Using wget to recursively fetch .txt files in .php file, but filters break the command











up vote
0
down vote

favorite












I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.




  • Main page with list of accessions: http://signal.salk.edu/atg1001/download.php

  • Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]

  • I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)


Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.



wget --spider --recursive "http://signal.salk.edu/atg1001/download.php


I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.



# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"


However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.



# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"

# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.


I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.



Thank you!





Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.










share|improve this question
























  • Perhaps you'd be better off pulling a list of the URLs in question, and then passing those to wget?
    – miken32
    Nov 14 at 23:38










  • parsing directly the links in the page with regexp is maybe an option : stackoverflow.com/questions/1881237/…
    – SachaDee
    Nov 14 at 23:45

















up vote
0
down vote

favorite












I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.




  • Main page with list of accessions: http://signal.salk.edu/atg1001/download.php

  • Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]

  • I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)


Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.



wget --spider --recursive "http://signal.salk.edu/atg1001/download.php


I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.



# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"


However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.



# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"

# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.


I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.



Thank you!





Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.










share|improve this question
























  • Perhaps you'd be better off pulling a list of the URLs in question, and then passing those to wget?
    – miken32
    Nov 14 at 23:38










  • parsing directly the links in the page with regexp is maybe an option : stackoverflow.com/questions/1881237/…
    – SachaDee
    Nov 14 at 23:45















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.




  • Main page with list of accessions: http://signal.salk.edu/atg1001/download.php

  • Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]

  • I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)


Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.



wget --spider --recursive "http://signal.salk.edu/atg1001/download.php


I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.



# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"


However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.



# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"

# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.


I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.



Thank you!





Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.










share|improve this question















I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.




  • Main page with list of accessions: http://signal.salk.edu/atg1001/download.php

  • Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]

  • I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)


Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.



wget --spider --recursive "http://signal.salk.edu/atg1001/download.php


I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.



# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"


However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.



# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"

# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.


I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.



Thank you!





Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.







wget






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 at 23:56









miken32

23k84671




23k84671










asked Nov 14 at 22:32









adirks

114




114












  • Perhaps you'd be better off pulling a list of the URLs in question, and then passing those to wget?
    – miken32
    Nov 14 at 23:38










  • parsing directly the links in the page with regexp is maybe an option : stackoverflow.com/questions/1881237/…
    – SachaDee
    Nov 14 at 23:45




















  • Perhaps you'd be better off pulling a list of the URLs in question, and then passing those to wget?
    – miken32
    Nov 14 at 23:38










  • parsing directly the links in the page with regexp is maybe an option : stackoverflow.com/questions/1881237/…
    – SachaDee
    Nov 14 at 23:45


















Perhaps you'd be better off pulling a list of the URLs in question, and then passing those to wget?
– miken32
Nov 14 at 23:38




Perhaps you'd be better off pulling a list of the URLs in question, and then passing those to wget?
– miken32
Nov 14 at 23:38












parsing directly the links in the page with regexp is maybe an option : stackoverflow.com/questions/1881237/…
– SachaDee
Nov 14 at 23:45






parsing directly the links in the page with regexp is maybe an option : stackoverflow.com/questions/1881237/…
– SachaDee
Nov 14 at 23:45



















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309727%2fusing-wget-to-recursively-fetch-txt-files-in-php-file-but-filters-break-the-c%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309727%2fusing-wget-to-recursively-fetch-txt-files-in-php-file-but-filters-break-the-c%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to change which sound is reproduced for terminal bell?

Can I use Tabulator js library in my java Spring + Thymeleaf project?

Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents