Download Multiple ASDC Files with Wget

Please enter here to ask a question about any NASA Science related topics!
Post Reply
njester
Posts: 5
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Download Multiple ASDC Files with Wget

by njester » Mon May 24, 2021 11:44 am America/New_York

The following allows you to download data from https://asdc.larc.nasa.gov/data/ using linux, a mac, or Windows with Cygin.

1) If you haven't before, create an authentication cookie that can be used with to access files behind the EarthData login page.

Code: Select all

USERNAME=<your earthdata username>
PASSWORD=<your earthdata password>
cd ~ 
touch .netrc 
echo "machine urs.earthdata.nasa.gov login $USERNAME password $PASSWORD" > .netrc 
chmod 0600 .netrc 
touch .urs_cookies 
2) You can download data with the following commands:

Code: Select all

URL=<your url here>
wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --auth-no-challenge=on --keep-session-cookies --content-disposition  --recursive --wait=60 --no-parent --reject "index.html*" --execute robots=off $URL
Recursive wget is a fairly complex and powerful with lots of useful options for things like filtering results. I've provided an explanation for the options that I chose below.

The following arguments are required to authenticate:
  • --load-cookies ~/.urs_cookies
  • --save-cookies ~/.urs_cookies
  • --auth-no-challenge=on
  • --keep-session-cookies
  • --content-disposition
  • --no-check-certificate
Other Arguments:
  1. --recursive: To recursively download files, wget performs the following process:
    1. Downloads the requested page, this is the first level (like the root of a tree file tree)
    2. Scans the document(s) downloaded this level for links
    3. Downloads all the links found in this level, these contents are a new level
    4. Repeat steps 2 and 3 for <depth> levels defined by the --level=depth argument. If undefined, depth is set to 5.
  2. wait=60: Wait the specified number of seconds (in this case, 60) between the retrievals. From the documentation at https://www.gnu.org/software/wget/manua ... index-wait
    “You should be warned that recursive downloads can overload the remote servers...consider using the
  3. -w option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by your rudeness.” –Wget Documentation
  4. --no-parent: ignores any links that point to a location above the current level (such as a link to the top-level site)
  5. --reject "index.html*": (optional) does not download the contents of the index page, it is unlikely that users will want this page's contents, but instead will want the data in the links on the page.
  6. --execute robots=off: this argument tells wget to ignore robots.txt, a file which tells automated tools where not to search.
  7. --accept=”<pattern>”: this argument filters files based on a file name pattern.
  8. --accept-regex=”<pattern>”: this argument filters files by URL using regex.
Wget has lots of options for, so if you need something that the script above doesn't do, check the official documentation:
https://www.gnu.org/software/wget/

Tags:

lefsky
Posts: 1
Joined: Fri Aug 13, 2021 2:00 pm America/New_York

Re: Download Multiple ASDC Files with Wget

by lefsky » Fri Aug 13, 2021 2:35 pm America/New_York

I am trying to adapt this script to allow me to download all of the file locations within a particular directory, specifically

https://asdc.larc.nasa.gov/data/DSCOVR/EPIC/L2_CLOUD_03

using the --spider option but it only downloads the index.html.* files rather than rejecting them. Can you assist?

ASDC - cheyenne.e.land
User Services
User Services
Posts: 24
Joined: Mon Mar 22, 2021 3:55 pm America/New_York
Answers: 1

Re: Download Multiple ASDC Files with Wget

by ASDC - cheyenne.e.land » Mon Aug 16, 2021 11:22 am America/New_York

Hello,

Thank you for your question. A Subject Matter Expert has been notified and will answer your question shortly.

njester
Posts: 5
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Re: Download Multiple ASDC Files with Wget

by njester » Thu Aug 19, 2021 4:35 pm America/New_York

lefsky wrote: Fri Aug 13, 2021 2:35 pm America/New_York I am trying to adapt this script to allow me to download all of the file locations within a particular directory, specifically

https://asdc.larc.nasa.gov/data/DSCOVR/EPIC/L2_CLOUD_03

using the --spider option but it only downloads the index.html.* files rather than rejecting them. Can you assist?
The script provided should download all the files within a directory. Did you use the first code from the first code block to create .netrc and .urs_cookies file? If not, or if the credentials you entered were wrong, it will download an html page saying that it login failed. I don't think you'll need spider for this application unless I've misunderstood your question.
-Nathan

Post Reply