Download Multiple ASDC Files with Wget

Please enter here to ask a question about any NASA Science related topics!
Post Reply
njester
Posts: 4
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Download Multiple ASDC Files with Wget

by njester » Mon May 24, 2021 11:44 am America/New_York

The following allows you to download data from https://asdc.larc.nasa.gov/data/ using linux, a mac, or Windows with Cygin.

1) If you haven't before, create an authentication cookie that can be used with to access files behind the EarthData login page.

Code: Select all

USERNAME=<your earthdata username>
PASSWORD=<your earthdata password>
cd ~ 
touch .netrc 
echo "machine urs.earthdata.nasa.gov login $USERNAME password $PASSWORD" > .netrc 
chmod 0600 .netrc 
touch .urs_cookies 
2) You can download data with the following commands:

Code: Select all

URL=<your url here>
wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --auth-no-challenge=on --keep-session-cookies --content-disposition  --recursive --wait=60 --no-parent --reject "index.html*" --execute robots=off $URL
Recursive wget is a fairly complex and powerful with lots of useful options for things like filtering results. I've provided an explanation for the options that I chose below.

The following arguments are required to authenticate:
  • --load-cookies ~/.urs_cookies
  • --save-cookies ~/.urs_cookies
  • --auth-no-challenge=on
  • --keep-session-cookies
  • --content-disposition
  • --no-check-certificate
Other Arguments:
  1. --recursive: To recursively download files, wget performs the following process:
    1. Downloads the requested page, this is the first level (like the root of a tree file tree)
    2. Scans the document(s) downloaded this level for links
    3. Downloads all the links found in this level, these contents are a new level
    4. Repeat steps 2 and 3 for <depth> levels defined by the --level=depth argument. If undefined, depth is set to 5.
  2. wait=60: Wait the specified number of seconds (in this case, 60) between the retrievals. From the documentation at https://www.gnu.org/software/wget/manua ... index-wait
    “You should be warned that recursive downloads can overload the remote servers...consider using the
  3. -w option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by your rudeness.” –Wget Documentation
  4. --no-parent: ignores any links that point to a location above the current level (such as a link to the top-level site)
  5. --reject "index.html*": (optional) does not download the contents of the index page, it is unlikely that users will want this page's contents, but instead will want the data in the links on the page.
  6. --execute robots=off: this argument tells wget to ignore robots.txt, a file which tells automated tools where not to search.
  7. --accept=”<pattern>”: this argument filters files based on a file name pattern.
  8. --accept-regex=”<pattern>”: this argument filters files by URL using regex.
Wget has lots of options for, so if you need something that the script above doesn't do, check the official documentation:
https://www.gnu.org/software/wget/

Tags:

Post Reply