1) If you haven't before, create an authentication cookie that can be used with to access files behind the EarthData login page. Please note, if you already have a .netrc file, instead of running the line that starts with echo, open the .netrc file with the editor of your choice and add "machine urs.earthdata.nasa.gov login $USERNAME password $PASSWORD" (replace the words starting with $ with your Earthdata login credentials)
Code: Select all
USERNAME=<your earthdata username>
PASSWORD=<your earthdata password>
cd ~
touch .netrc
echo "machine urs.earthdata.nasa.gov login $USERNAME password $PASSWORD" > .netrc
chmod 0600 .netrc
touch .urs_cookies
Code: Select all
URL=<the data url goes here>
wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --auth-no-challenge=on --keep-session-cookies --content-disposition --recursive --wait=60 --no-parent --reject "index.html*" --execute robots=off $URL
The following arguments are required to authenticate:
- --load-cookies ~/.urs_cookies
- --save-cookies ~/.urs_cookies
- --auth-no-challenge=on
- --keep-session-cookies
- --content-disposition
- --no-check-certificate
- --recursive: To recursively download files, wget performs the following process:
- Downloads the requested page, this is the first level (like the root of a tree file tree)
- Scans the document(s) downloaded this level for links
- Downloads all the links found in this level, these contents are a new level
- Repeat steps 2 and 3 for <depth> levels defined by the --level=depth argument. If undefined, depth is set to 5.
- wait=60: Wait the specified number of seconds (in this case, 60) between the retrievals. From the documentation at https://www.gnu.org/software/wget/manua ... index-wait
“You should be warned that recursive downloads can overload the remote servers...consider using the - -w option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by your rudeness.” –Wget Documentation
- --no-parent: ignores any links that point to a location above the current level (such as a link to the top-level site)
- --reject "index.html*": (optional) does not download the contents of the index page, it is unlikely that users will want this page's contents, but instead will want the data in the links on the page.
- --execute robots=off: this argument tells wget to ignore robots.txt, a file which tells automated tools where not to search.
- --accept=”<pattern>”: this argument filters files based on a file name pattern.
- --accept-regex=”<pattern>”: this argument filters files by URL using regex.
https://www.gnu.org/software/wget/