how to wget files in more than one folder

Use this Forum to find information on, or ask a question about, NASA Earth Science data.
avalente
Posts: 7
Joined: Thu Apr 30, 2020 7:44 am America/New_York
Answers: 0

how to wget files in more than one folder

by avalente » Thu Apr 30, 2020 10:37 am America/New_York

Hi,

I would like to WGET all MODIS AQUA chla weekly L3m data, but still couldn't find a way.
Can you please help correct the code below?
wget -q -O - https://oceandata.sci.gsfc.nasa.gov/MODIS-Aqua/Mapped/8-Day/4km/chlor_a/ |grep chlor_a_4km.nc| wget --user=avalente --ask-password --auth-no-challenge=on --base https://oceandata.sci.gsfc.nasa.gov/ -N --wait=0.5 --random-wait --force-html -i -

Thanks!
Andre

Tags:

OB ODPS - jgwilding
Subject Matter Expert
Subject Matter Expert
Posts: 118
Joined: Fri Feb 19, 2021 1:09 pm America/New_York
Answers: 0

how to wget files in more than one folder

by OB ODPS - jgwilding » Thu Apr 30, 2020 10:53 am America/New_York

Andre,

Have you tried using the file-search utility (https://oceandata.sci.gsfc.nasa.gov/api/file_search/) to obtain a file containing the URLs to the matching files and then giving that file to wget?

john

gnwiii
Posts: 677
Joined: Fri Jan 29, 2021 5:51 pm America/New_York
Answers: 2

how to wget files in more than one folder

by gnwiii » Thu Apr 30, 2020 11:37 am America/New_York

For some of us, wget has not been reliable.   Download methods decribes methods for bulk downloads with wget, curl, and python.  The python script that has been reliable for me.

OB.DAAC - amscott
User Services
User Services
Posts: 249
Joined: Mon Jun 22, 2020 5:24 pm America/New_York
Answers: 1
Has thanked: 2 times
Been thanked: 1 time

how to wget files in more than one folder

by OB.DAAC - amscott » Thu Apr 30, 2020 11:58 am America/New_York

The issue with your command is that the directory you are using is not the direct path to the files you want to retrieve. A 'year' directory sits between the two. If you use file search, this command gets you the URLS to the data which you can then download:
wget -q --post-data="sensor=aqua&dtype=L3m&addurl=1&results_as_file=1&search=*8D_CHL_chlor_a_4km.nc" -O - https://oceandata.sci.gsfc.nasa.gov/api/file_search

avalente
Posts: 7
Joined: Thu Apr 30, 2020 7:44 am America/New_York
Answers: 0

how to wget files in more than one folder

by avalente » Thu Apr 30, 2020 3:06 pm America/New_York

Dear all,

Thanks for the messages. Can you please help me to make wget to use the text file with URLs (list_files.txt)
Tried the code below but is not working.
wget -i list_files.txt --user=avalente --ask-password --auth-no-challenge=on --base https://oceandata.sci.gsfc.nasa.gov/ -N --wait=0.5 --random-wait --force-html -i -

OB ODPS - jgwilding
Subject Matter Expert
Subject Matter Expert
Posts: 118
Joined: Fri Feb 19, 2021 1:09 pm America/New_York
Answers: 0

how to wget files in more than one folder

by OB ODPS - jgwilding » Thu Apr 30, 2020 3:40 pm America/New_York

Are you giving two -i options, one with list_files.txt and one with - (stdin)?  I can see that as being a conflict.  I don't think you need the --base and --force-html options.  Each record in the list_file.txt should have a complete URL.  If not, you can regenerate the file list making sure the "Add URL Prefix" option is checked.

For example, this worked for me,

wget -i results.txt --user=my-username --ask-password -auth-no-challenge=on

john

gnwiii
Posts: 677
Joined: Fri Jan 29, 2021 5:51 pm America/New_York
Answers: 2

how to wget files in more than one folder

by gnwiii » Thu Apr 30, 2020 3:42 pm America/New_York

A description of how wget is failing could help someone recognize your problem.   Do some files from a list download correctly?  Are you getting lots of short files with html content  (this is easy to see if you use wget with --adjust-extension).   If you want to stick with wget, here are workarounds that have helped:

1.  when using a ~/.netrc file, try adding an entry for oceandata.sci.gsfc.nasa.gov with the same username and password used with urs.earthdata.nasa.gov

2. sometimes wget tries to make connections using IPv6 and gets "no route to host".   If you encounter this, try the --inet4-only option.

3. try wget with and without --auth-no-challenge.   

4.  try a different version of wget, or even wget2

5.  the wget 1.19.1 man page has a section describing use of post data and cookies.   There was a suggestion in a wget2 bug report that this method should be used with wget2.

This example shows how to log in to a server using POST and then proceed to download the desired pages,
           presumably only accessible to authorized users:

                   # Log in to the server.  This can be done only once.
                   wget --save-cookies cookies.txt \
                        --post-data 'user=foo&password=bar' \
                        http://example.com/auth.php

                   # Now grab the page or pages we care about.
                   wget --load-cookies cookies.txt \
                        -p http://example.com/interesting/article.php

           If the server is using session cookies to track user authentication, the above will not work because
           --save-cookies will not save them (and neither will browsers) and the cookies.txt file will be empty.  In
           that case use --keep-session-cookies along with --save-cookies to force saving of session cookies.


If you have python3, the obdaac_download.py is straightforward to use:

% wget -q --post-data="sensor=aqua&dtype=L3m&addurl=1&results_as_file=1&search=*8D_CHL_chlor_a_4km.nc" -O - https://oceandata.sci.gsfc.nasa.gov/api/file_search > l3m_8D_CHL_chlor_a_4km.list
% head -1 l3m_8D_CHL_chlor_a_4km.list > l3m_8D_CHL_chlor_a_4km.small.list
% tail -1 l3m_8D_CHL_chlor_a_4km.list >> l3m_8D_CHL_chlor_a_4km.small.list

The list lacks a newline at the end, so add one with vi:

% vi l3m_8D_CHL_chlor_a_4km.small.list
% cat l3m_8D_CHL_chlor_a_4km.small.list
https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A20021852002192.L3m_8D_CHL_chlor_a_4km.nc
https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A20200892020096.L3m_8D_CHL_chlor_a_4km.nc
% ~/bin/obdaac_download.py
usage: obdaac_download.py [-h] [-v] [--filelist FILELIST]
                          [--http_manifest HTTP_MANIFEST] [--odir ODIR]
                          [--uncompress] [--force]
                          [filename]
Please provide a filename (or list file) to retrieve
% ~/bin/obdaac_download.py -v --filelist l3m_8D_CHL_chlor_a_4km.small.list
Output directory: /home/seadas
Retrieving https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A20021852002192.L3m_8D_CHL_chlor_a_4km.nc
Retrieving https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A20200892020096.L3m_8D_CHL_chlor_a_4km.nc
% file *.nc
A20021852002192.L3m_8D_CHL_chlor_a_4km.nc: Hierarchical Data Format (version 5) data
A20200892020096.L3m_8D_CHL_chlor_a_4km.nc: Hierarchical Data Format (version 5) data

avalente
Posts: 7
Joined: Thu Apr 30, 2020 7:44 am America/New_York
Answers: 0

how to wget files in more than one folder

by avalente » Fri May 01, 2020 6:30 am America/New_York

the code John provided (below) works, but it only downloads files of ~12KB ..
     wget -i list_files.txt --user=avalente --ask-password -auth-no-challenge=on

So, all this started when my old code (below) stop working. It was working OK a few months ago, but now only returns files 12KB (as the code above)
     wget -q --post-data="sensor=aqua&dtype=L3m&addurl=1&results_as_file=1&search=*8D_CHL_chlor_a_4km.nc" -O - https://oceandata.sci.gsfc.nasa.gov/api/file_search

Then I realized that something might have changed since there was a new OBPG download_methods section, with different methods (including passwords). So, I adapted the example for bulk transfers of SST4 to my request (below). But only could do it for one year at a time (e.g. 2008). My question is how to adapt code below to get all years 2003-2020 at once (which seemed to be in different folders). Is it possible? If not, what you think is happening since for John the code he gave is working for him..
    wget -q -O - https://oceandata.sci.gsfc.nasa.gov/MODIS-Aqua/Mapped/8-Day/4km/chlor_a/2008/ |grep chlor_a_4km.nc| wget --user=avalente --ask-password --auth-no-challenge=on --base https://oceandata.sci.gsfc.nasa.gov/ -N --wait=0.5 --random-wait --force-html -i -

yes I would like to stick with wget (don't use python).As for the options from gnwiii

1. how would I do that. Is this done with the code below? If so, I did before (not sure if correctly). still 12kb
echo "machine urs.earthdata.nasa.gov login USERNAME password PASSWD" > ~/.netrc ; > ~/.urs_cookies
chmod  0600 ~/.netrc

2. no, did not see a message "no route to host".

3. done, same 12KB

4. with wget2 says: Unknown option 'ask-password' for the code John send and errors for code I used to use

5. did not understood that one

OB ODPS - jgwilding
Subject Matter Expert
Subject Matter Expert
Posts: 118
Joined: Fri Feb 19, 2021 1:09 pm America/New_York
Answers: 0

how to wget files in more than one folder

by OB ODPS - jgwilding » Fri May 01, 2020 7:23 am America/New_York

How about this.

Add your user name and password to the ~/.netrc file first with,

echo "machine urs.earthdata.nasa.gov login YOUR-USERNAME password YOUR-PASSWD" >> ~/.netrc

If not already set correctly, change mode with, chmod 600 ~/.netrc

Then run,

wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --content-disposition -i FILE-CONTAINING-URLS --auth-no-challenge=on

john

gnwiii
Posts: 677
Joined: Fri Jan 29, 2021 5:51 pm America/New_York
Answers: 2

how to wget files in more than one folder

by gnwiii » Fri May 01, 2020 8:24 am America/New_York

For the record it would be good to provide the version of your wget program and OS version.  Someone with the same configuration may have a solution.  

As you can see, working with wget requires some familiarity with the command line (usually bash, but zsh is gaining popularity and is used on recent macOS).
I think you will find a helpful over the long run to spend a few afternoons studying command line concepts.  A good place to start is Linux Command.  

Small downloads with wget are usually html login pages.  Did you try adding --adjust-extension to your wget command line?   This may give you html pages you can load in a browser.  Every time this happened to me I got a EarthData login.  Adding the second "machine" to your ~/.netrc may help:

echo "machine oceandata.sci.gsfc.nasa.gov login USERNAME password PASSWD" >> ~/.netrc

The >> appends the new entry to the existing file.  The permissions should already be OK, but you should check because wget may refuse to use a ~/.netrc with the wrong permissions:

ls -l ~/.netrc
-rw------- 1 <user> <group> 149 Apr 25 09:09 /home/<user>/.netrc


You can check the type of a downloaded file using the file command.    I used this to check the files downloaded using obdaac_download.py (note that NetCDF4 files use the HDF5 storage format) in my previous post.   If you still have one of the short .nc files that contains html, just replace .nc with .html and try viewing it in a browser.

The recent change to require passwords for OBPG downloads caused similar problems for many users.   The EarthData login uses "single signon" (SSO) in which your first attempt to connect the OceanData server is redirected to the EarthData server.  After logging on, cookies are saved to validate your session when you connect to the OceanData server.   The same process is used when you log into this forum with a browser.   Different SSO mechanisms have been tried in the past, and various people have contributed support for them to the wget code base.  As a result, the wget code became a tangled mess, so the current developers decided to "redo from start" and are working on wget2 which will eventually replace wget.   Point 5 is the suggested workaround for SSO problems, using wget once to login and set cookies, and then using the cookies to validate your session with the OceanData server and start a download.

Post Reply