Crafting URLs to download data? the page automatically blocks me

marcsandoval · by **marcsandoval** » Tue Nov 13, 2018 1:02 pm America/New_York

Hello!,

I'm trying to download several images of MODIS-Aqua Chlorophyll and SST L2 1km. I'm using a script like this (which I saw in a older topic with the same name, that is why I keep the name of the topic):

query="?sub=level1or2list&sen=am&per=DAY&day=$dat&n=$n&s=$s&w=$w&e=$e"
wget -qO - \
$url$query \
| perl -n -0777 \
   -e 'if(/filenamelist&id=(\d+\.\d+)/){' \
   -e 'print `wget "'$url'?sub=filenamelist&id=$1&prm=CHL" -qO -`;' \
   -e '}' \
   -e 'elsif(/(A\d+\.L2_LAC_OC)/){' \
   -e 'print "$1\n";' \
   -e '}' > temporal.txt

while read filename; do
.....

echo $filename | wget -B https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/ --content-disposition -i

done <temporal.txt
rm temporal.txt

And per each day sometimes "temporal.txt" can have several swaths but when I try to download them some of the files doesn't exist (I don't know why the page gives me the name of one swath that supposed to cover my region but doesn't exist) and the terminal get stuck trying to connect with the page until finally break it, show an error and move to the other file in "temporal.txt". After severals errors, the page automatically blocks me. Do you know if there is a sentence or flag in wget to fix that?

Thanks

by **OB WebDev - norman** » Tue Nov 13, 2018 3:45 pm America/New_York

Hi,

It looks like you patterned your script after my earlier post.

https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?pid=41578

Since you omit parts of your script, however, it is hard to guess
what might be wrong. If you would post one of the browse.pl
URLs that is returning non-existent files along with the time
that you are making the request and the IP address you are
coming from, I could investigate further.

Regards,
Norman

marcsandoval · by **marcsandoval** » Tue Nov 13, 2018 9:09 pm America/New_York

Is not that theme, is this one: https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?pid=19165

but is pretty much the same.

My code looks like the same of that topic:
########################################################################
#! /bin/bash
# Download MODIS imagery from lat/lon limits at time (days since 1/1/70)

year=$1
month=$2
day=$3

url=http://oceancolor.gsfc.nasa.gov/cgi/browse.pl
n=-31.99
s=-35.54
e=-76.91
w=-82.65
dat=`echo $((($(date +%s --date "$year-$month-$day")-$(date +%s --date "1970-1-1"))/(3600*24)))`
time=`date -d "UTC 1970-01-01 $dat days" -u --utc`

# CHLA
echo "************CHL on $time between $s and $n N, and between $w and $e E"
echo
query="?sub=level1or2list&sen=am&per=DAY&day=$dat&n=$n&s=$s&w=$w&e=$e"
wget -qO -
$url$query
| perl -n -0777
   -e 'if(/filenamelist&id=(d+.d+)/){'
   -e 'print `wget "'$url'?sub=filenamelist&id=$1&prm=CHL" -qO -`;'
   -e '}'
   -e 'elsif(/(Ad+.L2_LAC_OC)/){'
   -e 'print "$1
";'
   -e '}' > temporal.txt

while read filename; do
echo
echo "working in $filename"
echo

if [ -f /media/marco/100246E00246CA84/Imagenes_satelitales/1km/CHL-SST-ModisAqua/JF/images/$year/$filename ]
then

echo
   echo "$filename already exist"
echo

else
echo
echo "$filename doesn't exist, we are going to download it"
echo
echo $filename | wget -B https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/ --content-disposition -i -
echo

echo
echo "we move to the folder /images/$year"
echo
mv A* images/$year/
fi

done <temporal.txt
rm temporal.txt

##############and then is the same for SST##################

The day, month and year are running from a matlab script. The file "temporal.txt" gives me, for example

A2017105052500.L2_LAC_OC.nc
A2017105070000.L2_LAC_OC.nc
A2017105070500.L2_LAC_OC.nc
A2017105190500.L2_LAC_OC.nc

which you can see here: https://oceandata.sci.gsfc.nasa.gov/MODIS-Aqua/L2/2017/105/
That actually exist for SST but not for OC (or Chlorophyll).

My IP is: xx.xxx.xx.xx

Best regards!

by **OB WebDev - norman** » Wed Nov 14, 2018 10:11 am America/New_York

Hi,

Your computation of dat yields a day that is one day too early for me because the date
command is using the local time zone and January 1 is during standard time, but
April 15 is during daylight saving time. (Your location may have different time rules.)
Why not just avoid the issue altogether by using the -u option with the date command?


dat=`echo $(($(date -u +%s --date "$year-$month-$day")/86400))`

As for the reason that you are getting nonexistent files listed, it is because three
of the files in your example refer to nighttime scenes for which there are no
ocean-color products. Try adding "dnm=D" to your query string like this.


query="?sub=level1or2list&sen=am&per=DAY&day=$dat&n=$n&s=$s&w=$w&e=$e&dnm=D"

When I do that, I only get the file, A2017105190500.L2_LAC_OC , in my list of results.

For what it's worth I do not see any activity from xx.xxx.xx.xx in our recent server logs.
If you are still having connection issues and suspect a block, please email
connection_problems@oceancolor.sci.gsfc.nasa.gov as described in the following post.
https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?tid=6447

Norman

marcsandoval · by **marcsandoval** » Wed Nov 14, 2018 11:27 am America/New_York

Hello Norman,

I applied what you said. But still gives me an error with that date ( 29-Apr-2017) for example:

##############################################

working on A2017105190500.L2_LAC_OC
A2017105190500.L2_LAC_OC doesn't exist, we are going to download it

--2018-11-14 13:15:21-- https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A2017105190500.L2_LAC_OC
Resolviendo oceandata.sci.gsfc.nasa.gov (oceandata.sci.gsfc.nasa.gov)... xx.xxx.xx.xx, 2001:4d0:2418:128::84
Conectando con oceandata.sci.gsfc.nasa.gov (oceandata.sci.gsfc.nasa.gov)[xx.xxx.xx.xx]:443... conectado.
PeticiÃ³n HTTP enviada, esperando respuesta... 404 Not Found
2018-11-14 13:15:27 ERROR 404: Not Found.

##############################################
I added "dnm=D" to SST to and gives me the same:

##############################################

working on A2017105190500.L2_LAC_SST
A2017105190500.L2_LAC_SST doesn't exist, we are going to download it

--2018-11-14 13:15:30-- https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A2017105190500.L2_LAC_SST
Resolviendo oceandata.sci.gsfc.nasa.gov (oceandata.sci.gsfc.nasa.gov)... xx.xxx.xx.xx, 2001:4d0:2418:128::84
Conectando con oceandata.sci.gsfc.nasa.gov (oceandata.sci.gsfc.nasa.gov)[xx.xxx.xx.xx]:443... conectado.
PeticiÃ³n HTTP enviada, esperando respuesta... 404 Not Found
2018-11-14 13:15:36 ERROR 404: Not Found.

##############################################

And I'm sorry, I tried to say that I had been block by the web, but yesterday I sent an email and now is ok.

Regards

by **OB WebDev - norman** » Wed Nov 14, 2018 12:16 pm America/New_York

Hi,

There is still some confusion about dates in your message. April 29, 2017 is the
119th day of the year, so corresponding file names would start with "A2017119".
File names starting with "A2017105" contain data from April 15, 2017.

A colleague of mine pointed out that the files you are attempting to download
are missing the ".nc" ending. (I missed that omission in your earlier post.)
For the case where the query only returns a single file, you now need to specify
the ".nc" in the elsif statement that handles that case. So, for OC files:


        -e 'elsif(/(A\d+\.L2_LAC_OC\.nc)/){' \

At the time I wrote the original post you initially referenced, our files did not have the
NetCDF ending. Again, see the more recent post at:
https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?pid=41578 .

I grant you that this parsing of HTML returned by our browser is an awkward way of
doing things, but the current browse code was not originally intended to support these
sorts of requests. Other folks in our group have started to think about new ways to
implement data searches, but such improvements are probably still some years in
the future.

Norman

marcsandoval · by **marcsandoval** » Thu Nov 15, 2018 7:21 am America/New_York

Hello Norman,

Now I think is working, thank you so much. About the date, was my mistake.... my script looks for 15 days before and after of a specific date, so it is ok. The solution was to added ".nc".
About you said of the way to subset the data, yes is kind of awkward. Maybe your group can think in something like the Blended Sea Winds product with differents ways to subset the data (https://www.ncei.noaa.gov/thredds/catalog/uv/daily_agg/catalog.html?dataset=uv/daily_agg/Aggregation_of_Daily_Ocean_Wind_best.ncd) or something like the marine.copernicus.eu web page with the Python package called "motuclient (GitHub Link)".

Best regards!

by **OB WebDev - norman** » Thu Nov 15, 2018 8:35 am America/New_York

Hi,

I'm glad you got it working. Thanks for the pointers to other services.
It's always helpful to see what strategies others use to distribute data.

Norman

marcsandoval · by **marcsandoval** » Mon Jan 14, 2019 9:59 am America/New_York

Hello,

I'm writing because last week my script worked fine (altough kind of slow) and now takes so long to connect (but I can access the web page with my browser). I sent an email to connection_problems@oceancolor.sci.gsfc.nasa.gov with all the information and Christopher J Moellers and he told me that xx.xxx.xx.xx and xx.xxx.xx.xx are not blocked and they are seeing active traffic from xx.xxx.xx.xx as of 2019-01-14 09:05:58. My script, as I wrote before is:

year=2009
month=01
day=01

url=http://oceancolor.gsfc.nasa.gov/cgi/browse.pl
n=-31.99
s=-35.54
e=-76.91
w=-82.65
dat=`echo $(($(date -u +%s --date "$year-$month-$day")/86400))`
time=`date -d "UTC 1970-01-01 $dat days" -u --utc`

# CHLA
echo "************CHL on $time between $s and $n N, and between $w and $e E"
echo
query="?sub=level1or2list&sen=am&per=DAY&day=$dat&n=$n&s=$s&w=$w&e=$e&dnm=D"
wget -qO -
$url$query
| perl -n -0777
   -e 'if(/filenamelist&id=(d+.d+)/){'
   -e 'print `wget "'$url'?sub=filenamelist&id=$1&prm=CHL" -qO -`;'
   -e '}'
   -e 'elsif(/(Ad+.L2_LAC_OC.nc)/){'
   -e 'print "$1
";'
   -e '}' > temporal.txt

while read filename; do
echo
echo "working in $filename"
echo
if [ -f /media/marco/100246E00246CA84/Imagenes_satelitales/1km/CHL-SST-ModisAqua/JF/imagenes/$year/$filename ]
then
echo
   echo "the file $filename already exists"
echo
else
echo
echo "there is no $filename file, we are going to download it"
echo
echo $filename | wget -B https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/ --content-disposition -i -
echo
fi
done <temporal.txt
rm temporal.txt

####### and kind of the same for SST ##########

and it gets frozen in the first wget and it does nothing

Regards!

pfsmith · by **pfsmith** » Mon Jan 14, 2019 10:14 am America/New_York

Marco,
Chris checked into this and did not find any blocks in place at our site. He did see traffic from xx.xxx.xx.xx today.