Page 1 of 1

xz compressed, not bzip2

Posted: Mon Aug 05, 2019 1:42 pm America/New_York
by oo_processing
I have had two downloaded files from 2018 that have the wrong compression type?
This is one of them (and how I 'fixed' them):
[seadas_l1a_geo_extract_h5]$ file /cms_zfs/sat_data/modis/l0/2018/070/MOD00.P2018070.0140_1.PDS.bz2
/cms_zfs/sat_data/modis/l0/2018/070/MOD00.P2018070.0140_1.PDS.bz2: xz compressed data
[070]$ mv MOD00.P2018070.0140_1.PDS.bz2 MOD00.P2018070.0140_1.PDS.xz
[070]$ xz --decompress MOD00.P2018070.0140_1.PDS.xz
[070]$ ll MOD00.P2018070.0140_1.PDS
-rw-rw-r-- 1 bmurch cms_optics 396889536 Jul 30 01:30 MOD00.P2018070.0140_1.PDS
[070]$ bzip2 MOD00.P2018070.0140_1.PDS
[070]$ file /cms_zfs/sat_data/modis/l0/2018/070/MOD00.P2018070.0140_1.PDS.bz2
/cms_zfs/sat_data/modis/l0/2018/070/MOD00.P2018070.0140_1.PDS.bz2: bzip2 compressed data, block size = 900k

xz compressed, not bzip2

Posted: Mon Aug 05, 2019 1:53 pm America/New_York
by OB ODPS - towens
Our data provider changed to using xz compression for their long-term storage.
Our ingest code was not expecting it when we replaced some corrupted files with new copies.
I'll fix these on the server.

Thanks,
Tommy

xz compressed, not bzip2

Posted: Mon Aug 05, 2019 1:57 pm America/New_York
by OB ODPS - towens
I just checked the server, the file has the correct xz extension:    MOD00.P2018070.0140_1.PDS.xz
Is your code renaming it to bz2?

Tommy

xz compressed, not bzip2

Posted: Mon Aug 05, 2019 2:37 pm America/New_York
by oo_processing
Tommy,

I use the L1/2 browser to generate a L0 list. I then drop it into a file and I append the bz2 to the names and normally get them like this where x00 is the list:

time curl --interface 2607:fe50:0:6330::100 --retry 5 --retry-delay 2 --max-time 0 --remote-name-all https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/{$(sed ':a;N;$!ba;s/\n/,/g' /cms_zfs/work_orders/modis/PDS/2018/x00)}

However, I just noticed this:

[bin]$ time curl --interface 2607:fe50:0:6330::100 --retry 5 --retry-delay 2 --max-time 0 --remote-name-all https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/MOD00.A2000364.1045_1.PDS                                                                                                                                                                           
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                                                                                        
                                 Dload  Upload   Total   Spent    Left  Speed                                                                                                          
100  289M  100  289M    0     0  20.7M      0  0:00:13  0:00:13 --:--:-- 21.8M                                                                                                         

real    0m13.955s
user    0m0.362s
sys     0m0.258s
[bin]$ ll MOD00.A2000364.1045_1.PDS
-rw-rw-r-- 1 bmurch bmurch 303855723 Aug  5 14:23 MOD00.A2000364.1045_1.PDS                                        
[bin]$ file MOD00.A2000364.1045_1.PDS
MOD00.A2000364.1045_1.PDS: bzip2 compressed data, block size = 900k
[bin]$ time curl --interface 2607:fe50:0:6330::100 --retry 5 --retry-delay 2 --max-time 0 --remote-name-all https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/MOD00.A2000364.1045_1.PDS.bz2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289M  100  289M    0     0  21.3M      0  0:00:13  0:00:13 --:--:-- 21.8M

real    0m13.591s
user    0m0.330s
sys     0m0.254s
[bin]$ file MOD00.A2000364.1045_1.PDS*
MOD00.A2000364.1045_1.PDS:     bzip2 compressed data, block size = 900k
MOD00.A2000364.1045_1.PDS.bz2: bzip2 compressed data, block size = 900k
[bin]$ diff MOD00.A2000364.1045_1.PDS MOD00.A2000364.1045_1.PDS.bz2

[bin]$

So, it appears that the same file is returned regardless of the extension in the above cases.
BUT not with xz extention

time curl --interface 2607:fe50:0:6330::100 --retry 5 --retry-delay 2 --max-time 0 --remote-name-all https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/MOD00.A2000364.1045_1.PDS.xz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed 
110   665  110   665    0     0    119      0  0:00:05  0:00:05 --:--:--   197

real    0m5.581s
user    0m0.051s
sys     0m0.051s

[bin]$ cat MOD00.A2000364.1045_1.PDS.xz
<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta name="ROBOTS" content="NOARCHIVE"><title>ERROR @ OceanColor Biology Processing Group (OBPG)</title></head><body link=#323232 vlink=#323232 alink=#323232 style="background-color:#ffffff; color:#323232; font-size:175%"><br><hr color=#323232><center><h1><b>.:. ERROR .:.</b></h1><h2>OceanColor Biology Processing Group (OBPG)</h2><blockquote>Sorry, an error has occurred. Use the back button to return to the previous page or go to the <a href="https://oceancolor.gsfc.nasa.gov">Ocean Color Home Page</a>.</blockquote><br><hr color= #323232></body></html>

So do you suggest that I need to test every downloaded file (with file command) and then determine the type of compression from that?

xz compressed, not bzip2

Posted: Mon Aug 05, 2019 6:22 pm America/New_York
by OB.DAAC - SeanBailey
I suggest you don't append the .bz2.  The file search is based on the uncompressed filename - which is why it pulls down the .bz2 file even if you don't append the extension.
.xz is not one of the compression extensions (currently) recognized by the script, so it doesn't know to strip it off when doing the lookup, and so doesn't find the file.  If you don't go to the effort to guess the extension, you won't have to, well, guess the extension :grin:

Let cURL assign the filename from the Content-Disposition header:

$ curl -O -J https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/MOD00.A2000364.1045_1.PDS
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289M  100  289M    0     0  6980k      0  0:00:42  0:00:42 --:--:-- 7370k
curl: Saved to filename 'MOD00.A2000364.1045_1.PDS.bz2'

$ curl -O -J https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/MOD00.P2018070.0140_1.PDS
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  277M  100  277M    0     0  8695k      0  0:00:32  0:00:32 --:--:-- 8933k
curl: Saved to filename 'MOD00.P2018070.0140_1.PDS.xz'


Sean

xz compressed, not bzip2

Posted: Wed Aug 07, 2019 2:35 pm America/New_York
by oo_processing
But, I will have to decide what to do with the downloaded file.
So do I bunzip2 it? Or unxz?
I guess test to ensure it is a bzip2 file?

Brock

xz compressed, not bzip2

Posted: Wed Aug 07, 2019 3:55 pm America/New_York
by OB.DAAC - SeanBailey

> But, I will have to decide what to do with the downloaded file.


Yes, you will, but the file extension should clue you in as to which decompression utility to use.