Page 1 of 1

Python download script utf-8 UnicodeDecodeError VNP46A1

Posted: Wed Jun 07, 2023 6:44 am America/New_York
by jamieallen59
When using the Python download script from here:

To download a file e.g:

The script fails here:
> return result.decode('utf-8') if isinstance(result, bytes) else result

With the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

The underlying curl. I am using Python 3.11.3. This is successful, but seems to return data that isn't decodable:
['curl', '--fail', '-sS', '-L', '-b session', '--get', '', '-H', 'user-agent: tis/download.py_1.0--3.11.3 (main, Apr 7 2023, 21:05:46) [Clang 14.0.0 (clang-1400.0.29.202)]', '-H', 'Authorization: Bearer XXX-TOKEN-XXX']

What I've tried:
- checking the encoding using chardet which returns encoding: 'Windows-1252'. When using 'Windows-1252' to decode, the result is:
"UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 5216: character maps to <undefined> "
This implies multiple types of encoding in the result.
- I've also tried using a less restrictive encoding 'latin-1', but the result of this is simply 'None'.

I need to download these files for every day of the year for multiple years, so not having a suitable download script is currently slowing/blocking my research.

Re: Python download script utf-8 UnicodeDecodeError VNP46A1

Posted: Wed Jun 07, 2023 11:14 am America/New_York
by LAADS_UserServices_M
The issue is that HDF5 is a binary format, not a text format. There is no text encoding that will work to decode it as text (as you are seeing). The script does also read csv or json files from the web site, to get the directory listings, so it needs to be able to read both text and binary data. Unfortunately, encodings are difficult to detect correctly.

Maybe the best way to handle this is, if the filename ends in .hdf or .h5 or .nc (these are the three data formats used by LAADS data providers) then just return the result as a string of bytes (no encoding).