HLS Data Processing with rioxarray: parallel reading and cookie questions

Use this Forum to find information on, or ask a question about, NASA Earth Science data.
Post Reply
parevalo
Posts: 3
Joined: Tue Nov 26, 2024 4:07 pm America/New_York
Answers: 0

HLS Data Processing with rioxarray: parallel reading and cookie questions

by parevalo » Tue Nov 26, 2024 5:02 pm America/New_York

TLDR: is it ok to use rioxarray parallel read option (i.e. lock=False) to read HLS COG's from the cloud?

I'm writing a python script in that aims to do the following:

- Retrieve the HLS HTTP granule links for a a given tile id and time range, using earthaccess
- Build an xarray dataset and subset the data to 8 small chips (366x366 pixels) per HLS tile id footprint
- Compute a median composite per month, and save each median file per chip to a tiff file.

The central piece of this process is building each of the datasets using rioxarray, which I am doing as shown in this gist: https://gist.github.com/parevalo/a23f0fe98abde52f90cfaea337a806c0

I put together that code based on the HLS tutorial notebook (https://github.com/nasa/HLS-Data-Resources/blob/main/python/tutorials/HLS_Tutorial.ipynb) and the COG's best practices notebook (https://github.com/pangeo-data/cog-best-practices/blob/main/2-dask-localcluster.ipynb)

One thing that I have noticed is that if the lock argument in the rioxarray.open_rasterio() function is set to False, building the dataset is significantly faster because it is read in parallel. This makes a major difference when building a dataset from many granules. Since I want to apply this code over many HLS tiles in North America for multiple months, having some efficiency without having to download the entire data would be great. With that in mind, I have the following questions:

1. Is it safe/sensible/recommended to use that option? I don't see it in the HLS notebook, which makes me wonder if it's discouraged. If that's the case, what is the exact reason for it? I couldn't find any other examples online that used this option, and it's not clear to me why.
2. If using that option is ok, do you have any information on how it may affect the writing and reading of the cookies created by libcurl? I'm asking because when the lock argument is unset there's often a single cookie in the file, but when is set to False, the cookie file is permanently being written and overwritten, sometimes with no content inside the file, and it is not clear to me if the cookies are being used properly to avoid repeat authentications. Using GDAL's CPL_CURL_VERBOSE='ON' didn't help me answer this question fully myself.

I am trying to avoid a scenario where I accidentally request data too aggressively, or one where authentication by cookies doesn't work as expected, resulting in repeated authorizations that would slow down the system. Any guidance would be greatly appreciated!

I have run the code in my laptop using Ubuntu 20.04.6 LTS and in a linux HPC running AlmaLinux 8.10, rioxarray 0.17 and rasterio 1.4.1.
by LP - erik.bolch » Fri Dec 06, 2024 4:11 pm America/New_York
Hi @parevalo,

The example HLS tutorial we created doesn't utilize the `lock=False` option but using it should be fine. I'm not sure how exactly the cookie is handled, but you could also try the `GDAL_HTTP_AUTH='BEARER'` configuration option instead of the cookies.

Some other things that may be helpful:
  • You may want to change the chunk size from 512x512 to 3600x3600 (full scene) depending on how large your region of interest is. If you need all the data from a scene, using the full scene size rather than the internal tile chunk size will improve performance.
  • If you want to stack HLS Landsat and HLS Sentinel scenes, you'll have to handle the difference in quantity and naming convention of the bands.
  • There's a known issue with some HLS Landsat scenes, where the scale_factor is in the general file metadata, not the band level metadata. Because of this, occasionally some scenes may not scale properly with `mask_and_scale=True` using `rioxarray`.
I made a short jupyter notebook of how I would approach this. If you're doing something more performant, please share.

Hope this helps,

Erik
Go to full post

Filters:

LP DAACx - dgolon
User Services
User Services
Posts: 422
Joined: Mon Sep 30, 2019 10:00 am America/New_York
Answers: 0
Has thanked: 31 times
Been thanked: 8 times
Contact:

Re: HLS Data Processing with rioxarray: parallel reading and cookie questions

by LP DAACx - dgolon » Mon Dec 02, 2024 10:11 am America/New_York

Hi @parevalo Thanks for writing in. One of our developers is taking a look at your question. We will follow up when we have an answer.
Subscribe to the LP DAAC listserv by sending a blank email to lpdaac-join@lists.nasa.gov.

Sign up for the Landsat listserv to receive the most up to date information about Landsat data: https://public.govdelivery.com/accounts/USDOIGS/subscriber/new#tab1.

betolink
Posts: 2
Joined: Thu May 27, 2021 2:52 pm America/New_York
Answers: 0

Re: HLS Data Processing with rioxarray: parallel reading and cookie questions

by betolink » Tue Dec 03, 2024 11:39 am America/New_York

Hi @parevalo!

I think you're code is correct and according to https://github.com/corteva/rioxarray/issues/214 `lock=True` is not a valid option. In their documentation they also mention that using locks benefit from caching if the reads are done multiple times https://corteva.github.io/rioxarray/stable/examples/read-locks.html

Hope this helps!
Luis.

parevalo
Posts: 3
Joined: Tue Nov 26, 2024 4:07 pm America/New_York
Answers: 0

Re: HLS Data Processing with rioxarray: parallel reading and cookie questions

by parevalo » Thu Dec 05, 2024 10:23 am America/New_York

The question is not whether the code is "correct" or not, but whether the cookies are being written to/read from properly when reading the files in parallel this way (files are not read in parallel if lock=None, but then then the code is significantly slower that way). I'm trying to avoid issues with repeated authorizations that would result if the cookie file is not behaving properly. My assumption is that everything is working as intended, since reading the files with stackstac (much better alternative) shows the same cookie behavior, but I just wanted to make sure, because I intend to process thousands of files this way.

LP - erik.bolch
Subject Matter Expert
Subject Matter Expert
Posts: 3
Joined: Mon Jun 13, 2022 10:26 am America/New_York
Answers: 2
Been thanked: 2 times

Re: HLS Data Processing with rioxarray: parallel reading and cookie questions

by LP - erik.bolch » Fri Dec 06, 2024 4:11 pm America/New_York

Hi @parevalo,

The example HLS tutorial we created doesn't utilize the `lock=False` option but using it should be fine. I'm not sure how exactly the cookie is handled, but you could also try the `GDAL_HTTP_AUTH='BEARER'` configuration option instead of the cookies.

Some other things that may be helpful:
  • You may want to change the chunk size from 512x512 to 3600x3600 (full scene) depending on how large your region of interest is. If you need all the data from a scene, using the full scene size rather than the internal tile chunk size will improve performance.
  • If you want to stack HLS Landsat and HLS Sentinel scenes, you'll have to handle the difference in quantity and naming convention of the bands.
  • There's a known issue with some HLS Landsat scenes, where the scale_factor is in the general file metadata, not the band level metadata. Because of this, occasionally some scenes may not scale properly with `mask_and_scale=True` using `rioxarray`.
I made a short jupyter notebook of how I would approach this. If you're doing something more performant, please share.

Hope this helps,

Erik

parevalo
Posts: 3
Joined: Tue Nov 26, 2024 4:07 pm America/New_York
Answers: 0

Re: HLS Data Processing with rioxarray: parallel reading and cookie questions

by parevalo » Fri Dec 06, 2024 4:41 pm America/New_York

Thank you so much, Erik, this is super helpful! I didn't know about the bearer token authentication, so I will give it a try.

For what it's worth:
- I ended up using CMR-CLOUDSTAC along with stackstac, which is working very well and very fast.
- I'm only retrieving 10 'chips' of size 366x366 pixels per each HLS tile footprint, so I figured a chunk size of 512 made sense.
- I am def taking care of the band naming convention, no problem there. The code I shared was just a small fraction of my entire script.

Thanks a LOT for the script, especially the function to pass the rio env to the dask workers!

Paulo

Post Reply