Skip to content

Conversation

@jbusecke
Copy link
Collaborator

@jbusecke jbusecke commented Oct 8, 2025

Replacing #96 since we cannot deploy to sandbox from a fork.

Depends on #95 and developmentseed/titiler#1235

Tasks

  • Updated zarr to v3
  • Modified the test fixture scripts to write v2 and v3 zarr stores explicitly (I retriggered them and committed the new fixtures)
  • Add scripts/fixtures for native icechunk
  • Deploy and test on real-world native icechunk data
  • Add scripts/fixtures for virtual icechunk
  • Deploy and test on real-world virtual icechunk data

@jbusecke
Copy link
Collaborator Author

jbusecke commented Oct 8, 2025

@jbusecke
Copy link
Collaborator Author

Ok I made good progress today. The current changes (for now depending on developmentseed/titiler@f08c0ce) are working with some real world examples:

MURSST from virtual icechunk store (performance seems on par with a native icechunk store of the same data)

NLDAS virtual icechunk store.

The container authentication is currently mocked up inside the xarray_open_dataset function for testing and we discussed that it would be best to pass this information as runtime options to titiler-multidim.

At this point I would love to get some feedback on how to proceed.

Should the format of the input stay the same?
I am currently using a dictionary that maps the container url to a dict of icechunk.s3_storage kwargs

settings = {
                "authorized_chunk_access": {
                    's3://nasa-waterinsight/NLDAS3/forcing/daily/': {'anonymous': True},
                    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/': {'from_env': True},
                    },
            }

Happy to change this though!

Where to implement the icechunk opening logic
As far as I can tell in order to pass these options cleanly we would have a few different options:

  • Add additional input argument to titiler.xarray.io.xarray_open_dataset to pass the container auth info from the titiler-multidim reader module. This seems like the least amount of code changes, but we would have to wait for a PR merge upstream.
  • Move all the opening logic to titiler-multidim. We could either redefine xarray_open_dataset or introduce something like xarray_open_icechunk which we use instead of xarray_open_dataset and pass as an opener to the XarrayReader. This might be faster, but I feel it might fragment the code?

Curious what others think? cc @hrodmn @abarciauskas-bgse @sharkinsspatial @maxrjones

@jbusecke
Copy link
Collaborator Author

More detailed follow up question: I am currently passing all container auth info from the settings without validating that the virtual chunk container actually exists in the store config. It might be more robust to prune the inputs for only containers that are existing in the store config. Happy to change that.

@maxrjones
Copy link
Member

I think that supporting a configuration file would be the cleanest way to authorize virtual chunk containers on a per-application basis. Two common libraries for supporting configuration files are donfig and pydantic. People typically use donfig if they want a super lightweight dependency at the expense of limited typing/validation support. Since titiler already requires pydantic, pydantic makes the most sense for handling the configuration. I think using pydantic would both change the format of the input, since you would define a pydantic model, and the opening logic since you wouldn't pass through kwargs.

@jbusecke
Copy link
Collaborator Author

and the opening logic since you wouldn't pass through kwargs.

Just want to make sure I understand you correctly. I thought that I would integrate the container auth info in the Api Settings(which are defined here). This would still require me to pass them as kwargs to xarray_open_dataset here and here, something like this:

...
with xarray_open_dataset(
            ...,
            container_auth=settings.container_auth,
        ) as ds:

I might also not quite appreciate some of the details of how (and why) this is implemented in this way (e.g. why is xarray_open_dataset imported here and not just reused from the class opener attribute (which defaults to xarray_open_dataset)?

    @classmethod
    def list_variables(
        cls,
        src_path: str,
        group: Optional[Any] = None,
        decode_times: bool = True,
    ) -> List[str]:
        """List available variable in a dataset."""
        with self.opener( # why not like this?
            src_path,
            group=group,
            decode_times=decode_times,
        ) as ds:
            return list(ds.data_vars)  # type: ignore

Wondering if a short sync on this (maybe tomorrow?) might be useful.

@abarciauskas-bgse
Copy link
Contributor

I'm down to sync on this today or Friday, but happy to hear the outcome if y'all jam on Thursday.

@maxrjones
Copy link
Member

and the opening logic since you wouldn't pass through kwargs.

Just want to make sure I understand you correctly. I thought that I would integrate the container auth info in the Api Settings(which are defined here). This would still require me to pass them as kwargs to xarray_open_dataset here and here, something like this:

...
with xarray_open_dataset(
            ...,
            container_auth=settings.container_auth,
        ) as ds:

I might also not quite appreciate some of the details of how (and why) this is implemented in this way (e.g. why is xarray_open_dataset imported here and not just reused from the class opener attribute (which defaults to xarray_open_dataset)?

    @classmethod
    def list_variables(
        cls,
        src_path: str,
        group: Optional[Any] = None,
        decode_times: bool = True,
    ) -> List[str]:
        """List available variable in a dataset."""
        with self.opener( # why not like this?
            src_path,
            group=group,
            decode_times=decode_times,
        ) as ds:
            return list(ds.data_vars)  # type: ignore

Wondering if a short sync on this (maybe tomorrow?) might be useful.

I'd be down to sync on this tomorrow. I don't understand why you need to the container_auth parameter. It seems simpler and sufficient to use settings from pydantic (serialized as a toml or json) in xarray_open_dataset in https://github.com/jbusecke/titiler/blob/0c7ce7f017334369ad3f2fd9e70beb309c23a32e/src/titiler/xarray/titiler/xarray/io.py#L92-L97.

Comment on lines +1 to +60
"""Create icechunk fixtures (native and later virtual)."""
# TODO: these files could also be generated together with the zarr files using the same data

import numpy as np
import xarray as xr
import icechunk as ic

# Define dimensions and chunk sizes
res = 5
time_dim = 10
lat_dim = 36
lon_dim = 72
chunk_size = {"time": 10, "lat": 10, "lon": 10}

# Create coordinates
time = np.arange(time_dim)
lat = np.linspace(-90.0 + res / 2, 90.0 - res / 2, lat_dim)
lon = np.linspace(-180.0 + res / 2, 180.0 - res / 2, lon_dim)

dtype = np.float64
# Initialize variables with random data
CDD0 = xr.DataArray(
np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
dims=("time", "lat", "lon"),
name="CDD0",
)
DISPH = xr.DataArray(
np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
dims=("time", "lat", "lon"),
name="DISPH",
)
FROST_DAYS = xr.DataArray(
np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
dims=("time", "lat", "lon"),
name="FROST_DAYS",
)
GWETPROF = xr.DataArray(
np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
dims=("time", "lat", "lon"),
name="GWETPROF",
)

# Create dataset
ds = xr.Dataset(
{
"CDD0": CDD0.chunk(chunk_size),
"DISPH": DISPH.chunk(chunk_size),
"FROST_DAYS": FROST_DAYS.chunk(chunk_size),
"GWETPROF": GWETPROF.chunk(chunk_size),
},
coords={"time": time, "lat": lat, "lon": lon},
)
storage = ic.local_filesystem_storage("tests/fixtures/icechunk_native")
config = ic.RepositoryConfig.default()
repo = ic.Repository.create(storage=storage, config=config)
session = repo.writable_session("main")
store = session.store

ds.to_zarr(store, consolidated=False)
session.commit("Add initial data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it'd be nicer to generate the test stores at runtime, if they do not already on the user's machine, rather than storing the fixtures in the repository because it makes the git history cleaner (not storing hundreds of files) while only taking <1s of runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. But then we should probably do that for all test data? In this PR I mostly tried to adhere to the existing structure. I still think that you are right, but changing this would require quite a bit more work since we not only have to generate the data at runtime, but also the expected result jsons! I think that is handled better in a separate PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants