Support icechunk #101

jbusecke · 2025-10-08T20:31:39Z

Replacing #96 since we cannot deploy to sandbox from a fork.

Depends on #95 and developmentseed/titiler#1235

Tasks

Updated zarr to v3
Modified the test fixture scripts to write v2 and v3 zarr stores explicitly (I retriggered them and committed the new fixtures)
Add scripts/fixtures for native icechunk
Deploy and test on real-world native icechunk data
Add scripts/fixtures for virtual icechunk
Deploy and test on real-world virtual icechunk data

Co-authored-by: Henry Rodman <[email protected]>

…to dev deps

jbusecke · 2025-10-08T20:35:10Z

The deployment is failing due to git missing?.

jbusecke · 2025-10-21T23:41:02Z

Ok I made good progress today. The current changes (for now depending on developmentseed/titiler@f08c0ce) are working with some real world examples:

MURSST from virtual icechunk store (performance seems on par with a native icechunk store of the same data)

NLDAS virtual icechunk store.

The container authentication is currently mocked up inside the xarray_open_dataset function for testing and we discussed that it would be best to pass this information as runtime options to titiler-multidim.

At this point I would love to get some feedback on how to proceed.

Should the format of the input stay the same?
I am currently using a dictionary that maps the container url to a dict of icechunk.s3_storage kwargs

settings = {
                "authorized_chunk_access": {
                    's3://nasa-waterinsight/NLDAS3/forcing/daily/': {'anonymous': True},
                    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/': {'from_env': True},
                    },
            }

Happy to change this though!

Where to implement the icechunk opening logic
As far as I can tell in order to pass these options cleanly we would have a few different options:

Add additional input argument to titiler.xarray.io.xarray_open_dataset to pass the container auth info from the titiler-multidim reader module. This seems like the least amount of code changes, but we would have to wait for a PR merge upstream.
Move all the opening logic to titiler-multidim. We could either redefine xarray_open_dataset or introduce something like xarray_open_icechunk which we use instead of xarray_open_dataset and pass as an opener to the XarrayReader. This might be faster, but I feel it might fragment the code?

Curious what others think? cc @hrodmn @abarciauskas-bgse @sharkinsspatial @maxrjones

jbusecke · 2025-10-21T23:44:05Z

More detailed follow up question: I am currently passing all container auth info from the settings without validating that the virtual chunk container actually exists in the store config. It might be more robust to prune the inputs for only containers that are existing in the store config. Happy to change that.

maxrjones · 2025-10-22T12:48:36Z

I think that supporting a configuration file would be the cleanest way to authorize virtual chunk containers on a per-application basis. Two common libraries for supporting configuration files are donfig and pydantic. People typically use donfig if they want a super lightweight dependency at the expense of limited typing/validation support. Since titiler already requires pydantic, pydantic makes the most sense for handling the configuration. I think using pydantic would both change the format of the input, since you would define a pydantic model, and the opening logic since you wouldn't pass through kwargs.

jbusecke · 2025-10-22T15:06:46Z

and the opening logic since you wouldn't pass through kwargs.

Just want to make sure I understand you correctly. I thought that I would integrate the container auth info in the Api Settings(which are defined here). This would still require me to pass them as kwargs to xarray_open_dataset here and here, something like this:

...
with xarray_open_dataset(
            ...,
            container_auth=settings.container_auth,
        ) as ds:

I might also not quite appreciate some of the details of how (and why) this is implemented in this way (e.g. why is xarray_open_dataset imported here and not just reused from the class opener attribute (which defaults to xarray_open_dataset)?

    @classmethod
    def list_variables(
        cls,
        src_path: str,
        group: Optional[Any] = None,
        decode_times: bool = True,
    ) -> List[str]:
        """List available variable in a dataset."""
        with self.opener( # why not like this?
            src_path,
            group=group,
            decode_times=decode_times,
        ) as ds:
            return list(ds.data_vars)  # type: ignore

Wondering if a short sync on this (maybe tomorrow?) might be useful.

abarciauskas-bgse · 2025-10-22T17:13:39Z

I'm down to sync on this today or Friday, but happy to hear the outcome if y'all jam on Thursday.

maxrjones · 2025-10-22T18:30:52Z

and the opening logic since you wouldn't pass through kwargs.

Just want to make sure I understand you correctly. I thought that I would integrate the container auth info in the Api Settings(which are defined here). This would still require me to pass them as kwargs to xarray_open_dataset here and here, something like this:
...
with xarray_open_dataset(
            ...,
            container_auth=settings.container_auth,
        ) as ds:
I might also not quite appreciate some of the details of how (and why) this is implemented in this way (e.g. why is xarray_open_dataset imported here and not just reused from the class opener attribute (which defaults to xarray_open_dataset)?
    @classmethod
    def list_variables(
        cls,
        src_path: str,
        group: Optional[Any] = None,
        decode_times: bool = True,
    ) -> List[str]:
        """List available variable in a dataset."""
        with self.opener( # why not like this?
            src_path,
            group=group,
            decode_times=decode_times,
        ) as ds:
            return list(ds.data_vars)  # type: ignore
Wondering if a short sync on this (maybe tomorrow?) might be useful.

I'd be down to sync on this tomorrow. I don't understand why you need to the container_auth parameter. It seems simpler and sufficient to use settings from pydantic (serialized as a toml or json) in xarray_open_dataset in https://github.com/jbusecke/titiler/blob/0c7ce7f017334369ad3f2fd9e70beb309c23a32e/src/titiler/xarray/titiler/xarray/io.py#L92-L97.

maxrjones · 2025-10-22T18:46:42Z

tests/fixtures/generate_test_icechunk.py

+"""Create icechunk fixtures (native and later virtual)."""
+# TODO: these files could also be generated together with the zarr files using the same data
+
+import numpy as np
+import xarray as xr
+import icechunk as ic
+
+# Define dimensions and chunk sizes
+res = 5
+time_dim = 10
+lat_dim = 36
+lon_dim = 72
+chunk_size = {"time": 10, "lat": 10, "lon": 10}
+
+# Create coordinates
+time = np.arange(time_dim)
+lat = np.linspace(-90.0 + res / 2, 90.0 - res / 2, lat_dim)
+lon = np.linspace(-180.0 + res / 2, 180.0 - res / 2, lon_dim)
+
+dtype = np.float64
+# Initialize variables with random data
+CDD0 = xr.DataArray(
+    np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
+    dims=("time", "lat", "lon"),
+    name="CDD0",
+)
+DISPH = xr.DataArray(
+    np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
+    dims=("time", "lat", "lon"),
+    name="DISPH",
+)
+FROST_DAYS = xr.DataArray(
+    np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
+    dims=("time", "lat", "lon"),
+    name="FROST_DAYS",
+)
+GWETPROF = xr.DataArray(
+    np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype),
+    dims=("time", "lat", "lon"),
+    name="GWETPROF",
+)
+
+# Create dataset
+ds = xr.Dataset(
+    {
+        "CDD0": CDD0.chunk(chunk_size),
+        "DISPH": DISPH.chunk(chunk_size),
+        "FROST_DAYS": FROST_DAYS.chunk(chunk_size),
+        "GWETPROF": GWETPROF.chunk(chunk_size),
+    },
+    coords={"time": time, "lat": lat, "lon": lon},
+)
+storage = ic.local_filesystem_storage("tests/fixtures/icechunk_native")
+config = ic.RepositoryConfig.default()
+repo = ic.Repository.create(storage=storage, config=config)
+session = repo.writable_session("main")
+store = session.store
+
+ds.to_zarr(store, consolidated=False)
+session.commit("Add initial data")


IMO it'd be nicer to generate the test stores at runtime, if they do not already on the user's machine, rather than storing the fixtures in the repository because it makes the git history cleaner (not storing hundreds of files) while only taking <1s of runtime.

Good call. But then we should probably do that for all test data? In this PR I mostly tried to adhere to the existing structure. I still think that you are right, but changing this would require quite a bit more work since we not only have to generate the data at runtime, but also the expected result jsons! I think that is handled better in a separate PR?

Julius Busecke and others added 15 commits September 18, 2025 14:02

Parametrize test parameters across test functions

441063e

Fix pre-commit failure

5d760bc

Fix pyramid store detection in get_tile test

b87582d

Update tests/test_app.py

7890228

Co-authored-by: Henry Rodman <[email protected]>

updated zarr; added zarr v3 test fixture; parametrize caching for tests

ad28eab

Remove dask dependency

9161be4

updated zarr fixtures+scripts; minor test changes+additions;add dask …

a018066

…to dev deps

update tilejson expected responses

3e57c11

All tests passing with zarr v2 and v3

4d20b59

add icechunk fixture and generation script

21f33d1

Remove debugs, fix most tests

e6d86ea

Merge branch 'main' into support-icechunk

d99859c

Bump xarray + renable testing with cache

791d618

Some more debugging of the tile test

e1d90bf

Fix errors by pinning rio-tiler

98809ae

jbusecke added the deploy-dev label Oct 8, 2025

jbusecke had a problem deploying to dev October 8, 2025 20:32 — with GitHub Actions Failure

jbusecke mentioned this pull request Oct 8, 2025

WIP: Support icechunk stores #96

Closed

6 tasks

jbusecke temporarily deployed to dev October 8, 2025 20:34 — with GitHub Actions Inactive

Add git installation to Dockerfile

89b1755

jbusecke temporarily deployed to dev October 8, 2025 20:38 — with GitHub Actions Inactive

jbusecke temporarily deployed to dev October 8, 2025 20:41 — with GitHub Actions Inactive

jbusecke had a problem deploying to dev October 8, 2025 22:15 — with GitHub Actions Failure

This was referenced Oct 8, 2025

Zarr v3 + updated xarray cause TypeError: cannot pickle 'module' object #97

Open

rio-tiler 7.9.x breaks titiler-multidim tiles endpoint #102

Closed

dummy whitespace commit

ab81349

jbusecke had a problem deploying to dev October 8, 2025 22:21 — with GitHub Actions Failure

jbusecke temporarily deployed to dev October 8, 2025 22:23 — with GitHub Actions Inactive

add pytest-xdist and regen uv.lock

de5e640

jbusecke temporarily deployed to dev October 9, 2025 21:06 — with GitHub Actions Inactive

Add test notebook

32f3788

jbusecke temporarily deployed to dev October 9, 2025 21:24 — with GitHub Actions Inactive

jbusecke temporarily deployed to dev October 10, 2025 17:57 — with GitHub Actions Inactive

jbusecke temporarily deployed to dev October 10, 2025 18:07 — with GitHub Actions Inactive

Julius Busecke added 2 commits October 21, 2025 15:29

add test notebooks

3c255b6

Merge remote-tracking branch 'upstream/main' into support-icechunk

d246e9a

jbusecke had a problem deploying to dev October 21, 2025 19:31 — with GitHub Actions Failure

updated uv.lock

5ebb636

jbusecke temporarily deployed to dev October 21, 2025 19:39 — with GitHub Actions Inactive

jbusecke temporarily deployed to dev October 21, 2025 19:41 — with GitHub Actions Inactive

Move notebooks to hub testing

0cfe71d

jbusecke had a problem deploying to dev October 21, 2025 20:37 — with GitHub Actions Failure

jbusecke temporarily deployed to dev October 21, 2025 20:39 — with GitHub Actions Inactive

Julius Busecke added 2 commits October 21, 2025 23:05

updated notebooks

e3da262

rewrite uv.lock

08e2aed

jbusecke temporarily deployed to dev October 21, 2025 23:09 — with GitHub Actions Inactive

maxrjones reviewed Oct 22, 2025

View reviewed changes

jbusecke mentioned this pull request Oct 23, 2025

bug? titiler.xarray.io.Reader does not accept s3_credentials parameter developmentseed/titiler-cmr#88

Open

Move changes entirely to tt-multidim

c9b731b

jbusecke temporarily deployed to dev November 1, 2025 00:13 — with GitHub Actions Inactive

Confirm deployment error with minio

b966716

jbusecke deployed to dev November 1, 2025 01:26 — with GitHub Actions Active

jbusecke self-assigned this Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support icechunk #101

Support icechunk #101

Uh oh!

jbusecke commented Oct 8, 2025 •

edited

Loading

Uh oh!

jbusecke commented Oct 8, 2025

Uh oh!

jbusecke commented Oct 21, 2025

Uh oh!

jbusecke commented Oct 21, 2025

Uh oh!

maxrjones commented Oct 22, 2025

Uh oh!

jbusecke commented Oct 22, 2025

Uh oh!

abarciauskas-bgse commented Oct 22, 2025

Uh oh!

maxrjones commented Oct 22, 2025

Uh oh!

maxrjones Oct 22, 2025

Uh oh!

jbusecke Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Support icechunk #101

Are you sure you want to change the base?

Support icechunk #101

Uh oh!

Conversation

jbusecke commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbusecke commented Oct 8, 2025

Uh oh!

jbusecke commented Oct 21, 2025

Uh oh!

jbusecke commented Oct 21, 2025

Uh oh!

maxrjones commented Oct 22, 2025

Uh oh!

jbusecke commented Oct 22, 2025

Uh oh!

abarciauskas-bgse commented Oct 22, 2025

Uh oh!

maxrjones commented Oct 22, 2025

Uh oh!

maxrjones Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jbusecke Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbusecke commented Oct 8, 2025 •

edited

Loading