- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10
Support icechunk #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support icechunk #101
Conversation
Co-authored-by: Henry Rodman <[email protected]>
| Ok I made good progress today. The current changes (for now depending on developmentseed/titiler@f08c0ce) are working with some real world examples: MURSST from virtual icechunk store (performance seems on par with a native icechunk store of the same data) The container authentication is currently mocked up inside the  At this point I would love to get some feedback on how to proceed. Should the format of the input stay the same? Happy to change this though! Where to implement the icechunk opening logic 
 Curious what others think? cc @hrodmn @abarciauskas-bgse @sharkinsspatial @maxrjones | 
| More detailed follow up question: I am currently passing all container auth info from the settings without validating that the virtual chunk container actually exists in the store config. It might be more robust to prune the inputs for only containers that are existing in the store config. Happy to change that. | 
| I think that supporting a configuration file would be the cleanest way to authorize virtual chunk containers on a per-application basis. Two common libraries for supporting configuration files are donfig and pydantic. People typically use donfig if they want a super lightweight dependency at the expense of limited typing/validation support. Since titiler already requires pydantic, pydantic makes the most sense for handling the configuration. I think using pydantic would both change the format of the input, since you would define a pydantic model, and the opening logic since you wouldn't pass through kwargs. | 
| 
 Just want to make sure I understand you correctly. I thought that I would integrate the container auth info in the Api Settings(which are defined here). This would still require me to pass them as kwargs to xarray_open_dataset here and here, something like this: ...
with xarray_open_dataset(
            ...,
            container_auth=settings.container_auth,
        ) as ds:I might also not quite appreciate some of the details of how (and why) this is implemented in this way (e.g. why is      @classmethod
    def list_variables(
        cls,
        src_path: str,
        group: Optional[Any] = None,
        decode_times: bool = True,
    ) -> List[str]:
        """List available variable in a dataset."""
        with self.opener( # why not like this?
            src_path,
            group=group,
            decode_times=decode_times,
        ) as ds:
            return list(ds.data_vars)  # type: ignoreWondering if a short sync on this (maybe tomorrow?) might be useful. | 
| I'm down to sync on this today or Friday, but happy to hear the outcome if y'all jam on Thursday. | 
| 
 I'd be down to sync on this tomorrow. I don't understand why you need to the  | 
| """Create icechunk fixtures (native and later virtual).""" | ||
| # TODO: these files could also be generated together with the zarr files using the same data | ||
|  | ||
| import numpy as np | ||
| import xarray as xr | ||
| import icechunk as ic | ||
|  | ||
| # Define dimensions and chunk sizes | ||
| res = 5 | ||
| time_dim = 10 | ||
| lat_dim = 36 | ||
| lon_dim = 72 | ||
| chunk_size = {"time": 10, "lat": 10, "lon": 10} | ||
|  | ||
| # Create coordinates | ||
| time = np.arange(time_dim) | ||
| lat = np.linspace(-90.0 + res / 2, 90.0 - res / 2, lat_dim) | ||
| lon = np.linspace(-180.0 + res / 2, 180.0 - res / 2, lon_dim) | ||
|  | ||
| dtype = np.float64 | ||
| # Initialize variables with random data | ||
| CDD0 = xr.DataArray( | ||
| np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype), | ||
| dims=("time", "lat", "lon"), | ||
| name="CDD0", | ||
| ) | ||
| DISPH = xr.DataArray( | ||
| np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype), | ||
| dims=("time", "lat", "lon"), | ||
| name="DISPH", | ||
| ) | ||
| FROST_DAYS = xr.DataArray( | ||
| np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype), | ||
| dims=("time", "lat", "lon"), | ||
| name="FROST_DAYS", | ||
| ) | ||
| GWETPROF = xr.DataArray( | ||
| np.random.rand(time_dim, lat_dim, lon_dim).astype(dtype), | ||
| dims=("time", "lat", "lon"), | ||
| name="GWETPROF", | ||
| ) | ||
|  | ||
| # Create dataset | ||
| ds = xr.Dataset( | ||
| { | ||
| "CDD0": CDD0.chunk(chunk_size), | ||
| "DISPH": DISPH.chunk(chunk_size), | ||
| "FROST_DAYS": FROST_DAYS.chunk(chunk_size), | ||
| "GWETPROF": GWETPROF.chunk(chunk_size), | ||
| }, | ||
| coords={"time": time, "lat": lat, "lon": lon}, | ||
| ) | ||
| storage = ic.local_filesystem_storage("tests/fixtures/icechunk_native") | ||
| config = ic.RepositoryConfig.default() | ||
| repo = ic.Repository.create(storage=storage, config=config) | ||
| session = repo.writable_session("main") | ||
| store = session.store | ||
|  | ||
| ds.to_zarr(store, consolidated=False) | ||
| session.commit("Add initial data") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it'd be nicer to generate the test stores at runtime, if they do not already on the user's machine, rather than storing the fixtures in the repository because it makes the git history cleaner (not storing hundreds of files) while only taking <1s of runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. But then we should probably do that for all test data? In this PR I mostly tried to adhere to the existing structure. I still think that you are right, but changing this would require quite a bit more work since we not only have to generate the data at runtime, but also the expected result jsons! I think that is handled better in a separate PR?
Replacing #96 since we cannot deploy to sandbox from a fork.
Depends on #95 and developmentseed/titiler#1235
Tasks