-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
#5692 is not merged yet now merged but and we can already start thinking about the next steps. I’m opening this issue to list and track the remaining tasks. @pydata/xarray, do not hesitate to add a comment below if you think about something that is missing here.
Continue the refactoring of the internals
Although in #5692 everything seems to work with the current pandas index wrappers for dimension coordinates, not all of Xarray's internals have been refactored yet to fully support (or at least be compatible with) custom indexes. Here is a list of Dataset / DataArray methods that still need to be checked / updated (this list may be incomplete):
-
as_numpy(as_numpychanges MultiIndex #8001) -
broadcast(Bug in broadcasting with multi-indexes #6430, refactor broadcast for flexible indexes #6481 ) -
drop_sel(DataArray.drop_isel / .drop_sel with duplicated initial time stamp - InvalidIndexError #6605, drop_sel returns KeyError for abbreviated dates #7699) -
drop_isel -
drop_dims -
drop_duplicates('drop_duplicates' behaves differently when using 1 vs many coordinates for an index #8499) -
transpose -
interpolate_na -
ffill -
bfill -
reduce -
map -
apply -
quantile -
rank -
integrate -
cumulative_integrate -
filter_by_attrs -
idxmin -
idxmax -
argmin -
argmax -
concat(partially refactored, may not fully work with multi-dimension indexes) -
polyfit
I ended up following a common pattern in #5692 when adding explicit / flexible index support for various features (it is quite generic, though, the actual procedure may vary from one case to another and many steps may be skipped):
- Check if it’s worth adding a new method to the Xarray
Indexbase class. There may be several motivations:- Avoid handling Pandas index objects inside Dataset or DataArray methods (even if we don’t plan to fully support custom indexes for everything, it is preferable to put this logic behind the
PandasIndexorPandasMultiIndexwrapper classes for clarity and also if eventually we want to make Xarray less dependent on Pandas) - We want a specific implementation rather than relying on the
Variable’s corresponding method for speed-up or for other reasons, e.g.,IndexVariable.concatexists to avoid unnecessary Pandas/Numpy conversions ; in Explicit indexes #5692PandasIndex.concathas the same logic and will fully replace the former if/once we get rid ofIndexVariablePandasIndex.rollreusespandas.Indexindexing andappendcapabilities
- Avoid handling Pandas index objects inside Dataset or DataArray methods (even if we don’t plan to fully support custom indexes for everything, it is preferable to put this logic behind the
IndexAPI closely follows DataArray, Dataset and Variable API (i.e., same method names) for consistency- Within the Dataset or DataArray method, first call the
IndexAPI (if it exists) to create new indexes- The
Indexesclass (i.e., the.xindexesproperty returns an instance of this class) provides convenient API for iterating through indexes (e.g., get a list of unique indexes, get all coordinates or dimensions for a given index, etc.) - If there’s no implementation for the called
IndexAPI, either raise an error or fallback to calling theVariableAPI (below) depending on the case
- The
- Create new coordinate variables for each of the new indexes using
Index.create_variables- It is possible to pass a dict of current coordinate variables to
Index.create_variables; it is used to propagate variable metadata (dtype,attrsandencoding) - Not all indexes should create new coordinate variables, only those for which it is possible to reuse index data as coordinate variable data (like Pandas indexes)
- It is possible to pass a dict of current coordinate variables to
- Iterate through the variables and call the
VariableAPI (if it exists)- Skip new coordinate variables created at the previous step (just reuse it)
- Propagate the indexes that are not affected by the operation and clean up all indexes, i.e., ensure consistency between indexes and coordinate variables
- There is a couple of convenient methods that have been added in Explicit indexes #5692 for that purpose:
filter_indexes_from_coordsandassert_no_index_corrupted
- There is a couple of convenient methods that have been added in Explicit indexes #5692 for that purpose:
- Replace indexes and variables, e.g., using
_replace,_replace_with_new_dimsor_overwrite_indexesmethods
Relax all constraints related to “dimension (index) coordinates” in Xarray
- Allow multi-dimensional variables with the name matching one of its dimensions: Problem opening unstructured grid ocean forecasts with 4D vertical coordinates #2233 WIP: don't create indexes on multidimensional dimensions #2405 (WIP: don't create indexes on multidimensional dimensions #2405 (comment))
Indexes repr
- Add an
Indexessection to Dataset and DataArray reprs - Make the repr of
Indexes(i.e.,.xindexesproperty) consistent with the repr ofCoordinates(.coordsproperty) - Add
Index._repr_inline_for tweaking the inline representation of each index shown in the reprs above
Public API for assigning and (re)setting indexes
There is no public API yet for creating and/or assigning existing indexes to Dataset and DataArray objects.
- Enable and/or document the
indexesparameter in Dataset and DataArray constructors- Depreciate the implicit creation of pandas multi-index wrappers (and their corresponding coordinates) from anything passed via the
data,data_varsorcoordsarguments in favor of a more explicit way to pass it. - Opening dataset without loading any indexes? #6633 (pass empty dictionary)
- Pass indexes to the Dataset and DataArray constructors #6392
- Pass indexes directly to the DataArray and Dataset constructors #7214
- Expose "Coordinates" as part of Xarray's public API #7368
- Depreciate the implicit creation of pandas multi-index wrappers (and their corresponding coordinates) from anything passed via the
- Add
set_xindexanddrop_indexesmethods
We still need to figure out how best we can (1) assign existing indexes (possibly with their coordinates) and (2) pass index build options.
Other public API for index-based operations
To fully leverage the power and flexibility of custom indexes, we might want to update some parts of Xarray’s public API in order to allow passing arbitrary options per index. For example:
-
sel: the currentmethodandtolerancemay not be relevant for all indexes, pass extra arguments to Scipy's cKDTree.query, etc. Pass arbitrary options to sel() #7099 -
align: tolerance for alignment #2217
Also:
- Make public the
IndexesAPI as it provides convenient methods that might be useful for end-users - Import the
Indexbase class into Xarray’s main namespace (i.e.,xr.Index)? AlsoPandasIndexandPandasMultiIndex? The latter may be useful if we depreciateset_index(append=True)and/or if we depreciate “unpacking”pandas.MultiIndexobjects to coordinates when given ascoordsin the Dataset / DataArray constructors.- Add references in docstrings (Explicit indexes #5692 (comment)).
Documentation
- User guide:
- Update the “Terminology” section: “Index” may include custom indexes, review “Dimension coordinate” / “Non-dimension coordinate” as “Indexed coordinate” / “Non-indexed coordinate”
- Update the “Data structure” section such that it clearly mentions indexes as 1st class citizen of the Xarray data model
- Maybe update other parts of the documentation that refer to the concept of “dimension coordinate”
- API reference:
- add
IndexesAPI - add
IndexAPI: Add documentation on custom indexes #6975
- add
- Xarray internals: add a subsection on how to add custom indexes, maybe with some basic examples: Add documentation on custom indexes #6975
- Update development roadmap section
Index types and helper classes built in Xarray
- Since a lot of potential use-cases for custom indexes may consist in adding some extra logic on top of one or more pandas indexes along one or more dimensions (i.e., “meta-indexes”), it might be worth providing a helper
Indexabstract subclass that would basically dispatch the given arguments to the corresponding, encapsulatedPandasIndexinstances and then merge the results - Depreciate
PandasMultiIndexdimension coordinate?
3rd party indexes
- Add custom index entrypoint / plugin system, similarly to storage backend entrypoints