-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix behaviour of min_count in reducing functions #4911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Make sure Dask-backed arrays are not computed. * Check some specific examples give the correct output. * Run membership tests on xarray.core.dtypes.NAT_TYPES
* Fix mask checks in xarray.core.nanops._maybe_null_out to run lazily for Dask-backed arrays. * Change xarray.core.dtypes.NAT_TYPES to a set (it is only used for membership checks). * Add dtypes to NAT_TYPES rather than instances. Previously np.float64 was returning true from `dtype in NAT_TYPES` which resulted in min_count being ignored when reducing over all axes.
mathause
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
- Technically that's a breaking change (
xr.DataArray([1]).sum(min_count=1)currently has dtypeìntwhile it'sfloatafter this change). I am fine with this, maybe the docstring could be updated? Saying that if you setmin_countyou'll get a float array? - There is a failure in
ubuntu-latest py37-min-all-deps- looks likenp.whereis not yet lazy in dask 2.9.2 (I am not sure when it becomes lazy, though). You could just skip the test for the appropriate dask version, or add a minversion check for raise_if_dask_computes
We can use the |
even better would be |
* use duck_array_ops.where instead of np.where * add docstring and whatsnew messages about sum/prod on integer arrays with skipna=True and min_count != None now returning a float array.
|
mathause
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I get it - np.where does not dispatch to dask.array.where in the earlier versions...
Not for this PR but I wonder if xr.DataArray([1]).sum(min_count=2) should actually return NA.
It would make it more consistent, the current behaviour is
due to the different dtype defaults for skipna. I guess this would require rework of the dispatch logic in |
|
I think we can merge this before the weekend, unless someone objects? |
|
Thanks @bcbnz. I see this is your first PR here - welcome to xarray! |
The first commit modifies existing tests to check Dask-backed arrays are not computed. It also adds some specific checks that the correct result (NaN or a number as appropriate) is returned and some tests for checking membership of
xarray.core.dtypes.NAT_TYPES. After this commit I get 89 test failures, and they seem to cover the cases reported in #4898.The second commit fixes these failures:
The checks of the nan mask in
xarray.core.nanops._maybe_null_outare changed to usenp.wherewhich allows lazy evaluation.Previously,
xarray.core.dtypes.NAT_TYPESwas a tuple of datetime64 and timedelta64 instances; I've changed it to a set of the dtypes of these instances. It is only used for the membership check in_maybe_null_outso a set seems appropriate. The previous use of instances rather than dtypes caused a bug --np.float64 in NAT_TYPESreturned true even though it only contained datetime64/timedelta64. This meant that reducing operations over all axes (axis=Noneor...) with float64 arrays ignored min_count as the membership check in_maybe_null_outcaused it to be skipped.pre-commit run --all-fileswhats-new.rst