-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Groupby quantiles incorrect bins #33200 #33644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Note this is a duplicate of #33571
|
Updated the PR. I did not notice the other PR. It was not referenced in the original issue #33200 . |
c15b8b8 to
c56744b
Compare
c56744b to
8e6af0e
Compare
| # Figure out how many group elements there are | ||
| grp_sz = counts[i] | ||
| non_na_sz = non_na_counts[i] | ||
| if labels.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this required for? Do we have a test case when this is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, if labels are not provided or the dataframe is empty, an error message will rise when applying the quantile. However, there are cases where certain operations lead to empty labels, as when time resampling some types of empty dataframe:
Here is an example of the pipeline:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34258&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accounted for in the test that you've created? If not, can you add a test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add two cases that take the if labels.any(): is False path.
What is this required for?
otherwise labels.max() can raise ValueError: zero-size array to reduction operation maximum which has no identity
1daaef1 to
b8a19bf
Compare
b8a19bf to
5832ba9
Compare
2175657 to
662c102
Compare
|
BTW this should also close #33569 |
doc/source/whatsnew/v1.1.0.rst
Outdated
| - Bug in :meth:`DataFrame.resample` where an ``AmbiguousTimeError`` would be raised when the resulting timezone aware :class:`DatetimeIndex` had a DST transition at midnight (:issue:`25758`) | ||
| - Bug in :meth:`DataFrame.groupby` where a ``ValueError`` would be raised when grouping by a categorical column with read-only categories and ``sort=False`` (:issue:`33410`) | ||
| - Bug in :meth:`GroupBy.first` and :meth:`GroupBy.last` where None is not preserved in object dtype (:issue:`32800`) | ||
| - Bug in :meth:`SeriesGroupBy.quantile` causes the quantiles to be shifted when the ``by`` axis contains ``NaN`` (:issue:`33200`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also a bug for DataFrameGroupBy? If so may want to note that, otherwise seems pretty good to me. @WillAyd any other thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for both. Groupby.quantile processes both Series and Dataframes, and it is the cython funtion called there the one that has been fixed. Maybe just Groupby.quantile would suffice, since that's the function changed.
b3cf5b3 to
b0c309b
Compare
|
xref #33300 to track. |
|
@mabelvj This PR seems to be timing out on Windows. is this related to the changes here? |
|
@mabelvj Can you merge master? The quantile tests were moved to /tests/groupby/test_quantile.py. |
330e24d to
131a8c1
Compare
pandas/_libs/groupby.pyx
Outdated
| if labels.any(): | ||
| # Put '-1' (NaN) labels as the last group so it does not interfere | ||
| # with the calculations. | ||
| labels[labels == -1] = np.max(labels) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that modifying labels is causing a segfault in test_quantile_missing_group_values_no_segfaults on windows on the second call in the loop. I suspect this is the reason for the timeouts.
simonjayhawkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit on test. Implementation lgtm - @jreback
| # Figure out how many group elements there are | ||
| grp_sz = counts[i] | ||
| non_na_sz = non_na_counts[i] | ||
| if labels.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accounted for in the test that you've created? If not, can you add a test for it?
| df = pd.DataFrame({"key": key, "val": val}) | ||
|
|
||
| result = df.groupby("key").quantile() | ||
| result = df.groupby("key").quantile(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another test that does not explicitly set a quantile (e.g. like the original)
|
thanks @mabelvj for the patch. |
|
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
…incorrect bins)
…bins) (#34382) Co-authored-by: Mabel Villalba <[email protected]>
Maintain the order of the bins in group_quantile. Updated tests #33200
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diff