-
Couldn't load subscription status.
- Fork 6.8k
[DataFrame] Implement quantile #1992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, just a few comments. I wasn't able to reproduce a ValueError from the underlying pandas call. Can you describe when this would occur?
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use axis = pd.DataFrame()._get_axis_number(axis) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this do here? I'm checking if q is a list or not
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numeric_only specifies whether or not to filter the non-numeric columns. Does not need error checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If numeric_only is true, then pandas returns value errors for mixed dtype dataframes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not able to reproduce, seems that pandas drops non-numeric columns when numeric_only is True.
In [3]: df
Out[3]:
col1 col2
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
col1 int64
col2 object
dtype: object
In [5]: df.quantile(numeric_only=True)
Out[5]:
col1 2.0
Name: 0.5, dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true also in the axis=1 case.
In [6]: df.quantile(numeric_only=True, axis=1)
Out[6]:
0 1.0
1 2.0
2 3.0
Name: 0.5, dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, when numeric_only is FALSE, it returns TYPEerrors.
Try on
df = pd.DataFrame({"A": [1, 2,
"B": [2., 3., 4.],
"C": pd.date_range('20130101', periods=3),
"D": ['foo', 'bar', 'baz']})
df.quantile(.5, axis=1, numeric_only=False)
This returns a TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, although the check here does not handle this case. The error I see is TypeError: Cannot compare type 'Timestamp' with type 'float'.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the pandas _check_percentile method here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this ValueError be thrown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is thrown when there are only non-numeric columns in this partition
Added comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this should probably be handled somewhere to ensure concordance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe also handles it like this, so if we do something about it we should make sure to modify it there as well
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_arithmetic_helper should return a Series.
|
Test PASSed. |
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better; should probably look into the right error handling here. Seems like ignoring errors on workers is the wrong solution.
It would be better if the errors were caught in the function logic, before submitting remote tasks. Since this is an issue in describe can you update there too?
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, although the check here does not handle this case. The error I see is TypeError: Cannot compare type 'Timestamp' with type 'float'.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above, TypeErrors are thrown here. Perhaps you can add some logic at the beginning of the function to ensure all the dtypes are comparable.
The issue seems to arise not from non-numeric data, but from non-comparable types.
|
Updated error checking and dtyping |
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few preliminary comments.
python/ray/dataframe/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines should be consolidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to keep all the datetime imports on their own line, should I still do this?
python/ray/dataframe/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this, you need to remove test_series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just move the entire file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it just needs to be removed from the .travis.yml file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to run those tests? Or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While Series is unimplemented, we are importing it directly from Pandas. Skipping these tests for now as they expect NotImplementedErrors to be thrown.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this error is concordant with Pandas, but it feels very out of place in this context. Perhaps we should change it to something more descriptive. Honestly, this case seems like a bug with Pandas, it should handle the case where all columns are dropped.
Looping in @devin-petersohn here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I originally had a more descriptive error but scrapped it in favor of using the pandas one. Something that references mixed dtypes would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Looping in @devin-petersohn to discuss the error message shown.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As before use axis = pd.DataFrame()._get_axis_number(axis) here
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that np.datetime64 type columns are supported.
In [8]: x
Out[8]:
col1
0 2018-05-04 20:50:29
1 2018-05-04 20:50:29
2 2018-05-04 20:50:29
In [9]: x.dtypes
Out[9]:
col1 datetime64[ns]
dtype: object
In [10]: x.quantile(q=0.5, axis=1, numeric_only=True)
Out[10]:
0 NaN
1 NaN
2 NaN
Name: 0.5, dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a fix to handle all datetime objects
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this function defined twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One returns a Series the other returns a DataFrame
|
Test PASSed. |
be8b6ed to
39e3320
Compare
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Merged, thanks @11rohans! |
* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...
* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...
* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
Updates the
DataFrame.quantilemethod for full use cases.Also updates the init file to account for datetime methods.
Updates equals method so that it can handle one column dataframes correctly