[DataFrame] Implement quantile #1992

11rohans · 2018-05-03T05:20:55Z

Updates the DataFrame.quantile method for full use cases.
Also updates the init file to account for datetime methods.
Updates equals method so that it can handle one column dataframes correctly

AmplabJenkins · 2018-05-03T06:25:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5158/
Test PASSed.

kunalgosar

Looks really good, just a few comments. I wasn't able to reproduce a ValueError from the underlying pandas call. Can you describe when this would occur?

kunalgosar · 2018-05-03T23:24:09Z

python/ray/dataframe/dataframe.py

Use axis = pd.DataFrame()._get_axis_number(axis) instead.

What would this do here? I'm checking if q is a list or not

kunalgosar · 2018-05-03T23:27:46Z

python/ray/dataframe/dataframe.py

numeric_only specifies whether or not to filter the non-numeric columns. Does not need error checking.

If numeric_only is true, then pandas returns value errors for mixed dtype dataframes

I'm not able to reproduce, seems that pandas drops non-numeric columns when numeric_only is True.

In [3]: df Out[3]: col1 col2 0 1 a 1 2 b 2 3 c In [4]: df.dtypes Out[4]: col1 int64 col2 object dtype: object In [5]: df.quantile(numeric_only=True) Out[5]: col1 2.0 Name: 0.5, dtype: float64

This is true also in the axis=1 case.

In [6]: df.quantile(numeric_only=True, axis=1) Out[6]: 0 1.0 1 2.0 2 3.0 Name: 0.5, dtype: float64

Whoops, when numeric_only is FALSE, it returns TYPEerrors.

Try on

df = pd.DataFrame({"A": [1, 2, "B": [2., 3., 4.], "C": pd.date_range('20130101', periods=3), "D": ['foo', 'bar', 'baz']}) df.quantile(.5, axis=1, numeric_only=False)

This returns a TypeError

This is true, although the check here does not handle this case. The error I see is TypeError: Cannot compare type 'Timestamp' with type 'float'.

kunalgosar · 2018-05-03T23:29:24Z

python/ray/dataframe/dataframe.py

Use the pandas _check_percentile method here.

kunalgosar · 2018-05-03T23:36:41Z

python/ray/dataframe/dataframe.py

When would this ValueError be thrown?

This exception is thrown when there are only non-numeric columns in this partition
Added comment

I see, this should probably be handled somewhere to ensure concordance.

Describe also handles it like this, so if we do something about it we should make sure to modify it there as well

kunalgosar · 2018-05-03T23:37:29Z

python/ray/dataframe/dataframe.py

_arithmetic_helper should return a Series.

AmplabJenkins · 2018-05-04T04:40:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5189/
Test PASSed.

AmplabJenkins · 2018-05-04T05:02:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5191/
Test PASSed.

kunalgosar

Looks much better; should probably look into the right error handling here. Seems like ignoring errors on workers is the wrong solution.

It would be better if the errors were caught in the function logic, before submitting remote tasks. Since this is an issue in describe can you update there too?

kunalgosar · 2018-05-04T05:11:33Z

python/ray/dataframe/dataframe.py

This is true, although the check here does not handle this case. The error I see is TypeError: Cannot compare type 'Timestamp' with type 'float'.

kunalgosar · 2018-05-04T05:14:22Z

python/ray/dataframe/dataframe.py

As discussed above, TypeErrors are thrown here. Perhaps you can add some logic at the beginning of the function to ensure all the dtypes are comparable.

The issue seems to arise not from non-numeric data, but from non-comparable types.

11rohans · 2018-05-04T18:43:20Z

Updated error checking and dtyping

AmplabJenkins · 2018-05-04T19:45:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5220/
Test PASSed.

kunalgosar

Few preliminary comments.

kunalgosar · 2018-05-04T19:50:11Z

python/ray/dataframe/__init__.py

The lines should be consolidated.

Wanted to keep all the datetime imports on their own line, should I still do this?

kunalgosar · 2018-05-04T19:50:34Z

python/ray/dataframe/__init__.py

If you do this, you need to remove test_series.

Should I just move the entire file?

No, it just needs to be removed from the .travis.yml file

Don't we want to run those tests? Or no?

While Series is unimplemented, we are importing it directly from Pandas. Skipping these tests for now as they expect NotImplementedErrors to be thrown.

kunalgosar · 2018-05-04T20:46:16Z

python/ray/dataframe/dataframe.py

I know that this error is concordant with Pandas, but it feels very out of place in this context. Perhaps we should change it to something more descriptive. Honestly, this case seems like a bug with Pandas, it should handle the case where all columns are dropped.

Looping in @devin-petersohn here.

I agree -- I originally had a more descriptive error but scrapped it in favor of using the pandas one. Something that references mixed dtypes would be better

kunalgosar

A few comments. Looping in @devin-petersohn to discuss the error message shown.

kunalgosar · 2018-05-04T20:47:08Z

python/ray/dataframe/dataframe.py

As before use axis = pd.DataFrame()._get_axis_number(axis) here

kunalgosar · 2018-05-04T20:53:24Z

python/ray/dataframe/dataframe.py

Seems that np.datetime64 type columns are supported.

In [8]: x Out[8]: col1 0 2018-05-04 20:50:29 1 2018-05-04 20:50:29 2 2018-05-04 20:50:29 In [9]: x.dtypes Out[9]: col1 datetime64[ns] dtype: object In [10]: x.quantile(q=0.5, axis=1, numeric_only=True) Out[10]: 0 NaN 1 NaN 2 NaN Name: 0.5, dtype: float64

Made a fix to handle all datetime objects

kunalgosar · 2018-05-04T20:53:59Z

python/ray/dataframe/dataframe.py

Why is this function defined twice?

One returns a Series the other returns a DataFrame

AmplabJenkins · 2018-05-05T00:34:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5228/
Test PASSed.

AmplabJenkins · 2018-05-06T18:52:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5238/
Test FAILed.

AmplabJenkins · 2018-05-06T19:35:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5239/
Test PASSed.

AmplabJenkins · 2018-05-06T19:41:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5240/
Test PASSed.

devin-petersohn · 2018-05-07T01:25:23Z

Merged, thanks @11rohans!

* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...

* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...

* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)

kunalgosar suggested changes May 3, 2018

View reviewed changes

11rohans force-pushed the ray_quantile branch from 5ed3d29 to 35091c2 Compare May 4, 2018 03:36

kunalgosar suggested changes May 4, 2018

View reviewed changes

11rohans force-pushed the ray_quantile branch from a5cf3cd to c824f33 Compare May 4, 2018 18:42

kunalgosar suggested changes May 4, 2018

View reviewed changes

kunalgosar reviewed May 4, 2018

View reviewed changes

kunalgosar suggested changes May 4, 2018

View reviewed changes

11rohans and others added 8 commits May 6, 2018 11:21

added quantile method

4a08758

updated init for datetime signatures

4c7be56

updated documentation for _map_partitions return type

238a27b

removed extraneous print call

3f788a6

updated for simplicity

04b72a9

fixed dtyping issues and error raising

6bf4b10

updated datetime dtype checking

f1b0dd0

Fixing quantile implementation

39e3320

devin-petersohn force-pushed the ray_quantile branch from be8b6ed to 39e3320 Compare May 6, 2018 18:21

devin-petersohn added 2 commits May 6, 2018 11:32

Fix minor bug

c30be81

Fixing diff

3418aaa

devin-petersohn approved these changes May 7, 2018

View reviewed changes

devin-petersohn merged commit 1848745 into ray-project:master May 7, 2018

Uh oh!

[DataFrame] Implement quantile #1992

[DataFrame] Implement quantile #1992

Uh oh!

Conversation

11rohans commented May 3, 2018

Uh oh!

AmplabJenkins commented May 3, 2018

Uh oh!

kunalgosar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

11rohans May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

11rohans May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

11rohans May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 4, 2018

Uh oh!

AmplabJenkins commented May 4, 2018

Uh oh!

kunalgosar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

11rohans commented May 4, 2018

Uh oh!

AmplabJenkins commented May 4, 2018

Uh oh!

kunalgosar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

11rohans May 4, 2018 •

edited

Loading

11rohans May 4, 2018 •

edited

Loading

11rohans May 4, 2018 •

edited

Loading