-
-
Notifications
You must be signed in to change notification settings - Fork 260
Support Dask Dataframes in Hyperband #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently depends on dask/dask-ml#701 This could be improved by using an estimator that benefitted from large amounts of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any issues or qualms with this PR, especially if the tests pass. The failing test looks relevant, but only because we were performing an unnecessary check.
| X_train, X_test, y_train, y_test = self._get_train_test_split(X, y) | ||
|
|
||
| X_train, X_test, y_train, y_test = self._get_train_test_split( | ||
| X, y, shuffle=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this shuffle=True needed? There's made a modification above where shuffle=True is hard-coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method warns when given a dataframe and shuffle= is not specified. We need to choose some default. Would False be a better choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I prefer shuffle=True because it's the right choice from an optimization perspective. For example, what if the Dask Dataframe are CSVs from each state? Calling partial_fit on any one CSV is not the right answer if each CSV is very different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in this function shuffle is only per-partitions, not between partitions. I think that we would leave inter-partition shuffling to the user ahead of sending to Hyperband
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in this function shuffle is only per-partitions, not between partitions.
That's not true; it looks like for default value of blockwise depends on whether it's a Dask Array or DataFrame:
dask-ml/dask_ml/model_selection/_split.py
Lines 392 to 395 in fc6a04a
| The default behavior depends on the types in arrays. For Dask Arrays, | |
| the default is True (data are not shuffled between blocks). For Dask | |
| DataFrames, the default and only allowed value is False (data are | |
| shuffled between blocks). |
But that's unrelated to this PR. I think shuffle=True is a good choice in case there's order. Maybe we should make _get_train_test_split public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make _get_train_test_split public?
What's the motivation for that?
In case someone wants a specific train/test split. Maybe they have time series and don't want shuffle=True or blockwise=True? Or maybe they want the test set to be a specific chunk from the Dask Array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuffle=True, blockwise=True was the default behavior on master, and that's still what we have?
shuffle=True is definitely the behavior on master; shuffle=False isn't supported (_split.py#L493, _split.py#L477), and blockwise is False for DataFrames, and True for Arrays. I think that's the right behavior.
I need to investigate blockwise some; I'm not sure if there's a mixup in the documentation or implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).
I'm a little confused. Should the implementation support shuffling between chunks or within chunks? I'd like to see train_test_split support shuffling between chunks. Currently, train_test_split only supports shuffling within chunks blockwise=True:
import pandas as pd
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
N = 1000
df = pd.DataFrame({"x": np.arange(N), "y": np.arange(N)})
ddf = dd.from_pandas(df, npartitions=2)
kwargs = dict(random_state=0, train_size=0.5)
train, test = train_test_split(ddf, blockwise=True, **kwargs)
assert train.compute().shape == (530, 2)
train, test = train_test_split(ddf, blockwise=False, **kwargs) # raises NotImplementedErrorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the implementation support shuffling between chunks or within chunks?
Ideally either. But that's distinct from this PR.
Ah indeed. Fixed I think. Interestingly. Tests also pass if I remove the to_dask_array lines. I'm curious do we know what support is like for passing Pandas dataframes around to models is? Is this something that we would want to leave for the user? |
|
The sklearn dev failure can be ignore. I need to look into it today or tomorrow. Planning to merge later today, assuming the discussion in https://github.com/dask/dask-ml/pull/701/files/485ff530bca48eb6051591c75dbe6a5965d42333#diff-0a54ded74c8013caf0588e9a5c879e99 has been resolved. |
Co-authored-by: Tom Augspurger <[email protected]>
|
Thanks! Apologies for the incorrect suggestion. |
It doesn't look like we actually use exact chunk sizes anywhere, so this should be ok?
I may be wrong though. Regardless there is probably still some cleanup to do here. I thought I'd push up something early for feedback though.
cc @stsievert