Skip to content

Conversation

@mrocklin
Copy link
Member

It doesn't look like we actually use exact chunk sizes anywhere, so this should be ok?

I may be wrong though. Regardless there is probably still some cleanup to do here. I thought I'd push up something early for feedback though.

cc @stsievert

mrocklin added a commit to coiled/coiled-examples that referenced this pull request Jul 18, 2020
Currently depends on dask/dask-ml#701

This could be improved by using an estimator that benefitted from large
amounts of data.
Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any issues or qualms with this PR, especially if the tests pass. The failing test looks relevant, but only because we were performing an unnecessary check.

X_train, X_test, y_train, y_test = self._get_train_test_split(X, y)

X_train, X_test, y_train, y_test = self._get_train_test_split(
X, y, shuffle=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this shuffle=True needed? There's made a modification above where shuffle=True is hard-coded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method warns when given a dataframe and shuffle= is not specified. We need to choose some default. Would False be a better choice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I prefer shuffle=True because it's the right choice from an optimization perspective. For example, what if the Dask Dataframe are CSVs from each state? Calling partial_fit on any one CSV is not the right answer if each CSV is very different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in this function shuffle is only per-partitions, not between partitions. I think that we would leave inter-partition shuffling to the user ahead of sending to Hyperband

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in this function shuffle is only per-partitions, not between partitions.

That's not true; it looks like for default value of blockwise depends on whether it's a Dask Array or DataFrame:

The default behavior depends on the types in arrays. For Dask Arrays,
the default is True (data are not shuffled between blocks). For Dask
DataFrames, the default and only allowed value is False (data are
shuffled between blocks).

But that's unrelated to this PR. I think shuffle=True is a good choice in case there's order. Maybe we should make _get_train_test_split public?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make _get_train_test_split public?

What's the motivation for that?

In case someone wants a specific train/test split. Maybe they have time series and don't want shuffle=True or blockwise=True? Or maybe they want the test set to be a specific chunk from the Dask Array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffle=True, blockwise=True was the default behavior on master, and that's still what we have?

shuffle=True is definitely the behavior on master; shuffle=False isn't supported (_split.py#L493, _split.py#L477), and blockwise is False for DataFrames, and True for Arrays. I think that's the right behavior.

I need to investigate blockwise some; I'm not sure if there's a mixup in the documentation or implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).

Copy link
Member

@stsievert stsievert Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).

I'm a little confused. Should the implementation support shuffling between chunks or within chunks? I'd like to see train_test_split support shuffling between chunks. Currently, train_test_split only supports shuffling within chunks blockwise=True:

import pandas as pd
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split

N = 1000
df = pd.DataFrame({"x": np.arange(N), "y": np.arange(N)})
ddf = dd.from_pandas(df, npartitions=2)

kwargs = dict(random_state=0, train_size=0.5)
train, test = train_test_split(ddf, blockwise=True, **kwargs)
assert train.compute().shape == (530, 2)
train, test = train_test_split(ddf, blockwise=False, **kwargs)  # raises NotImplementedError

Copy link
Member

@TomAugspurger TomAugspurger Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the implementation support shuffling between chunks or within chunks?

Ideally either. But that's distinct from this PR.

@mrocklin
Copy link
Member Author

I don't have any issues or qualms with this PR, especially if the tests pass. The failing test looks relevant, but only because we were performing an unnecessary check.

Ah indeed. Fixed I think.

Interestingly. Tests also pass if I remove the to_dask_array lines. I'm curious do we know what support is like for passing Pandas dataframes around to models is? Is this something that we would want to leave for the user?

@TomAugspurger
Copy link
Member

The sklearn dev failure can be ignore. I need to look into it today or tomorrow.

Planning to merge later today, assuming the discussion in https://github.com/dask/dask-ml/pull/701/files/485ff530bca48eb6051591c75dbe6a5965d42333#diff-0a54ded74c8013caf0588e9a5c879e99 has been resolved.

@TomAugspurger
Copy link
Member

Thanks! Apologies for the incorrect suggestion.

@TomAugspurger TomAugspurger merged commit 382bbb6 into dask:master Jul 21, 2020
@mrocklin mrocklin deleted the hyperband-dataframe branch July 21, 2020 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants