-
Couldn't load subscription status.
- Fork 2
Add hyper-pararmeter-optimization notebook with Hyperband #1
Conversation
Currently depends on dask/dask-ml#701 This could be improved by using an estimator that benefitted from large amounts of data.
|
@stsievert if you have any time do you have any thoughts on how this example might be improved? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is nice too see.
There are some improvements to make I think. I've left some comments below to give users a better idea about why they're using Dask, and also some nits. I also think it'd help to provide some text below each title describing what the cell does and why it's required to use Dask. I'd probably point to Dask-ML's hyperparameter optimization docs too.
If you'd like, I might be able to modify this example.
hyper-parameter-optimization.ipynb
Outdated
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Train model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would split this into two sections: "Define model and hyperparameters search space" and "Find the best hyperparameters."
hyper-parameter-optimization.ipynb
Outdated
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from sklearn.linear_model import SGDClassifier\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure SGDClassifier is relevant according to the 4 categories in https://ml.dask.org/hyper-parameter-search.html. It's a linear model with 6 features; I don't know if I'd label that as "compute constrained."
I think there are a couple options:
- Have a more computationally constrained model (e.g, MLPClassifier or PyTorch). (I might use an MLPClassifier then say "Realistically, a PyTorch model might be used. To do that, ... (skorch) ....").
- Use
IncrementalSearchCV. I think this is the appropriate classifier for the example as written: it's memory-constrained, not compute-constrained. - Search over more hyperparameters. This would make it more computationally constrained; it'd require a higher
max_iterin Hyperband.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also point to the docs to make sure the users know why they're using Dask: https://ml.dask.org/hyper-parameter-search.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I chose SGDClassifier just because it was simple. These are all black boxes to me, so I chose the simplest black box about which I could find the most examples :)
hyper-parameter-optimization.ipynb
Outdated
| " \"store_and_fwd_flag\": \"category\",\n", | ||
| " \"PULocationID\": \"UInt16\",\n", | ||
| " \"DOLocationID\": \"UInt16\", \n", | ||
| " \"payment_type\": \"UInt8\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Some columns are included here are never seen again, like PULocationID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a copy-paste job from another notebook. I should probably remove some of these columns with usecols= I guess.
hyper-parameter-optimization.ipynb
Outdated
| " blocksize=\"16 MiB\",\n", | ||
| ")\n", | ||
| "\n", | ||
| "data = df[[\"passenger_count\", \"trip_distance\", \"RatecodeID\", \"payment_type\", \"fare_amount\"]]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: RatecodeID is categorical with 5 categories according to the column descriptions docs. Maybe OneHotEncoder should be used on that column?
from dask_ml.preprocessing import OneHotEncoder
rate_indicators = OneHotEncoder().fit_transform(df["RatecodeID"])
# put rate_indicators back into df
hyper-parameter-optimization.ipynb
Outdated
| "data = df[[\"passenger_count\", \"trip_distance\", \"RatecodeID\", \"payment_type\", \"fare_amount\"]]\n", | ||
| "data = data.fillna(0)\n", | ||
| "\n", | ||
| "labels = (df.tip_amount / df.fare_amount) > 0.25\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I might predict taxi trip duration to mirror https://www.kaggle.com/c/nyc-taxi-trip-duration/. That would imply a regression problem, not a classification problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool. It would be nice to reflect an existing Kaggle problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe at the end show how many minutes you are off:
pred_time = model.score(X_test)
err = np.abs(pred_time - real_time)
pd.Series(err).plot.hist()
hyper-parameter-optimization.ipynb
Outdated
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "search.score(X_test.sample(frac=0.1, random_state=123), y_test.sample(frac=0.1, random_state=123))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment on why frac=0.1 is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was interesting, and something that we might want to think about in Dask-ML.
My current understanding is that search.score calls a scikit-learn scorer on the inputs, and so these are brought into local memory. I imagine that this is because we haven't made dask-compatible scorers for everything. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ParallelPostFit relevant? It takes a trained model and maps the score/predict functions to each chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? (cc @TomAugspurger)
My sense is that scorers will likely need to be handled one at a time, and that there isn't an obvious way to map them all automatically. It looks like there is a mapping in dask_ml/model_selection/scorer.py. Maybe ParallelPostFit uses that. If so, IncrementalSearchCV and friends (Hyperband) should maybe use the same tricks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, IncrementalSearchCV and friends (Hyperband) should maybe use the same tricks?
Wrapping it in ParallelPostFit should do the trick, but every key in the hyperparameter dict params would need to be prepended with estimator__.
I'll think more about doing this automatically. My initial reaction is "no", since the default is to fall back to the estimator's default. I wouldn't want to complicate that.
One thing we should be doing is to make something like Hyperband(..., scoring="accuracy") work. Right now we use sklearn.metrics.check_scoring. But if that used dask_ml.metrics.check_scoring things would work. I'll open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ParallelPostFit relevant?
I meant this: ParallelPostFit(search.best_estimator_).score(X_test, y_test).
That would be very very welcome :) |
|
I've made some edits. A summary of the changes:
This is rough; this is far from a polished draft. @mrocklin let me know what questions you have. |
|
Oh, cool. This is fun to play with.
Thoughts on using PyTorch/Skorch here instead? Would that make things much more complex?
I think that pointing to docs for a lot of this is good. I like the idea of using Hyperband here, but I don't like the idea of explaining all of the knobs behind Hyperband in a first exposure example like this. I'm curious, are the defaults bad in this case? Would it be ok to omit extra parameters here or do we need to expose those to have things make sense. I ran into an issue with the |
You're talking about I'd still link to the rule of thumb (probably the one in the example; the one in the docstring is hard to link to). I'd also add a note something like "if you want to sample more parameters or train your models for longer, look at HyperbandSearchCV's rule of thumb. Luckily, it's simple and only requires knowing how many hyperparameters to sample and how long to train the model."
👍 I think it'd be nice to have PyTorch; we don't have a PyTorch + Hyperband example yet in dask-examples yet. I suspect your users don't want to be tied to Scikit-Learn. Having a PyTorch example would allow users more freedom. Looking at skorch's getting started guide, it'd amount to this much code: from skorch import NeuralNetRegressor
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
class HiddenLayerNet(torch.nn.Module):
def __init__(self, n_features=10, n_outputs=1, n_hidden=100):
super().__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.fc2 = nn.Linear(n_hidden, n_output)
def forward(self, X, **kwargs):
return self.fc2(F.relu(self.fc1(x)))
net = NeuralNetRegressor(
module=HiddenLayerNet,
module__hiden=200,
optimizer=optim.SGD,
optimizer__lr=0.1,
max_epochs=10,
# Shuffle training data on each epoch
iterator_train__shuffle=True,
)PyTorch modules require float32 input. I'd convert the dataset first. |
FWIW I suspect that while many researchers find those questions simple to answer I suspect that many practitioners don't have good answers. I think that one of the reasons why Scikit-Learn was popular was that many things worked out of the box with sensible defaults. I wonder if there is a good default solution in this case. (that's probably a problem to solve later though).
If you're interested in writing this up I'd be in favor. (I'm really just trying to get as much free labor as I can out of you :) ) |
|
I've integrated PyTorch. I didn't have time to debug an issue I ran into: the output of |
|
I'll take a look!
…On Thu, Jul 30, 2020 at 3:11 PM Scott Sievert ***@***.***> wrote:
I've integrated PyTorch.
I didn't have time to debug an issue I ran into: the output of
ParallelPostFit(search.best_estimator).predict(X_test) is reported by
Dask to be (100, ), but when I compute it's actually (100, 50).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGOFRRKXNFOHX6BQMLR6HVYPANCNFSM4PAQOSDA>
.
|
|
I think ParallelPostFit.predict is just incorrect at
https://github.com/dask/dask-ml/blob/5c3179eb7eaa6bf830e0b6df162902f805a9b3c0/dask_ml/wrappers.py#L275.
It doesn't handle multi-dimensional output.
I'd hoped that we could pass an empty array to `.predict()`, to find the
output shape, but at least some scikit-learn estimators validate that the
array is non-empty.
On Thu, Jul 30, 2020 at 5:17 PM Matthew Rocklin <[email protected]>
wrote:
… I'll take a look!
On Thu, Jul 30, 2020 at 3:11 PM Scott Sievert ***@***.***>
wrote:
> I've integrated PyTorch.
>
> I didn't have time to debug an issue I ran into: the output of
> ParallelPostFit(search.best_estimator).predict(X_test) is reported by
> Dask to be (100, ), but when I compute it's actually (100, 50).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AACKZTGOFRRKXNFOHX6BQMLR6HVYPANCNFSM4PAQOSDA
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQ7OJCXUXUJMVOJCETR6HWPBANCNFSM4PAQOSDA>
.
|
|
Thanks for your work on this @mrocklin @stsievert @TomAugspurger! I pushed a few small updates. Namely:
Traceback:---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-9db7871d3c39> in <module>
1 y_train2 = y_train.reshape(-1, 1).persist()
----> 2 search.fit(X_train, y_train2)
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in fit(self, X, y, **fit_params)
700 client = default_client()
701 if not client.asynchronous:
--> 702 return client.sync(self._fit, X, y, **fit_params)
703 return self._fit(X, y, **fit_params)
704
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
831 else:
832 return sync(
--> 833 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
834 )
835
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
337 if error[0]:
338 typ, exc, tb = error[0]
--> 339 raise exc.with_traceback(tb)
340 else:
341 return result[0]
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/distributed/utils.py in f()
321 if callback_timeout is not None:
322 future = asyncio.wait_for(future, callback_timeout)
--> 323 result[0] = yield future
324 except Exception as exc:
325 error[0] = sys.exc_info()
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_hyperband.py in _fit(self, X, y, **fit_params)
400
401 _SHAs = await asyncio.gather(
--> 402 *[SHAs[b]._fit(X, y, **fit_params) for b in _brackets_ids]
403 )
404 SHAs = {b: SHA for b, SHA in zip(_brackets_ids, _SHAs)}
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in _fit(self, X, y, **fit_params)
658 random_state=self.random_state,
659 verbose=self.verbose,
--> 660 prefix=self.prefix,
661 )
662 results = self._process_results(results)
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in fit(model, params, X_train, y_train, X_test, y_test, additional_calls, fit_params, scorer, random_state, verbose, prefix)
476 random_state=random_state,
477 verbose=verbose,
--> 478 prefix=prefix,
479 )
480
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in _fit(model, params, X_train, y_train, X_test, y_test, additional_calls, fit_params, scorer, random_state, verbose, prefix)
260 # async for future, result in seq:
261 for _i in itertools.count():
--> 262 metas = await client.gather(new_scores)
263
264 if log_delay and _i % int(log_delay) == 0:
~/miniforge3/envs/coiled-jrbourbeau-pytorch/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1845 exc = CancelledError(key)
1846 else:
-> 1847 raise exception.with_traceback(traceback)
1848 raise exc
1849 if errors == "skip":
/opt/conda/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in _score()
/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py in _passthrough_scorer()
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in score()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f()
/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_regression.py in r2_score()
/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_regression.py in _check_reg_targets()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite()
ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). |
|
Some thoughts!
|
|
When I reduce the partition size I get Traceback[CV, bracket=4] creating 81 models
[CV, bracket=3] creating 34 models
[CV, bracket=2] creating 15 models
[CV, bracket=1] creating 8 models
[CV, bracket=0] creating 5 models
[CV, bracket=0] For training there are between 46291 and 98195 examples in each chunk
[CV, bracket=2] For training there are between 46291 and 98195 examples in each chunk
[CV, bracket=3] For training there are between 46291 and 98195 examples in each chunk
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-9db7871d3c39> in <module>
1 y_train2 = y_train.reshape(-1, 1).persist()
----> 2 search.fit(X_train, y_train2)
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in fit(self, X, y, **fit_params)
700 client = default_client()
701 if not client.asynchronous:
--> 702 return client.sync(self._fit, X, y, **fit_params)
703 return self._fit(X, y, **fit_params)
704
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
831 else:
832 return sync(
--> 833 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
834 )
835
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
337 if error[0]:
338 typ, exc, tb = error[0]
--> 339 raise exc.with_traceback(tb)
340 else:
341 return result[0]
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/distributed/utils.py in f()
321 if callback_timeout is not None:
322 future = asyncio.wait_for(future, callback_timeout)
--> 323 result[0] = yield future
324 except Exception as exc:
325 error[0] = sys.exc_info()
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_hyperband.py in _fit(self, X, y, **fit_params)
400
401 _SHAs = await asyncio.gather(
--> 402 *[SHAs[b]._fit(X, y, **fit_params) for b in _brackets_ids]
403 )
404 SHAs = {b: SHA for b, SHA in zip(_brackets_ids, _SHAs)}
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in _fit(self, X, y, **fit_params)
658 random_state=self.random_state,
659 verbose=self.verbose,
--> 660 prefix=self.prefix,
661 )
662 results = self._process_results(results)
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in fit(model, params, X_train, y_train, X_test, y_test, additional_calls, fit_params, scorer, random_state, verbose, prefix)
476 random_state=random_state,
477 verbose=verbose,
--> 478 prefix=prefix,
479 )
480
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in _fit(model, params, X_train, y_train, X_test, y_test, additional_calls, fit_params, scorer, random_state, verbose, prefix)
260 # async for future, result in seq:
261 for _i in itertools.count():
--> 262 metas = await client.gather(new_scores)
263
264 if log_delay and _i % int(log_delay) == 0:
~/miniconda/envs/coiled-coiled-examples-pytorch/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1845 exc = CancelledError(key)
1846 else:
-> 1847 raise exception.with_traceback(traceback)
1848 raise exc
1849 if errors == "skip":
/opt/conda/lib/python3.7/site-packages/dask_ml/model_selection/_incremental.py in _score()
/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py in _passthrough_scorer()
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in score()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f()
/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_regression.py in r2_score()
/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_regression.py in _check_reg_targets()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite()
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
[CV, bracket=1] For training there are between 46291 and 98195 examples in each chunk |
|
I think NaNs are happening because poor hyperparameters are chosen and the loss is climbing to infinity. I got the same error when I my step size was too large (a classic result in optimization). Avoiding NaNs when getting the loss is getting large solves the issue for me: from skorch import NeuralNetRegressor
class NoNaNs(NeuralNetRegressor):
def get_loss(self, y_pred, y_true, X=None, training=False):
if torch.abs(y_true - y_pred).abs().mean() > 1e6:
return torch.tensor([0.0], requires_grad=True)
return super().get_loss(y_pred, y_true, X=X, training=training)
model = NoNaNs(module=HiddenLayerNet, ..., **niceties)I think this issue should be reported upstream to Skorch. (edit) I haven't tested it, but it also might work to have |
|
Thanks for the FYI I removed the "Visualization" and "Why not simply sampling instead?" sections as, while I found them to be informative, they take several minutes to execute. This also lets us avoid any issues with |
|
Scott, should we make HyperbandCV robust to NaN's? Is there an obvious way
to do this? Treat them as bad results that should be dropped?
…On Wed, Aug 12, 2020 at 9:04 AM James Bourbeau ***@***.***> wrote:
Thanks for the NoNaNs fix @stsievert <https://github.com/stsievert>! That
solved the issue for me too.
FYI I removed the "Visualization" and "Why not simply sampling instead?"
sections as, while I found them to be informative, they take several
minutes to execute. This also lets us avoid any issues with
ParallelPostFit.predict.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTD3366CP6LT6UCPAUTSAK4PRANCNFSM4PAQOSDA>
.
|
I think that's a good idea. I might add infinite losses too. The obvious way to catch the NaN error is to use a try/except block around the HyperbandSearchCV doesn't get an opportunity to see the output of |
|
Thanks all for your work on this example! I'm going to merge this PR and we can fine tune with follow-up PRs. Thanks again! |
Currently depends on dask/dask-ml#701
This could be improved by using an estimator that benefitted from large
amounts of data.