Skip to content

Conversation

haixiw
Copy link
Contributor

@haixiw haixiw commented Jun 21, 2023

Issue #, if available:
tt: V933345598
Description of changes:

Here's the doc about the requirements on input data for Distributed CPU training:
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#Instance-XGBoost-distributed-training-cpu

We already have the logic to exclude instances from training when they don't have training data.
Let's say, 5 files of training data but customer launched 6 instances so that the one instance without training data will be excluded from this distributed training job.

This change is to add the same logic for validation data when validation channel is already set by customer. The error happens when some instances have validation data but some don't. This will crash the eval metric calculation and MapReduce process across all instances. With this change the above failing scenario will be handled.

Testing:

Tested with customer notebook
All references can be found in tt: V933345598

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@haixiw haixiw requested a review from mabunday June 21, 2023 22:04
@haixiw haixiw marked this pull request as ready for review June 22, 2023 04:41
@haixiw haixiw marked this pull request as draft June 22, 2023 04:44
@haixiw haixiw marked this pull request as ready for review June 22, 2023 17:10
@haixiw haixiw requested a review from a team June 22, 2023 22:58
@haixiw haixiw merged commit d49a4c3 into master Jun 23, 2023
haixiw added a commit that referenced this pull request Jul 27, 2023
…ell divided (#399)

* test log

* Fix the distribute CPU training error when validation data can't be well divided

* remove some logs

* fix flake8

* minor change
@haixiw haixiw deleted the dist_cpu branch September 27, 2023 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants