Fix the distribute CPU training error when validation data can't be well divided #399

haixiw · 2023-06-21T22:01:04Z

Issue #, if available:
tt: V933345598
Description of changes:

Here's the doc about the requirements on input data for Distributed CPU training:
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#Instance-XGBoost-distributed-training-cpu

We already have the logic to exclude instances from training when they don't have training data.
Let's say, 5 files of training data but customer launched 6 instances so that the one instance without training data will be excluded from this distributed training job.

This change is to add the same logic for validation data when validation channel is already set by customer. The error happens when some instances have validation data but some don't. This will crash the eval metric calculation and MapReduce process across all instances. With this change the above failing scenario will be handled.

Testing:

Tested with customer notebook
All references can be found in tt: V933345598

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ell divided

…ell divided (#399) * test log * Fix the distribute CPU training error when validation data can't be well divided * remove some logs * fix flake8 * minor change

haixiw added 3 commits June 20, 2023 21:45

test log

614f173

Fix the distribute CPU training error when validation data can't be w…

9be1569

…ell divided

remove some logs

b2e4563

haixiw requested a review from mabunday June 21, 2023 22:04

fix flake8

69e1e6f

haixiw marked this pull request as ready for review June 22, 2023 04:41

haixiw marked this pull request as draft June 22, 2023 04:44

minor change

fa371c7

haixiw marked this pull request as ready for review June 22, 2023 17:10

haixiw requested a review from a team June 22, 2023 22:58

mabunday approved these changes Jun 22, 2023

View reviewed changes

malav-shastri approved these changes Jun 23, 2023

View reviewed changes

haixiw merged commit d49a4c3 into master Jun 23, 2023

haixiw deleted the dist_cpu branch September 27, 2023 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the distribute CPU training error when validation data can't be well divided #399

Fix the distribute CPU training error when validation data can't be well divided #399

Uh oh!

haixiw commented Jun 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix the distribute CPU training error when validation data can't be well divided #399

Fix the distribute CPU training error when validation data can't be well divided #399

Uh oh!

Conversation

haixiw commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haixiw commented Jun 21, 2023 •

edited

Loading