[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA #1075

karan6181 · 2021-04-28T22:07:11Z

Updated smdataparallel binary to support EFA

Issue #, if available:

PR Checklist

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON Checklist

When creating a PR:

I've modified src/config/build_config.py in my PR branch by setting ENABLE_EI_MODE = True or ENABLE_NEURON_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Benchmark Checklist

When creating a PR:

I've modified src/config/test_config.py in my PR branch by setting ENABLE_BENCHMARK_DEV_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Reviewer Checklist

For reviewer, before merging, please cross-check:

I've verified the code change on the config file mentioned above has already been reverted

Description:

Tests run:

DLC image/dockerfile:

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… to support EFA

jeet4320 · 2021-04-29T02:06:18Z

TF2 is failing due to capacity issue

E sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-smdataparallel-1619652004-96b6: Failed. Reason: CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.

jeet4320 · 2021-04-29T02:37:56Z

All Pytorch tests are passing on first commit of this PR.

jeet4320 · 2021-04-29T12:01:34Z

All TF2 tests are green as well

karan6181 · 2021-04-29T13:03:32Z

All TF2 tests are green as well

Is the TF1 CI hanged?

jeet4320 · 2021-04-29T13:57:28Z

I have cancelled those as those are not needed and they unnecessary take instance capacity

jeet4320 · 2021-04-29T14:04:43Z

Will merge PR today as all needed tests are passing

* aws/master: [tensorflow][build][test] update TF2.3 for pillow to 8.2.0 (#1072) [test][huggingface_pytorch] Updated number of tests in smmp test to 500 and version for git script (#1069) [pytorch][release] Release pt1.6 Inference cpu, gpu and training cpu (#1074)

[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary…

1150f11

… to support EFA

run just failed tests for tf2

dc2d6b1

jeet4320 added 5 commits April 29, 2021 09:57

revert code

e2e18e2

update pillow in pt and tf2

7af470f

use n virginia region for p3.16 instances

800e728

remove space

3ba82b9

jeet4320 approved these changes Apr 29, 2021

View reviewed changes

jeet4320 merged commit faa3383 into aws:master Apr 29, 2021

karan6181 deleted the smddp_version_upgrade branch April 29, 2021 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA #1075

[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA #1075

Uh oh!

karan6181 commented Apr 28, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

karan6181 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA #1075

[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA #1075

Uh oh!

Conversation

karan6181 commented Apr 28, 2021

PR Checklist

Pytest Marker Checklist

EIA/NEURON Checklist

Benchmark Checklist

Reviewer Checklist

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

karan6181 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

jeet4320 commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants