[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095

indhub · 2021-05-07T03:43:25Z

PR Checklist

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON Checklist

When creating a PR:

I've modified src/config/build_config.py in my PR branch by setting ENABLE_EI_MODE = True or ENABLE_NEURON_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Benchmark Checklist

When creating a PR:

I've modified src/config/test_config.py in my PR branch by setting ENABLE_BENCHMARK_DEV_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Reviewer Checklist

For reviewer, before merging, please cross-check:

I've verified the code change on the config file mentioned above has already been reverted

Description:
Build OpenMPI without libfabric support

Tests run:

DLC image/dockerfile:

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

junpuf

LGTM

pytorch/training/docker/1.8/py3/cu111/Dockerfile.gpu

junpuf

LGTM

saimidu · 2021-05-07T07:15:00Z

pytorch/training/docker/1.8/py3/cu111/Dockerfile.gpu

+# Install OpenMPI without libfabric support
+RUN mkdir /tmp/openmpi && \


nit: Please change the formatting of this RUN statement block to be similar to the blocks in line 124-144, and 146-148.

saimidu · 2021-05-07T07:16:32Z

tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu

+# Install OpenMPI without libfabric support
+RUN mkdir /tmp/openmpi && \


nit: Please change the formatting of this RUN statement block to be similar to the blocks in line 132-140, and 146-150.

saimidu · 2021-05-07T07:17:46Z

tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu

  && rm -rf /tmp/nccl

-# Install EFA along with AWS OPEN_MPI
+# Install EFA along without AWS OPEN_MPI


With this, will we still require RDMAV_FORK_SAFE=1 to be set on the dockerfile's env variables?

RDMAV_FORK_SAFE is required if a program that uses EFA needs to be able to call fork. Given our change to MPI, programs that only use MPI doesn't need this flag anymore. But programs that use NCCL or Herring will still need this. If we don't set this flag, programs that use NCCL or Herring and call fork will crash.

pytorch/training/docker/1.8/py3/cu111/Dockerfile.gpu

jeet4320 · 2021-05-07T21:44:59Z

Manually ran dlc-pr-pytorch_dlc-pr-ec2-test and it passed

Build OpenMPI without libfabric support

9ab7273

junpuf previously approved these changes May 7, 2021

View reviewed changes

Keep RDMAV_FORK_SAFE=1

32b915e

indhub dismissed junpuf’s stale review via 32b915e May 7, 2021 04:10

junpuf reviewed May 7, 2021

View reviewed changes

pytorch/training/docker/1.8/py3/cu111/Dockerfile.gpu Show resolved Hide resolved

junpuf approved these changes May 7, 2021

View reviewed changes

saimidu reviewed May 7, 2021

View reviewed changes

Merge branch 'master' into openmpi

9c33872

jeet4320 merged commit 33037e9 into aws:master May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095

[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095

Uh oh!

indhub commented May 7, 2021

Uh oh!

junpuf left a comment

Uh oh!

Uh oh!

junpuf left a comment

Uh oh!

saimidu May 7, 2021

Uh oh!

saimidu May 7, 2021

Uh oh!

saimidu May 7, 2021

Uh oh!

indhub May 7, 2021

Uh oh!

Uh oh!

jeet4320 commented May 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		# Install OpenMPI without libfabric support
		RUN mkdir /tmp/openmpi && \

[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095

[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095

Uh oh!

Conversation

indhub commented May 7, 2021

PR Checklist

Pytest Marker Checklist

EIA/NEURON Checklist

Benchmark Checklist

Reviewer Checklist

Uh oh!

junpuf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junpuf left a comment

Choose a reason for hiding this comment

Uh oh!

saimidu May 7, 2021

Choose a reason for hiding this comment

Uh oh!

saimidu May 7, 2021

Choose a reason for hiding this comment

Uh oh!

saimidu May 7, 2021

Choose a reason for hiding this comment

Uh oh!

indhub May 7, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeet4320 commented May 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants