-
Notifications
You must be signed in to change notification settings - Fork 520
[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# Install OpenMPI without libfabric support | ||
RUN mkdir /tmp/openmpi && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please change the formatting of this RUN
statement block to be similar to the blocks in line 124-144, and 146-148.
# Install OpenMPI without libfabric support | ||
RUN mkdir /tmp/openmpi && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please change the formatting of this RUN
statement block to be similar to the blocks in line 132-140, and 146-150.
&& rm -rf /tmp/nccl | ||
|
||
# Install EFA along with AWS OPEN_MPI | ||
# Install EFA along without AWS OPEN_MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, will we still require RDMAV_FORK_SAFE=1
to be set on the dockerfile's env variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDMAV_FORK_SAFE
is required if a program that uses EFA needs to be able to call fork
. Given our change to MPI, programs that only use MPI doesn't need this flag anymore. But programs that use NCCL or Herring will still need this. If we don't set this flag, programs that use NCCL or Herring and call fork
will crash.
Manually ran |
PR Checklist
Pytest Marker Checklist
@pytest.mark.model("<model-type>")
to the new tests which I have added, to specify the Deep Learning model that is used in the test (use"N/A"
if the test doesn't use a model)@pytest.mark.integration("<feature-being-tested>")
to the new tests which I have added, to specify the feature that will be tested@pytest.mark.multinode(<integer-num-nodes>)
to the new tests which I have added, to specify the number of nodes used on a multi-node test@pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">)
to the new tests which I have added, if a test is specifically applicable to only one processor typeEIA/NEURON Checklist
src/config/build_config.py
in my PR branch by settingENABLE_EI_MODE = True
orENABLE_NEURON_MODE = True
Benchmark Checklist
src/config/test_config.py
in my PR branch by settingENABLE_BENCHMARK_DEV_MODE = True
Reviewer Checklist
Description:
Build OpenMPI without libfabric support
Tests run:
DLC image/dockerfile:
Additional context:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.