-
Notifications
You must be signed in to change notification settings - Fork 520
[pytorch][tensorflow][build][test] Build OpenMPI without libfabric support #1095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,7 @@ ENV CUDNN_VERSION=8.0.5.39 | |
ENV NCCL_VERSION=2.7.8 | ||
ENV HOROVOD_VERSION=0.21.3 | ||
ENV EFA_VERSION=1.11.2 | ||
ENV OMPI_VERSION=4.1.1 | ||
ENV BRANCH_OFI=1.1.1 | ||
ENV DGLBACKEND=pytorch | ||
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" | ||
|
@@ -93,17 +94,30 @@ RUN cd /tmp \ | |
&& make -j64 src.build BUILDDIR=/usr/local \ | ||
&& rm -rf /tmp/nccl | ||
|
||
# Install EFA along with AWS OPEN_MPI | ||
# Install EFA along without AWS OPEN_MPI | ||
RUN mkdir /tmp/efa \ | ||
&& cd /tmp/efa \ | ||
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \ | ||
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ | ||
&& cd aws-efa-installer \ | ||
&& ./efa_installer.sh -y --skip-kmod -g \ | ||
&& rm -rf $OPEN_MPI_PATH \ | ||
&& rm -rf /tmp/efa \ | ||
&& rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz | ||
|
||
RUN echo "pml = ob1" >> $OPEN_MPI_PATH/etc/openmpi-mca-params.conf | ||
# Install OpenMPI without libfabric support | ||
RUN mkdir /tmp/openmpi && \ | ||
Comment on lines
+108
to
+109
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Please change the formatting of this |
||
cd /tmp/openmpi && \ | ||
wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz && \ | ||
tar zxf openmpi-${OMPI_VERSION}.tar.gz && \ | ||
cd openmpi-${OMPI_VERSION} && \ | ||
./configure --enable-orterun-prefix-by-default --prefix=$OPEN_MPI_PATH && \ | ||
make -j $(nproc) all && \ | ||
make install && \ | ||
ldconfig && \ | ||
cd / && \ | ||
rm -rf /tmp/openmpi | ||
|
||
ENV PATH="$OPEN_MPI_PATH/bin:$PATH" | ||
ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib/:$EFA_PATH/lib/:$LD_LIBRARY_PATH | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ ARG OPEN_MPI_PATH=/opt/amazon/openmpi | |
ARG EFA_PATH=/opt/amazon/efa | ||
ARG NCCL_VERSION=2.7.8 | ||
ARG EFA_VERSION=1.11.2 | ||
ARG OMPI_VERSION=4.1.1 | ||
ARG BRANCH_OFI=1.1.1 | ||
|
||
ARG TF_URL=https://aws-tensorflow-binaries.s3-us-west-2.amazonaws.com/tensorflow/r2.4_aws/20210127-150238/gpu/py37/cu110/tensorflow_gpu-2.4.1-cp37-cp37m-manylinux2010_x86_64.whl | ||
|
@@ -104,16 +105,30 @@ RUN cd /tmp \ | |
&& make -j64 src.build BUILDDIR=/usr/local \ | ||
&& rm -rf /tmp/nccl | ||
|
||
# Install EFA along with AWS OPEN_MPI | ||
# Install EFA along without AWS OPEN_MPI | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this, will we still require There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
RUN mkdir /tmp/efa \ | ||
&& cd /tmp/efa \ | ||
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-$EFA_VERSION.tar.gz \ | ||
&& tar -xf aws-efa-installer-$EFA_VERSION.tar.gz \ | ||
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \ | ||
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ | ||
&& cd aws-efa-installer \ | ||
&& ./efa_installer.sh -y --skip-kmod -g \ | ||
&& rm -rf $OPEN_MPI_PATH \ | ||
&& rm -rf /tmp/efa \ | ||
&& rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz | ||
|
||
# Install OpenMPI without libfabric support | ||
RUN mkdir /tmp/openmpi && \ | ||
Comment on lines
+119
to
+120
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Please change the formatting of this |
||
cd /tmp/openmpi && \ | ||
wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz && \ | ||
tar zxf openmpi-${OMPI_VERSION}.tar.gz && \ | ||
cd openmpi-${OMPI_VERSION} && \ | ||
./configure --enable-orterun-prefix-by-default --prefix=$OPEN_MPI_PATH && \ | ||
make -j $(nproc) all && \ | ||
make install && \ | ||
ldconfig && \ | ||
cd / && \ | ||
rm -rf /tmp/openmpi | ||
|
||
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ | ||
&& tar -xzf boost_1_73_0.tar.gz \ | ||
&& cd boost_1_73_0 \ | ||
|
@@ -141,7 +156,6 @@ RUN echo "hwloc_base_binding_policy = none" >> $OPEN_MPI_PATH/etc/openmpi-mca-pa | |
|
||
# Set default NCCL parameters | ||
RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf | ||
RUN echo "pml = ob1" >> $OPEN_MPI_PATH/etc/openmpi-mca-params.conf | ||
ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib/:$EFA_PATH/lib/:$LD_LIBRARY_PATH | ||
# /usr/local/lib/libpython* needs to be accessible for dynamic linking | ||
ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH | ||
|
Uh oh!
There was an error while loading. Please reload this page.