Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions pytorch/training/docker/1.8/py3/cu111/Dockerfile.gpu
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ ENV CUDNN_VERSION=8.0.5.39
ENV NCCL_VERSION=2.7.8
ENV HOROVOD_VERSION=0.21.3
ENV EFA_VERSION=1.11.2
ENV OMPI_VERSION=4.1.1
ENV BRANCH_OFI=1.1.1
ENV DGLBACKEND=pytorch
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"
Expand Down Expand Up @@ -93,17 +94,30 @@ RUN cd /tmp \
&& make -j64 src.build BUILDDIR=/usr/local \
&& rm -rf /tmp/nccl

# Install EFA along with AWS OPEN_MPI
# Install EFA along without AWS OPEN_MPI
RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf $OPEN_MPI_PATH \
&& rm -rf /tmp/efa \
&& rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz

RUN echo "pml = ob1" >> $OPEN_MPI_PATH/etc/openmpi-mca-params.conf
# Install OpenMPI without libfabric support
RUN mkdir /tmp/openmpi && \
Comment on lines +108 to +109
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please change the formatting of this RUN statement block to be similar to the blocks in line 124-144, and 146-148.

cd /tmp/openmpi && \
wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz && \
tar zxf openmpi-${OMPI_VERSION}.tar.gz && \
cd openmpi-${OMPI_VERSION} && \
./configure --enable-orterun-prefix-by-default --prefix=$OPEN_MPI_PATH && \
make -j $(nproc) all && \
make install && \
ldconfig && \
cd / && \
rm -rf /tmp/openmpi

ENV PATH="$OPEN_MPI_PATH/bin:$PATH"
ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib/:$EFA_PATH/lib/:$LD_LIBRARY_PATH

Expand Down
22 changes: 18 additions & 4 deletions tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ ARG OPEN_MPI_PATH=/opt/amazon/openmpi
ARG EFA_PATH=/opt/amazon/efa
ARG NCCL_VERSION=2.7.8
ARG EFA_VERSION=1.11.2
ARG OMPI_VERSION=4.1.1
ARG BRANCH_OFI=1.1.1

ARG TF_URL=https://aws-tensorflow-binaries.s3-us-west-2.amazonaws.com/tensorflow/r2.4_aws/20210127-150238/gpu/py37/cu110/tensorflow_gpu-2.4.1-cp37-cp37m-manylinux2010_x86_64.whl
Expand Down Expand Up @@ -104,16 +105,30 @@ RUN cd /tmp \
&& make -j64 src.build BUILDDIR=/usr/local \
&& rm -rf /tmp/nccl

# Install EFA along with AWS OPEN_MPI
# Install EFA along without AWS OPEN_MPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, will we still require RDMAV_FORK_SAFE=1 to be set on the dockerfile's env variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDMAV_FORK_SAFE is required if a program that uses EFA needs to be able to call fork. Given our change to MPI, programs that only use MPI doesn't need this flag anymore. But programs that use NCCL or Herring will still need this. If we don't set this flag, programs that use NCCL or Herring and call fork will crash.

RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-$EFA_VERSION.tar.gz \
&& tar -xf aws-efa-installer-$EFA_VERSION.tar.gz \
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf $OPEN_MPI_PATH \
&& rm -rf /tmp/efa \
&& rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz

# Install OpenMPI without libfabric support
RUN mkdir /tmp/openmpi && \
Comment on lines +119 to +120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please change the formatting of this RUN statement block to be similar to the blocks in line 132-140, and 146-150.

cd /tmp/openmpi && \
wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz && \
tar zxf openmpi-${OMPI_VERSION}.tar.gz && \
cd openmpi-${OMPI_VERSION} && \
./configure --enable-orterun-prefix-by-default --prefix=$OPEN_MPI_PATH && \
make -j $(nproc) all && \
make install && \
ldconfig && \
cd / && \
rm -rf /tmp/openmpi

RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
&& tar -xzf boost_1_73_0.tar.gz \
&& cd boost_1_73_0 \
Expand Down Expand Up @@ -141,7 +156,6 @@ RUN echo "hwloc_base_binding_policy = none" >> $OPEN_MPI_PATH/etc/openmpi-mca-pa

# Set default NCCL parameters
RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf
RUN echo "pml = ob1" >> $OPEN_MPI_PATH/etc/openmpi-mca-params.conf
ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib/:$EFA_PATH/lib/:$LD_LIBRARY_PATH
# /usr/local/lib/libpython* needs to be accessible for dynamic linking
ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
Expand Down