[PyTorch][Training][EC2][SageMaker]PyTorch 2.9 Currency Release #5407

DevakiBolleneni · 2025-10-22T07:30:45Z

GitHub Issue #, if available:

Note:

If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Tests Run

I have run builds/tests on commit for my changes.

EC2 tests: b8d42f9
SM tests: 36f8594

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

Using dlc_developer_config.toml
Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)

How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true
sagemaker_local_tests = true

How to use PR description

Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:

# /buildspec <buildspec_path>
- e.g.: # /buildspec pytorch/training/buildspec.yml
- If this line is commented out, dlc_developer_config.toml will be used.
# /tests <test_list>
- e.g.: # /tests sanity security ec2
- If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.

# /buildspec <buildspec_path>
# /tests <test_list>

Formatting

I have run black -l 100 on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)

PR Checklist

Expand

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

Expand

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu

Yadan-Wei · 2025-10-22T18:48:13Z

Do not forget to change toml file to trigger image build and test.

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu

pytorch/training/docker/2.9/py3/Dockerfile.cpu

pytorch/training/buildspec-2-9-ec2.yml

Yadan-Wei · 2025-11-06T23:58:05Z

Enable efa log display in console https://github.com/aws/deep-learning-containers/blob/master/test/dlc_tests/container_tests/bin/efa/testEFA#L92 to check the log.

Yadan-Wei · 2025-11-07T00:13:36Z

Also, check the packages installed in container frequently in case there are new versions released.

Yadan-Wei · 2025-11-14T19:15:08Z

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu

+# Optionally set NVTE_FRAMEWORK to avoid bringing in additional frameworks during TE install
+ENV NVTE_FRAMEWORK=pytorch
+
+RUN curl -LO https://github.com/Dao-AILab/flash-attention/releases/download/v${FLASH_ATTN_VERSION}/flash_attn-${FLASH_ATTN_VERSION}+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl \


Looks like the flash attention wheel is incorrect cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl,, it's still 2.8
From here https://github.com/Dao-AILab/flash-attention/releases you can see all prebuilt wheels

If the prebuilt wheel doesn't have torch 2.9 we need to refer this to build from source , refer https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/2.7/py3/cu128/Dockerfile.gpu#L92

Thanks for catching that. Will update it.

Yadan-Wei · 2025-11-14T19:20:10Z

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu

+#                      |___/                        |_|
+#################################################################
+
+FROM 669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-base:13.0.0-gpu-py312-cu130-ubuntu22.04-ec2-pr-5468-2025-11-12-00-24-28 AS common


If the base image is released, you can replace here.

Yadan-Wei · 2025-11-14T22:43:56Z

test/dlc_tests/container_tests/bin/transformerengine/testPTTransformerEngine

  pytest -v -s $TE_PATH/tests/pytorch/test_fused_optimizer.py
  pytest -v -s $TE_PATH/tests/pytorch/test_multi_tensor.py
  pytest -v -s $TE_PATH/tests/pytorch/test_fusible_ops.py
+elif [ $(version $TE_VERSION) -lt $(version "3.0") ]; then


Will this condition check also include all other version fall into 2.0 and 3.0?

yes, it will cover all the versions in between 2.0 to 3.0.

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu

Yadan-Wei · 2025-11-18T22:09:18Z

test/dlc_tests/container_tests/bin/efa/testEFA

    RETURN_VAL=${PIPESTATUS[0]}
    # In case, if you would like see logs, uncomment below line
-    # RESULT=$(cat ${TRAINING_LOG})
+    RESULT=$(cat ${TRAINING_LOG})


Can you revert this change since we have solved this issue?

Add docker files to PT2.9 training

9ef081e

DevakiBolleneni requested a review from a team as a code owner October 22, 2025 07:30

aws-deep-learning-containers-ci bot added authorized build Reflects file change in build folder pytorch Reflects file change in pytorch folder Size:XL Determines the size of the PR labels Oct 22, 2025

jinyan-li1 reviewed Oct 22, 2025

View reviewed changes

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu Show resolved Hide resolved

jinyan-li1 reviewed Oct 22, 2025

View reviewed changes

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu Outdated Show resolved Hide resolved

DevakiBolleneni added 3 commits October 22, 2025 11:00

removed the pins and updated versions

a43b05f

fixed the pins in cpu file as well

ead5f52

Modified the Buildspec files and toml file

7eff279

DevakiBolleneni requested a review from a team as a code owner October 22, 2025 18:47

DevakiBolleneni added 3 commits October 22, 2025 15:14

Removed fastai temporarily

c61c52f

rebuilding after pinning opencv-python

f7c6c44

rebuild with updated base image

6e6238c

jinyan-li1 reviewed Oct 27, 2025

View reviewed changes

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu Outdated Show resolved Hide resolved

jinyan-li1 reviewed Oct 27, 2025

View reviewed changes

pytorch/training/docker/2.9/py3/cu130/Dockerfile.gpu Outdated Show resolved Hide resolved

jinyan-li1 reviewed Oct 27, 2025

View reviewed changes

pytorch/training/docker/2.9/py3/Dockerfile.cpu Outdated Show resolved Hide resolved

jinyan-li1 reviewed Oct 27, 2025

View reviewed changes

pytorch/training/buildspec-2-9-ec2.yml Outdated Show resolved Hide resolved

DevakiBolleneni and others added 3 commits October 27, 2025 11:28

corrected base image and few typos

48f5832

adding additional dependency for TE 2.8

49dd5c1

Merge branch 'aws:master' into pt2.9-currency

793b873

DevakiBolleneni and others added 5 commits November 6, 2025 16:25

Enable efa log and modify the license file

d2e1f02

Modify the license file

29f9e55

Merge branch 'master' into pt2.9-currency

ccfeb0a

Merge branch 'master' into pt2.9-currency

cfbc16f

Add pt2.9 ec2 test file

e43aafc

DevakiBolleneni and others added 7 commits November 12, 2025 11:28

Merge branch 'master' into pt2.9-currency

ea0920f

update prbase image and revert back the NCCL changes

399a043

modify the ofi-nccl path

76c2cd4

build sm image

eb67a5b

add fastai and update TE version

fc61fdb

rebuild ec2 image with fastai

595f5b4

rebuild sm image and test

edf9871

Yadan-Wei reviewed Nov 14, 2025

View reviewed changes

DevakiBolleneni and others added 11 commits November 14, 2025 12:19

update base image and flashattention wheel

64492d3

rebuild sm image with enabled security tests

a5bb85d

Merge branch 'master' into pt2.9-currency

d52590c

rebuild ec2 image

606d4fd

rerun jobs after deleting AML2_CPU_ARM64_US_EAST_1

d22ad48

rerun jobs after disabling safety check test and ecr scan allowlist

85e9903

update MAX_JOBS and try rebuild

a9bb637

rebuild ec2 image with safety check test and ecr scan allowlist

155cae7

rebuild ec2 image and run tests

719a55f

rebuild sm image and run tests

07b1b26

skip smppy tests and rerun

6568833

aws-deep-learning-containers-ci bot added the sagemaker_tests label Nov 17, 2025

DevakiBolleneni and others added 3 commits November 17, 2025 21:34

rerun after enabling safety check test and ecr scan allowlist

1b6ed8d

rebuild ec2 image

ef9f107

Merge branch 'master' into pt2.9-currency

55c5593

Yadan-Wei reviewed Nov 18, 2025

View reviewed changes

DevakiBolleneni and others added 5 commits November 18, 2025 14:29

Merge branch 'master' into pt2.9-currency

af6e0d6

fix formatting

a450ada

Merge branch 'master' into pt2.9-currency

b8d42f9

Rerun SM tests

36f8594

Merge branch 'master' into pt2.9-currency

bc54719

[PyTorch][Training][EC2][SageMaker]PyTorch 2.9 Currency Release #5407

Are you sure you want to change the base?

[PyTorch][Training][EC2][SageMaker]PyTorch 2.9 Currency Release #5407

Conversation

DevakiBolleneni commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests Run

Formatting

PR Checklist

Pytest Marker Checklist

Uh oh!

Uh oh!

Uh oh!

Yadan-Wei commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yadan-Wei commented Nov 6, 2025

Uh oh!

Yadan-Wei commented Nov 7, 2025

Uh oh!

Yadan-Wei Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yadan-Wei Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

DevakiBolleneni Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Yadan-Wei Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

DevakiBolleneni Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Yadan-Wei Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

DevakiBolleneni Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yadan-Wei Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DevakiBolleneni commented Oct 22, 2025 •

edited

Loading

Yadan-Wei Nov 14, 2025 •

edited

Loading