Skip to content

Conversation

@youkaichao
Copy link
Member

@youkaichao youkaichao commented Sep 1, 2025

Purpose

PyTorch 2.8 is only available for cuda 12.9 on arm64 platforms

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: youkaichao <[email protected]>
@mergify mergify bot added the ci/build label Sep 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the CUDA version from 12.8.1 to 12.9.1 across the build configurations. The changes in .buildkite/release-pipeline.yaml and the CUDA_VERSION argument in docker/Dockerfile are consistent with this goal. However, I've identified a critical issue in the docker/Dockerfile where a hardcoded PyTorch version for arm64 builds is still pointing to a CUDA 12.8-specific build, which will likely cause the arm64 build to fail with the updated CUDA version. This needs to be addressed.

# docs/assets/contributing/dockerfile-stages-dependency.png

ARG CUDA_VERSION=12.8.1
ARG CUDA_VERSION=12.9.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While updating CUDA_VERSION to 12.9.1 is the goal of this PR, this change will likely break the arm64 build due to a hardcoded PyTorch version string.

In this file at line 349, the PyTorch version for arm64 is hardcoded as torch==2.8.0.dev20250318+cu128. The +cu128 suffix indicates it's built for CUDA 12.8.

With CUDA_VERSION set to 12.9.1, the build script will correctly look for packages in the CUDA 12.9 index (.../cu129), but it will try to install a package for CUDA 12.8, which will likely fail.

To fix this, the hardcoded torch version string needs to be updated to a version compatible with CUDA 12.9 (e.g., ...dev...+cu129). The torchvision version on the same line might also need to be updated accordingly. This is a critical issue that needs to be addressed to ensure the arm64 build succeeds.

Since this part of the file is not in the diff, I cannot provide a direct code suggestion, but the line to change is:

349:            "torch==2.8.0.dev20250318+cu128" "torchvision==0.22.0.dev20250319" ; \

@youkaichao
Copy link
Member Author

The arm build is broken https://buildkite.com/vllm/release/builds/7808/steps/canvas?sid=01990371-cf21-43a1-bc1f-7b51b716a488 by the #20358 pytorch 2.8 update, because it finds a cpu built pytorch.

@youkaichao youkaichao requested a review from hmellor as a code owner September 1, 2025 05:42
@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 1, 2025
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if release pipeline can turn green then.

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 1, 2025
@nWEIdia
Copy link

nWEIdia commented Sep 1, 2025

it finds a cpu built pytorch.

Thank you! That explains why the build suddenly required numa.
So in this sense, installing libnuma-dev is not the right fix (in #23960), right?

@youkaichao
Copy link
Member Author

So in this sense, installing libnuma-dev is not the right fix (in #23960), right?

I think so. Switching to the cuda 12.9 build looks more reasonable.

@youkaichao
Copy link
Member Author

close as #23960 covers more aspects

@youkaichao youkaichao closed this Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants