Skip to content

Conversation

@wangshangsam
Copy link
Collaborator

@wangshangsam wangshangsam commented Nov 19, 2025

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@wangshangsam wangshangsam self-assigned this Nov 19, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@wangshangsam wangshangsam force-pushed the wangshangsam/cuda13-wheel-buildkite-step branch from 6c7ce36 to 6ad4052 Compare November 19, 2025 02:36
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds CI steps to build aarch64 wheels and images for CUDA 13.0. The changes introduce two new jobs to the Buildkite release pipeline. My review has identified a critical issue where the builds will likely fail due to a hardcoded PyTorch version for an older CUDA version in the Dockerfile. Additionally, I've pointed out a high-severity concern regarding a change in the base build image to a newer Ubuntu version, which could impact the binary compatibility of the generated artifacts. Both issues are present in the two new CI steps.

commands:
# #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
# https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting CUDA_VERSION=13.0.1 for this arm64 build will likely cause it to fail. The docker/Dockerfile has a hardcoded PyTorch version for CUDA 12.8 (torch==2.8.0.dev20250318+cu128) for arm64 platforms (see docker/Dockerfile lines 344-352). The build process will attempt to find this cu128 package in the cu130 PyTorch index, which will not work. To fix this, the hardcoded PyTorch version in docker/Dockerfile needs to be updated or made dynamic to support CUDA 13.0.

queue: arm64_cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the wheel build step, setting CUDA_VERSION=13.0.1 for this arm64 build will likely cause a failure. The docker/Dockerfile uses a hardcoded PyTorch version for CUDA 12.8 (torch==2.8.0.dev20250318+cu128) for arm64 platforms (lines 344-352), which is incompatible with the cu130 index that will be used. The hardcoded version in docker/Dockerfile needs to be adjusted for CUDA 13.0 support.

commands:
# #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
# https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The BUILD_BASE_IMAGE is set to nvidia/cuda:13.0.1-devel-ubuntu22.04. This contradicts the project's stated goal of using an older Ubuntu version for builds to maintain broad glibc compatibility, as mentioned in docker/Dockerfile (lines 18-21). Using ubuntu22.04 may limit the portability of the generated wheel. Other arm64 builds in this pipeline use the default ubuntu20.04-based image. If this change is not intentional, consider removing the --build-arg BUILD_BASE_IMAGE to use the default.

      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."

queue: arm64_cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The BUILD_BASE_IMAGE is set to an ubuntu22.04-based image, which may reduce the glibc compatibility of the resulting Docker image and the artifacts within. This is inconsistent with the project's documented approach in docker/Dockerfile (lines 18-21) and other arm64 builds in this file. Please consider removing the --build-arg BUILD_BASE_IMAGE argument if using ubuntu22.04 is not a strict requirement for CUDA 13.0.

      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."

@wangshangsam wangshangsam marked this pull request as draft November 19, 2025 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant