[fix] Fix llama 4 long context #4809

mikeiovine · 2025-05-30T19:22:31Z

Description

The 8k context limitation was accidentally re-added during feat/llama4 integration.

Since the Blackwell kernels have not landed yet, I've added an additional get_sm_version() == 90 restriction for using the full context length.

Test Coverage

RULER evaluation passes

|  Tasks  |Version|Filter|n-shot|Metric|   | Value |   |Stderr|                                                                                                                                             
|---------|------:|------|-----:|-----:|---|------:|---|------|                                                                                                                                             
|ruler_cwe|      1|none  |     0| 16384|↑  | 0.9808|±  |   N/A|                                                                                                                                             
|         |       |none  |     0|  4096|↑  |-1.0000|±  |   N/A|

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

hlu1 · 2025-05-30T21:46:46Z

Is there any test coverage for llama4 long context?

mikeiovine · 2025-05-30T23:00:29Z

Is there any test coverage for llama4 long context?

The issue is that the release version of lm_eval does not have RULER yet, you need to install it from source. We need to add a very long prompt to the sanity check (test_modeling_llama4). However, the sanity check is currently disabled due to an unrelated multimodality issue. The multimodal tests need to be separated from the text one to unblock this.

hlu1 · 2025-05-30T23:04:47Z

Can you add test to check that the return value of infer_max_seq_len is correct on SM90/100? It's a functional test, not a real accuracy test.

mikeiovine · 2025-06-02T15:13:50Z

Decided to enable a functional llama 4 long context test in L0 instead so we don't have to put up another PR. The broken multimodal tests remain disabled for now.

mikeiovine · 2025-06-02T15:14:28Z

/bot run --add-multi-gpu-test

mikeiovine · 2025-06-02T15:25:11Z

/bot run

mikeiovine · 2025-06-02T15:26:16Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-06-02T15:31:09Z

PR_Github #7225 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-02T15:31:34Z

PR_Github #7226 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-02T15:31:35Z

PR_Github #7225 [ run ] completed with state ABORTED

mikeiovine · 2025-06-02T18:02:46Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-06-02T18:09:10Z

PR_Github #7236 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-02T18:09:12Z

PR_Github #7226 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-03T01:24:28Z

PR_Github #7236 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5241 completed with status: 'SUCCESS'

mikeiovine · 2025-06-03T13:54:07Z

/bot run --post-merge

tensorrt-cicd · 2025-06-03T13:59:35Z

PR_Github #7359 [ run ] triggered by Bot

schetlur-nv · 2025-06-03T19:30:58Z

/bot --help

github-actions · 2025-06-03T19:31:10Z