-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Your current environment
Greetings, everyone.
-
I have build the Pytorch 2.5.1 from scratch on the Jetson AGX orin with CUDA support.
So I have got following output from CLI:
Python 3.10.16 (main, Dec 11 2024, 16:18:56) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import torch
print(torch.version.cuda)
12.6
print(torch.cuda.get_arch_list())
['sm_87']
As you can see, it should support the sm_87 capability. -
And yes, I have followed these setup to build the lasted vllm code from github.
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation -
What is more, I edited this file: /home/nvidia/projects/vllm/.deps/flashmla-src/setup.py
And change the line from:
cc_flag.append("arch=compute_90a,code=sm_90a")
to
cc_flag.append("arch=compute_87,code=sm_87") # for jetson agx orin -
Everything can be compiled and then I try to run:
CUDA_LAUNCH_BLOCKING=1 python examples/offline_inference/basic/basic.py
I got following errors:
INFO 03-04 11:23:24 [config.py:576] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 03-04 11:23:24 [llm_engine.py:235] Initializing a V0 LLM engine (v0.7.4.dev180+gb87c21fc.d20250304) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-04 11:23:27 [cuda.py:268] Using Flash Attention backend.
INFO 03-04 11:23:27 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-04 11:23:27 [model_runner.py:1110] Starting to load model facebook/opt-125m...
INFO 03-04 11:23:28 [weight_utils.py:257] Using model weights format ['*.bin']
INFO 03-04 11:23:29 [weight_utils.py:273] Time spent downloading weights for facebook/opt-125m: 0.622125 seconds
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.08it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.08it/s]
INFO 03-04 11:23:29 [loader.py:422] Loading weights took 0.25 seconds
INFO 03-04 11:23:29 [model_runner.py:1117] Model loading took 0.2389 GB and 1.909658 seconds
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/nvidia/projects/vllm/examples/offline_inference/basic/basic.py", line 18, in
[rank0]: llm = LLM(model="facebook/opt-125m")
[rank0]: File "/home/nvidia/projects/vllm/vllm/utils.py", line 1045, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/entrypoints/llm.py", line 243, in init
[rank0]: self.llm_engine = self.engine_class.from_engine_args(
[rank0]: File "/home/nvidia/projects/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/nvidia/projects/vllm/vllm/engine/llm_engine.py", line 277, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/nvidia/projects/vllm/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/home/nvidia/projects/vllm/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
[rank0]: results = self.collective_rpc("determine_num_available_blocks")
[rank0]: File "/home/nvidia/projects/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/utils.py", line 2232, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
[rank0]: self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/worker/model_runner.py", line 1354, in _dummy_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/worker/model_runner.py", line 1742, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/model_executor/models/opt.py", line 353, in forward
[rank0]: hidden_states = self.model(input_ids, positions, intermediate_tensors,
[rank0]: File "/home/nvidia/projects/vllm/vllm/compilation/decorators.py", line 172, in call
[rank0]: return self.forward(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/model_executor/models/opt.py", line 312, in forward
[rank0]: return self.decoder(input_ids,
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/model_executor/models/opt.py", line 273, in forward
[rank0]: hidden_states = layer(hidden_states)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/model_executor/models/opt.py", line 171, in forward
[rank0]: hidden_states = self.self_attn(hidden_states=hidden_states)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/model_executor/models/opt.py", line 113, in forward
[rank0]: attn_output = self.attn(q, k, v)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/nvidia/projects/vllm/vllm/attention/layer.py", line 212, in forward
[rank0]: torch.ops.vllm.unified_attention_with_output(
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in call
[rank0]: return self._op(*args, **(kwargs or {}))
[rank0]: File "/home/nvidia/projects/vllm/vllm/attention/layer.py", line 361, in unified_attention_with_output
[rank0]: self.impl.forward(self,
[rank0]: File "/home/nvidia/projects/vllm/vllm/attention/backends/flash_attn.py", line 749, in forward
[rank0]: flash_attn_varlen_func(
[rank0]: File "/home/nvidia/projects/vllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 172, in flash_attn_varlen_func
[rank0]: out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
[rank0]: File "/home/nvidia/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in call
[rank0]: return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank0]:[W304 11:23:31.178074184 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
I feel confused about this error because I do believe the my built torch shoudl support sm_87[jetson agx orin] natively now and the flash_attention component also has been compiled natively.
Could you please give a hint what step I missed? Thank you so much for your help.
🐛 Describe the bug
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank0]:[W304 11:23:31.178074184 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.