- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.8k
Description
Your current environment
vllm version: 0.6.3.post1
Model Input Dumps
No response
🐛 Describe the bug
I see on the official site of gemma: https://huggingface.co/google/gemma-2b, context length is 8K.
however, when I load it into vllm and try to do inference where max_model_len is set to 8192, I encounter the error below:
Traceback (most recent call last):
File "/home/ubuntu/moa/eval.py", line 170, in
llm = LLM(model= args.llm_name, dtype='bfloat16', max_model_len= max_len,
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 177, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 570, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 903, in create_engine_config
model_config = self.create_model_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 839, in create_model_config
return ModelConfig(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 192, in init
self.max_model_len = _get_and_verify_max_len(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 1790, in _get_and_verify_max_len
raise ValueError(
ValueError: User-specified max_model_len (8192) is greater than the derived max_model_len (sliding_window=4096 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.