-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Closed
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
Apparently outperforms Mixtral at a smaller size. Longer context length and multilingual.
https://github.com/mistralai/mistral-inference/#deployment for Dockerfile (requires updating transformers).
Currently doesn't run with --tensor-parallel-size=2on vllm/vllm-openai:latest:
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 243, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 153, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 128, in __init__
[rank0]: super().__init__(model_config, cache_config, parallel_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 42, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 79, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 134, in _run_workers
[rank0]: ] + [output.get() for output in worker_outputs]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 134, in <listcomp>
[rank0]: ] + [output.get() for output in worker_outputs]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 58, in get
[rank0]: raise self.result.exception
[rank0]: RuntimeError: start (640) + length (640) exceeds dimension size (1024).
Can't test with tp=1.
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request