Skip to content

vllm development does not work for tensor-parallel > 1 #2619

@lroberts7

Description

@lroberts7

I have a local dev build on commit

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ git log -n 1
commit 5265631d15d59735152c8b72b38d960110987f10 (HEAD -> main, origin/main, origin/HEAD)
Author: Vladimir <[email protected]>
Date:   Fri Jan 26 08:48:17 2024 +0100

    use a correct device when creating OptionalCUDAGuard (#2583)

and I have some local code that is a thin wrapper around LLM class

If i run this with tensor-parallel == 2 I get the following:

roberts@GPU77B9:~/llm_quantization$ FLASK_APP=quantized_flask_app.py FLASK_ENV=debug python3.10 -m flask run 
 * Serving Flask app 'quantized_flask_app.py' (lazy loading)
 * Environment: debug
 * Debug mode: off
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
16384
INFO 2024-01-26 22:03:13,343 abc_etal.py:195 unknown_model_name:unknown_model_version
                             Hello! logging initialized, starting up... 
INFO 2024-01-26 22:03:13,343 abc_etal.py:196 unknown_model_name:unknown_model_version
                             Git commit of model: unknown_git_commit 
INFO 2024-01-26 22:03:13,343 abc_etal.py:197 unknown_model_name:unknown_model_version
                             Git commit of cuda torch base: unknown_git_commit 
INFO 2024-01-26 22:03:14,921 abc_etal.py:200 unknown_model_name:unknown_model_version
                             Compute device available: cuda 
WARNING 01-26 22:03:16 config.py:506] Casting torch.bfloat16 to torch.float16.
WARNING 01-26 22:03:16 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-26 22:03:18,650 ERROR services.py:1329 -- Failed to start the dashboard , return code 1
2024-01-26 22:03:18,650 ERROR services.py:1354 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2024-01-26 22:03:18,651 ERROR services.py:1398 -- 
The last 20 lines of /tmp/ray/session_2024-01-26_22-03-16_731996_3725694/logs/dashboard.log (it contains the error message from the dashboard): 
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 16, in <module>
    from ray.job_submission import JobStatus, JobSubmissionClient
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/job_submission/__init__.py", line 2, in <module>
    from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in <module>
    from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 100, in <module>
    monkeypatch_pydantic_2_for_cloudpickle()
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle
    pydantic._internal._model_construction.SchemaSerializer = (
AttributeError: module 'pydantic._internal' has no attribute '_model_construction'
2024-01-26 22:03:18,879 INFO worker.py:1673 -- Started a local Ray instance.
[2024-01-26 22:03:19,820 E 3725694 3725694] core_worker.cc:205: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

however, tensor-parallel == 1 works fine:

lroberts@GPU77B9:~/llm_quantization$ FLASK_APP=quantized_flask_app.py FLASK_ENV=debug python3.10 -m flask run 
 * Serving Flask app 'quantized_flask_app.py' (lazy loading)
 * Environment: debug
 * Debug mode: off
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
16384
INFO 2024-01-26 22:04:03,519 abc_etal.py:195 unknown_model_name:unknown_model_version
                             Hello! logging initialized, starting up... 
INFO 2024-01-26 22:04:03,519 abc_etal.py:196 unknown_model_name:unknown_model_version
                             Git commit of model: unknown_git_commit 
INFO 2024-01-26 22:04:03,519 abc_etal.py:197 unknown_model_name:unknown_model_version
                             Git commit of cuda torch base: unknown_git_commit 
INFO 2024-01-26 22:04:05,098 abc_etal.py:200 unknown_model_name:unknown_model_version
                             Compute device available: cuda 
WARNING 01-26 22:04:06 config.py:506] Casting torch.bfloat16 to torch.float16.
WARNING 01-26 22:04:06 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-26 22:04:06 llm_engine.py:72] Initializing an LLM engine with config: model='/home/lroberts/NexusRaven-13B-AWQ/', tokenizer='/home/lroberts/NexusRaven-13B-AWQ/presaved_tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 22:04:23 llm_engine.py:316] # GPU blocks: 4145, # CPU blocks: 327
INFO 01-26 22:04:27 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-26 22:04:27 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-26 22:04:33 model_runner.py:689] Graph capturing finished in 6 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 2024-01-26 22:04:33,205 abc_etal.py:231 unknown_model_name:unknown_model_version
                             Startup completed! 
INFO 2024-01-26 22:04:33,207 _internal.py:224 unknown_model_name:unknown_model_version
                             WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000 
INFO 2024-01-26 22:04:33,207 _internal.py:224 unknown_model_name:unknown_model_version
                             Press CTRL+C to quit 
[OpenAIMessage(role='system', content='You are a helpful assistant.'), OpenAIMessage(role='user', content='Tell me a few reasons why someone might consider higher education. Do not repeat yourself. Response:  ')]
16384
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.11s/it]
INFO 2024-01-26 22:05:05,684 _internal.py:224 unknown_model_name:unknown_model_version
                             127.0.0.1 - - [26/Jan/2024 22:05:05] "POST /sequence-generation/chat/json HTTP/1.1" 200 - 
```bash

the message is a simple curl request looks like this: 
```bash
curl -v --trace-time -X POST -H "Content-Type: application/json" --data '{"max_tokens": 500, "messages": [{"content": "You are a helpful assistant.","role": "system"}, {"content": "Tell me a few reasons why someone might consider higher education. Do not repeat yourself. Response:  ","role": "user"}], "model": "gpt-3.5-turbo", "temperature": 0}' http://localhost:5000/sequence-generation/chat/json

with response:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"  There are many reasons why someone might consider higher education. Here are a few:\n\n1. To gain knowledge and skills: Higher education provides students with the opportunity to learn new knowledge and skills that can be applied in their future careers.\n2. To prepare for a career: Many people choose to pursue higher education because it is a way to prepare for a specific career. For example, a student may choose to study business because they want to work in the field.\n3. To gain a competitive edge: Higher education can provide students with a competitive edge in the job market. Many employers require a degree from a reputable institution, and having one can make a candidate more attractive to potential employers.\n4. To develop critical thinking and problem-solving skills: Higher education provides students with the opportunity to develop their critical thinking and problem-solving skills.\n5. To gain a sense of community: Higher education provides students with the opportunity to connect with other students and faculty members, which can help to create a sense of community.\n6. To gain a sense of purpose: Higher education can provide students with a sense of purpose and direction in life.\n7. To gain a sense of accomplishment: Higher education can provide students with a sense of accomplishment and pride in their achievements.\n8. To gain a sense of personal growth: Higher education can provide students with the opportunity to grow and develop as individuals.\n9. To gain a sense of independence: Higher education can provide students with the opportunity to become independent and self-sufficient.\n10. To gain a sense of fulfillment: Higher education can provide students with a sense of fulfillment and satisfaction in their lives.\n\nOverall, higher education can provide students with a wide range of benefits, including the opportunity to gain knowledge and skills, prepare for a career, gain a competitive edge, develop critical thinking and problem-solving skills, gain a sense of community, gain a sense of purpose, gain a sense of accomplishment, gain a sense of personal growth, gain a sense of independence, and gain a sense of fulfillment.","role":"assistant"}}],"created":1706306706,"id":"llama-2-7b-chat-hf","object":"chat.completion","usage":{"completion_tokens":457,"prompt_tokens":49,"total_tokens":506}}

the error in logs from ray indicates some serialization

 1 2024-01-26 21:35:42,363 INFO utils.py:112 -- Get all modules by type: DashboardHeadModule
  2 2024-01-26 21:35:42,407 INFO utils.py:123 -- Module ray.dashboard.modules.actor.actor_head cannot be loaded because we cannot import all dependencies. Install this module using `pip ins    tall 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  3 2024-01-26 21:35:42,429 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip in    stall 'ray[default]'` for the full dashboard functionality. Error: No module named 'grpc'
  4 2024-01-26 21:35:42,430 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_head cannot be loaded because we cannot import all dependencies. Install this module using `pip ins    tall 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  5 2024-01-26 21:35:42,431 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_agent cannot be loaded because we cannot import all dependencies. Install this module using `pi    p install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  6 2024-01-26 21:35:42,431 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_head cannot be loaded because we cannot import all dependencies. Install this module using `pip     install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  7 2024-01-26 21:35:42,450 ERROR dashboard.py:259 -- The dashboard on node GPU77B9 failed with the following error:
  8 Traceback (most recent call last):
  9   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/dashboard.py", line 248, in <module>
 10     loop.run_until_complete(dashboard.run())
 11   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
 12     return future.result()
 13   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/dashboard.py", line 75, in run
 14     await self.dashboard_head.run()
 15   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/head.py", line 325, in run
 16     modules = self._load_modules(self._modules_to_load)
 17   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/head.py", line 219, in _load_modules
 18     head_cls_list = dashboard_utils.get_all_modules(DashboardHeadModule)
 19   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/utils.py", line 121, in get_all_modules
 20     importlib.import_module(name)
 21   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
 22     return _bootstrap._gcd_import(name[level:], package, level)
 23   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
 24   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
 25   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
 26   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
 27   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
 28   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
 29   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 16, in <module>
 30     from ray.job_submission import JobStatus, JobSubmissionClient
 31   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/job_submission/__init__.py", line 2, in <module>
 32     from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType
 33   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in <module>
 34     from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED
 35   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 100, in <module>
 36     monkeypatch_pydantic_2_for_cloudpickle()
 37   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle
 38     pydantic._internal._model_construction.SchemaSerializer = (
 39 AttributeError: module 'pydantic._internal' has no attribute '_model_construction'
 40 
~                                                                                                                                                                                            
~                                                                                        

relevant details about env:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import pydantic; print(pydantic.__version__)"
2.5.3
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import ray; print(ray.__version__)"
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
2.8.0
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import torch; print(torch.__version__)"
2.1.2+cu121

It seems there a known fix or workaround here -> ray-project/ray#41913 (comment)

but it seems that pydantic version 2 is necessary for openai testing

pydantic >= 2.0 # Required for OpenAI server.

is there a suggested workaround or should I manually downgrade pydantic to version lower than 2.0.0?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions