Skip to content

Conversation

@seanshi-scale
Copy link

@seanshi-scale seanshi-scale commented Oct 11, 2023

tl;dr: RPyC is still slower than Ray as implemented in this PR, although there probably are other things that we could try to speed up some of the bottlenecks in this implementation.

Benchmarking on a machine with 4 A10 GPUs, and running with tensor-parallel=4, i.e.

python -m vllm.entrypoints.api_server --tensor-parallel-size 4 --model ~/path/to/model
--worker-use-rpyc

and running a single request i.e.

curl http://localhost:8000/generate -d '{"prompt": "lorem ipsum sit dolor amet", "n": 1, "temperature": 0.01, "max_tokens": 1024, "stream": false}'

I'm getting roughly 49.9 tokens/sec with the RPyC implementation and 54.7 tokens/sec with the Ray implementation as reported by vllm itself.

The main bottleneck at this point seems to be sending the data from the engine process to the worker process, i.e. the obtain() calls in the rpyc worker class's exposed_execute_method call. Think this happens because the objects have to be pickled on the engine process's side via some request from the worker processes, and this happens for each worker process. There's definitely some room to make this faster, e.g. by serializing the objects better so they don't have to be pickled/unpickled, using shared memory between the processes, maybe calling obtain() on each argument directly.

Aside from this PR, I had to do the following inside RPyC's code to avoid a serious slowdown

--- a/rpyc/utils/factory.py
+++ b/rpyc/utils/factory.py
@@ -99,7 +99,7 @@ def connect(host, port, service=VoidService, config={}, ipv6=False, keepalive=False):
    """

     :returns: an RPyC connection
     """
-    s = SocketStream.connect(host, port, ipv6=ipv6, keepalive=keepalive)
+    s = SocketStream.connect(host, port, ipv6=ipv6, keepalive=keepalive, nodelay=True)

Without this patch, I was getting roughly 15 tokens/second with the same setup as before.

Also, there's some messiness in terms of when certain env vars get set/when torch or other libraries get imported that I had to hack around.

seanshi-scale and others added 30 commits September 27, 2023 12:26
…re out where the bottleneck is that's causing 0.1 seconds of iteration lag on this small model, don't think threadpoolexecutor helped
@Juelianqvq
Copy link
Contributor

Juelianqvq commented Nov 2, 2023

asyncio.to_thread method seems only was supported in python > 3.9, is there any workaround?

@seanshi-scale
Copy link
Author

I did find https://stackoverflow.com/questions/68523752/python-module-asyncio-has-no-attribute-to-thread, don't have the time myself to implement it but it's worth a shot?

@Juelianqvq
Copy link
Contributor

I did find https://stackoverflow.com/questions/68523752/python-module-asyncio-has-no-attribute-to-thread, don't have the time myself to implement it but it's worth a shot?

Yeah, I've tried with this answer before, then I found latency on tp=2 llama13b dived from 35 to 7.8 tokens/s.
tp=1 not tested yet.

@seanshi-scale
Copy link
Author

I haven't found any other workarounds unfortunately, I've only been trying with a later version of python

@zhuohan123
Copy link
Member

zhuohan123 commented Jan 12, 2024

Close this PR in favor of #2221. Please feel free to reopen the PR if you have anything to add.

@zhuohan123 zhuohan123 closed this Jan 12, 2024
minmin-intel pushed a commit to minmin-intel/vllm that referenced this pull request Jul 15, 2025
Set vllm-hpu-extension revision to 80985d3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants