-
-
Couldn't load subscription status.
- Fork 10.9k
[Do not merge] Trying out rpyc as a replacement for ray #1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bring in 0.2.0 changes
…re out where the bottleneck is that's causing 0.1 seconds of iteration lag on this small model, don't think threadpoolexecutor helped
…to seanshi-scale/rpyc
|
asyncio.to_thread method seems only was supported in python > 3.9, is there any workaround? |
|
I did find https://stackoverflow.com/questions/68523752/python-module-asyncio-has-no-attribute-to-thread, don't have the time myself to implement it but it's worth a shot? |
Yeah, I've tried with this answer before, then I found latency on tp=2 llama13b dived from 35 to 7.8 tokens/s. |
|
I haven't found any other workarounds unfortunately, I've only been trying with a later version of python |
|
Close this PR in favor of #2221. Please feel free to reopen the PR if you have anything to add. |
Set vllm-hpu-extension revision to 80985d3
tl;dr: RPyC is still slower than Ray as implemented in this PR, although there probably are other things that we could try to speed up some of the bottlenecks in this implementation.
Benchmarking on a machine with 4 A10 GPUs, and running with tensor-parallel=4, i.e.
and running a single request i.e.
I'm getting roughly 49.9 tokens/sec with the RPyC implementation and 54.7 tokens/sec with the Ray implementation as reported by vllm itself.
The main bottleneck at this point seems to be sending the data from the engine process to the worker process, i.e. the
obtain()calls in the rpyc worker class'sexposed_execute_methodcall. Think this happens because the objects have to be pickled on the engine process's side via some request from the worker processes, and this happens for each worker process. There's definitely some room to make this faster, e.g. by serializing the objects better so they don't have to be pickled/unpickled, using shared memory between the processes, maybe callingobtain()on each argument directly.Aside from this PR, I had to do the following inside RPyC's code to avoid a serious slowdown
Without this patch, I was getting roughly 15 tokens/second with the same setup as before.
Also, there's some messiness in terms of when certain env vars get set/when torch or other libraries get imported that I had to hack around.