Skip to content

Conversation

@seanshi-scale
Copy link
Owner

@seanshi-scale seanshi-scale commented Sep 27, 2023

  • performance not as good as with ray, still a ways off in terms of overhead
  • things I needed to do:
    • patch rpyc to use tcp nodelay
    • some "hacky" code, e.g. importing worker late, setting some env vars in weird places; to get the workers to launch
  • things that I see as deficiencies in the rpyc approach:
    • from the timing, looks like serializing the data and sending it over to the worker processes incurs a significant overhead. Can try things like improve the serialization (e.g. not pickle/unpickle, but serialize to a json or something), or do some shared memory between processes.

Testing methodology:
Ran on a machine with 4xA10 GPUs. Ran the llama-2-7b chat model with tensor-parallel-size 4. Send a single request with a pretty short prompt and a pretty long response (e.g. ~5-10ish tokens prompt, asking for ~500-1000 tokens response).

Raw notes:

curl http://localhost:8000/generate -d '{"prompt": "What is your name?", "n": 4, "temperature": 0.1}'
curl http://localhost:8000/generate -d '{"prompt": "How do you make cookies?", "n": 1, "temperature": 0.2, "max_tokens": 1024, "stream": false}'

python -m vllm.entrypoints.api_server --tensor-parallel-size 4 --model ~/llama-weights/hf-llama-2-7b-chat/
--worker-use-rpyc

EOD 9/29:
llama-2-7b-chat, n=1:
17 tok/s with 4 workers, rpyc
53ish tok/s with 4 workers, ray
22 tok/s with 2 workers, rpyc
46 tok/s with 2 workers, ray

Midday 10/4
llama-2-7b-chat n=1:
49.5 tok/s with 4 workers, rpyc
54.7 tok/s with 4 workers, ray

maybe something is set wrong for torch to make things super slow? profile execute_model and see what's up

something with rpyc/some weird bottleneck in the python/processes/idk actually
(mostly) fixed via

--- a/rpyc/utils/factory.py
+++ b/rpyc/utils/factory.py
@@ -99,7 +99,7 @@ def connect(host, port, service=VoidService, config={}, ipv6=False, keepalive=Fa

     :returns: an RPyC connection
     """
-    s = SocketStream.connect(host, port, ipv6=ipv6, keepalive=keepalive)
+    s = SocketStream.connect(host, port, ipv6=ipv6, keepalive=keepalive, nodelay=True)

use asyncio vs threadpoolexec -> 49.5 tok/sec for rpyc vs 54.7 tok/sec for ray

to investigate: something with model execute taking 0.025ish seconds vs 0.015 seconds on occasion
maybe not a problem anymore? idk man my testing env might be not good?

TODO:

  • replace obtain with deliver in the main process (kinda annoying, deliver doesn't work oob, need to investigate)
  • continue adding prints in rpyc to figure out where bottleneck is (some gil thing, scheduling worker processes, something in the socket/connection itself???) (one big bottleneck in socket itself, fixed with tcp nodelay)
  • read over ray code to see what they're doing differently (i.e. how remote() is implemented)
  • read over lightllm to see if they're using rpyc in a way that avoids this issue
  • see if you can get rpyc over pipes to work
_run_workers_async prep executors 1696448720.4718165
started at 1696448720.4718547
started at 1696448720.4719837
15871 starthead 1696448720.472019
started at 1696448720.472363
15872 starthead 1696448720.472397
15873 starthead 1696448720.4726086
started at 1696448720.4729712
15874 starthead 1696448720.4731753
15871 startexec 1696448720.4743671
15873 startexec 1696448720.4746091
15872 startexec 1696448720.475054
15874 startexec 1696448720.475109
execute_model time: 0.015847444534301758, prep inputs: 0.00012826919555664062, model forward: 0.015719175338745117, pid 15871
execute_model time: 0.015161752700805664, prep inputs: 0.00012612342834472656, model forward: 0.015035629272460938, pid 15872
15871 stopexec 1696448720.4902742
15872 stopexec 1696448720.4902766
execute_model time: 0.015143632888793945, prep inputs: 0.00013256072998046875, model forward: 0.015011072158813477, pid 15874
execute_model time: 0.01563858985900879, prep inputs: 0.00017571449279785156, model forward: 0.015462875366210938, pid 15873
15873 stopexec 1696448720.4903154
15874 stopexec 1696448720.4903169
_run_workers_async wait for gather, 1696448720.4921165
_run_workers_async end 1696448720.492134
_run_workers_async total 0.020339012145996094
_run_workers_async start 1696448720.4923127
_run_workers_async prep executors 1696448720.4923337
started at 1696448720.4923716
started at 1696448720.4925008
15871 starthead 1696448720.4925346
15872 starthead 1696448720.492742
started at 1696448720.492886
started at 1696448720.4930904
15873 starthead 1696448720.4931326
15874 starthead 1696448720.4934273
15871 startexec 1696448720.4946985
15872 startexec 1696448720.4953666
15873 startexec 1696448720.4956145
15874 startexec 1696448720.4958744
execute_model time: 0.01569080352783203, prep inputs: 0.00012826919555664062, model forward: 0.01556253433227539, pid 15872
execute_model time: 0.01636195182800293, prep inputs: 0.00014281272888183594, model forward: 0.016219139099121094, pid 15871
15872 stopexec 1696448720.5111196
15871 stopexec 1696448720.5111215
execute_model time: 0.015245199203491211, prep inputs: 0.00014734268188476562, model forward: 0.015097856521606445, pid 15874
execute_model time: 0.015501737594604492, prep inputs: 0.00017189979553222656, model forward: 0.015329837799072266, pid 15873
15873 stopexec 1696448720.5111866
15874 stopexec 1696448720.5111876
_run_workers_async wait for gather, 1696448720.5128732
_run_workers_async end 1696448720.5128915
_run_workers_async total 0.020578861236572266

@seanshi-scale seanshi-scale changed the title add rpc add rpyc Sep 29, 2023
@seanshi-scale seanshi-scale marked this pull request as ready for review October 11, 2023 01:40
@seanshi-scale seanshi-scale merged commit c6ee7f3 into main Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants