add rpyc #1

seanshi-scale · 2023-09-27T21:20:06Z

performance not as good as with ray, still a ways off in terms of overhead
things I needed to do:
- patch rpyc to use tcp nodelay
- some "hacky" code, e.g. importing worker late, setting some env vars in weird places; to get the workers to launch
things that I see as deficiencies in the rpyc approach:
- from the timing, looks like serializing the data and sending it over to the worker processes incurs a significant overhead. Can try things like improve the serialization (e.g. not pickle/unpickle, but serialize to a json or something), or do some shared memory between processes.

Testing methodology:
Ran on a machine with 4xA10 GPUs. Ran the llama-2-7b chat model with tensor-parallel-size 4. Send a single request with a pretty short prompt and a pretty long response (e.g. ~5-10ish tokens prompt, asking for ~500-1000 tokens response).

Raw notes:

curl http://localhost:8000/generate -d '{"prompt": "What is your name?", "n": 4, "temperature": 0.1}'
curl http://localhost:8000/generate -d '{"prompt": "How do you make cookies?", "n": 1, "temperature": 0.2, "max_tokens": 1024, "stream": false}'

python -m vllm.entrypoints.api_server --tensor-parallel-size 4 --model ~/llama-weights/hf-llama-2-7b-chat/
--worker-use-rpyc

EOD 9/29:
llama-2-7b-chat, n=1:
17 tok/s with 4 workers, rpyc
53ish tok/s with 4 workers, ray
22 tok/s with 2 workers, rpyc
46 tok/s with 2 workers, ray

Midday 10/4
llama-2-7b-chat n=1:
49.5 tok/s with 4 workers, rpyc
54.7 tok/s with 4 workers, ray

~~maybe something is set wrong for torch to make things super slow? profile execute_model and see what's up~~

something with rpyc/some weird bottleneck in the python/processes/idk actually
(mostly) fixed via

--- a/rpyc/utils/factory.py
+++ b/rpyc/utils/factory.py
@@ -99,7 +99,7 @@ def connect(host, port, service=VoidService, config={}, ipv6=False, keepalive=Fa

     :returns: an RPyC connection
     """
-    s = SocketStream.connect(host, port, ipv6=ipv6, keepalive=keepalive)
+    s = SocketStream.connect(host, port, ipv6=ipv6, keepalive=keepalive, nodelay=True)

use asyncio vs threadpoolexec -> 49.5 tok/sec for rpyc vs 54.7 tok/sec for ray

~~to investigate: something with model execute taking 0.025ish seconds vs 0.015 seconds on occasion~~
maybe not a problem anymore? idk man my testing env might be not good?

TODO:

replace obtain with deliver in the main process (kinda annoying, deliver doesn't work oob, need to investigate)
continue adding prints in rpyc to figure out where bottleneck is (some gil thing, scheduling worker processes, something in the socket/connection itself???) (one big bottleneck in socket itself, fixed with tcp nodelay)
read over ray code to see what they're doing differently (i.e. how remote() is implemented)
read over lightllm to see if they're using rpyc in a way that avoids this issue
see if you can get rpyc over pipes to work

_run_workers_async prep executors 1696448720.4718165
started at 1696448720.4718547
started at 1696448720.4719837
15871 starthead 1696448720.472019
started at 1696448720.472363
15872 starthead 1696448720.472397
15873 starthead 1696448720.4726086
started at 1696448720.4729712
15874 starthead 1696448720.4731753
15871 startexec 1696448720.4743671
15873 startexec 1696448720.4746091
15872 startexec 1696448720.475054
15874 startexec 1696448720.475109
execute_model time: 0.015847444534301758, prep inputs: 0.00012826919555664062, model forward: 0.015719175338745117, pid 15871
execute_model time: 0.015161752700805664, prep inputs: 0.00012612342834472656, model forward: 0.015035629272460938, pid 15872
15871 stopexec 1696448720.4902742
15872 stopexec 1696448720.4902766
execute_model time: 0.015143632888793945, prep inputs: 0.00013256072998046875, model forward: 0.015011072158813477, pid 15874
execute_model time: 0.01563858985900879, prep inputs: 0.00017571449279785156, model forward: 0.015462875366210938, pid 15873
15873 stopexec 1696448720.4903154
15874 stopexec 1696448720.4903169
_run_workers_async wait for gather, 1696448720.4921165
_run_workers_async end 1696448720.492134
_run_workers_async total 0.020339012145996094
_run_workers_async start 1696448720.4923127
_run_workers_async prep executors 1696448720.4923337
started at 1696448720.4923716
started at 1696448720.4925008
15871 starthead 1696448720.4925346
15872 starthead 1696448720.492742
started at 1696448720.492886
started at 1696448720.4930904
15873 starthead 1696448720.4931326
15874 starthead 1696448720.4934273
15871 startexec 1696448720.4946985
15872 startexec 1696448720.4953666
15873 startexec 1696448720.4956145
15874 startexec 1696448720.4958744
execute_model time: 0.01569080352783203, prep inputs: 0.00012826919555664062, model forward: 0.01556253433227539, pid 15872
execute_model time: 0.01636195182800293, prep inputs: 0.00014281272888183594, model forward: 0.016219139099121094, pid 15871
15872 stopexec 1696448720.5111196
15871 stopexec 1696448720.5111215
execute_model time: 0.015245199203491211, prep inputs: 0.00014734268188476562, model forward: 0.015097856521606445, pid 15874
execute_model time: 0.015501737594604492, prep inputs: 0.00017189979553222656, model forward: 0.015329837799072266, pid 15873
15873 stopexec 1696448720.5111866
15874 stopexec 1696448720.5111876
_run_workers_async wait for gather, 1696448720.5128732
_run_workers_async end 1696448720.5128915
_run_workers_async total 0.020578861236572266

…ny devices

… probably

…re out where the bottleneck is that's causing 0.1 seconds of iteration lag on this small model, don't think threadpoolexecutor helped

…to seanshi-scale/rpyc

vs ray's 56

…now at 42.2" This reverts commit 1a62dd0.

seanshi-scale added 9 commits September 27, 2023 12:26

todo

b30c21c

start adding in hooks to use rpyc

fd15d74

missed a spot

4ac4db4

.

689b5d5

idk

714c31d

something to initialize env vars for torch distributed

1c5bc06

super untested init code

04601e2

.

c0ac074

async

ff56e75

seanshi-scale changed the title ~~add rpc~~ add rpyc Sep 29, 2023

seanshi-scale added 20 commits September 29, 2023 01:04

wip added a lot of stuff

2bd498e

am hitting some runtime error probably bcz of how I'm setting up ports

52f753d

still don't know how to initialize distributed rip

b287bec

stash

459406b

save for bisecting

aa97942

get ray to not break, it's some rpyc_utils import I think

6c54a55

figured out what import breaks the ray serving mode

48ebebb

find free port

e40d4df

some asyncio bs

165b4ab

for some reason we're already done importing torch and we don't see a…

248f0e5

…ny devices

got past cuda no devices

fc4d357

init workers in parallel

c640e7c

starts up but the assert outputs is wrong

371bf7d

it works??? idk if obtain(ans) is slow but watch out for that

8e6d29c

it works but it's a lot slower than ray, gotta figure out parallelism…

37196df

… probably

tried switching over to threadpoolexecutor, more timing stuff to figu…

fa23fd6

…re out where the bottleneck is that's causing 0.1 seconds of iteration lag on this small model, don't think threadpoolexecutor helped

rip

8f8e195

help, setting keepalive on rpyc.connect doesn't help it seems?

b1547d2

todos

df301c3

Merge branch 'seanshi-scale/rpyc' of github.com:seanshi-scale/vllm in…

fdbcf6e

…to seanshi-scale/rpyc

seanshi-scale added 26 commits October 3, 2023 20:12

print conn

c6b858c

figure out connection type

ba59145

rm prints, we need to set tcp nodelay on rpyc's init, we are now at 42.2

1a62dd0

vs ray's 56

switch back to asyncio, seems a bit faster?

b57ab06

Revert "rm prints, we need to set tcp nodelay on rpyc's init, we are …

4b527d6

…now at 42.2" This reverts commit 1a62dd0.

print out total time

578a719

use asyncio instead of threadpoolexec for the actual loop oops

00ed936

comment out some prints, we're at about 49.5 tok/sec now

dd396f0

print prepare inputs time also

5240b46

more printing out timing

cdcf0f0

rm a print

8f6b05c

clean up more prints

ecdfb48

clean up x3

ca94c56

add back a print that's actually necessary ugh

df4a843

clean up some more prints

5ed5bad

clean up llm_engine.py

cd67e01

clean up more llm_engine.py

781ad0e

clean up more stuff

c4b469b

cleanup part 8

db596f6

more cleaning up

705fa5d

more cleanup

355983a

oops

bfc97ea

lmao

864b343

remove engine_use_rpyc

a2bfc83

more cleanup

ac86152

clean up unused worker fns

f53e1b9

seanshi-scale marked this pull request as ready for review October 11, 2023 01:40

seanshi-scale merged commit c6ee7f3 into main Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

add rpyc #1

add rpyc #1

Uh oh!

seanshi-scale commented Sep 27, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

add rpyc #1

add rpyc #1

Uh oh!

Conversation

seanshi-scale commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seanshi-scale commented Sep 27, 2023 •

edited

Loading