-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Closed
Labels
Description
Motivation.
In the async_engine code path, we have an option to launch the engine in a separate process using Ray
parser.add_argument('--engine-use-ray',
action='store_true',
help='Use Ray to start the LLM engine in a '
'separate process as the server processOriginally, the option make it possible to separate the server's Python overhead with the engine's main scheduler loop.
However, few factors made this unused/less popular
- Ray is an optional component, and typically not used in single node environment.
- The serialization and rpc typically offset the theoretical performance gain
- There are typically other ways to isolate server and engine (through multiprocessing, threading, etc).
- Recently, we are separating this in server using lower overhead approaches [ Frontend ] Multiprocessing for OpenAI Server with
zeromq#6883
Proposed Change.
Deprecation of the flag with warning for one release.
Removal of the flag given no major pushbacks.
Feedback Period.
1wk
CC List.
No response
Any Other Things.
No response
WoosukKwon, mgoin, joerunde, njhill, zifeitong and 13 more