Skip to content

yangligt2/multi-part-simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

multi-part-simulator

Set up a simulator for the request flow for LLMD with envoy

  1. Client -> Envoy: Client sends the simple JSON request.
  2. Envoy -> EPP: The ext_proc filter (EPP) receives the request.
  3. EPP -> Envoy: The EPP mock creates the first part of the multipart body (containing the request_payload) and sends it back to Envoy. Crucially, it tells Envoy to NOT close the request stream to the upstream (vLLM) yet.
  4. Envoy -> vLLM (Prefill): Envoy forwards this first multipart chunk to the vLLM mock instance. The TCP connection and HTTP request stream remain open and waiting for more data.
  5. vLLM Processing (The Long Part): The vLLM mock server receives and starts fake processing the request_payload. This is the long step (sleep for 100ms - 30s). The server's request-reading code is essentially paused, waiting for the rest of the request stream to arrive.
  6. vLLM -> Envoy (Response Started): Once vLLM finishes fake prefill and generates the first token, it sends its response headers (e.g., HTTP/1.1 200 OK) and the first token (randomly generated) back towards Envoy.
  7. Envoy -> EPP (Response Notification): The ext_proc filter intercepts this response from vLLM and forwards it to the EPP.
  8. EPP Gets Metadata: Now, the EPP sees the first token, generates a random string as metadata .
  9. EPP -> Envoy (Injects Part 2): The EPP uses its ext_proc control channel to send a new instruction to Envoy: "For that original request stream you're holding open, here is the second multipart part (metadata) and the final boundary. Send it now and close the stream."
  10. Envoy -> vLLM (Request Finished): Envoy writes the second multipart part and the closing boundary to the still-open request stream to vLLM.
  11. vLLM Receives Full Request: The vLLM server's code, which was waiting for the request stream to finish, now receives the metadata and the end-of-stream signal. It has everything it needs and proceeds with the decode phase. So it generates a response to mock that it has finished everything.
  12. Response Flow: The rest of the response tokens flow from vLLM through Envoy to the client as normal.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published