Set up a simulator for the request flow for LLMD with envoy
- Client -> Envoy: Client sends the simple JSON request.
- Envoy -> EPP: The ext_proc filter (EPP) receives the request.
- EPP -> Envoy: The EPP mock creates the first part of the multipart body (containing the request_payload) and sends it back to Envoy. Crucially, it tells Envoy to NOT close the request stream to the upstream (vLLM) yet.
- Envoy -> vLLM (Prefill): Envoy forwards this first multipart chunk to the vLLM mock instance. The TCP connection and HTTP request stream remain open and waiting for more data.
- vLLM Processing (The Long Part): The vLLM mock server receives and starts fake processing the request_payload. This is the long step (sleep for 100ms - 30s). The server's request-reading code is essentially paused, waiting for the rest of the request stream to arrive.
- vLLM -> Envoy (Response Started): Once vLLM finishes fake prefill and generates the first token, it sends its response headers (e.g., HTTP/1.1 200 OK) and the first token (randomly generated) back towards Envoy.
- Envoy -> EPP (Response Notification): The ext_proc filter intercepts this response from vLLM and forwards it to the EPP.
- EPP Gets Metadata: Now, the EPP sees the first token, generates a random string as metadata .
- EPP -> Envoy (Injects Part 2): The EPP uses its ext_proc control channel to send a new instruction to Envoy: "For that original request stream you're holding open, here is the second multipart part (metadata) and the final boundary. Send it now and close the stream."
- Envoy -> vLLM (Request Finished): Envoy writes the second multipart part and the closing boundary to the still-open request stream to vLLM.
- vLLM Receives Full Request: The vLLM server's code, which was waiting for the request stream to finish, now receives the metadata and the end-of-stream signal. It has everything it needs and proceeds with the decode phase. So it generates a response to mock that it has finished everything.
- Response Flow: The rest of the response tokens flow from vLLM through Envoy to the client as normal.