Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion moonshotai/Kimi-K2.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ A sample launch command is:
# start ray on node 0 and node 1

# node 0:
vllm serve moonshotai/Kimi-K2-Instruct --trust-remote-code --tokenizer-mode auto --tensor-parallel-size 8 --pipeline-parallel-size 2 --dtype bfloat16 --quantization fp8 --max-model-len 2048 --max-num-seqs 1 --max-num-batched-tokens 1024 --enable-chunked-prefill --disable-log-requests --kv-cache-dtype fp8
vllm serve moonshotai/Kimi-K2-Instruct --trust-remote-code --tokenizer-mode auto --tensor-parallel-size 8 --pipeline-parallel-size 2 --dtype bfloat16 --quantization fp8 --max-model-len 2048 --max-num-seqs 1 --max-num-batched-tokens 1024 --enable-chunked-prefill --disable-log-requests --kv-cache-dtype fp8 -dcp 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The -dcp 8 parameter has been added to the command, but it's not a standard vllm argument and isn't explained in this document. To help users understand how to use this configuration, please add a description of what -dcp does to the "Key parameter notes" section.

```

Key parameter notes:
Expand Down Expand Up @@ -142,4 +142,32 @@ Mean ITL (ms): 58.15
Median ITL (ms): 54.59
P99 ITL (ms): 91.18
==================================================
```

After adding '-dcp 8':
```bash
============ Serving Benchmark Result ============
Successful requests: 16
Request rate configured (RPS): 10000.00
Benchmark duration (s): 47.14
Total input tokens: 128000
Total generated tokens: 16000
Request throughput (req/s): 0.34
Output token throughput (tok/s): 339.38
Peak output token throughput (tok/s): 384.00
Peak concurrent requests: 16.00
Total Token throughput (tok/s): 3054.46
---------------Time to First Token----------------
Mean TTFT (ms): 2007.87
Median TTFT (ms): 1932.03
P99 TTFT (ms): 4680.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.01
Median TPOT (ms): 45.10
P99 TPOT (ms): 46.51
---------------Inter-token Latency----------------
Mean ITL (ms): 45.01
Median ITL (ms): 42.01
P99 ITL (ms): 52.01
==================================================
```
Comment on lines +147 to 173
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Thank you for adding the benchmark results. To make this section clearer and more consistent with the rest of the document, please consider the following improvements:

  • Placement: The vllm serve command was updated for the 16xH800 setup, but this benchmark result is placed after the 16xH200 benchmark. Please move this section to follow the 16xH800 benchmark results for consistency.
  • Missing Command: For reproducibility, please include the vllm bench serve command that was used to generate these results.
  • Section Title: The title "After adding '-dcp 8':" could be more descriptive. Consider making it a sub-heading like #### With -dcp 8 under the appropriate benchmark section.
  • Code Block Language: This code block is marked as bash, while others are marked as shell. Using a consistent language specifier (shell or bash) would improve the document's consistency.
  • Final Newline: The file is missing a newline character at the end. It's a common convention to end files with a newline.