-
Notifications
You must be signed in to change notification settings - Fork 663
docs: add multinode example #2155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new README file has been added under Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
examples/basics/multinode/README.md (1)
361-369: Second fenced log block also missing language-``` +```text [DEBUG] KV overlap scores: {prefill-worker-1: 15 blocks, prefill-worker-2: 8 blocks} ...
🧹 Nitpick comments (2)
examples/basics/multinode/README.md (2)
111-118: Environment-variable guidance contradicts prior best practiceYou instruct setting
ETCD_ENDPOINTSandNATS_SERVERon all nodes, but earlier learnings show the head/frontend node can rely on localhost and shouldn’t need these variables. Over-exporting encourages copy-paste mistakes and needlessly pollutes envs.Recommend clarifying:
# On worker nodes only (frontend/head can omit): export ETCD_ENDPOINTS=http://<INFRA_NODE_IP>:2379 export NATS_SERVER=nats://<INFRA_NODE_IP>:4222…and adding a short note referencing the head/worker distinction.
322-329: Add language identifier to fenced log snippet (markdownlint MD040)-``` +```text [DEBUG] KV overlap scores: {worker-1: 15 blocks, worker-2: 8 blocks} ...
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/basics/multinode/README.md(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
examples/basics/multinode/README.md (2)
Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Learnt from: GuanLuo
PR: #1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
🪛 markdownlint-cli2 (0.17.2)
examples/basics/multinode/README.md
324-324: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
363-363: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock but please address changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5a6ce36 to
d3b6680
Compare
Summary by CodeRabbit