-
Notifications
You must be signed in to change notification settings - Fork 693
feat: support MoE model in SLA Planner Sglang #3185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
…/sglang-dsr1-sweep Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
WalkthroughAdds MoE-aware profiling across profiler codepaths, introduces attention_dp_size and GPU-count sweeps, updates config modifiers with DEP/TEP and multinode support, renames SGLang service keys to short names, adds a MoE profiling Job manifest, adjusts decode request range logic, updates docs, Dockerfile, deployment naming behavior, and adds MoE dry-run tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant PS as profile_sla.py
participant CM as ConfigModifier (SGLang/VLLM/TRTLLM)
participant PF as Prefill Profiler
participant DC as Decode Profiler
participant BK as Backend Engine
participant ST as PVC (/data)
U->>PS: run_profile(args)
PS->>PS: Parse flags (is_moe_model, num_gpus_per_node)
alt MoE
PS->>CM: convert_config(..., is_moe_model=True)
PS->>CM: set_config_tep_size / set_config_dep_size(num_gpus, per-node)
PS->>DC: profile_decode(..., attention_dp_size=num_gpus)
else Dense
PS->>CM: convert_config(..., is_moe_model=False)
PS->>CM: set_config_tp_size(tp_size)
PS->>DC: profile_decode(..., attention_dp_size=1)
end
PS->>PF: profile_prefill(...)
par Prefill
PF->>BK: deploy & run prefill
BK-->>PF: TTFT/throughput
and Decode
DC->>DC: get_num_request_range(attn_dp, max_conc, granularity)
DC->>BK: deploy & run decode sweep
BK-->>DC: ITL/throughput
end
PS->>ST: Write results/plots
PS-->>U: Summary JSON/plots
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…/sglang-dsr1-sweep Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
benchmarks/profiler/utils/profile_decode.py (1)
143-170: Regression:_profile_decode_helpersignature change breaksprofile_decode_aiconfigurator.Missing
attention_dp_sizeargument causes TypeError at runtime.Apply:
-def profile_decode_aiconfigurator( +def profile_decode_aiconfigurator( work_dir, num_gpus, max_kv_tokens, max_context_length, interpolation_granularity, ai_configurator_perf_estimator: AIConfiguratorPerfEstimator, **model_config_kwargs, ): @@ - return _profile_decode_helper( + return _profile_decode_helper( work_dir, num_gpus, max_kv_tokens, max_context_length, interpolation_granularity, get_itl_and_thpt_per_gpu, - ) + # For dense models attention_dp_size=1; MoE path doesn't use AI Configurator. + model_config_kwargs.get("attention_dp_size", 1) if "attention_dp_size" in model_config_kwargs else 1, + )Optionally, make the parameter explicit:
- ai_configurator_perf_estimator: AIConfiguratorPerfEstimator, + ai_configurator_perf_estimator: AIConfiguratorPerfEstimator, + attention_dp_size: int = 1, @@ - get_itl_and_thpt_per_gpu, - # For dense models attention_dp_size=1; MoE path doesn't use AI Configurator. - model_config_kwargs.get("attention_dp_size", 1) if "attention_dp_size" in model_config_kwargs else 1, + get_itl_and_thpt_per_gpu, + attention_dp_size,benchmarks/profiler/utils/config.py (3)
552-558: Bug: replicas set on raw dict never makes it into the returned configYou mutate
config[...]then returncfg.model_dump(), soreplicasstays unchanged. Set it oncfginstead.- # set num workers to 1 - decode_worker_config = config["spec"]["services"][ - WORKER_COMPONENT_NAMES["sglang"].decode_worker_k8s_name - ] - decode_worker_config["replicas"] = 1 - - return cfg.model_dump() + # set num workers to 1 + decode_worker_config = cfg.spec.services[ + WORKER_COMPONENT_NAMES["sglang"].decode_worker_k8s_name + ] + decode_worker_config.replicas = 1 + + return cfg.model_dump()
451-454: Fix NameError in exception logging (line variable may be undefined)On errors before the loop,
lineis undefined. Log the filename instead.- except Exception as e: - logger.warning( - f"Failed to parse KV cache size from line: {line}. Error: {e}" - ) + except Exception as e: + logger.warning( + "Failed to parse KV cache size from log file %s. Error: %s", + dynamo_log_fn, + e, + )
274-281: Make removal of flag robust: avoid ValueError if flag missingUse a presence check before
remove.- # remove --is-prefill-worker flag - args.remove("--is-prefill-worker") + # remove --is-prefill-worker flag + if "--is-prefill-worker" in args: + args.remove("--is-prefill-worker")
🧹 Nitpick comments (15)
docs/benchmarks/pre_deployment_profiling.md (3)
66-68: Clarify TEP-only prefill stance for MoE.Consider adding one sentence explaining why DEP prefill is suboptimal (e.g., memory-bound tradeoffs), to avoid confusion for users comparing with dense TP.
72-74: Future TEP decode note is helpful; link to tracker/issue if available.If there’s a tracking issue/PR for TEP decode support, add the link here for discoverability.
124-131: Align naming between docs and CLI: map DEP/TEP to attention_dp_size.You introduced attention_dp_size in the CLI; add a parenthetical here mapping DEP to attention_dp_size to reduce terminology drift.
deploy/utils/dynamo_deployment.py (1)
142-193: Port-forward robustness: add --address and a readiness check.To avoid binding on all interfaces and flaky sleeps:
- Add "--address 127.0.0.1".
- Probe the local URL before returning instead of fixed sleep.
Apply this diff:
cmd = [ "kubectl", "port-forward", f"svc/{self.service_name}", f"{local_port}:{self.frontend_port}", + "--address", + "127.0.0.1", "-n", self.namespace, ] @@ - print("Waiting for port forward to establish...") - time.sleep(3) + print("Waiting for port forward to establish...") + start = time.time() + while time.time() - start < 15: + try: + import socket as _s + with _s.create_connection(("127.0.0.1", local_port), timeout=0.5): + break + except Exception: + time.sleep(0.2)benchmarks/profiler/profile_endpoint.py (1)
82-87: New --attention_dp_size flag: document allowed values and interactions.Add help text clarifying relation to DEP (e.g., DEP size), valid ranges, and when it’s used (decode/MoE only).
container/Dockerfile.sglang-wideep (1)
61-61: Exclude common local/build/test files from the Docker context (update .dockerignore)Your .dockerignore blocks many model artifacts and .git but is missing common patterns that reduce image size and cache churn — add entries like:
- .venv/
- venv/
- pycache/
- *.pyc
- build/
- dist/
- *.egg-info/
- .pytest_cache/
- tests/
- node_modules/
Location: .dockerignore (repo root) — add the above patterns.
benchmarks/profiler/utils/defaults.py (1)
19-21: Make DECODE_MAX_CONCURRENCY configurable (env override).Hard-coding 2000 is fine as a cap; allow env override to tune without code changes.
Apply:
-# for MoE models with attn-dp, we might hit this limit -DECODE_MAX_CONCURRENCY = 2000 +# for MoE models with attn-dp, we might hit this limit +import os # at top of file if not present +DECODE_MAX_CONCURRENCY = int(os.getenv("DECODE_MAX_CONCURRENCY", "2000"))Add the import near other imports:
import osbenchmarks/profiler/deploy/profile_sla_moe_job.yaml (1)
1-61: Harden pod security context; optionally pin num-gpus-per-node.Add non-root, drop caps, and disallow privilege escalation. Consider passing
--num-gpus-per-nodeexplicitly to document granularity.Apply:
- name: profile-sla image: ${DOCKER_IMAGE} resources: requests: cpu: "32" memory: "50Gi" + securityContext: + runAsNonRoot: true + runAsUser: 1000 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: ["ALL"] env: @@ - --is-moe-model + - --num-gpus-per-node + - "8"If your cluster enforces Pod Security Standards, confirm these settings comply with your admission controller.
benchmarks/profiler/utils/config.py (7)
439-449: Robust KV concurrency parsing (avoid brittle splits and slicing)Current parsing relies on string splits and
[:-1], risking wrong values if format changes. Prefer regex.- if "Maximum concurrency for" in line: - line = line.strip().split("Maximum concurrency for ")[1] - token_count = int( - line.split(" tokens per request: ")[0].replace(",", "") - ) - concurrency = float(line.split(" tokens per request: ")[1][:-1]) - - logger.info( - f"Found KV cache info: {token_count} x {concurrency} = {int(token_count * concurrency)}" - ) - return int(token_count * concurrency) + if "Maximum concurrency for" in line: + m = re.search( + r"Maximum concurrency for\s+([\d,]+)\s+tokens per request:\s*([\d.]+)", + line, + ) + if m: + token_count = int(m.group(1).replace(",", "")) + concurrency = float(m.group(2)) + total = int(token_count * concurrency) + logger.info("Found KV cache info: %d x %s = %d", token_count, m.group(2), total) + return total
788-802: Narrow broad exception in KV-cache parsingAvoid catching bare Exception; use specific exceptions.
- except Exception as e: - logger.warning(f"Failed to parse KV cache size from log file. Error: {e}") + except (OSError, ValueError) as e: + logger.warning("Failed to parse KV cache size from log file. Error: %s", e)
432-436: Silence unused parameter in VLLM KV-cache method (ruff ARG003)Parameter is part of the Protocol but unused here. Add a no-op reference.
def get_kv_cache_size_from_dynamo_log( cls, dynamo_log_fn: str, attention_dp_size: int = 1 ) -> int: + _ = attention_dp_size # unused in vllm backend; kept for protocol compliance
1048-1077: Silence unused parameter in TRT-LLM KV-cache method (ruff ARG003)Parameter is part of the Protocol but unused here. Add a no-op reference.
def get_kv_cache_size_from_dynamo_log( cls, dynamo_log_fn: str, attention_dp_size: int = 1 ) -> int: + _ = attention_dp_size # unused in trtllm backend; kept for protocol compliance # TRT-LLM log parsing for KV cache size
31-39: Prevent duplicate log handlers on repeated importsWrap handler registration to avoid duplicated logs.
logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) -console_handler = logging.StreamHandler() -console_handler.setLevel(logging.INFO) -formatter = logging.Formatter( - "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S" -) -console_handler.setFormatter(formatter) -logger.addHandler(console_handler) +if not logger.handlers: + console_handler = logging.StreamHandler() + console_handler.setLevel(logging.INFO) + formatter = logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S" + ) + console_handler.setFormatter(formatter) + logger.addHandler(console_handler)
57-61: Type the multinode field to avoid relying on dynamic extra assignmentRelying on extra fields risks assignment issues under Pydantic v2. Recommend adding
multinode: Optional[MultinodeConfig] = NonetoServiceand ensuring the type is defined beforeService.
- Option A (preferred): move
MultinodeConfigaboveServiceand add a typed field.- Option B: add a forward reference with
from __future__ import annotationsand usemultinode: Optional["MultinodeConfig"].Would you like me to generate the exact diff (including moving the class)?
Also applies to: 83-85, 167-184, 602-664, 666-731
102-109: Optional: remove all occurrences when stripping valued args
remove_valued_argumentsremoves only the first pair. If repeated, consider looping.def remove_valued_arguments(args: list[str], key: str) -> list[str]: - """Remove a valued argument (e.g., --key value) from the arguments list if exists.""" - if key in args: - idx = args.index(key) - if idx + 1 < len(args): - del args[idx : idx + 2] + """Remove all occurrences of a valued argument (e.g., --key value).""" + while True: + try: + idx = args.index(key) + if idx + 1 < len(args): + del args[idx : idx + 2] + else: + del args[idx] + except ValueError: + break return args
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
benchmarks/profiler/deploy/profile_sla_moe_job.yaml(1 hunks)benchmarks/profiler/profile_endpoint.py(2 hunks)benchmarks/profiler/profile_sla.py(22 hunks)benchmarks/profiler/utils/config.py(14 hunks)benchmarks/profiler/utils/defaults.py(1 hunks)benchmarks/profiler/utils/plot.py(2 hunks)benchmarks/profiler/utils/profile_decode.py(6 hunks)components/backends/sglang/deploy/agg.yaml(1 hunks)components/backends/sglang/deploy/agg_logging.yaml(1 hunks)components/backends/sglang/deploy/agg_router.yaml(1 hunks)components/backends/sglang/deploy/disagg.yaml(2 hunks)components/backends/sglang/deploy/disagg_planner.yaml(2 hunks)components/planner/src/dynamo/planner/defaults.py(1 hunks)components/planner/src/dynamo/planner/utils/planner_argparse.py(0 hunks)container/Dockerfile.sglang-wideep(2 hunks)deploy/utils/dynamo_deployment.py(2 hunks)docs/benchmarks/pre_deployment_profiling.md(4 hunks)tests/profiler/test_profile_sla_dryrun.py(1 hunks)
💤 Files with no reviewable changes (1)
- components/planner/src/dynamo/planner/utils/planner_argparse.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Applied to files:
container/Dockerfile.sglang-wideep
📚 Learning: 2025-08-15T02:01:01.238Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2453
File: deploy/dynamo_check.py:739-741
Timestamp: 2025-08-15T02:01:01.238Z
Learning: In the dynamo project, missing cargo should be treated as a fatal error in dynamo_check.py because developers need cargo to build the Rust components. The tool is designed to validate the complete development environment, not just import functionality.
Applied to files:
container/Dockerfile.sglang-wideep
🧬 Code graph analysis (3)
tests/profiler/test_profile_sla_dryrun.py (2)
tests/profiler/test_profile_sla_aiconfigurator.py (1)
Args(28-49)benchmarks/profiler/profile_sla.py (1)
run_profile(63-708)
benchmarks/profiler/profile_sla.py (4)
benchmarks/profiler/utils/profile_decode.py (2)
get_num_request_range(25-37)profile_decode_aiconfigurator(143-169)benchmarks/profiler/utils/config.py (20)
convert_config(188-194)convert_config(229-319)convert_config(459-557)convert_config(807-932)set_config_tep_size(201-204)set_config_tep_size(365-368)set_config_tep_size(603-663)set_config_tep_size(980-983)set_config_tp_size(197-198)set_config_tp_size(322-362)set_config_tp_size(560-600)set_config_tp_size(935-977)set_config_dep_size(207-210)set_config_dep_size(371-374)set_config_dep_size(666-730)set_config_dep_size(986-989)get_kv_cache_size_from_dynamo_log(221-224)get_kv_cache_size_from_dynamo_log(433-454)get_kv_cache_size_from_dynamo_log(789-802)get_kv_cache_size_from_dynamo_log(1049-1077)benchmarks/profiler/utils/profile_cache.py (4)
check_prefill_results_exist(26-53)load_existing_prefill_results(91-108)check_decode_results_exist(56-88)load_existing_decode_results(111-138)benchmarks/profiler/utils/plot.py (1)
plot_prefill_performance(34-71)
benchmarks/profiler/utils/config.py (2)
components/backends/vllm/src/dynamo/vllm/args.py (1)
Config(39-64)components/backends/trtllm/src/dynamo/trtllm/utils/trtllm_utils.py (1)
Config(28-93)
🪛 Ruff (0.13.1)
tests/profiler/test_profile_sla_dryrun.py
141-141: Probable insecure usage of temporary file or directory: "/tmp/test_profiling_results"
(S108)
benchmarks/profiler/utils/config.py
434-434: Unused class method argument: attention_dp_size
(ARG003)
633-635: Avoid specifying long messages outside the exception class
(TRY003)
696-698: Avoid specifying long messages outside the exception class
(TRY003)
800-800: Do not catch blind exception: Exception
(BLE001)
1050-1050: Unused class method argument: attention_dp_size
(ARG003)
🪛 markdownlint-cli2 (0.18.1)
docs/benchmarks/pre_deployment_profiling.md
162-162: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 Checkov (3.2.334)
benchmarks/profiler/deploy/profile_sla_moe_job.yaml
[medium] 3-61: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 3-61: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: Build and Test - trtllm
- GitHub Check: Build and Test - sglang
- GitHub Check: Mirror Repository to GitLab
🔇 Additional comments (13)
docs/benchmarks/pre_deployment_profiling.md (1)
16-16: Support matrix update for SGLang MoE looks correct.Good to mark SGLang MoE as supported to reflect DEP/TEP paths.
components/backends/sglang/deploy/disagg_planner.yaml (2)
132-163: Service key rename to prefill is consistent.No functional changes beyond the key; args remain valid.
101-132: Service key rename todecodeis consistent.
SGLang deployment manifests use shortdecode/prefillservice keys across the repo; docs still mentionSGLangDecodeWorker/SGLangPrefillWorkerbut that's documentation-only — no changes required.components/backends/sglang/deploy/disagg.yaml (2)
17-43: decode key rename LGTM.Args align with disaggregation-mode=decode.
44-70: prefill key rename LGTM.Args align with disaggregation-mode=prefill.
components/backends/sglang/deploy/agg_logging.yaml (1)
20-43: decode key rename LGTM.Consistent with the broader rename; no behavior change.
benchmarks/profiler/profile_endpoint.py (1)
104-115: Pass-through confirmed — signature matches: benchmarks/profiler/utils/profile_decode.py defines profile_decode(..., attention_dp_size) and benchmarks/profiler/profile_endpoint.py passes args.attention_dp_size.benchmarks/profiler/utils/plot.py (2)
35-52: LGTM: GPU-count labeling for prefill.API and annotations updated coherently.
75-87: LGTM: GPU-count labeling for decode.Legend and loop variables align with new semantics.
tests/profiler/test_profile_sla_dryrun.py (1)
166-169: MoE recipe path verified — no action needed.
Found: recipes/deepseek-r1/sglang-wideep/tep16p-dep16d-disagg.yamlcomponents/backends/sglang/deploy/agg_router.yaml (1)
20-44: Consistentdecoderename — confirm router & service selectors usedecodeVerify router logic, Kubernetes Service selectors, and any consumer code reference
decode(not the previous service name). Check: components/backends/sglang/deploy/README.md (SGLangDecodeWorker / decode-only worker), docs/kubernetes/README.md:119, docs/architecture/distributed_runtime.md:36.components/backends/sglang/deploy/agg.yaml (1)
17-17: Rename todecodelooks good — ensure mappings & docs updated.No WORKER_COMPONENT_NAMES mapping for
decode_worker_k8s_namewas found; repo still containsSGLangDecodeWorkerreferences:
- docs/kubernetes/README.md:119
- docs/architecture/distributed_runtime.md:36
- components/backends/sglang/deploy/README.md:12,19,26
Ensure WORKER_COMPONENT_NAMES maps
decode_worker_k8s_name->"decode"and update any consumers/log paths/docs that still expectSGLangDecodeWorker.benchmarks/profiler/utils/config.py (1)
167-184: Verify Pydantic behavior: doesworker_service.multinode = MultinodeConfig(...)persist in model_dump()?If Pydantic rejects dynamic field assignment,
multinodewon’t appear in the emitted config. Consider the typed-field change above.Run a quick check in your environment to confirm assignment persists through
cfg.model_dump()for this model shape.
Co-authored-by: hhzhang16 <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
…o/dynamo into hzhou/sglang-dsr1-sweep
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]> Co-authored-by: hhzhang16 <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: hongkuanz <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]> Co-authored-by: hhzhang16 <[email protected]>
Signed-off-by: hongkuanz <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]> Co-authored-by: hhzhang16 <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: hongkuanz <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]> Co-authored-by: hhzhang16 <[email protected]>
Signed-off-by: hongkuanz <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]> Co-authored-by: hhzhang16 <[email protected]> Signed-off-by: Kyle H <[email protected]>
For DS-liked MoE model, sla planner:
This PR:
--is-moe-modelflag to the pre-deployment sweeping to support TEP/DEP sweep.Summary by CodeRabbit
New Features
Changes
Documentation
Tests