Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Sep 23, 2025

For DS-liked MoE model, sla planner:

  1. use TEP for prefill
  2. use DEP for decode (will support TEP later)
  3. scaling number of replicas (will scale XEP size later).

This PR:

  1. add a --is-moe-model flag to the pre-deployment sweeping to support TEP/DEP sweep.
  2. no change in sla planner logic (because if we assume a good load balance, TEP prefill is equivalent to TP, while DEP decode is equivalent to TP thanks to in-flight batching)
  3. many misc changes to support multi-node sweep.

Summary by CodeRabbit

  • New Features

    • MoE-aware profiling for prefill/decode with GPU-count sweeps and multinode support.
    • New CLI flags: --is-moe-model, --num-gpus-per-node; decode range auto-calculation; added --attention_dp_size.
    • Added Kubernetes Job to run MoE profiling.
    • Plots now label by GPU count.
  • Changes

    • Removed CLI flags: --isl, --osl.
    • Deployment names now include a random suffix; default frontend service name follows it.
    • Simplified SGLang service names to “prefill” and “decode”.
  • Documentation

    • Expanded pre-deployment profiling guide with MoE workflows.
  • Tests

    • Added MoE SGLang dry-run coverage.

Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
@tedzhouhk tedzhouhk requested review from a team as code owners September 23, 2025 20:31
@github-actions github-actions bot added the feat label Sep 23, 2025
Signed-off-by: hongkuanz <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds MoE-aware profiling across profiler codepaths, introduces attention_dp_size and GPU-count sweeps, updates config modifiers with DEP/TEP and multinode support, renames SGLang service keys to short names, adds a MoE profiling Job manifest, adjusts decode request range logic, updates docs, Dockerfile, deployment naming behavior, and adds MoE dry-run tests.

Changes

Cohort / File(s) Summary
Profiler CLI & entrypoints
benchmarks/profiler/profile_endpoint.py, benchmarks/profiler/profile_sla.py
Adds --attention_dp_size (endpoint) and MoE flags (--is-moe-model, --num-gpus-per-node). Refactors profiling to sweep by GPU count; wires attention_dp_size; updates logging/paths and calls to prefill/decode with MoE-aware params.
Profiler utils: config & decode logic
benchmarks/profiler/utils/config.py, benchmarks/profiler/utils/profile_decode.py, benchmarks/profiler/utils/defaults.py, benchmarks/profiler/utils/plot.py
Extends ConfigModifierProtocol to support MoE (set_config_tep_size/dep_size, is_moe_model) and multinode. Adds get_num_request_range and attention_dp_size plumbed through decode. Replaces fixed decode range with capped range (DECODE_MAX_CONCURRENCY). Plotting now labels by GPU count.
K8s manifests: SGLang services rename
components/backends/sglang/deploy/agg.yaml, .../agg_logging.yaml, .../agg_router.yaml, .../disagg.yaml, .../disagg_planner.yaml
Renames spec.services keys: SGLangDecodeWorker→decode, SGLangPrefillWorker→prefill (where applicable). In agg_router, adds replicas/resources/extraPodSpec and command args.
Profiler MoE Job
benchmarks/profiler/deploy/profile_sla_moe_job.yaml
Adds Kubernetes Job profile-sla for MoE profiling with env, secrets, PVC mount, args, and serviceAccount.
Planner constants/CLI
components/planner/src/dynamo/planner/defaults.py, components/planner/src/dynamo/planner/utils/planner_argparse.py
Shortens SGLang component names to "prefill"/"decode". Removes --isl and --osl from SLA planner argparse.
Deployment helper
deploy/utils/dynamo_deployment.py
Appends short UUID to deployment_name; default service name updated accordingly; removes Grove-disable annotation.
Dockerfile
container/Dockerfile.sglang-wideep
Uses workspace copy instead of git clone; installs deps via container/deps/requirements.txt; adjusts PATH for cargo.
Docs
docs/benchmarks/pre_deployment_profiling.md
Documents MoE profiling (TEP/DEP), MoE job manifest, workflow steps updates, and notes.
Tests
tests/profiler/test_profile_sla_dryrun.py
Adds MoE SGLang dry-run fixture and async test for run_profile.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant PS as profile_sla.py
  participant CM as ConfigModifier (SGLang/VLLM/TRTLLM)
  participant PF as Prefill Profiler
  participant DC as Decode Profiler
  participant BK as Backend Engine
  participant ST as PVC (/data)

  U->>PS: run_profile(args)
  PS->>PS: Parse flags (is_moe_model, num_gpus_per_node)
  alt MoE
    PS->>CM: convert_config(..., is_moe_model=True)
    PS->>CM: set_config_tep_size / set_config_dep_size(num_gpus, per-node)
    PS->>DC: profile_decode(..., attention_dp_size=num_gpus)
  else Dense
    PS->>CM: convert_config(..., is_moe_model=False)
    PS->>CM: set_config_tp_size(tp_size)
    PS->>DC: profile_decode(..., attention_dp_size=1)
  end
  PS->>PF: profile_prefill(...)
  par Prefill
    PF->>BK: deploy & run prefill
    BK-->>PF: TTFT/throughput
  and Decode
    DC->>DC: get_num_request_range(attn_dp, max_conc, granularity)
    DC->>BK: deploy & run decode sweep
    BK-->>DC: ITL/throughput
  end
  PS->>ST: Write results/plots
  PS-->>U: Summary JSON/plots
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Poem

A whisk of bytes, a hop through GPUs wide,
I map my trails: DEP and TEP beside.
Decode counts bounded, ranges neat,
Short names on pods make routes petite.
With PVC burrow, logs I stow—
MoE meadows, here I go! 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description describes the high-level intent and main behavioral changes for MoE support but does not follow the repository's required template: it is missing the "Overview", "Details", "Where should the reviewer start?" and "Related Issues" sections and lacks explicit pointers to key files and tests, so it does not meet the repository template requirements. Please update the PR description to match the repository template by adding an "Overview" one-line summary, a "Details" section listing the key changes and affected files (for example benchmarks/profiler/profile_sla.py, benchmarks/profiler/utils/config.py, components/planner/...), a "Where should the reviewer start?" section pointing to critical files and tests, and a "Related Issues" section with any issue numbers (e.g., closes #xxx); including brief commands or test guidance will also help reviewers.
Docstring Coverage ⚠️ Warning Docstring coverage is 23.26% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: support MoE model in SLA Planner Sglang" is concise and accurately summarizes the primary change: adding MoE support to the SLA Planner Sglang workflow (e.g., the --is-moe-model flag and MoE-aware profiling/config updates), so it communicates the main intent clearly for reviewers scanning history.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
benchmarks/profiler/utils/profile_decode.py (1)

143-170: Regression: _profile_decode_helper signature change breaks profile_decode_aiconfigurator.

Missing attention_dp_size argument causes TypeError at runtime.

Apply:

-def profile_decode_aiconfigurator(
+def profile_decode_aiconfigurator(
     work_dir,
     num_gpus,
     max_kv_tokens,
     max_context_length,
     interpolation_granularity,
     ai_configurator_perf_estimator: AIConfiguratorPerfEstimator,
     **model_config_kwargs,
 ):
@@
-    return _profile_decode_helper(
+    return _profile_decode_helper(
         work_dir,
         num_gpus,
         max_kv_tokens,
         max_context_length,
         interpolation_granularity,
         get_itl_and_thpt_per_gpu,
-    )
+        # For dense models attention_dp_size=1; MoE path doesn't use AI Configurator.
+        model_config_kwargs.get("attention_dp_size", 1) if "attention_dp_size" in model_config_kwargs else 1,
+    )

Optionally, make the parameter explicit:

-    ai_configurator_perf_estimator: AIConfiguratorPerfEstimator,
+    ai_configurator_perf_estimator: AIConfiguratorPerfEstimator,
+    attention_dp_size: int = 1,
@@
-        get_itl_and_thpt_per_gpu,
-        # For dense models attention_dp_size=1; MoE path doesn't use AI Configurator.
-        model_config_kwargs.get("attention_dp_size", 1) if "attention_dp_size" in model_config_kwargs else 1,
+        get_itl_and_thpt_per_gpu,
+        attention_dp_size,
benchmarks/profiler/utils/config.py (3)

552-558: Bug: replicas set on raw dict never makes it into the returned config

You mutate config[...] then return cfg.model_dump(), so replicas stays unchanged. Set it on cfg instead.

-        # set num workers to 1
-        decode_worker_config = config["spec"]["services"][
-            WORKER_COMPONENT_NAMES["sglang"].decode_worker_k8s_name
-        ]
-        decode_worker_config["replicas"] = 1
-
-        return cfg.model_dump()
+        # set num workers to 1
+        decode_worker_config = cfg.spec.services[
+            WORKER_COMPONENT_NAMES["sglang"].decode_worker_k8s_name
+        ]
+        decode_worker_config.replicas = 1
+
+        return cfg.model_dump()

451-454: Fix NameError in exception logging (line variable may be undefined)

On errors before the loop, line is undefined. Log the filename instead.

-        except Exception as e:
-            logger.warning(
-                f"Failed to parse KV cache size from line: {line}. Error: {e}"
-            )
+        except Exception as e:
+            logger.warning(
+                "Failed to parse KV cache size from log file %s. Error: %s",
+                dynamo_log_fn,
+                e,
+            )

274-281: Make removal of flag robust: avoid ValueError if flag missing

Use a presence check before remove.

-            # remove --is-prefill-worker flag
-            args.remove("--is-prefill-worker")
+            # remove --is-prefill-worker flag
+            if "--is-prefill-worker" in args:
+                args.remove("--is-prefill-worker")
🧹 Nitpick comments (15)
docs/benchmarks/pre_deployment_profiling.md (3)

66-68: Clarify TEP-only prefill stance for MoE.

Consider adding one sentence explaining why DEP prefill is suboptimal (e.g., memory-bound tradeoffs), to avoid confusion for users comparing with dense TP.


72-74: Future TEP decode note is helpful; link to tracker/issue if available.

If there’s a tracking issue/PR for TEP decode support, add the link here for discoverability.


124-131: Align naming between docs and CLI: map DEP/TEP to attention_dp_size.

You introduced attention_dp_size in the CLI; add a parenthetical here mapping DEP to attention_dp_size to reduce terminology drift.

deploy/utils/dynamo_deployment.py (1)

142-193: Port-forward robustness: add --address and a readiness check.

To avoid binding on all interfaces and flaky sleeps:

  • Add "--address 127.0.0.1".
  • Probe the local URL before returning instead of fixed sleep.

Apply this diff:

         cmd = [
             "kubectl",
             "port-forward",
             f"svc/{self.service_name}",
             f"{local_port}:{self.frontend_port}",
+            "--address",
+            "127.0.0.1",
             "-n",
             self.namespace,
         ]
@@
-        print("Waiting for port forward to establish...")
-        time.sleep(3)
+        print("Waiting for port forward to establish...")
+        start = time.time()
+        while time.time() - start < 15:
+            try:
+                import socket as _s
+                with _s.create_connection(("127.0.0.1", local_port), timeout=0.5):
+                    break
+            except Exception:
+                time.sleep(0.2)
benchmarks/profiler/profile_endpoint.py (1)

82-87: New --attention_dp_size flag: document allowed values and interactions.

Add help text clarifying relation to DEP (e.g., DEP size), valid ranges, and when it’s used (decode/MoE only).

container/Dockerfile.sglang-wideep (1)

61-61: Exclude common local/build/test files from the Docker context (update .dockerignore)

Your .dockerignore blocks many model artifacts and .git but is missing common patterns that reduce image size and cache churn — add entries like:

  • .venv/
  • venv/
  • pycache/
  • *.pyc
  • build/
  • dist/
  • *.egg-info/
  • .pytest_cache/
  • tests/
  • node_modules/

Location: .dockerignore (repo root) — add the above patterns.

benchmarks/profiler/utils/defaults.py (1)

19-21: Make DECODE_MAX_CONCURRENCY configurable (env override).

Hard-coding 2000 is fine as a cap; allow env override to tune without code changes.

Apply:

-# for MoE models with attn-dp, we might hit this limit
-DECODE_MAX_CONCURRENCY = 2000
+# for MoE models with attn-dp, we might hit this limit
+import os  # at top of file if not present
+DECODE_MAX_CONCURRENCY = int(os.getenv("DECODE_MAX_CONCURRENCY", "2000"))

Add the import near other imports:

import os
benchmarks/profiler/deploy/profile_sla_moe_job.yaml (1)

1-61: Harden pod security context; optionally pin num-gpus-per-node.

Add non-root, drop caps, and disallow privilege escalation. Consider passing --num-gpus-per-node explicitly to document granularity.

Apply:

       - name: profile-sla
         image: ${DOCKER_IMAGE}
         resources:
           requests:
             cpu: "32"
             memory: "50Gi"
+        securityContext:
+          runAsNonRoot: true
+          runAsUser: 1000
+          allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: true
+          capabilities:
+            drop: ["ALL"]
         env:
@@
           - --is-moe-model
+          - --num-gpus-per-node
+          - "8"

If your cluster enforces Pod Security Standards, confirm these settings comply with your admission controller.

benchmarks/profiler/utils/config.py (7)

439-449: Robust KV concurrency parsing (avoid brittle splits and slicing)

Current parsing relies on string splits and [:-1], risking wrong values if format changes. Prefer regex.

-                    if "Maximum concurrency for" in line:
-                        line = line.strip().split("Maximum concurrency for ")[1]
-                        token_count = int(
-                            line.split(" tokens per request: ")[0].replace(",", "")
-                        )
-                        concurrency = float(line.split(" tokens per request: ")[1][:-1])
-
-                        logger.info(
-                            f"Found KV cache info: {token_count} x {concurrency} = {int(token_count * concurrency)}"
-                        )
-                        return int(token_count * concurrency)
+                    if "Maximum concurrency for" in line:
+                        m = re.search(
+                            r"Maximum concurrency for\s+([\d,]+)\s+tokens per request:\s*([\d.]+)",
+                            line,
+                        )
+                        if m:
+                            token_count = int(m.group(1).replace(",", ""))
+                            concurrency = float(m.group(2))
+                            total = int(token_count * concurrency)
+                            logger.info("Found KV cache info: %d x %s = %d", token_count, m.group(2), total)
+                            return total

788-802: Narrow broad exception in KV-cache parsing

Avoid catching bare Exception; use specific exceptions.

-        except Exception as e:
-            logger.warning(f"Failed to parse KV cache size from log file. Error: {e}")
+        except (OSError, ValueError) as e:
+            logger.warning("Failed to parse KV cache size from log file. Error: %s", e)

432-436: Silence unused parameter in VLLM KV-cache method (ruff ARG003)

Parameter is part of the Protocol but unused here. Add a no-op reference.

     def get_kv_cache_size_from_dynamo_log(
         cls, dynamo_log_fn: str, attention_dp_size: int = 1
     ) -> int:
+        _ = attention_dp_size  # unused in vllm backend; kept for protocol compliance

1048-1077: Silence unused parameter in TRT-LLM KV-cache method (ruff ARG003)

Parameter is part of the Protocol but unused here. Add a no-op reference.

     def get_kv_cache_size_from_dynamo_log(
         cls, dynamo_log_fn: str, attention_dp_size: int = 1
     ) -> int:
+        _ = attention_dp_size  # unused in trtllm backend; kept for protocol compliance
         # TRT-LLM log parsing for KV cache size

31-39: Prevent duplicate log handlers on repeated imports

Wrap handler registration to avoid duplicated logs.

 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
-console_handler = logging.StreamHandler()
-console_handler.setLevel(logging.INFO)
-formatter = logging.Formatter(
-    "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S"
-)
-console_handler.setFormatter(formatter)
-logger.addHandler(console_handler)
+if not logger.handlers:
+    console_handler = logging.StreamHandler()
+    console_handler.setLevel(logging.INFO)
+    formatter = logging.Formatter(
+        "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S"
+    )
+    console_handler.setFormatter(formatter)
+    logger.addHandler(console_handler)

57-61: Type the multinode field to avoid relying on dynamic extra assignment

Relying on extra fields risks assignment issues under Pydantic v2. Recommend adding multinode: Optional[MultinodeConfig] = None to Service and ensuring the type is defined before Service.

  • Option A (preferred): move MultinodeConfig above Service and add a typed field.
  • Option B: add a forward reference with from __future__ import annotations and use multinode: Optional["MultinodeConfig"].

Would you like me to generate the exact diff (including moving the class)?

Also applies to: 83-85, 167-184, 602-664, 666-731


102-109: Optional: remove all occurrences when stripping valued args

remove_valued_arguments removes only the first pair. If repeated, consider looping.

 def remove_valued_arguments(args: list[str], key: str) -> list[str]:
-    """Remove a valued argument (e.g., --key value) from the arguments list if exists."""
-    if key in args:
-        idx = args.index(key)
-        if idx + 1 < len(args):
-            del args[idx : idx + 2]
+    """Remove all occurrences of a valued argument (e.g., --key value)."""
+    while True:
+        try:
+            idx = args.index(key)
+            if idx + 1 < len(args):
+                del args[idx : idx + 2]
+            else:
+                del args[idx]
+        except ValueError:
+            break
     return args
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a71802 and 76e77aa.

📒 Files selected for processing (18)
  • benchmarks/profiler/deploy/profile_sla_moe_job.yaml (1 hunks)
  • benchmarks/profiler/profile_endpoint.py (2 hunks)
  • benchmarks/profiler/profile_sla.py (22 hunks)
  • benchmarks/profiler/utils/config.py (14 hunks)
  • benchmarks/profiler/utils/defaults.py (1 hunks)
  • benchmarks/profiler/utils/plot.py (2 hunks)
  • benchmarks/profiler/utils/profile_decode.py (6 hunks)
  • components/backends/sglang/deploy/agg.yaml (1 hunks)
  • components/backends/sglang/deploy/agg_logging.yaml (1 hunks)
  • components/backends/sglang/deploy/agg_router.yaml (1 hunks)
  • components/backends/sglang/deploy/disagg.yaml (2 hunks)
  • components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
  • components/planner/src/dynamo/planner/defaults.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_argparse.py (0 hunks)
  • container/Dockerfile.sglang-wideep (2 hunks)
  • deploy/utils/dynamo_deployment.py (2 hunks)
  • docs/benchmarks/pre_deployment_profiling.md (4 hunks)
  • tests/profiler/test_profile_sla_dryrun.py (1 hunks)
💤 Files with no reviewable changes (1)
  • components/planner/src/dynamo/planner/utils/planner_argparse.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • container/Dockerfile.sglang-wideep
📚 Learning: 2025-08-15T02:01:01.238Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2453
File: deploy/dynamo_check.py:739-741
Timestamp: 2025-08-15T02:01:01.238Z
Learning: In the dynamo project, missing cargo should be treated as a fatal error in dynamo_check.py because developers need cargo to build the Rust components. The tool is designed to validate the complete development environment, not just import functionality.

Applied to files:

  • container/Dockerfile.sglang-wideep
🧬 Code graph analysis (3)
tests/profiler/test_profile_sla_dryrun.py (2)
tests/profiler/test_profile_sla_aiconfigurator.py (1)
  • Args (28-49)
benchmarks/profiler/profile_sla.py (1)
  • run_profile (63-708)
benchmarks/profiler/profile_sla.py (4)
benchmarks/profiler/utils/profile_decode.py (2)
  • get_num_request_range (25-37)
  • profile_decode_aiconfigurator (143-169)
benchmarks/profiler/utils/config.py (20)
  • convert_config (188-194)
  • convert_config (229-319)
  • convert_config (459-557)
  • convert_config (807-932)
  • set_config_tep_size (201-204)
  • set_config_tep_size (365-368)
  • set_config_tep_size (603-663)
  • set_config_tep_size (980-983)
  • set_config_tp_size (197-198)
  • set_config_tp_size (322-362)
  • set_config_tp_size (560-600)
  • set_config_tp_size (935-977)
  • set_config_dep_size (207-210)
  • set_config_dep_size (371-374)
  • set_config_dep_size (666-730)
  • set_config_dep_size (986-989)
  • get_kv_cache_size_from_dynamo_log (221-224)
  • get_kv_cache_size_from_dynamo_log (433-454)
  • get_kv_cache_size_from_dynamo_log (789-802)
  • get_kv_cache_size_from_dynamo_log (1049-1077)
benchmarks/profiler/utils/profile_cache.py (4)
  • check_prefill_results_exist (26-53)
  • load_existing_prefill_results (91-108)
  • check_decode_results_exist (56-88)
  • load_existing_decode_results (111-138)
benchmarks/profiler/utils/plot.py (1)
  • plot_prefill_performance (34-71)
benchmarks/profiler/utils/config.py (2)
components/backends/vllm/src/dynamo/vllm/args.py (1)
  • Config (39-64)
components/backends/trtllm/src/dynamo/trtllm/utils/trtllm_utils.py (1)
  • Config (28-93)
🪛 Ruff (0.13.1)
tests/profiler/test_profile_sla_dryrun.py

141-141: Probable insecure usage of temporary file or directory: "/tmp/test_profiling_results"

(S108)

benchmarks/profiler/utils/config.py

434-434: Unused class method argument: attention_dp_size

(ARG003)


633-635: Avoid specifying long messages outside the exception class

(TRY003)


696-698: Avoid specifying long messages outside the exception class

(TRY003)


800-800: Do not catch blind exception: Exception

(BLE001)


1050-1050: Unused class method argument: attention_dp_size

(ARG003)

🪛 markdownlint-cli2 (0.18.1)
docs/benchmarks/pre_deployment_profiling.md

162-162: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 Checkov (3.2.334)
benchmarks/profiler/deploy/profile_sla_moe_job.yaml

[medium] 3-61: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 3-61: Minimize the admission of root containers

(CKV_K8S_23)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: Build and Test - trtllm
  • GitHub Check: Build and Test - sglang
  • GitHub Check: Mirror Repository to GitLab
🔇 Additional comments (13)
docs/benchmarks/pre_deployment_profiling.md (1)

16-16: Support matrix update for SGLang MoE looks correct.

Good to mark SGLang MoE as supported to reflect DEP/TEP paths.

components/backends/sglang/deploy/disagg_planner.yaml (2)

132-163: Service key rename to prefill is consistent.

No functional changes beyond the key; args remain valid.


101-132: Service key rename to decode is consistent.
SGLang deployment manifests use short decode/prefill service keys across the repo; docs still mention SGLangDecodeWorker/SGLangPrefillWorker but that's documentation-only — no changes required.

components/backends/sglang/deploy/disagg.yaml (2)

17-43: decode key rename LGTM.

Args align with disaggregation-mode=decode.


44-70: prefill key rename LGTM.

Args align with disaggregation-mode=prefill.

components/backends/sglang/deploy/agg_logging.yaml (1)

20-43: decode key rename LGTM.

Consistent with the broader rename; no behavior change.

benchmarks/profiler/profile_endpoint.py (1)

104-115: Pass-through confirmed — signature matches: benchmarks/profiler/utils/profile_decode.py defines profile_decode(..., attention_dp_size) and benchmarks/profiler/profile_endpoint.py passes args.attention_dp_size.

benchmarks/profiler/utils/plot.py (2)

35-52: LGTM: GPU-count labeling for prefill.

API and annotations updated coherently.


75-87: LGTM: GPU-count labeling for decode.

Legend and loop variables align with new semantics.

tests/profiler/test_profile_sla_dryrun.py (1)

166-169: MoE recipe path verified — no action needed.
Found: recipes/deepseek-r1/sglang-wideep/tep16p-dep16d-disagg.yaml

components/backends/sglang/deploy/agg_router.yaml (1)

20-44: Consistent decode rename — confirm router & service selectors use decode

Verify router logic, Kubernetes Service selectors, and any consumer code reference decode (not the previous service name). Check: components/backends/sglang/deploy/README.md (SGLangDecodeWorker / decode-only worker), docs/kubernetes/README.md:119, docs/architecture/distributed_runtime.md:36.

components/backends/sglang/deploy/agg.yaml (1)

17-17: Rename to decode looks good — ensure mappings & docs updated.

No WORKER_COMPONENT_NAMES mapping for decode_worker_k8s_name was found; repo still contains SGLangDecodeWorker references:

  • docs/kubernetes/README.md:119
  • docs/architecture/distributed_runtime.md:36
  • components/backends/sglang/deploy/README.md:12,19,26

Ensure WORKER_COMPONENT_NAMES maps decode_worker_k8s_name -> "decode" and update any consumers/log paths/docs that still expect SGLangDecodeWorker.

benchmarks/profiler/utils/config.py (1)

167-184: Verify Pydantic behavior: does worker_service.multinode = MultinodeConfig(...) persist in model_dump()?

If Pydantic rejects dynamic field assignment, multinode won’t appear in the emitted config. Consider the typed-field change above.

Run a quick check in your environment to confirm assignment persists through cfg.model_dump() for this model shape.

Co-authored-by: hhzhang16 <[email protected]>
Signed-off-by: Hongkuan Zhou <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
@tedzhouhk tedzhouhk merged commit 6243bcb into main Sep 24, 2025
15 of 18 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/sglang-dsr1-sweep branch September 24, 2025 02:42
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: Hongkuan Zhou <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
Signed-off-by: Jason Zhou <[email protected]>
athreesh pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: Hongkuan Zhou <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: Hongkuan Zhou <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
Signed-off-by: Jason Zhou <[email protected]>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: Hongkuan Zhou <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
kylehh pushed a commit that referenced this pull request Sep 25, 2025
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: Hongkuan Zhou <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
Signed-off-by: Kyle H <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants