diff --git a/README.md b/README.md index 9817a78dcf..bb146b2a33 100644 --- a/README.md +++ b/README.md @@ -25,17 +25,7 @@ limitations under the License. # NVIDIA Dynamo -High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. - -## Latest News - -* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md) - -## The Era of Multi-GPU, Multi-Node - -

- Multi Node Multi-GPU topology -

+High-throughput, low-latency inference framework for serving generative AI models across multi-node distributed environments. Large language models are quickly outgrowing the memory and compute budget of any single GPU. Tensor-parallelism solves the capacity problem by spreading each layer across many GPUs—and sometimes many servers—but it creates a new one: how do you coordinate those shards, route requests, and share KV cache fast enough to feel like one accelerator? This orchestration gap is exactly what NVIDIA Dynamo is built to close. @@ -51,6 +41,12 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa Dynamo architecture

+Built in Rust for performance and Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach. + +## Latest News + +* [0.5.0] KVBM (KV Cache Block Manager) support now available in Dynamo for enhanced memory management and KV cache offloading from HBM to remote storage + ## Framework Support Matrix | Feature | vLLM | SGLang | TensorRT-LLM | @@ -58,7 +54,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa | [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ | | [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 | | [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ | -| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 | | [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ | | [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ | @@ -67,88 +62,77 @@ To learn more about each framework and their capabilities, check out each framew - **[SGLang](components/backends/sglang/README.md)** - **[TensorRT-LLM](components/backends/trtllm/README.md)** -Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach. +# Quick Start -# Installation +**New to Dynamo?** **[Complete Quickstart Guide](docs/quickstart.md)** (Recommended) -The following examples require a few system level packages. -Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md) +## Local Development -## 1. Initial setup +### Prerequisites +- Ubuntu 24.04 (recommended) or compatible Linux +- NVIDIA GPU with CUDA support +- Docker & Docker Compose -The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv: -``` +### 1. Install Dynamo +```bash +# Install uv (recommended Python package manager) curl -LsSf https://astral.sh/uv/install.sh | sh -``` - -### Install etcd and NATS (required) -To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynamo locally, these need to be available. - -- [etcd](https://etcd.io/) can be run directly as `./etcd`. -- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`. - -To quickly setup etcd & NATS, you can also run: +# Create virtual environment and install Dynamo +uv venv venv +source venv/bin/activate +uv pip install "ai-dynamo[sglang]" # or [vllm], [trtllm] ``` -# At the root of the repository: -# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used. + +### 2. Start Infrastructure Services +```bash +# Start etcd and NATS (required for distributed communication) docker compose -f deploy/docker-compose.yml up -d ``` -## 2. Select an engine - -We publish Python wheels specialized for each of our supported engines: vllm, sglang, trtllm, and llama.cpp. The examples that follow use SGLang; continue reading for other engines. +### 3. Run Your First Model +```bash +# Terminal 1: Start frontend +python -m dynamo.frontend --http-port 8000 +# Terminal 2: Start backend worker +python -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B ``` -uv venv venv -source venv/bin/activate -uv pip install pip -# Choose one -uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc. +### 4. Test It +```bash +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}' ``` -## 3. Run Dynamo +## Kubernetes Deployment -### Running an LLM API server +**Production deployments** **[Kubernetes Quickstart](docs/quickstart.md#kubernetes-quickstart)** -Dynamo provides a simple way to spin up a local set of inference components including: +```bash +# Install platform +export NAMESPACE=dynamo-kubernetes +export RELEASE_VERSION=0.5.0 -- **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust. -- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers. -- **Workers** – Set of pre-configured LLM serving engines. +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default -``` -# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router. -# Pass the TLS certificate and key paths to use HTTPS instead of HTTP. -python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem] - -# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these, -# both for the same model and for multiple models. The frontend node will discover them. -python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init -``` +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace -#### Send a Request +# Deploy model (example: vLLM aggregated) +kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} -```bash -curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", - "messages": [ - { - "role": "user", - "content": "Hello, how are you?" - } - ], - "stream":false, - "max_tokens": 300 - }' | jq +# Test the deployment +kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE} +curl http://localhost:8000/v1/models ``` -Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them. +**For detailed Kubernetes deployment guide**: [Kubernetes Documentation](docs/kubernetes/README.md) -### Deploying Dynamo +## Next Steps -- Follow the [Quickstart Guide](docs/kubernetes/README.md) to deploy on Kubernetes. - Check out [Backends](components/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.) - Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations. @@ -159,97 +143,42 @@ Dynamo provides comprehensive benchmarking tools to evaluate and optimize your d * **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf * **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements -# Engines - -Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`). - -## vLLM - -``` -uv pip install ai-dynamo[vllm] -``` - -Run the backend/worker like this: -``` -python -m dynamo.vllm --help -``` - -vLLM attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length `. - -To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`. - -## SGLang - -``` -# Install libnuma -apt install -y libnuma-dev - -uv pip install ai-dynamo[sglang] -``` - -Run the backend/worker like this: -``` -python -m dynamo.sglang.worker --help -``` - -You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs. - -## TensorRT-LLM +# Supported Engines -It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for running the TensorRT-LLM engine. +Dynamo supports multiple inference engines. Choose your preferred backend: -> [!Note] -> Ensure that you select a PyTorch container image version that matches the version of TensorRT-LLM you are using. -> For example, if you are using `tensorrt-llm==1.1.0rc5`, use the PyTorch container image version `25.06`. -> To find the correct PyTorch container version for your desired `tensorrt-llm` release, visit the [TensorRT-LLM Dockerfile.multi](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi) on GitHub. Switch to the branch that matches your `tensorrt-llm` version, and look for the `BASE_TAG` line to identify the recommended PyTorch container tag. +| Engine | Install | Run Command | Notes | +|--------|---------|-------------|-------| +| **vLLM** | `uv pip install ai-dynamo[vllm]` | `python -m dynamo.vllm --model Qwen/Qwen3-0.6B` | Use `--context-length ` if KV cache doesn't fit in memory. Set `CUDA_VISIBLE_DEVICES` to specify GPUs. | +| **SGLang** | `uv pip install ai-dynamo[sglang]` | `python -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B` | Requires `apt install -y libnuma-dev` dependency. | +| **TensorRT-LLM** | `uv pip install ai-dynamo[trtllm]` | `python -m dynamo.trtllm --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B` | Requires NVIDIA PyTorch container. See [TensorRT-LLM Quickstart](docs/quickstart.md#tensorrt-llm-backend) for setup. | -> [!Important] -> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1` - -### Install prerequisites -``` -# Optional step: Only required for Blackwell and Grace Hopper -uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 +**Detailed engine guides**: [vLLM](components/backends/vllm/README.md) | [SGLang](components/backends/sglang/README.md) | [TensorRT-LLM](components/backends/trtllm/README.md) -# Required until the trtllm version is bumped to include this pinned dependency itself -uv pip install "cuda-python>=12,<13" +# Development -sudo apt-get -y install libopenmpi-dev -``` +
-> [!Tip] -> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html). +Building from Source (Click to expand) -### After installing the pre-requisites above, install Dynamo -``` -uv pip install ai-dynamo[trtllm] -``` +**For contributors and advanced users** -Run the backend/worker like this: -``` -python -m dynamo.trtllm --help -``` - -To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`. - -# Developing Locally - -## 1. Install libraries +### Prerequisites **Ubuntu:** -``` +```bash sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake ``` **macOS:** - [Homebrew](https://brew.sh/) -``` +```bash # if brew is not installed on your system, install it /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" ``` - [Xcode](https://developer.apple.com/xcode/) -``` +```bash brew install cmake protobuf ## Check that Metal is accessible @@ -257,15 +186,14 @@ xcrun -sdk macosx metal ``` If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly. +### Install Rust -## 2. Install Rust - -``` +```bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env ``` -## 3. Create a Python virtual env: +### Create a Python virtual env: Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it. @@ -280,24 +208,24 @@ uv venv dynamo source dynamo/bin/activate ``` -## 4. Install build tools +### Install build tools -``` +```bash uv pip install pip maturin ``` [Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool. -## 5. Build the Rust bindings +### Build the Rust bindings -``` +```bash cd lib/bindings/python maturin develop --uv ``` -## 6. Install the wheel +### Install the wheel -``` +```bash cd $PROJECT_ROOT uv pip install . # For development, use @@ -314,3 +242,5 @@ Remember that nats and etcd must be running (see earlier). Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`. If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details. + +
\ No newline at end of file diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md index d90116eed3..74cdf2b338 100644 --- a/components/backends/sglang/README.md +++ b/components/backends/sglang/README.md @@ -40,7 +40,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | | [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned | -| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned | +| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP | ### Large Scale P/D and WideEP Features diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md index 3d0b685570..29c59b139e 100644 --- a/components/backends/trtllm/README.md +++ b/components/backends/trtllm/README.md @@ -57,7 +57,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | | [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned | -| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned | +| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ✅ | | ### Large Scale P/D and WideEP Features diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md index 619ad47560..ceff111abd 100644 --- a/components/backends/vllm/README.md +++ b/components/backends/vllm/README.md @@ -40,7 +40,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | | [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP | -| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP | +| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ✅ | | | [**LMCache**](./LMCache_Integration.md) | ✅ | | ### Large Scale P/D and WideEP Features diff --git a/docs/kubernetes/README.md b/docs/kubernetes/README.md index 22ff95675c..e98095eb65 100644 --- a/docs/kubernetes/README.md +++ b/docs/kubernetes/README.md @@ -24,7 +24,7 @@ High-level guide to Dynamo Kubernetes deployments. Start here, then dive into sp ```bash # 1. Set environment export NAMESPACE=dynamo-kubernetes -export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases +export RELEASE_VERSION=0.5.0 # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases # 2. Install CRDs helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz @@ -33,19 +33,51 @@ helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default # 3. Install Platform helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace + +# For multinode deployments, enable Grove/KAI with: +# helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace \ +# --set grove.enabled=true \ +# --set kai.enabled=true ``` For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**. -## 2. Choose Your Backend - -Each backend has deployment examples and configuration options: - -| Backend | Available Configurations | -|---------|--------------------------| -| **[vLLM](/components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node | -| **[SGLang](/components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node | -| **[TensorRT-LLM](/components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node | +## 2. Choose Your Backend and Deployment Pattern + +### **Aggregated Serving** +Prefill and decode phases run on the same worker - simplest deployment pattern. + +| Backend | Configuration | Deploy Command | +|---------|---------------|----------------| +| **vLLM** | [Aggregated](components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` | +| **vLLM** | [Aggregated + Router](components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` | +| **SGLang** | [Aggregated](components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` | +| **SGLang** | [Aggregated + Router](components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Aggregated](components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Aggregated + Router](components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` | + +### **Disaggregated Serving** +Prefill and decode phases run on separate workers - higher performance and scalability. + +| Backend | Configuration | Deploy Command | +|---------|---------------|----------------| +| **vLLM** | [Disaggregated](components/backends/vllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg.yaml -n ${NAMESPACE}` | +| **vLLM** | [Disaggregated + Router](components/backends/vllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_router.yaml -n ${NAMESPACE}` | +| **vLLM** | [Disaggregated + Planner](components/backends/vllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` | +| **SGLang** | [Disaggregated](components/backends/sglang/deploy/disagg.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg.yaml -n ${NAMESPACE}` | +| **SGLang** | [Disaggregated + Planner](components/backends/sglang/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Disaggregated](components/backends/trtllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Disaggregated + Router](components/backends/trtllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_router.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Disaggregated + Planner](components/backends/trtllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` | + +### **Multi-node Deployment** (Model replicaes across multiple nodes) +Scale disaggregated serving across multiple Kubernetes nodes for maximum performance. + +| Backend | Configuration | Deploy Command | +|---------|---------------|----------------| +| **vLLM** | [Multi-node](components/backends/vllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` | +| **SGLang** | [Multi-node](components/backends/sglang/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg-multinode.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Multi-node](components/backends/trtllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` | ## 3. Deploy Your First Model diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000000..552553db52 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,300 @@ +# Dynamo Quickstart Guide + +Get up and running with NVIDIA Dynamo in minutes! This guide provides the fastest paths to deploy Dynamo for different use cases. + +## Choose Deployment Path + +| Use Case | Time to Deploy | Best For | Path | +|----------|----------------|----------|------| +| **Local Development** | 5 minutes | Testing, development, getting started | [Local Quickstart](#local-quickstart) | +| **Kubernetes Production** | 15-20 minutes | Production deployments, scaling | [Kubernetes Quickstart](#kubernetes-quickstart) | + +--- + +## Local Quickstart + +**Perfect for**: Development, testing, learning Dynamo concepts + +### Prerequisites +- Ubuntu 24.04 (recommended) or compatible Linux +- NVIDIA GPU with CUDA support +- Docker & Docker Compose +- Python 3.9+ + +### 1. Install Dynamo + +```bash +# Install uv (recommended Python package manager) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Create virtual environment and install Dynamo +uv venv venv +source venv/bin/activate +uv pip install "ai-dynamo[sglang]==0.5.0" # or [vllm], [trtllm] +``` + +### 2. Start Infrastructure Services + +Dynamo uses **etcd** and **NATS** for distributed communication at data center scale. Even for local development, these services are required for component discovery and message passing. + +```bash +# Start etcd and NATS using Docker Compose +curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/release/0.5.0/deploy/docker-compose.yml +docker compose -f docker-compose.yml up -d +``` + +**What this sets up:** +- **etcd**: Distributed key-value store for service discovery and metadata storage +- **NATS**: High-performance message broker for inter-component communication + +### 3. Deploy Your First Model + +**Terminal 1 - Start the Frontend:** +```bash +python -m dynamo.frontend --http-port 8000 +``` + +**Terminal 2 - Start the Backend Worker:** +```bash +python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B +``` + +### 4. Test Your Deployment + +```bash +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "Qwen/Qwen3-0.6B", + "messages": [{"role": "user", "content": "Hello!"}], + "max_tokens": 50}' +``` + +**Success!** You now have a working Dynamo deployment. + +### Cleanup +```bash +# Stop Dynamo components (Ctrl+C in each terminal) +# Stop infrastructure services +docker compose -f docker-compose.yml down +``` + +--- + +
+Framework-Specific Quickstarts (Click to expand) + +### vLLM Backend +```bash +# Install +uv pip install "ai-dynamo[vllm]" + +# Run +python -m dynamo.vllm --model Qwen/Qwen3-0.6B +``` + +### SGLang Backend +```bash +# Install dependencies +apt install -y libnuma-dev + +# Install +uv pip install "ai-dynamo[sglang]" + +# Run +python -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B +``` + +### TensorRT-LLM Backend + +**Note**: TensorRT-LLM requires the NVIDIA PyTorch container as a base, which needs NGC login. + +```bash +# 1. Login to NVIDIA NGC (required for PyTorch container) +docker login nvcr.io +# Enter your NGC username and API key when prompted + +# 2. Install prerequisites +uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 +uv pip install "cuda-python>=12,<13" +sudo apt-get -y install libopenmpi-dev + +# 3. Install +uv pip install "ai-dynamo[trtllm]" + +# 4. Run +python -m dynamo.trtllm --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B +``` + +**NGC Setup**: Get your NGC username and API key from [NGC Console](https://ngc.nvidia.com/setup/api-key) + +
+ +--- + +## Kubernetes Quickstart + +**Perfect for**: Production deployments, scaling, multi-node setups + +### Prerequisites +- Kubernetes cluster (1.24+) +- NVIDIA GPU operator installed +- kubectl configured +- Helm 3.0+ + +### 1. Install Dynamo Platform + +```bash +# Set environment +export NAMESPACE=dynamo-kubernetes +export RELEASE_VERSION=0.5.0 + +# Install CRDs +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default + +# Install Platform +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace +``` + +### 2. Deploy Your Model + +Choose your backend and deployment pattern: + +#### **Aggregated Serving** (Single-node, all-in-one) +Prefill and decode phases run on the same worker - simplest deployment pattern. + +| Backend | Configuration | Deploy Command | +|---------|---------------|----------------| +| **vLLM** | [Aggregated](components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` | +| **vLLM** | [Aggregated + Router](components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` | +| **SGLang** | [Aggregated](components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` | +| **SGLang** | [Aggregated + Router](components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Aggregated](components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Aggregated + Router](components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` | + +#### **Disaggregated Serving** (Multi-node, specialized workers) +Prefill and decode phases run on separate workers - higher performance and scalability. + +| Backend | Configuration | Deploy Command | +|---------|---------------|----------------| +| **vLLM** | [Disaggregated](components/backends/vllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg.yaml -n ${NAMESPACE}` | +| **vLLM** | [Disaggregated + Router](components/backends/vllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_router.yaml -n ${NAMESPACE}` | +| **vLLM** | [Disaggregated + Planner](components/backends/vllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` | +| **SGLang** | [Disaggregated](components/backends/sglang/deploy/disagg.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg.yaml -n ${NAMESPACE}` | +| **SGLang** | [Disaggregated + Planner](components/backends/sglang/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Disaggregated](components/backends/trtllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Disaggregated + Router](components/backends/trtllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_router.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Disaggregated + Planner](components/backends/trtllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` | + +#### **Multi-node Deployment** (Distributed across multiple nodes) +Scale disaggregated serving across multiple Kubernetes nodes for maximum performance. + +| Backend | Configuration | Deploy Command | +|---------|---------------|----------------| +| **vLLM** | [Multi-node](components/backends/vllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` | +| **SGLang** | [Multi-node](components/backends/sglang/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg-multinode.yaml -n ${NAMESPACE}` | +| **TensorRT-LLM** | [Multi-node](components/backends/trtllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` | + +### 3. Test Your Deployment + +```bash +# Check status +kubectl get dynamoGraphDeployment -n ${NAMESPACE} + +# Test it +kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE} +curl http://localhost:8000/v1/models +``` + +**Success!** Your Dynamo deployment is running on Kubernetes. + +### Cleanup +```bash +kubectl delete dynamoGraphDeployment agg-vllm -n ${NAMESPACE} +helm uninstall dynamo-platform -n ${NAMESPACE} +helm uninstall dynamo-crds --namespace default +``` + +--- + +## Next Steps + +### For Local Development Users + +**Dive deeper into Dynamo's architecture and Python development:** + +- **[Architecture Guide](docs/architecture/)** - Understand Dynamo's design and components +- **[Disaggregated Serving](examples/basics/disaggregated_serving/)** - Try advanced serving patterns locally +- **[Multi-node Deployment](examples/basics/multinode/)** - Scale across multiple local nodes +- **[Custom Backend Examples](examples/custom_backend/)** - Build your own Dynamo components +- **[Runtime Examples](lib/bindings/python/README.md)** - Low-level Python<>Rust bindings +- **[KV-Aware Routing](docs/architecture/kv_cache_routing.md)** - Understand intelligent request routing + +### For Kubernetes Production Users + +**Production deployment and operations:** + +- **[Kubernetes Documentation](docs/kubernetes/)** - Complete K8s deployment guide +- **[API Reference](docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specifications +- **[Installation Guide](docs/kubernetes/installation_guide.md)** - Detailed platform setup +- **[Monitoring Setup](docs/kubernetes/metrics.md)** - Observability and metrics +- **[Logging Configuration](docs/kubernetes/logging.md)** - Centralized logging setup +- **[Multi-node Deployment](docs/kubernetes/multinode-deployment.md)** - Scale across K8s nodes +- **[Performance Tuning](docs/benchmarks/)** - Optimize for your workload + +--- + +## Troubleshooting + +### Common Issues + +**"Connection refused" errors:** +```bash +# Check if etcd and NATS are running +docker ps | grep -E "(etcd|nats)" + +# Restart infrastructure services +docker compose -f docker-compose.yml down +docker compose -f docker-compose.yml up -d +``` + +**GPU not detected:** +```bash +# Check GPU availability +nvidia-smi + +# Verify CUDA installation +python -c "import torch; print(torch.cuda.is_available())" +``` + +**Kubernetes deployment stuck:** +```bash +# Check pod status +kubectl get pods -n dynamo-kubernetes + +# Check logs for any given component +kubectl logs -f deployment/agg-vllm-frontend -n dynamo-kubernetes +``` + +**Model download issues:** +```bash +# Set HuggingFace token for private models +export HUGGINGFACE_HUB_TOKEN=your_token_here + +# Or use local model path +python -m dynamo.vllm --model /path/to/local/model +``` + +### Getting Help + +- **[GitHub Issues](https://github.com/ai-dynamo/dynamo/issues)** - Report bugs and request features +- **[Discord Community](https://discord.gg/D92uqZRjCZ)** - Get help from the community +- **[Documentation](https://docs.nvidia.com/dynamo/latest/)** - Comprehensive guides and API docs + +--- + +## System Requirements + +For detailed compatibility information, see the [Support Matrix](docs/support_matrix.md). + diff --git a/examples/basics/multinode/README.md b/examples/basics/multinode/README.md index ac2db06adb..ae077150f7 100644 --- a/examples/basics/multinode/README.md +++ b/examples/basics/multinode/README.md @@ -131,7 +131,7 @@ Open a terminal on Node 1 and launch both workers: ```bash # Launch prefill worker in background -CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ +CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --page-size 16 \ @@ -141,7 +141,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ --disaggregation-mode prefill \ --disaggregation-transfer-backend nixl & -CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \ +CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --page-size 16 \ @@ -157,7 +157,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \ > - `--page-size 16`: Sets the KV cache block size - must be identical across all workers > - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation) > - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers -> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization +> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend handles tokenization ### Step 3: Launch Replica 2 (Node 2) @@ -165,7 +165,7 @@ Open a terminal on Node 2 and launch both workers: ```bash # Launch prefill worker in background -CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ +CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --page-size 16 \ @@ -176,7 +176,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ --disaggregation-transfer-backend nixl & # Launch decode worker in foreground -CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \ +CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --page-size 16 \ @@ -473,7 +473,7 @@ Stop all components in reverse order: exit # Method 3: Kill by process name (from any terminal) - pkill -f "dynamo.sglang.worker.*prefill" + pkill -f "dynamo.sglang.*prefill" ``` 3. Stop infrastructure services: ```bash