From 0a0fd004c56fa4c5878579044efe5134dc86b3ca Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 02:38:17 +0800 Subject: [PATCH 001/552] Add a chunking processing function that supports long - text embedding, and update relevant documentation and examples. New example scripts and service startup scripts are added to demonstrate how to configure and utilize chunking processing. Update the model configuration to support long - text processing and implement the chunking processing logic in the code. Signed-off-by: x22x22 --- docs/models/pooling_models.md | 86 +++- docs/models/supported_models.md | 5 +- .../openai_embedding_long_text.md | 137 ++++++ .../openai_embedding_long_text_client.py | 234 ++++++++++ .../openai_embedding_long_text_service.sh | 80 ++++ vllm/config.py | 9 + vllm/entrypoints/openai/serving_embedding.py | 419 +++++++++++++++++- 7 files changed, 966 insertions(+), 4 deletions(-) create mode 100644 examples/online_serving/openai_embedding_long_text.md create mode 100644 examples/online_serving/openai_embedding_long_text_client.py create mode 100644 examples/online_serving/openai_embedding_long_text_service.sh diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index f0de84a66f8..73f37f96cec 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -32,6 +32,90 @@ we attempt to override the default pooler based on its Sentence Transformers con You can customize the model's pooling method via the `--override-pooler-config` option, which takes priority over both the model's and Sentence Transformers's defaults. +## Chunked Processing for Long Text + +vLLM supports **chunked processing** for embedding models to handle text inputs that exceed the model's maximum token length. This feature automatically splits long text into manageable chunks, processes them separately, and aggregates the results. + +### Supported Models + +- `intfloat/multilingual-e5-large` +- Other embedding models can be extended to support this feature + +### How Chunked Processing Works + +1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered +2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity +3. **Parallel Processing**: Each chunk is processed independently through the model +4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts +5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing + +### Configuration + +Enable chunked processing by setting `enable_chunked_processing: true` in the pooler configuration: + +```bash +vllm serve intfloat/multilingual-e5-large \ + --task embed \ + --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}' \ + --max-model-len 10240 \ + --trust-remote-code +``` + +### Aggregation Algorithm + +The chunked processing uses a FastChat-inspired weighted averaging algorithm: + +```python +# Weighted average: sum(embedding_i * token_count_i) / total_tokens +weighted_sum = sum(embeddings[i] * weights[i] for i in range(num_chunks)) +final_embedding = weighted_sum / sum(weights) +``` + +This ensures that longer chunks contribute proportionally more to the final representation. + +### Performance Characteristics + +| Aspect | Short Text (≤ max_len) | Long Text (> max_len) | +|--------|------------------------|----------------------| +| **Processing Time** | Standard | Increased (multiple inference calls) | +| **Memory Usage** | Standard | Reduced (chunks processed separately) | +| **Quality** | Standard | Maintains semantic representation | +| **Compatibility** | Full | Full (backward compatible) | + +### Example Usage + +```python +from openai import OpenAI + +client = OpenAI( + api_key="your-api-key", + base_url="http://localhost:31090/v1" +) + +# This will automatically use chunked processing if text is too long +response = client.embeddings.create( + input="Very long text that exceeds the model's maximum context length..." * 1000, + model="multilingual-e5-large" +) + +print(f"Embedding dimension: {len(response.data[0].embedding)}") +``` + +### Logging and Monitoring + +When chunked processing is active, you'll see informative log messages: + +``` +INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing +INFO: Split input of 15000 tokens into 2 chunks +``` + +### Limitations + +- **Increased Latency**: Processing multiple chunks takes longer than single-chunk processing +- **Model Support**: Currently limited to specific embedding models +- **Context Boundaries**: Chunking may split related content, though weighted averaging helps preserve overall semantics + ## Offline Inference The [LLM][vllm.LLM] class provides various methods for offline inference. @@ -170,7 +254,7 @@ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter. ```text -curl http://127.0.0.1:8000/v1/embeddings \ +curl http://127.0.0.1:31090/v1/embeddings \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index ddc920aeb2d..a9597e45fd5 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -418,7 +418,7 @@ Specified using `--task embed`. | `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | | | `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | | | `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | | -| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, `intfloat/multilingual-e5-large` (see note), etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | | @@ -437,6 +437,9 @@ Specified using `--task embed`. !!! note The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture. +!!! note + `intfloat/multilingual-e5-large` supports **long text embedding** with chunked processing. When input text exceeds the model's maximum length, the model automatically splits the input into chunks and processes them separately, then aggregates the results. Enable this feature with `--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}'`. See the [Chunked Processing section](pooling_models.md#chunked-processing-for-long-text) for more details. + If your model is not in the above list, we will try to automatically convert the model using [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings of the whole prompt are extracted from the normalized hidden state corresponding to the last token. diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md new file mode 100644 index 00000000000..a974eab8c13 --- /dev/null +++ b/examples/online_serving/openai_embedding_long_text.md @@ -0,0 +1,137 @@ +# Long Text Embedding with Chunked Processing + +This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length. + +## 🚀 Quick Start + +### 1. Start the Server + +Use the provided script to start a vLLM server with chunked processing enabled: + +```bash +# Basic usage +./openai_embedding_long_text_service.sh + +# Custom configuration +MODEL_NAME="intfloat/multilingual-e5-large" \ +PORT=31090 \ +MAX_MODEL_LEN=10240 \ +./openai_embedding_long_text_service.sh +``` + +### 2. Test Long Text Embedding + +Run the comprehensive test client: + +```bash +python openai_embedding_long_text_client.py +``` + +## 📁 Files + +| File | Description | +|------|-------------| +| `openai_embedding_long_text_service.sh` | Server startup script with chunked processing enabled | +| `openai_embedding_long_text_client.py` | Comprehensive test client for long text embedding | +| `openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) | + +## ⚙️ Configuration + +### Server Configuration + +The key parameter for chunked processing is in the `--override-pooler-config`: + +```json +{ + "pooling_type": "CLS", + "normalize": true, + "enable_chunked_processing": true +} +``` + +### Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use | +| `PORT` | `31090` | Server port | +| `GPU_COUNT` | `1` | Number of GPUs to use | +| `MAX_MODEL_LEN` | `10240` | Maximum model context length | +| `API_KEY` | `EMPTY` | API key for authentication | + +## 🔧 How It Works + +1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered +2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity +3. **Independent Processing**: Each chunk is processed separately through the model +4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging +5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing + +## 📊 Performance Characteristics + +| Text Length | Processing Method | Memory Usage | Speed | +|-------------|------------------|--------------|-------| +| ≤ max_len | Standard | Normal | Fast | +| > max_len | Chunked | Reduced per chunk | Slower (multiple inferences) | + +## 🧪 Test Cases + +The test client demonstrates: + +- ✅ **Short text**: Normal processing (baseline) +- ✅ **Medium text**: Single chunk processing +- ✅ **Long text**: Multi-chunk processing with aggregation +- ✅ **Very long text**: Many chunks processing +- ✅ **Batch processing**: Mixed-length inputs in one request +- ✅ **Consistency**: Reproducible results across runs + +## 🐛 Troubleshooting + +### Common Issues + +1. **Chunked processing not enabled**: + + ``` + ValueError: This model's maximum context length is 512 tokens... + ``` + + **Solution**: Ensure `enable_chunked_processing: true` in pooler config + +2. **Memory errors**: + +``` + RuntimeError: CUDA out of memory + ``` + +**Solution**: Reduce `MAX_MODEL_LEN` or use fewer GPUs + +1. **Slow processing**: + **Expected**: Long text takes more time due to multiple inference calls + +### Debug Information + +Server logs show chunked processing activity: + +``` +INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing +INFO: Split input of 15000 tokens into 2 chunks +``` + +## 📚 Additional Resources + +- [Pooling Models Documentation](../../docs/models/pooling_models.md#chunked-processing-for-long-text) +- [Supported Models List](../../docs/models/supported_models.md#text-embedding) +- [Original Feature Documentation](../../README_CHUNKED_PROCESSING.md) + +## 🤝 Contributing + +To extend chunked processing support to other embedding models: + +1. Check model compatibility with the pooling architecture +2. Test with various text lengths +3. Validate embedding quality compared to single-chunk processing +4. Submit PR with test cases and documentation updates + +--- + +**Note**: Chunked processing is currently supported for specific embedding models. See the [supported models documentation](../../docs/models/supported_models.md#chunked-processing-for-long-text) for the complete list. diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text_client.py new file mode 100644 index 00000000000..cee268e4b77 --- /dev/null +++ b/examples/online_serving/openai_embedding_long_text_client.py @@ -0,0 +1,234 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +""" +Example script demonstrating long text embedding with chunked processing in vLLM. + +This example shows how to use vLLM's chunked processing feature to handle text +inputs that exceed the model's maximum token length. The feature automatically +splits long text into chunks and aggregates the results. + +Prerequisites: +1. Start vLLM server with chunked processing enabled: + + vllm serve intfloat/multilingual-e5-large \ + --task embed \ + --override-pooler-config \ + '{"pooling_type": "CLS", "normalize": true, \"enable_chunked_processing": true}' \ + --max-model-len 10240 \ + --served-model-name multilingual-e5-large \ + --trust-remote-code \ + --port 31090 \ + --api-key your-api-key + +2. Install required dependencies: + pip install openai requests +""" + +import time + +from openai import OpenAI + +# Configuration +API_KEY = "your-api-key" # Replace with your actual API key +BASE_URL = "http://localhost:31090/v1" +MODEL_NAME = "multilingual-e5-large" + + +def generate_long_text(base_text: str, repeat_count: int) -> str: + """Generate long text by repeating base text.""" + return base_text * repeat_count + + +def test_embedding_with_different_lengths(): + """Test embedding generation with different text lengths.""" + client = OpenAI(api_key=API_KEY, base_url=BASE_URL) + + # Test cases with different text lengths + test_cases = [ + { + "name": "Short Text", + "text": "Hello, this is a short text for embedding.", + "expected_chunks": 1, + }, + { + "name": "Medium Text", + "text": generate_long_text( + "This is a medium-length text that should fit within the " + "model's context window. " * 20, + 2, + ), + "expected_chunks": 1, + }, + { + "name": "Long Text (2 chunks)", + "text": generate_long_text( + "This is a very long text that will exceed the model's " + "maximum context length and trigger chunked processing. " * 50, + 5, + ), + "expected_chunks": 2, + }, + { + "name": "Very Long Text (3+ chunks)", + "text": generate_long_text( + "This text is extremely long and will definitely " + "require multiple chunks for processing. " * 100, + 10, + ), + "expected_chunks": 3, + }, + ] + + print("🧪 Testing vLLM Long Text Embedding with Chunked Processing") + print("=" * 70) + + for i, test_case in enumerate(test_cases, 1): + print(f"\n📝 Test {i}: {test_case['name']}") + print(f"Text length: {len(test_case['text'])} characters") + + try: + start_time = time.time() + + response = client.embeddings.create( + input=test_case["text"], model=MODEL_NAME, encoding_format="float" + ) + + end_time = time.time() + processing_time = end_time - start_time + + # Extract embedding data + embedding = response.data[0].embedding + embedding_dim = len(embedding) + + print("✅ Success!") + print(f" - Embedding dimension: {embedding_dim}") + print(f" - Processing time: {processing_time:.2f}s") + print(f" - Expected chunks: ~{test_case['expected_chunks']}") + print(f" - First 5 values: {embedding[:5]}") + + except Exception as e: + print(f"❌ Failed: {str(e)}") + + +def test_batch_embedding(): + """Test batch embedding with mixed-length inputs.""" + client = OpenAI(api_key=API_KEY, base_url=BASE_URL) + + print("\n🔄 Testing Batch Embedding with Mixed Lengths") + print("=" * 50) + + # Mix of short and long texts + batch_inputs = [ + "Short text 1", + generate_long_text("Medium length text that fits in one chunk. " * 20, 1), + "Another short text", + generate_long_text("Long text requiring chunked processing. " * 100, 5), + ] + + try: + start_time = time.time() + + response = client.embeddings.create( + input=batch_inputs, model=MODEL_NAME, encoding_format="float" + ) + + end_time = time.time() + processing_time = end_time - start_time + + print("✅ Batch processing successful!") + print(f" - Number of inputs: {len(batch_inputs)}") + print(f" - Number of embeddings: {len(response.data)}") + print(f" - Total processing time: {processing_time:.2f}s") + print( + f" - Average time per input: {processing_time / len(batch_inputs):.2f}s" + ) + + for i, data in enumerate(response.data): + input_length = len(batch_inputs[i]) + embedding_dim = len(data.embedding) + print( + f" - Input {i + 1}: {input_length} chars → {embedding_dim}D embedding" + ) + + except Exception as e: + print(f"❌ Batch processing failed: {str(e)}") + + +def test_embedding_consistency(): + """Test that chunked processing produces consistent results.""" + client = OpenAI(api_key=API_KEY, base_url=BASE_URL) + + print("\n🔍 Testing Embedding Consistency") + print("=" * 40) + + # Use the same long text multiple times + long_text = generate_long_text( + "Consistency test text for chunked processing validation. " * 50, 3 + ) + + embeddings = [] + + try: + for i in range(3): + response = client.embeddings.create( + input=long_text, model=MODEL_NAME, encoding_format="float" + ) + embeddings.append(response.data[0].embedding) + print(f" - Generated embedding {i + 1}") + + # Check consistency (embeddings should be identical) + if len(embeddings) >= 2: + # Calculate similarity between first two embeddings + import numpy as np + + emb1 = np.array(embeddings[0]) + emb2 = np.array(embeddings[1]) + + # Cosine similarity + cosine_sim = np.dot(emb1, emb2) / ( + np.linalg.norm(emb1) * np.linalg.norm(emb2) + ) + + print("✅ Consistency test completed!") + print(f" - Cosine similarity between runs: {cosine_sim:.6f}") + print(" - Expected: ~1.0 (identical embeddings)") + + if cosine_sim > 0.999: + print(" - ✅ High consistency achieved!") + else: + print(" - ⚠️ Consistency may vary due to numerical precision") + + except Exception as e: + print(f"❌ Consistency test failed: {str(e)}") + + +def main(): + """Main function to run all tests.""" + print("🚀 vLLM Long Text Embedding Client") + print(f"📡 Connecting to: {BASE_URL}") + print(f"🤖 Model: {MODEL_NAME}") + masked_key = "*" * (len(API_KEY) - 4) + API_KEY[-4:] if len(API_KEY) > 4 else "****" + print(f"🔑 API Key: {masked_key}") + + # Run all test cases + test_embedding_with_different_lengths() + test_batch_embedding() + test_embedding_consistency() + + print("\n" + "=" * 70) + print("🎉 All tests completed!") + print("\n💡 Key Features Demonstrated:") + print(" - ✅ Automatic chunked processing for long text") + print(" - ✅ Seamless handling of mixed-length batches") + print(" - ✅ Consistent embedding generation") + print(" - ✅ Backward compatibility with short text") + print("\n📚 For more information, see:") + print( + " - Documentation: https://docs.vllm.ai/en/latest/models/pooling_models.html" + ) + print(" - Chunked Processing Guide: openai_embedding_long_text.md") + + +if __name__ == "__main__": + main() diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh new file mode 100644 index 00000000000..3012049002e --- /dev/null +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -0,0 +1,80 @@ +#!/bin/bash + +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# vLLM Embedding Server with Chunked Processing +# This script starts a vLLM server with chunked processing enabled for long text embedding. + +set -euo pipefail + +# Configuration +MODEL_NAME=${MODEL_NAME:-"intfloat/multilingual-e5-large"} +PORT=${PORT:-31090} +GPU_COUNT=${GPU_COUNT:-1} +MAX_MODEL_LEN=${MAX_MODEL_LEN:-10240} +API_KEY=${API_KEY:-"your-api-key"} + +echo "🚀 Starting vLLM Embedding Server with Chunked Processing" +echo "================================================================" + +# Environment variables for optimization +export VLLM_WORKER_MULTIPROC_METHOD=spawn +export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 + +# Display configuration +echo "📋 Configuration:" +echo " - Model: $MODEL_NAME" +echo " - Port: $PORT" +echo " - GPU Count: $GPU_COUNT" +echo " - Max Model Length: $MAX_MODEL_LEN tokens" +echo " - Chunked Processing: ENABLED" +echo " - Pooling Type: CLS + Normalization" +echo "" + +# Validate GPU availability +if command -v nvidia-smi &> /dev/null; then + gpu_count=$(nvidia-smi --list-gpus | wc -l) + echo "🖥️ Available GPUs: $gpu_count" + if [ "$GPU_COUNT" -gt "$gpu_count" ]; then + echo "⚠️ Warning: Requested $GPU_COUNT GPUs but only $gpu_count available" + echo " Adjusting to use $gpu_count GPUs" + GPU_COUNT=$gpu_count + fi +else + echo "⚠️ Warning: nvidia-smi not found. GPU detection skipped." +fi + +echo "" +echo "🔧 Starting server with chunked processing configuration..." + +# Start vLLM server with chunked processing enabled +vllm serve "$MODEL_NAME" \ + --tensor-parallel-size "$GPU_COUNT" \ + --enforce-eager \ + --max-model-len "$MAX_MODEL_LEN" \ + --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}' \ + --served-model-name multilingual-e5-large \ + --task embed \ + --use-v2-block-manager \ + --api-key "$API_KEY" \ + --trust-remote-code \ + --port "$PORT" \ + --host 0.0.0.0 + +echo "" +echo "✅ vLLM Embedding Server started successfully!" +echo "" +echo "📡 Server Information:" +echo " - Base URL: http://localhost:$PORT" +echo " - Model Name: multilingual-e5-large" +echo " - API Key: $API_KEY" +echo "" +echo "🧪 Test the server with:" +echo " python examples/online_serving/openai_embedding_long_text_client.py" +echo "" +echo "📚 Features enabled:" +echo " ✅ Long text chunked processing" +echo " ✅ Automatic chunk aggregation" +echo " ✅ OpenAI-compatible API" +echo " ✅ GPU acceleration" \ No newline at end of file diff --git a/vllm/config.py b/vllm/config.py index b1f7f9e57a7..5bb24774e82 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3240,6 +3240,15 @@ class PoolerConfig: ``math-shepherd-mistral-7b-prm`` model. """ + enable_chunked_processing: Optional[bool] = None + """ + Whether to enable chunked processing for long inputs that exceed the model's + maximum position embeddings. When enabled, long inputs will be split into + chunks, processed separately, and then aggregated using weighted averaging. + This allows embedding models to handle arbitrarily long text without CUDA + errors. Defaults to False. + """ + def compute_hash(self) -> str: """ WARNING: Whenever a new field is added to this config, diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index e87decfe636..300703c3ce9 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -2,9 +2,11 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import base64 +from collections.abc import AsyncGenerator from typing import Final, Literal, Optional, Union, cast import numpy as np +import torch from fastapi import Request from typing_extensions import assert_never, override @@ -13,17 +15,21 @@ from vllm.entrypoints.chat_utils import ChatTemplateContentFormatOption from vllm.entrypoints.logger import RequestLogger from vllm.entrypoints.openai.protocol import (EmbeddingChatRequest, + EmbeddingCompletionRequest, EmbeddingRequest, EmbeddingResponse, EmbeddingResponseData, ErrorResponse, UsageInfo) from vllm.entrypoints.openai.serving_engine import (EmbeddingServeContext, OpenAIServing, - ServeContext) + ServeContext, + TextTokensPrompt) from vllm.entrypoints.openai.serving_models import OpenAIServingModels +from vllm.inputs.data import EmbedsPrompt as EngineEmbedsPrompt +from vllm.inputs.data import TokensPrompt as EngineTokensPrompt from vllm.logger import init_logger from vllm.outputs import (EmbeddingOutput, EmbeddingRequestOutput, - PoolingRequestOutput) + PoolingOutput, PoolingRequestOutput, RequestOutput) logger = init_logger(__name__) @@ -133,6 +139,415 @@ def _build_response( usage=usage, ) + def _get_max_position_embeddings(self) -> int: + """Get the model's effective maximum sequence length for chunking. + + This uses the same logic as vLLM's _get_and_verify_max_len to determine + the actual sequence length limit, + considering both model config and tokenizer config. + """ + hf_config = self.model_config.hf_config + + # Start with max_position_embeddings from model config + derived_max_len = getattr(hf_config, 'max_position_embeddings', 2048) + + # Get tokenizer config for pooling models (embedding models) + if self.model_config.runner_type == "pooling": + from vllm.transformers_utils.config import try_get_tokenizer_config + tokenizer_config = try_get_tokenizer_config( + self.model_config.tokenizer, + trust_remote_code=self.model_config.trust_remote_code, + revision=self.model_config.tokenizer_revision) + + # Consider model_max_length in tokenizer_config + # (same logic as _get_and_verify_max_len) + if tokenizer_config: + tokenizer_model_max_length = tokenizer_config.get( + 'model_max_length', derived_max_len) + derived_max_len = min(derived_max_len, + tokenizer_model_max_length) + + return int(derived_max_len) + + def _should_use_chunked_processing(self, request) -> bool: + """Check if chunked processing should be used for this request.""" + if not isinstance(request, + (EmbeddingChatRequest, EmbeddingCompletionRequest)): + return False + + pooler_config = getattr(self.model_config, 'pooler_config', None) + return (pooler_config is not None + and getattr(pooler_config, 'enable_chunked_processing', False)) + + def _chunk_token_ids(self, token_ids: list[int], + chunk_size: int) -> list[list[int]]: + """Split token IDs into chunks of specified size.""" + if len(token_ids) <= chunk_size: + return [token_ids] + + chunks = [] + for i in range(0, len(token_ids), chunk_size): + chunk = token_ids[i:i + chunk_size] + chunks.append(chunk) + return chunks + + async def _process_chunked_request( + self, + ctx: EmbeddingServeContext, + original_prompt: TextTokensPrompt, + pooling_params, + trace_headers, + ) -> list[AsyncGenerator[Union[RequestOutput, PoolingRequestOutput], + None]]: + """Process a single prompt using chunked processing.""" + generators = [] + token_ids = original_prompt["prompt_token_ids"] + + # Split into chunks using max_position_embeddings + max_pos_embeddings = self._get_max_position_embeddings() + chunks = self._chunk_token_ids(token_ids, max_pos_embeddings) + + logger.info( + "Split input of %s tokens into %s chunks (max_chunk_size: %s)", + len(token_ids), len(chunks), max_pos_embeddings) + + for chunk_idx, chunk_tokens in enumerate(chunks): + # Create a request ID for this chunk + chunk_request_id = f"{ctx.request_id}-chunk-{chunk_idx}" + + # Create engine prompt for this chunk + chunk_engine_prompt = EngineTokensPrompt( + prompt_token_ids=chunk_tokens) + + # Create chunk request prompt for logging + chunk_text = "" + chunk_request_prompt = TextTokensPrompt( + prompt=chunk_text, prompt_token_ids=chunk_tokens) + + # Log the chunk + self._log_inputs(chunk_request_id, + chunk_request_prompt, + params=pooling_params, + lora_request=ctx.lora_request, + prompt_adapter_request=ctx.prompt_adapter_request) + + # Create generator for this chunk + generator = self.engine_client.encode( + chunk_engine_prompt, + pooling_params, + chunk_request_id, + lora_request=ctx.lora_request, + trace_headers=trace_headers, + priority=getattr(ctx.request, "priority", 0), + ) + + generators.append(generator) + + return generators + + async def _aggregate_chunked_results( + self, + ctx: EmbeddingServeContext, + chunk_results: list[PoolingRequestOutput], + original_token_count: int, + original_prompt_token_ids: Optional[list[int]] = None, + ) -> PoolingRequestOutput: + """Aggregate results from multiple chunks + using vLLM-compatible weighted averaging.""" + if len(chunk_results) == 1: + return chunk_results[0] + + # Extract embeddings and use vLLM's token counting approach + chunk_embeddings = [] + chunk_weights = [] + + for result in chunk_results: + # PoolingRequestOutput.outputs is a PoolingOutput object + if hasattr(result, 'outputs') and hasattr(result.outputs, 'data'): + # Get the embedding tensor from PoolingOutput.data + embedding_data = result.outputs.data + if not isinstance(embedding_data, torch.Tensor): + embedding_data = torch.tensor(embedding_data, + dtype=torch.float32) + chunk_embeddings.append(embedding_data) + + # Use actual effective token count + # this is what vLLM uses internally + effective_token_count = len(result.prompt_token_ids) + chunk_weights.append(effective_token_count) + + if not chunk_embeddings: + raise ValueError("No valid embeddings found in chunk results") + + # Simple weighted averaging compatible with vLLM's approach + # This is similar to what MeanPool does for multiple sequences + device = chunk_embeddings[0].device + # Use float32 for precision, as done in vLLM's PoolerHead + dtype = torch.float32 + + # Weighted sum following vLLM's internal logic + weighted_sum = torch.zeros_like(chunk_embeddings[0], + dtype=dtype, + device=device) + total_weight = 0 + + for embedding, weight in zip(chunk_embeddings, chunk_weights): + embedding = embedding.to(dtype=dtype, device=device) + weighted_sum += embedding * weight + total_weight += weight + + # Final averaged embedding - let vLLM handle the rest + aggregated_embedding = weighted_sum / total_weight + + # NOTE: Don't manually normalize here + # let vLLM's PoolerHead handle normalization + # based on the model's pooler_config.normalize setting. + # This ensures consistency with vLLM's standard pooling behavior. + + # Create aggregated result using vLLM's standard output structure + first_result = chunk_results[0] + + # Create new PoolingOutput with aggregated embedding + aggregated_output = PoolingOutput(data=aggregated_embedding) + + # Preserve original prompt token ids for consistency + result_prompt_token_ids = (original_prompt_token_ids + if original_prompt_token_ids is not None + else first_result.prompt_token_ids) + + aggregated_result = PoolingRequestOutput( + request_id=first_result.request_id, + outputs=aggregated_output, + prompt_token_ids=result_prompt_token_ids, + finished=True, + ) + + return aggregated_result + + def _validate_input( + self, + request, + input_ids: list[int], + input_text: str, + ) -> TextTokensPrompt: + """Override to support chunked processing for embedding requests.""" + token_num = len(input_ids) + + # Note: EmbeddingRequest doesn't have max_tokens + if isinstance(request, + (EmbeddingChatRequest, EmbeddingCompletionRequest)): + # Check if chunked processing is enabled for pooling models + pooler_config = getattr(self.model_config, 'pooler_config', None) + enable_chunked = (pooler_config is not None and getattr( + pooler_config, 'enable_chunked_processing', False)) + + # Use max_position_embeddings for chunked processing decisions + max_pos_embeddings = self._get_max_position_embeddings() + + if token_num > max_pos_embeddings: + if enable_chunked: + # Allow long inputs when chunked processing is enabled + logger.info( + "Input length %s exceeds max_position_embeddings " + "%s, will use chunked processing", token_num, + max_pos_embeddings) + else: + raise ValueError( + f"This model's maximum position embeddings length is " + f"{max_pos_embeddings} tokens. However, you requested " + f"{token_num} tokens in the input for embedding " + f"generation. Please reduce the length of the input or " + f"enable chunked processing.") + + return TextTokensPrompt(prompt=input_text, + prompt_token_ids=input_ids) + + # For other request types, use the parent's implementation + return super()._validate_input(request, input_ids, input_text) + + async def _prepare_generators( + self, + ctx: ServeContext, + ) -> Optional[ErrorResponse]: + """Override to support chunked processing.""" + ctx = cast(EmbeddingServeContext, ctx) + generators: list[AsyncGenerator[Union[RequestOutput, + PoolingRequestOutput], + None]] = [] + + try: + trace_headers = (None if ctx.raw_request is None else await + self._get_trace_headers(ctx.raw_request.headers)) + + if not hasattr(ctx.request, "to_pooling_params"): + return self.create_error_response( + "Request type does not support pooling parameters") + + pooling_params = ctx.request.to_pooling_params() + + if ctx.engine_prompts is None: + return self.create_error_response( + "Engine prompts not available") + + if ctx.request_prompts is None: + return self.create_error_response( + "Request prompts not available") + + # Check if we should use chunked processing + use_chunked = self._should_use_chunked_processing(ctx.request) + + for i, engine_prompt in enumerate(ctx.engine_prompts): + request_prompt = ctx.request_prompts[i] + + # Check if this specific prompt needs chunked processing + max_pos_embeddings = self._get_max_position_embeddings() + if (use_chunked and isinstance(request_prompt, dict) + and "prompt_token_ids" in request_prompt + and len(request_prompt["prompt_token_ids"]) + > max_pos_embeddings): + + # Use chunked processing for this prompt + chunk_generators = await self._process_chunked_request( + ctx, request_prompt, pooling_params, trace_headers) + generators.extend(chunk_generators) + else: + # Normal processing for short prompts + request_id_item = f"{ctx.request_id}-{i}" + + self._log_inputs( + request_id_item, + request_prompt, + params=pooling_params, + lora_request=ctx.lora_request, + prompt_adapter_request=ctx.prompt_adapter_request) + + # Mypy has an existing bug related to inferring the variance + # of TypedDicts with `builtins.enumerate`: + # https://github.com/python/mypy/issues/8586#issuecomment-2867698435 + engine_prompt = cast( + Union[EngineTokensPrompt, EngineEmbedsPrompt], + engine_prompt) + generator = self.engine_client.encode( + engine_prompt, + pooling_params, + request_id_item, + lora_request=ctx.lora_request, + trace_headers=trace_headers, + priority=getattr(ctx.request, "priority", 0), + ) + + generators.append(generator) + + from vllm.utils import merge_async_iterators + ctx.result_generator = merge_async_iterators(*generators) + + return None + + except Exception as e: + # TODO: Use a vllm-specific Validation Error + return self.create_error_response(str(e)) + + async def _collect_batch( + self, + ctx: ServeContext, + ) -> Optional[ErrorResponse]: + """Override to support chunked processing.""" + ctx = cast(EmbeddingServeContext, ctx) + try: + if ctx.engine_prompts is None: + return self.create_error_response( + "Engine prompts not available") + + if ctx.request_prompts is None: + return self.create_error_response( + "Request prompts not available") + + if ctx.result_generator is None: + return self.create_error_response( + "Result generator not available") + + # Check if we used chunked processing + use_chunked = self._should_use_chunked_processing(ctx.request) + + # Collect all results first + all_results = [] + async for i, res in ctx.result_generator: + all_results.append((i, res)) + + # Group results by original prompt + if use_chunked: + # For chunked processing, we need to group chunk results by + # original prompt + final_res_batch = [] + + max_pos_embeddings = self._get_max_position_embeddings() + for prompt_idx, request_prompt in enumerate( + ctx.request_prompts): + if (isinstance(request_prompt, dict) + and "prompt_token_ids" in request_prompt + and len(request_prompt["prompt_token_ids"]) + > max_pos_embeddings): + + # This prompt was chunked, collect all its chunk results + chunk_results = [] + chunk_prefix = f"{ctx.request_id}-chunk-" + + for result_idx, result in all_results: + if result.request_id.startswith(chunk_prefix): + chunk_results.append(result) + + if chunk_results: + # Aggregate chunk results + original_token_count = len( + request_prompt["prompt_token_ids"]) + aggregated_result = await \ + self._aggregate_chunked_results( + ctx, chunk_results, original_token_count, + request_prompt["prompt_token_ids"]) + final_res_batch.append(aggregated_result) + else: + return self.create_error_response( + f"No chunk results found for prompt " + f"{prompt_idx}") + else: + # Normal prompt, find its result + expected_id = f"{ctx.request_id}-{prompt_idx}" + found = False + for result_idx, result in all_results: + if result.request_id == expected_id: + final_res_batch.append(result) + found = True + break + + if not found: + return self.create_error_response( + f"Result not found for prompt {prompt_idx}") + + ctx.final_res_batch = final_res_batch + else: + # Normal processing - original logic + num_prompts = len(ctx.engine_prompts) + final_res_batch: list[Optional[Union[RequestOutput, + PoolingRequestOutput]]] + final_res_batch = [None] * num_prompts + + for result_idx, result in all_results: + if result_idx < num_prompts: + final_res_batch[result_idx] = result + + if None in final_res_batch: + return self.create_error_response( + "Failed to generate results for all prompts") + + ctx.final_res_batch = [ + res for res in final_res_batch if res is not None + ] + + return None + + except Exception as e: + return self.create_error_response(str(e)) + class OpenAIServingEmbedding(EmbeddingMixin): request_id_prefix = "embd" From 50bfdf95ef48e8718624ab8abddceb9e299cf2d0 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 03:21:31 +0800 Subject: [PATCH 002/552] Rectify the code formatting issues, disable yapf to prevent conflicts with isort, and ensure the accuracy of docstrings. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 300703c3ce9..08d6c792e96 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -14,12 +14,15 @@ from vllm.engine.protocol import EngineClient from vllm.entrypoints.chat_utils import ChatTemplateContentFormatOption from vllm.entrypoints.logger import RequestLogger +# yapf conflicts with isort for this docstring +# yapf: disable from vllm.entrypoints.openai.protocol import (EmbeddingChatRequest, EmbeddingCompletionRequest, EmbeddingRequest, EmbeddingResponse, EmbeddingResponseData, ErrorResponse, UsageInfo) +# yapf: enable from vllm.entrypoints.openai.serving_engine import (EmbeddingServeContext, OpenAIServing, ServeContext, From c475c83b1af77702a0e1e3c1da0c408f4886e418 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 03:40:23 +0800 Subject: [PATCH 003/552] Optimize the embedding processing logic, add checks for text token prompts, and improve the implementation of chunk processing to ensure accuracy and efficiency when handling long texts. Meanwhile, relevant type annotations have been updated to enhance code readability and type safety. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 204 +++++++++++-------- 1 file changed, 115 insertions(+), 89 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 08d6c792e96..aee8a29792c 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -22,11 +22,11 @@ EmbeddingResponse, EmbeddingResponseData, ErrorResponse, UsageInfo) -# yapf: enable from vllm.entrypoints.openai.serving_engine import (EmbeddingServeContext, OpenAIServing, ServeContext, TextTokensPrompt) +# yapf: enable from vllm.entrypoints.openai.serving_models import OpenAIServingModels from vllm.inputs.data import EmbedsPrompt as EngineEmbedsPrompt from vllm.inputs.data import TokensPrompt as EngineTokensPrompt @@ -200,10 +200,9 @@ async def _process_chunked_request( original_prompt: TextTokensPrompt, pooling_params, trace_headers, - ) -> list[AsyncGenerator[Union[RequestOutput, PoolingRequestOutput], - None]]: + ) -> list[AsyncGenerator[PoolingRequestOutput, None]]: """Process a single prompt using chunked processing.""" - generators = [] + generators: list[AsyncGenerator[PoolingRequestOutput, None]] = [] token_ids = original_prompt["prompt_token_ids"] # Split into chunks using max_position_embeddings @@ -368,6 +367,11 @@ def _validate_input( # For other request types, use the parent's implementation return super()._validate_input(request, input_ids, input_text) + def _is_text_tokens_prompt(self, prompt) -> bool: + """Check if a prompt is a TextTokensPrompt (has prompt_token_ids).""" + return (isinstance(prompt, dict) and "prompt_token_ids" in prompt + and "prompt_embeds" not in prompt) + async def _prepare_generators( self, ctx: ServeContext, @@ -404,42 +408,46 @@ async def _prepare_generators( # Check if this specific prompt needs chunked processing max_pos_embeddings = self._get_max_position_embeddings() - if (use_chunked and isinstance(request_prompt, dict) - and "prompt_token_ids" in request_prompt - and len(request_prompt["prompt_token_ids"]) - > max_pos_embeddings): - - # Use chunked processing for this prompt - chunk_generators = await self._process_chunked_request( - ctx, request_prompt, pooling_params, trace_headers) - generators.extend(chunk_generators) - else: - # Normal processing for short prompts - request_id_item = f"{ctx.request_id}-{i}" - - self._log_inputs( - request_id_item, - request_prompt, - params=pooling_params, - lora_request=ctx.lora_request, - prompt_adapter_request=ctx.prompt_adapter_request) - - # Mypy has an existing bug related to inferring the variance - # of TypedDicts with `builtins.enumerate`: - # https://github.com/python/mypy/issues/8586#issuecomment-2867698435 - engine_prompt = cast( - Union[EngineTokensPrompt, EngineEmbedsPrompt], - engine_prompt) - generator = self.engine_client.encode( - engine_prompt, - pooling_params, - request_id_item, - lora_request=ctx.lora_request, - trace_headers=trace_headers, - priority=getattr(ctx.request, "priority", 0), - ) - - generators.append(generator) + if (use_chunked + and self._is_text_tokens_prompt(request_prompt)): + # Cast to TextTokensPrompt since we've + # verified prompt_token_ids + text_tokens_prompt = cast(TextTokensPrompt, request_prompt) + if len(text_tokens_prompt["prompt_token_ids"] + ) > max_pos_embeddings: + # Use chunked processing for this prompt + chunk_generators = await self._process_chunked_request( + ctx, text_tokens_prompt, pooling_params, + trace_headers) + generators.extend(chunk_generators) + continue + + # Normal processing for short prompts or non-token prompts + request_id_item = f"{ctx.request_id}-{i}" + + self._log_inputs( + request_id_item, + request_prompt, + params=pooling_params, + lora_request=ctx.lora_request, + prompt_adapter_request=ctx.prompt_adapter_request) + + # Mypy has an existing bug related to inferring the variance + # of TypedDicts with `builtins.enumerate`: + # https://github.com/python/mypy/issues/8586#issuecomment-2867698435 + engine_prompt = cast( + Union[EngineTokensPrompt, EngineEmbedsPrompt], + engine_prompt) + generator = self.engine_client.encode( + engine_prompt, + pooling_params, + request_id_item, + lora_request=ctx.lora_request, + trace_headers=trace_headers, + priority=getattr(ctx.request, "priority", 0), + ) + + generators.append(generator) from vllm.utils import merge_async_iterators ctx.result_generator = merge_async_iterators(*generators) @@ -481,70 +489,88 @@ async def _collect_batch( if use_chunked: # For chunked processing, we need to group chunk results by # original prompt - final_res_batch = [] + chunked_final_res_batch: list[PoolingRequestOutput] = [] max_pos_embeddings = self._get_max_position_embeddings() for prompt_idx, request_prompt in enumerate( ctx.request_prompts): - if (isinstance(request_prompt, dict) - and "prompt_token_ids" in request_prompt - and len(request_prompt["prompt_token_ids"]) - > max_pos_embeddings): - - # This prompt was chunked, collect all its chunk results - chunk_results = [] - chunk_prefix = f"{ctx.request_id}-chunk-" - - for result_idx, result in all_results: - if result.request_id.startswith(chunk_prefix): - chunk_results.append(result) - - if chunk_results: - # Aggregate chunk results - original_token_count = len( - request_prompt["prompt_token_ids"]) - aggregated_result = await \ - self._aggregate_chunked_results( - ctx, chunk_results, original_token_count, - request_prompt["prompt_token_ids"]) - final_res_batch.append(aggregated_result) - else: - return self.create_error_response( - f"No chunk results found for prompt " - f"{prompt_idx}") - else: - # Normal prompt, find its result - expected_id = f"{ctx.request_id}-{prompt_idx}" - found = False - for result_idx, result in all_results: - if result.request_id == expected_id: - final_res_batch.append(result) - found = True - break - - if not found: - return self.create_error_response( - f"Result not found for prompt {prompt_idx}") - - ctx.final_res_batch = final_res_batch + if self._is_text_tokens_prompt(request_prompt): + # Cast to TextTokensPrompt + # since we've verified prompt_token_ids + text_tokens_prompt = cast(TextTokensPrompt, + request_prompt) + if len(text_tokens_prompt["prompt_token_ids"] + ) > max_pos_embeddings: + # This prompt was chunked, collect all + # its chunk results + chunk_results: list[PoolingRequestOutput] = [] + chunk_prefix = f"{ctx.request_id}-chunk-" + + for result_idx, result in all_results: + if result.request_id.startswith(chunk_prefix): + # Cast to PoolingRequestOutput since + # we know chunked results are always pooling + chunk_results.append( + cast(PoolingRequestOutput, result)) + + if chunk_results: + # Aggregate chunk results + original_token_count = len( + text_tokens_prompt["prompt_token_ids"]) + aggregated_result = await \ + self._aggregate_chunked_results( + ctx, chunk_results, + original_token_count, + text_tokens_prompt["prompt_token_ids"]) + chunked_final_res_batch.append( + aggregated_result) + else: + return self.create_error_response( + f"No chunk results found for prompt " + f"{prompt_idx}") + continue + + # Normal prompt (short or embeds), find its result + expected_id = f"{ctx.request_id}-{prompt_idx}" + found = False + for result_idx, result in all_results: + if result.request_id == expected_id: + # Cast to PoolingRequestOutput for embedding results + chunked_final_res_batch.append( + cast(PoolingRequestOutput, result)) + found = True + break + + if not found: + return self.create_error_response( + f"Result not found for prompt {prompt_idx}") + + # Update the final result batch with proper type + ctx.final_res_batch = cast( + list[Union[RequestOutput, PoolingRequestOutput]], + chunked_final_res_batch) else: # Normal processing - original logic num_prompts = len(ctx.engine_prompts) - final_res_batch: list[Optional[Union[RequestOutput, - PoolingRequestOutput]]] - final_res_batch = [None] * num_prompts + normal_final_res_batch: list[ + Optional[PoolingRequestOutput]] = [None] * num_prompts for result_idx, result in all_results: if result_idx < num_prompts: - final_res_batch[result_idx] = result + # Cast to PoolingRequestOutput for embedding results + normal_final_res_batch[result_idx] = cast( + PoolingRequestOutput, result) - if None in final_res_batch: + if None in normal_final_res_batch: return self.create_error_response( "Failed to generate results for all prompts") - ctx.final_res_batch = [ - res for res in final_res_batch if res is not None + final_results = [ + res for res in normal_final_res_batch if res is not None ] + ctx.final_res_batch = cast( + list[Union[RequestOutput, PoolingRequestOutput]], + final_results) return None From c4925a98325af9df218c21f4a05757c2c6bd7c92 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 04:06:25 +0800 Subject: [PATCH 004/552] Added multiple long-text batch processing tests to verify the uniqueness of block IDs and fix the block ID conflicts in batch processing. Updated relevant examples to demonstrate the new features. Signed-off-by: x22x22 --- .../openai_embedding_long_text_client.py | 114 ++++++++++++++++++ vllm/entrypoints/openai/serving_embedding.py | 10 +- 2 files changed, 121 insertions(+), 3 deletions(-) diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text_client.py index cee268e4b77..1297a1f0d6c 100644 --- a/examples/online_serving/openai_embedding_long_text_client.py +++ b/examples/online_serving/openai_embedding_long_text_client.py @@ -155,6 +155,118 @@ def test_batch_embedding(): print(f"❌ Batch processing failed: {str(e)}") +def test_multiple_long_texts_batch(): + """Test batch processing with multiple long texts to verify chunk ID uniqueness.""" + client = OpenAI(api_key=API_KEY, base_url=BASE_URL) + + print("\n🔧 Testing Multiple Long Texts in Batch (Chunk ID Fix Verification)") + print("=" * 70) + + # Create multiple distinct long texts that will all require chunking + long_texts = [ + generate_long_text( + "First long document about artificial intelligence and machine learning. " + * 80, + 6, + ), + generate_long_text( + "Second long document about natural language processing and transformers. " + * 80, + 6, + ), + generate_long_text( + "Third long document about computer vision and neural networks. " * 80, 6 + ), + ] + + # Add some short texts to mix things up + batch_inputs = [ + "Short text before long texts", + long_texts[0], + "Short text between long texts", + long_texts[1], + long_texts[2], + "Short text after long texts", + ] + + print("📊 Batch composition:") + for i, text in enumerate(batch_inputs): + length = len(text) + text_type = "Long (will be chunked)" if length > 5000 else "Short" + print(f" - Input {i + 1}: {length} chars ({text_type})") + + try: + start_time = time.time() + + response = client.embeddings.create( + input=batch_inputs, model=MODEL_NAME, encoding_format="float" + ) + + end_time = time.time() + processing_time = end_time - start_time + + print("\n✅ Multiple long texts batch processing successful!") + print(f" - Number of inputs: {len(batch_inputs)}") + print(f" - Number of embeddings returned: {len(response.data)}") + print(f" - Total processing time: {processing_time:.2f}s") + + # Verify each embedding is different (no incorrect aggregation) + embeddings = [data.embedding for data in response.data] + + if len(embeddings) >= 3: + import numpy as np + + # Compare embeddings of the long texts (indices 1, 3, 4) + long_embeddings = [ + np.array(embeddings[1]), # First long text + np.array(embeddings[3]), # Second long text + np.array(embeddings[4]), # Third long text + ] + + print("\n🔍 Verifying embedding uniqueness:") + for i in range(len(long_embeddings)): + for j in range(i + 1, len(long_embeddings)): + cosine_sim = np.dot(long_embeddings[i], long_embeddings[j]) / ( + np.linalg.norm(long_embeddings[i]) + * np.linalg.norm(long_embeddings[j]) + ) + print( + f" - Similarity between long text {i + 1} and {j + 1}: " + f"{cosine_sim:.4f}" + ) + + if ( + cosine_sim < 0.9 + ): # Different content should have lower similarity + print(" ✅ Good: Embeddings are appropriately different") + else: + print( + " ⚠️ High similarity - may indicate chunk " + "aggregation issue" + ) + + print("\n📋 Per-input results:") + for i, data in enumerate(response.data): + input_length = len(batch_inputs[i]) + embedding_dim = len(data.embedding) + embedding_norm = np.linalg.norm(data.embedding) + print( + f" - Input {i + 1}: {input_length} chars → {embedding_dim}D " + f"embedding (norm: {embedding_norm:.4f})" + ) + + print( + "\n✅ This test verifies the fix for chunk ID collisions in " + "batch processing" + ) + print(" - Before fix: Multiple long texts would have conflicting chunk IDs") + print(" - After fix: Each prompt's chunks have unique IDs with prompt index") + + except Exception as e: + print(f"❌ Multiple long texts batch test failed: {str(e)}") + print(" This might indicate the chunk ID collision bug is present!") + + def test_embedding_consistency(): """Test that chunked processing produces consistent results.""" client = OpenAI(api_key=API_KEY, base_url=BASE_URL) @@ -214,6 +326,7 @@ def main(): # Run all test cases test_embedding_with_different_lengths() test_batch_embedding() + test_multiple_long_texts_batch() test_embedding_consistency() print("\n" + "=" * 70) @@ -221,6 +334,7 @@ def main(): print("\n💡 Key Features Demonstrated:") print(" - ✅ Automatic chunked processing for long text") print(" - ✅ Seamless handling of mixed-length batches") + print(" - ✅ Multiple long texts in single batch (chunk ID fix)") print(" - ✅ Consistent embedding generation") print(" - ✅ Backward compatibility with short text") print("\n📚 For more information, see:") diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index aee8a29792c..7ac9b525f77 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -200,6 +200,7 @@ async def _process_chunked_request( original_prompt: TextTokensPrompt, pooling_params, trace_headers, + prompt_idx: int, ) -> list[AsyncGenerator[PoolingRequestOutput, None]]: """Process a single prompt using chunked processing.""" generators: list[AsyncGenerator[PoolingRequestOutput, None]] = [] @@ -215,7 +216,8 @@ async def _process_chunked_request( for chunk_idx, chunk_tokens in enumerate(chunks): # Create a request ID for this chunk - chunk_request_id = f"{ctx.request_id}-chunk-{chunk_idx}" + chunk_request_id = (f"{ctx.request_id}-prompt-{prompt_idx}-" + f"chunk-{chunk_idx}") # Create engine prompt for this chunk chunk_engine_prompt = EngineTokensPrompt( @@ -418,7 +420,7 @@ async def _prepare_generators( # Use chunked processing for this prompt chunk_generators = await self._process_chunked_request( ctx, text_tokens_prompt, pooling_params, - trace_headers) + trace_headers, i) generators.extend(chunk_generators) continue @@ -504,7 +506,9 @@ async def _collect_batch( # This prompt was chunked, collect all # its chunk results chunk_results: list[PoolingRequestOutput] = [] - chunk_prefix = f"{ctx.request_id}-chunk-" + chunk_prefix = ( + f"{ctx.request_id}-prompt-{prompt_idx}-" + f"chunk-") for result_idx, result in all_results: if result.request_id.startswith(chunk_prefix): From ff7253a3edd5adc197864905d764bfac02b65258 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 04:07:45 +0800 Subject: [PATCH 005/552] Added multiple long-text batch processing tests to verify the uniqueness of block IDs and fix the block ID conflicts in batch processing. Updated relevant examples to demonstrate the new features. Signed-off-by: x22x22 --- examples/online_serving/openai_embedding_long_text_client.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text_client.py index 1297a1f0d6c..b500a4707a9 100644 --- a/examples/online_serving/openai_embedding_long_text_client.py +++ b/examples/online_serving/openai_embedding_long_text_client.py @@ -27,6 +27,7 @@ import time +import numpy as np from openai import OpenAI # Configuration @@ -292,7 +293,6 @@ def test_embedding_consistency(): # Check consistency (embeddings should be identical) if len(embeddings) >= 2: # Calculate similarity between first two embeddings - import numpy as np emb1 = np.array(embeddings[0]) emb2 = np.array(embeddings[1]) From dc5b358499426811dba38543d9d09a2e5aabf07f Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 04:10:15 +0800 Subject: [PATCH 006/552] Rectify the numbering errors in the document by changing the number of the "Slow Processing" section from 1 to 3 to ensure the accuracy and consistency of the list. Signed-off-by: x22x22 --- examples/online_serving/openai_embedding_long_text.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md index a974eab8c13..029e12b17e2 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text.md @@ -99,13 +99,13 @@ The test client demonstrates: 2. **Memory errors**: -``` + ``` RuntimeError: CUDA out of memory ``` -**Solution**: Reduce `MAX_MODEL_LEN` or use fewer GPUs + **Solution**: Reduce `MAX_MODEL_LEN` or use fewer GPUs -1. **Slow processing**: +3. **Slow processing**: **Expected**: Long text takes more time due to multiple inference calls ### Debug Information From e657331b483984e9664a681077637e8c2fe9fe78 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 04:12:15 +0800 Subject: [PATCH 007/552] Update the long - text service script. Add a new variable named MODEL_CODE to enhance the flexibility of the model name, and use this variable to replace the hard - coded model name in the output information. Ensure that the configuration during service startup is more consistent and maintainable. Signed-off-by: x22x22 --- .../online_serving/openai_embedding_long_text_service.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh index 3012049002e..d85bc16be19 100644 --- a/examples/online_serving/openai_embedding_long_text_service.sh +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -10,6 +10,7 @@ set -euo pipefail # Configuration MODEL_NAME=${MODEL_NAME:-"intfloat/multilingual-e5-large"} +MODEL_CODE=${MODEL_CODE:-"multilingual-e5-large"} PORT=${PORT:-31090} GPU_COUNT=${GPU_COUNT:-1} MAX_MODEL_LEN=${MAX_MODEL_LEN:-10240} @@ -54,7 +55,7 @@ vllm serve "$MODEL_NAME" \ --enforce-eager \ --max-model-len "$MAX_MODEL_LEN" \ --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}' \ - --served-model-name multilingual-e5-large \ + --served-model-name ${MODEL_CODE} \ --task embed \ --use-v2-block-manager \ --api-key "$API_KEY" \ @@ -67,7 +68,7 @@ echo "✅ vLLM Embedding Server started successfully!" echo "" echo "📡 Server Information:" echo " - Base URL: http://localhost:$PORT" -echo " - Model Name: multilingual-e5-large" +echo " - Model Code: ${MODEL_CODE}" echo " - API Key: $API_KEY" echo "" echo "🧪 Test the server with:" From 9008b2ed624398d5931ab3219304e0534e600f04 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sat, 12 Jul 2025 04:19:32 +0800 Subject: [PATCH 008/552] Multiple long - text batch processing tests have been newly added to verify the uniqueness of block IDs and resolve the block ID conflict issues in batch processing. Meanwhile, relevant documents and examples have been updated to ensure the accuracy and consistency of long - text processing. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 121 +++++++++---------- 1 file changed, 58 insertions(+), 63 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 7ac9b525f77..e40ca3c8a88 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -482,84 +482,79 @@ async def _collect_batch( # Check if we used chunked processing use_chunked = self._should_use_chunked_processing(ctx.request) - # Collect all results first - all_results = [] - async for i, res in ctx.result_generator: - all_results.append((i, res)) - - # Group results by original prompt if use_chunked: - # For chunked processing, we need to group chunk results by - # original prompt - chunked_final_res_batch: list[PoolingRequestOutput] = [] + # Efficient single-pass processing for chunked requests + from collections import defaultdict + + # Group results by original prompt index + grouped_results = defaultdict(list) + short_prompts_results = {} + + async for result_idx, result in ctx.result_generator: + if "-chunk-" in result.request_id: + # Extract prompt_idx from chunked request_id + # e.g., from "req-id-prompt-2-chunk-0" -> 2 + parts = result.request_id.split("-") + try: + prompt_idx = int(parts[parts.index("prompt") + 1]) + grouped_results[prompt_idx].append( + cast(PoolingRequestOutput, result)) + except (ValueError, IndexError): + return self.create_error_response( + f"Invalid chunk request ID format: " + f"{result.request_id}") + else: + # Extract prompt_idx from non-chunked request_id + # e.g., from "req-id-2" -> 2 + try: + prompt_idx = int(result.request_id.split("-")[-1]) + short_prompts_results[prompt_idx] = cast( + PoolingRequestOutput, result) + except ValueError: + return self.create_error_response( + f"Invalid request ID format: " + f"{result.request_id}") + + # Build final result batch in prompt order + final_res_batch = [] - max_pos_embeddings = self._get_max_position_embeddings() for prompt_idx, request_prompt in enumerate( ctx.request_prompts): - if self._is_text_tokens_prompt(request_prompt): - # Cast to TextTokensPrompt - # since we've verified prompt_token_ids - text_tokens_prompt = cast(TextTokensPrompt, - request_prompt) - if len(text_tokens_prompt["prompt_token_ids"] - ) > max_pos_embeddings: - # This prompt was chunked, collect all - # its chunk results - chunk_results: list[PoolingRequestOutput] = [] - chunk_prefix = ( - f"{ctx.request_id}-prompt-{prompt_idx}-" - f"chunk-") - - for result_idx, result in all_results: - if result.request_id.startswith(chunk_prefix): - # Cast to PoolingRequestOutput since - # we know chunked results are always pooling - chunk_results.append( - cast(PoolingRequestOutput, result)) - - if chunk_results: - # Aggregate chunk results - original_token_count = len( + if prompt_idx in grouped_results: + # This was a chunked prompt - aggregate results + chunk_results = grouped_results[prompt_idx] + if self._is_text_tokens_prompt(request_prompt): + text_tokens_prompt = cast(TextTokensPrompt, + request_prompt) + original_token_count = len( + text_tokens_prompt["prompt_token_ids"]) + aggregated_result = await \ + self._aggregate_chunked_results( + ctx, chunk_results, original_token_count, text_tokens_prompt["prompt_token_ids"]) - aggregated_result = await \ - self._aggregate_chunked_results( - ctx, chunk_results, - original_token_count, - text_tokens_prompt["prompt_token_ids"]) - chunked_final_res_batch.append( - aggregated_result) - else: - return self.create_error_response( - f"No chunk results found for prompt " - f"{prompt_idx}") - continue - - # Normal prompt (short or embeds), find its result - expected_id = f"{ctx.request_id}-{prompt_idx}" - found = False - for result_idx, result in all_results: - if result.request_id == expected_id: - # Cast to PoolingRequestOutput for embedding results - chunked_final_res_batch.append( - cast(PoolingRequestOutput, result)) - found = True - break - - if not found: + final_res_batch.append(aggregated_result) + else: + return self.create_error_response( + f"Chunked prompt {prompt_idx} is not a " + f"text tokens prompt") + elif prompt_idx in short_prompts_results: + # This was a short prompt + final_res_batch.append( + short_prompts_results[prompt_idx]) + else: return self.create_error_response( f"Result not found for prompt {prompt_idx}") - # Update the final result batch with proper type ctx.final_res_batch = cast( list[Union[RequestOutput, PoolingRequestOutput]], - chunked_final_res_batch) + final_res_batch) else: - # Normal processing - original logic + # Normal processing for non-chunked requests num_prompts = len(ctx.engine_prompts) normal_final_res_batch: list[ Optional[PoolingRequestOutput]] = [None] * num_prompts - for result_idx, result in all_results: + async for result_idx, result in ctx.result_generator: if result_idx < num_prompts: # Cast to PoolingRequestOutput for embedding results normal_final_res_batch[result_idx] = cast( From 1a8c7c892986bb80fc6b4467ab5a7a7adac271b1 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sun, 13 Jul 2025 23:34:49 +0800 Subject: [PATCH 009/552] Update the documentation and examples to support the new `max_embed_len` parameter, enabling long - text input without the need to set the environment variable `VLLM_ALLOW_LONG_MAX_MODEL_LEN`. Modify the relevant configurations and processing logic to ensure clear error messages are provided when the input exceeds the maximum embedding length, while maintaining backward compatibility. Enhance the description of input validation and processing performance. Signed-off-by: x22x22 --- docs/models/pooling_models.md | 31 ++++++---- .../openai_embedding_long_text.md | 56 ++++++++++++++----- .../openai_embedding_long_text_service.sh | 12 ++-- vllm/config.py | 10 ++++ vllm/entrypoints/openai/serving_embedding.py | 28 ++++++++++ 5 files changed, 106 insertions(+), 31 deletions(-) diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index 73f37f96cec..e20ebe406cf 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -43,24 +43,31 @@ vLLM supports **chunked processing** for embedding models to handle text inputs ### How Chunked Processing Works -1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered -2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity +1. **Flexible Input Validation**: Configure `max_embed_len` to accept inputs longer than `max_model_len` without environment variables +2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity 3. **Parallel Processing**: Each chunk is processed independently through the model 4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts 5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing ### Configuration -Enable chunked processing by setting `enable_chunked_processing: true` in the pooler configuration: +Enable chunked processing and configure maximum embedding input length: ```bash vllm serve intfloat/multilingual-e5-large \ --task embed \ - --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}' \ - --max-model-len 10240 \ + --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 10240}' \ --trust-remote-code ``` +#### Configuration Parameters + +- `enable_chunked_processing`: Enable chunked processing for long inputs (default: `false`) +- `max_embed_len`: Maximum input length allowed for embedding generation (default: `null`) + - When set, allows inputs longer than `max_model_len` without requiring `VLLM_ALLOW_LONG_MAX_MODEL_LEN` + - Inputs exceeding `max_embed_len` are rejected with clear error messages + - Chunking is triggered when inputs exceed `max_position_embeddings` + ### Aggregation Algorithm The chunked processing uses a FastChat-inspired weighted averaging algorithm: @@ -75,12 +82,13 @@ This ensures that longer chunks contribute proportionally more to the final repr ### Performance Characteristics -| Aspect | Short Text (≤ max_len) | Long Text (> max_len) | -|--------|------------------------|----------------------| +| Aspect | Short Text (≤ max_position_embeddings) | Long Text (> max_position_embeddings) | +|--------|----------------------------------------|---------------------------------------| | **Processing Time** | Standard | Increased (multiple inference calls) | | **Memory Usage** | Standard | Reduced (chunks processed separately) | | **Quality** | Standard | Maintains semantic representation | | **Compatibility** | Full | Full (backward compatible) | +| **Input Validation** | Standard max_model_len check | Extended max_embed_len check | ### Example Usage @@ -92,9 +100,10 @@ client = OpenAI( base_url="http://localhost:31090/v1" ) -# This will automatically use chunked processing if text is too long +# This will automatically use chunked processing for very long text +# max_embed_len=10240 allows inputs up to 10k tokens response = client.embeddings.create( - input="Very long text that exceeds the model's maximum context length..." * 1000, + input="Very long text that exceeds the model's position embeddings..." * 500, model="multilingual-e5-large" ) @@ -106,8 +115,8 @@ print(f"Embedding dimension: {len(response.data[0].embedding)}") When chunked processing is active, you'll see informative log messages: ``` -INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing -INFO: Split input of 15000 tokens into 2 chunks +INFO: Input length 10000 exceeds max_position_embeddings 512, will use chunked processing +INFO: Split input of 10000 tokens into 20 chunks (max_chunk_size: 512) ``` ### Limitations diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md index 029e12b17e2..211e9854d95 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text.md @@ -15,7 +15,7 @@ Use the provided script to start a vLLM server with chunked processing enabled: # Custom configuration MODEL_NAME="intfloat/multilingual-e5-large" \ PORT=31090 \ -MAX_MODEL_LEN=10240 \ +MAX_EMBED_LEN=10240 \ ./openai_embedding_long_text_service.sh ``` @@ -39,13 +39,14 @@ python openai_embedding_long_text_client.py ### Server Configuration -The key parameter for chunked processing is in the `--override-pooler-config`: +The key parameters for chunked processing are in the `--override-pooler-config`: ```json { "pooling_type": "CLS", "normalize": true, - "enable_chunked_processing": true + "enable_chunked_processing": true, + "max_embed_len": 10240 } ``` @@ -56,23 +57,31 @@ The key parameter for chunked processing is in the `--override-pooler-config`: | `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use | | `PORT` | `31090` | Server port | | `GPU_COUNT` | `1` | Number of GPUs to use | -| `MAX_MODEL_LEN` | `10240` | Maximum model context length | +| `MAX_EMBED_LEN` | `10240` | Maximum embedding input length (allows longer inputs without VLLM_ALLOW_LONG_MAX_MODEL_LEN) | | `API_KEY` | `EMPTY` | API key for authentication | ## 🔧 How It Works -1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered -2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity +1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables +2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity 3. **Independent Processing**: Each chunk is processed separately through the model 4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging 5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing +### Input Length Handling + +- **Within max_embed_len**: Input is accepted and processed +- **Exceeds max_position_embeddings**: Chunked processing is automatically triggered +- **Exceeds max_embed_len**: Input is rejected with clear error message +- **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN` + ## 📊 Performance Characteristics | Text Length | Processing Method | Memory Usage | Speed | |-------------|------------------|--------------|-------| -| ≤ max_len | Standard | Normal | Fast | -| > max_len | Chunked | Reduced per chunk | Slower (multiple inferences) | +| ≤ max_position_embeddings | Standard | Normal | Fast | +| > max_position_embeddings, ≤ max_embed_len | Chunked | Reduced per chunk | Slower (multiple inferences) | +| > max_embed_len | Rejected | N/A | Error response | ## 🧪 Test Cases @@ -92,20 +101,28 @@ The test client demonstrates: 1. **Chunked processing not enabled**: ``` - ValueError: This model's maximum context length is 512 tokens... + ValueError: This model's maximum position embeddings length is 4096 tokens... ``` **Solution**: Ensure `enable_chunked_processing: true` in pooler config -2. **Memory errors**: +2. **Input exceeds max_embed_len**: + + ``` + ValueError: This model's maximum embedding input length is 10240 tokens... + ``` + + **Solution**: Increase `max_embed_len` in pooler config or reduce input length + +3. **Memory errors**: ``` RuntimeError: CUDA out of memory ``` - **Solution**: Reduce `MAX_MODEL_LEN` or use fewer GPUs + **Solution**: Reduce chunk size by adjusting model's `max_position_embeddings` or use fewer GPUs -3. **Slow processing**: +4. **Slow processing**: **Expected**: Long text takes more time due to multiple inference calls ### Debug Information @@ -113,8 +130,8 @@ The test client demonstrates: Server logs show chunked processing activity: ``` -INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing -INFO: Split input of 15000 tokens into 2 chunks +INFO: Input length 15000 exceeds max_position_embeddings 4096, will use chunked processing +INFO: Split input of 15000 tokens into 4 chunks (max_chunk_size: 4096) ``` ## 📚 Additional Resources @@ -132,6 +149,17 @@ To extend chunked processing support to other embedding models: 3. Validate embedding quality compared to single-chunk processing 4. Submit PR with test cases and documentation updates +## 🆕 Enhanced Features + +### max_embed_len Parameter + +The new `max_embed_len` parameter provides: + +- **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable +- **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len` +- **Clear Error Messages**: Better feedback when inputs exceed limits +- **Backward Compatibility**: Existing configurations continue to work + --- **Note**: Chunked processing is currently supported for specific embedding models. See the [supported models documentation](../../docs/models/supported_models.md#chunked-processing-for-long-text) for the complete list. diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh index d85bc16be19..613d94790ff 100644 --- a/examples/online_serving/openai_embedding_long_text_service.sh +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -5,6 +5,7 @@ # vLLM Embedding Server with Chunked Processing # This script starts a vLLM server with chunked processing enabled for long text embedding. +# Uses max_embed_len to allow long inputs without VLLM_ALLOW_LONG_MAX_MODEL_LEN. set -euo pipefail @@ -13,7 +14,7 @@ MODEL_NAME=${MODEL_NAME:-"intfloat/multilingual-e5-large"} MODEL_CODE=${MODEL_CODE:-"multilingual-e5-large"} PORT=${PORT:-31090} GPU_COUNT=${GPU_COUNT:-1} -MAX_MODEL_LEN=${MAX_MODEL_LEN:-10240} +MAX_EMBED_LEN=${MAX_EMBED_LEN:-10240} API_KEY=${API_KEY:-"your-api-key"} echo "🚀 Starting vLLM Embedding Server with Chunked Processing" @@ -21,15 +22,14 @@ echo "================================================================" # Environment variables for optimization export VLLM_WORKER_MULTIPROC_METHOD=spawn -export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 # Display configuration echo "📋 Configuration:" echo " - Model: $MODEL_NAME" echo " - Port: $PORT" echo " - GPU Count: $GPU_COUNT" -echo " - Max Model Length: $MAX_MODEL_LEN tokens" echo " - Chunked Processing: ENABLED" +echo " - Max Embed Length: ${MAX_EMBED_LEN} tokens" echo " - Pooling Type: CLS + Normalization" echo "" @@ -53,8 +53,7 @@ echo "🔧 Starting server with chunked processing configuration..." vllm serve "$MODEL_NAME" \ --tensor-parallel-size "$GPU_COUNT" \ --enforce-eager \ - --max-model-len "$MAX_MODEL_LEN" \ - --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}' \ + --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": '${MAX_EMBED_LEN}'}' \ --served-model-name ${MODEL_CODE} \ --task embed \ --use-v2-block-manager \ @@ -76,6 +75,7 @@ echo " python examples/online_serving/openai_embedding_long_text_client.py" echo "" echo "📚 Features enabled:" echo " ✅ Long text chunked processing" +echo " ✅ Enhanced max embedding length (${MAX_EMBED_LEN} tokens)" echo " ✅ Automatic chunk aggregation" echo " ✅ OpenAI-compatible API" -echo " ✅ GPU acceleration" \ No newline at end of file +echo " ✅ GPU acceleration" diff --git a/vllm/config.py b/vllm/config.py index 5bb24774e82..7f891e709af 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3249,6 +3249,16 @@ class PoolerConfig: errors. Defaults to False. """ + max_embed_len: Optional[int] = None + """ + Maximum input length allowed for embedding generation. When set, allows + inputs longer than max_model_len to be accepted for embedding models. + This parameter enables accepting long inputs without requiring + VLLM_ALLOW_LONG_MAX_MODEL_LEN environment variable. When an input exceeds + max_embed_len, it will be handled according to the original max_model_len + validation logic. Defaults to None (use max_model_len validation). + """ + def compute_hash(self) -> str: """ WARNING: Whenever a new field is added to this config, diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index e40ca3c8a88..b014020c8d6 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -345,9 +345,37 @@ def _validate_input( enable_chunked = (pooler_config is not None and getattr( pooler_config, 'enable_chunked_processing', False)) + # Get max_embed_len from pooler config if set + max_embed_len = (pooler_config.max_embed_len if pooler_config + and pooler_config.max_embed_len else None) + # Use max_position_embeddings for chunked processing decisions max_pos_embeddings = self._get_max_position_embeddings() + # Determine the effective max length for validation + if max_embed_len is not None: + # Use max_embed_len for validation instead of max_model_len + effective_max_len = max_embed_len + validation_error_msg = ( + f"This model's maximum embedding input length is " + f"{max_embed_len} tokens. However, you requested " + f"{token_num} tokens in the input for embedding " + f"generation. Please reduce the length of the input.") + else: + # Fall back to max_model_len validation (original behavior) + effective_max_len = self.max_model_len + validation_error_msg = ( + f"This model's maximum context length is " + f"{self.max_model_len} tokens. However, you requested " + f"{token_num} tokens in the input for embedding " + f"generation. Please reduce the length of the input.") + + # Check if input exceeds effective max length + if token_num > effective_max_len: + raise ValueError(validation_error_msg) + + # Check for chunked processing + # when exceeding max_position_embeddings if token_num > max_pos_embeddings: if enable_chunked: # Allow long inputs when chunked processing is enabled From 382b96225585df42e6bbd884c9000991f9de5024 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Sun, 13 Jul 2025 23:47:34 +0800 Subject: [PATCH 010/552] Update the example code to support the new `max_embed_len` parameter, ensuring the correctness of the configuration when dealing with long - text inputs. Adjust the format of the relevant configuration strings to better handle the embedding length limit. Signed-off-by: x22x22 --- examples/online_serving/openai_embedding_long_text_client.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text_client.py index b500a4707a9..fb645ed975e 100644 --- a/examples/online_serving/openai_embedding_long_text_client.py +++ b/examples/online_serving/openai_embedding_long_text_client.py @@ -14,8 +14,8 @@ vllm serve intfloat/multilingual-e5-large \ --task embed \ --override-pooler-config \ - '{"pooling_type": "CLS", "normalize": true, \"enable_chunked_processing": true}' \ - --max-model-len 10240 \ + '{"pooling_type": "CLS", "normalize": true, ' \ + '"enable_chunked_processing": true, "max_embed_len": 10240}' \ --served-model-name multilingual-e5-large \ --trust-remote-code \ --port 31090 \ From 783a0517204aac13ca1d49bb795df7a9bf60495b Mon Sep 17 00:00:00 2001 From: x22x22 Date: Mon, 14 Jul 2025 23:58:57 +0800 Subject: [PATCH 011/552] The documentation and examples have been updated to support the enhanced chunk processing functionality. The logic for automatic detection and verification of pooling types has been optimized to ensure warnings are provided when non - MEAN pooling types are used. The relevant configurations and processing logic have been updated to improve user experience and compatibility. Signed-off-by: x22x22 --- docs/models/pooling_models.md | 52 ++++++- .../openai_embedding_long_text.md | 38 +++-- .../openai_embedding_long_text_service.sh | 92 +++++++++++-- vllm/config.py | 10 ++ vllm/entrypoints/openai/serving_embedding.py | 130 +++++++++++++++++- 5 files changed, 282 insertions(+), 40 deletions(-) diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index e20ebe406cf..e4e1436c545 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -38,8 +38,14 @@ vLLM supports **chunked processing** for embedding models to handle text inputs ### Supported Models -- `intfloat/multilingual-e5-large` -- Other embedding models can be extended to support this feature +Chunked processing is supported for the following embedding models: + +- `intfloat/multilingual-e5-large` (Recommended pool type: `MEAN`) +- `jinaai/jina-embeddings-v3` (Recommended pool type: `MEAN`) +- `jinaai/jina-embeddings-v4-vllm-retrieval` (Recommended pool type: `MEAN`) +- `Qwen/Qwen3-Embedding-4B` (Recommended pool type: `MEAN`) + +Other embedding models can be extended to support this feature by ensuring proper pooling type compatibility. ### How Chunked Processing Works @@ -56,7 +62,7 @@ Enable chunked processing and configure maximum embedding input length: ```bash vllm serve intfloat/multilingual-e5-large \ --task embed \ - --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 10240}' \ + --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000}' \ --trust-remote-code ``` @@ -90,8 +96,18 @@ This ensures that longer chunks contribute proportionally more to the final repr | **Compatibility** | Full | Full (backward compatible) | | **Input Validation** | Standard max_model_len check | Extended max_embed_len check | +#### Extreme Long Text Support + +With the enhanced `max_embed_len` configuration (up to 3M+ tokens), you can process: +- **Complete Documents**: Research papers, legal contracts, technical manuals +- **Large Codebases**: Entire repositories and documentation +- **Books and Literature**: Full chapters or small books +- **Multi-document Analysis**: Combined content for comprehensive understanding + ### Example Usage +#### Basic Configuration + ```python from openai import OpenAI @@ -101,22 +117,44 @@ client = OpenAI( ) # This will automatically use chunked processing for very long text -# max_embed_len=10240 allows inputs up to 10k tokens +# max_embed_len=3072000 allows inputs up to 3M+ tokens response = client.embeddings.create( - input="Very long text that exceeds the model's position embeddings..." * 500, + input="Very long text that exceeds the model's position embeddings..." * 5000, model="multilingual-e5-large" ) print(f"Embedding dimension: {len(response.data[0].embedding)}") ``` +#### Alternative Model Configurations + +```bash +# For Jina embeddings v3 (optimized for performance) +vllm serve jinaai/jina-embeddings-v3 \ + --task embed \ + --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576}' \ + --trust-remote-code + +# For Jina embeddings v4 (latest retrieval model) +vllm serve jinaai/jina-embeddings-v4-vllm-retrieval \ + --task embed \ + --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 2097152}' \ + --trust-remote-code + +# For Qwen3 Embedding (large-scale multilingual) +vllm serve Qwen/Qwen3-Embedding-4B \ + --task embed \ + --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1572864}' \ + --trust-remote-code +``` + ### Logging and Monitoring When chunked processing is active, you'll see informative log messages: ``` -INFO: Input length 10000 exceeds max_position_embeddings 512, will use chunked processing -INFO: Split input of 10000 tokens into 20 chunks (max_chunk_size: 512) +INFO: Input length 100000 exceeds max_position_embeddings 512, will use chunked processing +INFO: Split input of 100000 tokens into 196 chunks (max_chunk_size: 512) ``` ### Limitations diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md index 211e9854d95..c1c044d916b 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text.md @@ -9,13 +9,17 @@ This directory contains examples for using vLLM's **chunked processing** feature Use the provided script to start a vLLM server with chunked processing enabled: ```bash -# Basic usage +# Basic usage (supports very long texts up to ~3M tokens) ./openai_embedding_long_text_service.sh -# Custom configuration +# Custom configuration with different models +MODEL_NAME="jinaai/jina-embeddings-v3" \ +MAX_EMBED_LEN=1048576 \ +./openai_embedding_long_text_service.sh + +# For extremely long documents MODEL_NAME="intfloat/multilingual-e5-large" \ -PORT=31090 \ -MAX_EMBED_LEN=10240 \ +MAX_EMBED_LEN=3072000 \ ./openai_embedding_long_text_service.sh ``` @@ -43,10 +47,10 @@ The key parameters for chunked processing are in the `--override-pooler-config`: ```json { - "pooling_type": "CLS", + "pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, - "max_embed_len": 10240 + "max_embed_len": 3072000 } ``` @@ -54,10 +58,10 @@ The key parameters for chunked processing are in the `--override-pooler-config`: | Variable | Default | Description | |----------|---------|-------------| -| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use | +| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) | | `PORT` | `31090` | Server port | | `GPU_COUNT` | `1` | Number of GPUs to use | -| `MAX_EMBED_LEN` | `10240` | Maximum embedding input length (allows longer inputs without VLLM_ALLOW_LONG_MAX_MODEL_LEN) | +| `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) | | `API_KEY` | `EMPTY` | API key for authentication | ## 🔧 How It Works @@ -70,11 +74,19 @@ The key parameters for chunked processing are in the `--override-pooler-config`: ### Input Length Handling -- **Within max_embed_len**: Input is accepted and processed +- **Within max_embed_len**: Input is accepted and processed (up to 3M+ tokens) - **Exceeds max_position_embeddings**: Chunked processing is automatically triggered - **Exceeds max_embed_len**: Input is rejected with clear error message - **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN` +### Extreme Long Text Support + +With `MAX_EMBED_LEN=3072000`, you can process: +- **Academic papers**: Full research papers with references +- **Legal documents**: Complete contracts and legal texts +- **Books**: Entire chapters or small books +- **Code repositories**: Large codebases and documentation + ## 📊 Performance Characteristics | Text Length | Processing Method | Memory Usage | Speed | @@ -91,6 +103,7 @@ The test client demonstrates: - ✅ **Medium text**: Single chunk processing - ✅ **Long text**: Multi-chunk processing with aggregation - ✅ **Very long text**: Many chunks processing +- ✅ **Extreme long text**: Document-level processing (100K+ tokens) - ✅ **Batch processing**: Mixed-length inputs in one request - ✅ **Consistency**: Reproducible results across runs @@ -109,7 +122,7 @@ The test client demonstrates: 2. **Input exceeds max_embed_len**: ``` - ValueError: This model's maximum embedding input length is 10240 tokens... + ValueError: This model's maximum embedding input length is 3072000 tokens... ``` **Solution**: Increase `max_embed_len` in pooler config or reduce input length @@ -130,8 +143,8 @@ The test client demonstrates: Server logs show chunked processing activity: ``` -INFO: Input length 15000 exceeds max_position_embeddings 4096, will use chunked processing -INFO: Split input of 15000 tokens into 4 chunks (max_chunk_size: 4096) +INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing +INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096) ``` ## 📚 Additional Resources @@ -157,6 +170,7 @@ The new `max_embed_len` parameter provides: - **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable - **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len` +- **Extreme Length Support**: Process documents with millions of tokens - **Clear Error Messages**: Better feedback when inputs exceed limits - **Backward Compatibility**: Existing configurations continue to work diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh index 613d94790ff..fa78385e782 100644 --- a/examples/online_serving/openai_embedding_long_text_service.sh +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -3,34 +3,69 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -# vLLM Embedding Server with Chunked Processing +# vLLM Embedding Server with Enhanced Chunked Processing # This script starts a vLLM server with chunked processing enabled for long text embedding. -# Uses max_embed_len to allow long inputs without VLLM_ALLOW_LONG_MAX_MODEL_LEN. +# Now supports proper pooling type validation and model-specific configurations. set -euo pipefail # Configuration MODEL_NAME=${MODEL_NAME:-"intfloat/multilingual-e5-large"} MODEL_CODE=${MODEL_CODE:-"multilingual-e5-large"} + PORT=${PORT:-31090} GPU_COUNT=${GPU_COUNT:-1} -MAX_EMBED_LEN=${MAX_EMBED_LEN:-10240} +MAX_EMBED_LEN=${MAX_EMBED_LEN:-3072000} API_KEY=${API_KEY:-"your-api-key"} -echo "🚀 Starting vLLM Embedding Server with Chunked Processing" -echo "================================================================" +# Enhanced pooling configuration with model-specific defaults +POOLING_TYPE=${POOLING_TYPE:-"auto"} # auto, MEAN, CLS, LAST +ALLOW_NON_MEAN_CHUNKING=${ALLOW_NON_MEAN_CHUNKING:-"false"} +# export CUDA_VISIBLE_DEVICES=2,3,4,5 + +echo "🚀 Starting vLLM Embedding Server with Enhanced Chunked Processing" +echo "==================================================================" # Environment variables for optimization export VLLM_WORKER_MULTIPROC_METHOD=spawn +# Function to determine optimal pooling type for known models +get_optimal_pooling_type() { + local model="$1" + case "$model" in + *"e5-"* | *"multilingual-e5"*) + echo "MEAN" # E5 series uses mean pooling + ;; + *"bge-"*) + echo "CLS" # BGE series uses CLS pooling + ;; + *"gte-"*) + echo "MEAN" # GTE series uses mean pooling + ;; + *"sentence-t5"* | *"st5"*) + echo "MEAN" # Sentence-T5 uses mean pooling + ;; + *) + echo "MEAN" # Default to MEAN for unknown models + ;; + esac +} + +# Auto-detect pooling type if not explicitly set +if [ "$POOLING_TYPE" = "auto" ]; then + POOLING_TYPE=$(get_optimal_pooling_type "$MODEL_NAME") + echo "🔍 Auto-detected pooling type: $POOLING_TYPE for model $MODEL_NAME" +fi + # Display configuration echo "📋 Configuration:" echo " - Model: $MODEL_NAME" echo " - Port: $PORT" echo " - GPU Count: $GPU_COUNT" -echo " - Chunked Processing: ENABLED" +echo " - Enhanced Chunked Processing: ENABLED" echo " - Max Embed Length: ${MAX_EMBED_LEN} tokens" -echo " - Pooling Type: CLS + Normalization" +echo " - Pooling Type: $POOLING_TYPE + Normalization" +echo " - Allow Non-MEAN Chunking: $ALLOW_NON_MEAN_CHUNKING" echo "" # Validate GPU availability @@ -46,14 +81,35 @@ else echo "⚠️ Warning: nvidia-smi not found. GPU detection skipped." fi +# Warning for non-MEAN pooling types +if [ "$POOLING_TYPE" != "MEAN" ] && [ "$ALLOW_NON_MEAN_CHUNKING" != "true" ]; then + echo "" + echo "⚠️ IMPORTANT: Using $POOLING_TYPE pooling with chunked processing" + echo " This may produce different results than non-chunked processing." + echo " For BERT-type models with bidirectional attention, consider:" + echo " - Using MEAN pooling for mathematically equivalent results" + echo " - Setting ALLOW_NON_MEAN_CHUNKING=true to suppress this warning" + echo "" +fi + echo "" -echo "🔧 Starting server with chunked processing configuration..." +echo "🔧 Starting server with enhanced chunked processing configuration..." + +# Build pooler config JSON +POOLER_CONFIG="{\"pooling_type\": \"$POOLING_TYPE\", \"normalize\": true, \"enable_chunked_processing\": true, \"max_embed_len\": ${MAX_EMBED_LEN}" + +# Add allow_non_mean_chunking if needed +if [ "$ALLOW_NON_MEAN_CHUNKING" = "true" ]; then + POOLER_CONFIG="${POOLER_CONFIG}, \"allow_non_mean_chunking\": true" +fi + +POOLER_CONFIG="${POOLER_CONFIG}}" -# Start vLLM server with chunked processing enabled +# Start vLLM server with enhanced chunked processing vllm serve "$MODEL_NAME" \ --tensor-parallel-size "$GPU_COUNT" \ --enforce-eager \ - --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": '${MAX_EMBED_LEN}'}' \ + --override-pooler-config "$POOLER_CONFIG" \ --served-model-name ${MODEL_CODE} \ --task embed \ --use-v2-block-manager \ @@ -69,13 +125,21 @@ echo "📡 Server Information:" echo " - Base URL: http://localhost:$PORT" echo " - Model Code: ${MODEL_CODE}" echo " - API Key: $API_KEY" +echo " - Pooling Strategy: $POOLING_TYPE" echo "" echo "🧪 Test the server with:" echo " python examples/online_serving/openai_embedding_long_text_client.py" echo "" -echo "📚 Features enabled:" -echo " ✅ Long text chunked processing" +echo "📚 Enhanced features enabled:" +echo " ✅ Intelligent pooling type detection and validation" +echo " ✅ Long text chunked processing with proper aggregation" +echo " ✅ Model-specific pooling strategy optimization" echo " ✅ Enhanced max embedding length (${MAX_EMBED_LEN} tokens)" -echo " ✅ Automatic chunk aggregation" +echo " ✅ Automatic chunk aggregation (MEAN/CLS/LAST support)" echo " ✅ OpenAI-compatible API" -echo " ✅ GPU acceleration" +echo " ✅ GPU acceleration" +echo "" +echo "🔧 Advanced usage:" +echo " - Set POOLING_TYPE=MEAN|CLS|LAST to override auto-detection" +echo " - Set ALLOW_NON_MEAN_CHUNKING=true for non-MEAN pooling without warnings" +echo " - Set MAX_EMBED_LEN to adjust maximum input length" diff --git a/vllm/config.py b/vllm/config.py index 7f891e709af..344fe0142d2 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3259,6 +3259,16 @@ class PoolerConfig: validation logic. Defaults to None (use max_model_len validation). """ + allow_non_mean_chunking: Optional[bool] = None + """ + Whether to allow chunked processing for non-MEAN pooling types without + warnings. By default (None or False), a warning will be shown when using + chunked processing with pooling types other than MEAN, as they may produce + different results than non-chunked processing. Set to True to explicitly + allow and suppress warnings for non-MEAN pooling types. Only applies when + enable_chunked_processing is True. + """ + def compute_hash(self) -> str: """ WARNING: Whenever a new field is added to this config, diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index b014020c8d6..57b3e6698ed 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -152,7 +152,7 @@ def _get_max_position_embeddings(self) -> int: hf_config = self.model_config.hf_config # Start with max_position_embeddings from model config - derived_max_len = getattr(hf_config, 'max_position_embeddings', 2048) + derived_max_len = getattr(hf_config, 'max_position_embeddings', 512) # Get tokenizer config for pooling models (embedding models) if self.model_config.runner_type == "pooling": @@ -179,8 +179,38 @@ def _should_use_chunked_processing(self, request) -> bool: return False pooler_config = getattr(self.model_config, 'pooler_config', None) - return (pooler_config is not None - and getattr(pooler_config, 'enable_chunked_processing', False)) + if not (pooler_config is not None and getattr( + pooler_config, 'enable_chunked_processing', False)): + return False + + # Check pooling type compatibility for chunked processing + pooling_type = getattr(pooler_config, 'pooling_type', None) + if pooling_type: + pooling_type_upper = pooling_type.upper() + + # Warn about non-MEAN pooling types + if pooling_type_upper not in ['MEAN', 'AVG']: + # Check if user explicitly allowed non-mean chunking + allow_non_mean = getattr(pooler_config, + 'allow_non_mean_chunking', False) + if not allow_non_mean: + logger.warning( + "Chunked processing with pooling type '%s' " + "may produce different results than non-chunked " + "processing. Only MEAN pooling is mathematically " + "equivalent when using weighted averaging aggregation. " + "For other pooling types, different aggregation " + "strategies will be used that approximate the original " + "behavior. Set 'allow_non_mean_chunking: true' " + "in pooler config to suppress this warning.", + pooling_type) + # Still allow it but with warning + else: + logger.info( + "Using chunked processing with pooling type " + "'%s' (explicitly enabled)", pooling_type) + + return True def _chunk_token_ids(self, token_ids: list[int], chunk_size: int) -> list[list[int]]: @@ -211,8 +241,9 @@ async def _process_chunked_request( chunks = self._chunk_token_ids(token_ids, max_pos_embeddings) logger.info( - "Split input of %s tokens into %s chunks (max_chunk_size: %s)", - len(token_ids), len(chunks), max_pos_embeddings) + "Split input of %s tokens into %s chunks " + "(max_chunk_size: %s)", len(token_ids), len(chunks), + max_pos_embeddings) for chunk_idx, chunk_tokens in enumerate(chunks): # Create a request ID for this chunk @@ -256,11 +287,44 @@ async def _aggregate_chunked_results( original_token_count: int, original_prompt_token_ids: Optional[list[int]] = None, ) -> PoolingRequestOutput: - """Aggregate results from multiple chunks - using vLLM-compatible weighted averaging.""" + """Aggregate results from multiple chunks using + pooling-type-specific strategies.""" if len(chunk_results) == 1: return chunk_results[0] + # Get pooling type to determine aggregation strategy + pooler_config = getattr(self.model_config, 'pooler_config', None) + pooling_type = getattr(pooler_config, 'pooling_type', 'MEAN') + if pooling_type: + pooling_type = pooling_type.upper() + + # Route to appropriate aggregation method based on pooling type + if pooling_type in ['MEAN', 'AVG']: + return await self._aggregate_mean_pooling( + chunk_results, original_token_count, original_prompt_token_ids) + elif pooling_type == 'LAST': + return await self._aggregate_last_pooling( + chunk_results, original_prompt_token_ids) + elif pooling_type == 'CLS': + return await self._aggregate_cls_pooling( + chunk_results, original_prompt_token_ids) + else: + # For unsupported pooling types, + # fall back to mean aggregation with warning + logger.warning( + "Chunked aggregation for pooling type '%s' is not " + "specifically implemented. Falling back to weighted " + "averaging which may produce incorrect results.", pooling_type) + return await self._aggregate_mean_pooling( + chunk_results, original_token_count, original_prompt_token_ids) + + async def _aggregate_mean_pooling( + self, + chunk_results: list[PoolingRequestOutput], + original_token_count: int, + original_prompt_token_ids: Optional[list[int]] = None, + ) -> PoolingRequestOutput: + """Aggregate results using weighted averaging for MEAN pooling.""" # Extract embeddings and use vLLM's token counting approach chunk_embeddings = [] chunk_weights = [] @@ -328,6 +392,58 @@ async def _aggregate_chunked_results( return aggregated_result + async def _aggregate_last_pooling( + self, + chunk_results: list[PoolingRequestOutput], + original_prompt_token_ids: Optional[list[int]] = None, + ) -> PoolingRequestOutput: + """Aggregate results for LAST pooling by using the last chunk. + + For LAST pooling, we use the embedding from the last chunk since + it contains the final token's representation, which is what LAST + pooling extracts from the full sequence. + """ + last_result = chunk_results[-1] + + # Preserve original prompt token ids for consistency + if original_prompt_token_ids is not None: + # Create a new result with updated prompt_token_ids + aggregated_result = PoolingRequestOutput( + request_id=last_result.request_id, + outputs=last_result.outputs, + prompt_token_ids=original_prompt_token_ids, + finished=True, + ) + return aggregated_result + + return last_result + + async def _aggregate_cls_pooling( + self, + chunk_results: list[PoolingRequestOutput], + original_prompt_token_ids: Optional[list[int]] = None, + ) -> PoolingRequestOutput: + """Aggregate results for CLS pooling by using the first chunk. + + For CLS pooling, we use the embedding from the first chunk since + it contains the CLS token's representation, which is what CLS + pooling extracts (typically the first token). + """ + first_result = chunk_results[0] + + # Preserve original prompt token ids for consistency + if original_prompt_token_ids is not None: + # Create a new result with updated prompt_token_ids + aggregated_result = PoolingRequestOutput( + request_id=first_result.request_id, + outputs=first_result.outputs, + prompt_token_ids=original_prompt_token_ids, + finished=True, + ) + return aggregated_result + + return first_result + def _validate_input( self, request, From 53decf588dc65dbe4e37a32b40e7a717a9a53b62 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Tue, 15 Jul 2025 11:10:41 +0800 Subject: [PATCH 012/552] fix(embedding): optimize LAST/CLS pooling in chunked processing - Process only relevant chunks (last for LAST, first for CLS pooling) - Disable chunked processing by default for these types due to semantic issues - Remove unused AVG pooling type references - Add explicit user override option with warnings Fixes computational waste identified in code review. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 100 ++++++++++++++++--- 1 file changed, 84 insertions(+), 16 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 57b3e6698ed..843cdbebb9a 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -188,8 +188,35 @@ def _should_use_chunked_processing(self, request) -> bool: if pooling_type: pooling_type_upper = pooling_type.upper() - # Warn about non-MEAN pooling types - if pooling_type_upper not in ['MEAN', 'AVG']: + # For LAST and CLS pooling, chunked processing doesn't make + # semantic sense because only the last/first chunk + # contains the relevant token position + if pooling_type_upper in ['LAST', 'CLS']: + # Check if user explicitly allowed non-mean chunking + allow_non_mean = getattr(pooler_config, + 'allow_non_mean_chunking', False) + if not allow_non_mean: + logger.warning( + "Chunked processing with pooling type '%s' " + "is not recommended as it may produce semantically " + "incorrect results. %s pooling relies on specific " + "token positions that lose their meaning when the " + "sequence is chunked. Consider using MEAN pooling " + "or disable chunked processing. Set " + "'allow_non_mean_chunking: true' ", + "to override this warning.", pooling_type, + pooling_type_upper) + return False # Disable chunked processing by default + else: + logger.info( + "Using chunked processing with %s pooling " + "(explicitly enabled). Note: only the %s chunk " + "will be processed to avoid computational waste.", + pooling_type_upper, + "last" if pooling_type_upper == "LAST" else "first") + + # Warn about non-MEAN pooling types (for other pooling types) + elif pooling_type_upper != 'MEAN': # Check if user explicitly allowed non-mean chunking allow_non_mean = getattr(pooler_config, 'allow_non_mean_chunking', False) @@ -240,12 +267,39 @@ async def _process_chunked_request( max_pos_embeddings = self._get_max_position_embeddings() chunks = self._chunk_token_ids(token_ids, max_pos_embeddings) - logger.info( - "Split input of %s tokens into %s chunks " - "(max_chunk_size: %s)", len(token_ids), len(chunks), - max_pos_embeddings) + # Check pooling type to optimize chunk processing + pooler_config = getattr(self.model_config, 'pooler_config', None) + pooling_type = getattr(pooler_config, 'pooling_type', 'MEAN') + if pooling_type: + pooling_type = pooling_type.upper() - for chunk_idx, chunk_tokens in enumerate(chunks): + # For LAST pooling, only process the last chunk + # For CLS pooling, only process the first chunk + if pooling_type == 'LAST': + chunks_to_process = [chunks[-1]] + chunk_indices = [len(chunks) - 1] + logger.info( + "LAST pooling: processing only the last chunk (%d tokens) " + "out of %d total chunks to avoid computational waste", + len(chunks[-1]), len(chunks)) + elif pooling_type == 'CLS': + chunks_to_process = [chunks[0]] + chunk_indices = [0] + logger.info( + "CLS pooling: processing only the first chunk (%d tokens) " + "out of %d total chunks to avoid computational waste", + len(chunks[0]), len(chunks)) + else: + # For MEAN and other pooling types, process all chunks + chunks_to_process = chunks + chunk_indices = list(range(len(chunks))) + logger.info( + "Split input of %s tokens into %s chunks " + "(max_chunk_size: %s)", len(token_ids), len(chunks), + max_pos_embeddings) + + for i, (chunk_idx, chunk_tokens) in enumerate( + zip(chunk_indices, chunks_to_process)): # Create a request ID for this chunk chunk_request_id = (f"{ctx.request_id}-prompt-{prompt_idx}-" f"chunk-{chunk_idx}") @@ -299,7 +353,7 @@ async def _aggregate_chunked_results( pooling_type = pooling_type.upper() # Route to appropriate aggregation method based on pooling type - if pooling_type in ['MEAN', 'AVG']: + if pooling_type == 'MEAN': return await self._aggregate_mean_pooling( chunk_results, original_token_count, original_prompt_token_ids) elif pooling_type == 'LAST': @@ -397,12 +451,19 @@ async def _aggregate_last_pooling( chunk_results: list[PoolingRequestOutput], original_prompt_token_ids: Optional[list[int]] = None, ) -> PoolingRequestOutput: - """Aggregate results for LAST pooling by using the last chunk. + """Aggregate results for LAST pooling. - For LAST pooling, we use the embedding from the last chunk since - it contains the final token's representation, which is what LAST - pooling extracts from the full sequence. + For LAST pooling, when chunked processing is enabled, we only process + the last chunk to avoid computational waste, since only the last token's + representation is needed. This result is returned directly. """ + # When LAST pooling chunked processing is enabled, we only process + # the last chunk, so chunk_results should contain only one result + if len(chunk_results) != 1: + logger.warning( + "Expected exactly 1 chunk result for LAST pooling, " + "got %d. Using the last result.", len(chunk_results)) + last_result = chunk_results[-1] # Preserve original prompt token ids for consistency @@ -423,12 +484,19 @@ async def _aggregate_cls_pooling( chunk_results: list[PoolingRequestOutput], original_prompt_token_ids: Optional[list[int]] = None, ) -> PoolingRequestOutput: - """Aggregate results for CLS pooling by using the first chunk. + """Aggregate results for CLS pooling. - For CLS pooling, we use the embedding from the first chunk since - it contains the CLS token's representation, which is what CLS - pooling extracts (typically the first token). + For CLS pooling, when chunked processing is enabled, we only process + the first chunk to avoid computational waste, since only the CLS token's + representation (typically the first token) is needed. """ + # When CLS pooling chunked processing is enabled, we only process + # the first chunk, so chunk_results should contain only one result + if len(chunk_results) != 1: + logger.warning( + "Expected exactly 1 chunk result for CLS pooling, " + "got %d. Using the first result.", len(chunk_results)) + first_result = chunk_results[0] # Preserve original prompt token ids for consistency From 595a2ec76db763be6071b6c6bf3ce63e3b994a9f Mon Sep 17 00:00:00 2001 From: x22x22 Date: Tue, 15 Jul 2025 12:18:16 +0800 Subject: [PATCH 013/552] fix: implement online aggregation for chunked embedding processing Replace batch aggregation with streaming aggregation to prevent memory spikes and potential DoS attacks. Process chunk results incrementally instead of accumulating complete chunk lists in memory, ensuring near-constant memory usage regardless of input length. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 410 +++++++++---------- 1 file changed, 191 insertions(+), 219 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 843cdbebb9a..c5a19bbe0e5 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -6,7 +6,6 @@ from typing import Final, Literal, Optional, Union, cast import numpy as np -import torch from fastapi import Request from typing_extensions import assert_never, override @@ -32,7 +31,7 @@ from vllm.inputs.data import TokensPrompt as EngineTokensPrompt from vllm.logger import init_logger from vllm.outputs import (EmbeddingOutput, EmbeddingRequestOutput, - PoolingOutput, PoolingRequestOutput, RequestOutput) + PoolingRequestOutput, RequestOutput) logger = init_logger(__name__) @@ -273,30 +272,21 @@ async def _process_chunked_request( if pooling_type: pooling_type = pooling_type.upper() - # For LAST pooling, only process the last chunk + # For LAST pooling, only process the last chunk # For CLS pooling, only process the first chunk if pooling_type == 'LAST': chunks_to_process = [chunks[-1]] chunk_indices = [len(chunks) - 1] - logger.info( - "LAST pooling: processing only the last chunk (%d tokens) " - "out of %d total chunks to avoid computational waste", - len(chunks[-1]), len(chunks)) + logger.info("LAST pooling: processing only the last chunk") elif pooling_type == 'CLS': chunks_to_process = [chunks[0]] chunk_indices = [0] - logger.info( - "CLS pooling: processing only the first chunk (%d tokens) " - "out of %d total chunks to avoid computational waste", - len(chunks[0]), len(chunks)) + logger.info("CLS pooling: processing only the first chunk") else: # For MEAN and other pooling types, process all chunks chunks_to_process = chunks chunk_indices = list(range(len(chunks))) - logger.info( - "Split input of %s tokens into %s chunks " - "(max_chunk_size: %s)", len(token_ids), len(chunks), - max_pos_embeddings) + logger.info("Using chunked processing for MEAN pooling") for i, (chunk_idx, chunk_tokens) in enumerate( zip(chunk_indices, chunks_to_process)): @@ -334,184 +324,6 @@ async def _process_chunked_request( return generators - async def _aggregate_chunked_results( - self, - ctx: EmbeddingServeContext, - chunk_results: list[PoolingRequestOutput], - original_token_count: int, - original_prompt_token_ids: Optional[list[int]] = None, - ) -> PoolingRequestOutput: - """Aggregate results from multiple chunks using - pooling-type-specific strategies.""" - if len(chunk_results) == 1: - return chunk_results[0] - - # Get pooling type to determine aggregation strategy - pooler_config = getattr(self.model_config, 'pooler_config', None) - pooling_type = getattr(pooler_config, 'pooling_type', 'MEAN') - if pooling_type: - pooling_type = pooling_type.upper() - - # Route to appropriate aggregation method based on pooling type - if pooling_type == 'MEAN': - return await self._aggregate_mean_pooling( - chunk_results, original_token_count, original_prompt_token_ids) - elif pooling_type == 'LAST': - return await self._aggregate_last_pooling( - chunk_results, original_prompt_token_ids) - elif pooling_type == 'CLS': - return await self._aggregate_cls_pooling( - chunk_results, original_prompt_token_ids) - else: - # For unsupported pooling types, - # fall back to mean aggregation with warning - logger.warning( - "Chunked aggregation for pooling type '%s' is not " - "specifically implemented. Falling back to weighted " - "averaging which may produce incorrect results.", pooling_type) - return await self._aggregate_mean_pooling( - chunk_results, original_token_count, original_prompt_token_ids) - - async def _aggregate_mean_pooling( - self, - chunk_results: list[PoolingRequestOutput], - original_token_count: int, - original_prompt_token_ids: Optional[list[int]] = None, - ) -> PoolingRequestOutput: - """Aggregate results using weighted averaging for MEAN pooling.""" - # Extract embeddings and use vLLM's token counting approach - chunk_embeddings = [] - chunk_weights = [] - - for result in chunk_results: - # PoolingRequestOutput.outputs is a PoolingOutput object - if hasattr(result, 'outputs') and hasattr(result.outputs, 'data'): - # Get the embedding tensor from PoolingOutput.data - embedding_data = result.outputs.data - if not isinstance(embedding_data, torch.Tensor): - embedding_data = torch.tensor(embedding_data, - dtype=torch.float32) - chunk_embeddings.append(embedding_data) - - # Use actual effective token count - # this is what vLLM uses internally - effective_token_count = len(result.prompt_token_ids) - chunk_weights.append(effective_token_count) - - if not chunk_embeddings: - raise ValueError("No valid embeddings found in chunk results") - - # Simple weighted averaging compatible with vLLM's approach - # This is similar to what MeanPool does for multiple sequences - device = chunk_embeddings[0].device - # Use float32 for precision, as done in vLLM's PoolerHead - dtype = torch.float32 - - # Weighted sum following vLLM's internal logic - weighted_sum = torch.zeros_like(chunk_embeddings[0], - dtype=dtype, - device=device) - total_weight = 0 - - for embedding, weight in zip(chunk_embeddings, chunk_weights): - embedding = embedding.to(dtype=dtype, device=device) - weighted_sum += embedding * weight - total_weight += weight - - # Final averaged embedding - let vLLM handle the rest - aggregated_embedding = weighted_sum / total_weight - - # NOTE: Don't manually normalize here - # let vLLM's PoolerHead handle normalization - # based on the model's pooler_config.normalize setting. - # This ensures consistency with vLLM's standard pooling behavior. - - # Create aggregated result using vLLM's standard output structure - first_result = chunk_results[0] - - # Create new PoolingOutput with aggregated embedding - aggregated_output = PoolingOutput(data=aggregated_embedding) - - # Preserve original prompt token ids for consistency - result_prompt_token_ids = (original_prompt_token_ids - if original_prompt_token_ids is not None - else first_result.prompt_token_ids) - - aggregated_result = PoolingRequestOutput( - request_id=first_result.request_id, - outputs=aggregated_output, - prompt_token_ids=result_prompt_token_ids, - finished=True, - ) - - return aggregated_result - - async def _aggregate_last_pooling( - self, - chunk_results: list[PoolingRequestOutput], - original_prompt_token_ids: Optional[list[int]] = None, - ) -> PoolingRequestOutput: - """Aggregate results for LAST pooling. - - For LAST pooling, when chunked processing is enabled, we only process - the last chunk to avoid computational waste, since only the last token's - representation is needed. This result is returned directly. - """ - # When LAST pooling chunked processing is enabled, we only process - # the last chunk, so chunk_results should contain only one result - if len(chunk_results) != 1: - logger.warning( - "Expected exactly 1 chunk result for LAST pooling, " - "got %d. Using the last result.", len(chunk_results)) - - last_result = chunk_results[-1] - - # Preserve original prompt token ids for consistency - if original_prompt_token_ids is not None: - # Create a new result with updated prompt_token_ids - aggregated_result = PoolingRequestOutput( - request_id=last_result.request_id, - outputs=last_result.outputs, - prompt_token_ids=original_prompt_token_ids, - finished=True, - ) - return aggregated_result - - return last_result - - async def _aggregate_cls_pooling( - self, - chunk_results: list[PoolingRequestOutput], - original_prompt_token_ids: Optional[list[int]] = None, - ) -> PoolingRequestOutput: - """Aggregate results for CLS pooling. - - For CLS pooling, when chunked processing is enabled, we only process - the first chunk to avoid computational waste, since only the CLS token's - representation (typically the first token) is needed. - """ - # When CLS pooling chunked processing is enabled, we only process - # the first chunk, so chunk_results should contain only one result - if len(chunk_results) != 1: - logger.warning( - "Expected exactly 1 chunk result for CLS pooling, " - "got %d. Using the first result.", len(chunk_results)) - - first_result = chunk_results[0] - - # Preserve original prompt token ids for consistency - if original_prompt_token_ids is not None: - # Create a new result with updated prompt_token_ids - aggregated_result = PoolingRequestOutput( - request_id=first_result.request_id, - outputs=first_result.outputs, - prompt_token_ids=original_prompt_token_ids, - finished=True, - ) - return aggregated_result - - return first_result - def _validate_input( self, request, @@ -676,7 +488,13 @@ async def _collect_batch( self, ctx: ServeContext, ) -> Optional[ErrorResponse]: - """Override to support chunked processing.""" + """Collect and aggregate batch results + with support for chunked processing. + + For chunked requests, performs online aggregation to + minimize memory usage. + For regular requests, collects results normally. + """ ctx = cast(EmbeddingServeContext, ctx) try: if ctx.engine_prompts is None: @@ -695,29 +513,103 @@ async def _collect_batch( use_chunked = self._should_use_chunked_processing(ctx.request) if use_chunked: - # Efficient single-pass processing for chunked requests - from collections import defaultdict + # Online aggregation for chunked requests to + # minimize memory usage + import torch - # Group results by original prompt index - grouped_results = defaultdict(list) + # Track aggregation state for each prompt + prompt_aggregators = {} short_prompts_results = {} async for result_idx, result in ctx.result_generator: if "-chunk-" in result.request_id: # Extract prompt_idx from chunked request_id - # e.g., from "req-id-prompt-2-chunk-0" -> 2 parts = result.request_id.split("-") try: prompt_idx = int(parts[parts.index("prompt") + 1]) - grouped_results[prompt_idx].append( - cast(PoolingRequestOutput, result)) + + # Initialize aggregator for this prompt if needed + if prompt_idx not in prompt_aggregators: + # Get pooling type to determine + # aggregation strategy + pooler_config = getattr( + self.model_config, 'pooler_config', None) + pooling_type = getattr(pooler_config, + 'pooling_type', 'MEAN') + if pooling_type: + pooling_type = pooling_type.upper() + + prompt_aggregators[prompt_idx] = { + 'pooling_type': + pooling_type, + 'weighted_sum': + None, + 'total_weight': + 0, + 'first_result': + None, + 'last_result': + None, + 'chunk_count': + 0, + 'request_id': + result.request_id.split("-chunk-")[0] + } + + aggregator = prompt_aggregators[prompt_idx] + pooling_type = aggregator['pooling_type'] + + # Handle different pooling types with + # online aggregation + if pooling_type == 'MEAN': + # Online weighted averaging + embedding_data = result.outputs.data + if not isinstance(embedding_data, + torch.Tensor): + embedding_data = torch.tensor( + embedding_data, dtype=torch.float32) + + weight = len(result.prompt_token_ids) + + if aggregator['weighted_sum'] is None: + # First chunk + aggregator[ + 'weighted_sum'] = embedding_data.to( + dtype=torch.float32) * weight + else: + # Accumulate + aggregator[ + 'weighted_sum'] += embedding_data.to( + dtype=torch.float32) * weight + + aggregator['total_weight'] += weight + + elif pooling_type == 'LAST': + # Keep only the + # last result (highest chunk index) + chunk_idx = int(parts[parts.index("chunk") + + 1]) + if (aggregator['last_result'] is None + or chunk_idx > aggregator.get( + 'last_chunk_idx', -1)): + aggregator['last_result'] = result + aggregator['last_chunk_idx'] = chunk_idx + + elif pooling_type == 'CLS': + # Keep only the first result (chunk index 0) + chunk_idx = int(parts[parts.index("chunk") + + 1]) + if chunk_idx == 0: + aggregator['first_result'] = result + + aggregator['chunk_count'] += 1 + except (ValueError, IndexError): return self.create_error_response( f"Invalid chunk request ID format: " f"{result.request_id}") else: - # Extract prompt_idx from non-chunked request_id - # e.g., from "req-id-2" -> 2 + # Non-chunked result try: prompt_idx = int(result.request_id.split("-")[-1]) short_prompts_results[prompt_idx] = cast( @@ -727,28 +619,108 @@ async def _collect_batch( f"Invalid request ID format: " f"{result.request_id}") - # Build final result batch in prompt order + # Build final result batch final_res_batch = [] for prompt_idx, request_prompt in enumerate( ctx.request_prompts): - if prompt_idx in grouped_results: - # This was a chunked prompt - aggregate results - chunk_results = grouped_results[prompt_idx] - if self._is_text_tokens_prompt(request_prompt): - text_tokens_prompt = cast(TextTokensPrompt, - request_prompt) - original_token_count = len( - text_tokens_prompt["prompt_token_ids"]) - aggregated_result = await \ - self._aggregate_chunked_results( - ctx, chunk_results, original_token_count, - text_tokens_prompt["prompt_token_ids"]) - final_res_batch.append(aggregated_result) + if prompt_idx in prompt_aggregators: + # Finalize aggregation for this chunked prompt + aggregator = prompt_aggregators[prompt_idx] + pooling_type = aggregator['pooling_type'] + + if pooling_type == 'MEAN': + # Finalize weighted average + if aggregator[ + 'weighted_sum'] is not None and aggregator[ + 'total_weight'] > 0: + final_embedding = aggregator[ + 'weighted_sum'] / aggregator['total_weight'] + + # Create aggregated result + from vllm.outputs import PoolingOutput + aggregated_output = PoolingOutput( + data=final_embedding) + + # Get original prompt token ids + if self._is_text_tokens_prompt(request_prompt): + text_tokens_prompt = cast( + TextTokensPrompt, request_prompt) + original_token_ids = text_tokens_prompt[ + "prompt_token_ids"] + else: + return self.create_error_response( + f"Chunked prompt {prompt_idx} is not a " + f"text tokens prompt") + + aggregated_result = PoolingRequestOutput( + request_id=aggregator['request_id'], + outputs=aggregated_output, + prompt_token_ids=original_token_ids, + finished=True, + ) + final_res_batch.append(aggregated_result) + else: + return self.create_error_response( + f"No valid aggregation data for prompt " + f"{prompt_idx}") + + elif pooling_type == 'LAST': + if aggregator['last_result'] is not None: + # Use the last chunk result + last_result = aggregator['last_result'] + if self._is_text_tokens_prompt(request_prompt): + text_tokens_prompt = cast( + TextTokensPrompt, request_prompt) + original_token_ids = text_tokens_prompt[ + "prompt_token_ids"] + + aggregated_result = PoolingRequestOutput( + request_id=aggregator['request_id'], + outputs=last_result.outputs, + prompt_token_ids=original_token_ids, + finished=True, + ) + final_res_batch.append(aggregated_result) + else: + return self.create_error_response( + f"Chunked prompt {prompt_idx} is not a " + f"text tokens prompt") + else: + return self.create_error_response( + f"No LAST result found for prompt " + f"{prompt_idx}") + + elif pooling_type == 'CLS': + if aggregator['first_result'] is not None: + # Use the first chunk result + first_result = aggregator['first_result'] + if self._is_text_tokens_prompt(request_prompt): + text_tokens_prompt = cast( + TextTokensPrompt, request_prompt) + original_token_ids = text_tokens_prompt[ + "prompt_token_ids"] + + aggregated_result = PoolingRequestOutput( + request_id=aggregator['request_id'], + outputs=first_result.outputs, + prompt_token_ids=original_token_ids, + finished=True, + ) + final_res_batch.append(aggregated_result) + else: + return self.create_error_response( + f"Chunked prompt {prompt_idx} is not a " + f"text tokens prompt") + else: + return self.create_error_response( + f"No CLS result found for prompt " + f"{prompt_idx}") else: return self.create_error_response( - f"Chunked prompt {prompt_idx} is not a " - f"text tokens prompt") + f"Unsupported pooling type for chunked " + f"processing: {pooling_type}") + elif prompt_idx in short_prompts_results: # This was a short prompt final_res_batch.append( From 73b6b66ba12c517c211090a2fdf3f5062b901266 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Tue, 15 Jul 2025 15:27:32 +0800 Subject: [PATCH 014/552] fix pre-commit errors Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 120 +++++++++++++++---- 1 file changed, 98 insertions(+), 22 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index c5a19bbe0e5..26eae3b2b8f 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -3,9 +3,10 @@ import base64 from collections.abc import AsyncGenerator -from typing import Final, Literal, Optional, Union, cast +from typing import Any, Final, Literal, Optional, Union, cast import numpy as np +import torch from fastapi import Request from typing_extensions import assert_never, override @@ -515,11 +516,9 @@ async def _collect_batch( if use_chunked: # Online aggregation for chunked requests to # minimize memory usage - import torch - # Track aggregation state for each prompt - prompt_aggregators = {} - short_prompts_results = {} + prompt_aggregators: dict[int, dict[str, Any]] = {} + short_prompts_results: dict[int, PoolingRequestOutput] = {} async for result_idx, result in ctx.result_generator: if "-chunk-" in result.request_id: @@ -563,46 +562,86 @@ async def _collect_batch( # online aggregation if pooling_type == 'MEAN': # Online weighted averaging + # Ensure result is PoolingRequestOutput + # for embedding processing + if not isinstance(result, + PoolingRequestOutput): + return self.create_error_response( + f"Expected PoolingRequestOutput for " + f"chunked embedding, got " + f"{type(result).__name__}") + embedding_data = result.outputs.data if not isinstance(embedding_data, torch.Tensor): embedding_data = torch.tensor( embedding_data, dtype=torch.float32) + if result.prompt_token_ids is None: + return self.create_error_response( + "prompt_token_ids cannot be None for " + "chunked processing") weight = len(result.prompt_token_ids) + weighted_embedding = embedding_data.to( + dtype=torch.float32) * weight + if aggregator['weighted_sum'] is None: # First chunk aggregator[ - 'weighted_sum'] = embedding_data.to( - dtype=torch.float32) * weight + 'weighted_sum'] = weighted_embedding else: # Accumulate - aggregator[ - 'weighted_sum'] += embedding_data.to( - dtype=torch.float32) * weight + current_sum = aggregator['weighted_sum'] + if isinstance(current_sum, torch.Tensor): + aggregator['weighted_sum'] = ( + current_sum + weighted_embedding) - aggregator['total_weight'] += weight + total_weight = aggregator['total_weight'] + if isinstance(total_weight, (int, float)): + aggregator['total_weight'] = ( + total_weight + weight) elif pooling_type == 'LAST': # Keep only the # last result (highest chunk index) + if not isinstance(result, + PoolingRequestOutput): + return self.create_error_response( + f"Expected PoolingRequestOutput for " + f"chunked embedding, got " + f"{type(result).__name__}") + chunk_idx = int(parts[parts.index("chunk") + 1]) + last_chunk_idx = aggregator.get( + 'last_chunk_idx', -1) + # Ensure last_chunk_idx is an integer + # for comparison + if not isinstance(last_chunk_idx, int): + last_chunk_idx = -1 if (aggregator['last_result'] is None - or chunk_idx > aggregator.get( - 'last_chunk_idx', -1)): + or chunk_idx > last_chunk_idx): aggregator['last_result'] = result aggregator['last_chunk_idx'] = chunk_idx elif pooling_type == 'CLS': # Keep only the first result (chunk index 0) + if not isinstance(result, + PoolingRequestOutput): + return self.create_error_response( + f"Expected PoolingRequestOutput for " + f"chunked embedding, got " + f"{type(result).__name__}") + chunk_idx = int(parts[parts.index("chunk") + 1]) if chunk_idx == 0: aggregator['first_result'] = result - aggregator['chunk_count'] += 1 + chunk_count = aggregator['chunk_count'] + if isinstance(chunk_count, int): + aggregator['chunk_count'] = chunk_count + 1 except (ValueError, IndexError): return self.create_error_response( @@ -631,11 +670,13 @@ async def _collect_batch( if pooling_type == 'MEAN': # Finalize weighted average - if aggregator[ - 'weighted_sum'] is not None and aggregator[ - 'total_weight'] > 0: - final_embedding = aggregator[ - 'weighted_sum'] / aggregator['total_weight'] + weighted_sum = aggregator['weighted_sum'] + total_weight = aggregator['total_weight'] + if (weighted_sum is not None + and isinstance(weighted_sum, torch.Tensor) + and isinstance(total_weight, (int, float)) + and total_weight > 0): + final_embedding = weighted_sum / total_weight # Create aggregated result from vllm.outputs import PoolingOutput @@ -653,8 +694,15 @@ async def _collect_batch( f"Chunked prompt {prompt_idx} is not a " f"text tokens prompt") + # Ensure request_id is string + request_id = aggregator['request_id'] + if not isinstance(request_id, str): + return self.create_error_response( + f"Invalid request_id type: " + f"{type(request_id)}") + aggregated_result = PoolingRequestOutput( - request_id=aggregator['request_id'], + request_id=request_id, outputs=aggregated_output, prompt_token_ids=original_token_ids, finished=True, @@ -669,14 +717,28 @@ async def _collect_batch( if aggregator['last_result'] is not None: # Use the last chunk result last_result = aggregator['last_result'] + if not isinstance(last_result, + PoolingRequestOutput): + return self.create_error_response( + f"Expected PoolingRequestOutput for " + f"last_result, got " + f"{type(last_result).__name__}") + if self._is_text_tokens_prompt(request_prompt): text_tokens_prompt = cast( TextTokensPrompt, request_prompt) original_token_ids = text_tokens_prompt[ "prompt_token_ids"] + # Ensure request_id is string + request_id = aggregator['request_id'] + if not isinstance(request_id, str): + return self.create_error_response( + f"Invalid request_id type: " + f"{type(request_id)}") + aggregated_result = PoolingRequestOutput( - request_id=aggregator['request_id'], + request_id=request_id, outputs=last_result.outputs, prompt_token_ids=original_token_ids, finished=True, @@ -695,14 +757,28 @@ async def _collect_batch( if aggregator['first_result'] is not None: # Use the first chunk result first_result = aggregator['first_result'] + if not isinstance(first_result, + PoolingRequestOutput): + return self.create_error_response( + f"Expected PoolingRequestOutput for " + f"first_result, got " + f"{type(first_result).__name__}") + if self._is_text_tokens_prompt(request_prompt): text_tokens_prompt = cast( TextTokensPrompt, request_prompt) original_token_ids = text_tokens_prompt[ "prompt_token_ids"] + # Ensure request_id is string + request_id = aggregator['request_id'] + if not isinstance(request_id, str): + return self.create_error_response( + f"Invalid request_id type: " + f"{type(request_id)}") + aggregated_result = PoolingRequestOutput( - request_id=aggregator['request_id'], + request_id=request_id, outputs=first_result.outputs, prompt_token_ids=original_token_ids, finished=True, From bbba6000518e3a47dc78cf591b131482e756d08f Mon Sep 17 00:00:00 2001 From: x22x22 Date: Fri, 18 Jul 2025 22:22:14 +0800 Subject: [PATCH 015/552] Update the documentation and examples to support the enhanced chunk processing function, and elaborate on the processing methods and performance characteristics of different pooling types (MEAN, CLS, LAST). Optimize the configuration parameters to ensure that users receive clear warning messages when using non - MEAN pooling, and enhance the support for long - text input. Signed-off-by: x22x22 --- docs/models/pooling_models.md | 46 +++++++++++++++++-- .../openai_embedding_long_text.md | 34 ++++++++++---- .../openai_embedding_long_text_client.py | 27 +++++++++-- .../openai_embedding_long_text_service.sh | 43 +++++++++++------ vllm/entrypoints/openai/serving_embedding.py | 26 ++++++++--- 5 files changed, 142 insertions(+), 34 deletions(-) diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index e4e1436c545..f9ebac8ed27 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -60,10 +60,17 @@ Other embedding models can be extended to support this feature by ensuring prope Enable chunked processing and configure maximum embedding input length: ```bash +# MEAN pooling (recommended for chunked processing) vllm serve intfloat/multilingual-e5-large \ --task embed \ --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000}' \ --trust-remote-code + +# CLS pooling (processes only first chunk) +vllm serve BAAI/bge-large-en-v1.5 \ + --task embed \ + --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576, "allow_non_mean_chunking": true}' \ + --trust-remote-code ``` #### Configuration Parameters @@ -73,10 +80,17 @@ vllm serve intfloat/multilingual-e5-large \ - When set, allows inputs longer than `max_model_len` without requiring `VLLM_ALLOW_LONG_MAX_MODEL_LEN` - Inputs exceeding `max_embed_len` are rejected with clear error messages - Chunking is triggered when inputs exceed `max_position_embeddings` +- `allow_non_mean_chunking`: Allow non-MEAN pooling types with chunked processing (default: `false`) + - When `false`: CLS/LAST pooling types show warnings and may be disabled + - When `true`: Explicitly enables CLS/LAST pooling with performance optimizations + - Required to suppress warnings for non-MEAN pooling types ### Aggregation Algorithm -The chunked processing uses a FastChat-inspired weighted averaging algorithm: +The chunked processing uses different strategies based on pooling type: + +#### MEAN Pooling (Recommended) +Uses weighted averaging across all chunks: ```python # Weighted average: sum(embedding_i * token_count_i) / total_tokens @@ -86,13 +100,39 @@ final_embedding = weighted_sum / sum(weights) This ensures that longer chunks contribute proportionally more to the final representation. +#### CLS Pooling (Performance Optimized) +Only processes the **first chunk** to avoid computational waste: + +```python +# CLS pooling: only the first chunk contains the CLS token +final_embedding = first_chunk_embedding +``` + +Note: This may lose information from later parts of the text. + +#### LAST Pooling (Performance Optimized) +Only processes the **last chunk** to avoid computational waste: + +```python +# LAST pooling: only the last chunk contains the final token +final_embedding = last_chunk_embedding +``` + +Note: This may lose information from earlier parts of the text. + ### Performance Characteristics +| Pooling Type | Chunks Processed | Processing Time | Semantic Coverage | Best Use Case | +|--------------|------------------|-----------------|-------------------|---------------| +| **MEAN** | All chunks | Highest (all chunks) | Complete | General purpose, long documents | +| **CLS** | First chunk only | Lowest (1 chunk) | Limited to start | Classification, when start matters | +| **LAST** | Last chunk only | Lowest (1 chunk) | Limited to end | When ending matters | + | Aspect | Short Text (≤ max_position_embeddings) | Long Text (> max_position_embeddings) | |--------|----------------------------------------|---------------------------------------| -| **Processing Time** | Standard | Increased (multiple inference calls) | +| **Processing Time** | Standard | Varies by pooling type (CLS/LAST: minimal, MEAN: increased) | | **Memory Usage** | Standard | Reduced (chunks processed separately) | -| **Quality** | Standard | Maintains semantic representation | +| **Quality** | Standard | Depends on pooling type and content distribution | | **Compatibility** | Full | Full (backward compatible) | | **Input Validation** | Standard max_model_len check | Extended max_embed_len check | diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md index c1c044d916b..a94bf95e534 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text.md @@ -50,10 +50,19 @@ The key parameters for chunked processing are in the `--override-pooler-config`: "pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, - "max_embed_len": 3072000 + "max_embed_len": 3072000, + "allow_non_mean_chunking": true } ``` +#### Pooling Type Behavior with Chunked Processing + +| Pooling Type | Chunks Processed | Performance | Semantic Coverage | Use Case | +|--------------|------------------|-------------|-------------------|----------| +| **MEAN** (recommended) | All chunks | Slower | Complete | General purpose, full documents | +| **CLS** | First chunk only | Fastest | Limited to start | Classification, when beginning matters | +| **LAST** | Last chunk only | Fastest | Limited to end | When ending/conclusion matters | + ### Environment Variables | Variable | Default | Description | @@ -62,14 +71,21 @@ The key parameters for chunked processing are in the `--override-pooler-config`: | `PORT` | `31090` | Server port | | `GPU_COUNT` | `1` | Number of GPUs to use | | `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) | +| `POOLING_TYPE` | `auto` | Pooling type: `auto`, `MEAN`, `CLS`, `LAST` | +| `ALLOW_NON_MEAN_CHUNKING` | `false` | Allow CLS/LAST pooling with chunked processing | | `API_KEY` | `EMPTY` | API key for authentication | ## 🔧 How It Works 1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables 2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity -3. **Independent Processing**: Each chunk is processed separately through the model -4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging +3. **Pooling-Optimized Processing**: + - **MEAN pooling**: All chunks processed separately through the model + - **CLS pooling**: Only first chunk processed (contains CLS token) + - **LAST pooling**: Only last chunk processed (contains final token) +4. **Intelligent Aggregation**: + - **MEAN**: Results combined using token count-based weighted averaging + - **CLS/LAST**: Direct use of single chunk result (no aggregation needed) 5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing ### Input Length Handling @@ -89,11 +105,13 @@ With `MAX_EMBED_LEN=3072000`, you can process: ## 📊 Performance Characteristics -| Text Length | Processing Method | Memory Usage | Speed | -|-------------|------------------|--------------|-------| -| ≤ max_position_embeddings | Standard | Normal | Fast | -| > max_position_embeddings, ≤ max_embed_len | Chunked | Reduced per chunk | Slower (multiple inferences) | -| > max_embed_len | Rejected | N/A | Error response | +### By Pooling Type (for long text) + +| Pooling Type | Chunks Processed | Processing Time | Memory Usage | Semantic Quality | +|--------------|------------------|-----------------|--------------|------------------| +| **MEAN** | All chunks | Highest | Moderate | Complete coverage | +| **CLS** | First chunk only | Lowest | Minimal | Limited to beginning | +| **LAST** | Last chunk only | Lowest | Minimal | Limited to ending | ## 🧪 Test Cases diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text_client.py index fb645ed975e..1909800e420 100644 --- a/examples/online_serving/openai_embedding_long_text_client.py +++ b/examples/online_serving/openai_embedding_long_text_client.py @@ -6,21 +6,34 @@ This example shows how to use vLLM's chunked processing feature to handle text inputs that exceed the model's maximum token length. The feature automatically -splits long text into chunks and aggregates the results. +splits long text into chunks and handles different pooling types optimally. Prerequisites: 1. Start vLLM server with chunked processing enabled: + # MEAN pooling (processes all chunks, recommended for complete coverage) vllm serve intfloat/multilingual-e5-large \ --task embed \ --override-pooler-config \ - '{"pooling_type": "CLS", "normalize": true, ' \ - '"enable_chunked_processing": true, "max_embed_len": 10240}' \ + '{"pooling_type": "MEAN", "normalize": true, ' \ + '"enable_chunked_processing": true, "max_embed_len": 3072000}' \ --served-model-name multilingual-e5-large \ --trust-remote-code \ --port 31090 \ --api-key your-api-key + # OR CLS pooling (processes only first chunk, faster but limited coverage) + vllm serve BAAI/bge-large-en-v1.5 \ + --task embed \ + --override-pooler-config \ + '{"pooling_type": "CLS", "normalize": true, ' \ + '"enable_chunked_processing": true, "max_embed_len": 1048576, ' \ + '"allow_non_mean_chunking": true}' \ + --served-model-name bge-large-en-v1.5 \ + --trust-remote-code \ + --port 31090 \ + --api-key your-api-key + 2. Install required dependencies: pip install openai requests """ @@ -164,6 +177,10 @@ def test_multiple_long_texts_batch(): print("=" * 70) # Create multiple distinct long texts that will all require chunking + # Note: Results depend on pooling type: + # - MEAN pooling: All chunks processed, full semantic coverage + # - CLS pooling: Only first chunk processed per text (performance optimized) + # - LAST pooling: Only last chunk processed per text (performance optimized) long_texts = [ generate_long_text( "First long document about artificial intelligence and machine learning. " @@ -335,6 +352,10 @@ def main(): print(" - ✅ Automatic chunked processing for long text") print(" - ✅ Seamless handling of mixed-length batches") print(" - ✅ Multiple long texts in single batch (chunk ID fix)") + print(" - ✅ Pooling-type optimized processing:") + print(" • MEAN: All chunks processed (complete coverage)") + print(" • CLS: Only first chunk processed (performance optimized)") + print(" • LAST: Only last chunk processed (performance optimized)") print(" - ✅ Consistent embedding generation") print(" - ✅ Backward compatibility with short text") print("\n📚 For more information, see:") diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh index fa78385e782..0d9a613c2d3 100644 --- a/examples/online_serving/openai_embedding_long_text_service.sh +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -20,8 +20,10 @@ API_KEY=${API_KEY:-"your-api-key"} # Enhanced pooling configuration with model-specific defaults POOLING_TYPE=${POOLING_TYPE:-"auto"} # auto, MEAN, CLS, LAST -ALLOW_NON_MEAN_CHUNKING=${ALLOW_NON_MEAN_CHUNKING:-"false"} +ALLOW_NON_MEAN_CHUNKING=${ALLOW_NON_MEAN_CHUNKING:-"true"} +export VLLM_ENABLE_CHUNKED_PROCESSING=true # export CUDA_VISIBLE_DEVICES=2,3,4,5 +# export VLLM_ATTENTION_BACKEND=XFORMERS echo "🚀 Starting vLLM Embedding Server with Enhanced Chunked Processing" echo "==================================================================" @@ -34,19 +36,25 @@ get_optimal_pooling_type() { local model="$1" case "$model" in *"e5-"* | *"multilingual-e5"*) - echo "MEAN" # E5 series uses mean pooling + echo "MEAN" # E5 series uses mean pooling (best for chunked processing) ;; *"bge-"*) - echo "CLS" # BGE series uses CLS pooling + echo "CLS" # BGE series uses CLS pooling (only first chunk processed when chunked) ;; *"gte-"*) - echo "MEAN" # GTE series uses mean pooling + echo "LAST" # GTE series uses LAST pooling (best for chunked processing) ;; *"sentence-t5"* | *"st5"*) - echo "MEAN" # Sentence-T5 uses mean pooling + echo "MEAN" # Sentence-T5 uses mean pooling (best for chunked processing) + ;; + *"jina-embeddings"*) + echo "MEAN" # Jina embeddings use mean pooling (optimal for chunked processing) + ;; + *"Qwen"*"Embedding"*) + echo "LAST" # Qwen embeddings use LAST pooling (optimal for chunked processing) ;; *) - echo "MEAN" # Default to MEAN for unknown models + echo "MEAN" # Default to MEAN for unknown models (best chunked processing compatibility) ;; esac } @@ -62,7 +70,7 @@ echo "📋 Configuration:" echo " - Model: $MODEL_NAME" echo " - Port: $PORT" echo " - GPU Count: $GPU_COUNT" -echo " - Enhanced Chunked Processing: ENABLED" +echo " - Enhanced Chunked Processing: ${VLLM_ENABLE_CHUNKED_PROCESSING}" echo " - Max Embed Length: ${MAX_EMBED_LEN} tokens" echo " - Pooling Type: $POOLING_TYPE + Normalization" echo " - Allow Non-MEAN Chunking: $ALLOW_NON_MEAN_CHUNKING" @@ -85,10 +93,19 @@ fi if [ "$POOLING_TYPE" != "MEAN" ] && [ "$ALLOW_NON_MEAN_CHUNKING" != "true" ]; then echo "" echo "⚠️ IMPORTANT: Using $POOLING_TYPE pooling with chunked processing" - echo " This may produce different results than non-chunked processing." - echo " For BERT-type models with bidirectional attention, consider:" - echo " - Using MEAN pooling for mathematically equivalent results" - echo " - Setting ALLOW_NON_MEAN_CHUNKING=true to suppress this warning" + echo " Chunked processing behavior for different pooling types:" + if [ "$POOLING_TYPE" = "CLS" ]; then + echo " - CLS pooling: Only the FIRST chunk will be processed (performance optimized)" + echo " - This avoids processing unnecessary chunks but may lose information" + elif [ "$POOLING_TYPE" = "LAST" ]; then + echo " - LAST pooling: Only the LAST chunk will be processed (performance optimized)" + echo " - This avoids processing unnecessary chunks but may lose information" + else + echo " - $POOLING_TYPE pooling: All chunks processed, results may differ from non-chunked" + fi + echo " - Each token only attends within its chunk (limited attention scope)" + echo " - Consider using MEAN pooling for full semantic coverage" + echo " - Set ALLOW_NON_MEAN_CHUNKING=true to suppress this warning" echo "" fi @@ -96,9 +113,9 @@ echo "" echo "🔧 Starting server with enhanced chunked processing configuration..." # Build pooler config JSON -POOLER_CONFIG="{\"pooling_type\": \"$POOLING_TYPE\", \"normalize\": true, \"enable_chunked_processing\": true, \"max_embed_len\": ${MAX_EMBED_LEN}" +POOLER_CONFIG="{\"pooling_type\": \"$POOLING_TYPE\", \"normalize\": true, \"enable_chunked_processing\": ${VLLM_ENABLE_CHUNKED_PROCESSING}, \"max_embed_len\": ${MAX_EMBED_LEN}" -# Add allow_non_mean_chunking if needed +# Add allow_non_mean_chunking if needed (suppresses warnings for non-MEAN pooling types) if [ "$ALLOW_NON_MEAN_CHUNKING" = "true" ]; then POOLER_CONFIG="${POOLER_CONFIG}, \"allow_non_mean_chunking\": true" fi diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 26eae3b2b8f..a5f816a66a8 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -148,6 +148,8 @@ def _get_max_position_embeddings(self) -> int: This uses the same logic as vLLM's _get_and_verify_max_len to determine the actual sequence length limit, considering both model config and tokenizer config. + When max_model_len is set and smaller than max_position_embeddings, + use max_model_len for chunking. """ hf_config = self.model_config.hf_config @@ -170,6 +172,12 @@ def _get_max_position_embeddings(self) -> int: derived_max_len = min(derived_max_len, tokenizer_model_max_length) + # Consider max_model_len when it's set and smaller than other limits + # max_model_len is set in OpenAIServing.__init__ + # from model_config.max_model_len + if self.max_model_len is not None: + derived_max_len = min(derived_max_len, self.max_model_len) + return int(derived_max_len) def _should_use_chunked_processing(self, request) -> bool: @@ -224,13 +232,17 @@ def _should_use_chunked_processing(self, request) -> bool: logger.warning( "Chunked processing with pooling type '%s' " "may produce different results than non-chunked " - "processing. Only MEAN pooling is mathematically " - "equivalent when using weighted averaging aggregation. " - "For other pooling types, different aggregation " - "strategies will be used that approximate the original " - "behavior. Set 'allow_non_mean_chunking: true' " - "in pooler config to suppress this warning.", - pooling_type) + "processing due to limited attention scope within " + "chunks. Each token can only attend to tokens within " + "its chunk (similar to sliding window attention), " + "which changes token representations before pooling. " + "While MEAN pooling provides a reasonable " + "approximation " + "through weighted averaging aggregation, other pooling " + "types use different aggregation strategies that " + "further approximate the original behavior. Set " + "'allow_non_mean_chunking: true' in pooler config " + "to suppress this warning.", pooling_type) # Still allow it but with warning else: logger.info( From e084b8ce4cd1b727156dcde2e7abc5a77cb1cb61 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Sat, 12 Jul 2025 01:05:33 +0900 Subject: [PATCH 016/552] [Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM (#20646) Signed-off-by: mgoin Signed-off-by: x22x22 --- .../fp4/nvfp4_scaled_mm_kernels.cu | 135 +++++++++++------- 1 file changed, 85 insertions(+), 50 deletions(-) diff --git a/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu b/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu index 7572a7eb312..5bc4c38a275 100644 --- a/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu +++ b/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu @@ -30,35 +30,40 @@ #include "cutlass/util/packed_stride.hpp" +#include "core/math.hpp" + using namespace cute; #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) -// Kernel Perf config -template -struct KernelTraits; -template <> -struct KernelTraits { - using MmaTileShape = Shape<_128, _128, _256>; - using ClusterShape = Shape<_1, _1, _1>; - using PerSmTileShape_MNK = Shape<_128, _128, _256>; +// Configuration for M in (256, inf) +struct sm100_fp4_config_default { + using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto; + using EpilogueSchedule = cutlass::epilogue::collective::EpilogueScheduleAuto; + using TileShape = Shape<_256, _256, _256>; + using ClusterShape = Shape<_2, _1, _1>; + using PerSmTileShape_MNK = Shape<_128, _256, _256>; }; -template <> -struct KernelTraits { - using MmaTileShape = Shape<_256, _256, _256>; - using ClusterShape = Shape<_4, _4, _1>; - using PerSmTileShape_MNK = Shape<_128, _256, _256>; +// Configuration for M in (16, 256] +struct sm100_fp4_config_M256 { + using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto; + using EpilogueSchedule = cutlass::epilogue::collective::EpilogueScheduleAuto; + using TileShape = Shape<_256, _128, _256>; + using ClusterShape = Shape<_2, _1, _1>; + using PerSmTileShape_MNK = Shape<_128, _128, _256>; }; -template <> -struct KernelTraits { - using MmaTileShape = Shape<_256, _256, _256>; - using ClusterShape = Shape<_4, _4, _1>; - using PerSmTileShape_MNK = Shape<_128, _256, _256>; +// Configuration for M in [1, 16] +struct sm100_fp4_config_M16 { + using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto; + using EpilogueSchedule = cutlass::epilogue::collective::EpilogueScheduleAuto; + using TileShape = Shape<_128, _128, _256>; + using ClusterShape = Shape<_1, _1, _1>; + using PerSmTileShape_MNK = Shape<_128, _128, _256>; }; -template +template struct Fp4GemmSm100 { // A matrix configuration using ElementA = cutlass::nv_float4_t; @@ -71,21 +76,22 @@ struct Fp4GemmSm100 { static constexpr int AlignmentB = 32; // C/D matrix configuration - using ElementD = T; - using ElementC = T; + using ElementD = OutType; + using ElementC = OutType; using LayoutCTag = cutlass::layout::RowMajor; using LayoutDTag = cutlass::layout::RowMajor; static constexpr int AlignmentD = 128 / cutlass::sizeof_bits::value; static constexpr int AlignmentC = 128 / cutlass::sizeof_bits::value; + // Kernel functional config using ElementAccumulator = float; using ArchTag = cutlass::arch::Sm100; using OperatorClass = cutlass::arch::OpClassBlockScaledTensorOp; - // Kernel Perf config - using MmaTileShape = typename KernelTraits::MmaTileShape; - using ClusterShape = typename KernelTraits::ClusterShape; - using PerSmTileShape_MNK = typename KernelTraits::PerSmTileShape_MNK; + // Use config's tile shapes + using MmaTileShape = typename Config::TileShape; + using ClusterShape = typename Config::ClusterShape; + using PerSmTileShape_MNK = typename Config::PerSmTileShape_MNK; using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder< @@ -119,22 +125,22 @@ struct Fp4GemmSm100 { using LayoutD = decltype(cute::make_layout(make_shape(0, 0, 0), StrideD{})); }; -template -typename T::Gemm::Arguments args_from_options( +template +typename Config::Gemm::Arguments args_from_options( at::Tensor& D, at::Tensor const& A, at::Tensor const& B, at::Tensor const& A_sf, at::Tensor const& B_sf, at::Tensor const& alpha, int64_t M, int64_t N, int64_t K) { - using ElementA = typename T::Gemm::ElementA; - using ElementB = typename T::Gemm::ElementB; + using ElementA = typename Config::Gemm::ElementA; + using ElementB = typename Config::Gemm::ElementB; using ElementSFA = cutlass::float_ue4m3_t; using ElementSFB = cutlass::float_ue4m3_t; - using ElementD = typename T::Gemm::ElementD; + using ElementD = typename Config::Gemm::ElementD; using ElementCompute = float; - using StrideA = typename T::StrideA; - using StrideB = typename T::StrideB; - using StrideD = typename T::StrideD; - using Sm100BlkScaledConfig = - typename T::Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig; + using StrideA = typename Config::StrideA; + using StrideB = typename Config::StrideB; + using StrideD = typename Config::StrideD; + using Sm100BlkScaledConfig = typename Config::Gemm::GemmKernel:: + CollectiveMainloop::Sm1xxBlkScaledConfig; int m = static_cast(M); int n = static_cast(N); @@ -148,7 +154,7 @@ typename T::Gemm::Arguments args_from_options( auto layout_SFB = Sm100BlkScaledConfig::tile_atom_to_shape_SFB( cute::make_shape(m, n, k, 1)); - typename T::Gemm::Arguments arguments{ + typename Config::Gemm::Arguments arguments{ cutlass::gemm::GemmUniversalMode::kGemm, {m, n, k, 1}, {// Mainloop arguments @@ -167,17 +173,17 @@ typename T::Gemm::Arguments args_from_options( return arguments; } -template +template void runGemm(at::Tensor& D, at::Tensor const& A, at::Tensor const& B, at::Tensor const& A_sf, at::Tensor const& B_sf, at::Tensor const& alpha, int64_t m, int64_t n, int64_t k, cudaStream_t stream) { - typename Fp4GemmSm100::Gemm gemm; + typename Config::Gemm gemm; auto arguments = - args_from_options>(D, A, B, A_sf, B_sf, alpha, m, n, k); + args_from_options(D, A, B, A_sf, B_sf, alpha, m, n, k); - size_t workspace_size = Fp4GemmSm100::Gemm::get_workspace_size(arguments); + size_t workspace_size = Config::Gemm::get_workspace_size(arguments); auto const workspace_options = torch::TensorOptions().dtype(torch::kUInt8).device(A.device()); auto workspace = torch::empty(workspace_size, workspace_options); @@ -188,12 +194,40 @@ void runGemm(at::Tensor& D, at::Tensor const& A, at::Tensor const& B, CUTLASS_CHECK(gemm.run(arguments, workspace.data_ptr(), stream)); } + +// Dispatch function to select appropriate config based on M +template +void cutlass_fp4_gemm_dispatch(torch::Tensor& D, torch::Tensor const& A, + torch::Tensor const& B, + torch::Tensor const& A_sf, + torch::Tensor const& B_sf, + torch::Tensor const& alpha, int64_t m, int64_t n, + int64_t k, cudaStream_t stream) { + uint32_t const mp2 = std::max(static_cast(16), next_pow_2(m)); + + if (mp2 <= 16) { + // m in [1, 16] + runGemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } else if (mp2 <= 256) { + // m in (16, 256] + runGemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } else { + // m in (256, inf) + runGemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } +} + #else -template -void runGemm(at::Tensor& D, at::Tensor const& A, at::Tensor const& B, - at::Tensor const& A_sf, at::Tensor const& B_sf, - at::Tensor const& alpha, int64_t m, int64_t n, int64_t k, - cudaStream_t stream) { +template +void cutlass_fp4_gemm_dispatch(torch::Tensor& D, torch::Tensor const& A, + torch::Tensor const& B, + torch::Tensor const& A_sf, + torch::Tensor const& B_sf, + torch::Tensor const& alpha, int64_t m, int64_t n, + int64_t k, cudaStream_t stream) { TORCH_CHECK(false, "Unsupported CUTLASS version. Set VLLM_CUTLASS_SRC_DIR to " "a CUTLASS 3.8 source directory to enable support."); @@ -271,12 +305,13 @@ void cutlass_scaled_fp4_mm_sm100a(torch::Tensor& D, torch::Tensor const& A, const cudaStream_t stream = at::cuda::getCurrentCUDAStream(A.get_device()); if (out_dtype == at::ScalarType::Half) { - runGemm(D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + cutlass_fp4_gemm_dispatch(D, A, B, A_sf, B_sf, alpha, m, n, + k, stream); } else if (out_dtype == at::ScalarType::BFloat16) { - runGemm(D, A, B, A_sf, B_sf, alpha, m, n, k, stream); - } else if (out_dtype == at::ScalarType::Float) { - runGemm(D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + cutlass_fp4_gemm_dispatch(D, A, B, A_sf, B_sf, alpha, + m, n, k, stream); } else { - TORCH_CHECK(false, "Unsupported output data type of nvfp4 mm"); + TORCH_CHECK(false, "Unsupported output data type of nvfp4 mm (", out_dtype, + ")"); } } From ecc9e745b09a654688bb615fbeafe8be8d34e961 Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Fri, 11 Jul 2025 17:42:10 +0100 Subject: [PATCH 017/552] [Docs] Data Parallel deployment documentation (#20768) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- README.md | 2 +- docs/README.md | 2 +- docs/assets/deployment/dp_external_lb.png | Bin 0 -> 86128 bytes docs/assets/deployment/dp_internal_lb.png | Bin 0 -> 69309 bytes docs/serving/data_parallel_deployment.md | 112 ++++++++++++++++++++++ docs/serving/distributed_serving.md | 4 + 6 files changed, 118 insertions(+), 2 deletions(-) create mode 100644 docs/assets/deployment/dp_external_lb.png create mode 100644 docs/assets/deployment/dp_internal_lb.png create mode 100644 docs/serving/data_parallel_deployment.md diff --git a/README.md b/README.md index 3e6ae2acab2..c4b14685526 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ vLLM is flexible and easy to use with: - Seamless integration with popular Hugging Face models - High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more -- Tensor parallelism and pipeline parallelism support for distributed inference +- Tensor, pipeline, data and expert parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron diff --git a/docs/README.md b/docs/README.md index 3483567f1a2..6823008ed33 100644 --- a/docs/README.md +++ b/docs/README.md @@ -36,7 +36,7 @@ vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more -- Tensor parallelism and pipeline parallelism support for distributed inference +- Tensor, pipeline, data and expert parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators. diff --git a/docs/assets/deployment/dp_external_lb.png b/docs/assets/deployment/dp_external_lb.png new file mode 100644 index 0000000000000000000000000000000000000000..a5d3a2f31db7b1bbb48a1696014f9094efd54084 GIT binary patch literal 86128 zcmcG$2RPRM-#4t1C`3b)k%US{_9im23nh`r9x2(oh)S}OmB=O~dsLK?mAzL|_KHxp z`*rI7zpmf)T=#Q7_kAD7a~{`m{rz-0&+qp$-tX7?eD7kP4L+Uo^rD)5J4ffwOhsC?;kvdg;Qx`i z`P;0swczCp%|&-fX5R{<{$H*!{brDk zYTT!4BY~FABSYeeT7Lg0f6CKO{(o}4|64cpfBB~DQ8gl>q`lPC%&R5Y*&$Tr#!7!a zoYB{hee&c<-c61ZC&kl#GlM`uh6mugXW} zE}a}xVj=6-B=Nygi1w%+nG+BYkn6Xhn8%#^QaOw@P4~SiTbUM1e?=G)>BNg!SI)jN z>$Wqf{C-2&=;u4nwGxX&V{yiyI#XqbopBB9zKSu0(;G{7cZ9N5(Yi^C@=5%Cg#^ng#RES zN;}cYxA{YShDO8Y58fFv65BR^%GkN^cb@)#d6SMfqk883h7lo~$C(8M7rNh5P*C_% z;s43~WA)4ZPAe_F;^$7dp3(9jv$W)XzS{_+W0c!v^fOzhj=w;$o3NwyD>wa@L1}Jn zef&Fz@MrvZYv=@Bv)_v77op7?({R~Xo!jm6;f9@Ef*7@l?BCms*Ln2l5jn}V&ew*m zX?JMF#KbU(LuPH6LbhK|8kF|-_Uc`}{KoCi^pz`DTwGl2*;&c{wdPEUoPK_NrT=EM zUM4glA%U7jo?hI!@#|O5AD;s?Q`IEA)>rRQv)I|$jjB!gBwbM?Ton;fGw-QWrvwEl z$;hGvte)iO3-pcg+s+NwH9V6Lvmb5zUJ;N^GL%`QwN#Y-WCu`?L z0|T$Ys*vU7<-5BM2nq@+NiKfy*&!(@X=-Y!tE($7FTa;fMNv(SMk2nTz!BHcn#r?k zR{+H!LH@oYDu=1)MeIH`HJwveS65I-_1ajUXn)DU$+@P{_~&PTWktm$U0p8SJjKq@ z!NI}l>1R<g(xg_6g^YS?7){1115>eQi1BO$kAg%cHg2W%F}$ z6V1uBabD|m`}dn%yB7B3iM-9Nw?-8Y(lt{bTp8@nw^DdSf2qK_D@bZxG5@5_>#K`1 z1GN#{{n^(le^2$qeD?S?NXGoByj&);uemvt>eu4z55f5>w|OKtmsb4ISTQF_`%w)C zhx{A2A4DnyF%At6Q!+{F>3pG*5vH&I62)g_X^F)mPg?A{aDw?$ON-{>7a9tR!2KK- z?p(QDTB^kw**!KDPc~a(|U*CnLSMKzm{afpY zZaO)|NqMc${;2%;@nekI9=F9AW@h7=Zy$`Nw(X!i>bttoVK6AQ?KsBO*44Wa|C6qx zy0@yKAtLxn%MOE`#CJ@4ddU7&29)y68lK^PF-g~(lg_53q$JpCYim=>h&qfPbyOOu zk0oQqjFd~=_2Ju>t(i7)Mqb`=qFr_JJX3}WYNg_>A3Xc_?=N@iVgC0gnwXnAe|vvh z) ze#tN>^Bk1&kxGbWptO{F?YY_Jq`Sn$#ZBv?!-IowjJFSSk7a zZJ*vGhK7dvCH7?G(Qh3W6dVzzGS{s==m~= zdU|@QsuT`HaSi=g*P|64?(U?_E5D}_*;3Da|NdP?Rkh;LK9r)Tm>&$PTwz$L^Xzx- zvQ=J-=A@LBXZd?0ZU5fWw(+r3yAE)j0?)?bo+hzgqH8BzBW0tOJY@BI-dHMG3 z+fLKHT_p|^Z7(ij7^i!SubG&z{wb~9C2Q{-a9&T(+Q{fgoA%c-udKvGmbT;Wl5te! zMrVbjkIMchSUG1LRJy)A(b(Ad`0>7R!*_%_JM|VTr#xSeiH)2>7a7iY9>7#XW`dk3-QZLXRaP`acC+ z|CiY0e}(T(G6u1?e;S#K5mdTx`7(#iH$OiTnR|YIehGge<7mj2+xL#RX&D%B%Sz_e z*463fTZWtO^8a{0FmS)k?Ch+|!g%=1(DK3ryK3}Ncl=kIZP^-4>JR&;k6Lbus#T5| zr51VtcYL_2C>*c)$Kej!g9i_y@ok!#EK|o9HeX7lN2UJ-VHM-67`N}(u~+8&`SVHY z57f$5e+6`AF#~yIUt69ZV{w&sb}kxy6A}_4%>c) z_9rW%e$s~zZ%iyqbkI}!EBaLa$nLUER~wjH@};5)Nd5;(O{ikL>+4IrV@GOIl1ABK zEjd?L(TLi<91~?lMbf@vOjTNtmHTCse0s?L2 z<#(pfnEU*#BGTG)w6yj!W4)cb@i%yFf^rQ%j2_ngojY4g_7YZA{vEviF*7@RWm}to zxcJlQGx(+S%XmO~llRv{@gEN~baZIx>7SxQIUUJ64Ukgf-FFm!yYG2?{NbZVj|vEc z|L52K{Tlokk%qSOIyuoXpMZc}Y%zGdcdw!Q-x{M$ckp0IQIU?WZgVWD&1T1%F`%ZV zwofHuXkA^J%O*Qo^}fBmfW`K@JgFg|r*39?Mn(qkCpKF}=qr9pYwM%FJW+MzRHtak zn_P}3a4)%+qDAE8UT9>w!$>XDnyi2lbzYx_C}|HJT~|mU+17(~aZ+*9`-t`pVLC%= z7cl(U!t`n_&Z>u4(WC!^5uNFNYeM>;{O7@B$g5Yc@~Vlv^JXS5YOrZF#s3zje>CVy z*oTCoSrZ%+^Tx961RxS#CAYSi)0wV&^UBG~H^*-E-ucm!h$x=i#KdH5Y;0+1$)26C zb1k?31+?B7O6uxnJhZ1zpSEY0IrQHVaQWt6)6mf19eP|5o0HSJGvDQ`!q;+Qe8w@K z;$oK@H+cFk;ZxHNjZ?ccOCD2?CCuLD3^+PErl+Ul@Bi`f88z?U$NQ*{e8y)JUYmg1 zn^#LqD=9g7|Gs^Vv84EI+J0P3Jh_O72yTC1VBjAYer%g#7%}G8x4gU@bv7@Lzhv)o z3e*-2l1Vl)d?4xe?c14|neom4_}ci4|0|$#|G3o`Crkw zqNb*1cWuEAE-tP}Ypk-yQ`udUG#!mK?gsu#rnn#JnD_6cNqlauEjx@S2L@6IQf>); z0IC@f5RjbA+IPQ^WOGU++OrHAsGg^%TNoMfntq7{P!;!B)|&i=uE`hO*w)6tYpkfD z(fG#cr!M)$OP7k>{#@sZure}ANKUR8xIs@(FC!}p6!a=`z1nE9GkY()+T))BFJ8Rx z@$oTu=gQh&yESibK|z6qg~j^%I$?Oy*}vlQyy;aVc?>_@qxKC3mJSUKot&JU>M8sQ z9N3bg`r*R|BcJ^MOmTc>lnV~Bva;vSodaqyH#bMm1IPyMmE6b5!UEv+1It_gjibZO zn}GGl(HXydA>X;P`<SwmfUcJyBQ}C( zu3Y(A$>gP{t&K&LXZVrC#>OV&U3T^k?bsS4#>&P z?Q!1jI=6$Unn1@NGYbht)DFp?IfHfRz4L~nqnffZpP=B^e5(!&04q~wdU~swzcg<=QV_ ztkKW`9;n_Oj%(N$@MoHwnsUI)-hTPd&obbS=H~MQTa+WXOn7MVHmEg?EiIRGjLF(S zKE}q!Gcc7|SwDw|pM-=k3lg!Gcn$+elaqjcoRpC0tEiAOFzCm4b#*Ou73{jb^B za1LaFY!Wn;$Imk}@$nK;s9seOXQ(eUAeVa?;?dv*$r9|10FAgv#6%{SY?jR+;1ExPS zGb3Qtj?h91sX#Gb<}=Mm~<}`}Fj66O-_$s7vS1%aDpVjPGMM zu6VFNL9^lsw}Bd|h|@IvLa1vka5RA-zPj=*BRN^XX6}A@a#SeG32L=YBmZ8VgUGaA%wY_%jn$;ai_r+kUX8UVI84u#y zJNeyIAVn<9&ofH69_{Wd^YR+#u;e>zIhTISJB>$?MmzNq6r2ZdZ)17SEfe(5R8fFVo)M z{`KqE)(&zKAKieT)6-uufxGon!JK`AD}$IYo~UfhNn=;8v@W0uurg_0x^xMUH&X{& z!1w3*5kZf2r+^i|bDCw6##lgygsx9^Wc5Q&c=_^W`irIIWkC<`APuYKPJSh z7>KEQCgDb41KrQU!dR5ENSQaHKDHOy8Z;RVkF})Uy?b|0y`?JC7Oi`+vHD;{($Yl# zZotwjc4N(0Xn#;ECER}B5Pn)qKHY;S5he&WQQc12V+g97WQ+qbt+)9o_K)=A*L+@e6V-{|Vq z_d|9S7or`GZL($QRH#zWcJ|5;Z#|8|b>YH=D_5SSqzvQkpK$4)h@g^ioiB2nicU=( zK{35%Wb`yP7JGjd`puUwUr-~lVZ@l3pR317`UXRa@!VMZ1DwHxKYOsBqr`D)`KQ;0 z*vXS$s;bQUOFdfKQdFY8Pfal#Iy5moEs4DewFZ+eCnrb2B>D8j14&~DY!O`gx;e(x zD9Ae~4$h;N#|hswLS3$|uFg>nH{^|iI9K_IVQyw7Qq1Z4&6~5J#LyIRK}U`p!8E>o z`}XrgI&fY&(xveiS&*`ZM@NNS=SOK)NQq8tf=*dExr%pge^BQ{f9+5tsw+XS2|Of} zmh}dz#EXkrs6y6F339(Y*w!ErV3!_XU|@iL7!%|6<1+;$Hx7;$Xn$W~gw*XX4xxKQ z#l(2w-cjh%U%e6(5Lo#A`|Dem+)=5oVeIOU4ho#WrSb4+9Hs((F@vUN;65wYGPIr` z|A^jYtT{P8J|68If{L5l3bYG|4lE)fm#p{bPE4Uf0B794e;@Povn915lvVk{OWlQ; zfsc^&Wc?{E%*`Lt@v-sn%%EM@)zrBD8mz{S#isAW{_xJY%HJ2OU+(&0x68)*JLs}n zsWVINzv7s$-X-}@Yeeq%=tNc-B?X1**|STHUo)rnKKs2rqr9$;Q$Qdenj(uW4Y876 zoT$dN)*vSW{b&05FY@<4hte;`%-U4gG`n}d7U1Vkl8~dlLiR5YCT(ruUbR+r5cBJp zqu8-W_U+r}>+6dq@Ow2_2=c%ZUmZRB)P7~Y;?ZDS+$ zYuNPPI&XZ*e-Y&goTqlkj^)UaQ8h3DqSl=R^zWCGlY^SHy6XN9YKec9)=o}M<)eKF zw)iiiK)9rTSVIF#1gi1>WH~+`sMfnq5wr&r9wDLKl$0rAyYP1oEL5bvUXCjNGM2Q$ zH)}}7-MiaSL;Na!%zlbqboXqLueP+ZN~Y*-Y?LN#YHZZ7|1aXq_{4-s)efSh=2Ug2 z5=|l^^Znbl`TUlOjj8-`_8D~tm_JQS9gSO1g=I>%T=O#+DCs&6{^5#&6%y!X#CD=| z#l=X?!&Ll#UxI=U>ZcolN_&IX1 zvL;)gS6;*Z#fHPj$a)+(_US4Qs6CrXWPB#^SRU{mz^6NkC*Hy4dz6i>{;5byP0je% zcW&sqjEszs?h?c*caOKEs&g=nSs5GKuPrY;m-Rm(%1RU*Ksjr;hp3V3pUObeVJcE3 zWo3vdStz{FU{S*afIcPsJ`Oa7KI4TTyR@?6^}EZg_0Q4c$1S?^Ecndoq5Q6|{A?{W zfeLS*db}FwLHAZ9D3+t6-~&3M{dXWBZ&JsM!QNzGM3Z)dlKuM8I`pV8-y8x_; z^{%fN$R!@8jr*%+#M@z-e0%&A7n@A7b#M#X7tZrLL!pj8?c-ZP&YT1sGG+*s9~?~S zi?V1`RMbn6NmvEg`RKiKBMkt?r3txPg9A`0!JXi6{l}8NNkT#b41tfp2Vfxga(UU- zeEM`oe+MIzl;`A+AJ?p`F6-%??EY%)AY|2k(qxD=0MpgZq<8*2M2ay~S+n}5BMosJ zCyx^a@7|Sm;OnkQ zOeNrf8@wa5%Zui&M2$2$Qh*rWzJI?{Z5rP^zpwyW2Iaq|w)Spd>#I_J)8xbGcX5E% za2{yM<*n9-EX~dNFHh$06=(eUvuxvfU5M1$d`sE-0{8d?xE~uNA2zmUBDX&G_ZuiE zRMyrihgfX&ez>XQG~U4n^%mb!kd^(2j)RTYdEMGN(^FmY7#9}+9Dn|t9fy=3h^zU~ z*8RH1W@d#sEEu@*EbxlSkr5}fa@;MH=^TlljzkPk|4Gs1`#g8S-Kf$(efk493kL@W z)IT)Ta~Cei=f^}xH^)Twu3@Rca-wp4z3@t}nAdgQTrR2A5lHQ&_%AFi)kD~0cnw8U zdy98Tf=2Vs1Nj>`+ipu-tfsOC=Q|8vXiCEVjBDv(^g`Y<$d=lsSUi z;E9{$ujXafl(aO(=TRuae^I$hcz8Iz24tK?;n6-dzLwV3NwD{`XV0P|=jG>*)oDD! zx?X^v1|B;BfI2bZG%k}f6Y%dWKRV(@{&IE)3*ki@KCPjIF{# zg0!A)Hs=8REhhE`?=-6Tg4y&{R!wWlKw(0WXbyH#{xvtJtfX}2GF+d`RkY5uKui}3b>o%Y=YG1r4bmQB57$9eh z-b{65#qs7QRG`Mb$jBIOh~w_v@f(=BmCY3IL>a-KdERW$I$JOkqsMzkMryz`6>kC zhEB?J6;lFq%MX9EsSs9;a|!h=c%R|dj!Nkb!k)tVKR2NhZE>{%GJ zF*hm&YhaarEH77kDtyYM=JD=Wf`o&``LU&C4*#O2rj|3r17QQ87Vz;8AZJrkQ^za) zl|N-0?_et3hjtLnZ&4R5kfTo{eQ()0Io7M{*Ybi@HC@42saoS1@DKw0u(V8)NG>J3 z0N;rJoueV!*OmC~+f^AwC8Y=RXSclHG3)#mo3Z-$PyM>b5fQs0BJ|!mhfLe=Gl%!+ z8yp(ChpZBHB!J4u($aNdT)UWf?MILlL`EBH>$IOYu+pLCIL!^K7VFWAvW*)?14KU; z3mJxwaX#OF@FOw`e+5wJ1x0yz`+_txK?8yO;ZSV)b4*^Vr~2rQQ)w6ImwD=!pNjMN z-F1804%h_Z5)xsdq1}x3Fij1hv%su&fjk0+-_hFomu91aPIqP}qUTL9$0ugz<|@O6 zBqfa(qwXeX(B70U@lSw-$R#Ao2Lfp9?i*|TQdv3DoUGjX<5C*KmsZ+*{$oJvzYIAp zWC}Y@>bSTRr>3UTADD*=U+GWDfJP;T7M*l9Y`VV;QVxNSqKqUbB~@|kv?oUI(Bmo` z3(NWIhBQ7mLGr-FgpeotN)*pBGdCxOO$4O|H$*~7pZ1&S&m#u0_U>-c6DKk%Or*AK z*#ez3KR+L7idH_C!DK9a?2C6`GIcRRLe`x#sDiOlUY=MG01O}*$?eCtcpGn8GAB-) zXl`!q?$I?QlDQWiu8o?Pz2iM9d}2aEihA7i;NU!J_&}m*YY$e|S!K=oV~`ji=77;d z{*~HT&F$r*1GItALTv+x z1kHpj+Z?;A4my*^U==lJGuGy(iVCO&>1*G}O|UJJ92Ur#jlt-EoA`Km7CXw;^HC7w z1{1Tgq_9&=_?>3V46^VPWqDX+DK4nV9iHD6By=)tJ9 zV6ot`ppm3P7F5M@KvJtuB__ zdX6g4h>KyMr#CV(f>e{XFm1#1lup!s6slFl!9v^qE0-==z~i-suK=cKbZED+scEVE z()r0=h|cTKC7}pn8>4i?7FSbId5R3eo$b4jrs$pYsnD&itYqispNEABVZ^d-R*&ls z!Ugz)LxR@Mz_I!Ml0<Kr%w>$ z*M8M<<4T~gQM>u2kb|mzYWMW@_b+o_0zVA{^(g<0bk7+%x%+K=SiZ<80IHThfvifn zT);vxDEv>J(6j+Fx8wf^@+{o@0PfOYkv0(SyMBIGAoeZ<71-O`i{Bc47#C*%r!uIr zC;OWJm!>W8Q=8E^QRG1=I5|19Q!AXLv5mTlZYE-xTU(zLFm3++jo`OK(}#yzvV7d!+`PQMkZL$z@qKOy zf(Y;rIzVFQ0)qxVe27Q} ze1~*-=pcY#N>@a}A|ez>=>#kvYYp7-@gZm-5Izpl(%yS#a`h??E

on4z-0eO_W> zVsdiwYl5wC)Ysj7XD^ftGG-fC$z3@nnD@kS(R3Xh9pD0(9s^Kc@H;DIO1lFvaARXt zooM38;myM(OW3d!JxJgf@#~<=C=UMQbTWPGj>cF{yJ${c&FzAsg~53pQqI$xhY`6M_i%l5*Bc3dneGioR*8wL~knTE&khire_XS@SiSvY>b z#APleHI+TATHJHB$U5de0$7WSi`e|ngm0E3c~rnZ_zJ=qiX=KUIE)l3KB}XvlYG4N ze{st;uXq@>91VIFa{xpQB;6ECS(lcbUQ{GJelGN-6Us0!NjixcddJ%8Du&wU&K*hn zFOQFFA{msBm`KJ<$f9Iq{QC84kHmXKE~tQBnd(YyfBzo7Aq;2_Eee~w6Wt8#eXM?$G&2F9dPiU5q1}3$sOPR` zTF*4@QpxQ*${GrjJanaH%_p3fe?tSroRaNb1vNm+QB@TQxPPAxy9E1wbPFmTt8ec+ zx5BQLjkTX%du7nNI&od9&%~ps?BG;%sU2u;YO;_5X(c4AFdrNzPrkh|a0liTB*fhe zqWrgpRd(#~=e%6N-$!{!I(bt_sj90RvMVN7?d)3_YO%h-=0)i1fpwy}zK*^I5Mudw zshR89ebmXoz`jCTChcIwZxa(MP{;a;Z?O|(wH!KTEKbpwrG}(C*`vesjeL}e< zd}CmG+7q`3NGb3vplaX-)Z7vP0yHRgR#uegqfpA(*}YI>AaB4mvq4oGt*36K530n> z@1ZpRpE%ZMQW#8)t0*KwIX837NQ?{=kjD!kG8#~UT+;zk;%3hWTk5c6!5s4n3a;0E zfZmXmk#P`L2|)JgYfeGIGakb`=DWhGpQ0XO^-GA0M{}OQa(WUT?tpwCe!JSgv1S0x z2lWeb5AYkWReL7H*_z(tGF7mfurlnier#=j7U(IkKBPTm1cM<)&_+p3t@Or#jJ^Fl zz6+)tk9Fs3cz1<`oZa!^%`Gj(deiux?m1cNPV0g}SHHZ0(pR#!I%n#_zYD7qIkc6P6>i-; z^QrFqZ6xIJVhTjZH)9VAz7#}*$N2al%H?^^ssK(7ZiA7}ynJAjE<^yt?O&O|G!d#L z36-_AHTux?uW!4$Coh)|+-P+s3nYf5g;yM!2=ZZR<-;hl*p@J0>{Ce&Z=c*j1kc9M zkb2*~)+YjQh+YEl0~ZHikvwFQDn;qQT9oAF<0E6{6~B3&8aNsPkUMCni~f!8pYEey zddv+6zU4}@1IsSLe;NQ}p;qVE9N5C2;CHM?SR=i8v&+H9gCqf#4@Fhr52A!4gM+yH zt1|iS63#Q;R7TgX$vPeJrP@PF3q@u0$B(_>{|LPNzFnj)iIh_4)2AncCba-5x^RIY zJ7f=r|iUHzMAJVJ7<_cnMwa9ffEP-<6u4 zzPGcp^YgCrEPHIcc2W>=BA{?yQ}c3eM0`nko#39mrDI88I)9ShaG?}^r+aEq}BWO?bAEr*1ZYgQ4JS(ZXd4N50267boeu^FH;3Q zZdC2e?cTmVD=RB(E+r@ohXqjl|{7rm>k4lr|nxcFI=Fp z5DGBcXAHbTtXXejjDQ&dvo=kMPVOPUq~--Cd`lah#oqP6KQO)rXF z5>Gtby62!ZS))9P|6BX9&^)u3dd0u+B@e20h>~tzLZ|AU!AIRhh$yvx{+t5#W}hmX z3UYp8ei?wn)zx)eLQ<9#D;M&pwr&wJGN2M1oKNlO#T=~x;NWnt{qC{7WoM`B@&xS- zsW64F+;0LQS|TQbROqOhuiz%`hs<9UA$_>WW5rQT?GX~xSUO{B{%S}Pq2Pj)YU8DP2%>yeY=5`uNotW zXpf7tGrfdsUd3L5D>d202Lb`6cAQy#K@%~wJ$pi6q^tuFou0;S<56$+D^!MH4=A<`Z1F{$**9 zKI~5L;7|cg?Q&yYK3?9mtX(^Iy1TpY6D8>c=73e;0oIGS6@Jprr7K2X13SDI6y${_ zhi(Vu(zdT;q01}|S(N%{fe8p6)QoVtPpm>XZ9$6tIg5dY;pWy3pGQsr#btPS7&#;M z+tS~D{I~@LK}e_&$U|Sh@4_?1sZ*mvL&;T>TAf%cDDmJ4WX!loxT%9uJ=z@iuuBn- zK;uNE!B}0=)J!bWvCq!Q(apEiHZ>haXTf}$SzA914wklgO&E{%*4BINFtI^K2w2hH zQUbN?6JwOkFevh21MNq^)4)z?&QJ}r&PdS0A0bLhGeU#KnE9sOd2!#lB zKiei}`_?T!A9UV2--Jh?fA)^0VYbf6Y&N#AuDo-+Sg022)P5IbHV@zsOIUhI81c@Y zBDNRCSAx-C0*2?77~H#5u3r6yV*{wzV8Hm+w&lHZb8}D=paM=$OuQ(XZ+!?X2Tc;X z9u6FR9hL9`Zns|A7dZNP{vC?ZdJG;7mExLIPc8t9V zsof`|?4v;qc|)a?Lh`8ANMP%6T@+!ubtf9$dJPn5;G6b#sM~x{g2tdPz{+14O$@>p z5aPLQ#RMMt;>FyrU*qt|?lfyEql5ScLwmX;GVq%aFgakbisFfFfjuqtc<*C*>_bO8 zJ4;7#rNdOMZEdR2{AWefiMNKcDbo=X9e%m#O4`GIPO*iacM%y(V^9X)VABn*Zg#e? z>VX^3aNrflIC1$><+v_b>VEe`sHorL4+OC_Tvw9*X)S$y)hS}{aK2_xhMln80|O=8 z7jI1b(KioZ5OW+I9rg9|Q;U<7K#Jh&*DGYFEeJ>lF^;#Wd7CL6h5F;9?f92-cMkU%w+ zlJWw&gVCUM#&x`|t!)AMf{U;8q9oi3?2h>P`r?ADAy{EV5e}#!%W!|$8@G%mgj(S2 z%vI6VRXoIUx|eb0_O^ zBO?_d$JTup7zzEiIaY254wtBuXqsOF*he}R6bnsH4aWkYKSpI|--Kyo{e}e9yYH>b zEu`E)3M1TH89sm_f~kf^MuO$0ec8eEQvb~b1qCzE4Bz12Wrw(rZ4o41rgArcJ@QvX z*2wnOf*dUBeCpQxXcJU535jr0i*-c{?}FtRaQ z$%ON4jAyq{$)I>|iKrzcjBxmX(qT7eE{;}UJ40Xm-o;>ZfRKzi@p6O^7t!lLbU>7j89B^^(uue#_)zbm#EXUxMgb^Xg##J!rAi(y{NFwK}tc| z7lS$CFn6$T z(aHTYNR0nR)O$mi&ifMLl3U)K-Fi$D@i(AN96XsNS6%uwKmXQi!?VX}dt-C+9yk6} z>BG)&k#u#tVei7K>2~%Q)%YvY0oFq8BmgiZ8|pCN_=S*7_tBm)@TwJPufU{3G2sV; zJ0A1NC}K;;NKWe?rh1EGRGges==E8dnbvOmydQbD*;h(S#&J&x-r-%XLCZVrdylf< z#4)P>IB$!VhX%qsxQ)n02nrYWf{fy(uVn$DwA1=GN;tH}RtTkzorDYyaw*j}mI(q0 z99cLdSk7b$VB(O9$mxY=z>0u=7Ew;&^R~SSsW5`!SOb%qtAk%isG*_3_P75_EaS8^ zz(xB(Rp-;OGtT1Gx9V{k<$v=eL6XJU=`e7dj+(-QbKU{@`m$FTm=)?n1jv~Jc z2s_BPh&@DBM|p8aJC2s5J{41@+q!x7_+JM}_U+lT2U=EW=yy1NHW#MCmKV}u!R8=o zF}Mt+eeolwYlcaW&`(=yD{gMvwrw;4+PbeLx6RzzTE*kJS|Bze zn=d(osFt%^(`1OHEh1XT%tLg0kY)xGc4B<|(?K~7x?%-l2NNr+o(~+bC$J!A3~=lY zUK<}`(uU@`b)7vudu8gMNpSq(5aQ=YaQjT&_Hc6VFFf3P@$PjFKEB+vG;>rwd-i57 zVG$8(8Nf*P{@Mo=_4AWmVYNe0WA{~%dz1gAN5jK*#l83Lk%Dh}(?EFersB11Jyx>U z;(+AR;9eO;8JRouSvWnM{(+9fr|HE-ZX7?u2~{m#BHn*^!t2jt&u$UD2NXl7@aS{Q z##O=0gey>q5e%f-bKn4brIn<|^5cdP^pu{qHV6DOgPFu0?>n0uUc(OxBM#IYGcqZ5 zKeqrWilXm+-mcA4gnA|RXRjO>oLj=ifw=_wlx8f=1%-l+Tppz5-$6p}@+sUX;oxhE z3e>D5u?U*MAp&-cii~WGRdhQAn^>j_MP$%UfF@u8CeQQGEx|0CWBefBYh#@dVnYiB zE(^sh6kgy;+MG~qgA6y~Ao?ElUJ95w(6 z*jVl;!)a2-@1MwUyrHDUf=;`EoR-IX`b;-9A0KKDa6c}BUmdZWjmgN(07g6&)Cf== zTAr)#aRIPi#?(@^)9RwL>AYzzYr1e3@Q~u^{+4=%>#UM1i%6H$>+C!;~0r|?U1gH4%9ds35k9_KSe|vI z;gOM}Q&WAQG(ZICYAV(Ds|h(8kar#NmdAyKLd-``oeDviJ)Z8rwHxW{=*u83Ym2kj zvVSAc=j`S-F)^VaCl^S^=l7)7*r$SzsX1Mf9bswY;t5ODl(1|}s;aARIyg}5-kpl$ zQ^n5(-|lgH_60t~a6Q3=LQWbf#ne%HI9Ef?0vXRfe2+2(4-TZd6G{@#VD#V48C`#o z_xZED3~5-G%IS)$JZB2css43vKDqDx7KtOH)XZn}UA45eS=&r{$Z+(vRt31K+VBQt1S+>@Rem0998XDTq!~{O! z0_uVjM2k#_$YTC+;-_|~Ikt8O;nIFdNMD|w(4nVHwRnpMCd@jEBR%o=_~8+oue`+n zj3X+-dFwXr-nMy7&|O;(a2u2&MM}$Oiqj%~5bZ52#;`hYF5q#nEX+km$KS}R44p^v z8@d%7pHq@8CkLJoK|WAzaYE1rvOu8z*ln7>CwrjjNp$ehQ!3a=v>+=d?nOd0dJge; z(ACF}?P0axCW50!n`dB0np;>vtw0|d2cZCMsHwgk{r-4mw*?joz$w<%uafxaQ#?F` za}T+;NX6`?;j_@b4 z2;YQEd?U}RLP_3y;z8cnWi5%lOiV$pB+^Tpp-T-HmnG2jE_A*_hhkvdh_kTxoG>M= zq17-+SRh|&;9`NWQka*wiqet(JZ5~{vchKc5A-_JVX(mQ;$!jjAVGvkxw!Ll^ zgfaT#$B)Zz6)#-^^lI1_%wVtg*Vlnn`L7)vFAdT95=O18OlqNM7E9Tsj;jaRSf3A2 zx?!JMFHb|PquU6mHiiT$DkjDRLsCbRS_az>#WY|SBaRrAE{}7-MgVSqt&@os2nk5) z@ZrNkLKnIEETjnLywmiTt2~bdZIyI4pIg9!5V`CQy;Aq64_65hy>dBUDQ29)RL;C+7U?z4eMXJbpJz? zqY0)8Rzh`ERj_*`0%7ntV85bChKJFnNnF0%12pN0w0}=QL_`E|gd_*Y1PYPeS#7XNeP?H9w7~F!S)#np<#%_HF~bb9 znXSqyT|OZ$-dj^+3FA>Ut}^2*1QZEiE|z1_Phm=L6}kUCQd}PB(8hb#$b%ZNc&Ltb|nP z55T+ms0t{wXvfcx(7=i?o(M6E0foTd_xkULC;0Mfa<t>jJ3z%U?)3rZ z>c_$oG<9_kG8JS=;#Q00U1oenAoVO zvkcPOngnRYZ?1tP14J7Ek{rh=-KGgbP6oXXpGG?%o zx>5m`U&fmVTKX`KHL?*nt=znKQqoaS3>zClv!#hQrL_O9*;A zlj-Z>)Xlpm8cW4%)6I|Q08V{DPeV565FH&$V3qG!2E_8dDtk_}V36>eurHZc(w{9U(@;v2%O?Tee z&ibo7zE&UZnyBO|m!kl&p0VAP;RJ4_-MH|aGtEenhv&7x4p@p>la&YvxVpKG3=Q39 z{Ct&1a(?#(uA{33S{AK4rLp%BacugHY%|PTDj67JCZ*0;6ES-q-%aqM-S?@jEu5;k zv-8-}YaCyA{w(D3rlh%bUNKtwgnolY!x({f!@-4wW%9xdZmt3;*{)rCNr^!?38U`n z#4+GCx_uW7l?=++*>~2e5@8`BIL2FT%nOqgmOt{HO_PQgS!C4nb8}z3T=SajbJ0^- z*$|%jqY}`}R(;`o!Jdd$#$-0UImXAkEWrN?vAPp zipM+la3&?BFRjh4dkzc5MMayYMDqz`jW99RMct}jao~*=$(r+w>sS%er)kv5vt5w_hIZd zV_p{&$Hj(0Aeziew{ZkO=H9I91jsjBf{HTr-VlUD-WQQgcK}|}1j+-5yl2|x`dcQk z2}Z6rm8jz+ttZTPwdDi~ z{_a8+G_q%5(Z}PMLlm(Xt)r*eR+rGJ-G2&fDY=G1hr=^C*EVHWhmD-`D|h1Arw6YA6Cc(b3&N!@8F*V}wuw9$kFz zRgW0n)1P7TiTi12xM1Yq+`Ed3HC8UXHmFqAGT0`9GK!~955TI({0Wy9=MEi!4-87( zKPcm)!1uxHkq#sQQv%0>?u!lCB!uV^i%eQlk};}1bffvKhk(S0oT(wjpk3xxc zIFt{9f|!JF#J_yWDMK1DKq$4q*DxfytUJ5Ab#ZbKp$ls2=TKg94^qOsq1nG5?%PpT z)`4eKILv{E(g11Lw`FJpc{=Cg7!3q`Jc*~Zy}h{*mVMfjC-7McC84(>^y)rkg}fRE zN5IA4Cr_>v+nGQc#zGmXi}62zzYDE4b|0!)x{g!LIzJcfxP*m^-o5KKe&A z=ACyFGBT9Bpr+iX%6RqaGdFN3yjQvnAHdSwoE*`o^NB04R|wC&i4%9R!#QQ=YE*a& z{%(Lx9HXr)Sw|hf!6slwxLUWprt&rzLKPC5pxoT4J@#Y2@M=U@8{8Fyqch{kBm0)N}|1VyWpv-SE_29GBDPor z2n?Mbk5$0hg`egBF?CWp$gs}go0^C%#RuxyUMh-5_hS5(jW)9eIPk<+I*cbORs*JUQa@fyeV zvFK*r0J>PK6%_ew&Dy(ve?`${J?6TuF1j-JxnYX?XLMg*-7@l~VvX@(8I7QZ!Wtha z2)qzu-l*&mW0@O?}eg5``d{Rqxh z(+Yjq<+U_);9%hIg^7#4jR+uhe(KL%2&!aO(s%vW20Spia5Z*uV;kDsMXfb$E%K-9`@811qb`~}&G~EzCKly9dSY!Z# z-+c_d;H*b4=sa`vb0P357C7Xp#gRC#2lwtF*V|;y4??)g)$nErIery*xG)gYwN9)# z5e{#`B>}X<_kHgPgV&652{fawto-rr53F}=Cqg&K`UL_!_5J&v{rer@-sA2t-bk{g zn0bl4(Nvbwag8|Oh3pR=!yqIm=m=jFWxx~g5S>iyvJ1Wg3=sLmepwlr=$M$ZNB0CK z8{bJ!KhCu+u>bo^Ki5S>Uf&fLF9BphO27{MQq86}6(_~9ON-~-z9sY_oFg8co_=II zm3s-U`!b>xu^ycuGLP`EHGtLn3!|s8Zt=>jMb)Wk4IdaCtTO z{P^)jY881%Qv0Z>qj*hr`Y%CGA$caTcemg*2-D?{8(?*OWAyvVgCqQ?2zbaM1D;2R zT7@3E=yn|$sb0^(kZR*YhYoppZQubaknCY0A-9T@$)2*>x$DJuls}_K1)MrvjvXBN z__48#O`t(WS{e)^f{@}e$A+mncB~4?1b{a1TvBFntw|_PRrU4lZEZ-dhvFfABVm+h zm6ch17xkRXk-|V?5@ShVd?^$sV({C8Nhq}y+Dr>(EgYik3#XCb+QSb^6sI(ROy z)3doM_v=L(U30&0Ym7A85nq&hY8jl&x~g|u^5GK-!4=T2)1)qLZcf4`ssI4~OAkSk zj3P&$yaN_E;yaow$%Cmyls!eX`=Nh3mg%{3x2eW&C?1MqS2p<)c6aX}ACdQ0SnL2& zTlVh{$0DcXfWA}>!8QAev=fQJtv4klIaeB{z@$)4xw*3{Q&%8-tRjYp_({t^5qB@^ z9k=~69RoP364gF7NqHm@dLaZO7>}52g2xL988f^1a(eQKECM$iTwJicD9Oo_F;A^5%D9iMZ#Q=g#5T4|&Sefo=WFU2jKTu01F4@D z|43fOU0*E`GGwT4X^SNzbB<3=*4EH?|Cnay{-)t!(-b+a6X6cbZs7bQ1%~4KvSzv5 zf3C`sKDyW8_wAnZb<%}l>VeA_?ljcZO(94Ox`ffOPLUPC@YSC$@ofobmi40-wZFGu zmz$De3f_aK-bg%az$!&_cw~6k(!%1@)G8PNnmg%pVEx0sc&;7RC`d4n2A*Vh^ld#( zu|W)?;nKUei_G2>;UE-3ke%}W()kajeZ8Fi=s{F6kfUldv|mJi)kVx74mX~5v5ad1 zH$`K>V+#lHCka7aEFp?hQ&WrAk0EehfZ!oM8IOlS1AuPx>Q46S*E4_q>?S8qk|F~; zM7qflCFx3y8I=q&!T2W#l5SR394l5DY)}h6Yi1IJB_5t&$*og>5Xf&ybjsaA9|qXx zzFd%L#=dRKJ3I^%&&zq6l;k`1N=pCiSu#)OAT^jukUEgM==D5HM(KXQ&w?ZaChqiP zSRv&&3Mvv(I3k;z`<0Mb#fEtr2PyKfFD4(Ds=P=I&l`hZZq`sc@V?Y@t)}6oNo+`1 z*vGk@q`dB~#O|uzY#wTTSn?1cN5=3 zDc-qzH%?%f+S&209RCtkD%JVE;V}##r^U;ff2Yvop)3fW-1PkxGVQx949Bga_D7R2(@Vvhx{QRw*ox+f0E?#tk zPK&{Zh;|FdFH}_nah~?a1)Pzp{_^Fm(@#b#fA#;t*_nsc{CDqur(z@7m8lGs$ehTK z<`P24SY#fG(4>@G*dxm(sv zmfG+_YMLCBl=cP$h>FJdW77J}nMJc^HBjn%=~r(zmY3fKG5v_ie+`W?sNP#%#|1eL zfz8>gP_^pd2YWz33lW!W{P?RXAgM8z0D~EW9Ry?Pqve8UWkcz7M~{Yyn>S?JxA_Lm zBHOAJi9#@@^080AX6otF1}-Jnp7YcoXo+_15?Ca+HGw?@^fm-|I5x4Efx>_2Iw%n@qCP_2vjLY^m+K(_vOI5Un z6frlKS@vnL!PMR2T;rjdnswy=$3>9XrWPwXj3b-}es{Gi{m7L;2L$YZVM0qo8HIQc z9yqxEz+##h2-2q0l9+;nuiWl? z6IE7*(L%M!etPQa>e4;DY{#17;$n(fjGHQ2^7tLVWiMpeKy}LIPVM>*84|K842o6vo_ksMq7cJVu}xFaa2 z?2s=1w(E*fcQ&{fq_`BVIC`9FjnGMKRNirG#IU7aSS*x{nP|Rf7_(_ADq)JHKG>Qz zo9*sCS_i4usZ$}3Hp!jJR4w?>e(0^R+BZ+f+1vjn*8hBcX3=%{bGyy)YLwZoA9cGet0g^aE5op;ovw?+r!`nn~eZ04TW*j~$Ibb83f z%B7sw|MBT%pad<`()sDD3ba9Nijz;eolG-TxF+|$;N#})D_t$Gj*^>jYu(rpOCC8= z*(|{cMlvsTseD1yt|=oYnwunUYU?>X&Y>e77j2`g*L!$;ES&Q*w0_r)#zr4!i~uVFfw>Jl)q(n4B(s(DTm3FE{31SeXx4oOV!ub53kjbhLY&`i%~E z9V5-IIf%?UV#z~KlqrY{=g%LWra!h#VV`Yn!Y6rx{lf0ir08SN0;37;PWMl~Q zgP2nNl?J67j{U%V35}*tOcnz?dFyo4_gf?gLz|DF&-S1!h>MFmdUS=Qr6u|%<}Sl0 z^&IK+6tZTN^Q*g0A2E_H^s9*3RQOL;UKqCQ4w^bV%yu=^_8TzZI5l8Saq;M3!`x6* zA&cxre*@oR;^dwBtglIgTE=ftg4D((ei)K+S@G3?%2S&9y<@v4W+}N9>{uAf$k-zX~zFsTMZc53$y?G4ZM*0;<6SHK-HjKsN>L9)QGaUyeM*rSL@ z=Kb^J8aETE?d|La&%02?&1l$0PhG58{%N$xSEd4glGpCPzpq87AV`yb{^YAG_f*t) z0=W7x?tpn*z_Iqywgy)!pg-0)%@T*>H>99Cybt|2+8bO}YwMJ=XLbAb9XE4kSZZry zk+I@eB;{Y|4&i(U>ZTD={pQV0hbgbdJx;qJ5*@tpOCfEYoD5UYwuD~25hEVv=Jqq_ zVvd}Dgxi~2ye<+3Q!1fqBb_L0$-6!p|BX*qJq4{5RU|^V@gm z0F}w)5kWFwd23z_c>&C<`z*zfk(Hg2NmHiuHIRs;YQpdQ{PXv3N*&}FCWeOH{Uc;W zjfuQJ!jjprfX_E7Z@@$+%F37}n`0qewgtxk;w_{dNk9F>^2r14aEOs@AyflLBYvC#>!w~slw7Cj3!2wrTjC#TAY z(zorBu|BB3x5}(>)T?pK{!RAK5Q&EhX7rpp6x)dM7Vx$B``}Z{ zHc!&ZeYF{LuQCg$MW)t!uN9=BWgClb!e3u-{^CUx09Y_?nmY9{;;+AdU6%e&S$WiF z!O72papR~eGm&IrB8YX>^o<)0Yb?IQ{-wW`zWlbqSq#PULefJiw?jgSQR4US>z%iL zP#hTUN~;gfpreU_4U zsKy|Dgp9}oo@}>80)-{7{Ag0r`hHQ=2;aVb#bE1WR;@^6bY)$o;+8;1AF+{zY|yrP zx$n;{TC|bN8XeV+vUXT-H@sv=k6yiSF@*D(>@nD^8)QP`^;VPV*!kTl9QRka2uh3iDe8 zkgd&8vD7f0b#KOfuc51ihj3tEV12(n8hr*18U*TX-@g40MXBx{1U7pRh%@s5lL>aN zwQOg6dYbT)`~UfP&|D0f;YLj-hL}ETyF(b#FD^dsFC9N_ocy$GPr418Zp$}b-?_6L z6Xv-8{wETN4N1?yD#2vS(C{Uao+?+7!~k1?3<@3!OfK@Tf~DYY-9?QJ%8KdQdLWZcv7$;;ZW%dpaMR8b-eO z_StdCB1vl-?_Q#ExQI@@dYK>}=2}t2EL=Nz(IWNMF)}w!fb<6rbb@gbgv172@)jIy zI1BEfz~MaMX8&ejD^Y;+mS$CvbU!kEJQ?@yozBRrOFshYV_a&G{o2i)%VAOJ*}hEW zkN6SjVTYnzg9=m*^%EfRKj&9sFl(SEd@*iK3{U3XySEqBi9v6Xv{T48ub*def2O_@ZEeY@)Zqyzz2b}>g(K9wAqt}FI)&mAU9@oXqBDl zIu{WhaFDTa6F|JHhYFKccz% zk(SfiweK{sQTfbJ1Xd@%|7Pf~X9Jx9hS;s(s#57+AQC6>&*cv3S*)Sdq94G;pvclR z=pyo2Ku2+h<3$b(8(QGszAHQqB0}w*ebMi4No#&N$ z$=!=RbxRZiHt84}AF@bbC^>N3HuK@K0Rha8oH>3RVqe2RV%80l62xR+SBMb~i^#7# zkGjDzz5hD0*P1mS(Q62KBtT+8pMk%sK5VJnTRn@%MMYgS_lJj%BYQQho!{}QjHsbA z7>{{M_|+*>r*aJXjuL)4l^*_IkLQ#vU${1JCJd-P9B<-b$wEe^1qtMN$BqvZ2E1iP zn-&D8?_GI$R;Ev;MyOURX?8_=Mvq5@q2y(=mQE5$8Rma0^Wr&_qvpz$2KO?4@$%)B z^Iig1nqwzjKx_nH8m=EqayBME8lQ^;%hD9+*>mG^m-sJyGP*B=VG!@%XMS?J<4K+! zf9mvf^U{Fyjn{9s1wu7yy12LyEAX}Hi-j)XyUI1AjDqsryKx?!M77iG?4I#hmMBgM zkf0xPh|47|)s5+6egLD~%F5A;ZxFuyNUPw{dWFIr2LX{-I?8ex&w34`=ojP}(YpXkA#M_NFTOMH0&nO$SvOj8Tc?W@C z10o4H8Va{GB*NPo#8YP$D`wcT@$T;DFiy!Y2P9S9#K{P#;~`dX6^;#F4n7 z)U%aju+16OfM&XWYl$A|2I~-$#rM?{Q=&4-oK4mSP>DoY!?C=uau8K15++@X*)PwDjJ+0R{%q zRjvl|g?(f+iU`io5*ECRfLPn<)5|L=7}_)Q(2#ES1qyzISSFGGixOz21fVLh5$G*s zhBOZuu#O-GQjV(hlovIr3r}!1ejsztF3v(xcJ4M?ITIgAlJF+u9_D8(TG-=nVI&ZJmn7q z3c}a>#ot{910bEw?29m;DF5;o2}@4Kj@{}epVV&1d51^)_Us9Gv}E10 ztA;Vc^o?did3GBJ9FwlTrLTSS;?2b2Ni3S)1fezHG47Bw zG787M(u2Zgc`1^p<;(jt9mVCBwHZxd*}etR7;aMVkAsWXJ?k+lQ22yBMtyNzmjz73 z6{MEw$#Ze%6AX(z69He2K3@^s;38U_(u$qDcW@SMd;tSQ7D(_ zuOVBdOM&3+)p&W4#DXpkH56S@M))hg%ZW1vpgSbb(P=ka|51bd5VZNXBLQsV(74S$ zB2j}3nT9KXFT_$rssF|Aju+C>D(VLE0+F!wGw6}9%)w!Me4hZ3cVokVe9eG`av;bp z&ir6(bqov)D9^2P<+0k|NDJ-w$&=Zp*|SHsFo8xNjd}f$Dl_Q|R7XPw4+e!z@X&}) zBPa!!oj-pbxW!y1O$1Ekv<-TQ%oe#Lk|GrTA&Jp+TeEt#)`;Fqq$fvV#inupXnLWIZ~OClm4E!> zwxU+YGBN-Kf8bleP=6yvfgTVPlh1Au!hrGdvfMcT=85W6p?_JWWykN+{75=-WR|Y^ z{PCkNa0Rgm1OS{1&RWMi%9p--e{=VpH#9s!?PlY za3-(}Qv68m15cK(K1@r#S7Prxep2LA6KWxcd5+Ma!GnnQx|ti_KU5WN*`OFhOI8EEhiia7O?_t$8{KZ(P5lOSuDSe% zbw=4g;4&c2ewAhr7%Wv zsi(aGzOgwzE6G&l&LfH|?D2XBWO=1;V04JFY$m>x=lx$45Oto~5IsGWj+5z0IG&%U z{wa=2Xer(x5G`WSHE2vV%?oRu54V`-sP=3}{;^d09#C}&GmcAdXt-3L)VkQ2G~}8YvjNcptvNQIJU%-+ zMzOo+mp2&wZ&Zj(d~x|>RmA1ZqY?TdLYvb#`Ob*eEoO)FnGIosRD+pHhAZ9LM$j_sHGj_=W+0ki{f!`q?I3Oug31b<|5C>-qGg9oi` zJYKC5(gyG0KSh^hFoi0o+0qgkE9=gk?bKYbVBbH^mF{5~MN7>0Z{Iq1?b@+z+h2@8 zVA#(P2#_(Tluc~|Sj-eL8MRSPSF->Y*FuNW87o~oZwAnm?3`GI5DEz}deVvb@W9?B z4L#QHkN3yqvYCGghou9}de5TZ4X}c+^mKPuL&-RGs$%TA4x&6S)G+$gTPP)^9^N3Xd;GmZ?LUsy<;c#1CP#dIkH#BC659Wl z*QTF3)j{*tr<6*g5#)03wSPNbk~R}AfGPkzI7OqtAeYd6U-!(0#$Lxdk;tce?AVz| zFETyXWFa_TemHEvtG#gFRTBL@~k=tN5md# zY9|gK4nr=LlJao&SSFTVwbl~5uJVOMp)OBcCKGSVM{h!r#ncatSIeBQ{zT60l4v9h zHTJYxD>mu}tj1?$7;Gpe*o=rneD6>^X4>7v*PEIP+iXrmR{9?dT>wOIG?wy3E_fKO zpSwAvvc{kri*T-!bWyrPxxRnCIuudHk{MY+L+ z&JI8e^moTLuGJfuPjLcuuu4yH*=vjss6jE(2vR(v4*Q1XLoHtjJ)EG(iH=q)^0exs ztTV6wT}1H|ABG1f+gb_iO&pYgP5=GWlxSA@PiOTC8IkxLI!jHN8c3G#hzPB_j*aO| zc0Y_WZNc}6!lE-r?Dt#(aY@CB*&PNfztokx^JIhp^LFhl+qBA9fwj4K`@wJhm3O?Yu13ut z0sn|ArB2Iux$C>e-xbjJ8BX1pZ5A}21s9RPu48YqsN*E=fmvCb!QWK}81lxVFKJu) zy73~@61o$g&vp}4*>hBuuUNt0*$%}c*AdQhgI2+q!Qt>Av9HUvdvy>E6H#yBo?W|c zs+h%_$c^=3BZdtVD6^`d#xcq}F8rP9vE9UHg}|fc zpaFgP@)#qd`7A%Fs$!!0+v9k&GP?cr#KbP2Y2-Ip3dVs~u1tmm%r6(8Mgs#HNIWbG z(4s^U3x?DfsJ&$t5&D|z*GKuD5b*ON%}okS;kdVNI!;58FWyn5duSfGZx+KA2JP^D zUSwxy<kTVZZn9l^6CZ1BK>|kqyvg{sUBGPWhF;S<+f~mJg|kVUxM(z;gK`?RJ(0 zIPHt~zY(<|5R_=u=Ls(Z$v<`7Pu0#R(f4R{ThltgHTf^y<-^$!e$bs5nLNG!EA|o` zHzsPe5;YjzsdEUn>a&lRfQ>;^lxp1A>O24ek&SaexxtsIAq~g3!B|Dxz~Dt*o-SrQ z6r+}bT0zh#36>qJ+H}i_8K#CgB0L_7=3hOgIX-*Js;CA%-9-ESla3A_LNV&N;O;_u z`|fEmu2)QQI{=FpcHg4)=12l90GK{(c31{9f+NnD+L>1yCrH~ke;dO~Fk*KM4 zMcf$do%2@;Ua4*s+spdsVH?AVu$&-cFz)CK`9B#d4P#(Wh;TROZ?oEevF4z)wpUSH zi7Lm_5siAm$slIbCw?EmKwya*jkru@f{(y}PWL%?94;c0z!%=2IGjPx9*EiU zE7w10_cnM@Rg0BatrFm$F{!J{m~McKEaO&W=+dhjWVQ1C(L`liwX{rdGX zuP^O(v(uSOAkc3bB9CH(9~et&MHXhen#eZkwc2g1El5IIkk!$vI}F8t*GW{S3a=wr zrnQG*#vB@D3}3d&wNY8PH6efg{2`M`dXCF+HPJ&q1zWY?#)1(I5onwU!S{1>15;Yp z##0Ou2@?~yRGEooDYOp~Q}srUgnbFwy?ed8dvAk-CP{x=5h}unwt;ji$q(8+wA(QL zidk3Q4g%lEF1*2t6#CW5j}HW_BSDkwISl{d;LGj;>=>Ts4sz+qhr$CCMKU;gA|*wE zm@N<)ua{iAkJ1FWGHXf*s3r#G-9)~v`AiFlEmiQ5)Y)8X0r1m;5-sgv;@d5^mh|LF z&_Z3IAr*`x8$Qf5xBL0_p2|Pocj4M9-X=u=_e`81Y!JD+K}B;GxeTW~ySs%qIi9Ev z_I_KoeWL>d0L?V8`28razBccbH`A}3f&XlRJ7Jum2$UmCAKaLE$q4-MKcuI~do(Vd z3}0zTk(olM1s_YIq|w`?8(h0>`S7`uCOM&Mz4k)bd;Fn(^?<+ZY;P$nHtOufnrZE0 zqN5G|9NBBtM=eB~kqT-HZongW&Enj_l6JMyfjc|{-cvbJlRqw8$nPMxU49`Kr+)d+ z=)l1LQu>-w1h(%gD~}~4P@23d54rHTp+xB)fkVytBPJ@~cd1fa{ozrA`?2`dmRRN2 zG0M8Apa5Y-AA>!+PvPG?Y?$D@NyC&*LdJJ^sjF9wl9X2Hhk+w{z>hOkK=2FjuNIkl zeqVxqZ#e5!m}tPqqxWf@ZmO!nez>%>^SS96T*W3mk%tole;hKZ4Ga{}Ms)g1-5RE+ zN8Kn|C~HkxvQrPu@*A!nrrBp;URP#>+i#F*E2OYzkhw5!U^skKE`M4D+uBS;8H2?- z-8Znt>5&||)LkAN8N{bp)^E3NEpiq>*?C&K>(xK(`UmPQBBpHPkzc#(e$UIc1&3wu8EU)yepSYLGQ zt`(Phsw#U!YWu&!b%+!eDREBJaeWw}tyV_CJ=_qXiaQhSWaPWn>m}Y=dPqnI( z=*yNbUMx{noe1r!wAR_#VV>pD8FRay`_c5<%elMr>eXX~wqwKke=1LDbr2H(i&w1R ztJ1&VPZmpGdQMmOi)|B8r}VYeU{OOX-K!A7)K}H4ei(2cIdba)?r~p{kx6gKHg0A_ ziYNzRkSM$4*yYPnO~#Kt&oAC_ZjMivzdfU$dTDJt{>yX9v04KM560Hy%&Ak~>g(5e zd41V5HhFZYm9=&7%h#<$jn~*j3y|Q6f*`2mQCfyo1&O*cr-F+vzH5usbn+HWq=p$Y ze(S#)M*12bul}7T5!vAX1MewRDgAG)!d7Oc4P{xf(ma}&2pK{oZ=Aj^(h&3WlxRkK z%WGxO{kQl8FHe2SY!qn&Ce1RiAZ|+3@S|tXx+0|JM46dd`g>`aikV$WG~fk;q~o z8x{)-XS4fi%$VnQ?*{$-i(c3<@`rOT?$Dv9?&1JwwtOX-({1Y?dFD;|rxegOXxH1k zML{y2w1%J1>=`pyCcAxkYf+vHcJ$ecX{DZ}p&&j)?*qmNMH^QIN^Ps~8{%{DNl^n- z@$b|*yrw-^%xs(Yp;LU^aZDM%91nTNxunOJ*>f*MnQe_%4z(VmVc4U`9z3Wp_TjJN z$;o%&6etFSo`IdHDvldu>enbxx@OPJObTSnWHLY7G3A8l)c*xC(*?X@Fddi}Li=Yy z{~PhqeMHwj!CZ?Izz<5WS^;M`&z#11Y@u_z-)bfQH7YjM{oN!a2Z@KPHnK4OSAisC zJc=Q8C)f8qfKIH* zoz1)`_SU5hh^i2)Q5cfR;`7mznk<#uxM4%v4aPco=i_^csw$hCek^LYZC^~x6}}xf z5`9$FFvf@a?zMvOCD*2W5dhDTmT}r)6lUyS!-_1#aMxMllJXA0q$*Qz5+qXBR8>Lc zTCwYugv0u9Bpl@V>T+4Fd47HZ=4R06`X=N2RF5?bZuL<<&XnHy&N_=St$XjSmD_eG z!QTR>v?t~5$S&U7P~5t8E`e)B4Z*$kBgJw`+%B#i4PT>X-eSFoy%?JbZ1&WZ`T2D~ z(p#;weJ(9`r>Wm-jWbh?B-^(ylkegsWvH*uhDdWx3+DwP9VRN!)qz>9niO6ow{8MB zK{!5#X~zhkPE+!%f%!^phYlHX3PpFU+KbHD3A0P*aTGd{V1et zW=0`(n-S_#oLWzs^ybf!Sd?=F2oU~579-uehg6xjDs0CDn87chPv_%(+*_e6=a19< z%$X`t*1K76_k$cgEtE2vGvd=-+%2J85P z1BMReVw7D9YF}o`+or_@>@?)vA#I`YQcw`~tW6vyiN~hI6qVa^){LLAf1W>^I(|N@ zM@nU0*o2LZJ1jAMjvDRKE6Q4D=+J3ol}VF^W5>38HwM7(#`f_QO-r^gk!=(fva(YR z(QUi>pc&0Hji}SUvTc*`wfpJw5k|q`9-D`WjW3dD=rN8tM8>Czih^>vEj*`^)Mafd zaQ}?P>;*If*~)9*m;D~8G+YXX&?uIs-g!rF0n|?k$)4GFYkMpsk+Ohu!?Uvy>n9)1 zM{o@-NsE>@#CHYb`WP8y4MXb4Kfd1FIGMuR?H3OdVip4hJb3Tj+r(M(MdJ&Ro*%&k z=M9uUf!Uw#b3mw3?pKbNaRGS=JraM$Pu4JS!3Y+0>h!t7bVKCh1AaZHI#AkC}oi# zMMd9%d3-v@9;Jm^=v$R6@Qr3>dj*f3+3&#aBu!*8oi+Ef3~j?YZyS92Y9bFn*Ecei z#XQa8tGm#KU%+jEV-WP$oZloHlKzq*kQ>rPenBw9Q+VK!1x%oCtc>q^_k1> zDJ$==yl&}m{Usq=c}FKTHQ`E7n4)rBx5*=zr}XFRJ50yJ*vTNTi7D32MtR=&75?Um z74KwZTMN$4m=gbt_{D%7mxu0BlG9hnZ7#wG3Jb7x6t3C3wXdp^2gG-k}@OP7jn-8%BXSV#uQhlIhp zdkVc1*S)9y;k*W9Ir!a%38iuDc+0henu-cLLY&vGT`+Inq?IcVy}-(Zua8lTxA#w2 z8g_*-C3Tnhpr_c_1jk63T&Mm8J49h9dSLiKMraNn-fFAQlF}WPiKa%`_e`eS_A{FU z0mvrP{adzJU@}#pw+5uh^%nLsP*gGNu(Ruo<4+-PQJ?Xv=UG@RCFn7>ONTSb<+e}t zodR%B>9onalv?a|di#!IN zpHJ_od-p30e*hZDF2GP^52%PLenn>p8;?0rv3wpTubS)QTH;%hM)#OOpI(d_)GhNK z4NfHs=o1G&!H^-tYnG> z`LU9Pby{1Bii41GR0a$ayJiX%GK0f+OG;uZb4L9gdx4TCM3ZPYqBeu~uV{ys3Pzph z&WHwTQhrShw;w*612&jzZ~tJ{W;PGEP~QSXz!Un95y}Vz5ZMFHmv>xA0PHf@ z&V~W4VwaUGK_X2AS{l*9EDcl4eb_x0Yuw%RW662J%bOyazc74=DUN73@-ViX8@go*L-1S_^>_EJ$fkCnr~@E1$E8U+6S_*L$1H9Wz1 z@!HbNo^jWrtR_yR)Cv z>P<8f!IsFdh}L`r`rSmttA93`kG|opmseW$Vt_b-sZnG^iP=gs7>4<|r1KvZMhrs{ zG<0ak>*Nq2Fgu`%!w{2i=Iq5AbdyRljU8D1z<(#|bn4o5HNBT}8#rthK_}NUQ!*Ng z4gnO3bgYUIKvI^Zo}OcCTlw*$5nGLT`T5tcA_Cx}qbS;Z-KXv=1R(`tMZ!x$CRv9J zhy7PWelGt7w3_Z0(pmKPQRi~BJf9m+E7@~aHwS*SMFOO*{KvQpubP}e)ZC3I(KX(%J z5ljIQ|I!$D_z};*$J(`iIHcgrQoDZTA-E}Ye};qwYI7d_wyxag44Calk2-Xm=m_c& zd@|^9t^cV5ki$QUnfBc6FOJ^4X(7#{QTlUE8+DI3K&NrovDGdvcm?wZ4(Z3&uf2Pp z-d%pZl#F2N`%_|?!~{YLAjTrpd}} z{1%kqe}3-Wi*Va4Aw1SBSmt8baBulnG?#pp+zYwQb;nn)S@YH@O-cup>sL@%IAp+p zC4N8nK9EC-?c0wPgERGvS7n!=Ma7O3tN);|+4HxrhPCJQiaU06Jii1&n-UHp#?n&C zp*eh5t1^UV@jsRY7;^QNGO){CKF?YNyd9{b-W1}&2MKJOVmBQQ}^lMJLW4mFH9mI^%CFuYgq8gGyAV$f%jBsH+V!c72)dVx^$<)6T;ph zqA#DR;YxFQKW1TrVsjOG8AY1R`-mbeTF-GRYFmzB-7*nJ{-0gJ(XI|=;_3Z6-MKH( zPfPiKI2DW!?v_=bkeUkJUkbjX@@nvXkBK&nD3&5PLM>Zx9dZ1F2mATSUe983+HlMH zBOTGfM_Act8$LM$lis7*f$x%#-p!iK-p{?M@1+bcGF8Fq7fj#WyiPNV+w@$S=#3zQ zlYHLg(4)U+=yNTKiN7r?uMK64@Gd2kM%CF~!;hB~%|5nr-|gR?hDdgXgoN+iD}cU= zV&lZ}qI1;?Hf)esyS#g$GeK)jpo|B0d1G zfbzr8qQ90_1#dBZ-S*wbANDlaMTft)l$HflQBFvA!Q5>TJ|n@Yp~;ST3_8HfaiJLP zq|30Dj&X4C!O%84+y!a@#qA z=GsmPK2UMpx&llY?BF8|(^+mLSP{NTD=ObJYQ9lVT{Q%xzkQPopUVZ0mTkI)LLWT| z&0?$#Rlmr|wi+;KP~i6MTtDECqG5XL4r*XNbM?TYqkMD3@zOo`HSs_`JfK*_$CQ3Y zT|nOq*pDYJ8DUm^=FWYN)@SatX=l#6SuRi;wI-5^9%+I_6^E8LiMjwRKoq&CDwoz^GA{Qr6v92rLg6H8!%2IrCv7A=rH6cJP@Wck{@cX{C8Q{oc; zuRicryu|?n?v1!M%~9*ahc*wgdrChuW{&|F0-TSO+yYi+HIwgzOL}}`p<~6@#7fz2 z!lAuZM*v2&>`aPfYrlWzE3J2PQ+gkL%I^XHj)Aqq-R0*vf=u>pDO2C6?$!}eMDH?M* zZYa=PKi3f;R$J^JIBh(0AG+`Kw6(Rb*>1*lAc1Zt7ubNr3DJPkw^m<@Si{ZmVSu&v z%V*D+0_xhOi?E|%)TlY=RToZ;u6*q-Upxy_wJtT^vY609u2;Kkl~@zHgN6$NG;r?Q z_wQTF%L_~FM~*D7jIOmw6$a9o?y%DTy6P1)xho^Zpn&2EeTAEpEP9~R6N z_UIg)4DZ5d1NnQ(B09XV)l^`t2MsL=j*gx0ja*BhLfH z7590xuK8Aitl;TrS-%pxsibI!+?Oc76V+V{550D;Vyu@22mn9j`ZK(OdP`|sWo3lZ(l~s$f)@Ng zZ@g)XSRrVs9abmKV8S=AeT=&WcY547=Ja>Er5G`sK*-Utd)l4m5kYe!NsG#VLAt9T zDymlmm2?5p;{hzJ&p|c=hI!TslT)6rx?aY|q&@jEgx8XA8#JGtEcgY9^Hwh@bG;dFrFO zcBTi!pv^;$tgE?y1~1udwy8>k7?pmh1CnxO<)w=k?>Tg+!?sDkK5|5P2ba>)Y;i=$ z%X`Km0jiIU8?S_fOd-s22f(B2P)6NpKl5QpiAleHf%O}|cbjUVS8{3ttmFeVwTO*dX!oYhLY2dAP77d%i|P;yk&VA3s2l%%Kk>(2j1J;!-e z-ZVVi=5Rw7nhMq z;D;bn=WMxm4&JtH9F4}si#7ZvvKfOO<>lz%LH0k}#!)JyUApunPU8*Rz3!2kxFbfe z7gvwqqEsglEw=DilXD!7IGQP=>MKzS+YQI!!C7~Bs5k*c{@A&-TTq%x83;wF`kp>r zzGlrskLpd|s>2f!mQ0+ugK0#aZUKRkHuCbrcJI35IPQCRgqvm#FhWU4l0&$c0F zQ^xg|I@j>d957&q+jjl&KwshCquo#8ZZUQ&`36o9Gyq;7gKi&SkCh97@@He?aOaC) zDxC7T8g;HyzF-s?9qq%}p(z5r1yyO#yLYR~)Zmi{m(i0AUn{UE2km=HLH2eNy2I>w zczcdqa#POFuZiG5V!WJvEOAWg-@a+ zYn9k&@?r11y9t|=yLOtV3;ANnsU9KxG9Pbm*K;Zqz7Ua2X&@>py??)*KDwktMKQ}F{0q>2F@*^!l4)x301u=b9T$mr{!0CczUne+&)ahgmG{7 zg41<*GOZ`X+xfF+#qHV+q!2m3mgP)-D)4Od%8ESo$MiP(aaAh4Mbpy2-VsnKPJqTO zSUJKdX365kiz6mP?Ay0Ra_-{A71v(?JPsW?#DqBx3RAShU?rIi753qvOS|BJND_@2 zv-(6rGB$l=dek4jWL&DB9TWxZf9k+I&!*7n#`i)&-0 z{(j{2ep0~?`RqH`ZRsBBU2#iQvS<4BK_(_8*stR|*WvkGqo?2AU+7HeUoAN!y-1d-(j5 z1~_Z6{#e<97Fl)asbPh3)lqj^TNrDNO09_^f=-WHn2Ry-k)Zo^r^7Icq$*`33KQG% zW7#YYG%_}ZR(yWL-|DAt=Hg-cXgXm93Nu1K4jR$f_EhZN?esdxD)Mr2RwH+Q+~Kq0 zd2P$gTdX~wc51);&c4t(dV2L&t`M$bCGK#XtBUoWV1Z_AApn}!JNnDY`{~_{kIm}T zbB`gvc~<4mdDpz_@CvMehKJfi6LHSk`h(!~#QaP2RBby+5?GN7;Dvz9-CcRvsXN4M z3o9@quX^Z!Q-NZ^2Ed(ENit$Jgv#s7c`1au83yYRT17nQY;;<-oh5S6Y16)3xiZ~f zcbnnbyNnflWj7^NJ%GVHZ_6*ZNGVnwUDZ@%j|Td*e%n6cfgc}#!J`?>?-MqKQin|m z^n32>k3K4l514n*^3EHF@hvHJR)+#I=ADDWo&@Rvnx7->e{cP|g9bJ|hTXeyvFV_w z`1#4v_H3ITq7z}ye1+2)d7Hu9c&N*(t&HEk=JXI_A%O6Mm>O#|_pgfAL|dCAI6gh6 zZNM#-SpHI0RfXxcUu_(=XHVIO55m9-oXTJvbS_+&_5H(uKn5kMH#alkWga+*;ZyWK zGe_ce>eolPiU$lbd3J6^`>k0N#gKfoz>;E%L`OY9b7W?$RM$Q~w=LRPafXHQ;*j5Z z_a9|vYms$654JqN@dqvM8v3T}kBc~P;L_@1!BLJJRoLaZ+@8Tt^)x1U8EzuAUrc8iepUV$khj~T)E;g!9tz$ur21oanQ`-ELvq<@(u+# zF&p5^;M-q$!Whnx>^FtGd-*Xl1}%Sm-G5Q~ih+UI2}Rnb>pwnYRBGjeY59}}?VKWs z;WN6&j9fix=`Hbz?R&omUb*>T;Nh4)0dj7a=WB-z$bArw@_QLDwGf3vdTMgARr<`O zG%I+2e!jj^FO;7?fY*!5VDPU}<5S)t4vdVB=9|@_7Uj7sip9d*hTO(|kVaCs9g3>s^T(p?j3J1jX$UBB6T!0zt2#VTBlBZ`rQs@ zqL2%g|E6_(U2^D=b}&p6ypOiCKXs5rm{v+e~xzY6ZsJN?|rOs1^uY==kM(@Fh_>vs3XHMTDoSI^fyeN&%@Mkz`HH22_7s?&$00oqdaG1+-L} zw&3#LPSdhhoI)cM^{1!$9h$^X^X~STM<8NR;v5@UZux%F#;3{ z_pyO_r>wUZs1FSE^=(3T+a%BV>Uqk4+_(KLYzR8K&)_wd^dF|mbPBLA5_r;i{|^wu#rjrJ`k z=Y$NrS_589zP*IXK@bK$6KX+5Qfx+A8$Sqb74Ws({W8C1r~6T1(>x${11f*KRFXa0 zFy)I3kbwoxeJWxux^43!t|G*-uwtWvlkve9!u;pNev0wBOliEk$^pXu(*K9zzGuar z-kVy#Wo_(*+BBAUvVe*y1NuUPyQ~)UY#Vk!kTf|)HV zh&inv9?aTF8<&X_7@g$d^R+@ey8---YfK5;q zotlqL?l!XTe#uvY8r=D|z(6qU^i^G|-_&`}`exB&t83S$O`8+jF+!OU-LWT5Ozieu zLSi-=C!5!eP;P17D;7_SH?I0j+?6iqm%{>jK<>zq-)Om#_D_9x;x|)AQ>`%fH{?}m zusg9A=vT36WWYcc^)#>_-u2W=d9O7Xu3=-Q%h(T1>i#|xrn0YUezwg_)3%ym%O=eC zfyvNs)zR9g^N_MWTQe#ycY-k=LoV+(a>**jTSQJ%>ESi+_@tT+aaZ!a@$v-3WYG6fwIFj_4?9HE~9 z{?td{yak`INIuEe61z;FIWvYv8}W`WEBIAKJI2s8aCiR-5yEaA4g?Xlkwyxw0An4U z$1t{Bj~PAM2E5uxJB|g4G;qnX`HGVbrbJ~hnGqkX6r2hK*rJxi`)+XPKe=806lJq!yjv8JIAFRV*%uebLP~kC zrr=O|$kr1ws=WT!#Y-}^b#=$*lp^U1E zixf{LmFw+pZvO2uD(2JY&!_Xg1;@rRH3Xfq?agkBy!{MQCZwlFzF4_z8Ixk;!-t`A zb`HQ&@6gdzW#6d(YzX%Vre@#*p z>17mVUT>dQ&SikD1+)6%l)IQoW(UU9VYzGJkD(<`&%3x&^5obw_10T8(`Q&)S#=xf zv>$*$Ux|oC!UD<;)XygLcT>1S9|A@13M!=)E5?a`;mfP4aEIBhRL#g7B*jFD62biBoa1Z!7{$x zKcJuZAJ)+$+J-BLtg`vH-{-~+ja-1=!^BQUyv{FIRZ}ZTcSGB^6x@T7YKy}U(TRJn zUK#c2V;O$sKt*%#K_`~i@>{plBI3%>^KhzQ1cKx&c&dSY*jowtS(@Rw8lH$4o%p^x zqWxdLfAe0kF2A3bx97pT&wO-3Ck^l-#>wkC!vOC19L^XOZ*Vnr+vp>M(K(@acPNz!fC65*vj6M3d2PQ1{}Dzv zz|?pgpyr4IHwtczd_JIbFEk#S%V|+lG9fXC2v8LJl)7aNZ*WKUkjQ7F3gs(c=)wLF00LxR5FL?);MX2IgBp)VT z74crt6b)QtT#td5xyw81`sv+il7gzH73%IR@$MknKNg-rNc1|x+KS?5*yQFNCaj#k zmr08jPSiqdXWUl6S84m7ve$Oqjl!N1LE1?EfPHvK5uNZ11h90z6eh}`T5)m!f!-Q z1#X~;vG>b)deS|7zIm%-k(roB41v*&HWYW&o3;Hf(VcYw*EeWuBcBGVyqmjjR@5!G z$>sS^db4gImq|N$2{Q%9L%2MVn4T5dU#2{fO5QDeKf?j51%z$T6wlX;dVQG{(M%1p zw{Q!-cA^vJ1i9@$M%S=ih)Hjb5*ta9bAHrVgsU=idGezFQP-$hHN$s*%eDLYt1^y$ zxS;6)CEL%PJAFYuO{(PQFK2uMhnxQWv*W4We4QVrEx$U>&c;7}_luPf%{4WX3w3Xr z_TQu75hk~XB6x&_6;&2n-qRH57USK)~7#C!;I(r+FRP zvi__2=!v3xZ`?0i&W@2mNX%PD1%CgRf`Hi@O$*jokKFuQ&wU6Wl5Ha#U&tgnYw$cJ z)W6k@#k-u`)`RPuULNn~s1`OJV2H|w2e55d6di7!>B#pNN^ie&r{4a?st|q-HPhbR zQz(9|t*Nz95_>Cqdn`qgRlG)G{EAW7Nvggq4?%rc(&7Mtv>_9{WuBMi_fFk)IKoPn5s7I1M+a%Cl zIo5pPNZsc$w*UrMBe&hg_wk zlE1=%C)WuBWD7HWC$1gwcbeu%QQ)dS-sW$pHtB{}J#J;T!55%P0q5b!DBHiUTYTN$ zAK3{aZlXe$UAG<1^M-2Py(|2AVmw{;%`WM1fKK+&BUXJC4r3Yw5x=SMJBIoEIbCxQ zB5&m3(aR6--YqP}18MZVI*lK3ytU!}J1wEJX3Tg~Uhekpe&_P@xC*{}@q)lRC(l00 zr`-X`6OwqATEs}(r!pd&IZsO$8X3uoq!L^~X6rbJtD}6Rfr0H$$m>MZeEQUrRSPs1 zh;c?YG&+FsyCVlBjhtLtTu^|aqhIr%%VjSGs|n8X%#X`swHZNVJ)lTjmT`6J?%lh$ zY@45a;s+-T&YDb3aWo?@_Og2ofL<6No6BB9#->UE-B1zll-RFEm;D#fF6E z=T?wy!dEQSFo!J{c7un83i~Hl1pFp=v!9|~d@bZX*z*_%8C-UOD?G##@xH8oPw?t0{aO1t!5p938T zcsi-qDSCa#XoOyW?cO8%vU#mnZ1a5DlG&2xK>43K4YJ8Iao34Ua-k1`h19+s4*xjuzXY}|3YttwB73c zuRt>Ns50~FEo%23?8(`1ujG6aL95QT+W+No)f}Jn-e(GUL$E0F1&wJquWQ~fJ97Q$Y4)Tmf`Ry@4q5TRgue(q?>c{x#h+lx=F$w{2Uw?c14Q|_CB z4IB)sNl*m{JG|2k&b;PF)YsRea0NRwA(|)6&(7S{Z|yg)A@~;>xR#j-vWbSORK6U( zZBcz&i_8plB3@oz=Llnaxqw=dk{d52zDpxBD0o~HUF@-CpqcT^T@XhES3y1q#Q|j6 zUpeVL-Cl*(%P54ueXJ-eyCH$IOgecIhritT_;_fL2ZIZJ7{aN#@*8F-{qkkju$TwS z2Z=m2*z}NV=M)2Dy4pD^5tKNuOHt(>yxdTe(2t1Dj9Fp;=(%iCxnUm0Jy<7rY| zpam<5LU8Zm1m84%lw7)N-?bXGUdf4x7QwzEGbLe*FE5!Ochpd6k0D7LdJ|Im_Wd>S z5@a(q*T?3;Tsl1Zo@4DWr2F;Bqne^`i!CkHnb}^S-F@t!Nyd6EB}N>mtTd*RkOD$s zk?rGU6?GsCY?c)J~@-NA95^YpcmuB^X(V zRR!Fc#(ct19i3SAMRW-Q>NM+Oo9P^Gb}?RA@u17FUfOiR zJA{C@sPMws-^zUPi*%SP>UG8tUfjLAlE^z_$_>1oYkl*%{g^AEDoR#t2kwDcrGTS~ zeyWJxpay})=rzwY4w`=JQ%LQ0Tp3mUS%iJ`vWU9`i$QZz_Bb_*;(b3e{Q9Aw--yv$ zGi+)?7T;G5m_Pr)7KMk=BMHBejRp)Kz*j#9Oi@CS47dH3%UcN?1B9|&XF0W=eD^Uu zcC+6Ukisk)|9r@`D&!+uwk5*`qM4wGA9m7UL8yTYcmGGMla%6x9;8B$jRmHKNJ~h z{hnb`pgkUoC2UjAC`bt7a}ON)O`daEa~R3x>m31q0BBDjp1~pg>q51Wf}u+5_k7)! z{j@!dOb=)HeE&FxolS6__wF5*0=z$#T>DMZKDibP@*#oQG-H1;sL=AKe%d_u)e005 zUrPl?*$z4H-08F1VAq8j+FX2j{io>V)4Jn%%$+B0Sbe{s(U(yHUz0`Q5o67a4?juv z43Z1ombWWwF}%$5AJ!*76rSuFn5`*gJ1|#bhgUPb8G*TArev-~a_ETv#n_w2Q@OV9 z|F@Kov_y(jmLyFo(xAagNTpDeG-#KQW>Jz#MWhJLsfd!OY^0)TH=|_8&@9p}G*DEU z{N8taKhNj$`Tq4=ujh~b>}{=eulv5P^E%JtIL_ldR!gQ~YoKmo-5OJb;I6$HW2AKT z8~@$wvL@=YmBUqMK=fyIH;R!V*YuWB}LdiuCKj{3{5{`^d)knVhn>JZK*%>N@*>)p6Le{kS2 z5LT*B&c`u{(0w@BxJat{`X2A*;^554%od^Dl};O+0Ghr!OXQ+(6Q<*;M2d4#bEIR!Cn>T)J2SWO^ls-A{9Z)xrz!oP0tW2%gErq`ZRpA3S2eAszr)76CbU55gn{Tx>bRm^U8?V4sXF|&}oh6I{W z7w>+!GU=(E5y^h+ z$Ter?&ruiq=Xh#FKDRVys$2D%homR`Z|{%golY}o?2!uI`_zN~v@)QBI>)5$>`b?D zLSSDUsq}*IW-TJoxJbHq#`Wqik`&Pf^u;1HmFRwMR<5uw2>dZ_bizYm$Y0M^YdYSs zJ5G&8yU!|4P7*NW4zcfjlIcNwM&R1!yrzwu&^pQdnWutKK?!Q3809zm0TLLvE7g!` zbLT#X4N^LyC95jzAZT<5xgk2(lUDfo4F?etY;I8ol72vG7(R?XeE8jm4_DpuwY4)^ zHFMzi*miz+&mhtOHH+%{gmHmhTwK*3gSxO~F}k1>2QY~n*-a2~zXM)m=DKTXX^Ebe zpR2nMXrs&?{>K*pJdv7N6gM~4>~gcKtEpM!<;9@1KDkFUWmSBph4Ij#5n*A)rKN{W z7f)ZZr2n%TF0Gh@9W*3m^3VkBjQQKfB`rexhh(2ZFnaV-77BS3xMQ!medKI3kHpGZ z;i1c1Zm~Z)dCMv^Z?%szsC!GMS(|pwJm%kTM#euDN@eQVF6L*waEj4B$??JePM;gh z*mKGJ#fvA6x7IDMZSMSMlr=Od>xtWw^3f$+w`ITShuJVl+x$g7tj)iUJO;J(?Yrr)#Dnw7mRt*J@~~76O6(C% z`vsVK|IZ%`!JLqfqQGYGueixt%8%-NFhh6duuD$pCCeYiDnX(V=(W<#& zYc1Cp#r?%ebo~4;6>eKF9!+5!IppdY&i&ydgvhBxD~G!9e#tb3g=6b;;Hg{>dM}668ZFQWj~V7Fd2JZeBNtnieJH1hL4EB1KG!C zh&1Tde4x+qt(v)Fx#XE9wOVoQe794zM!dHaypSb$CbIRvGNhhNy$FntLr~*;5W2`Q zgd|EV=l+_$ZEz%bljNRH4rjFATj8ThlS-j2RbAoELoOK~Y3)JP3g9$9b`|8ZI0KAh zg0hn6+H04gAgm`c;X4y-F0_g~PQVethP&&9oUY<^qqCZR@v){%pWC z7uS6;(mXCFSYh@!{@CABw$E`S9^$B4KyC!L0%}i__!n zI~Y7u3QSf%)McZ7z%p@w9~Q_<3izWDI2G8xBJ7zRHpS7Hi{E^QNjQu^r% zLS#3ieUn7=r9|$FN1*#=m*8vjfPf((=t3XuS5Jpt{`2QJ2%ih@<8DYzuA>?vse1QL z1m$nLgR;zmi$X90M7=vi!>_fN+$U z<~l9>bX|+QgL-U1=O&4T;?6tjP4bF^Tm4}-O#$ zJx_Os%`q9E|ETV%>{-^DWglzD_SX7yb+yO4En9X*te8H9x=yZN%<#xF+&yry`0i7o zVjLW#TVw%!b{aCrRbtE_WruM>QvxYS7pa>_@(zT&#`F99q6dEiDEM|W`g{>$pNxY` zSLog=q@#YpAp=ZaDk{&%f4vRdr1IhI`a_#nrK?8J0XhpS_|HCp~K`&?r3;mnR z2y1e z6+gc{fCPX@Z0tz}k0zfqg~C~;DQ*e!Y_c|bosyS|atbOdV1sWkBiq`(S}`1|mL^-a zK;+k4IpLv+vsIE+v^3p3FTe7sAxgwT`NhvXpoYc8qy2p!ce*FeC% z5fI%h=pYoX05od(^?jg0%%)?s?HxiLv4;p)?Lt`^b$@_1ou|8|{TnoC%9_xVJuuFG zSN^vzAMDoX#f@Sx56SO0LJ@H&(k$vpuW@q#soqQ(tHfvsUvhIEBaSe<_Izz78NL8O z$^0nI?VWn`aQ!s6yC7)A)gA!NB8VGO2kj&~6T9iiGvhdnYLL=u5%78WcpN?SaEZte zKUzW)$9XR2X2DfX2#!^fQ;GtyiumzE5Gt%nO+aRkj#RVn?l_tbqs{&bySvwFDaDnU z&^x6Ulyq|2&|teAopviq`ILC05&G7;Gb56^+Zyc3cmLkp+@@{YNi~mtQc{bh@A--i zpPHJ&!@{DiF?;b}ku@8&0lQPY;+<$1)N$lVmTR(IK=#~?1XnTDp|A4bzUmeTjU#AX zJS(5;ZP~;}jHZ1B^%6q`a2wEH=bZ;#$=o4!ru&j5a2xK6f&?MSiCzej2GL5LJG_Lb zS*vB==1p56Y;1D4oFT~?1=DjJD4<)9b#5cX9i)ZHB6Q0Zc~((+Lw!9iSojM>&J6`* z`pGQ&;*sXEaXi>tg`8bPPH|KnXUSCMqEj$zWy@x(MKmM!y(|Nn z0@lEne(BQRzaD!HMJXbe=((X3E{F%dbL;`i?wYF+_zi`wl`$i;(X(4xTTg8Ids|4z z9dJrIBWCuWPRMC&B`4r4bQP+aI*#M#{cM>MXI(PFGp2a?OG!DEq|TYGJZ~QJcUB_w zCli-Ob+;=SXZg(at*o^eOadYdL1KXb$i75?ztMUKEl#v)gQ=kLM9?gEwdZEFZCw?` z_~Q_+#>j#g$$Tu=%O7bhvn*B&Qrp(e6J4ojUI2SV!P@J%j7k?T6Ci7$=|Y4}{mA%v zmQ`&!zTNrVByJugZ>v_Vpq{~c4NGOf-XJSG@fM}7KCZ1OIBGRr%R))c^Y;E&U#}bV zTc9Y#x%uxGFCrN)M4EcpP-vQ0HhYdG$wW!7@&7|64$@1LSIg3IpkfHQF_Sr(SFdhl zHCTn0*2`cHxXza#J-myk)px2!rb}!fs+2 zIopya=h|;y(kgHW`kzdWWA{}|<470lM-7d_=uU>a0Ej;NT_Ulp3Eqz=LRyXMOL03n z`wpz5NF*#GQcI-AoOF^Y6Y9}g?8dtX8($8gW7_Qm$3l=!ae*WG@OpyZK0O^QREtT| z<7};wMgImO8ZD3nHLsqXFuW6^yT!fI(*9-zs1%eCPfeq2zQ zFXy<$UqPcwOuj)7vzul(I@*jBrv@^vaR$?I=cl1#)qS@@h^tEaZbCC&!br71g`)EE zHNL*+4kv4~hMwS+k$U%W0t< zR4%xCcQFoNyr}E;Otw@FdHLoIgR&a%{J3kNCnVvYe3qYqj}d@>peHDsUEwPgH&sd- zd4AfXpa+G8p%D@2k)65uK-3G>H3Y}DnT^u-1%cPK%zQj4*y(-!NX5%#b`p<@P3T&1 zirJ0qe%Uj#N0#IVG!-nvP(mjA3o67-{0ImWsGgV@gdu@B6g#ZiL42`TQz;FNc}8%C zRrV@U&_Z%^exAcY_~^R^hC??<73%531 zB^#K6O_c83zx&SK85e}Oo8VvKDg2vBlj(@Dh|Ca0Cdq~)MT31`O>5<8&66D|r zw^}5~^nhlAnad`$zsasSeeNVb^6%T+pY|rk!*{` z-Ya$Rfy^a8j2t=~{WaI*fbF7xl2r{KOVUPM-h3F_1yRSsub(uvTM>#xA}#gnaxX7Y zH~y|7DlzeIj^3M#H|pQ+VW3x5FnNcp>eW^?he3J`t#Y4)CS7#|LUL5>)y3f@31#>Q zWwJ<#b;Ob&OMR=wE347mm6WH{=0e?VR$Rdf@`XgKkv4O zlmiRls+l8ac=rObv&t32fq(apSY_byK|atg>DG{v*fVWL0{7ic5?4eBwbhPqlFo0+vMu?R@g1rxA6gmUX z-vERl7%H&n&OTgnj!KglKvm28NXfZ*_wU^UOcRf2@qLV*dw*esJc%YDbrBIR?a^C$ zW*!xo^$hCo{O3G+TF5}DvD0^|3So8RzoLL;WIKCzk!1ie9obIUM#RcnLZVGb6Sac` ze77EAf!ANNj>XDh9JSs@h_st+w1#diF;nc_D!Godm&U2tM*Ar0^=T|I9FeUpZ;u%J zRaKR!X2qGnr6UF5jaS|U#ITy;UWB+(yd)iF&pveUAOwsMO(wkiDF-4#-W3$!NX3ey zWn^&pY{ca6eKmhLfi9`TEw;9p3}@q5rbq*XY!M;<)gAW?rS<1Msc)eRjvRT;chEp% zH%V2{HBa)tjrJQl$+rnwiNEe8hMm~(aB9WHSew)wY!nAb4VVaAgVTbg_@>@T{)yCS z{IEM*1nP-0($$Fo4?vGQ@8RO-oxzRmH(g(H)n@B2}&-)kGOhfE%{-vE>47 zkYAd~>shfBSd$iL3%cuoOh3}iL}LI2#N?rsXb1~gE+oBq20BR^?igfdPDaj;AT2dL zJv!Gd%*V#bsgu-l^69;s!+zrmE?1E~loS?9He^Z%>1|e1vrwd9ILMmh^cFcXw;Tyc z%(Y3z-HQ-waD)NjmY-cT4Fr8`+}52UB7O`av-^syr)V%qaSG85HZ(T=ASy#J)VzB) zV9N@!B{3?ZD&%q(N>r<-kvGkmF+MzN>Ow3gW~>{~yY~q(&hFhS&XDl2VTO~`vg8s> z=ODe6@_k-~vZ~4v3q!teTgH?7_K`F{4+R8-9bSX549WsFZ!Lx=@d{or*hzilgv%t9 zbZz{89#|9U-5BqQ+#U@W-Uwq#OwvoxXb`0_whr|l(YqJ?r0<+=rhiZD^eJEVO613K zb8Fb2?DZ3a+XFm#IcRHc_hp=}wuzkZh+4-W2<~DJkMGycGiTn5Y{=ca3neM2op}Mz zQ)i~CT+h8q4s6_I2OR=e$)^8K0a}?qsEc>$pOvIWKs71~<5V|pVRdzKm=ja8R7Iw^ za@8uZi`F~uEJ$K{;>Lw80@Pa=4)j)u%{Fz3y(@7)Jv% z>hka3zfb#gv0B2?vw{wW;0L#V3;=rh0E48UtQbMPhWG>d(P$tqOy#)KUG6r!V292K zT>JfqW9dpKB#6J^3jC{c)Uwr2(i<%Hhrq{IuYN}-UR*rQ?WiO5V*uvUGBWUe79Ed2e@34&V9T|n)klJY zf*7NRZ863ito|~nVZOO3q&q*I-GhM3vPbL=_IoE#XB|yVg24Fir%WtzQ1IeXkvco8 zqeLQGV*S3n1d>k6>K^B81F+eps?wy2u3;nVF?YX9|Qt9 zK7G^I`Qe_Z3xXm6FiqWFbMMqFfNiDzUV>~3`GUFy2#O`qy=9A6*`dft-Ix_n8H+gB zcIED#JbdjrMjTk!;JFlAaKhl%VO-kkYf&>PNnt#XUm<_e5BkAikImfx4jDdTsVH`C zp$ls8(CvNkBCGD z-<}lyhRoFWov3Yi8Y{O4ve&vgZm{r*NUW%eul(Npx!J-u)JSUnIri zT1B#0nx`e~Rxmw$*{S!NXFt#`XRnfyhlwEY_4#S&rQb|xE9e_A*qkUt*gex! zicS~3SHPD|^MplTIonyqZko%8l-zVyYV=IY;6Xe|@@vM~irDO9u>~wBlI=15$#W1B ziyT5=o0ijA-H+Ehl&qx${XAo7?YFvPj64iXt}8jI9+OYcZTO*kW~)}O);yIl!FNp? zp;tZ8$^X4PqqaI)03NvQlL+fYD*#0ZB-Llza`vIl{a_FPrp}4(S|q(-giXN@w`adblAr_qA;eNcNWB)I!Nz4g{mvwe8;<^ubRv{QLGo_STD zny+sU#8P%@fi6m;ZIWWHZu)5#k+$)3ze89HvNHz7LYW5bQK|Y&MBYZOJ~Zf$&AYAj zyV_=}3z6+e#?A1-C$q-y&;;pj_5*29BuJrn$(Z3%&2!-RoFM31{)buO&Vs#r{L2H$*!;Qoou61URgD$I?I(;sgFi%rLY73B`qOq#Q(aO&OFdx)5}KJc7 z_^(@6o?)O8TafDEqU&3pwCPy(FO^!YZEdcH-c-H0FGh`FF^41_H8sm}WC6M1v_XWr zG{RAxzW&6aKrn1yJGg@4A5=Ba7Urq^+ch)8UMI^;?j1ZV`A3T8%Jejw@9GwG7)vex zr}pfmQ-n1sItw)&owt>hH2X*{5x!OHYtP7=K@JddTN$&{^~Gv{>Ao+KY=3F)df zWSD&EjrnOvroJ*$E}Q`pz=PJfD%WM3dEN8vmf?QF%OGQKctx*TDb2o){r{~e9amTa zH=xr50}gt)Db#Yjm_Ys3-E@YR;Fz8^Eu1XfYV-&pr}uw!U&4X;Y0lu(=;H<= znJq%_=YJ;R=9M$!Hkt&sRDSqij<}QLV){Cbopb5eiPJpZ@bP2EWs_P769fHkmfgFz zg9OWs**$Mrlz_|R6Uqc@@Tvt9JcZ0(V(-I24VF`f5?vaK7uR&<&>ZeG9x>T z(%6{jPEqh%criPfFbE%zlJfoAH`0(j;-B08_s`e!!vTNr4P_b;7p_Tk?J3HSs6HMX zZYP9px%;Z4*tZcL=&ukr1}QNPT<`dgul}U$zFAg=6~|??I25X+USg=xgFb|oSzK{t zG&z-E*q3)#r1{iL9Xax%$We;A)ky=~3yz?LIkeCko!8!Xpo3N5kpl5e>dM-{EcDf#sZrtXsu4R63c$P5Mb zv#gS3Tp`!?dc|<^@Oa*?8x%>Q@DBi6a&;C8SP`&}*)3?(<+w#Ml~xfE77XSGkLC6# ziUl3GmCx&`BUM8l@W?_#Lt&gjri4gPTYfLK5v8@Hy;H%V(d+i|#I?mIyoG49kUK^) z@bOWXKnX@@xs@wuWc!m;kQ2J_uxeoPI;>o!H=z%Bm*s9BLuI}AeNRIjM0 z0E9t?qSs5eD|fK;912^-jkE$kK!SDcJfBhbFfHN56LNj8aIF=?RreL#I2pWxu; z-H+_uD{7(U4f^+H~=nRhpMm(q_3@HY`P?Yx9 zq{UyjX{o41Pv*0kfr_N#ic+!BaAeeDnktF?&gH7%S!c$X3qlcz(c;|bUbI6;eHuES ztdtZCh&}o%%)F@Md=*E{E(5?WgsOM$oR}PTCr3wOT?{+M^}7Xk%GJlzVV^ z(t6Q^=a1__Xljg{{|qExIo{=;e`j2TY+VBqHwzmEUA-M=#}?y^f;ms`fJ&fAB`}AI zfvv?MOUSLc|7zrLo%{5kiNS za5ss`^!%L;1o_YfXxCtx!2gantPU7QoWmxKQeuRmp~#T?oEWDlW!;yIFBc|)@G=+c z6vVxFfr0uZ*7OdvA;XkNENQiYa|{Di|zbGws+!`L-u!o+sqiXogEQQA%S zP=rzs0?!X9EhNSBp&1$%aALQk{3dVs_(DUf9ggNN1YiHx_q!n`y+VyTVD^Q27P0eg z1(`ps9gMUAud^n^hL?gjd39Q30Z3JfTmv-h{>uhYhPo&uG4arxEN`pFn|51o*f3_x zwJzoxMm^hB2oxw=&}ZTTLF{p6mttLq+`8ZQH^3?ZW6ber`+QW+?QitL6wIKkLJ>0t zf@@8TmQeo$MFJh5s0I1(24)S4Bn>{h^vU$c&5W$b0J;$G3y5f38t_HU%a^OzM!2tK zuUaI5m3jO2622tK9pwcjf-L#$H8&Ic;T&n4vsZuXd0w-kx|?sX{qViRCkR*nsA6dzEAb?wp8waFgM-e=+4cO0SvQjZn zA+6^2b?(wd?02CULm9G7f^SowfVxCd>huZQ4w|7uaC#IZe~Uv))(UwCn~S2kbH|Pr zYmC~@|E<#`#U25i23Rj1JRjp+;>%qN@WZ_v(3}paH+_QKkQ@Y>#QT0n%RzHHg^udB*ZFI;$tVRLRSwoJB<(*)1KR7&tjs5{Ft>>wq@jAlA_BJ{;R*9~jY zuTl$nt`WM1CH zzI)HgtL?CPN+pxD3EdUe{hayDRCP4(os3BOm2!M5$_e}&XVRm_Nse%(%ZTj&KzbM{ z6GW381kaMli$>suqEQ-8fy$)?h@)ladtEOQZX0%DDI6 z%`i>2*Ux@3cFy0+64!sqA4Zy$jlw@Qb>IVTaT~MP8W4jAn}m}4OlHJ>0tzlV#_>>r z#NyKB%c6>2WrTd6Na~}h=)XzJ7znq(KYD|28qzJY17P%ChK8Xv&ccw2-v^BI3UzH0 z+=YUIS|c2C*zGss^1b7Ii&Bm=FzWZ`bcrokUYcAAo%UzvH5zDqy zQabAKt!QYDhoE8W>KZ%u#MH~yht(`#cqt1mkV3-I_c^ING$#LL1J8N4&jukANH2V! zhim}CqX2?Q{7$%w5k3D9CmsS|zlo2cZGT6>>z=W4!Y%p~SHOz0Y@zM;M_4JaaiDtO zK>5{UD8Y=A=Vd%1=gIMq#ry;?<1?AETRWcv9m|#DSKmK4(V^#H5yu8HH8I&0^|&-Xg$m(_UTV^Y&rXhO;m0>gY>1O}rsjC=-2k_CHlA$s=N1zH~`Z3Yfj4 z^*j%ceK9fOWEyL)klS{?CZY14d)O=a_J_Q8@KxQ>7+y;?w5OMm?y|jPs?%Tv>(3_C+nk#(vxb8svMn%EPG4|GHF2C^p@@T1bhFC# zTi$Edl;6MN%4D~^>#w|<1Af6{k(EGP98euUl*OEvoBQbOqFt$}sq-tuN-q8ES#;J5 zh)~SL09$M`y{#`p!j+^9UfEV7E3qS#YkLnjSggVuJAZ&kq}0QktT|P|TLi zEso-W6rnRtV3+IZ!L5@hi{FND?p|)LXtsuYl*7D}_x&F~#L3gkmw&)OIrxx7_Kpcu z_(vwqh#fGb;~V=nNuB;if_etBPgF53Fq5p+0S4JV@XA9f6B@zp_@L3UIiZ1%gtvsWi zc8W4{IZHMMvPO&bLXJ0#9NM4EVG3DUq{e)hVz7;7tnCp}ZD25v^?bqJ755uZl3Vi! zcf-p=o|s*mU9mW4eK!W|CPkoggvq9__bh<_Hq&h8Ai9v9y^O>jfXld>^X%C`Q%B2cbOMjD zvbdzzw&Uf5$Cy7QApNl&e@$RLBqzRCmzP+;<6(8my}5!ifV))I@)!OTH1H!Gd38%6 zZa#_eT~tXh^kx!x)8GIFS?O(HU(Rb}e2jUCu5qq@{o;kBTQ~e4UcP?4Z~y)rW|Z!r zDlRz|bRHC-b>FYrHjuOgpCkTAuqEXD6-Wlh z=#k-gd~<_s?-c9ekEhhG-Ma6ePp76XpF8(??=L3|CpHcYbdReDUghg6US6RW87*iayNJ1;&JjUkYeh9mJFJ z#8Aq3Ozh481;OzSXXKC}j=GYhrklc9+S|ml0`TtekVVmrqmdNa*f1@-unRdm#2T%2 ze~4GGy778Nx>3t%Ew<_)cPJ+Ed`K#6;Pf7(4RQ4IgTlUn)E5ABYT^Pp`@<;xlv=kU zuq4XMnfBCBCJ5s=zF)PqOr256^P=5tRb<<@ZL29txI@)L$h}8k==+gEOk$CZ=PZLV z-3`Nx2Mz<*~x?^2cKFV2!gz9AJH8bzcG%$h+yezhA#z=kJe-Dw-NZ z;{&9;^IC;!cY<5x3df((68@hQ|2;>I5_?!Y%rx6{!GwX;MWzS` z$s6vnTJyZ=U$C%Yu^&)vUDCJf=vuQd;8f+a39mg(rXhe9<&BJ9kRHj*dc z#R_MkBqfqFjOpm|Mo7YY^V8m3+_=;bH%|r-NP}nuW^~Z=$Bz$6HedhNxwy~D(}rOk zm4ks0s8PaxNEIgUj*VT8rx2DLojc$0>F*1X9qGMB|I7Mry|Q+ALRbnZM1CV~?!Ki1r}?4jeyOP8MFrHif~ zXA_b!^$LB7BK4T48i89aIbo%pz@DJ>%z$#3GBcZaKITP< zAhWFUUt*j8*0#DKr2kj^8mg$@MKbPLA(@Uu^rTep*paWMENLQJ>M zRgsM*;qaV|kbN`{2Ztxfk;DZK`#JxL^BbC)FyQ>XEy4^=Xxe!TB}_mUM=vQ!S%jv8 z2iMT4%?Ueo+O%McO6YMI+yPfsZP@UOF9-m_a-Sb;-5<>()%lQO91A)gPzlr_SPd}? z7!&JR_%#2l2(%DOf`i43h|?``q=zTn;{CR4lFKG;~gs3RIL93&sO%( zgr+Y{DL5S*7X`$rimeO->Gk;rokPpCo3ZXNqkY`<3p7>>E1=59g6fauL zM)c_)HiUx*k;@n~yZ?9>sn4--C?VrM&|XC*Wz^xHOoqt)h}cHW!feg^Qy)X)98lL# zcArg8=X`KwFCw-$5TRIdBqU@!tvd%IHj7E0b^?hky(wd2Tc`rLmm)8&W+5VzMU9^< zjt*kt7{PbP?KXwF8dl_hkfOHZPwc&#>@#NEz?t&8;xs@HEDtjp^8> zTj2G8rqUUWvWYACsU$vJFWowWLT)U2eD>s(D`R?Aj(utA;K<#Cl`$hZ;s*}u7Lg2J zwIx1@u8BnXBEx^O6G};fROUt$VONBovurV0=<&sA5eir)p~j z1^9>OJV!dGIh--XCx7LZ?dkdTHujU7enp5qBsI=n2uiITb~k=G!LFa;P(tZ*E}8LMDYjZBl+q6KDx5NUwCgb;b>B}^A4f}3krREx$}IDMeF_9B>aJt@9&`B06y z{*U2BxTZPd$A?TgQ0sD2wD5s3Pq+$~&Y2jtb0Yze~)w2aZSs1{KY zEq*uZm|@;uf4zX<0o?VOz=(*QcM_HxSoYKZ_WgVBUcHD|6Ev*o|9LQD4*?yfeWCQh!6t1i35zF#!vXb?Mn5_!kesd{xH!OW{~l1=D;&N++VbD?vS7jpSk!HlBN z5296Ka>T>^V9#pm>U_yn?Q(H0RnFsh;I3U8U)z4C!?4$yH@GB7#cA-Sd8uKGqefjilij-2STX>5aQFj6=}tEA3Ees z|J@+uQkVGl?GkPHCf_Y&Sp0^w{YoHudrM5Sx@h2^rEbS zq&#=@sC15z%ZxKeI97qMkgFj6e1ggVu!$2m{>UAea&g=YU690!)32pu0;J}2UoJqo80Qr*R zRrSVVVGyIM1`j^Eyuy6#w-ij12Ja)Q%6fJ!J_yR0pdJSm| zn#@y8f6;Y{;avl#5e4jf@ETg&=NM_M8=eg+!U3-aEa6L9d$eMg2G!{s^$MyC%) zPOq%dY=4Iq&qt)WumUHkY~9_563|#PIbYtjX`XjUD^Q!N8LSAz2ayJdPcSA`Hwcdx zz$C_ns50sG@RfHdS20M$N{N0)%q=7m3D$EQ`czt1j)XjzRu|TiwmahPtMo@4fp1Ao z%lG>}roc5cw7I0ujnvbd4}8Fxg$o~3d?>WPlcVs9tqLOIQd3>0PGxRS05;AXCIofP zTpZ6|z`p@vWWE`N_yc4xmVz+k#>v-YBvqn@*?XoF`c4uYhcFB@30uv)S&%b7y{;-7yGa##y6>EqM*=N~t`PEVOJV-0=l&!3CqL=9R#=PI_0 zyqCapvp8DMc@Y!|j+K3Wd`U>p=Le=0Y|LEYv6OkdcV9Mea8FHOko4NW0_FJquNwS- zb=>sNxge6=_>YevGRDTt`5aY5t6zWtL~JD~{mFDMz;1NI?}g}u@UdH$E;g=@Y7cgL zx$Cul(*+y>K3_+2gIscf=gJ@&iPXj2Xm>*M8uZSb7pY=Ry++XwVvuATA?R$TEcRf1MT#Fj& z5Vz6C2xW{PY17Fm-stWnOC55znxD? zPR@J&{D>s_z=1v*%53m6q;4WoT2Vilzz8OJ7#8S&iKP@_1*xQmt+ zC)^iWEPzuo-|vl#lqnt*1}rt<>zPuwa6jJ)a-1W)yvd+DT$jYaWgr2qN8ZthDMSJ z5n~NC-aky+B0;~Yw;>c4YPzLHk6OS4h@eP_P36|Did22h7-@$UCXDYu2HOy&)+l`8wPW)CUV)+i2^K%t3lsUo7i*apYh~zAvHzSBpL$Vzk*rUioHhFl(UccQ z0WYc!UX-uAexL15oVnC9?kE9yp|@SI-~rE^;X;5D*FinY%Fgk$*mz>E3%W@$D!hFw zPPGLGAR8m65Gh%{muSf8$Raw7G>I3eF)z$hcp<*c$XG>CXDQQ&@#62l%czm9ji00T zTNV__IpQz@At}GhvRwV!yNmq7ZL1kZjw-2V&?)D%%|=Hz9;U~kZteYVvxUWRZCwEY zx4C*8+mDBu)QxM2k7gP2q|_|{18w%%%*_tA_=6M}-4|O;uWxgGKre!`op85@5zgaoV^o7wSM`BAiqqwD=HQ4(U`Usu;}y~~N){msoQxQr)HMj9EN zxou!(e)0t8t-FVZ*saz$YnW(h1bR>T%f0*gmP{{45v8&bP+!8{p#n99(|eely+}*i{Btc zV`w}7wtO6tQ+iaW1d1;YMlM8~3taSDFz26klN707#i#M}iPH@TXeDWUnio7b3fk#JZZ4#754tQNVP8K+&WDjH85I|gR~l;V0Rbpc)>RUIZ{ zqqqVW7=HDgTen_O#p0?yorK~|+K02?O_Wd*uHc8qb;KGkoJc#JSV%=vVIYbjwJ+?} z-?U4YE7v~N*KvdpinQxCg*y5G0+WfnCpRgu;kY}TuGZgHl9 zD7T~G5O0fPsY-oc)YiHz7Hwe`+jcS>b7)V#JN^qyV0Z~8fjXRIPA@3b?$64qDwMcy z?CxLe_3BVlmQW_^ppT2{>C9Eu#jyM&-5AXE>Pw?5m)pV|?kPAU-s2~sJ1^g-s4^sRh@pj;|VOklc!GU2a%vkEZ&S6qV{cY*3-$m?%%oN z!NE(a5-CcXwp%-tFq015;AI*f$jv5Bc{us1ZX*LG=|tE7F@sRmerk1` z9N=P~14AF2zYj2mSn)K|t+*QcIQDd#Fy~WD(KC&zV?Tb4S@(H|&FoMJn=iQm)_Ei9 zHp*%62!kSPYrSgr6f=CD8gA+G<*qOPko?R!b-V6x)#L>ENj5frrlp~w=hSrqda!~y2<*ZBwI0RxtToyP}=?kG6Fi}94mBCbir zMT`z01Y@^t@nI12+L5dAVrwb3-pum{uj$*rv^?KHhp0M+$Lv4!>3PTXChY# zKGIW9k3eF2!icmU0-D4L&!sOk=;K+|^Rd2$a+_AycwUNh9Gx_peZjuxfukl`SWpA- zm+ud>e1yaPLmG|0z3ubr%f7k!6NZ$xJ3XfSxZPa!+EsEIN5?lX#EqX}GA`+-qDnQ-mfeHmP;)(W8Z@9a<&&_Km)?5?>2$Fvl6rua1mvAz@V) z%~RJk)nW!hW9~-2kgpLZcKwPf?3Zo0V5mtSriimayQ!(AaUt+hu;G8+eo&SD=cBj( zNc&d-@l|YWx$HN6TtP}ifWomj*T(pw1NQ(R$RtxPRroAS2-$;O(6%qWHd+T}_VAu9 zz!b6GNHW)hc{Z$_eiLMdaxBfcKPQVtsN48{htCwwI^Oy=sP+}ldCGG=THE(_UEj2a zVgOTsmxu05YXiLZl`iyY8lBYCR{prkb;YHmtiu_NJT$ocB8mRMD_`|SjL0fG-T2b% z+rS>rJ|S|Nx**nNhT7?C6%`D8Z=`Ji4R@Ai;iI_F2EYuo5>qhhOuX@E$!8J|@r+_q7S%m#v`x+Nh%hgSF{>rnsozKD^G1vQGj=O->F(rJY<#|c4{6U~{k6v5eIah%4i zhSzG<6_sv{NhW{`KJ1q{1DJ&ih*~u#ZuJRymUR&p$(q+fi$g0QM z zY%WZYBZ^DMSW16S@^j}}ic>^&-1b`_txq|Th`Kjh`kPjtZIVu$K})}h^>wO8I|pb$ zKLBVV&a6zYPTJaiHw^#jlJT{RJAK$7U_2KNFuVPO#(L|9XwVzQxdF2OoxV=1U}_%H z)=D;5>0f&%4hF#D){$@BM`-rs$*c$yj$;kCL9@w5-uO5=^ia09Tu z8@@bS(|EQQ829y!y*(0?T5m?(C$Ts-Dq;}}M%*2j){ZMsnZN+)pPaSyqAgZ&lz(;2 z&qOd4-4sZ(Mci8VUYtj(IM+#$XPLSICk<>Bb7bwbO>97Lq75ie(63==Om8X)fsv=3OjhIp#nEtb6Ov*2 zVQLm#SbVRweGi~Gf|btT`8UassoO=q6b2cKkx&C0-5?~`2ytZ}9}*8Lg0(>>5t7kHm6{ip2ZriqP z%T3evhmi%IENd1Nr9{ThOi(JAz<|Tv&@B&c2oha%e22# zhjuTVT7Kz!ZRmyvvTxr^8ds!}G){aJvSEl{^T&lp`}gZx;_*}KsJ-8>A8TO>DME(E zoK8tuJb!*MPk%X30XMYjqSK5S59*w|YH6(;u7js2&&M)a zrTWW(BZ;R(YAA1U_%5Rz+qbLl!2)+9?jR6cFnMrXU#1`+$2kI&Xr(9>W3@DmUU!Fbt|bEl!tqoN#m^nB?+t?26G zVy7zms=@07ZbO>@2P5KPuO4Qu^#&GZ1LG;M8f*N_0~KYo+D z`j*?>QbXsCXn<_qmAThCWlXb*q5Yh?MSTlNah0SZqvbUqc}B~~yW+$pD9jnFN5!=- z@lmo;%T5_el~f}eBKJ{Ct!RC>=jenc5srEVaf7dJ+7d0dTWt0Nz!sPK>n@$bH6)UH za(Ux!Z2~89bgIzG`hMkcA`d;C>K4ErUD{1*_gmg|6Ojfg4*B|DDuMnl8~hr*dwDTp zOO#x2BT^fNj_{186dQ#*2Rse;LYx*iZmy-Akogk93uhU7Ui38KUa&Ag$*!*Zat5fH z8?!%R%8f-!Z-=5FG95bauRwkyz8ZULWVIFMd~+e9?G(`gBQ zAQgmN7iPaF{)y5GWF7;=6!$iUkaWGBIg@x|UMt*E3Pl-Y$rC`&wklm#a6bEQ`fMFB zS3c$!?wW$eacWh4{h%Y-v!0+Pv0n#WDT&vdGvv4UWznag8T?V6{cny1C0w8j?s6Z{ z)h@?)cIlI$i3^$$c6RI7aWjG#L6#O|arBMr(uCewdFe9U$SaNM$1sJP+3$?#bA}ao z-;4i?Id2}2oRlPKVfeFlr%qOODu~T{O}ot334QIkzva&yNZK2R>r4q>>hGVau_5|Y zh+^wj*+&0c3}JZH%CxKzzX@F34#O$HpLAZ>mXGjs#Le;8)7qT%t|1NXg7$oJT`IH3 z)J;g^Md=)tnNxpbT9-*6|E1dwjvHX0KubOyZEj}D{7E=^Y@{a|+S++Ox++LG<3=cS z=>GP@hgkk4Bd2}(XJW+krLhs|dM{DMBm{01|H=V!;{v4Ir*-&UxrVBj4&DOIWxsmoHdUVv~g(Lut zng29NBZd*q!iA8?%+7Fc|6tP~f$uYjA)Ybvi2{1{x5h<@3+Q!a!T^qX)idI*p}U7S z+b&9|3jUAcOgrLz{38A3MZm!Z9PB4fm~>MNOb3uqw*ch*%STrb9*aL`Yu%%a*N~vZ zlJj&mppq&KV88YZ!Z(Yz2!ZFsXVA

pgqKDPiU;3;3WyR8!nE<3-ICYB7n*%j6K{$)1O&CD$F#b)TeN31T|41HZx@rY2 z^fh1ul-KmLt?d=eV#QNLRKJ=1zgzC{LXcFj_n!zg$?Wdi2t-3N%^i;I%Bp6 zZ2+DM0L}mkW1_cP%o}>DgxG?X+LTlhG7F2%VJY{2Z^OpVpK-%a@eN@%r2q3)_U!}C zcVe-4EE(GMr%9ZCm%b|Y{aDB@9z=Y+_(7Ri8>?h8;X&n(@yPWaNIj#lHVwG(3MZb$ z15Wf_T*i#$m7Go7a|r4E5v{a?vyH}z8rHy}bcXW@&@Oxw4kfmDs7YbW;1K>WQyvh# zcdxY~EKs`#n>=?zn`dWb4er*)GwvVQbZd>Fpe;%sA1CqaiMn@Q65qK4M`~6mN^dgV zw?=Zd-IYq^FQlxJOQYoTRx*+tHMHlYK|O@Py~?VpOhkVS8NtaVJ^^)6I6Tju)WFJt zGtA9QqCh||oZEQu1s;M6O;Kr~5-4{Q1h4kHOyz|^;!gVf5v3zUtsSuk@;A55pZJvcr7yo|wwr$%ij!J%_%j~;+MsT3#<_}mBYyZ#T+lUAIVhy$&6areYg{LoIlP=M#i(laF_&+i?B0F@Zv9)7w z;P2O;IAHT$HZYa?RMLa8Z)|Np?Ztt`NvBTH{&oBMZO%|d z9R1e$`JG)}F?aH(1Gftx(eh45JpJM;%9MkCj2s-;_w_PPKu^b~&z^D4)e-T(-|a1q{%MW!F(NZF)7GQUfB{bVF^~SAr^d;XU>zQgDG<;LVe&gfEvo2gH$sWhKKj^uN z@JpVvMhG0*6;6lZgu}$zuB}RV5AjazM9|Tub!#S9e0XwxCx?|FIMOn{8Uj@GJQ3+W zpZ@j(S;cV%V`32<^w80n?dZtAykuQZIU0oO@=M1+@rEv69U`g~_je*UdUlgz;8gK}{q5X=Td z_ZjwXMe5*;G16nVmwbb_AfHrUFTy5(vsOJQ`W7E^{{0-8xUllV^$sm>GYaVK1wtIY z?CEe{rMc^rdR^M^(dzGR%Ao`-7 z)J1=tRdP+NnsP8Dlzz~u*;!Mi9#o?}gQ#=+RWIQQ<^f(Y*anGM(lIzQn2MmtuV4KX z2}ly4M=)-y}Dm0SN|);B}Vm~b=Dd6P{C_EoI&oqB;Pr; z$>XJW7D17cNqJj)meWa4U;La!K^e=Z;Gk*0QKJqSEyv_1W77nB7$ot3U8~H~eKq|A zSwVPqzEIc&wZcm9CXz>Us=gZqe^*vwYdf#SR^2s419C;G)Wwq`X>Fp z^_w+K%QS0hYxA_Hj^loj7;`33WvBdqot+6-jcecbSA-BUS5%^jZLCBkC53E~c}ry| zBts+(q`?mDObw<;B@rS+5-AlMg(O8PMGDO+ji}z=-JYKJeU9h*zV};)B*m&#RAb4=absb9U5N*%*MNE!VoTZC z6ekc_MNUA)8yytNzLUm|-Ap+JbH@%DV4P{GfcqJjp(*4!VOB&Pv^EL6NoxtC!>*?? z={7og20@98ny>GH1Hp(ym#EZ6_W`;%uxpo&qh0=xaXT5G(J;+EA-h85ZxkK_8$GAd zb655jT&eT`W2}_WKPh=PsZ#lj$NB4m^=Kd@EUB(;qxQHW-qGZq2GyOqmLD_|A9n4U zJU}Ou36(2ZF*E7;v#0mb2z)EnGRGS7fygFV5MbeW)yYK6QmE=~lsj`5V6bS*Sk_m-D>ZXJxe`%+as(CPm+3FY0Ark+pfuER`)Fsu3x@au(Ze^k#R!0+S*Ks zb*8PsLvvI-osqp`+gAt&i}!XDJrO1mD7ZI!X?~a~jrstXz;EA#0J8o>Zzw-6iJj=C zZ6TTB)Xxi2KUbtmA%}ja*@KME4UULM8_oiLd7`(`Cg?6whPSZYRR5{`6N{QOX1q;s z)zH!cn-fDXmKjegM=yCX8Q-^BM|_yK6fFtPj}R7@8KjbEZ8$=c07tH}dX+pL%nv#H z!;G@8bs^&D;i|1pm$^vQ>)Tni9`Ckkv(p)w0l?bgNCu+wBlsFDG3_8#Z^Wgwo|jon z+eKZKlG{$0JMH7fj5(Mqx@|EwMy%r)7?^SlNl@5ZH2jEXdT4(Z9CB;xK!;I&&$F^L z^z`Hcv(s!S3KE%~Pb$2&>Q`D|l>hJE)jj>_1w)a}_{sn5Da-FnbpOd#oBlIrdc2%* zv5@JP#374;WDdPOXaqQIC!KrnD;=RT@l*~#Cwc25-m`C5CHcwJCUYR`t zN3u*!I!smj%smPVqmM)2x#I-v`^Td=##$l-S>kau{khn85$Lg*xW@& z4Ny65t4w#p*Np*TV$@8qD8SV13S_0FC-b}q?n6obk3{~+)K0s%f_d9sAUCn(p6THS zMcHr8M=t%jNHuMTkHN68iUm%_;$s5hhToi3?dwywJOVspMcPq3K&Z}kV5hs(siHVj zu|E22otm#*LHtC$<3F+69{v+DHi(IOHRP@f1za&2vTDyYyB|eu&$I9iM|6gyYOz8e z92Zez7hqkyhd@FQfw(c`uPpoElTH+dSZwCP%L+cSGB5V05w)!3&6Zw^H*M4qsRes^ zP195DN}r&VvhVEOoy>aBmZ?PwF?x#p;fpH?!}5OY9mPI0JK*W*NiZdcrWAD@9AlVK zxZB=vHLX*-D&I_~avRn|bg)N9P2r%cy69tyvb1Qx-i*)v4ZmEL)jSmvBCmJF$4u=< z(LMzVTaO)$L?ZKE5d75U5biM5f-`WMn;_!#?3dYJl2`zNks!hUhE-9D@ElIMz^#xj zpn05js)U!9*x|>*^wA4`f7#QZlw?5_h?T%@8>V7d ziEzNRE)-%`v#{u!u10ZMw`4?%!a(Ifww*L-satjs1zR{p%-w@TAkvhl=p23(S?{C4 z!9SpX4T}=fw*Bg}nIto8G*}*Tw_L0M#dsX8C~N|wr|H6dP{yhANnDQB4OzeL`yLe4 zx)9p^Szrn?+CEp3h)c6}tgJsJ6D%Qp7R4y99jiMS&NO=de&-0x#-z<}>4fr_g-|LH+rx>4{TSqK zVmeHUoF&asLNDR!>*L4Ne8cxKx^#eq_~=1DyKUGN#z7qJZM$SY2ml)$?O&VS+K7={ zri9SFjgB4*$xSVpyMiT_$1TzVWkgFD0TaNdy96hM=}v>lnF7bU(ev%`QO$A7mei+D zqbfdE;pVCBFq`wz55;a~dwKe=wmWD0eb>>0%PyPGm;^u|Y`&unIdlG<4fOZdl>jgnEdB0DXSO;NLU|Gp2w`9vWAFEp|VDI`)qwuaqfJ9Z0 zEaMbi)yvl>+)qgnCc#s56f`@j7XnM4Z{00+gMYj*dWD7-%ClKs7eYy!;2|15$#N70 z5x6UHXmolcBVvV){+l<_`rC&M|F})yRlW3&{4JSbk`)8=A)$iYXYIY54quwnVW$2ulZAS(`yZ@3Tq~QrJ z45}%upk++Rgb3`wG)d1nf~oOkBopmUV{a)!zuG$d7ZEePH2*WVMWPe6qy?-H(f>`` zY3tNMfiuYLk9C`cpiOd(It(SN| zJw{A({qO0Bn}2lJV|d1C>^{#Htk^lack&Fkq2YR=8=s9TVvr=j2~ z78ieQVPzo>Svoms8Mn5N`bBt;NaP^c!P#IuFe}K00ctX6A2M(t&7Y=MOIQ4sQ7~uz z0CGLPskdOAezQt&q0f{p1a|M{sts^&{xWT%A|)12l@wfcm=RT9NJ|@xjJ#QH5N1d8 zCMbw(M|l|tEds{@FK*LC2BJnU#`ee`72ixpU2{MmBDDfq5a_SyuS^z={7yi6X|%lP z)dl)fbYUr%G&D3|mSh>ScA7kl6A0^2L#}6MK`V4V1GoDLWi&_`e=v{#;5&)T_VAcP zb)5y_Fkih*e^Fn;R0m#2L8$B7xIz2Z;3bIynAKgY2aUsE{vs~Ez8vf%y=209dt2KJ zk&)|QP?wj-6XAjR7~MkHXOg?GtvlSF z^_Exds{n|$P5nV3P2E^+K##+6y04*`dU&*ZdWxznih4(!qVYt*8U!UE@bu=+a%oX( zdI5awaXFfE1dv2aLu284?TQ5)L^-*ts;W)URoBK)aRi>Ci5JsvYo6XE_E)?2t+;GC zwgRYOK`svrL3mJli2xvhgj!Kr%0PJ~Ma5Onpkk|qN+U4O%oT%q;%nA`9%O-bKmli@ z+Xav9kh29af}#y=M7Yxvw6u`IOaw~}W|jn>p8jfjgnpkiNAc|=C`HX&hck!TWTP`w znR$9+;Mp8H6r!{JQVH_$vzQ2kWMt5D8CasSva|J-QJsV>6})y|$mT`s<|?o$BOe@Or{&hx%hLJ>Oz#*r^Ct)p3|dB* ze&}O?SOYSuzjp5*4BkSXB|mn7M` z?h4ceY63i3E|wsmrmN@|^JFZeLck z4-!(#2Sv=J5V&i0uY^3F=KRD)0e{uO2nyKe(b|yX?Z*MgQ~>G*+OAKZt5NYhb(n~b zrNi!Aq|S@i#wrg!id{jwx2gQqQ)mppAyQISaeqQdt?GSg=^LIQi40oC^H!|zUSlVm~2aHKh0pJrq$A z2}SCC9M7r_0bXiM;Mtym$WeBmHDksj>WYlNVsI342*`e5o6F|bhfXI#AKA;|)#Y@a&( zT#_w*fAQj@M?@k7eU;)xFZHJo-QiJrv>)!i`nE9AU_tb;iH_yjY4`7kgZCwOW@o{7 zw#_=94h|4_iCP{t)HY!w?b*}iRC+a$J?Dflrn@4I>NCy~J$elH7NUdp;LrroY`PGg z6(gu^bMzc4IK4xM4h5KIq|A?)NVE!O1>IN>*<|vKl)rrxmKVPq^DU4$=~=`j0c=JUrPi+>2|@hOp(R$YdQEFT4M8_*gFss@I@4D){=pxbqa@xJ z*Yz>!1<)D%pkSI}asjpw+G8^>wiglKb|^^hsLuH40M5?IsgcZK!$x&YO-Z1gH8;CU zO49A3HPeF8gUbI36cg=atHd22KYz~B1=h(7_-CQ|?>e*|veVNyAr(MGOENq_|0^4s zx(GVekBhdHyP6xOQUT+h6QK&BD45+oWcY%AVYZYO7hBufN)H^k*vyQ&?4);Y{Ra%N zfj>2LXadKZG9KN?hO(^BU}TU_AyWdzbsse07IGbMz ze#T8BBWl!KBlSDp3=+6yf=gybhCOIpB<3+>a_Tld_f>l!HQf_57 zgSQ_j*VlRPK9-k%;~V4;&-8fq^eM=pe4HozGh}@r?+_O-mZt<%zWl$CRqBZ%85cof ztPM@&Zp}SmAo;y-sQ)c$jSY!i+)e7PKP! z8$to(^b*-Ii=AMO7ScoKsEXgLjg9vN2KHAhA=Ge8Q~kcq+pvgAnS>%XhR3D{?Fo$c zmAhc%1csf}Z6K_MsWkFS4->|pQ?p~ogwMQI-%pRNH(1}-z2*bwsbNYLGoU0Ci6jO# zSkIcjwEY=*W_Lxb;W_=6?Gq>N(B>M5b;lNd-STf_l;LgJ3)KeX<>!-8Fqomop8I4M zbjg{t+x9*%q|LOxG@8eOnfoxWuC!Y`=fiMMV+P1Si9Lea+$m;6xA%VOY{&sRWAAKN z6-rF|oQ}O+YrgMunF~oE8()7LVI=e~pVGV9bMT_a`Tt4TfN27ACf}LfOus6oUCl@u zd=t^J_=p+p&r~y(h4x(83qMjn@b2s9GA8&&7#+*wOcya^F%zhiVJeqg)-UbnI5Ra^ z42Lnsls}D>5fi`=_1gEqwY_cEu1m*`-tdq9wm@_hU*oMb6sKM_!)R z+o`3i`<=>bjUWOZ0sLGTjLH*%;MaONaV%a7hb0VxdGe&LzP^%Y)LJY|0_f1Qr~nbk z3%0bAlQ4>%%t)}AvTFtaNRg0X!aA+4(l^wh7%NQlBy?w`3^Vo^Wc)I_=6zXNF0gD2 z77gS0N#}3hZf?)J+Z1*oTGz99U4f-d&#pOMGL?P@e($Q}@8h5Sc$TN1)I?KHTRZuq zD#NdD&6{>=jz*f#`|N3|HxHgvTd#Q0cVy5_}J;>^zD8H*d^VH_} zw66)l0tXiT1k1qoFT|!mq$yRu*$E>04NL`kcyob0Zobo0xU=f(R(P8CzTEF=pHPd= zaOCeX9W{q2)LLmVzU86A6V{+;Pb)B|6@g=-y+WubDWubxJCnsq^$8wY78gf9l<*@tno9X_Z7L0`W;v8fQBW3YDZ zdFj%ZfUoxm4rJ}xy?gd(H*Vs8g_(F@iANls^1;MWvVVi5s2xx?a|DH%p%*W1%`15| z^MYy9TSK3LKP_RX9cMW`HbmyjqQ$8ydRiY3dgl&o+!#lC)Sw&p>$X6T%*{!KLm^Zejldz-Us@EdGHds9+->xaQf ze$Hl2=Jq_~ne9J5>(l-VhFcf+qdro9?(Erin>RC2@j~lJ2RuIV6yWOxnwYwJH8+dV z|30o6^1x&HotFXZlXcZq^n9GM)`=tR@MFH!=EWhgw_;=2GfC591lOyjnb{wYzl_nL z$emjVvxz?8z$rdBZd27PiC9PJCT8B{ZnxoM=;-PeR7%m0NZRxzHs#Z#z!eK8JA7?# z(r<4$yycD*j6=y~8Mn0~KMyw0J)5aErERh>q5PfO*SQ-vUe0t_a9hi&)XPd_G7!&u}hl61ZYX!~04+@JhW)f}tha0s#nv3=kPaW#hut$o# z;@Z1z6%F5r5;186XrrnPkoZHX(aqVN*3G-tYG;6=YZ=S{N`X!V`Kna-jTonMqWE{SURwa-*KflX^3 zpK)crh3UDwXX&;A51@C0)w|PTW)tBuH7V#+ds3a9DR|#Q2SY55?@dr~aS*%tz4oxj zXz%8_r{ieu?X!5XWNK1{x1Ylx(~5=KpN1j6luO*9qli(UN#Z`9v ziMYZYPunaW%_nN7K`Eyyo7uzlm-Wuham9CgM`{}asLtnN8K3omfQ(=p^QL=U6K#{ov=v~5= z3deYwB$pJvRjg>glv5!DqnVi#K5SjFvnm$YE+}8IxF?4zO5VBEolZe7YfKWc%&!4``bH$2y^ zv0$77@rb-U&~q2)Hc&xEtExV(oqydlQYeYGFBWw!uc;X~cC3?K0#lXYbHk179E=M^_d3BCm+#&PkO(B1l$u7D+Y4<$lyrJ*p~@Vks90g}R-^1e7GZyO5Y&wYD7Xp3c%=-@AcQlw9t+0>~3GK z=&IuRIb`SNYS-Q4A{ynl$r8);#?JbaR(2lc6l^@84HJ!6R$#B*>Am5EK#D zw^65NHeG)GX@7qKR>Q{U^rGBGjt_nDF_}y#Dth1Vg|hw3Pb%w%}eH zCr38?jz>%)-e(kun0>@%Bq~Y96et_mdwkEwW2-Y?9E;1jhBlsKqb}s3pfG`rA6nQM zAmP-8?1di94-~wg&Jl&YfzwQ^ykyC1%#T1Lm^ZaWz?C2f*x0?Fx73oT?XaMugQ%Nj-4kPpvTPiMGt{}=u>~NX1F;%Rob3l&d`VAYveI~l;3z}|bX74lXy}||_ z8n;e$FbuqC)P&*kJ328y-!Ee|AaD2_0bBRk(ZbDpa%*c^A}Y8|04fz3W!@ zJbn6fv|{DF2~-eIYxQz9ceNbcC+D1#xrd6Lo~1->W4?W#TS*Cqrn6`JVx}uAi9^8J zB4psk%~PR(kFzZ&X`SoI8(5M2i@TEDp9oV=8jlHlHE<4!4vSvy7=PSq$NJRMeZ5P3o6hW$ z7smPc>3)ZgG6@S%qSq~fF*1Yp8@B;<-OiCFGNNF_WHCa>3sdgo;=Mo0b+zz#)T+HN zLDc&^tO0UdP%6;WZ=~6rL@J3AA!4vD$caVjv7z$?sbkyF_N(_jsCu2uQYy@xmBdM)nTp#DxrTu)J>Z0=4 zh{jci#5)~qv;MMPum6#n2$5*26YB|DZMl8os|OQCd5xMn)g5X#GDo|-Xsf3{qU7}aPpm_ZJgj6U0 z_V9hA{{6q~7+XjN0C*!aa=!f1wX=kDsVS7gc>cG%w)r~NYgk6&e- z-A&X!LG~bq6r7ozv$)99v){83@iO*ZlRX%W((=nt5WIu=$;A$e!6wGfLRX?PBXz<{uf}Y#Oi=}n>k;n@@ec_*$bS%sMhv8XebUL%E@tqPQSQRdEjI!|9QrTWm`Z7bG|51M0|FS%Dt zc3(-AVt`WGbvLsWKN?&C-dVs*KuPU1pB*!@$kl9=erX@Y*hlSIAjn1SPyWd8XVK{H2F z^F;dESl#eWApyeObSz7owWcgBC91x!ge4YJJ~7i#qfpYb9k@bHP7Y6o@0dIOaOwFF z0YrT>-9{Zmw6c=YKg*jpNLJR|!s4>ixa__MtSfn!NWAM`E8cO**MA`h5vCO`T&VG8 zDp^un(Q^_^7TlsN+9QsTLqq}*98mksZ$xlhmB zmB+(Cx6>}wt-k#rqjkQVaNk??PuY-?d))1x4S~m7ioM82On!k|t%q%#%aG}tq9KRr%fwgYR{f$7cNNbT8$ndy0QVxnPTk}Z4e zL>f2a;&dmX+;B%LUezQZ<-O&4r9NQ_U&dKRW~KdFY3Xr9Da}?>XLEzNOn3EuhT=Z?&UUIhE^kgxS^w#PmGdX1-06K39GPhJ$NCFly&8OaAIi!HdMkO#UYPyn1KSP&% z`z~JMG9!C+9J+VwG`WzDGuP=T1m(uusCMHSGkQBYYX*OhU+JbQ&~t{3|Cc@9mW4Qu zg+&4Ykbl`6*Uhg`$%ly_C)|-pjKR8|?c#utuo56khz;^^3$LnudN3Z%v)y*JmI+9U z&y=F+`EvlO4qfn!#MtumeT9!oAACf-gT|Tg9R>#7KlLr>2*_{`7$&=>P9jf({{Q*d zFEX>VY=twC=K;>```6I+%G(bxRCnX!EA@|2<_3N@my97Bo+bAq80#4TbWn`_$~~ zuyDi9W}fu7y4rN>XRc?I!?q}m;O{zxe_ChdHBz3L*3@S0oVi?siP;*Sds+V3RbzSc zh2M5`eQ^nJM_Np<>-RBp!S>CX=y(>tDZ1^NEIR?@z*#{%;rW zKdk(dm9gfhPb!%>Y;wFB8TmLrnUKIx0w-Hx@UP$dhqL|X9y%5r7at~ip1M*!DEO9) zWM|FZY57=-0P7YVzhNWJ{l|Ow>vySnYOGG^@PGe*+u=Xl5vG9E`M+(%qt${Yza4Q8 zYH3a7qXU%KBUgKu)zl0&q$+4I?OawFYIC0Yvjc22W#!^_>^^`ARZR6c;O_H3|2Au( zOqs>^oWeqRzzMI)IDBa~b>&hf0p3!Djm!TzMgOM-H50A-9hbz*{QvF9l(&zZY<}hF Tf@*=>Au=&GHHtP|x%6QlR?kd{Hvw?3RYyactT6+Y1Iu$OO6f9f0qW65OwN(;)}TbSq42bE(%nuMa(Z^B(y#2()2Avcb&W_`?MXK6;w zhYJ&wnwoldcjvJ;-97%R`Ssev?NPKZBN--ANH+#4-_HyRWcX$ywviw@N4cI)O4Za7 z+}qZ2kr-F1G`xItHi!@La!c=}hWp`3ae3e#<>{PX&{$c##Xw2xp3H_{6zVqPc!G{t zH{F#5gBSvyqzIXfLVq7T878l`4*K`ararrZ$DmF4Sl%M@h zW`G(uT#5!K%?k%JY-jnE0NIbBp;uGy;YV^PV@OEjS_o3j{_A#wAL-(nsByTty1~m` zfw%(;F&~J*%Wth%ZW*W(6-<403JO%BE7!TI5|B{#x+xr>O_II&(p8` z9(q3#1zvrl99Mn&?|WGle2R=*^jh{0FJH@;Bxyv~)YNozblho#{WlUu`FVLX-LV4y z|HmUp1n%z$nQ?gaGmTQ=qZTZ{|9f6biiyjbm6;hSMRWE$Dp$>hDRfPa4i_D9r;0~N zsIC0w#C4i2WYOs{Iq*99_OOk8;0*`Unrh(9MB7nEy3;c-@K(w(D}p};Mn`4YH@3DW zd=d9%s$@gfejr1N*t`aU%%$&=DHgO zgWIwXg`ati=@TWpNVT24V~dy*fO}R{W!U62cWoO*-(O+a6h$ePbtjyH|EF#F{*b@( zTDi62ZzRxPq<^Qau;mOq9Uoc1IF^mSN+*COY)GyhR|7N_B9M7x!y(XSn|1EsbH$p z_E;LJ4c$>Salq2GMX6F`)-v;pFLP(^Wqe$#t{ok`cyR4QQQ~1;JE6?!QO$p19Pgd@ ziMy}8yzV{99_(_LkI}NX_DrCq!*GEAAN^7f=Q>;5JL~klQF#^P4pDPr2R4p$`k#8W zr~3YZuiNHC`8+MFoobaIOLr>gzRPpb#XGj%lh)|dAJ6s?dSeX)|0fWhyrnJiEo=F6 ze}ic_{E@nB6;I>doqP~6W`D|_9s`l|lMz>UC5bT(gk4*7ZkOIHS*r}Su)j#l5b#5go*ECvkihsP)_njtF1T_d!^AuvzO5)rlURnHKYv4&}zJp)ek4vA^uAT!^ zkG7|USZ(|GklMPeabX_w;)|T6+`=E1#Pm>5ZO5{`%ogEZX=CJh$E#yFNbTYf@t)Mn znupNltnJ>o$}1K>XnFj2Gv~8RmBkq+<$v-K2l}y^+t!45tJ7nt@UMmIpj-o{md z=L1O#)AOz_?oegv`x#ZcB+4|Fi$=%Ynr|IT40E0Me{)c6Dp=P~YopxMo>1YLBVI0g zQPRQ={IA4KSO+qN)b#5N>ljJ;3S*=OS9#l#&svqX+ieOkkkNq}k6cX&7k1V9Oumqs5STsatm90P~QO(M>aRS=ae=IOpscK9nC1E z1WJje@1C|a)TQg3PI+w@ax`|+ii8ZUI6MDz=ETalgs|b|czm_|(uU0cCTz!$#3&6U z567H)@8q2!2phulj9|=n?w7`A(d%Y|;EQ`d=vPx@(9iZrRPAt<^}vFZ803Cxn%t@fInP9Ma0FG@h!_4v|bKpskY zgm*&1Duy!a#Q%I$@T1^IP=JLb15QY^PrBl06)&8MfiNfwBd#uw67y;X zf>!JvyDSA~#M=$~M3$#~BVlvOowbsuJYy9j{WN4V_V!z4U?6WL$5FoXNbYN=l;34l z9lwvgC3Ds2aG#T528*wp`@jW5(340R$VD)7Vq#*44-d@EH1;gF8A^noVR~}5*R7b7 z{wY!Ejc(aupQUSo!R=f5H(wf?j^b?HzQDWaIYI$H$;R z114x@4*~kv``cYGQ>AETuQv)7tf6WJ3Jvs(%*>I|(V}F$<~OICO&#YI&0#dLavV>- zY-Rb{sF>=EcG>VuPfwd;^fbIM!jP&@G^?8f5NA*f3fkGRrHI~g7knd3;fe|VcEeBp z{uRd)r3-a+_2a9x=*~Z1(OyS?uUyM8<3NVCO?^hqoSZN*4DG&aXGbB2ZgZV;J|D>Z zHSp%AIXgBoJgmf|nD{{Ww%TN8Mpo8^Pdi__a@mw18D@!A1lfnlju0)Ez3Ga;hpoVe zi>A&@rxzSvR}Yt+k3@mTM1L`kJ~0ZoRI$ryEy)KmOlzjZZ!Q8E$;q-+!A}4hL;7jY z@X9qV1PSZA@sC(Mszil1!dboMu?huaR}$_VHxgx2u+Y4@IPDlUTXH(L$ybo4Zf|cN z??xZ}nDjN3B8bWV9?DGwPlhApUC8?yP*U7amVA#yZujeC@^M)IpouHqPFENb-6El> zG2%lx8YMpq;UEDrOh`N&4Kbm`csP|IR(iECd{-P!T z$&}JokG*MP8L^Rnvd^OZMnb7PgbjwCqehEd%$}X`pgzx;Ajd?ibO%5ur5BVdC7B=% zOmfUV@S4wb^bqVLbRVZz&9oQ?No1Mtp==F&i57FBf_iJ-%$~FmiH(X6DfkGzZg&<8eaHuT)STgN zDa8EpZ`!v=VzrB9h*S193hr%6DF!Lt zcpH^J0BCo80&qGP4n=g4M#&cw=eOExWALzrcOi_72T&+yvr(lMb8jX& z2tB6|Fdmz4zup$VlKgF5vr0*&hDkf9VpdWI=g9?LpI^->%AT8>Q)83@VS$b~w$l6V z+iFWoi;IAg+s*JWdOK5HUEPp=I`+43-}s%EEG(GO%gRkVMPpyW22)`pd$%tU43Ol( z!sm~Dn0Eed8I^~Zr+}J(x#J>`hk{zRw;yDYRcD-l64Gr!52BGwtF7TLcAttp1=!s# z93L|*q?@Y!inE7tsV}sHTJvo}QW75z51p8*&s@!`zV{fUpuA%HKgCA=jvL3pRhO9{1%TRma6D-k(t5=aSc*yXxWpYwF)Q+1Lf zJf2TF3ixn(J5#a4vAyOm7sSNrCgxzm1!L$?PMuSk5VKZw0;OOe7vzANa)+?Whs4cN zBoru!eSWZ!*U;IEhxUVcs+FqdZ8!;WjTvUd-RJy+k)`gT&)GCrSjEq>Z5{4Q3R6m) zV`EArT!XH*ecGxu%XlVZLwJX7BWt{s7HQ)QGT%|Vf$XfPvO$d9^VPYQ1ws1r7$Tn} z9WHU?_gP&XBs;U_1kBeo@1JM>;JJtk-^4CIlcI>eoP8DD{RJ{V@p3JY5)zUU5f1zI zvTiQ#9JAj;{Lqi%$=%5iJG)VHm(4s3^>qm+?k6c&m=g`%_St!@N+*hwWzqhzOivk_ zZaKOqF4@jWMx|Av4!MV-+56LF**gpup8A==LD4+zl_+0yE8g_3o4OZR;LMx{&KWjv zH9=k6v#-C3|BkA7qBOpqU>&1~4F48^lQLAYyZ;;8_=Wy?94&6R7V}i8|HpL+ns6R} zLYoigSZ1PcI1*VF=DJs1{MJa9p&&UrE`I*Y+vUfs$PB3tHGPjSosVy}y$dEfuD0Q{ zE0CtvuO)oduE_m!(zGsS9U&eHpo>llIzr%j+B4^0V#Hwj~?rh+NeTYJVFQT;(~V}dW7ztE63iCmw}IAm%&;}zZeCO6h1Eu64$W*fuzNn9! z6i)9e#alRdQ_Xvu<2-v%GTOSbBsSl=!kU{(IKF#=v!eKrOBoJF?<~)S94ZM)v*(^J zKimlxm(;JH<~2D%Fr~Qj)h|P|m=`ZAj{6RzWbE3Rr5XSF}5j!%$KSAr> zI);I;?`Z0UH3K?3V@$n$;nj_v^qn%UFMARpuR!oHP_aGS0aC-Rqun`X!)9~1q(s!l zJ`D7s28c^62w-B;dRz7YZ?Lowkr};5<73eggi9VWmgb@g>2X?Ms4GT;hpo^dJt$5f ze1v#y`bS5T6$+LgA1*0<_iK#XJmm{Crhl;yo|Ku~@E51qquh@?N2j+r z)y?ab@+f&hN2kj_*n;Gv*>fFad}STm<$u{VJMfuukz(2tVtjDMLJX0|-Y2$Tvw7h! zQJ>Mb78w=A)!;t%Oo>HN>st!WD;dQctp}NaP?R__?nC`T#RaFzeKG>Q+Q{{`sms3D4aBAu~dzqXG3q{-&4{2!? z_#%HeQHTu8*~0^1Z+Y_EUZc*iE@wr&{e^Po5w{YcC;pu?86Teh-#|JHmlk<&s|XS} zJ^*wC<*i7_r}sk}Y$jh)2l*f=Cte>m7EygZ6Am9XT=a|qF3v)wJ+SM82?8tr zwdd2DkId@33kQSKr|+i^o-v*!PF|!1R%``~=fvW>c0Mf&@K?Gh9dI#{h8VFfJw?rm zbp|(TYX|G%rlSpb;loLcaqdMFz9xNo5IYAsEL^LRni}LFjxuE zcG+X#kVC(UQ?0lQ5)pZoC)g0YpV=b_m$JRY`istpagF+Bs89YK(_@m{RIwAup-b%g zbswZ?CRJ^^_x_@kt3iNej=eQXCe3R;`y0ZT!(;tkYMN4ha*hR2MK+|RIK1@p z`%+L)P*h~??Jcm_2$VrN0F9$}=1>jFo}iO{ii)Bq<^yb>k(dL|0D*qOnwm}h5o%E@ zexY$PVeX1sqvv=0yZ2RJ6ahZIl8Oo;0l{m5j4;Z8=A4|BSXm03(Re3r&XlVa^d1&* z=EU#*7ajs+N+O*n-d}*2M}Pf}6*v3|PdH9qdrE4WOHE+cKHAW>FEIcD1l{H4 z*mo&<3sNa?RU7`7Q~rcWu!&((9{AOPLHb>!&R^8EwD9k>8Sz^e4>N_l3P)+s>8XK6 zK%Nt7ef;(P`r{L$KLFdt)k#kga1V-!`xDO2?tfsV?0tZtC#=!CTcdmwFsNW9{P+}i z;OffTAo8nO5|S_C98BaqRozQCkcoRf#$_1E2zu?uflSP)LIVnRZ}Q2xjiYI0tF=%alM<^`Q$5>+6q6|C-o3Kv!|@_7&Nr; zFcWxBq`CQ0N7+H+CC!K4-!mq@juP+0SqQN(FcOoKO}}`*wz2sjf6I~bc@?x9;2_D| zpEu`wJKD%ag@r#?dcitwX?gDRW(}<7s;Im38BZ`e5;(ZnRwOgSBWe zl@4D#g=j#>h*LbN{;K@3BUUf2!ySy?GCfj}Va`lVaXL|0c{qS;NUyC;(^72(5r=@( z(d@COLjzr{rJ|a-~gMsq(fAl@Jy{eAdjzn2vzOaC={@RzR28U0rL~DS7 z3=)4|Ov8qt&{{v|u#LpWhI5Xkb`v_WV`Q6S?(Uv|@r|R*%Fd4YTHd@4CoC+ihJ)t4 zBcIc+lzzdeQ}kN&M1{m!OlL4x?(RI-j;^tCJs=a|(^wF$W8?(tPuWPKkQuQNl~Bz( z@Lg1fzD)N$YEPa+0VjIB2}Az^0DuclK|zuxTo_j2;Y!DToR@UU?A&!Y`BFo?b9}hH zSa}aHKU90D3=9lf+uBAmo`RbRWffq=F@Z6* z_r)BxV7fpS0JM8J4Gw%}06fvaamME}<7>mveq3;9T-Q~=5AMHP7y5koMN0vRRO%is zf#vOe2yc`gi<7PG90*>}ZJBUHvl`R~%n!=&q4a;k3H1R z%IO9BLf_i)Y>bGR8XB^4Z#fU1B2_F6{I7ex(F+e4$Umw$1Gf)rnoBmCf|Z?J%x!b{ z=g+?jS4Z~ko&duWbXe#n+2pOuUrFYp5YTOm&UZclUhIZn7i3elUNGsYJ`0H)ae1p3 zO8ye27~MN>HDW{LXPTvL9EJ7O2v;;F*}R9t7w&q1S+KDBMNz_Zd3I)zt>l&`l+3N-uzpX-^{xs$n9u0i|AdLjaP8LV>FQ z*_kt)o16rQ5b;Z-H^Kd4ay22fDXagEj8D#3SXhsDTVhpZWkdB|VdJ>@omPcNAp;hm zpuNtLAPlJwv5DJ~rombJEy-B_qg;n-ZC9zBCQ%_aFve$!J|PgOKKa*3I~q}v%J1B? z<7HV0+l(qrA5M~ieDaC_WumljpR@h7cw0feyPIu^Qbpj%`b)MIdqKJ?zh$@=%9f+( zYA9K2P3Rey6pXD2WA`S#?HkwL zWVwBo+o}bm4)~zotuUvtymirGWrTI4_yjrf^x|-zHJIu`-vTkx=0z}06WmDrDl73T z3AEF*v-+V81zki-_dpolhjgAW!p!{5o@GmRn%YT$t zrL-QLyryUh*#1z83_%fEvNyvXQk+nXH7Utb?Khb}p%xfN8KraXGjGtMQIXCJebBtF z4>E2Z;%gtQ_XLeYadXQp;$VrpuHTP)P0_-oJ@k3dhgSZ4(Rx5P>3%N84wsy}V^+r7 z6Kbe-w_j2zVsTn_e}U50@wxr-Y4}bMf4VG zS*L8uuEAoG+E_ekAsSt=KC&%sGVD|8QcJY+$l z>AH(Mxd;bWz1L4(K6oSVw&F%GvqL20uK-y?lJP9J(C9SHmmr&tK0G6w%$ooxsx5d#rG2bH2cu{(1d%q(zs0}Q?OxjY}SLU z551NmSR4q-HYZckG-G{~f#n({Tt=pHWm9?N5f0&GRC%SLP|WG^-{m4p@y6ou;r`QT zhAjZM;o&F{BQrA%via@B!Se0WC1kesBa=A&GAvH);$R^;Ydz$dRhv{k6D-82Q}UES zW5V!br9^K%X(}$iLq~g59IhOdOq$H344-cIR>4zBKAM1xw?&5B|)t<4_ad7v!yFF!9R_ zapH!0)=AjnhvhKHoH(4hF(0&G6C~JoT5lPt=O%20b3eH)!Fk+I+uc5mUo~>u*ek76 ztT;sryhA}b94Cd%kap(XBGkwTPC2Kl5%vicrHHoXA(^v*>2kQI7!2$Go)l7YR9B6! z{s`bYs27E$a^L&zWy(+prpO%m^QeRn-BBn)*@)&SX zPQ73C$(OhEITo5Ikaxg>JqpabpcUvHEJQVDvMILgTmV|Goy(KgbL3$m!VA;q kP z+4;3=6V*mtsFKo9_-^eg}YrEO=;R91CEup(uTX z;_yEj2bns`tr5p@@Q!m&BEO^8i0x7eWd5L2#{Jk!CvP=0!A<*4G8bS_jLr7EPpqa! zsMC}5$Qz?bK3=}_o{58&uGf`TCu1hxMibPuCIa$4dqp>D6jca5vlUbS7A+Ef!5HY} z<|a{OmEKren5{Nw^9`{#aTE~tF)bbKCY5ROIZNljn$e=k1i$ed5r>>XLsc=SD)t#W z-n93MxaT2WSqF+1y-D9l8f3~?o$0gfCW+u4FZ5}J2$i=ZTC&QG9G{olUTuGmBMd*6 zy@*p5s!vqqjW_q0-lBko#3^n%QZh3)Tm;S(WPv#&yW@Jq`MU@Zt+zHew+kYytgN;Z zqW}ZKq!8YAZL#@omh^SUJ#PhO z2qNNByZ-$7lge$K#~B9#`Xdp0VJSFM-T5Y{m!YidI`3#Yw0YJh0)a`O7+(V!f~&^qk}! z=~_`%+QdH(7`^<|L{wnd{$g&(;^CE0N9 zpyNTs_ii$>shZE0Z&GbDNA#Uww>#? z+~GekpB)Og&Ftb0LTyj^seZZ+-($El+p6mn`HY!Iwm0Bpg|@<5Wbbx20!}UK#w_@- z$i{k@yhcO;!4#sag#}%jT(y3U^{!PHj+8(KY-BY&+Cq9I5t>-!fhP>Uh61ggd$TD( zE*0fFQPB?1t9gS(2`TUeug;8_EL?QB1%#(S5b~T*=^31~Ec(UXtBac0=(EWIh5InC zcV+~r(ligJ{4KyaxHiagQvBvM@8gjUK5^u~6n<$VU{Mj!q^}J@;>GT@xh8l%eDxS9 zS7{47s>y;va2`b_Q}}M+j$Uiu?g`!+8c(<3wZ#Mi_6QhseV;oEz`P@sndkUFCnn+t zdlPT`?eBwX?>}R1>^;2>2a|j@oXU+GP6yYU#YqfwLDz`p>*u;k|&hTmU5V< z8}NymN9+2Ak`W~j5g@}8-rZXiW~ut%eC)OUks&`exXaITKdE)?R7THyoc@a;iVG`> zLL}`)c6r~K;YYDWnRN;R${j?aQZ-{= zqX*0vb2-=_0N0d$B|kXrxzZkY;bj+b6p-1Oui#b?xMI|)uTzj-h0Xp(4N>Q#h-X7} zeUt^2#JBaeGaHg~&ZEz5^;v!XLk50UQ>HJfxlBWK>lBQ58MQfvs&ix)3|bczKGe?2`t08)k;@)QI=J4i#a zzd3s7nZ4=(ltIm({?wVe@#T1)jI5UdBZ5p6Bl4o}VSmL^sq*>vl>;y4H%=U`bxoVO zHxi}*SY9FSTKM|-OT4;=Yn|Yy1-gK>;G*wy?3VFTpJor1&DxF*l8vEFIl7;wWfKR} z4S~b}?(I=Hk-L$qcD`~4aIydV!c{;2DQ*pgx|8fyRD)Q#uzBRCJEMX(dp_U-@O=i9 zIshFOVQK1w2r*X}1G@ZeHD_n&lX$n|1w7oW+N4DKnLcwM{!_pzcP-lYnI7sY$W5CS zO|(ciO{pnFESEfQt*10eu2{7jX!$nFgQ~Th8cYk_;a>m9}cScyQiwI5ice1@g z>j)hIhi}5U>bx)eHB(1#QEB|~wWqbczczz9M7I;a4C^gtis1pL8%-7*{GjHM{PmTn z1A{Zg&n!3ySxLinyt5f=!T{zuGLQrmbBe|-!wRSzx~}f<@bLb+Ia4yZBU47KJmILj zYo^02JP?;pXl=WXLz+PWI6ORbH+;=M;QyP^s3+#>J@41F2Qx6tpEA{$0{L)#UG|$b zu1^=L&_x;apKx!NF$aL@E>YK6>bT+*Ss^37%W06IF1tzeeLYNtOKi_xC3tUJc3AX1 zsNML|uk-k+_I(k|rLW+C34YU`%q7;6+q1@(rFgGiF=(X$J0{KHM6|*#`p4xD&cl<- z4{o}r=Q1QE2skS8E}!urJZ zJ^1E86S366HT|0HZS+1Y>?tnF+I7)KQu0J$Dz`wV3#jXJ?>hs=*H=>g2dVJTpx+!O z^v5hUj^{}w45Ze1WV72CvOIqMJ*Z8uq)cS|I&V1O4Z6U`I(EyngNI-*Wj}4=8hO^X z>4oc1seBqgJkbt49dvQy1M+IbpZc{TQ@>6#GZCGpC{V;-r$#u7TA#p)f7loLWtH|} zB8+JP7(%wFV%O#9;dgsp#UYoi?%7K zpn;HJr^Tq!vWZWJhvBbPs6uCGGGiySZ_T#B;B!*HOg4hK@2k#+*Z`$BZ=ORe&vHHv zzfdWf*y*`ND^>|^7YUt;?Put~XC$p@Z1e=|2fc$S@5}kwn#cKu0UJLg>3?hTw#vpy zv~0$KH0DDB7Z{eHpFF|~>ZaNvjBK$)1I^S>fO6kJY-`aJ9BB(o4DT0OxDMIh_ESgO z>rU?=xwPf+AneIpI3Mz{Ys$#dYXqqlMXSJtCGR*avVzwBN$CbJ0w_>5#G%T>FC*&Y3c9&^l(j2PAKT4pz?+&%{`6J3Hz#*)>vA+KZ2Frt~F=~1NO0m zb_C_1-;xJ-CEjc&gYdDurT=eJcx(p>P+&WpzXA;caD`|gW=_f9Bk^iXEQOzHsT&`$3j0Qlk7~@8tj0K8fCU@xqZGDI&NZ>( zHrKgpCHy~aiOtTEhavMY3|zRZaxB480VzA=E|>K{g4JgYtcNBU@Ct4V1avwBA0M>W zbL0!gi(jm`2>f}FRQYN^X{}TUctS?61=rIPe^vmjZfTN+K`Wa%G2&Pa+tuH7;CayX z!QI%{n8Vn+sB^58*uWnuAbtVrop3}rrRtV#oJ!zj=yU0!ngBxUkFO4=-z>y8Vjapo z%MpdgdqEcl;jbO&rAMxGi`6Hk&^x%E`7R zUBo;!zOt8alLz7YTS0wm;C_D@M*Jj1pYOEZc_H!JYZ&!9_pYa5hmxu$=Tn)c8SG?$ zppF`G9U-_pTf92T-Pv_f89f8{x_UOkcmEQ`hK2@Edxi*U%ir?uQRcWueqz=zgCm?u z;7kQhLs#OW30oc#JmT=Lh{K+e2GBU4p7c>mHYEXu+?$2R7CD1`q&mY zs@xhpF)Ipzs6mJOG}TiLsA<1hT8aaMG9*9t#<5xwl@&HN@-gh;=dJ&pNMQh$*MPRr zlXyOBUyYAIz{%C|j`pbpXxEpaVQCbJ_hy&1`O0l`u$EW=g3E|sn6ppQ4zw~Le3r)+ z?Pg_BWW0%qcLCBq3q-b~{#AAYa9kWs57>K;_bNEaLyN5*_LxXzIK^9^ zXZA+wy_j!PU;nw@z?)Im-vo}cK?=*7g>~Tz(C5 z5dW-eFR*W8zf-OArn@RTz+rT`-TrLOhr}HKN{zA-2hlBHd&1zc8JujjyGI*A(%Nlg z@A7F@y5&Ow#`Z?X2~Q(9sB@>muA;EERzBN&l~ zW*O*4jxrB<-EGxuA30heB0IZlocJRicC49y?5E4ycYp{% zZAG7JoPT9CIRZ@GY%AO1!*lOaF1dS$?dmJ*}IDfv=~+7rnpB zOM2sL%0ylD-E^kpbVPZ2*JNU4r6{CA+F1E09t~r5Nv|`cO7K^p1yN02TP)q$+M4~y zb8)*t`vc=GKwJKOouKb^ndI+L&-8Jb2D{{h6m1#|@MBE>+yLS~mM^QDMI_U&wQCMR zsT!Mg3nIs|OuT945Y=Ggz`p10TY|RcFA5j(#v9$6KimQ=D=?*5ErfV*F(Gv$4cr|1 zBNBF?t-)ab5_0NLdbq4IUAPZf9Hh*+dz@Bo7Rib&)0%t11wpw64NMQMDgjRP6-8$R z29DB}bqIb=T$L~-*2+^%Qnt3X^3Ijd=-*V2j~Pf%VS|QOsATMlcdVR6jh;?DMh)t$ z;TQD^`UMX{;;%p#m&_H%sLGcvEnkla%M%+D^%GN@ajs}IDH1Yhn(NXh?6OXG z|8=P2y>HY->L|Kjoj%&YIQLLhRsE(NvO|IZNl~x1?I1yaKI!$r{ojHaalLpmLkw*ApJD50YxN9D0u--Lx}iKXR!rK8|yKymB* z2K{sG0|pDotAHfo|HKKsD=ORLaji`T2kbf}Z7si`T)%>{CplMYWp>Vc-vaNeE`Kp# zD;}X}Ow=-vTBy$Gn#3pLD=JW|z(VRvJ~zj$H?FM7q)={$KfAg{^-svOW!_b9LL4Jw z-tf!3P&|H1McNm9M!WHei8{XJfQvXH0Nu>2tmwtW0v82avA;%=@*j3SJ~W2`F964r zU)3++Mt1(9lZB=KpcAbn4u_zuA>De!B5?woZWB}Jj8Nl3W*mq!LqPYEi!(aEuYZ(W z6h9*|{n3^5{37XSMi8s9y)@^ zcBy5DD$xcmwvHd!!$1V{7Gt7g5BImd=A1ufk)Uv?w-cD-z#;m-`+?+d{`CWWYs@&B zcPDUif9BoMC?UqAp^MfBv>CDSv4tfK5Ii-rVj3&bSVsT~@4EBThj*;EEvc*Tbw5!H znRCKHkZRc!O`^@&3!NS;&<_Sk=d~L7Vi=J8fc2nYA*tct1zQoS_nlzFzUkX_g8uo| z#>UtYJ?M7cz?>NqaC{0Xh1admjj65h!N0dENp)>3r!1pA^a)AReba6d_AS;IfuuN0 zhY(oY`$QIXK4$|*1vvTYyB_!06~QQbylI+ei{x?=xHQB3n&4*P2_yu_qR7yyzkSIY z(=crS;aQ-uu}&KH<}YQmT739wAj3i^tH@XVF(%b))NBOoz|)Q8=MWT3ds~1YvRRvq z7eqt(yfFaiSciZ>ipoT~mu09}&o8wtbGZbdjD3DnUS+#sSnsG zS9`yh_hsKQEB*lG)e_wQ+B4Fa3qSN!Qvrm9!GvfY42t*KaEKCUOWR}C^Od`Y$|=u6 zP~q0gSFT#E&jkeszpvZE@K#Ichw(@uV2#t0__s9O3;SD%+ETdgE%8&l@lMnQ*HNJM zQoQf<%s!-OdIOq!%zb)nGPmzS6Nqzqv3Amu5y<@KXvJniV^C_XpsiXOBXB(e}<%b2U`*fYhqlHpl zJI7F+_I$33)TIX}x8x~D0i(s^xje<7B2C>wvClJhY05(dOk2x_A4E-u9)D52i<8^* z|H;@Dk{y>%#cQcUC&XC>8vZF#>LprC@%ZbmEoZw#^1si0>K32tog-H0lpzH5`V;;( z7tvqtEEw_|unExRzIZA2d~C*jw&Jnk^-nW0@A*Z)W_FV=SwYx4Oyxr7CeDFg4Ug?6 z_Q91qbG31c747T;N=NPl9Tr>xc07T4ru&_=L3F+y+Eoc!?Mutew#QGAVdZM8ZGTRd z>Z_zXu3LD_HP#wqiOv3@T%Ld2zm!kM(8<#K#%x~G+}7qBX-~@rUB3RoQ*B*M7=H1` zcqbe)jFwkJ3kWBB=kpGHNt?1G2(`NFIYmVb10Q&H>#Bc||M}#1$iBb^2YFX_T!sh^ z5Y!pxjp4&U_+^uE&uA~2v5qrvHvseyS4RmPKWcLeQWGm7er{9}Sdbr`u#=t8`<(yq zsX(aZLSp!tJd4nznClyho z;0-q8FI#8*1rr#|=KCXqwM6L0FEEcy=yaY)(%@^i;(pv2%)HN6I>mylH`!sQJb#P+ zum#2?d;D0r`c{vzRhQ43Z3Kx$3{TEX7VnJMX3T3WGTk_Yg@@&k#OSD2U0af($ncXz zdib&Q8KLHR{d?~)V;y>0WqK?qXe%yND`@~(6vx$<9wvy(w2_??3!5xR4eEO3o&OC# zt~jgQSwvsgyrGpSjI5ERwP&}r?Nhpvse;8R^JwZ0WkPcDnqVkG4jp>=eT<2VY(c}v zp$%MQ_-9jL66)KN$U`|aGQbhD8Rjq{Vb1pkgMbH2+ByrHK&kL#iP&th4vhHA^J5cz z&d&RHEia=Ry=bq*=N?orvU5nX4vMQvDe#l~KU(~H$_0O|pga(cc(8gAuW<6y$zwr= z4IX)8wNjugKz-z^Vi#_27`5*V#WjS=iJf+ls|nxLzn5r9i4EZ+JWazZ+U@bm!~x`s zQbY4~9sc&}lS|DU7>`$9QPP>{u|zmVe*6I52_K9evHKm><-1M(H44PXQ!#MB;!%H*pTXw)>h=7-7Yc(FYR)_!SRFH=h=o|>A& zqE(`m_T5VaAW=iZg~i48?@!}x#U)K34j!I&M;dXUKyACXX5Xzd{3}po2{0Z1z0W~` z2#evTJr1;8H*(anMR7C1`Dx8X;Iqb#0+BHtVM6Ve1D!csdBaS+LIG%^r*<-e{Y8fh zP0r9^A) zg1R-wH8ej@Vrp{w=ymKnV&i~K?98YA`LYi)>Iq^@Aoh9WxFh~1TlL+GU9i>0o@6q{ zir@##e7uT2NwQ=vKr;IT#CesdeCez&a3=_ZvXV4TU6^pgL64#m$mXa(#nt*28(3Yq zCYYOD}7q=dts}c zEhcr*o}+DFZxCtVTfNulcb?pip{i}yG%$pD@rKao`1pNcU&GC%d`i8iX#SLrmX2Nh zRF(c6U|C)w`D_zfI${%(TC`PuGjLm238GVBY_>i%^*1lXAnzZ%+IIDcf)zS5KfN%_ zf?@EyZft8_j{-4>4i|;+lQ}Renu$|kTd1?OYB5{#u0jKpH8cC15OgZQ0Zut;;!#JToLaO}PsDpwE}+{Z7?IW^3La7jY!L zrtkAB1_;{Eiu>Lnu~xt~Qc$^coYH*_ujjvu^{IVSzPxG%%R$KT&K>+rfTrd9n~Tw4 z!I^nJb!!fci_Rj3eTzGd#_*2sYg+d~pg!#fM%jt{+A=j0XPj469O$~&mr_(j!~OL2 z2TaIHsWdtjif}@9XM`JdL&FNA1K5GD-%hhI7c~^8j-s=z2{<|93pdtcau!7C{x2r% zzb0)g>l81YIWn9^eZtPSWPb5e5*PS)_@BVAZjKD{>=Ci=68Q9C$ zRr{F}_l^^6C97QQ7m10f?lqSH_AEXI5n#VNV@G@F_RA8tiPgn+)RTT}P-cww6V$j2 zoja2^y^m74f1YoZiUz(P%On|pR>0X0oLlz8S#j3scA2Vmym#f5mlHDb!#Ep@h*}e* zUGQi(0(GT#HH3QnAR&T7=-=@|krCfuSB<8KU5M%ldf}~8xv|=;na8mkzJr`)Mrz+$ z0ex*`D_^yLqTV2g`7cAi{^@ZNe4csc*z1%zo|z_9$pc#H&*)F#AvM}$`8h)3?(msv zyK%dR3|Q!K183AKb=^t!<@f2S3+jPDHOe{KyX7s!pqNGTa~MciuaFMoy*)r%M?z_>2a_~#o42+Fu)q3Jv%rrY zsA!FQ>w8#WhSeoJy#DvI6ZyKbm|@?};IV|HJhu^IZ`=3p3X^DaMGj^z@HEPyQdi-ZH4J zF6bJ(Kya53+ylWSxI=K)puru2TX6T_l0cB)F2NzVd+=bv-QD5aJnu}^J5y70isBD) z?>%Sl-rcL$THPn~mDQjP z=X-0#5M2i&qZiGtk=Hz+)%;Y#drYDl0GaQ5pVwDFTlq`7Y$w!tA>JoHKcAYKI{s0W zb$Td+|Im1p4vw{zYO-Zpq5k#opQR-McJ@SD`ib`cO=_B@HaPMV3)Ap!QhI0d=bNdB z_^XnVdR?e?Dr2NEsPGKq6=^B6tk) zvN4~lYWD95#6TYD1kaSxy!XrHV#K0K^BV_k-2N|bL~t@8QG9d31R)L~fZJ(H?xS5; zwjeTR8fryw>~CKsA1j-z=tKAWy+1#Q2))X=mp+!l0lNLxFn*1CsNU`J^W*ZE;5~{4 z&jiqr4u7IbuoziCRUrZj4M4mtzEZ|9Th>-oM29HhT%PukZVT5Xs?h-rfwd~518dcj zKWIY;s(Q>dI&J^?^M^8qAM3FV1(x9YXemvhnvD=d+ao`=;`s2fg10uYsT91RwP*m} z3G4y_0-zz^e*9pdqw{#W-S+hG_;pPV_jc`P8|iNhidbL2rcmQ&vX(5tQRO9;lp$D% zPa7%E&bOYIyYNPriEIBQHw7$)T!VT-0sQx8ub=Inuj~>M5~prbWF%pHKj07a-`pd@P^}ELiw~@={YYO9}c$d}X{f#Y74& zMVMGZZQ_YK(A916;JAB_L;~2ViEQc9Dz@(!ojp8wFMsC9!HKKJ`nDTvgbVa5!j&TK zk(r6`+n^REp_k z;UQ80QfXsjO6hoog7+yw3w0ogG6PI!K${N|HSE2I&s`n)7dNokXNnB|`{w2*Rybg8 zE2~9SQi?y`w>^6%OYCaCE#MDSvZ(T-G3x=40M$JrIJ9%dzpcFx_uiSt2V0!GPZt0l zJSHSabQQQd_?~Z#00vHJG&Tbqqd&y&jv;s9$1u?J3Ud&n{G&Lt`H&}5w)7LU!p#dH z56RI4Wlh;C&FilBV*9IXhYM5?;r{tL`2`15gF@+Y>pgDg72&TRKaHDwgWTWCr{g+p zSIfh!pU!id(Ew`R{?XCXD!Ok2=urUo8OXbL@5m53kFVDSf_wfqPX6V~7xoYTL&ztK zdxaLNFv!XTX>Aqcns4F1`fD7DMl3Ts-Kc9D&ln6+wY|Jc zfc8sDO4i#h4PGrn1|@>T^N0}Z9NjnKBwsid#{b%l>@#5rjPj@ewS!%QwMd%gE>-Vx z`V*VR>s!>&Jr73p>Fs?_E|>53iLXpQ|5KiAwWd-)4)n@uy%@Sv;5wu2sy-`MOue^G zbEeTTC9To;GC!r^{3(UNR9H)B?@|$BV!8zM2Szxtks+H(KMcHHu7|arNbulzo}Ve# zz`^n2%L5(&DFiU7lGPhynwlT)*vUM#l~S z>h#~gOpMwADT|%&{emSsq|9sJU=jIMK#b6nW}Zr(qqK0^;ybCKIwwC@IVJP-oyeD7 zYg)2wGu%+NN|B#w-0nfJ;q%I1*ry-@nll%gdzY16kFE+IX`uW2sYbt){!wvpOQPFM z9iJ2u)=44AfuEpXS@_v_)b9auah!Vh{9?QV-hAVX_4xkf(~hK&O#SPDvr)d|+=W+g z^BC6bLl^GG;UK>PLS((5KM$bDPc?MS zs6*o-X!ZU0g7L;V5^NSatZ)fQNuXtqu$xL|!?z^3075l;sRjG!cb;K(ttKzjzi-Sb z?~=T-V^5YYKuIl8x-NSyv;he+aG33$ZWECE7-!82gQ6;@s0a@|z^MS}xt3aRpyzbf zOM@`41nL8DFPWL`I-a1ZWz^sY_AjX?|Guvezk~0u!Uc6=;RS86rI4XuWpimH%p&7u zf$OO$AMYhIyCofhtsk9FU5PA|m`LEk?%kmES-g?-0E}!mEDu|4gw1?eXgxuglUx1g z?o!k6GRrjnkPrgMpcAD~iXV`Vk5f4R$H?BOvF-o%_H zuDGP?a><*dDeKyK9NnQ8IH3+SQ0=)NzS?=MS3K=isMwE}!W|Y>Y@&^Z`(bHo9FQ8Y z$SMJEhHh-dsoB{+fz+WdaCmTr8&!O!NkOuoQd^EybFbf^z)8nPwc_b8e$|Sw7$rs{ z`GY|yWKik@6(|k)VeY4ebs?Yl&=k!nkIzc$pmdPhA_foi)@@$PA!?$Q8_}o^+P*D~@{Qdw>Y19tP0zltvb6+k`<*mL6=9?n5;&BU58Up3JQT-~| zBcW{#*ihG6nwWou+H*li(_jbm(B*G z1=hTD^y{F^2eHaBwtv+=xmXq@vnT!)y9R9vO7GU0hr)M%}i$HXZqRc`K>K))=nP4O)%G&lZ6W2q}0C zm|U2e)YT4qUoTy6rP)2rWj<;(p-MXw_o+vIGX+}&JRT)BrTT??Ngua!Rq6f?@=ncM_k$L>9 z*w_Nc96*D&9k`PF-F6m2K=*>&`aC%vCX`Mjg}r2v^C8JU z!)I+om61hx1zp7KI${OVi1}|kL}(I7ku25;9DDc>2>%o>Zj@L+T)X^!k)xuge6=B# zUye=kmvIV&GF1zHR_cDdih#?Fm{U%g0uKsISQof8Q{CWb~dkt`!Kq^A9thPxb4 z^uM+8gpmKY+t=jD5nkv1C7X&yNGIG)2~st~cm#UNP;h$SbQKWDFPq5W9Nc)e%=%q#Nz)U(vGEmbGSf` zs-a%^%ouB@G4$dpP!l`EILF|7;)za&=3{@S)u>CI@x{mfh6Z zFhx%q;2NUW$L~*+YA#cMEPK&_gWsbY!{vM^)&6Dw>i_9Zgx6>i@QI^|0zzh@xIfL_ z(kJ?{izfGOmlq!jw5dvFu|V&?jK%P(d=t8k_h-mm$#R<>Lp_@nps9vB;MrzWQDc~nGw zo$I^vrGtVPY3%nWRmF`-f9?cNYb;G7)$ z_@CWB20wJ(cJ%~1zdrjP5qqY87c;|2;~QI{(h!jOuC5xwQEQU}lOPZp*whELsf-}`qHTU=rdmXKT1m{#_gsHWZwgnWtz5{9+Wa*%l*ExZELM2z9fDgU?Rih$t4561= zm}7cBi4`D2cKqw-MZ>UYg~x5tQDd`d8}&^eUCvohAvKRC)g0BOCZ65KANJZt3<-iQ>06fDIHfn>oDG~;dkdP!KB;epR%1oIIL6Ulsf__5Q#nRF;bQFhR z{R3xeAH?cAuT~5m4Da_&18z?zErOHvD=!v?v#oWjna^!y&uq=PC|4slxVT#u6|>DO zLPBldzJAZp(L5Oe5;q(O#KSeCPOxpgLeC73ns(PZ^|na(|Ik`xJkYe=2t*eY z4$c4D4DDA{#Y$m?K_UKA32@gsIz+z9No*-bWrd$i#?b+smXN?>mj+Qgegc1cbs`tGzBW) z2gu;ii{yDzG_|pj`0|Ui5QsD(mr@Ajo8>Zt-+D(L4fI-8qECU<|44%#Bqr_W5FwDg zZ6U||dN<>H9Q7W;me+O+-luY(4d8b}a!4Cvn4W%?yO)(v6l3Z&*#FGQ0Ty};!tWv` zcAY`T(*Xw!1rBCFPBM&{8wMmJtgBGcc|?@P#>P*aslma)adDV&gDKc+vQ}1B4Zd^1 z!4)$OShe4>5|fhV4qbuSQmr?XU-r9*>|wKNy$X#Nf1~22|1KbOZ*E>#wJ#lRr^1rf z&`44GWXO_4)`B@hxDy*T>U>HvM7QPEcK@j#r2ydVxr0kh!Bg^5x`{HIphf(2O8Q)t zm)Em?`iEV9vS!2#2GWNmARtgyR(7LRqz3J$P_XdJG!g_bx+IDs+@QfwDY9;~{5dig z+-zfedwWB})w%^+~F9`HWRO-;)l3K1O4c?QVD&mSZoeSBr#(dP`i?Bo8aI{-KSB>Eu>y|52$!g*n$ z%nm?Hd3U#CZ1An?;e3xzw6_5;?hELL2WZPrC4^~0LvO&Q0#f)2;E5zCs8G6&*+Cq- z`0rn<5hXblj+l8Rk`i4K_>Wo>gKbxD=)ER+d5Lvkp z_o(&$sL=K;d$)pISb9wjyWH>eSLw}z+=G0C^2BaKgko*c{;9v8gPvwoaxR| zqCC0a`7U1<4K&mxa^gPq=X}7B71P#Et+dh!7OnE4gf`@{*T$*`{|FcSqbxPL5i=(M z8GZg-hIb|sKpB=I+%HI3D>gMXm7ACMIIaJvegM-SOngw9a9HAW!Fz}wJc;|tck>J=NNT?C zRduHLPobfXRtAURA-CyBQ=3%Fy6^%T1VZDdrmw0h4CVKb zL6KzLzY8P9%=6So&I>$J_Rs=9jvJ8X7{Wt@S06Vu4I8hW3+`KcQ_FB5Qf>zoi}^>~ zLdxH54>b~rmK2~MwJrH_lYe@h>`OwyS|3)Wmy&x`jGB#&{0HU(iYl_9dSGoc3t_xc zk|<)(bmZ7}m$(pKh~d^2ZuMh9ZhrpSUj+W|5j9-be1(T3nhqEs5KUxXLqJGL`@ z!RIU2YZ)953C@0(+A1wVs?}6}h?-858r|xW`_Sl9->KwuW6||U2a!rp@kv6hd`8^Y zap~g?S%g3?f&%sCZ&fGccsS1+eq@|;+&0S6($LU0{yRMVVqqELd>WmHuk!CL5Rk46 z&}7=!*r>h30ih6jiyR{vSd+!xP8)}Zx5tJ`t>Y!h<7Iz>k2__*n#h2L4h#%L`L$;PXap?6q-z*JIjWOt?g~g)Ga}850BL5j@$F? zHlRIotT!AO9_~*2`A&u|DOHX-#PGoNMU*^7l6(@~s*^Ly!1Kg2d6_74sW!W|I)qAK zsnwiv`jjtG{*LcSlDUgzSohe!Os%I`F9hG}FH_l$rp>}7WzwGHC#c7!W!i*)mtN0A zPaGZ>3yD{G}35;s#IVch)Zd75&-CJa&;h`4H5j zUGmR~UsrLq{8MC;ar2UG{la)mhXwoEzU4yQ=O-pK&a@HR`tw~R-*TEvFSZr#?L_;! zLuKjLArfu$+{+9zBQugUVrdv3__B+Pc@FH?)gyqGt|f3UM48MVZ<@s9wn} z#HP$LlaV0>XIE8?9}9=+U}p@Wre4IakEWrIUXnDdmlUj*K(%OzLaCo^$(OYdzdpl} zxj$R`?fv|Cef;};bY=hVp=-iGcWXs9DDPOSM*eaj3R>?-;j||x3Vu0R>9F8U-xsi# z{4bqI&&Z zc`S=~^+llmbhX*mHMPIT-;vX-DG1`7obn2WgM6zb`LeNs2rpGDzIQrW7s(FfoU@Bh zq5Ve=L&=PhFj!V00yE19k!} zazDdwFO6_e=H9uro)bfMc6V9RMvC7)?2El}3tpRI_PrfCIIwBei`AG_n%~jXi+5!Q?CWw7{;oQQ;(kv(YmUA@E;WnXK-TYaXw}UGW zs++Y#u!A`kd~|3|#!y zsL%Y^kdMYT&=4YC=XW4+;QbC!KAh8Mo3rR2c{w$fXSHtj#$~?jBu@yqz^fCR?mcv- zZ^|oqUi76tb-i_|OFLjse6mWYym0xjcZT)0ow!z>xv1=3jc2d(tiF2i zI6SRu;K!g|gG8h@vwzo^pk&gx3*Nh1u?c-`clzlcX z&%D&fpGa}c64Nz3q{n9poPLoUoq)yVR_bW()5@_1^gA8B!8`i?e{%sS(Q&s*&d;ZBdtx}wzmS$=Dt*%tpLqow zKkEVwvVT!iQGmx}YHDgBb^?D$y;`&XNl3-g*FH<^XYuZXjQRHC=Q1<9v;;%E-i7*+U!Z_M{8F@gBu8 zTuM>V+?PHg8waTeB+Q)(;e_Va_1V9czIeqHuPycHIJatvezrS#|KPqr8b3&SY}BcR z{reG>FZYa?1Jy3`khpbs=;;l|gdNw#vH@N%--T7Umt@x;qAqSi)LoUX$QIiWLnbZ@ zwxM~!KA$G5q~cTs=JPL}!J_I8odYZ==eKk7%^R84f)=-wXyz4U!%4b-O(Ww zxcj`D&E{)c^H}t2n{R2fUN#q2ma}3Nju9^FLxmu|Z{E7gWR4}!BRFG=x{m3iKgue+k8XY}MFdYk*=?$Z z54`l^@yGsS?mv8eK|U%EjrPJxdNa6No728hl*`^Yq=n=4LmPwh<%C>98N(LqwnbZ# zpTND?l32JZ`ddT@+nq|r#G7s_;Qqa6dJpnOwo&)!)LmOghi8=FsbBT1^yhUW??)}O zXYUH{tg17x@jYet@7fP*U?6!+blOwz{Pr>^cp$m`)oi~nOm?mo-COiJT&jC|dKQbr z3}nSbL^?J$G$;h<&{>)|s50JgbIp{nqNQ2jfwEB;tHLq4_%kx5%_DHAj%6}X-Km&RdU6&q6xsHkp z7p*=EtwTZXL?vm|Kjsuz=y;lY%ssU@Li~o~Q{k(56?JxTAZ_+|ww;UG+8=Q_yviHb zv(w?!=SmJyucbFVtz@VbYCdH_8QiEfd%KxRQ$rwm3`L8yu#ngiweKX`+rkT@u7KMl z$uv``on^%(b4-a3Ypa9eZ4h|0B_AV~9 zdKL9^Hsk^|$fRFduOzj^Nj_{+j+if(Dd&OPFuyQ$8K2pE<(4k2ygHkbbybIU^mEYU z7dXlQT5fobq1)sv{Zr}Wf@;>P7Kj)%%848sA3}v5$muel*O;Furtq)o>Uq;iwbawB zPwi6cxO{cb`RDE72qQR2d)DzIvVUoeaUN{x#?mIOo`Q1%BCPn(T7BB6-PQHynJ{Vc zgH9&n@E3=wB(C5=M`$T1m0uf?eC+C4;;ZA6huG6ej|IGWp7XUhat+iLlID3HyXFNg ziM@wVC&;0z)Qe>U^#2ZmX46uA)pXahkgiF%?^M1@Xsc9j8*EmTq4HJF^YZ60Cox1v z%H-fKqgvuX4q;-f1Cnu>dgdXo4NFHfgr;Xw2X*Do@nguF%V6jh>Q%eAxNaWK$Dj;A zQVY`9jsk#l;kBSb0|SK?fPnxz?3s6U+7_0Uh!~v@$|c4FTTEr#tF{Z+duds7up}{j z;;B`+b+P8=$|BKPmb*`?(QpvOZ#0`Fp;VOg#*Tk?q56Z!78THw;u@U?a4aR77Iu5b zWMc~Y)~`^BT9^wE7jAyr>K9GcGbN9>*hiKmjHGc5ZJvdPhZFO;>i%{gUq3a7BHd?F zQ&)d2;U;Q{R!i(&*>Gmxdil1@O7iFS+?9V90jfB9o4WbZjMC7TC8-6<*$RF5pVohY zY$KyM3@9!EV>=*IxIIE1ZyPtDV{5A_wK|jQ7vD^)`MLGKBhM#OMu@P;r(?;nb1zzt zr$3gDx}WmtJTrA1(3YZVGxn z3NR2*cGO-P+)IvG+j{WA;Ql$WUrbd~J?%V=cmtuni{{t;c&qaCX9%nVQ+&^PUYlZt zrh`Jk0}tkrL{+BkAapfauvBy%0z`Wx6%a-BZk3d~2I|nn%|G^NxybAg%F4^5CZotv zCrI6Oy6BigVf#N^>OXTJnK0mp9|GCb=0tqHYU0>>#w#g!$Txf5TX)moFDei3SnzV? zX?z59&!;pL-#>=t2ilnvxl5 zrBD0jVS&&i?UDDThX6@w#mlY&2oGHn-Nj%^cJY^U_to!%sR`JLt*bX~q1Kl@J#bx5 z$i>;s$#Z++^CXzG5)S}*8_KT}7eH##2i!nVtdpD`)m)BXXWS1(JlHDS8{q%2I*RJ- z+KKrLPg=_F?(ugW29gEZ+5h%lmDR zu#}0Mh6cQShOjB$0BkZ`2&k%nB9NX7+Eh6%yY0M&7}D;LeOr?-COsrXyZd&mO_R;J z;rH40jNu#98uOdsxIqDz$K7HE8CA!&!Mpt}Ljs7!{eYRZnbyEn8Z#L)ZL@jdA*!=Y zZYbkDy?(gc@VJbKGhF!4i)AJ)4sZ)8P@}82nFDeTRek*!PwG4s%j!AkMEHFpbcw{Lk%Qe-aUG6h14YDuwRqrvk-B;nAmeZ=*be ztrv^?v2{)}2XX?r;UyhW@Nh1AU0c8OsDSIY-5n@R^EWVJzS`Gj*t}VWOKlO}{1eBf z2yf^M0Z8bj^=+>3K5e8blqLu!fyziTr_x0X`$z0dPy&r&@8YREaM~af*SjY)H}0>a zoe!*-o_T0^xBZo2y6!ABqVr0SL<8Y5cQL5AsZ7J?pnvR|5N$z#O#L+;aE4AJolt_L zZqKJ?Ea{h9!a@2u*?-s|#7C*u+HrYilrs+IS!U%mn!@W(idl;5xMWeT(M8Cw7%=uw zKTHZeCXAS!D+(FaFDR3p*oC9POXN5NOX9^V3f_vH2zq*Y8t=M}$c39C4m#sN9l6n~ zH4{n(6Dz5)1Cp|reAPJ8>RNnI=cXfY?)_|KB{Nk*g?3Z27HJC5pGq7^@UjGH8V2D9 zX6ro9g^&C?-x^BG{|4dUOn3_}gLZU0cQvJ=ynGdwccnY4kkDBAgS)LQldg)yMkI5q z0bWRsh$sL zbyXW;1BDwWC$7tCpncZ7u>G|}r$$4m3-hHVI#=-YK?PZ+_m!5T<5_2Cz;j1b6bgGb z{#S`%uh;(us6YPYMTS(?aliwRb9#Cj;^)wI;P6lc1;|h!+@CHiKe>KG3o^}al$*@8 zscyB3Rk*5kL3txs_=&msJZTe*Gw_-@O|0;J+=sA$IWMWHgP(3zs>cModOH{x7(he% z^L$PCx-)R0I(+oJKdg6(q%Og^6E$8cIvW)5>KsW10LRtEMV5U#iAmm48~Vq?W|X96+OEC5K~B>TIiuC09>HGP`^;7Qlvyy zr^&FRMN#ZkS+~ys41%1sk#3;ImrnpT%>x0?GdgG~ML(c}HidRp zK>9BKno<-r9Ab!gpdr3j=Du}C#yNE#*|kG|80X|%{VboQQ|pFhDlgR^@DudSz51Q+ zVBkC@IRA_s_^5VYi4Lb%%75Y@eZTqh=jwQM%pcVIz0w3oId(02y?M)`kd@n2Tk)QI zZvOXAPhMABk{}7;bZsq63TWAl8+5%L|F`<{=a9KIb&MKG-Nao;9jSHhHSyuGH46gk z>}hS7%h$lRUz!Z+>gpaXf=sC}oR_650^os#jWd#upQNAw;l;le&+#x^q9i|YzvANJ zKdodEqWX!F4r^tu9f|TBE!MN;_V_c;XUxxM8f|yekJrmWf`ZX`))l&qzb0VbQo-$O zuw`)$Ra1NLDQJa$s&Ve~uX6)~iFoNh-W=GCM$-(YO%U?x(`yX`X_a<~RNDG!$mM(k zBg?|8`Y5adT>^YvnN#zZ0lEWFMSG5bxw+^G`+gb<3D9f-6-04)b^G8eNIQ$1)OW}N zXyk$HB6& zzA&42VP}U_Qu3Lb*UGJ_FIrODB~b{(lICW<#mVRj*~p%NB*Y#9v3HkuFv# z3Q9r;R9!jhbB`9i3a+?qLpQg2nLMk{pKCQ4T$=@cFowYTvO8a)SVVCmbn==!%D@#1 z+{;oEqBipIH1gcu^CIgDXw$-vIDMAa;Tz((z1{1aP%Q>`@hl?W&1#Vu7eEq4nRuKA zSM_fW&z&64=@pgpJlS#lwDq;w?Ul;P`7x!%?xOi8Nj{u8$f*N;-V?F1ei9sz~>&@PTMvy zGNM@25X=Nds*ah0JUjzSEA?%Uj065*S!T>2AS?EAr7QH&z*E>Fu%y@@Q0n5w{pvT2 z;C(ur9Zu9t)^sGJMmuhM7KH3L^FCnLR~}?48&qj3oq}&c_-k08T}A41^Rv=h0`=ax zFP>7RU~*Nl1(***z%ZKALH`D7=~$w$7uHXv3;&mBE>N>P5K21m_6}fZg1!I3HU@C0 z?Iu7df!NeQYxSE%$uwQ>7K!Rq8LuP@v@fs|?x#NMTxDK8z^k*USz@5 zFt7GS$(teHW6x!nHIqt-)y8NVyZIW+$$WW2(7pjf9;p>nR8%^>d$un?56zUY`(9F( z+aeW!qHSQzP=?1|k^N`;fo)fA1D$eBxttCIjG~nmU|F7{BO0)h`3c^w3cM&TGk#nE z;rJm0GBGi+W~8-(9o=6|+M>6-wDi>xk73hP(i+Y1YljoDewz;*;%Hg@zd;|MSZRCM zWadeP2k?egC`|I}NXvu$byX%tqp*2{&(*tIRQnCOwRLs9f6JsgzJ3$@Vq#(v%mMG! ziFxZ+l&{QSu!LSEmFCV-ff$Y+hk}(@;eBX*U>=bTFi0=mlIajnulL<8A6 z008K78ma%+rVgZAA?I8YY0#u#cU2h^fFQ&a4=T_`vCowX?dme^YW$^MWNIhSI_4r% zmiU(36kMg65ZSg$Au+?sQF2pyES!s`EX#SLDqtWa2qpE7NKHfIeD+m@b^+$}6SONH z#DQCrrqT+=uvxr1)h~%~ju#mW!|9ws?GpiWmVOS(8imT_ft7POUrR4R9$0xr#vAzR zEj>Ll#D-Hx%naI@TbRrEDNLEFWiOrw#vhspSU}c;@^hd5Ohd&c#wMl2Y4n~2joEYP+?2;nG#$W`~`As_y(^n&x?zx1I)lL1*Ym=eydM5TyDI(5#5`sr0r zEXDwtgJDjH(r(}6XoR&FkXz?Ald>8Syu}K4hfP%eOcqZMLq0BC>yI~`jM))Lm@}IE zPL12BKLc$2aSJiRJYRlj|CRrq?yp~LzWPw5nce5LVV(W}zF(WBc9+Zf`wxy{+JcOV zz^DW@)^DZv?(VK|6J`Jl(yq^qOGm_thl9*=$o}t56(vTe2vy)T)c&Xb>XvbfsLoZt zmSjKHuMM9gsuJuCVxd8V#0F<;_ay#J*B&s@m%!v%z#I~gZJz>CI1iGVGYGvG{< zrl5p|r`=f*;N#mrzl~35d=>mMqv+oXGBPsQxZpcc!PxMyJvv|`x;BoVWj8n9pN|Q> z$t`KHA7ZS(Dc~{cKvnhj78tt*;U4t#i^|KRk6wd!0?_t`v9KE6NFg8t{gdCmWLB#f zZrZq+tONG9l=4)OYQdU7^0&W0-riS#g8%@YZZ^W(z%yX>5hdXMv!TJ=!=tIBq@A;_r7zG$&5?mB`v(eKSm3P>fu>4(%)$l+yBAe1D_Bf@i*nA-al zLPlQeY$evOg`5al5Iqc~&lpTqD!=o)!P)@HdP52?)Ggri>$|VQKEKi`!6{NBIHI4i zvo)oHxylxW<+7^PE~Of6+C(CW#*_2Y^y7u8V}s=a-QZ2oYwW?B=G_3~3fUmuDz8}w zh7PgGi-!k(x9|CiLXYf=8sL_*%R9X&OGcJnlqIY)+CXDi^&>+GH2FjLZP^KK)jpbZ zXgjgaXhc;htb-T_`hHkLkmR?zP`%s=h$%z0ofOHC3hz$<6^}KuOP;ud?YOl z+Q=;T^Zjln_(}ZDWSDhbVG^1No0f)#28HICCdlUOPs3g_4c1?^pNPlN`JYlntn~WX zi{+_=XK)z}hZ?o&=+I$Yz43Vc|H$og>K5;Gtl``qx+IYpqv*Lj^%IHe~Bj z@cvZ>7FzGkN2#{RG%y=F--fk#X?-Bv!bh?620|VJJUs9n%U z{aPqw$<1EV5=x`=U_j4B&f`It!EKKx_J43jq9FK_&NyN+ziW(zUs5 zunvAfliY_lQo1$jf_^a{uwid|7-|bEoa~Tt5x;{Dub218xMtEX-BF606iMdl|U9=h{41AqWQDaiiIM_VWpIKv6*m zK@|Y&LUcPYjE+l&Djwj#47C_V6}BM$0zz@6rtkqm(2i6FSS{fCA6K~5?(Os7`-rjK zAyiGweF1?k)&vEo$e*`cG|-cGcjpwBS@l9PcUTT z4TeRR5Wyh0xkRaKIgxkX`MZ;qQ=W|A#et2SETs9J`fKW2ZSFIcN!zPSq~FTuvG(rht390` z=hoMA8FXghvo2C9o)KV*iqg@Qfptj|5w}WqgVo_eQ1PjmlZA!spSV`*&01u)oSiTR z(o~@&4ZdX1l5O9mX5seRiNUWnBP_O6kZJ-1Hqn5};nEE4$nIUcuUV73pf1tT)~2BQ zG&W=2iO0eU7n0I$U|omc?dn=5N8N?T$jUV62s}4Hx?T#uJ@c|0W#r|T0IljbJQh76 zr_IeyPRva`7s1y;HLkP~iV?%mF8^pB>IB3~rvSfVw1YVt&d_S)N`PuZZvI`jUcU&& zh`QQK;Y7clJU=Zz(@SZ*g1B#|yB#&{m%GOVys>`%+NfYaiANca0ucuAO_#r4#c#-; zPAhhTOAl{rYx+AHJ+Mfr6JS#)aFW|@)7ow?CKUmnKwm*YA;RNl13(;ocm9w&8)>!x z>W%~fTj0VGpa5KOVtnLAh(&x@M|ZihDg&-8j?etf$HSweE&Vwj zUi;9Xw?DyI&-|%LO!oq@Do2fuRnbpL1PKuAq@|?`mqn{6uLM(qQS~5xD^X;HjKV#7 zAX5QbwZC7geVsCk*dOGp@g1oY5k^P{;TA=k3E^)g{(kEPuK2$9^}g8mjzJvgN7XXu zs4|2x_`{XuOR`)8yfJg%enSUM!%o+rG)Q@Ik7ncC-`x#2oF2pRv;EDN0Mbpuq6nn3K{;DKdLMPiyeWtYD6MZCtub zG%AeV=Dx-^epP7%b;CpYO2f9hZ_%jA0OjQGU(LmmD6AQX7@KCAn1&n_P=C{J+56?~ zz{A7ePFYf6aJN(FfsTo$OyUQItBUBba0Ct8#D|kN$eN-bZM*M=K!E~goZgamBgCQ1 zFbuVl^0Z%+P)zF}2~#KhL}Dk)cYtM**EC-FVJ;rD0_T7z8HL~+6hv0#7d%Z01fA-6 zDjg~uR$|l%poCxXlYl0t2s5cPGki}f=?%`VA*9601Su+TvX&wxHY-s2w?o~L*VWX# zitKSfm)wBagOL5!Y!5gcxFK(F%=uU%FDtvhw{TV`t>umc0J9|IFCOYq4o$rV}q_FL_6TyR0GoCcRAO`JC|&JGO<_SopOJ}W!&qw-?6WRu27TP z!L`S@#O-E`YbX7v^Jiir;3pCy6SiNbX8ZeLJnZlHToSbYm#e0mFu*b@z5|@Lm}m$O zU==|6Q}4LE@nY z8qyv=(s7hvb0R8iZwv-Ofl*Mjs-i230Q3eIbh3u_N>C0|CYVV^C5#iUsf*cB(OUht zR1sZMCb@u3$&FlNTa@JcHyxdUm^B@*gy0_78T6q z*Bm?1?Mv8pe-9A+HV2%7d#_gaXA5iGi|*+8>17M{mu#34v5H@=!A=@%5tu~8zxHBv z<#-;1RQbiUNL5)NcD(0gT9^=Oz}$ox&M8%T#}Y1OV@y-}5fn88X`BExx10o}$_vgB zo{*TBXs>_DBf!DoG6#41G(N*%W7bvZ$?{$98>d2pNHBQF_FOC2li$7#LCCEdo<-*D z#fI&pL|OJj@ni>?l?`Rd#S%M-dZ3@kQvf=jMEFvnfD#9!o7mNzO~Ko|3J)kV$ao!i zk@8SqAJ?=#gA7Ppd;XWm$3@n!n#7X1m{@C~4Zcdz_F^~)_T|)B^N>7zpm$J4csqLa zF4xJ+eQu?9R^PWJE6c#dgi2Uk19Zm_r6~_HL9rj-!Ar4cgoLnfPqvd~3mpt2$k^4x z?&{{|d(r`qc;qaxq6l^fs5bEIV;RQ>Fr&kiy($x*-Z^ZDN99GSF|620B1j6`uRq`W zK39Mju(yw`WdM851z6axBAsH+yau~o(%@f>re2&8ACbWdg`&E7|2K)yF5p1`)B|D! zL^k5j3>xI6P+(8Df*HjCHcDpX*fSG%;a)lBBPN=3zuZ%YxC6EofPYXc z&BsWt-qT|nyv;Q=2lJX2Ypvf5b|7YRhG{c|%|~NkJ+JfvGIUl`)7{I!T;LS~1{FZ9 z0A#7t3lUV(faD*b8&`SPE)1KR50E~51;NLQNqc=TPIEx+QG=PlWzM4k}^Msa%O+%3KkYP zfFUm^0FZnD@-q>?yHeDC`3EYjfz6?mb-s|7g7Wvnq1=ymLj(OxwvIQ%9wqFIAd3b`IQU{@l>PLu!wd&xc`PToKJ#vA z0wOJdN_Q~siR|>N`WN_eL>|f%V3mXC(6$s^Og(v6m0B&!8^Puxy|^SE9x3D}X3wgZ znT(&1S#B!g@NQzG=P28G&F5+IKgvn~RRH=d$=Dh&;ZJ*w1bg@TjFec%0CBC1Q~0z0 zh*@MO5m^{9Inhluq#X<tWmQ|_o*{J4Y^m&mHd@K`(h*isE=Fg zezF!g4OSwqSjCh#dZq^E`O6Vu#J;HU||2X#Mbe3M}p90ZS) zG^X=8`tn22Tf4w8c;fd}VX&R^n4-VzmRnQwJa*4EhdT6^To*pGIy02*G<;F?$Z26X z$@IWSVMf0undxPLtJI;3B(W{H;y{o3(qh|uX&QJv;@DL^b29cG%1;VzA6q2g_o)DV zFQ2ssqTp9BX#=zaK&wDp8RS1t`s3`6K;h}|gYU#Ot-dthO#e%7k(u>Q(hrrx@{*RA zx{Q7Fg`t~RZ1QMOU7Ij8h zk+NM9arK8dXV?LEHia=mNZ?0px^gUcDB+RXI7rvgC|XTC~&(;Vk#Lxwga%3F@bB{e!I5w^kW2{9{cLZR`*hs zgoLCt{fnh0Y$3*xca|BDO2HB)k1BLXsp1BX;1cK^oK1KlpF2g8JYKF*VysD@n*TcQ zH?&@rP5(Nr{e>6uR{dD;k5}=bpCB^Ck3{fv9rP?8yj&EXMhEjFN&M1C#wgq1@s9b= z3=CfgspmRx#NdR){)@tX&k>Xt4rsr`PvB~qTT6wwdPl_bH1HbRKHge~#9g1SRV`^ptKa0#_$y!jBCPmRZ)WF{ z-*Mi5vV*>{0yrw3m5-vEPoWWede(&Q#UcsNC#T_8F{t;~gGI@h_VO~$?<47~)L_i5lL@8%jA^RX6}Ua~%``#7z#0h6asF;D zcl?!5khuU-8Z+iFi&oI5ubzwIb&49%sv?6OZt%+xP}ZHY;mr9e_Zv#+a!OUBTo>{B z@cCBgc^e3ru&*N2WD4hMU9whBL&D)b4@0Ny^BT8Wjh&B1?t*df1Sboc{`h=E%TP_) z+fKc^BA)i(wF=Bk6!1|z&6?h0Wj{z-X7_8Bu>!OlJol=z_P(T{b6C+TH>xs_6e09lBv?kQ#=NR6-h&1_2R}?otG4L6B~S z4y6QX>2B$66i_;(yO9!vx5xkey?fVN_pSBf-qEE?oH%pNKKomr$Z&6HIH@SQ(~=1a zkLeCL6pgl_U-54s3|kBA3&_=H=jXlSfkubtj^w1R*L~`wlJmUB8iJ1O_*(i*>qvkJYPB*R@3|;To?8 z(?u8WM7oC0gvbtm@v8a-{4RH2gW_3_%&8&w zU!}43k4MGo-Hq=e?rqEL8@~d;TCI~j{`tRuJYgYU1+3DSc3E2_L4X^*MXztgYEiq8qPN)wJ-cim#mn(dL?j&;mu9s|dIQX_x} z!KC;B9Ep%}uIJo*7lHoEkr^6%{;9j0cW3?*4%ado`$IX?8(?B>GB(BR{{fU#Y>%Ov z9z^pfk_$mM$JHvyGdN zgk|JoC|%b{J!`ppb+wd*!>ziLwzkhP=?EG<#QNz7{p;)6gmZ!#S9c`_4n11R1bObF zr{ippBmX`DDP{g#aGp@(N54qMyH+ zS1wtljzStWyzb@rn?gl{_3b2xAeOK}W|(E1OguNXpZ9S%9fTdRYX5$dE@@y;v^j^@6r@z7i#*xN;Ff3 zSyuRTq=oQ$&}2SCLCeIrHw6a-&+fg0z>yIP0jno&O-Z-a5(n0vo&$J|nPqOTN|sZcZ89CElM)i1CNv8^`qQU=^L?9Ew1`ODs5U?QC?~pF zM^xJ}4T$Q99;OS3Zn)g$g~Z5K+>|~QAb&{vgov zAen4RrP>j^e$ZSb^3k|z<+<=@FV?A(=Suz^u-f#?vx+<3KSa;2GbGF|URpl2u8(j% zIc~VmU&UwND)|rZSj~Or@DH4@Bl}ee6oHNMvGvRbWlweA6sWMiU?W@6jQmpSkbX38 zqP=zz`-TB{oQ`}}cc=PWxxW;cfc2cn(+lvO7Cr=ACMR|FNYkwF1*uEk4GOH`wYSaf>NfgyHv=O>~t{-}i_6Z6e zbr^2}VV|Y^{En)6Fc>0}r=tBN-LvTuX^(`i|=dUNc)t*iU@MEo{V&hwf|1U{vu&8Kzt zd9HgZ$=JnxVY$X{$rqit=AfJ!7_Q3uAaIaG0ALp)U%xz~m$~|OR2FyRuy5!yvbwX# zkm;JU#b)&1VTfBUis}}`6z}DJhzT4}MpRW*{d|s3vuIn=q182Y_4-`nPw3Cw8eDw6 zi!}Gq!%UmD2@~dj;KM~vo58Nv(rn?VfMI%{m38>lDbPFH{*j;Ibuv(JW{`XkKwv`F z9Zf+(h6g&2Kh04R(IW~VOyBLhv_8GIvqScN0pLb*AYPGF0MqFQt)kL1ptdpVE1@z+ zw%;*jvs{~csj(qCz4AF_WJf|FDU5 z?-JjfoK6IP$=3uF`T%B&{|R_xSLUUpI1~pz&|`>7OsY0_>@}ICA{RhiRHmsq!%jN> z+6UXXtrk?Eh2swU_v=GJ%Mc79!{ril4yfP4s(izj?ffK!ctx?WZ=q%$YS@Pb9UBN) zb%=~FdwA81Er2vy8al%WTS^iR>`ov2aukpPpH_^|l?Pp=z_r6owl_*PLxHq^QBLaS zWrHR<>g^iu{E#pn_%dT6#mCQI8AM`3c+?LBbjpoe?nS&BdcpulWC|9Pl>r=gv>edo z0UHBhs0>>$%6BDF+YL3~*Fcz_XC2o}+sQL#n5cDGGl2&A;B80O6|BGSG4uI8vn(GB zdo>(;MVR(=E$>T7CP)6G3pu?x`r2Q<&DKN?fx-R2?3$Pb?I$g8LIiGn=Vxbf>IgP~ z9pjkHgebo!G^g2v2OuyCjwqG5BvUeLYbWgMa{?@9lvhv1L_|{J<4LO788y}t9@25% zn=k+NZD2?_&nOLJ=iNU%xIl(>4^$T!7hJ*<7blNWHg(pu_lF(k&$gBLLgFt z7brdmSkSB=Ii5LfD(oG)AT6L>AAQsWPRB}Lq$?pYnJ+pAKapVLDB!Ykr+@zZk_vOz zA*2gTzlI!#d4@sv_96lBm6JwVn?ceJ517MB4W?g-^owlr(z@ix)*3Na z#^+%i1vH8Lhj{h90TrkfoF02dIJ=Gj0U`*nFJ5AupJ1j@-QukQ$&E`JUI=Juy}zHi z)K75MewzYbC-DF})R+ra^8T`H^_c!XV`V$#7ueO*=(1#N8JHSNNlESfg=#ZVi3PIz znICeu6A{C8Xfm5Q5xP(#aa2xHFuU~bR^zJB<&x!s1qCW%2Bo&wI|9-9y=(rTk~IL@ zW1kL4cc8QbO@2HC_-!$-CGe|B*AoyFu9JZEG!hzw02^J zIZ864NyE0F4?YSt0Eg-BWf&Mmg2y={u{JwJgj|6QD%*9HFGs>4gj7a2h#U`I%XAsa z^dL1G#!S^L*^Sy+1s(8%qJ9_%a$2i&3pU7}_NG*`gPEVnNt<`sRGS}iOcTY$)cb>v zgB;Z}$r49XRgtzBKkr6GWk6U;V*C4*0aC5}%Cws&0O|0hs33@mwOO{l!WETP0iyYe z6DLuN2KxcU)|rQ(0U8#{1aXe>&oEs{1U8%+Djj&pCAHY8`-=5lC|ZijF&lc4E zo>`v>6Oce%^NY1Xi==}{fky7=?ino|IAfJV=oQQqz)J&80lfrwkF>5MCCkTWQqX@N zJvDW|-yOS{*su-5DFEkOAN%MD^f?!{mBJB`f+ZE}J2YBOidfseAOtKP#UGC=4~77P z&vo6%V{ZjG*`$axPRZ6B1GKe2@A<1;d>l>+$7$kAO_k{1`rvxYDhyg>cw<5&G8_O8 zXI>8*Xb4s~fzoF9`}eo`LO{O+FhJH{AN^-e6O<6BRf&7z+;PG6{8uqp79{0UV#30T zBn+hnxHPALOba0Ha?VH$WmctBF5dNHBnH z;`q4A%{4=;ro{?SRy@e}z)1QI^KTD0SFfJo`gp9B|D$KcvE9V#sJ7g@G$&{E$=&Ib zoQ6Rgq`nEbTra6M&ZbOaE_BUT&gUF9e^JR8!Qvlvbz}yO0UPM?@y>hKGar=#xGG;O zfDSL)d&HX60vK4>;F$FUAYtRR&x>T2z*_-%aWF4M2RC6m<%fY>jY0q(2=X`s!3qrM zwv2cAQfvX(Pw(oViVpzwE_&>izV4$+yt;ZH)AvDGo*?7&np{=sZ5-u`5491@0Zcnmiq-X=x|D!(fL$JLRYT`~awsLZV$?K*u;4jPw0{Cor`g_nZ?%oy0Z3So| zcOtVdW}DA{{vffymQj^{#M#X!{$T)Jxudc2cNYQfWVUet3>z%ZW4iW(g(eyrnlA6% zwrJ8PXwv9M>`y6)Z3#En#HqlwMZefUqvq^*1$3Y306PU@K zFx=|L25K-%a)r;Tl&+s1su%=GK(p`N%8gS%yG?;M zIB}w7yd(R8=_UNZKJh@(Ee3`(F3MN}p&2Kuj+R#CpGbrUa-I+YV0`8%QjYPI6v&5~ ze?|L$3nwxpurkab?mRzJ*9FCEu+{<6(AHL^Jx&&P(g~(c;)H%bw`*CFNH5(Aqv5#P z2hs8QREGvVh{?FcBY_|jibjSvt%!^2Pb5C&lxh7bZ~I}}4bV^=PrKRly|rp6Z#F&z zs(b_9e<)47IuzaF!U6GLgYug~4@PS+#?4fGiI@P!v`KlFx89aN60s zt@N`Uq$HL1m>YN*9MA-{h(qWrk}Br*o9$uFYUEO0c-X|1o)-tcm!ZB()zPFxveJ)0 z5lr>DI-n!TDEHxy!otD^PgEcaA41m!Kq3wf4qkoLi{K6Z70E|Va`@~NAfb71DO}%f z2Gk`+ViRzt%TS}Sv3I7CcQe_Fzao?}XKc$+7~Ms#TO!9%xWVVI$2=WjD`;C*2uS_8 zz6N82Rk8FlOamHgy=pSI2QYB7r?IU9wmoZLK#Q5n-T6J>KA$vCVmuWw_DTQ}zXXJ! zj!oz?dpBGh{TYRHu;LPhG0dCi#3#CWb}Vv_Ok@f9P}}`Pr&|7ONg>sr6z1T^!-SL1w`lvn85(}4l?SXyzPXxM~qa0 zZx~belcVsK7=AO9m9Viwv97CjEPfUiqlVMID&cMMjo1 zI|A^BH)%6lTY$x&nsa}NObrBAmtSBAeLFB zWkBG^MaDoCQ!N^l;&H~;MN4fL;X1upAdUuh#DcvCR0c^XP^9j;?G^lys~qv~Lx)Jo z6v^@_3s>2aL8=4UxIJu-mUIM05P1m5XOc}5?#x|=+Eh+||8_AkF)_5oB?-&@7ife4 zHVVud*Rdo`kDu!X5pt@pp#m8!D!0HaEP%M}7!7osiG_l-;ASX5#r`Hd2Z_peLr09+)20#N~4Flo9q|t?aX-TOO3q*^%^Ix|H_Vx$i zM!bGtOxnFW9)>olE8W#RU>UCYO71brd7wvt-!~b|hX^wTm^lh#bgpT`)ZSjn_f2Kh z4H7b>>{w3-H{d9NnBMnh1c`eMzO*YfRD0}y)ALpMd!1HR71fA|1)fv>+_f}7ViyMBp&b7KRHf4d*& zIp#AJ7f7=(0vmLow~Guj3nI&@`v0`=UV8RoWB59B15H*M{H`=n*TMaFc)~@pOx)c2 zfJUPQ$X-Bz=&ySW3oEDyh-ihO%0zqN-#n-?XK_fO!MR}me5P@`i~u@Le5oTHFQT<* zmUIC70*%3(`YnWw0(&g}(8^ME}O#tV^#lkSyoJ|X|h|GqjQwsQde_Hj^z zdiM-li3I^M9(EWS$UCIr^1K-6XZ1s`gh8riWMl*u7Nr*?D2OGuExm0P0^y>-`$KBF zh0?_{q?7_sv4%EGg7g|}<@dm4Bf$k34?3nvM*skVd@*kt3a8UA)dsLIA6IGf&dR*1 z8GPQi#1BsAJy3n89p{(I)qtrdZGh|!Nu^)xYmha-cwPsS59IkVZp?S06cKbch%$~X z8NYCniSY;kV27VK3e45!jq84a$$Zo;<`x;7Ak|@pA*<$?0n3$dJPddRAC?x+dx|4b z%~efYn}X!<6Fb=FIbR#F7HJGlYtkmTYxA zNEn>X6`hV{WQ+F~i{SW9<}C|_o>;(LeOwU%GF&kstI%iv1(pEm0un6I4S-il4ds_< zt+r?y9q*$2d5z#-;VaiIfazpt{sq8Q;-^h(Nh_*}>hr4DaqDbnNNA%G%1w9Tk(yb- zGk7=q*tbjKYcu3vu^c0D5hXb2#Jnjg#MbsqsKJ4bgfD^4AsBk8=D&zkCe?od?$YrHEV_CP^YF!DU1!=6;$}lnOo_w%*oPGw;?v`AZ{bq?uTRx8v5ORR(fWLi*4^$7O%c>QeP^qzYprT$TI ziUomtaL67Gz-%DWCW}K7JBb0Z@);|sw$j>+Sw%7I`?S6wJ)`6<6cRcM=|LUNYkymV zF71UpHkZT&^&Umx&_K{EDk8ZUo{Vb#=pGs^VFg*bViFSqn41A3rIROB?@wQq1a-IVNbxWd z7|ipN5w@3Ox=4n<=oBEwT*B?Z^M7|igY+SoZ`2$n6iGB#fRBX05?fvgU6Nuzc?8iu=x1W*BN1mp!OaStVa*AW%JS#5yctFD)kVZ@P~Z@K)j0M-Pf4JxGx!U8bm0#1Zn!HK{X6#`Up zepS81g8<^Jgjx7syOvgYpi7eygWW|fj80bb3<`F#WI(?PoD%&8YGR&8 z&%S^E&Y$M>XFGqn6(Icbtv_A=2EgqD%+2o^cN;=&0}CM)vmIM+n-(6}W-#VV7lp;(HTpkMF_Y0BBc7hTdWD0bIglz;d;U(d{{FxceRsIH zFg;CmU~uP}rb~*=Ie7Pd@fxdM%lPW9uP&|sYOJ5x%oY*07Vl@sR*98Yt6Y&=@*|4B zhLmPH>X|t^;dV%_SL0GuQ&fbdptajmX#+%?3JN2Q%?~@9Po zz7$Yo_EPeIC<6%Sq)jI-NyjP+t#+!b$}&^5ZEr6s<__Ji|9%_rx8ksS{_{Y14O?VD z*6sA~{;L^Q9FB1-|B^#;#w2CindAnAa>QxEc zzuYGkX((mu@J7F_06guEXO#d7H>``lb->mx<~hkWGkRs zp~8%hww78&D-FZrKGW;2Hyqj@6WNCbId%<<>7E{=$y*AnQ9}nXBbru>I|?RjuCK2% zJP$-#eI0La-p47281Fbw3J5urJQROjgY7$}L@rP&Hu;rH;dw~f`d-C;yvEyd5(oB# zsRId*@sk`Sc1VM5OH6sWT4DnG5XWSli|-aHq@uo@z2w)at#?Yc&87@_u>rk^`&*1RB`$Ozs%z3vH*r~Ru={38vxLpyw1Aplgl_V2lv_d~ zN4T|NFI-v-i|B{XT}|L5e6ILLYAm!Os1xrER)__ zy0}|jt`$d>Q6nwNlDjke-l^ z90G|W`(&`Y>p=6m#ch}V!}}u{EQn$fTeu|zBICK6id7tWKO|H^^<62=dzE_MZZsIi zkuFzsBFbF5|D8_x)*-J&>*qI_7toSlx}2$73ze00^mX{)pfBnvS`F)0W6>n_Vbo&d z59fp)c~~~VhZ&?Vb7mr+%ZMPNq@<*V55IIX>b`7GZuN5xzX@hoKrNmaT&Lh+ooifNY{{wj`t_q0 zc>Bg`PjZwR>D-APp6OeD;`l9eJvaAx1S@@%>?s72d(!Crd?H%hqWRAgTB&ZD6MO_2 z)S6}VF38mk=Z1#R*ZRhzRjV9gzpGx5q|OrGVU;3o&3A6fVV588Qh#v1|oAC&n}LLzDRDkmOIJ^(r_gKubbhJaiySvX5{<>^ut zJ?~H%{l^Jf;VY--r&ULZ=r1b28ow);bH-`sx4enUv9D%bLL2!kR-cn2Q>vZx@gsBD z+QS9`{^6fY=nO>iD9R^$~)=DxLQbKCs5JRnq*>BSml zDr+hnt+UKi-Ig{E6A}_)3ui)?kK2Rk^&2R3gWtIk&dU478lh`*xxF z2@Ewhf;4tKA{&l+^yO?u>iUh@H8ssWa3>Y>!23^mG07oht_0+0p&q>*FGXG_+PyCM z7+_&O|58YhL;1M#&t_k4^iwP#ty0IFo6~hP7Je^X=3bz2*tG1-oE^)h1lmDIgQ zb7?VmdLPp0LANVpWE2jTdP^fAxgttCT6CCGQi+;iweZuI>a19b4tDm2PTlU)%O3f~ zl=o5(@zXX36FBiGxFCK2Fm<~^*~a$Z!7Y-gAuE2rgDwHkc_(aoR#zf`5$>i37e*7t z{7p7}ulperH&bd<)1@y;m}Yh5&YD#95dZXqiFRAr(T+$s0dyc(91Sk8q^y;V5$W_? z%I_kwmw!az+jG@#H`TUoE-jC$dzy+|O*byLM&rCwyN!R+F(iAI&iA&}Y<+87cv6x? zND%EF9+9N@CWD&(uRGaYR20D>-VSBy^r)Is_C>$-+rzQ!KOxSLSZ?eT<)P-wF>QMPFR)=4Ii?{(@1aMl6@w9X7bUI7zaq_#*CAkhG{ zYRn;pjO3eFwtyx4a=HalM)QKL9ug!-cR6kR+P>xYbi`c&zVHJBqx>tG=ZDMbC@&Wx zQ!#t$+09-%1|2dqyE)4mtE>L)M0880qr~x%ET}L(dN6dl5be2J;lfOBHpK+6Uq!Yzp1swlb%G;{KL8aNHBbWmW<1 z%l9vrfYFo$$MMAInm+&Pa9T2oay|<_O^`F5q#^N5;9rA`xc%F^PZ$zynlDT&<}TJG zd;Xf-mNRMz2?aXc-^fO75BhRu;*mfkgmtZjPE6U)57P)~u=GYv;h`aVy1P?mgrzev zHQ*5$3jcPIXvahUUf#m9(OAz=x1mO@(j)ZiYn?;o5jn{ElRfLUk#R#~KGKk(IhZ*TE#cYNcyPU`)P@sDL}+(wnH3t5FG+<2{AD>3hCO5Qg)FEP$9 z6wGegNF3!TK_FpOSy|@854sR?isAQWP3+UXs)_(>@ahP$afPOHiZ8xV$|53irE^FA z=yxLoLc2h8M9A?@|MI~Y>a#9U$HfW5hh#V@Vs7^BaT&M2^L^Ux0H+o+kg@ed%J9vr zpR6lZu#o+lAYRO4yJ)o;%#D0EqDZ2;$&U#O(RUnm^f%*I#a@zVWkrn>H%1{RMp!04 zZ1&lPfvjzuf> zF7jjaD;hShkHUIX0GCwc^6@b`JDptnn!`OZ0WA5|H)@*A=d*9{C?DXfz4!|fWNH&@ zE(zUCwt_dzY=~~70FrwFXu7OnBp1N8I{`2QYqfzy(V)@K%xm?Hc35%^Nk3Z|=amZD+1XhN3?P4W^HD~s zhq>eFN*xYdUEJu9zoMFQ(okU4h*A17Zc`hUoMMi-lIB%aRiLPT zy;?uM#`3#%%D8DcgyvE@QBF>-Si3YJ=qaHEfpWo=2Z*ecr@gZ1t&gg-iwUl`dPQs{ zp<~iJ{j#iajhfvvs#h^sy}j}=;)Q0amWsevZn$G1NUKxBYPUP z#%FhIj2?HZHteT}7x2;wF@yh(VaZ{f20sL_qH0AIiQ)QSECi%v|4=uy<4O25EWoxX zU;xIO#wN{nB<;~0rs5BFTkfS?3k^H$IP;BfpCgEUmRJhRonR5e5T<0E{+E>DWTWp~ zTz;ROfv>4H)Mkno0@*rd<--G6gbQQV2pIxD;MY!Uc|sVy3^?FOBAf#zFm}C_>X`Zy z9UWa)CkWU?wfZvLcT}306)17fkf9Rq>FYd6OLUYJbkKA%nLqwGVPp5^O+=4~WIasS z0+#7FY08sn;J5fxQA)-SI1N@PO9T^u4*~}pwUi$~iVkn9vl9}<0^HfL7GJDArlaLU zJWxe_`=;uGcf|EdDulz#v`_;1dF;r~qXaEuPbp?>X32+&Uo@7l-Ex;%Kn+0nz!tD9 z0lfXv9%WdV4Q>`#LC?MIL7am69#htX_$WXT3jG+44bd+Z4A;N_FObkLrRc&$u}29Y z3}5ORh!5?Th=ge=0`14L1W*kMgo?2o*Zh+e- zIJXvbMq#P_>ex5{?eZy+m)eIwmN`l0BS~mjh!fp?S4%jZlvL{UCZp@x@CtIJ0`&pu z&vIX2t-`$T*1&|u%xYs678WU`gSJVwz05pr9k!lDa7%_BN2^Ge(r+u%B2M+G4L~NNTzo2{yVX@LX;2! zS!wNL1XLWtNzYv(E(LfnSy+>(8z*#;6Jg{~7{> zO9*&=X_y5m9sCV8i-*4f=DgFqLi>hFM&#QRkK4dJMPA4CvoarktMo_(;&>?otDe{IY2WH45yfUia!T zf(W#l;Z|XK(Mn%-0P|{VYt*K8PsbaWRf4vq3z!Y9g6RTtg3?z7U|kM834oY|1Yzjv zg4?eako$n9riPueiZE?7P*)1{7EcQAHwi@AYOS=J@_@SeK$FZK1(InG_Pvb$!KkN1 zH?|624Wk93%%(i?XLy`5A^OUC(By%l?P4GrhzP7H`U;E&P~DgYYZ$^p3>#RJ1+8&VBPNsesQd@xJ7 zyd6p409_}9xPE(whB1|)BDFSWm@juocQ$_vY=sc4J66!~#08TPt8f2E?!nZUSXd8@ zo3SClXEI_n3W4s=S*)7h$=RXSnE!bj@RGNp zpMlKLEt2jG6qZ2f1sWT`d1rop9%#S_FCr9a3$#jpY=hm59>uSMJ4tuL3ZLc(k3IF% zCn_csGC|llF4(p4nHm(i-f5Q^7OJr|w4i}5cHcl4Gc3b@Xmzy-16aBg7P4y&v7BFZ zmCYOeM)i#<5J3eu$8Th)=u(Vbx|wP{uLW{2K9ly~QNU*eUugG6S1nFHq*#Ikc0h~P zNtjkjLAUsGB2_&j6qg3NjRwBNMqK7ATox7Pgb-D~xunlRb;OD1drwTQ1W z--6>*zh(p8>G`02dY@t}lbVYo{00wgjSL5VwU#F4MhFq7e{O7?-XIF@=p@i@eCUq5 z{-Jw43Q!apM4f$o?|_Ny$##AQU-C031b>7KD6Ju7=71cWP>S}KprNFPK&%g71oCEM zlXV+itbKi-c>Y3HJwjd)rw#%PLN903MvjVKHQ>sYy!F$I73M2>j6#{+L>g>|&e#cL zLj>-PfDRn`cjM~l95Q$z6KRT~A79y$A6Qz<`XBiKZ+c@S^@W{W2{PxapLfPvHN_{t zwE7r!rP0mh(FKyp2T2NMxdWEw&9OgDdl{x3I>JOrWoa2*EzPz4-YSwEfBLU1?Q6-iU=6vaanQu z*PX$J#+MSnkK4szo@IaoiJRn=&n?KB+6DUuh~Mx*nYAhcIA0`e1p_R#-Y)N}yjdvx zsp;&KyiDMZ(ZTJTcqXixXVgp<>bxIgYdN%HOo1W13a_@6RWwp!`Id60KiaH@B}lMOi25Cr2Fdf;;lH z$&>L4DJv^0pz@)}ntAi#6$;4Px2gbSZbUx)8Er4D{jxuc zdW)js;rZ5~1_+%JVA1NKdPjZ3Jn}U`#uD3a)N(^CKBvqS{70dP8y#>vu`^C>o?dxL zaGCuMgY)^VpHy7_z16%EKn$}=fqd*EG)*fU#TR6=2NaUb$GM6I59EH*5sxhWPNin_ zO+q}3+;uNYr3Px262jp_g|&3VOp#d@a3*1&8{uF4?X2P2#R9lbEqYm=glhyNW@$46 z0v7CJ3djOLG@~^$>;q&{pb2s?|4+7q->Qra%*TJSCk_EApe{niGlr;scODnoqi0It zyea*@f2Vx^g2>dJ0dSiYB#gM=D@EQnE0hMAt%cy5+)gyMJ~DNBcnCrE0nt;a7f9a+ z`C(PK55A{~tz2BQhsQd(xVWgOr~pl^Rp;nNd15;_dV8&~IdvF$Sfs-aam$a@HDVl^ zC2qJgD>hsnWFA!n_;ha%b$!}0T4adgSZSa5&_Z%GticDq!suP}gwo+9nnDsec)$|U z=Zz@(tR6&Az?P(V4|(RX6JzG~&rNPtC=2U?5txP;Mh_@;a;Q94I}mGOa1Jd*fjLE& zeq5bgX0{#`|W=wtrQbC{F(~PdYHv%8G!I6V>^FRlvSp3)qpj84) znMigf;xB)ylQ?9F1rase$c@!f8Z1E;)4}=<(}ug-YffMh?)%r(p7UaFrkY(K%5)uw z%|QHC`(jQ%icMZqMtH z9w(QV?nT^ic&~Y7`>=KZkPNxoIRj;msECL)4^?*AvnA|r>f{WOUvSMn2P$53tzv~K z7Wje7JdR9l=+~UDfc72#hk+W)fp`o&3dcS$I?{Me-tleV!+ii6hH&Xs0kHt!Q|c~) z?giI3Uw8gG@(ORxIjrT}pC*R9j#vZ>1f6)~?q8QRVe>pqxDu%DwZ4^%$E5b}T24++ zpWHpeBBHOE+XpizVz*^YV(JXs-bCfNbG_c@lyNrq!Rc8;-l4&!M@vV`UCK!hp4D>8 zTCQzu+#a~K0dU}t-Lqo?@}savSg~)uGV`%!&tfGHdUkG_IPz3l>8P*%e8 z_VxWUfWV6|p?}v7&>}^5_hhxRGzx(p4PXZOnPa9vT15c|p6JO9)VG(m1^uss6<-Q` z5JlF`!fDg5MSy?&=TCG97eR)wnkzE1bKxml^?ODJbf%_u&mxdIENc19#^ponByCO1?B7R8Qfyj-!uaX`R(B?`%2bEPznYREm(xK*sh6f(Ld}_0}L|C zBr-~nRQfi1GM}O1qB3ybdR63`FMC)>k|V4tUdoxRX2AlDbxmSlRNMx?k#8atpxC6; z+^=0O2PUfX>|wH~5a+MWZEa7a=5XWe=T?2d#+R>HC@ou>;v};SW@BLmS@4%Mps*1~ zGpBglIOLvZ|4T6c_)FHC=wDwwr}rNL-;YU2 z{J7)_O1Xsyo)>*=psrDW^D5c82|6~3c}B)ULhv55d++7emj_lxZ%=#wbOy>t94S4} z`dM?JyIDp1nA$n9x~ooEX{1HT@K^51?1I8`eIHHWaFA4baePKTSjG2JuV*&#fXD~} zIX@Q&KJ!d3efUCi->V?IqRg{*8+2*8xfkbmL07G%&B;|SrK&_lZ=%6%6a|u=HF+{# z>UmM3lUU~ajn(#PL(GP|m9NE1z3uRE4LQAu#PclWx_8gaCKlsTB5bd9D9hg4&%0K% zI@%Q> z#l+O~b3OhWe3n`y-`IroyFD^<`yTpf<2oXjr0~tCxbsCT{i(tI(wL{uN_&YQtc`BA z2kbsOYXx?DYd^vlxkq~0f>>CzTxl{d){9%i_b1BYKFN=$*lh#YI83Fx+v2#Li!E1BQHaOH+6uQ1$wkApYWjn$JG3$6U! z*}K#TntLCm=N#sqie5ySJ(FHXq%7P#z>Hci^ zV0ZKP_74W{-^3TbkpMm!Bk>1Fl1VhWAH(TIKn|$@{tI|Q;P|$HpAb3$&%0|ZERx3U_?fklX83p1PMO4>j8ZtPDTUd22hC76}FOvj1?}rD!8b^dza2lHn zS}RX14@r#@$^OYfixNhhCVH-w;s*L4{tvWJ&o{M z=QPo)c|8WEO2<=Q-)=t82mi>HoA(2Szgz@`H~8-k6Oe#AwSz{(Q9tr&itUiHi9Z1t zct_ZSCy~A9E-h6J{|1$3gm1^n z7Lz_jKn8uMfB2++`Py6qpaVVU{@Db(!NzefYiR_~uto->qI9MPSX}qZ1)ocNVMk6V z&AppsUbN{`g#LdXj_{ulU5Q{2v^;=76fV%31H(U4v?|v~{-_{Ka}$^fjSf2M!qhIV zEiA+_+3I1zwqOp|lwC@Yz%=PEv*bz|0&-GL75%JoJRqjQQKyar0#I*U7n|QV-~vPu zs-&vww+Xp?~W zqUjOHln*?@|Em~WqCk^^KDF2~DGqn|Wz%%gb51Z!L-BMC+-KIHvwOiQij`cF1l#oc z@gy3aCFMv7%YK=fQ3E<~)MDIDbJWDM%ER)yQ^%TfPoc7dY)YnAwdnClS6WFKfdig$)WIDQF7Ts~dncW+w}8NaB;v5AQdk8FgmQ zuJG3`UcZL!eNIH~D8wiD*a7|G4MDQt5^5K(wpK%3IpD_NnK)CPy%i`X82$?iSrsV6 z3SfuDhBI}zIqhjsUDO2Xf}3s1IV|;A@dTd|98;5};+hSt5oEgnUCNT90kZ2NT0G%# zb9sDuz-7u|{u4J4f|LYZRD@R3y+qSlV1CJG2c9q53mg1+pifO+$8A4A_Y_h z+Qna^(dg*tY-;cAQktjDTk-)-P9vE}%o`nlp2)i8^7Q6uJYU|EnCK3VyED}p2LcemOm1MoL zqLorG+Hy2>TtIB-3?)dP2>L0*nSdEQJ2m>9DZrzsqQ$vN+ z&q;lNf&~2x3wSppH5)Bvr6LOv*gEpWu`S>6O)JFt8sA)f;AtIQ-Jutc zto%U&3gE4OD}#$!c-c=Z`X%bvK3BK0<)|qSQ0+`C!Dumlc)9o;%z=;<>Aw$uM>2X8 zOQbVOB`0sT2Tp?h`=HMXJVeCq<3lTqC?J1ThaF#e@#MG#JqeYzgbb}qWIWuv!N(+( z?FPZ4ciSazHXJ=ZBg1#WbqY9V9fC+)SjXZ4l>3vo^hkb*)avgaul8ZMqKOpB21^qc zYMMtDA<|GlH!LT|0tjQSl-~j{b-ucR92Fe#i0I$%G*)E z9$>&|Cw~;)+;O^TY3;*CsFTEY(3II9f62V!LljJajckQso7X{=ZrR+ECu~Z~%3eJc zYtQzh(jw-4>3?d3PU&tqE3<(8@2rsG zdPh=?g1j}8jg~y$gBok--at!t#35IAdN_hlR#=A3$m-)OlI=hk(Udh#7iOih80pNj zZvYGo0bDyE>^%fH(}JM;qI=fvcMO3lMU1|x2JJ3gO=73o zjc%w@r|EuYg`1SOJh8f5hjmy<=p!gj0IU;Qs2SH`3R*t8qio6Oe$)Z#_wTTn`9sD{ zUxT58LkZUd9|%7#?ilscVa$J5DX|moOXh_`@dy{h=5Q+IlQz(z-sHqiZ>8aM2@3fDlii2EZbf2_pkUA)@=(~G~jxQjh)GNs@);? zPqoqh0SY;*NW%v@g_?k?@<6GhO-^@HQrwSmye%$y9l21uZSRK%p&dxK7&BBOD{V>cm-?^^qJg?(Cj^nJLNw6Ef2^DGnm*~g(B|AkV z&AYh7CIQ6(up9DKf*)0-^K=_H1jB<`6RIq)w0WQ~Kju=U{ihGRI{a_Ny-*n>?I!T@7J=B3R?Xpy7d(i{($beR|AeXx#}iI$IMLjplxzhI8|-69b4>_ z7=u=Mu*pECM5u8-!XBS^(%cw7xszr@3@y{jxoCT2I@S&X;PbImm+0wI12OoT+SQcgiN@38B~yXR*CXfsSb{HV@0xdCZT6G*EQu5`yr|{llTGwYs*Xa4 zJhvmOuztb!0i$|pD%`QV3a=zu7aUK7gFag`2Oav8CSwnpg3*cNk=s#U8^A|0W7i)G zGFa^(nCvW_3B&ZO5_&lSG{1mv6yD|lQXHXr6KE2?*e?{xD)ntm0mnp|4S`hTC^qJ$ zpq85T5w|6O{WUuHt5k;!cJ}#^pAGd=$(*U*fCN}!*NCa?iWWwIc-S_RUJeQvh@-c| z1m}-x@i|wX8SUHA3Rw^eZfBh62)l{OT&Fv4E9H3nX81&;%}Zy1EAnj4oJ!N4&7FF} z6mR&xNs3Uc1RHEeN1IGAUn9r3&yy*Gyk6k{1ox0)vO+6xw(*{vq*P7^6c%}uJ7Q>` zFO^dvwzNJcC|g{T`k>-x$a@mO6K#;96+X`I;qzBj^#ERApCe;1ThXYyu6uqdZ zC|xREeQyWO45Or1O1J;G0fnO&??Y;ge9S#4a3EIP!~pD^;0j2%Xq+bLT{CvRc652e z9olEKPD6?%DQRpWC z1aJ9Y1+^02{tVl14OiZi<8>8D1><2f-VdmD!{In|$Z`B*AP75xM4%&v7!%e|U53Q5 z(oic`$Kn7q3?Z0k4F=jy@^sd#LXB%ViYFQLI^c=6GeZIG0@vi_<6qS!AMLNU)UJj* zh=Fsl=8z*;vV$rK>l(UurDl~RrKBAD*xDOC^tyE`HJA>BdCP}8>4$XLF-YbRkS`SG zV;@TcR_T8h9lVD|Mjiq7O*H7^9RxO?lAsLQ-$PedSA!S8LiY=WxhO?S7y-fF1%8Jg zAFx||rtP-#|DIp6$p1ZX82!&E#R!rx%PfDAbwwzD0c1dckhv^k_HI1zv=PA_vjE0u zki=BbK}XpZ0xu%3p)Bw!ijq(o9+fWqO5qaV{%504P9iX!DTIBn?vDHi+YIa;7)+p7 zyMFFwD#mW<2ATKNudwM~}Co%lig#b?`C;ssArydV0@&ljH1)yI5?2eT=S z4e%I?O8|}aI}ir!GMJt~(p>Ta&v3sRAu6O9YdG&oq4XUFf0r5$CWHu$sE-TB^E)_1 zr8e7S74eqc*4lb^2(0ThxZk@OV|WawEKBtBAS6vp$oXv$I9GZ?kAy5cKGEZuY&Oy3 zKzKdef3zK%hr|3}dYysVLmzYcKRZh_t#3S%U7g3c#_&TTWmMkL znniIGZ5dDu#7T9@YH(?a5h6tSNQj9Y6@)*@p;VBf7Xs~OFcus~n?42_Z-FYSNcf~5 zp$|cV3;CB!P2Pm^va-kS$qMJEjSqMG_WgN6Q1z+tA2)r@7E^?D&HB?_$J@V3h20Cc zc05gIBb)H!>3Zt%i=>$0IMZ>i#Bkw>^ z1;n8uJ!GfbM94X4qw~SCGwJi^&v1@DWx)Q{sCpM1zd!0f9y1VIZKMPmoF{|*`?<3( zx?2a%Z@X5^y-Ws6Q4WxV!bv%#WIWAd3EE=#U3sp6zAMC+gu2MU{z^XVkhduUA;c+UgofQF!yP`XHbZ%R8SD@}mMYH?G0-zG zz^-HLK9>!vi!jNz+-*AU#wGgBBJ-B`i^g6JBv-(eg`W5^xK5rJeg68jH57&%syAo` z8`7uZ4eTLt2em83zNqIfgVFaeVn|NFb$`$i)>_TV$5a0q?G=VE*I3TK$r^ykEz&O) ziE*|yxx~h3o1q}nh)+fNZosyCfILHvwvAYgt9#lRXom>sXyM0}{&0p4a$wJ4>(aJq z=O;}R^UFTC!MbnaN?5;z!=g)qMc3Gl4@Fu6zhFE)T=LR=)C(!xb8!_OTlO{intY(5 z^y7$nRq)|ZtwSVp z3*BRpnVUNYUl$?ucpb?TagWWFNXXrZ7&<79B33ZN&oc`nRaf47scaYX@>+($q#icm zd!~Ap;DW7>bx7tsE64$HSabrQiNuflU$5)MP@LjZ5i2hq==EH|07Bzx*1iKFG)=@oKoJ?b z&^;MSO;3;XKVfyhYGlIL4naQe%q|a+=_pDbh0HH(QJ1iRXPu4T`(PhV#9$+)Kdgq< z0vbeLPmeQr;yQ|>lzb?{{^kPoLr0nOTwK<7@0!+j!>g7nCP@))P*j5c-^}?2<@V71 z&21jyRPf+8Sm@PjZQO>$=HLW=$0#d3Sdki{JG66!?U~1TFnVfGo*c6~abLv7+FHv_ zj>Z|NOmDu=Uoa>$Nn;VM-5Xi{X8i#8dXs@nkvSzLl&2TKE_`pPN2^RS&iTBt?Avxg;yt-=yg z4YVaJl}KFq2Xa(81fNyLo?Ou`lXf|R_b?wr-j7oBk_fy+CNULfM<%ZwHuG|@LD*1hAk^$?Tb==1B+q#Ba9-_Pj)DAH;v zQB7B2*MlGn7?VAbI~kQwtH9Bop~%(EaRw6`8hom3jJ@LNi+kiQR?-@iiOVlF`dPS- zKUZ!5(YK)LP;qM1=>?Nn|8|+N)0aY%fl}b7!n=dTE1y16gcArZwqufMAhkFX87XNq zmh@@ADtZmPWKMl*`X_cw2(q{Q9pWYg8|~rLLCp+!ry(5OFZZ2|I*}iU(?$|PsZ8K& zGhKIKv0|BJg4smL4^Wo@+mt`TlFgEQw&cC9i;DQ;lsWQyt`){4C+|V7;*GlXaXb6M zI8wVoxehERH0ECe-UCEbLl5wJ3#1MXZ^Ea+##ed4#*_aULx>kUKr)RaEEA{7E5IC} z((toe9GH}FQbF7_{X=NK#VXZF9=Wx#F;0x{Fa75G>+kbGQAt*PsSyvYgWwVXRr?)< zD<>AA!pxIM+0@h|k-lstxktC@#A3!ODS6nfz#P8L)qBwna_X(UTmLs2g;bbO;j<`G z>?T4g1Z(jOc5%KV0S1!o-Mc8}$*nyPAtmXTS65aO6jDYW6er=Cn-;9^{qZ{dDM&x? z{`9#4{Jqo{3I}p8p3*D|H@ivCK!O86Bqm{Ow^M-P5otXgu00s3Ku|a2g_h)*ma=k4 zIMX^dT9@wr@rqN<{%+lnge|vhje4gyxyhtJkAPd58kqY^pl0yJJ1LVo?X9AL%>YrX z*Qo|vB+cG&>`J6%{wdv9)VNv0-tDT);a>g2zFZ;Jv2JwIijFSj9WBdqVzFFI)I0yN z=lC^lg7M15{md>lVMnAzB#JKd!XAB#zJcmweNJg zHKw4?)y1Wl{m<$~*p6VvJ#t5gg2AUR;TyDnzq#^&G*~qVn<1hx`agE>PqxNHJ~;Vy zW^zQ7{ykB8VORt)8CePnQ`<1Jr^j|qjYT`$h^Orv6p{4h!ZoL>P7gkLIR{Q@$)NI$P7l9}xK7h2m3 zeZZHtmCC|$Ho>bv;K4872eNsQTWge`4F4$bsLUtXgdz|Yah+tG_!0WeU8KZm`LBjD zVsYyzu1!`epAmCx6|1yY_?@SPc>3s`)#M^xwF427)sj%Nm3eN%o#r~`p++P}K)Wo} z2Z*jA3p_|Y?)15L7*qeZKx_45c`o(xp>w)rVfOS&LX9uT=s37zmGTetIk^KqttthX zQ&iHfF%=@U5t^4zroh zNuRm+@CIhwGvT9!q<{8|Ao`Q ziMnY+i&K!mK>buhE5yB;2_bwJ`Qd~$1145ODU0>!#&Q(zZ7tV~46k(%9B^GP=o5Ai zR{${=>66tfLY>w00|M8w_bry|QJ}O;t|B$gyZdTTgIj#qUThI8xXTv?dV1{Co2(Si zRcyqqmNK-IWTA;3lQ>eJcExOKh2XdnYB*16G0WWMBd0@RcXZ4(&4#a4v6O7jbO==F z!$!hmBy+i{s^7jXZqn_$?Wu?py%s;KZ*kFvDzrGXb7t-2B`w5hAD!xgUm+7qB)7A< zl~umyHMH&O@08xF9nmT)%rF_>b*akxBMK2R@VbXW^Z`yxI6!5Wh;G(#`4{E_g)}em zrmj1fOcpoi#`}PN-)p`hkt5*prG8 zdne5fT00YdNRdr0$$CDX`X^`Zj%uh1n6z4zFnaz zm||F+Hiazy*?BsaxYoIIP=s8l0}%nZ46hJBp}8R9G=ZF4BtR+RMFG%7gUa1UnRB6O zNRe5mOqkGBra-uJv+^CXAJbw+QpX2lg+OpPe8+~~VQbdGz29eSkI5z-rc=39jCAsHCc66Wu z`UogQM!m#*b09qV#6gN+Q!l!G+VmpvW8*KaaqN0|y&w=$TO(Kw)DNGcvmif;>sjZA z_1lYWe~U7QlAmEExc;@(S)qe(%!Q_cm~c1?&_0OQo-rwER;_cv?~cd49;hvC`Eb@U zN{@z^q4BwO$AdeKd(P3sv<>ibz(Jt@k*7eZ$!CJ&;BJGSFK>JxK}APtC`^8r$qYfu z1?T=sXrpr$>t=c|=3EI4+fIFcz|wP!pYehxyLBN;9T!8sq^W040WM=<`X z5wFdKn@ex|er2nv&;>vzzxuErsDm>^Q8+vaz$h&I-C~BfD)o|Fg#nXqhU3Af#UV3@jo^SSA$$(06MO)T6jpVV zFCSvum{$CEaD-E5$y^jm7fa_j1!KL~zKt3T<%wPelceCwX7{=GUETLsh} zJC+DHxSG>!WO&-2BZ74@rYbz>f~?go7H(eSk;300y^n{o_&c77(bm43tHPca%}vyW z36eu;hm&sd9kLK%sjUGP)QZ-|XR74qN9m%#yEl5R_aF-IGJpSA;k$2lHdFHNh+8Dz z7dfZCzF;V{t#|Fw-W=_Ws^ro4-b%ZztJKXOn5g zI7fEhI?dw!>eGpji|;G1odsF#fIG9XaV8J9cKr3Xc5??`s2vNUPU{cd>bx~lW$h57 zgp<|Y#*nr&|LAoQt}OoMkKSdSSn!jGy?vc*p(qRwp;tzIL+pL+?FV<~X6#QNyE|Zi zeY@>-bG3g&)}KW5@vK+%11QHug(6puIQ|+jDLV#9M9Tx-X5U|5M--aWOHkEQnHOpDf})2(rwH4c%x_(nV6S6=h5u+7cP&27q}?|oFu{XGS-ebI0-y&$Ts-W$AQ%sj7$ziCtw zBW(A|FB#a|Fw@d8F%`Yz;S-YR5hq3T;0|@iwB3+bFt&fHw!YNgPfBL+SmANdp1~>O z{2?b!#z%NZr)2@+W-k?CRWVzikw{Dg#`>A=8*#Ulu`geqc3SQ=FdwW8-|GUY#iO5` zjiD_cv^DEq6dqg$eLK=({{4G20s(wWH#6G~ymNaeiLB}8LlF?s-&HD`h}``S{p*p4 ztlP>b!XmGV$<7!hTMk96oj|Z$m-zPf>Z7c!Bs)I{@=q_26!|;#+Dc@;9o0IIhv0lY zq;MF#+9>h?(z!B4l;X4!IzN4MaL$R zWJi`X>}tAd^f1`(N)nvVJ#$EWb>2(OJF6P~S?t;F=ZJsvU&ZrP@(;CM=ALI3wY_9q zf;Yw|S#OxSFEqbf=el_gt`Lk5fj>o%~u{u83eBk>Ggsrw3IbGkxa_09q z`3LLF07$I6^LU#W;ullM6kN+cXC^G@BacsaJI;s9!U=IgQWBqdS2sQ$r;SLgbhMk8X5=$5aUfbOt9>=!m%(YtGxYS^Qt`x zJ~kGwG~xd{8S8;-w(dv_Rw)&T8T<%HdrQE4Vn9A(md@n1bQnpz)~AuYlT7yZxta^U zxO!TZlW`-Zr#>3TZt!{C)jopgoBcogR_vqt%=o|E^islAxiY&h|8gfGEvBqUbLi*K zko=YJn<*KLPnv!q{69= zi~;$Lhr9%6)KG0J4l4W;&o`j)#e+as8zc^z+#HmN#}&>2Je67w)b&Q62xaF=F-D+w zAdj&WT~yg9#TBU)s@ltf2g;`Y10ehZOl07(?i=bZ`6(o^pL(XA!0B@Yy=rHRndVCs zfx!}(v9HM)h{tQ|!z)vFK8(^t20g>NCifHrj$R+3UAauIB@q8Y+U-zxahr|?MxIYB z{o$6MNh)qLVMQl_i>W`UD92VV9nel*ARS3+OPO6z!qM$N1D}FEt7=V zZ+HG9hAAqX1#cg%}?o5JoFLXHMc ze@}Ef@-4664!IAbpdJJ9E3ukFT_KL^b-f3263{a52e1dBMtIO7dC)!kdwO=3n<^CU zY|Fl8u%79D&&zToB-gh1czBspk$lueB_)CtjZwyNK6@zBCsfd4tQz`sLxV(kbR=IP zBm0~#=?&&?{EzU_l}9jd)O7Hc0pzhO@>Vq~aCegL`xaPByc6WJm=cOvn) zFg_6JR?AUsG}_&bTDY>(6pc59y1FfabNbv^#=4*Canb3fFA(#%{KvEnxy)9acx_n{ z+WF7>itfQkQbitc#i}C^uIR{)rc6Ws`ulc4gG_9$0Gj_xFR69@sS-w?DhTdO;yg-uRYHr#2gLy-_q>uz^vHkZ0M0!!y~9r zV9TZn7YrUVoD}Z2E;U;i`%SbLQ>^z!49}PD=Kd z*%?3`C_%l>A(dREA4@NC5=akz4CUcs&*QzXk%xn2d<@83*y&n)yS)aGaumWUWkf{`>uizNDLP zzH4=`a7V$-MyvovSj+598rX`6Z+|3Ve6;XyKmKnEA6^Hd_l{}stpT1<^<;!7DA^ko z%xXD6=}SUY;oooN8zVIF5nr#*N0NPwlHAE3BHM&Z6%OK%n|FO?!``}`Zc2-vn_7dW*}25 zME0DdGjHZFBF88|v%f?=SP?S6Q1>yZHUmUMLE)q;xjCU>8*7;^bQqbOrwuuV*n)}A z)=1!->~jdA6dcnc$HmXsM~w!t2rwR}|Itw9fk(w2@(j+Oc$PZ}zNZ+Vp5FafM(FsR z9@p<6K?AQ6pk9YcarDWf6sE)G8Lm%TasN8p38y0v6!VlU6}P|i^|b&$4yJar?}&Yl zSfedihrp*IW~nOiQC;x9r>$3Q_A=un)NPmfD#QN3Ah=f=jI$@m$f zxNkO=&>HE!sU88*IK8cs&4)CdbzWp5}2!vGp&q$5=l!>O^3b+l3T` z5J9oA6R-l^AD6#j;esO=z3>^Ugsd~vf}yev;QCJm^u>$n!!qcrS()lSNTA4Qp(>5L z&ylYZ&w%@-`r*2|>#$P`Y$nk*9yDk`vp~^{nWw`B(Fqe4X@G16@BX56gundgj$9}W zZUI`FTF$ATlV|E+bK8T0q4>?SRu}%ly(lkToyPLQvS_^`#*;4FXgyOWr>WUTNm8R4 z@S##H+KetZDX5k$Ngg<2j?dx_(l^6w49R? z-eHTqG^T^%xAlp_%OApx3OSCPI@=QUudlxmXoc@|@O);pw6->kZPJ+{w_fYzaqbn} zikt2uo2`{8ivX_|8ia_52t-=&+{n*An>`hFJc|A5+iMELWIxdv1X{02 z*lPk{9|9t8WDu4~RC@pg9Y@Hpp~T5!_4GgIiYrIw3bRcyiiv!}dVOR6^3~9oF35q zc(&jLP-$@Vp$^LmdggRoCdZkg4EA4B0K*_)rA0b25b3l$!&(E8BkPf~l!g_KzMg&> zV`;b|IELCYK#5Pq3WrnF&Yfq;mW zZMU{LO+E2Zk}V_9WDdz2@O|LZ2-bP2q)?!}nZa_(UNP=ElI`d;PL9|hZ>%gG6X$>{LRF0cjBmBYn4AZ@nf_!snz4S1!`Zr9w0}H*48y>v)1Cu1)V-rGrPH zrw3lE>@-76b3zG1y%Aw+@gQJqW^k?OK1;# zlfeQAIhL|T$}ADK{AGYddELUD_25R$T3idWx7bda@1w^Wcz3R_24aVRElgucM+m5I z!VJ5}o3;zYukkSxPl@!Q#WYdZ{MKo;9BA1rW$|w$@p#Yqix42wctHwKbs({AIUH^} zOu(sQd}#E*<+Z(C`OuTl(B_tw6J7Fjg4-tQO}Fx7^KBxOY)Jbf7CFt*8wJ1V8;Kr# z>wog=IsZ7<{|qI5+#fx6O|Qa{wLL-THVXx^zRi|^mYg4*fMraLjb#`WT;vKh`A!;{ zwb{fk$!vuI_Zh`DgFOY(Z;dAY4Lw>f2JXFozVXzqXD&soyX4-$nEB<6#T@fzf`5*V zfPTS(Dc&Xmi4zOVNYjTE?YkQRD;Whc+uyPv)a~ zb2@o_GOIT_{jWDuHZ-uEtLhCGdpRUIF+O_NvrZ(dqP)f#=g>rR`sL%c+X$1P+Ux;m zgT1=em6hP^+rCAsQAE1_n*7AG`qO<3y2Ab4Ug(?JC8)j&i!OK zTDQ{dHwabF;AJU z9cyU?zf^WdJIZ^}F?$ntp7*AR*{o09sbzlLICWLMhMVl13j4{hu1)K9f-I9XY zM26=a*wMJvrQF+~%mu@%EO7E(J47?*$^Kn z=4_Q`j;TALpD!%5dK~;ncT59Ag#|5<)O;XY(1#mUI^Q;!_qY8SEBB&1e`lF2)%b#) zMDM!1_u=@ec987_`$G-?f)`%DO&KZJ&{+V@}fqiSZO`KE_=N0r8)hhN0QsSy^^yx z86p)AKMrKtSlu2qVc=hScQyA)#kzLQ30vNTn8$V^FQ6I3hsvyv#60}pp8E1XrWYWO zZ=8I`u0ba?4Q_f3 zG@>GNaHuKgRr6(nX_?C@VwdL0NzpTGHNoiR?BI_|em!_|vmo&qoEgs5X}fdjb7$vG zp7@XIFS;wh&a=PQtO)u@UrqfChDu-vU%ft7kC)r!I$oeP*Isz%lKa-ut)Pa?pBK1| ze=-VF{=8~kEmf+sy!3u}HjlfuJHMsktM1lL6u22EZZ3~{kJaIzffowyUfp-zWoEKM z8C9Zh#D1%>5d*x$OZ0IPBO!{f5y5wyoS>H?w0dC(Z`R=|HC>Yl)U2SDetm(dMo9Wa zH&7h6>U{A+M?dK=KlI+7*=wtrwJ(xA09(?hR|I>${;(|>wAD5EmCT)!5Tr(*)O9VN zxT1T=R)V^XqLZwQPUND|(gyP{sGr$v8WE`E1a$?uN0#=#wJa&k{>?o8UAy_~YP!3K zq0n!Md%ufLQNCfj_@S%^S-zN>v2HY1AWE`WeR}k6(be}2JXRkaYQe-NE*MN#y6<>` zLEHtJCnNSiT$g`Ql$Uoc=c|&dSu8&IaKYUZhgN0Kio#t9%r6&R0Q)G?gEozDEA)7J zVlX;x@N=aC_yUxcdB0;(BZWw&ypP?;R|qaVME{A2`dXw+7pZF3kIH*Ct@qyK{7Nt3 zXmJ)xCk>hr^_L8bP>=vs_X&p8VO9ftKaqP|z^ljTLEIBp*dHbW#-R@zL1FzMWK5;SU;=|iR8SD3j;#yrAvl>pi`yZ_S$eF+Yesgov zI>$#@{Kp$wgiDZH;a=m@jrNso?z7MD|I8}g8|Ob%X&-qnUz=ZQ=Pp4pnU}ovHLs6h zySMIpQU2pR#hR)y55Ex{1xkd;tby{`V)r)<7|bVz`pl5~X6IAS3M6+F9m+?;kG^bW zKI_bh%SJ`x>b8Ii!4wp8)1u{PlP2-3OKu{_C z`gS?mz{8CF(U(z=_bw~Jc}*kphi${Zi$?d&U0cq@rL+#*E6nh@A1g$)EbpezqW+l zcS78w^BD4gpaSc-;OEcUr>8s$crRXTSIGU(PU0^>SI8dwsgBmXDggrPkxx8B4-_5hCUEGa)1n6<%}jw7`2YD_dU3t;_q@2<^nS z#`BI(B#7(H6<=qvHmTr*bAIS2sFd>;V@gfsrODfboVZJ|+hiDao8)-BPuylv7p=jA zrxLbkL;-$4YlJu^Ip)K6C$S+BL6r$^@L*AZjll?wfHrtWglf>ivlMo2A`qN1x3NEy zdqoV-83b_<%MesfS_ZwUXJfi%FVly2e6bZWcpYRTk$xH}eUK`U80N`VNhDaCMIpz5jPqTs$fI%6Nt%3$`+$Vj zb#S;BiS)UZ_wpqOVs%AN>`UxzF1~RIcmN+_5WRS3Xe%;AkQ{E_-?#Ap>%ManUuKe9is69ION~!e@X+D&P?H$wD|H<}gRk z&=Brwwtz*+0GOJz*`}%GJWLy{`8!nqH3!e)nTHjt$mDz#-x|J`EkLzN@iT%Lk@el$ z7>V z-dz;um9s>&&phXhPpyq6<`IPHV}3M!;70uq7h!pAb%cIWm>%^Vn!-v$HTtr-Fh3Tt zwVCLDX0>&zlpcj1#VEhq(^7$d236k{d+D^(U_QNa5sbTVE HO#J>2ggS5{ literal 0 HcmV?d00001 diff --git a/docs/serving/data_parallel_deployment.md b/docs/serving/data_parallel_deployment.md new file mode 100644 index 00000000000..484443fdc5a --- /dev/null +++ b/docs/serving/data_parallel_deployment.md @@ -0,0 +1,112 @@ +# Data Parallel Deployment + +vLLM supports Data Parallel deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. + +This will work with both dense and MoE models. + +For MoE models, particularly those like DeepSeek that employ MLA (Multi-head Latent Attention), it can be advantageous to use data parallel for the attention layers and expert or tensor parallel (EP or TP) for the expert layers. + +In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned, and expert layers across all ranks are required to synchronize during every forward pass, even when there are fewer requests to be processed than DP ranks. + +The expert layers will by default form a (DP x TP) sized tensor parallel group. To enable expert parallelism, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case). + +In vLLM, each DP rank is deployed as a separate "core engine" process that communicates with front-end process(es) via ZMQ sockets. Data Parallel attention can be combined with Tensor Parallel attention, in which case each DP engine owns a number of per-GPU worker processes equal to the configured TP size. + +For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP). + +In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently. + +This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see . + +There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing. + +## Internal Load Balancing + +vLLM supports "self-contained" data parallel deployments that expose a single API endpoint. + +It can be configured by simply including e.g. `--data-parallel-size=4` in the vllm serve command line arguments. This will require 4 GPUs. It can be combined with tensor parallel, for example `--data-parallel-size=4 --tensor-parallel-size=2`, which would require 8 GPUs. + +Running a single data parallel deployment across multiple nodes requires a different `vllm serve` to be run on each node, specifying which DP ranks should run on that node. In this case, there will still be a single HTTP entrypoint - the API server(s) will run only on one node, but it doesn't necessarily need to be co-located with the DP ranks. + +This will run DP=4, TP=2 on a single 8-GPU node: + +```bash +vllm serve $MODEL --data-parallel-size 4 --tensor-parallel-size 2 +``` + +This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node: + +```bash +# Node 0 (with ip address 10.99.48.128) +vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \ + --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 +# Node 1 +vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 2 \ + --data-parallel-start-rank 2 \ + --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 +``` + +This will run DP=4 with only the API server on the first node and all engines on the second node: + +```bash +# Node 0 (with ip address 10.99.48.128) +vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \ + --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 +# Node 1 +vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \ + --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 +``` + +This DP mode can also be used with Ray, in which case only a single launch command is needed irrespective of the number of nodes: + +```bash +vllm serve $MODEL --data-parallel-size 16 --tensor-parallel-size 2 --data-parallel-backend=ray +``` + +Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic. + +When deploying large DP sizes using this method, the API server process can become a bottleneck. In this case, the orthogonal `--api-server-count` command line option can be used to scale this out (for example `--api-server-count=4`). This is transparent to users - a single HTTP endpoint / port is still exposed. Note that this API server scale-out is "internal" and still confined to the "head" node. + +

+![DP Internal LB Diagram](../assets/deployment/dp_internal_lb.png) +
+ +## External Load Balancing + +For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally. + +In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions. + +This can already be done trivially for non-MoE models, since each deployed server is fully independent. No data parallel CLI options need to be used for this. + +We support an equivalent topology for MoE DP+EP which can be configured via the following CLI arguments. + +If DP ranks are co-located (same node / ip address), a default RPC port is used, but a different HTTP server port must be specified for each rank: + +```bash +# Rank 0 +CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \ + --port 8000 +# Rank 1 +CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \ + --port 8001 +``` + +For multi-node cases, the address/port of rank 0 must also be specified: + +```bash +# Rank 0 (with ip address 10.99.48.128) +vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \ + --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 +# Rank 1 +vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \ + --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 +``` + +The coordinator process also runs in this scenario, co-located with the DP rank 0 engine. + +
+![DP External LB Diagram](../assets/deployment/dp_external_lb.png) +
+ +In the above diagram, each of the dotted boxes corresponds to a separate launch of `vllm serve` - these could be separate Kubernetes pods, for example. diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index 8012500dfbf..a1f522cc5f1 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -15,6 +15,10 @@ After adding enough GPUs and nodes to hold the model, you can run vLLM first, wh !!! note There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs. +### Distributed serving of MoE (Mixture of Experts) models + +It is often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. See the page on [Data Parallel Deployment](data_parallel_deployment.md) for more information. + ## Running vLLM on a single node vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray. From 8a2eff5af18a29fa1d5c72416516d3f0352f61e0 Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Sat, 12 Jul 2025 02:21:52 +0800 Subject: [PATCH 018/552] [Bugfix] Fix OOM in language generation test (#20814) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- tests/models/language/generation/test_common.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/models/language/generation/test_common.py b/tests/models/language/generation/test_common.py index 8aba68829b1..ea240d22788 100644 --- a/tests/models/language/generation/test_common.py +++ b/tests/models/language/generation/test_common.py @@ -90,7 +90,7 @@ marks=[pytest.mark.core_model], ), pytest.param( - "Qwen/Qwen1.5-MoE-A2.7B-Chat", + "allenai/OLMoE-1B-7B-0924-Instruct", marks=[pytest.mark.cpu_model], ) ]) From 5756c7729b9496b3f68674fb23785d1f60a3ab5a Mon Sep 17 00:00:00 2001 From: bigmoyan Date: Sat, 12 Jul 2025 04:16:14 +0800 Subject: [PATCH 019/552] Update kimi-k2 tool calling docs, enable unit tests (#20821) Signed-off-by: wangzhengtao Co-authored-by: wangzhengtao Co-authored-by: wangzhengtao Signed-off-by: x22x22 --- docs/features/tool_calling.md | 8 ++++++++ tests/tool_use/test_kimi_k2_tool_parser.py | 2 -- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index d3caeaba65f..35e01861c5d 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -282,6 +282,14 @@ Supported models: Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}` +### Kimi-K2 Models (`kimi_k2`) + +Supported models: + +* `moonshotai/Kimi-K2-Instruct` + +Flags: `--tool-call-parser kimi_k2` + ### Models with Pythonic Tool Calls (`pythonic`) A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models. diff --git a/tests/tool_use/test_kimi_k2_tool_parser.py b/tests/tool_use/test_kimi_k2_tool_parser.py index 8768203a711..bd030632f16 100644 --- a/tests/tool_use/test_kimi_k2_tool_parser.py +++ b/tests/tool_use/test_kimi_k2_tool_parser.py @@ -10,8 +10,6 @@ from vllm.entrypoints.openai.tool_parsers import KimiK2ToolParser from vllm.transformers_utils.tokenizer import get_tokenizer -pytest.skip("skip kimi_k2 parser test", allow_module_level=True) - # Use a common model that is likely to be available MODEL = "moonshotai/Kimi-K2-Instruct" From f4586db0e46c868e6f13340b38997ebe3ddd122d Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Fri, 11 Jul 2025 21:57:24 -0400 Subject: [PATCH 020/552] [CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' (#20845) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- vllm/entrypoints/utils.py | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/vllm/entrypoints/utils.py b/vllm/entrypoints/utils.py index 423b99dbe56..6c37ce818e6 100644 --- a/vllm/entrypoints/utils.py +++ b/vllm/entrypoints/utils.py @@ -33,10 +33,12 @@ async def listen_for_disconnect(request: Request) -> None: while True: message = await request.receive() if message["type"] == "http.disconnect": - if request.app.state.enable_server_load_tracking: - # on timeout/cancellation the BackgroundTask in load_aware_call - # cannot decrement the server load metrics. - # Must be decremented by with_cancellation instead. + # If load tracking is enabled *and* the counter exists, decrement + # it. Combines the previous nested checks into a single condition + # to satisfy the linter rule. + if (getattr(request.app.state, "enable_server_load_tracking", + False) + and hasattr(request.app.state, "server_load_metrics")): request.app.state.server_load_metrics -= 1 break @@ -101,9 +103,14 @@ async def wrapper(*args, **kwargs): raise ValueError( "raw_request required when server load tracking is enabled") - if not raw_request.app.state.enable_server_load_tracking: + if not getattr(raw_request.app.state, "enable_server_load_tracking", + False): return await func(*args, **kwargs) + # ensure the counter exists + if not hasattr(raw_request.app.state, "server_load_metrics"): + raw_request.app.state.server_load_metrics = 0 + raw_request.app.state.server_load_metrics += 1 try: response = await func(*args, **kwargs) From 06333ce6ab3966fd5bb8263ecf188cd18cf98d92 Mon Sep 17 00:00:00 2001 From: Ilya Markov Date: Sat, 12 Jul 2025 03:58:15 +0200 Subject: [PATCH 021/552] Integration SM100 FlashInfer fused allreduce RMSNorm (#20691) Signed-off-by: ilmarkov Co-authored-by: ilmarkov Signed-off-by: x22x22 --- tests/compile/test_fusion_all_reduce.py | 152 ++++++++++ vllm/compilation/collective_fusion.py | 356 +++++++++++++++++++++++- vllm/compilation/pass_manager.py | 8 +- vllm/config.py | 4 + 4 files changed, 514 insertions(+), 6 deletions(-) create mode 100644 tests/compile/test_fusion_all_reduce.py diff --git a/tests/compile/test_fusion_all_reduce.py b/tests/compile/test_fusion_all_reduce.py new file mode 100644 index 00000000000..7101857210a --- /dev/null +++ b/tests/compile/test_fusion_all_reduce.py @@ -0,0 +1,152 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from importlib.util import find_spec + +import pytest +import torch + +import vllm.envs as envs +from vllm.compilation.collective_fusion import AllReduceFusionPass +from vllm.config import (CompilationConfig, CompilationLevel, DeviceConfig, + ModelConfig, PassConfig, VllmConfig) +from vllm.distributed import tensor_model_parallel_all_reduce +from vllm.distributed.parallel_state import (init_distributed_environment, + initialize_model_parallel) +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.platforms import current_platform +from vllm.utils import update_environment_variables + +from ..utils import multi_gpu_test +from .backend import TestBackend + + +class TestAllReduceRMSNormModel(torch.nn.Module): + + def __init__(self, hidden_size=16, eps=1e-6): + super().__init__() + self.hidden_size = hidden_size + self.eps = eps + self.norm = RMSNorm(hidden_size, eps) + + def forward(self, hidden_states, residual): + view = hidden_states.reshape(-1, self.hidden_size) + all_reduce = tensor_model_parallel_all_reduce(view) + norm = self.norm(all_reduce) + return norm + + def ops_in_model_before(self): + return [torch.ops.vllm.all_reduce.default] + + def ops_in_model_after(self): + return [torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default] + + +class TestAllReduceFusedAddRMSNormModel(torch.nn.Module): + + def __init__(self, hidden_size=16, eps=1e-6): + super().__init__() + self.hidden_size = hidden_size + self.eps = eps + self.norm = RMSNorm(hidden_size, eps) + + def forward(self, hidden_states, residual): + view = hidden_states.reshape(-1, self.hidden_size) + all_reduce = tensor_model_parallel_all_reduce(view) + norm, _ = self.norm(all_reduce, residual) + return norm + + def ops_in_model_before(self): + return [torch.ops.vllm.all_reduce.default] + + def ops_in_model_after(self): + return [torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default] + + +@multi_gpu_test(num_gpus=2) +@pytest.mark.parametrize( + "test_model", + [TestAllReduceRMSNormModel, TestAllReduceFusedAddRMSNormModel]) +@pytest.mark.parametrize("batch_size", [8]) +@pytest.mark.parametrize("seq_len", [8]) +@pytest.mark.parametrize("hidden_size", [4096]) +@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16]) +@pytest.mark.skipif(envs.VLLM_TARGET_DEVICE not in ["cuda"], + reason="Only test on CUDA") +@pytest.mark.skipif(not find_spec("flashinfer"), + reason="flashinfer is not installed") +@pytest.mark.skipif(not current_platform.is_device_capability(100), + reason="Only test on SM100") +def test_all_reduce_fusion_pass_replace(test_model: torch.nn.Module, + batch_size: int, seq_len: int, + hidden_size: int, dtype: torch.dtype): + num_processes = 2 + + def run_torch_spawn(fn, nprocs): + torch.multiprocessing.spawn(fn, + args=(num_processes, test_model, + batch_size, seq_len, hidden_size, + dtype), + nprocs=nprocs) + + run_torch_spawn(all_reduce_fusion_pass_on_test_model, num_processes) + + +def all_reduce_fusion_pass_on_test_model(local_rank: int, world_size: int, + test_model_cls: torch.nn.Module, + batch_size: int, seq_len: int, + hidden_size: int, dtype: torch.dtype): + current_platform.seed_everything(0) + + device = torch.device(f"cuda:{local_rank}") + torch.cuda.set_device(device) + torch.set_default_device(device) + torch.set_default_dtype(dtype) + + update_environment_variables({ + 'RANK': str(local_rank), + 'LOCAL_RANK': str(local_rank), + 'WORLD_SIZE': str(world_size), + 'MASTER_ADDR': 'localhost', + 'MASTER_PORT': '12345', + }) + + init_distributed_environment() + initialize_model_parallel(tensor_model_parallel_size=world_size) + + vllm_config = VllmConfig( + compilation_config=CompilationConfig(level=CompilationLevel.PIECEWISE, + custom_ops=["+rms_norm"], + compile_sizes=[2, 4, 8])) + vllm_config.compilation_config.pass_config = PassConfig( + enable_fi_allreduce_fusion=True) + vllm_config.device_config = DeviceConfig(device=torch.device("cuda")) + + # this is a fake model name to construct the model config + # in the vllm_config, it's not really used. + model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e" + vllm_config.model_config = ModelConfig(model=model_name, + task="auto", + tokenizer=model_name, + tokenizer_mode="auto", + trust_remote_code=True, + dtype=dtype, + seed=42) + + all_reduce_fusion_pass = AllReduceFusionPass( + vllm_config, vllm_config.compilation_config.pass_config. + fi_allreduce_fusion_max_token_num) + backend = TestBackend(all_reduce_fusion_pass) + + model = test_model_cls(hidden_size) + + hidden_states = torch.randn((batch_size * seq_len, hidden_size), + requires_grad=False) + residual = torch.randn((batch_size * seq_len, hidden_size), + requires_grad=False) + + compiled_model = torch.compile(model, backend=backend) + compiled_model(hidden_states, residual) + + backend.check_before_ops(model.ops_in_model_before(), fully_replaced=False) + backend.check_after_ops(model.ops_in_model_after()) + del all_reduce_fusion_pass diff --git a/vllm/compilation/collective_fusion.py b/vllm/compilation/collective_fusion.py index f754fc2388b..5892669a3a9 100644 --- a/vllm/compilation/collective_fusion.py +++ b/vllm/compilation/collective_fusion.py @@ -1,23 +1,39 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from importlib.util import find_spec from typing import Optional import torch import torch._inductor.pattern_matcher as pm import torch.fx as fx +from torch._higher_order_ops.auto_functionalize import auto_functionalized from torch._inductor.pattern_matcher import PatternMatcherPass from torch.distributed._symmetric_memory import enable_symm_mem_for_group from vllm.config import VllmConfig -from vllm.distributed import get_tp_group +from vllm.distributed import get_tp_group, tensor_model_parallel_all_reduce from vllm.distributed.parallel_state import ( - get_tensor_model_parallel_world_size) + get_tensor_model_parallel_rank, get_tensor_model_parallel_world_size) from vllm.logger import init_logger +from vllm.utils import direct_register_custom_op from .vllm_inductor_pass import VllmInductorPass +if find_spec("flashinfer"): + import flashinfer.comm as flashinfer_comm + + flashinfer_comm = (flashinfer_comm if hasattr( + flashinfer_comm, "trtllm_allreduce_fusion") else None) +else: + flashinfer_comm = None +from vllm.platforms import current_platform + logger = init_logger(__name__) +ALLREDUCE_OP = torch.ops.vllm.all_reduce.default +RMS_OP = torch.ops._C.rms_norm.default +RMS_ADD_OP = torch.ops._C.fused_add_rms_norm.default + class BasePattern: @@ -43,7 +59,8 @@ def pattern(mul: torch.Tensor, mm_weight: torch.Tensor): mm, dim=0, world_size=self.tp_size, - group_name=self.tp.unique_name) + group_name=self.tp.unique_name, + ) return reduce_scatter def replacement(mul: torch.Tensor, mm_weight: torch.Tensor): @@ -79,7 +96,8 @@ def pattern( x, dim=0, world_size=self.tp_size, - group_name=self.tp.unique_name) + group_name=self.tp.unique_name, + ) return torch.ops.aten.mm.default(all_gather, weight) @@ -125,3 +143,333 @@ def __call__(self, graph: fx.Graph): logger.debug("Replaced %s patterns", count) self.dump_graph(graph, "after_async_tp_pass") self.end_and_log() + + +if flashinfer_comm is not None: + _FI_WORKSPACE_TENSOR = None + + MiB = 1024 * 1024 + # Max size of the input tensor per world size + # to use flashinfer fused allreduce + _FI_MAX_SIZES = { + 2: MiB, # 1MB + 4: MiB, # 1MB + 6: MiB // 2, # 512KB + 8: MiB // 2, # 512KB + } + + def call_trtllm_fused_allreduce_norm( + allreduce_in: torch.Tensor, + residual: torch.Tensor, + rms_gamma: torch.Tensor, + rms_eps: float, + world_rank: int, + world_size: int, + launch_with_pdl: bool, + trigger_completion_at_end: bool, + fp32_acc: bool, + max_token_num: int, + norm_out: Optional[torch.Tensor] = None, + ) -> None: + use_flashinfer = allreduce_in.shape[0] * allreduce_in.shape[ + 1] * allreduce_in.element_size() <= min( + _FI_MAX_SIZES[world_size], + max_token_num * allreduce_in.shape[0] * + allreduce_in.element_size(), + ) + if use_flashinfer: + assert (_FI_WORKSPACE_TENSOR is not None + ), "Flashinfer must be enabled when using flashinfer" + if norm_out is None: + norm_out = allreduce_in + residual_out = residual + else: + # return residual_out as allreduce_out with zeroed residual_in + # as flashinfer does not support rms_norm + # and allreduce_out together + residual_out = allreduce_in + # For the sizes that are smaller than the max size, + # we only use flashinfer one shot allreduce + flashinfer_comm.trtllm_allreduce_fusion( + allreduce_in=allreduce_in, + token_num=allreduce_in.shape[0], + residual_in=residual, + residual_out=residual_out, + norm_out=norm_out, + rms_gamma=rms_gamma, + rms_eps=rms_eps, + world_rank=world_rank, + world_size=world_size, + hidden_dim=allreduce_in.shape[-1], + workspace_ptrs=_FI_WORKSPACE_TENSOR, + launch_with_pdl=launch_with_pdl, + use_oneshot=True, + trigger_completion_at_end=trigger_completion_at_end, + fp32_acc=fp32_acc, + pattern_code=flashinfer_comm.AllReduceFusionPattern. + kARResidualRMSNorm, + allreduce_out=None, + quant_out=None, + scale_out=None, + layout_code=None, + scale_factor=None, + ) + else: + allreduce_out = tensor_model_parallel_all_reduce(allreduce_in) + if norm_out is None: + torch.ops._C.fused_add_rms_norm(allreduce_out, residual, + rms_gamma, rms_eps) + else: + torch.ops._C.rms_norm(norm_out, allreduce_out, rms_gamma, + rms_eps) + allreduce_in.copy_(allreduce_out) + + def call_trtllm_fused_allreduce_norm_fake( + allreduce_in: torch.Tensor, + residual: torch.Tensor, + rms_gamma: torch.Tensor, + rms_eps: float, + world_rank: int, + world_size: int, + launch_with_pdl: bool, + trigger_completion_at_end: bool, + fp32_acc: bool, + max_token_num: int, + norm_out: Optional[torch.Tensor] = None, + ) -> None: + pass + + direct_register_custom_op( + op_name="flashinfer_trtllm_fused_allreduce_norm", + op_func=call_trtllm_fused_allreduce_norm, + mutates_args=[ + "allreduce_in", + "residual", + "norm_out", + ], + fake_impl=call_trtllm_fused_allreduce_norm_fake, + dispatch_key=current_platform.dispatch_key, + ) + flashinfer_trtllm_fused_allreduce_norm = ( + torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default) + + +class FlashInferFusedAllReduceParams: + """Parameters for FlashInfer fused allreduce operations.""" + + def __init__( + self, + rank: int, + world_size: int, + use_fp32_lamport: bool = False, + max_token_num: int = 1024, + ): + self.rank = rank + self.world_size = world_size + self.use_fp32_lamport = use_fp32_lamport + self.trigger_completion_at_end = True + self.launch_with_pdl = True + self.fp32_acc = True + self.use_oneshot = False + self.max_token_num = max_token_num + + def get_trtllm_fused_allreduce_kwargs(self): + return { + "world_rank": self.rank, + "world_size": self.world_size, + "launch_with_pdl": self.launch_with_pdl, + "trigger_completion_at_end": self.trigger_completion_at_end, + "fp32_acc": self.fp32_acc, + "max_token_num": self.max_token_num, + } + + +class AllReduceRMSNORMPattern(BasePattern): + + def __init__( + self, + epsilon: float, + dtype: torch.dtype, + device: str, + allreduce_params: FlashInferFusedAllReduceParams, + ): + super().__init__(dtype, device) + self.epsilon = epsilon + self.allreduce_params = allreduce_params + + def get_inputs(self): + input = torch.empty([1, 8, 4], device=self.device, dtype=self.dtype) + rms_result = torch.empty([1, 8, 4], + device=self.device, + dtype=self.dtype) + weight = torch.empty([4], device=self.device, dtype=self.dtype) + + return [input, rms_result, weight] + + def register(self, pm_pass: PatternMatcherPass): + + def pattern(input: torch.Tensor, rms_result: torch.Tensor, + weight: torch.Tensor): + all_reduce_output = tensor_model_parallel_all_reduce(input) + rms = auto_functionalized( + RMS_OP, + result=rms_result, + input=all_reduce_output, + weight=weight, + epsilon=self.epsilon, + ) + return rms[1], all_reduce_output + + def replacement(input: torch.Tensor, rms_result: torch.Tensor, + weight: torch.Tensor): + residual = torch.zeros_like(input) + allreduce = auto_functionalized( + torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default, + allreduce_in=input, + residual=residual, + norm_out=rms_result, + rms_gamma=weight, + rms_eps=self.epsilon, + **self.allreduce_params.get_trtllm_fused_allreduce_kwargs(), + ) + + return allreduce[3], allreduce[1] + + pm.register_replacement(pattern, replacement, self.get_inputs(), + pm.fwd_only, pm_pass) + + +class AllReduceFusedAddRMSNormPattern(BasePattern): + + def __init__( + self, + epsilon: float, + dtype: torch.dtype, + device: str, + allreduce_params: FlashInferFusedAllReduceParams, + ): + super().__init__(dtype, device) + self.epsilon = epsilon + self.allreduce_params = allreduce_params + + def get_inputs(self): + input = torch.empty([4, 4], device=self.device, dtype=self.dtype) + residual = torch.empty([4, 4], device=self.device, dtype=self.dtype) + weight = torch.empty([4, 4], device=self.device, dtype=self.dtype) + return [ + residual, + input, + weight, + ] + + def register(self, pm_pass: PatternMatcherPass): + + def pattern(residual: torch.Tensor, input: torch.Tensor, + weight: torch.Tensor): + all_reduce_output = tensor_model_parallel_all_reduce(input) + rms = auto_functionalized( + RMS_ADD_OP, + input=all_reduce_output, + residual=residual, + weight=weight, + epsilon=self.epsilon, + ) + return rms[1], rms[2] + + def replacement(residual: torch.Tensor, input: torch.Tensor, + weight: torch.Tensor): + allreduce = auto_functionalized( + torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default, + allreduce_in=input, + residual=residual, + rms_gamma=weight, + rms_eps=self.epsilon, + norm_out=None, + **self.allreduce_params.get_trtllm_fused_allreduce_kwargs(), + ) + return allreduce[1], allreduce[2] + + pm.register_replacement(pattern, replacement, self.get_inputs(), + pm.fwd_only, pm_pass) + + +class AllReduceFusionPass(VllmInductorPass): + + def __init__(self, config: VllmConfig, max_token_num: int): + super().__init__(config) + self.disabled = True + self.tp_size = get_tensor_model_parallel_world_size() + if self.tp_size <= 1: + return + self.patterns: PatternMatcherPass = PatternMatcherPass( + pass_name="all_reduce_fusion_pass") + if config.model_config is None: + return + self.hidden_dim = config.model_config.get_hidden_size() + self.group = get_tp_group().device_group + rank = get_tensor_model_parallel_rank() + use_fp32_lamport = self.model_dtype == torch.float32 + if flashinfer_comm is None: + logger.warning( + "Flashinfer is not installed, skipping allreduce fusion pass") + return + # Check if the world size is supported + if self.tp_size not in _FI_MAX_SIZES: + logger.warning( + "Flashinfer allreduce fusion is not " + "supported for world size %s", + self.tp_size, + ) + return + + self.ipc_handles, workspace_tensor = ( + flashinfer_comm.trtllm_create_ipc_workspace_for_all_reduce_fusion( + tp_rank=rank, + tp_size=self.tp_size, + max_token_num=max_token_num, + hidden_dim=self.hidden_dim, + group=self.group, + use_fp32_lamport=use_fp32_lamport, + )) + + global _FI_WORKSPACE_TENSOR + _FI_WORKSPACE_TENSOR = workspace_tensor + self.allreduce_params = FlashInferFusedAllReduceParams( + rank=rank, + world_size=self.tp_size, + use_fp32_lamport=use_fp32_lamport, + max_token_num=max_token_num, + ) + + for epsilon in [1e-5, 1e-6]: + AllReduceRMSNORMPattern( + epsilon, + self.model_dtype, + self.device, + self.allreduce_params, + ).register(self.patterns) + AllReduceFusedAddRMSNormPattern( + epsilon, + self.model_dtype, + self.device, + self.allreduce_params, + ).register(self.patterns) + + self.disabled = False + + def __call__(self, graph: fx.Graph): + if self.disabled: + return + self.begin() + self.dump_graph(graph, "before_all_reduce_fusion_pass") + count = self.patterns.apply(graph) + logger.debug("Replaced %s patterns", count) + self.dump_graph(graph, "after_all_reduce_fusion_pass") + self.end_and_log() + + def __del__(self): + if self.disabled: + return + if flashinfer_comm is not None: + flashinfer_comm.trtllm_destroy_ipc_workspace( + self.ipc_handles, self.group) diff --git a/vllm/compilation/pass_manager.py b/vllm/compilation/pass_manager.py index 3ce00e3610c..078188854f0 100644 --- a/vllm/compilation/pass_manager.py +++ b/vllm/compilation/pass_manager.py @@ -7,7 +7,7 @@ from vllm.logger import init_logger from .activation_quant_fusion import ActivationQuantFusionPass -from .collective_fusion import AsyncTPPass +from .collective_fusion import AllReduceFusionPass, AsyncTPPass from .fix_functionalization import FixFunctionalizationPass from .fusion import FusionPass from .fusion_attn import AttnFusionPass @@ -62,7 +62,11 @@ def configure(self, config: VllmConfig): if self.pass_config.enable_attn_fusion: self.passes += [AttnFusionPass(config)] - + if self.pass_config.enable_fi_allreduce_fusion: + self.passes += [ + AllReduceFusionPass( + config, self.pass_config.fi_allreduce_fusion_max_token_num) + ] self.fix_functionalization = FixFunctionalizationPass(config) def add(self, pass_: InductorPass): diff --git a/vllm/config.py b/vllm/config.py index 344fe0142d2..d3774a18b06 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3991,6 +3991,10 @@ class PassConfig: """Whether to enable sequence parallelism.""" enable_async_tp: bool = False """Whether to enable async TP.""" + enable_fi_allreduce_fusion: bool = False + """Whether to enable flashinfer allreduce fusion.""" + fi_allreduce_fusion_max_token_num: int = 1024 + """Max number of tokens to used in flashinfer allreduce fusion.""" # TODO(luka) better pass enabling system. From 0c3a2b8be2d011d79223893547d218b5be259efb Mon Sep 17 00:00:00 2001 From: Trevor Morris Date: Fri, 11 Jul 2025 18:59:23 -0700 Subject: [PATCH 022/552] Add pynccl all-gatherv and reducescatterv (#20154) Signed-off-by: Trevor Morris Signed-off-by: mgoin Co-authored-by: mgoin Signed-off-by: x22x22 --- tests/distributed/test_pynccl.py | 70 ++++++++++++++++ .../base_device_communicator.py | 16 +++- .../device_communicators/cuda_communicator.py | 83 ++++++++++++++++++- .../device_communicators/pynccl.py | 72 ++++++++++++++++ .../device_communicators/pynccl_wrapper.py | 33 ++++++++ vllm/distributed/parallel_state.py | 12 +++ 6 files changed, 284 insertions(+), 2 deletions(-) diff --git a/tests/distributed/test_pynccl.py b/tests/distributed/test_pynccl.py index 5b32b90f3cf..abfad9ebfe7 100644 --- a/tests/distributed/test_pynccl.py +++ b/tests/distributed/test_pynccl.py @@ -4,6 +4,7 @@ import multiprocessing import os +import numpy as np import pytest import torch import torch.distributed @@ -177,6 +178,38 @@ def test_pynccl_all_gather(): distributed_run(all_gather_worker_fn, 2) +@worker_fn_wrapper +def all_gatherv_worker_fn(): + pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group, + device=get_world_group().device) + + rank = pynccl_comm.rank + world_size = pynccl_comm.world_size + device = f'cuda:{pynccl_comm.rank}' + + assert world_size <= 8 + sizes = [81, 20, 57, 52, 81, 5, 49, 49][:world_size] + num_elems = sizes[rank] + tensor = torch.arange(num_elems, dtype=torch.float32, + device=device) + rank * 100 + result = torch.zeros(sum(sizes), dtype=torch.float32, device=device) + + expected = torch.cat([ + torch.arange(sizes[r], dtype=torch.float32) + r * 100 + for r in range(world_size) + ]).to(device) + + pynccl_comm.all_gatherv(result, tensor, sizes=sizes) + torch.cuda.synchronize() + torch.testing.assert_close(result, expected, rtol=1e-5, atol=1e-8) + + +@pytest.mark.skipif(torch.cuda.device_count() < 2, + reason="Need at least 2 GPUs to run the test.") +def test_pynccl_all_gatherv(): + distributed_run(all_gatherv_worker_fn, 2) + + @worker_fn_wrapper def reduce_scatter_worker_fn(): pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group, @@ -214,6 +247,43 @@ def test_pynccl_reduce_scatter(): distributed_run(reduce_scatter_worker_fn, 2) +@worker_fn_wrapper +def reduce_scatterv_worker_fn(): + pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group, + device=get_world_group().device) + + rank = pynccl_comm.rank + world_size = pynccl_comm.world_size + device = f'cuda:{pynccl_comm.rank}' + + assert world_size <= 8 + sizes = [81, 20, 57, 52, 81, 5, 49, 49][:world_size] + num_elems = sum(sizes) + tensor = torch.arange(num_elems, dtype=torch.float32, + device=device) + rank * 100 + result = torch.zeros(sizes[rank], dtype=torch.float32, device=device) + + # Calculate expected result for this rank's chunk + all_tensors = [ + torch.arange(num_elems, dtype=torch.float32) + r * 100 + for r in range(world_size) + ] + sizes_cumsum = np.cumsum(sizes) + start = 0 if rank == 0 else sizes_cumsum[rank - 1] + end = sizes_cumsum[rank] + expected = sum(tensor[start:end] for tensor in all_tensors).to(device) + + pynccl_comm.reduce_scatterv(result, tensor, sizes=sizes) + torch.cuda.synchronize() + torch.testing.assert_close(result, expected, rtol=1e-5, atol=1e-8) + + +@pytest.mark.skipif(torch.cuda.device_count() < 2, + reason="Need at least 2 GPUs to run the test.") +def test_pynccl_reduce_scatterv(): + distributed_run(reduce_scatterv_worker_fn, 2) + + @pytest.mark.skipif(torch.cuda.device_count() < 2, reason="Need at least 2 GPUs to run the test.") def test_pynccl_with_cudagraph(): diff --git a/vllm/distributed/device_communicators/base_device_communicator.py b/vllm/distributed/device_communicators/base_device_communicator.py index eb467bb0736..dc5923cdc5a 100644 --- a/vllm/distributed/device_communicators/base_device_communicator.py +++ b/vllm/distributed/device_communicators/base_device_communicator.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import threading -from typing import Optional +from typing import Optional, Union from weakref import WeakValueDictionary import torch @@ -138,6 +138,14 @@ def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: input_size[dim + 1:]) return output_tensor + def all_gatherv( + self, + input_: Union[torch.Tensor, list[torch.Tensor]], + dim: int = 0, + sizes: Optional[list[int]] = None + ) -> Union[torch.Tensor, list[torch.Tensor]]: + raise NotImplementedError + def reduce_scatter(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: @@ -172,6 +180,12 @@ def reduce_scatter(self, # Reshape before returning return output_tensor.movedim(0, dim).contiguous() + def reduce_scatterv(self, + input_: torch.Tensor, + dim: int = -1, + sizes: Optional[list[int]] = None) -> torch.Tensor: + raise NotImplementedError + def gather(self, input_: torch.Tensor, dst: int = 0, diff --git a/vllm/distributed/device_communicators/cuda_communicator.py b/vllm/distributed/device_communicators/cuda_communicator.py index 3958d566b17..e4804691f0f 100644 --- a/vllm/distributed/device_communicators/cuda_communicator.py +++ b/vllm/distributed/device_communicators/cuda_communicator.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Optional, Union import torch from torch.distributed import ProcessGroup @@ -142,6 +142,42 @@ def reduce_scatter(self, input_: torch.Tensor, dim: int = -1): # Reshape before returning return output.movedim(0, dim).contiguous() + def reduce_scatterv(self, + input_: torch.Tensor, + dim: int = -1, + sizes: Optional[list[int]] = None): + world_size = self.world_size + pynccl_comm = self.pynccl_comm + assert pynccl_comm is not None + if dim < 0: + # Convert negative dim to positive. + dim += input_.dim() + + # Note: This will produce an incorrect answer if we don't make + # the input_tensor contiguous. Possible bug in reduce_scatter_tensor? + input_tensor = input_.movedim(0, dim).contiguous() + + if sizes is not None: + assert len(sizes) == world_size + assert input_tensor.shape[0] == sum(sizes) + chunk_size = sizes[self.rank_in_group] + else: + assert input_tensor.shape[0] % world_size == 0 + chunk_size = input_tensor.shape[0] // world_size + output_shape = (chunk_size, ) + input_tensor.shape[1:] + + output = torch.empty(output_shape, + dtype=input_tensor.dtype, + device=input_tensor.device) + + if sizes is not None: + pynccl_comm.reduce_scatterv(output, input_, sizes=sizes) + else: + pynccl_comm.reduce_scatter(output, input_) + + # Reshape before returning + return output.movedim(0, dim).contiguous() + def send(self, tensor: torch.Tensor, dst: Optional[int] = None) -> None: """Sends a tensor to the destination rank in a non-blocking way""" """NOTE: `dst` is the local rank of the destination rank.""" @@ -180,6 +216,51 @@ def destroy(self): self.all2all_manager.destroy() self.all2all_manager = None + def all_gatherv(self, + input_: Union[torch.Tensor, list[torch.Tensor]], + dim: int = 0, + sizes: Optional[list[int]] = None): + if dim != 0: + raise NotImplementedError("only dim 0 all-gatherv is supported") + world_size = self.world_size + pynccl_comm = self.pynccl_comm + assert pynccl_comm is not None and not pynccl_comm.disabled + + # 'sizes' is not needed if all inputs in the same group have the same + # shape + if sizes is not None and all(s == sizes[0] for s in sizes): + sizes = None + + def _all_gather_single(input_: torch.Tensor, + sizes: Optional[list[int]] = None): + input_size = input_.size() + if sizes is not None: + assert len(sizes) == world_size + assert input_.shape[dim] == sizes[self.rank_in_group] + output_size = (sum(sizes), ) + input_size[1:] + else: + output_size = (input_size[0] * world_size, ) + input_size[1:] + # Allocate output tensor. + output_tensor = torch.empty(output_size, + dtype=input_.dtype, + device=input_.device) + if sizes is not None: + pynccl_comm.all_gatherv(output_tensor, input_, sizes=sizes) + else: + pynccl_comm.all_gather(output_tensor, input_) + return output_tensor + + if isinstance(input_, torch.Tensor): + return _all_gather_single(input_, sizes) + + output_list = [] + pynccl_comm.group_start() + for inp in input_: + output_list.append(_all_gather_single(inp, sizes=sizes)) + pynccl_comm.group_end() + + return output_list + def dispatch( self, hidden_states: torch.Tensor, router_logits: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: diff --git a/vllm/distributed/device_communicators/pynccl.py b/vllm/distributed/device_communicators/pynccl.py index 29486292996..502bfd39005 100644 --- a/vllm/distributed/device_communicators/pynccl.py +++ b/vllm/distributed/device_communicators/pynccl.py @@ -152,6 +152,40 @@ def all_gather(self, ncclDataTypeEnum.from_torch(input_tensor.dtype), self.comm, cudaStream_t(stream.cuda_stream)) + def all_gatherv( + self, + output_tensor: torch.Tensor, + input_tensor: torch.Tensor, + sizes: list[int], + stream=None, + ): + if self.disabled: + return + # nccl communicator created on a specific device + # will only work on tensors on the same device + # otherwise it will cause "illegal memory access" + assert input_tensor.device == self.device, ( + f"this nccl communicator is created to work on {self.device}, " + f"but the input tensor is on {input_tensor.device}") + if stream is None: + stream = current_stream() + assert output_tensor.shape[0] == sum(sizes) + split_offset = 0 + self.nccl.ncclGroupStart() + for root, split_size in enumerate(sizes): + dst_slice = output_tensor[split_offset:split_offset + split_size] + self.nccl.ncclBroadcast( + buffer_type(input_tensor.data_ptr()), + buffer_type(dst_slice.data_ptr()), + dst_slice.numel(), + ncclDataTypeEnum.from_torch(input_tensor.dtype), + root, + self.comm, + cudaStream_t(stream.cuda_stream), + ) + split_offset += split_size + self.nccl.ncclGroupEnd() + def reduce_scatter(self, output_tensor: torch.Tensor, input_tensor: torch.Tensor, @@ -174,6 +208,38 @@ def reduce_scatter(self, ncclRedOpTypeEnum.from_torch(op), self.comm, cudaStream_t(stream.cuda_stream)) + def reduce_scatterv( + self, + output_tensor: torch.Tensor, + input_tensor: torch.Tensor, + sizes: list[int], + op: ReduceOp = ReduceOp.SUM, + stream=None, + ): + if self.disabled: + return + # nccl communicator created on a specific device + # will only work on tensors on the same device + # otherwise it will cause "illegal memory access" + assert input_tensor.device == self.device, ( + f"this nccl communicator is created to work on {self.device}, " + f"but the input tensor is on {input_tensor.device}") + if stream is None: + stream = current_stream() + + split_offset = 0 + self.nccl.ncclGroupStart() + for root, split_size in enumerate(sizes): + chunk = input_tensor[split_offset:split_offset + split_size, ...] + self.nccl.ncclReduce( + buffer_type(chunk.data_ptr()), + buffer_type(output_tensor.data_ptr()), chunk.numel(), + ncclDataTypeEnum.from_torch(input_tensor.dtype), + ncclRedOpTypeEnum.from_torch(op), root, self.comm, + cudaStream_t(stream.cuda_stream)) + split_offset += split_size + self.nccl.ncclGroupEnd() + def send(self, tensor: torch.Tensor, dst: int, stream=None): if self.disabled: return @@ -216,3 +282,9 @@ def broadcast(self, tensor: torch.Tensor, src: int, stream=None): self.nccl.ncclBroadcast(sendbuff, recvbuff, tensor.numel(), ncclDataTypeEnum.from_torch(tensor.dtype), src, self.comm, cudaStream_t(stream.cuda_stream)) + + def group_start(self): + self.nccl.ncclGroupStart() + + def group_end(self): + self.nccl.ncclGroupEnd() diff --git a/vllm/distributed/device_communicators/pynccl_wrapper.py b/vllm/distributed/device_communicators/pynccl_wrapper.py index 3018a92da07..a930b63bc26 100644 --- a/vllm/distributed/device_communicators/pynccl_wrapper.py +++ b/vllm/distributed/device_communicators/pynccl_wrapper.py @@ -154,6 +154,17 @@ class NCCLLibrary: ncclRedOp_t, ncclComm_t, cudaStream_t ]), + # ncclResult_t ncclReduce( + # const void* sendbuff, void* recvbuff, size_t count, + # ncclDataType_t datatype, ncclRedOp_t op, int root, + # ncclComm_t comm, cudaStream_t stream); + # note that cudaStream_t is a pointer type, so the last argument + # is a pointer + Function("ncclReduce", ncclResult_t, [ + buffer_type, buffer_type, ctypes.c_size_t, ncclDataType_t, + ncclRedOp_t, ctypes.c_int, ncclComm_t, cudaStream_t + ]), + # ncclResult_t ncclAllGather( # const void* sendbuff, void* recvbuff, size_t count, # ncclDataType_t datatype, ncclComm_t comm, @@ -207,6 +218,10 @@ class NCCLLibrary: # it is better not to call it at all. # ncclResult_t ncclCommDestroy(ncclComm_t comm); Function("ncclCommDestroy", ncclResult_t, [ncclComm_t]), + # ncclResult_t ncclGroupStart(); + Function("ncclGroupStart", ncclResult_t, []), + # ncclResult_t ncclGroupEnd(); + Function("ncclGroupEnd", ncclResult_t, []), ] # class attribute to store the mapping from the path to the library @@ -300,6 +315,18 @@ def ncclAllReduce(self, sendbuff: buffer_type, recvbuff: buffer_type, datatype, op, comm, stream)) + def ncclReduce(self, sendbuff: buffer_type, recvbuff: buffer_type, + count: int, datatype: int, op: int, root: int, + comm: ncclComm_t, stream: cudaStream_t) -> None: + # `datatype` actually should be `ncclDataType_t` + # and `op` should be `ncclRedOp_t` + # both are aliases of `ctypes.c_int` + # when we pass int to a function, it will be converted to `ctypes.c_int` + # by ctypes automatically + self.NCCL_CHECK(self._funcs["ncclReduce"](sendbuff, recvbuff, count, + datatype, op, root, comm, + stream)) + def ncclReduceScatter(self, sendbuff: buffer_type, recvbuff: buffer_type, count: int, datatype: int, op: int, comm: ncclComm_t, stream: cudaStream_t) -> None: @@ -342,6 +369,12 @@ def ncclBroadcast(self, sendbuff: buffer_type, recvbuff: buffer_type, def ncclCommDestroy(self, comm: ncclComm_t) -> None: self.NCCL_CHECK(self._funcs["ncclCommDestroy"](comm)) + def ncclGroupStart(self) -> None: + self.NCCL_CHECK(self._funcs["ncclGroupStart"]()) + + def ncclGroupEnd(self) -> None: + self.NCCL_CHECK(self._funcs["ncclGroupEnd"]()) + __all__ = [ "NCCLLibrary", "ncclDataTypeEnum", "ncclRedOpTypeEnum", "ncclUniqueId", diff --git a/vllm/distributed/parallel_state.py b/vllm/distributed/parallel_state.py index 495a758e606..1bb0ca79cc1 100644 --- a/vllm/distributed/parallel_state.py +++ b/vllm/distributed/parallel_state.py @@ -383,6 +383,12 @@ def _all_gather_out_place(self, input_: torch.Tensor, dim: int) -> torch.Tensor: return self.device_communicator.all_gather(input_, dim) + def all_gatherv(self, + input_: Union[torch.Tensor, list[torch.Tensor]], + dim: int = 0, + sizes: Optional[list[int]] = None): + return self.device_communicator.all_gatherv(input_, dim, sizes) + def reduce_scatter(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: @@ -401,6 +407,12 @@ def reduce_scatter(self, else: return self._reduce_scatter_out_place(input_, dim) + def reduce_scatterv(self, + input_: torch.Tensor, + dim: int = -1, + sizes: Optional[list[int]] = None) -> torch.Tensor: + return self.device_communicator.reduce_scatterv(input_, dim, sizes) + def _reduce_scatter_out_place(self, input_: torch.Tensor, dim: int) -> torch.Tensor: return self.device_communicator.reduce_scatter(input_, dim) From d74c98c2e8bfe65dce73642837cd662714060026 Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Sat, 12 Jul 2025 11:50:42 +0800 Subject: [PATCH 023/552] [Misc] Restrict deep_gemm's log output (#20827) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/deep_gemm_moe.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py index 4c0e6665bdc..433f957a843 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py @@ -43,7 +43,7 @@ def _valid_deep_gemm(hidden_states: torch.Tensor, w1: torch.Tensor, aligned by `dg.get_m_alignment_for_contiguous_layout()`. """ if not has_deep_gemm(): - logger.debug("DeepGemm disabled: deep_gemm not available.") + logger.debug_once("DeepGemm disabled: deep_gemm not available.") return False M = hidden_states.size(0) From ea19b9230c41d08f80230737858f06003dc4637a Mon Sep 17 00:00:00 2001 From: "Li, Jiang" Date: Sat, 12 Jul 2025 11:52:05 +0800 Subject: [PATCH 024/552] [Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices (#20822) Signed-off-by: jiang1.li Signed-off-by: x22x22 --- vllm/model_executor/layers/quantization/bitsandbytes.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/quantization/bitsandbytes.py b/vllm/model_executor/layers/quantization/bitsandbytes.py index 20625f587f5..92a46ad65cb 100644 --- a/vllm/model_executor/layers/quantization/bitsandbytes.py +++ b/vllm/model_executor/layers/quantization/bitsandbytes.py @@ -5,7 +5,6 @@ import torch -from vllm.model_executor.layers.fused_moe import fused_experts from vllm.model_executor.layers.fused_moe.layer import (FusedMoE, FusedMoEMethodBase) from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, @@ -467,6 +466,7 @@ def apply( logical_to_physical_map: Optional[torch.Tensor] = None, logical_replica_count: Optional[torch.Tensor] = None, ) -> torch.Tensor: + from vllm.model_executor.layers.fused_moe import fused_experts if enable_eplb: raise NotImplementedError( From 675d5ed0fd3aad82b5a05bd245d37759c8045faf Mon Sep 17 00:00:00 2001 From: yurhett <46419702+yurhett@users.noreply.github.com> Date: Sat, 12 Jul 2025 11:52:43 +0800 Subject: [PATCH 025/552] [Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading (#20682) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- tests/models/language/pooling/mteb_utils.py | 5 ++-- .../language/pooling/test_qwen3_reranker.py | 27 +++++++++++++++++++ vllm/model_executor/models/adapters.py | 13 +++++---- 3 files changed, 38 insertions(+), 7 deletions(-) diff --git a/tests/models/language/pooling/mteb_utils.py b/tests/models/language/pooling/mteb_utils.py index 847ea5f623f..6c4fde5fdfa 100644 --- a/tests/models/language/pooling/mteb_utils.py +++ b/tests/models/language/pooling/mteb_utils.py @@ -268,7 +268,8 @@ def mteb_test_rerank_models(hf_runner, model_info: RerankModelInfo, vllm_extra_kwargs=None, hf_model_callback=None, - vllm_mteb_encoder=VllmMtebEncoder): + vllm_mteb_encoder=VllmMtebEncoder, + atol=MTEB_RERANK_TOL): if not model_info.enable_test: # A model family has many models with the same architecture, # and we don't need to test each one. @@ -301,4 +302,4 @@ def mteb_test_rerank_models(hf_runner, print("SentenceTransformers:", st_dtype, st_main_score) print("Difference:", st_main_score - vllm_main_score) - assert st_main_score == pytest.approx(vllm_main_score, abs=MTEB_RERANK_TOL) + assert st_main_score == pytest.approx(vllm_main_score, abs=atol) diff --git a/tests/models/language/pooling/test_qwen3_reranker.py b/tests/models/language/pooling/test_qwen3_reranker.py index 9f040639c78..9c6a833b413 100644 --- a/tests/models/language/pooling/test_qwen3_reranker.py +++ b/tests/models/language/pooling/test_qwen3_reranker.py @@ -6,6 +6,7 @@ import torch from tests.conftest import HfRunner +from tests.utils import multi_gpu_test from .mteb_utils import RerankModelInfo, mteb_test_rerank_models @@ -87,3 +88,29 @@ def test_rerank_models_mteb(vllm_runner, model_info: RerankModelInfo) -> None: mteb_test_rerank_models(Qwen3RerankerHfRunner, vllm_runner, model_info, vllm_extra_kwargs) + + +@pytest.mark.parametrize("model_info", RERANK_MODELS) +@multi_gpu_test(num_gpus=2) +def test_rerank_models_mteb_tp(vllm_runner, + model_info: RerankModelInfo) -> None: + + assert model_info.architecture == "Qwen3ForSequenceClassification" + + vllm_extra_kwargs: dict[str, Any] = { + "hf_overrides": { + "architectures": ["Qwen3ForSequenceClassification"], + "classifier_from_token": ["no", "yes"], + "is_original_qwen3_reranker": True, + }, + "tensor_parallel_size": 2, + } + + if model_info.name == "Qwen/Qwen3-Reranker-4B": + vllm_extra_kwargs["max_num_seqs"] = 1 + + mteb_test_rerank_models(Qwen3RerankerHfRunner, + vllm_runner, + model_info, + vllm_extra_kwargs, + atol=1.2e-2) diff --git a/vllm/model_executor/models/adapters.py b/vllm/model_executor/models/adapters.py index 6584c84436c..dcdf69f773a 100644 --- a/vllm/model_executor/models/adapters.py +++ b/vllm/model_executor/models/adapters.py @@ -322,6 +322,8 @@ def load_weights_using_from_2_way_softmax( # refer to https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead) + from vllm.model_executor.model_loader.weight_utils import ( + default_weight_loader) from vllm.model_executor.models.utils import AutoWeightsLoader model_config = model.vllm_config.model_config @@ -329,8 +331,6 @@ def load_weights_using_from_2_way_softmax( tokens = cast(list[int], tokens) assert len(tokens) == 2 - device = model.score.weight.device - if model.config.tie_word_embeddings: model.lm_head = model.model.embed_tokens else: @@ -349,10 +349,13 @@ def load_weights_using_from_2_way_softmax( false_id = tokenizer.convert_tokens_to_ids(tokens[0]) true_id = tokenizer.convert_tokens_to_ids(tokens[1]) - weight = model.lm_head.weight.data[true_id].to(device).to( - torch.float32) - model.lm_head.weight.data[false_id].to(device).to( + weight = model.lm_head.weight.data[[true_id]].to( + torch.float32) - model.lm_head.weight.data[[false_id]].to( torch.float32) - model.score.weight.data.copy_(weight) + + param = model.score.weight + weight_loader = getattr(param, "weight_loader", default_weight_loader) + weight_loader(param, weight) del model.lm_head loaded_weights.add("score.weight") From 70b4321782568365bac75d10e294f794dad74115 Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Sat, 12 Jul 2025 11:53:07 +0800 Subject: [PATCH 026/552] [CI/Build] Ensure compatability with Transformers v4.53 (#20541) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Isotr0py Signed-off-by: x22x22 --- requirements/test.in | 2 +- requirements/test.txt | 2 +- .../multimodal/generation/test_common.py | 4 +-- .../multimodal/processing/test_common.py | 1 + tests/models/test_initialization.py | 12 +++++++-- vllm/inputs/registry.py | 8 +----- vllm/model_executor/models/commandr.py | 7 ++++-- vllm/model_executor/models/fuyu.py | 25 +++++++++++++------ vllm/model_executor/models/gemma3.py | 9 ++++--- vllm/model_executor/models/minicpmo.py | 21 ++++++++-------- vllm/model_executor/models/paligemma.py | 2 +- .../models/qwen2_5_omni_thinker.py | 10 +++++++- vllm/model_executor/models/whisper.py | 9 ++++++- 13 files changed, 74 insertions(+), 38 deletions(-) diff --git a/requirements/test.in b/requirements/test.in index 907d90201a2..1c725df7e60 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -34,7 +34,7 @@ opencv-python-headless >= 4.11.0 # required for video test datamodel_code_generator # required for minicpm3 test lm-eval[api]==0.4.8 # required for model evaluation test mteb[bm25s]>=1.38.11, <2 # required for mteb test -transformers==4.52.4 +transformers==4.53.2 tokenizers==0.21.1 huggingface-hub[hf_xet]>=0.33.0 # Required for Xet downloads. schemathesis>=3.39.15 # Required for openai schema test. diff --git a/requirements/test.txt b/requirements/test.txt index 2f3ccc4f61d..6f500992bb5 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -800,7 +800,7 @@ tqdm==4.66.6 # transformers tqdm-multiprocess==0.0.11 # via lm-eval -transformers==4.52.4 +transformers==4.53.2 # via # -r requirements/test.in # genai-perf diff --git a/tests/models/multimodal/generation/test_common.py b/tests/models/multimodal/generation/test_common.py index ce449489965..98461676aa4 100644 --- a/tests/models/multimodal/generation/test_common.py +++ b/tests/models/multimodal/generation/test_common.py @@ -318,6 +318,7 @@ num_logprobs=10, image_size_factors=[(), (0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)], auto_cls=AutoModelForImageTextToText, + marks=[large_gpu_mark(min_gb=32)], ), "glm4_1v-video": VLMTestInfo( models=["THUDM/GLM-4.1V-9B-Thinking"], @@ -331,8 +332,7 @@ inputs=custom_inputs.video_with_metadata_glm4_1v(), limit_mm_per_prompt={"video": 1}, )], - # This is needed to run on machine with 24GB VRAM - vllm_runner_kwargs={"gpu_memory_utilization": 0.95}, + marks=[large_gpu_mark(min_gb=32)], ), "h2ovl": VLMTestInfo( models = [ diff --git a/tests/models/multimodal/processing/test_common.py b/tests/models/multimodal/processing/test_common.py index 0f33225eda2..ab21941fae9 100644 --- a/tests/models/multimodal/processing/test_common.py +++ b/tests/models/multimodal/processing/test_common.py @@ -159,6 +159,7 @@ def _test_processing_correctness( _ADD_SPECIAL_TOKENS_OVERRIDES = { "mllama": False, "ovis": False, + "paligemma": False, "ultravox": False, "whisper": False, } diff --git a/tests/models/test_initialization.py b/tests/models/test_initialization.py index 76726c0c820..07ded1e5880 100644 --- a/tests/models/test_initialization.py +++ b/tests/models/test_initialization.py @@ -31,7 +31,8 @@ def test_can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch): model_info.check_transformers_version(on_fail="skip") # FIXME: Possible memory leak in the previous tests? - if model_arch in ("GraniteSpeechForConditionalGeneration", + if model_arch in ("Glm4vForConditionalGeneration", + "GraniteSpeechForConditionalGeneration", "KimiVLForConditionalGeneration"): pytest.skip("Avoid OOM") @@ -46,9 +47,14 @@ def hf_overrides(hf_config: PretrainedConfig) -> PretrainedConfig: n_group = getattr(text_config, 'n_group', None) num_experts = n_group * 2 if n_group is not None else 2 + # we use three layers for Gemma-3n to check + # both normal layer and kv_shared_layer + num_hidden_layers = (3 if model_arch + == "Gemma3nForConditionalGeneration" else 1) + text_config.update({ "num_layers": 1, - "num_hidden_layers": 1, + "num_hidden_layers": num_hidden_layers, "num_experts": num_experts, "num_experts_per_tok": 2, "num_local_experts": num_experts, @@ -56,6 +62,8 @@ def hf_overrides(hf_config: PretrainedConfig) -> PretrainedConfig: "first_k_dense_replace": 0, # To avoid OOM on DeepSeek-V3 "n_routed_experts": num_experts, + # For Gemma-3n + "num_kv_shared_layers": 1, }) if hasattr(hf_config, "vision_config"): diff --git a/vllm/inputs/registry.py b/vllm/inputs/registry.py index 082e52aff9e..652136fbbfe 100644 --- a/vllm/inputs/registry.py +++ b/vllm/inputs/registry.py @@ -5,9 +5,7 @@ from typing import TYPE_CHECKING, Any, NamedTuple, Optional, Union import torch -from packaging.version import Version from transformers import BatchFeature, PretrainedConfig, ProcessorMixin -from transformers import __version__ as TRANSFORMERS_VERSION from typing_extensions import TypeVar from vllm.jsontree import JSONTree, json_map_leaves @@ -137,13 +135,9 @@ def get_hf_processor( /, **kwargs: object, ) -> _P: - # Transformers 4.53.0 has issue with passing tokenizer to - # initialize processor. We disable it for this version. - # See: https://github.com/vllm-project/vllm/issues/20224 - if Version(TRANSFORMERS_VERSION) != Version("4.53.0"): - kwargs["tokenizer"] = self.tokenizer return super().get_hf_processor( typ, + tokenizer=self.tokenizer, **kwargs, ) diff --git a/vllm/model_executor/models/commandr.py b/vllm/model_executor/models/commandr.py index 817c6bb9a7f..c4f6144ed91 100644 --- a/vllm/model_executor/models/commandr.py +++ b/vllm/model_executor/models/commandr.py @@ -189,10 +189,13 @@ def __init__( layer_idx = extract_layer_index(prefix) layer_has_sliding_window = ( - getattr(config, "sliding_window_pattern", False) - and (layer_idx + 1) % self.config.sliding_window_pattern != 0) + getattr(config, "sliding_window_pattern", False) and + (layer_idx + 1) % self.config.sliding_window_pattern + != 0) or (getattr(config, "layer_types", False) + and config.layer_types[layer_idx] == "sliding_attention") self.sliding_window = (interleaved_sliding_window + or config.sliding_window if layer_has_sliding_window else None) self.attn = Attention(self.num_heads, diff --git a/vllm/model_executor/models/fuyu.py b/vllm/model_executor/models/fuyu.py index 26c8f80d5a0..558d4fbb4de 100644 --- a/vllm/model_executor/models/fuyu.py +++ b/vllm/model_executor/models/fuyu.py @@ -175,12 +175,21 @@ def _call_hf_processor( # Original output: (1, num_images, Pn, Px * Py * C) # New output: (num_images, Pn, Px * Py * C) - assert (isinstance(image_patches, list) - and len(image_patches) == 1) - assert (isinstance(image_patches[0], torch.Tensor) - and len(image_patches[0]) == len(images)) - - processed_outputs["image_patches"] = image_patches[0] + # image_patches is a list with shape: + # (1, num_images, Pn, Px * Py * C) + # before Transformers 4.53 + if isinstance(image_patches, list): + assert len(image_patches) == 1 + assert (isinstance(image_patches[0], torch.Tensor) + and len(image_patches[0]) == len(images)) + processed_outputs["image_patches"] = image_patches[0] + # image_patches is a tensor with shape: + # (num_images, Pn, Px * Py * C) + # after Transformers 4.53 + elif isinstance(image_patches, torch.Tensor): + assert len(image_patches) == len(images) + else: + raise AssertionError("This line should be unreachable.") return processed_outputs @@ -193,8 +202,10 @@ def _apply_hf_processor_tokens_only( vocab = tokenizer.get_vocab() boa_token_id = vocab["<0x04>"] + if prompt_tokens[-1] != boa_token_id: + prompt_tokens.append(boa_token_id) - return prompt_tokens + [boa_token_id] + return prompt_tokens def _get_mm_fields_config( self, diff --git a/vllm/model_executor/models/gemma3.py b/vllm/model_executor/models/gemma3.py index 954e48d25f6..1a2ce65d1e4 100644 --- a/vllm/model_executor/models/gemma3.py +++ b/vllm/model_executor/models/gemma3.py @@ -149,14 +149,17 @@ def __init__(self, # TODO(woosuk): Add reference to the original HF implementation. layer_idx = extract_layer_index(prefix) self.is_sliding = (getattr( - config, "interleaved_sliding_window", None) is not None and bool( - (layer_idx + 1) % config.sliding_window_pattern)) + config, "interleaved_sliding_window", None) is not None and (bool( + (layer_idx + 1) % config.sliding_window_pattern))) or ( + getattr(config, "layer_types", None) is not None + and config.layer_types[layer_idx] == "sliding_attention") # Initialize the rotary embedding. if self.is_sliding: # Local attention. Override the values in config.json. self.rope_theta = config.rope_local_base_freq self.rope_scaling = {"rope_type": "default"} - self.sliding_window = config.interleaved_sliding_window + self.sliding_window = (config.interleaved_sliding_window + or config.sliding_window) else: # Global attention. Use the values in config.json. self.rope_theta = config.rope_theta diff --git a/vllm/model_executor/models/minicpmo.py b/vllm/model_executor/models/minicpmo.py index 71593d4bb89..4e4fc3d5c76 100644 --- a/vllm/model_executor/models/minicpmo.py +++ b/vllm/model_executor/models/minicpmo.py @@ -30,8 +30,10 @@ from torch import nn from transformers import BatchFeature, PretrainedConfig from transformers.modeling_outputs import BaseModelOutputWithPast -from transformers.models.whisper.modeling_whisper import ( - ACT2FN, WHISPER_ATTENTION_CLASSES, WhisperConfig, WhisperEncoder) +from transformers.models.whisper.modeling_whisper import (ACT2FN, + WhisperAttention, + WhisperConfig, + WhisperEncoder) from vllm.config import VllmConfig from vllm.model_executor.layers.quantization import QuantizationConfig @@ -378,14 +380,13 @@ class MiniCPMWhisperEncoderLayer(nn.Module): def __init__(self, config: WhisperConfig, layer_idx: int): super().__init__() self.embed_dim = config.d_model - self.self_attn = WHISPER_ATTENTION_CLASSES[ - config._attn_implementation]( - embed_dim=self.embed_dim, - num_heads=config.encoder_attention_heads, - dropout=config.attention_dropout, - config=config, - layer_idx=layer_idx, - ) + self.self_attn = WhisperAttention( + embed_dim=self.embed_dim, + num_heads=config.encoder_attention_heads, + dropout=config.attention_dropout, + config=config, + layer_idx=layer_idx, + ) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.dropout = config.dropout self.activation_fn = ACT2FN[config.activation_function] diff --git a/vllm/model_executor/models/paligemma.py b/vllm/model_executor/models/paligemma.py index 77197abe571..b1f2e53b0c7 100644 --- a/vllm/model_executor/models/paligemma.py +++ b/vllm/model_executor/models/paligemma.py @@ -125,7 +125,7 @@ def _call_hf_processor( ) -> BatchFeature: tokenizer = self.info.get_tokenizer() if not mm_data: - prompt_ids = tokenizer.encode(prompt) + prompt_ids = tokenizer.encode(prompt, add_special_tokens=False) return BatchFeature(dict(input_ids=[prompt_ids]), tensor_type="pt") return super()._call_hf_processor( diff --git a/vllm/model_executor/models/qwen2_5_omni_thinker.py b/vllm/model_executor/models/qwen2_5_omni_thinker.py index 377a34f2088..c5a5c10d950 100644 --- a/vllm/model_executor/models/qwen2_5_omni_thinker.py +++ b/vllm/model_executor/models/qwen2_5_omni_thinker.py @@ -144,8 +144,16 @@ def get_hf_processor( ) -> Qwen2_5OmniProcessor: if fps is not None: kwargs["fps"] = fps + + # Monkey patch for Transformers v4.53 + processor_class = Qwen2_5OmniProcessor + if processor_class.image_processor_class != "AutoImageProcessor": + processor_class.image_processor_class = "AutoImageProcessor" + if processor_class.video_processor_class != "AutoVideoProcessor": + processor_class.video_processor_class = "AutoVideoProcessor" + processor = self.ctx.get_hf_processor( - Qwen2_5OmniProcessor, + processor_class, image_processor=self.get_image_processor(min_pixels=min_pixels, max_pixels=max_pixels, size=size, diff --git a/vllm/model_executor/models/whisper.py b/vllm/model_executor/models/whisper.py index 344d6fc8f45..ee1cfd7d713 100644 --- a/vllm/model_executor/models/whisper.py +++ b/vllm/model_executor/models/whisper.py @@ -634,7 +634,14 @@ def get_hf_config(self) -> WhisperConfig: def get_hf_processor(self, sampling_rate: Optional[int] = None ) -> WhisperProcessor: - return self.ctx.get_hf_processor(WhisperProcessor) + # HACK: Transformers 4.53.0 has issue with whisper tokenizer to + # initialize processor. We use a monkeypatch to fix it here. + # See: https://github.com/vllm-project/vllm/issues/20224 + processor_class = WhisperProcessor + tokenizer_class = ("WhisperTokenizer", "WhisperTokenizerFast") + if processor_class.tokenizer_class != tokenizer_class: + processor_class.tokenizer_class = tokenizer_class + return self.ctx.get_hf_processor(processor_class) def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]: return {"audio": 1} From 3723fb7c77fd32222d4150cb12f7e88cac13de6e Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Sat, 12 Jul 2025 07:56:24 +0400 Subject: [PATCH 027/552] [Bugfix] : Fix typo - logger.warn_once -> logger.warning_once (#20852) Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py index 46f1231a617..4cd68608f02 100644 --- a/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py @@ -111,7 +111,7 @@ def prepare( # topk_indices_dtype() int32 # if expert_map is not None: - logger.warn_once( + logger.warning_once( "The PPLX backend does not support expert mapping. " "The provided `expert_map` will be ignored.") expert_map = None #noqa: F841 From b41376cb852e554c52d26f859bbc379f15a8f267 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nicol=C3=B2=20Lucchesi?= Date: Sat, 12 Jul 2025 06:33:26 +0200 Subject: [PATCH 028/552] [Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models (#20637) Signed-off-by: NickLucche Signed-off-by: x22x22 --- vllm/config.py | 31 +++++++++ vllm/entrypoints/openai/speech_to_text.py | 83 +++++++++-------------- vllm/model_executor/models/interfaces.py | 32 ++++++++- vllm/model_executor/models/whisper.py | 55 +++++++++++++-- 4 files changed, 141 insertions(+), 60 deletions(-) diff --git a/vllm/config.py b/vllm/config.py index d3774a18b06..90cea63dd14 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -4987,3 +4987,34 @@ def get_layers_from_vllm_config(vllm_config: VllmConfig, vllm_config.compilation_config.static_forward_context.items() if isinstance(layer, layer_type) } + + +@config +@dataclass +class SpeechToTextConfig: + """Configuration for speech-to-text models.""" + + sample_rate: float = 16_000 + """Sample rate (Hz) to resample input audio to. Most speech models expect + 16kHz audio input. The input audio will be automatically resampled to this + rate before processing.""" + + max_audio_clip_s: int = 30 + """Maximum duration in seconds for a single audio clip without chunking. + Audio longer than this will be split into smaller chunks if + `allow_audio_chunking` evaluates to True, otherwise it will be rejected.""" + + overlap_chunk_second: int = 1 + """Overlap duration in seconds between consecutive audio chunks when + splitting long audio. This helps maintain context across chunk boundaries + and improves transcription quality at split points.""" + + min_energy_split_window_size: Optional[int] = 1600 + """Window size in samples for finding low-energy (quiet) regions to split + audio chunks. The algorithm looks for the quietest moment within this + window to minimize cutting through speech. Default 1600 samples ≈ 100ms + at 16kHz. If None, no chunking will be done.""" + + @property + def allow_audio_chunking(self) -> bool: + return self.min_energy_split_window_size is not None \ No newline at end of file diff --git a/vllm/entrypoints/openai/speech_to_text.py b/vllm/entrypoints/openai/speech_to_text.py index 0ab029e5305..c70355b2ae4 100644 --- a/vllm/entrypoints/openai/speech_to_text.py +++ b/vllm/entrypoints/openai/speech_to_text.py @@ -6,7 +6,6 @@ import time from collections.abc import AsyncGenerator from functools import cached_property -from math import ceil from typing import Callable, Literal, Optional, TypeVar, Union, cast import numpy as np @@ -28,7 +27,6 @@ from vllm.model_executor.model_loader import get_model_cls from vllm.model_executor.models import SupportsTranscription from vllm.outputs import RequestOutput -from vllm.transformers_utils.processor import cached_get_processor from vllm.utils import PlaceholderModule try: @@ -44,9 +42,6 @@ # As per https://platform.openai.com/docs/guides/speech-to-text#overview. # TODO configurable MAX_AUDIO_CLIP_FILESIZE_MB = 25 -MAX_AUDIO_CLIP_SECONDS = 30 -OVERLAP_CHUNK_SECOND = 1 -MIN_ENERGY_WINDOW_SIZE = 1600 # 1600 ~ 100ms for 16000 Hz audio class OpenAISpeechToText(OpenAIServing): @@ -71,36 +66,32 @@ def __init__( self.default_sampling_params = ( self.model_config.get_diff_sampling_param()) - processor = cached_get_processor(model_config.model) - self.max_audio_clip_s = processor.feature_extractor.chunk_length \ - if hasattr(processor.feature_extractor, 'chunk_length') \ - else MAX_AUDIO_CLIP_SECONDS - self.model_sr = processor.feature_extractor.sampling_rate - self.hop_length = processor.feature_extractor.hop_length self.task_type = task_type + self.asr_config = self.model_cls.get_speech_to_text_config( + model_config, task_type) + if self.default_sampling_params: logger.info( "Overwriting default completion sampling param with: %s", self.default_sampling_params) @cached_property - def model_cls(self): - return get_model_cls(self.model_config) + def model_cls(self) -> type[SupportsTranscription]: + model_cls = get_model_cls(self.model_config) + return cast(type[SupportsTranscription], model_cls) async def _preprocess_speech_to_text( self, request: SpeechToTextRequest, audio_data: bytes, ) -> tuple[list[PromptType], float]: - model_cls = cast(SupportsTranscription, self.model_cls) - # Validate request # TODO language should be optional and can be guessed. # For now we default to en. See # https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/generation_whisper.py#L1520 lang = request.language or "en" - model_cls.validate_language(lang) + self.model_cls.validate_language(lang) if len(audio_data) / 1024**2 > MAX_AUDIO_CLIP_FILESIZE_MB: raise ValueError("Maximum file size exceeded.") @@ -108,26 +99,23 @@ async def _preprocess_speech_to_text( with io.BytesIO(audio_data) as bytes_: # NOTE resample to model SR here for efficiency. This is also a # pre-requisite for chunking, as it assumes Whisper SR. - y, sr = librosa.load(bytes_, sr=self.model_sr) + y, sr = librosa.load(bytes_, sr=self.asr_config.sample_rate) duration = librosa.get_duration(y=y, sr=sr) - chunks = [y - ] if duration < self.max_audio_clip_s else self._split_audio( - y, int(sr)) + do_split_audio = (self.asr_config.allow_audio_chunking + and duration > self.asr_config.max_audio_clip_s) + chunks = [y] if not do_split_audio else self._split_audio(y, int(sr)) prompts = [] for chunk in chunks: - prompt = { - "encoder_prompt": { - "prompt": "", - "multi_modal_data": { - "audio": (chunk, sr), - }, - }, - "decoder_prompt": - model_cls.get_decoder_prompt(lang, self.task_type, - request.prompt) - } - prompts.append(cast(PromptType, prompt)) + # The model has control over the construction, as long as it + # returns a valid PromptType. + prompt = self.model_cls.get_generation_prompt( + audio=chunk, + stt_config=self.asr_config, + language=lang, + task_type=self.task_type, + request_prompt=request.prompt) + prompts.append(prompt) return prompts, duration async def _create_speech_to_text( @@ -196,7 +184,8 @@ async def _create_speech_to_text( self._log_inputs( request_id, - prompts[0]['decoder_prompt'], # type: ignore + # It will not display special tokens like <|startoftranscript|> + request.prompt, params=sampling_params, lora_request=None, prompt_adapter_request=None) @@ -261,17 +250,11 @@ async def _speech_to_text_stream_generator( async for res in result_generator: # On first result. if res.prompt_token_ids is not None: - # Do not account the 4-tokens `<|startoftranscript|>..` - # Could be negative when language token - # is not specified. - num_prompt_tokens = max( - len(res.prompt_token_ids) - 4, 0) - # NOTE(NickLucche) user can't pass encoder - # prompts directly at least not to Whisper. - # One indicator of the encoder amount of processing - # is the log-mel spectogram length. - num_prompt_tokens += ceil( - audio_duration_s * self.model_sr / self.hop_length) + num_prompt_tokens = len(res.prompt_token_ids) + if audio_tokens := self.model_cls.get_num_audio_tokens( + audio_duration_s, self.asr_config, + self.model_config): + num_prompt_tokens += audio_tokens # We need to do it here, because if there are exceptions in # the result_generator, it needs to be sent as the FIRST @@ -347,8 +330,8 @@ async def _speech_to_text_stream_generator( def _split_audio(self, audio_data: np.ndarray, sample_rate: int) -> list[np.ndarray]: - chunk_size = sample_rate * self.max_audio_clip_s - overlap_size = sample_rate * OVERLAP_CHUNK_SECOND + chunk_size = sample_rate * self.asr_config.max_audio_clip_s + overlap_size = sample_rate * self.asr_config.overlap_chunk_second chunks = [] i = 0 while i < audio_data.shape[-1]: @@ -384,10 +367,10 @@ def _find_split_point(self, wav: np.ndarray, start_idx: int, # Calculate RMS energy in small windows min_energy = math.inf quietest_idx = 0 - for i in range(0, - len(segment) - MIN_ENERGY_WINDOW_SIZE, - MIN_ENERGY_WINDOW_SIZE): - window = segment[i:i + MIN_ENERGY_WINDOW_SIZE] + min_energy_window = self.asr_config.min_energy_split_window_size + assert min_energy_window is not None + for i in range(0, len(segment) - min_energy_window, min_energy_window): + window = segment[i:i + min_energy_window] energy = (window**2).mean()**0.5 if energy < min_energy: quietest_idx = i + start_idx diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 50314736710..99669a23363 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -5,11 +5,14 @@ from typing import (TYPE_CHECKING, ClassVar, Literal, Optional, Protocol, Union, overload, runtime_checkable) +import numpy as np import torch from torch import Tensor from typing_extensions import Self, TypeIs +from vllm.config import ModelConfig, SpeechToTextConfig from vllm.inputs import TokensPrompt +from vllm.inputs.data import PromptType from vllm.logger import init_logger from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) @@ -692,9 +695,13 @@ class SupportsTranscription(Protocol): supports_transcription: ClassVar[Literal[True]] = True @classmethod - def get_decoder_prompt(cls, language: str, task_type: str, - prompt: str) -> str: - """Get the decoder prompt for the ASR model.""" + def get_generation_prompt(cls, audio: np.ndarray, + stt_config: SpeechToTextConfig, language: str, + task_type: str, + request_prompt: str) -> PromptType: + """Get the prompt for the ASR model. + The model has control over the construction, as long as it + returns a valid PromptType.""" ... @classmethod @@ -702,6 +709,25 @@ def validate_language(cls, language: str) -> bool: """Check if the model supports a specific ISO639_1 language.""" ... + @classmethod + def get_speech_to_text_config( + cls, model_config: ModelConfig, + task_type: Literal["transcribe", + "translate"]) -> SpeechToTextConfig: + """Get the speech to text config for the ASR model.""" + ... + + @classmethod + def get_num_audio_tokens(cls, audio_duration_s: float, + stt_config: SpeechToTextConfig, + model_config: ModelConfig) -> Optional[int]: + """ + Map from audio duration to number of audio tokens produced by the ASR + model, without running a forward pass. + This is used for estimating the amount of processing for this audio. + """ + return None + @overload def supports_transcription( diff --git a/vllm/model_executor/models/whisper.py b/vllm/model_executor/models/whisper.py index ee1cfd7d713..1a7982e48e4 100644 --- a/vllm/model_executor/models/whisper.py +++ b/vllm/model_executor/models/whisper.py @@ -3,8 +3,9 @@ import math from collections.abc import Iterable, Mapping, Sequence -from typing import Optional, TypedDict, Union +from typing import Optional, TypedDict, Union, cast +import numpy as np import torch from torch import nn from transformers import (BatchFeature, WhisperConfig, WhisperFeatureExtractor, @@ -12,8 +13,10 @@ from transformers.models.whisper.modeling_whisper import sinusoids from vllm.attention import Attention, AttentionType -from vllm.config import CacheConfig, VllmConfig +from vllm.config import (CacheConfig, ModelConfig, SpeechToTextConfig, + VllmConfig) from vllm.distributed import get_tensor_model_parallel_world_size +from vllm.inputs.data import PromptType from vllm.logger import init_logger from vllm.model_executor.layers.activation import get_act_fn from vllm.model_executor.layers.linear import (ColumnParallelLinear, @@ -33,6 +36,7 @@ EncDecMultiModalProcessor, PromptReplacement, PromptUpdate) from vllm.multimodal.profiling import BaseDummyInputsBuilder +from vllm.transformers_utils.processor import cached_get_processor from .interfaces import (MultiModalEmbeddings, SupportsMultiModal, SupportsTranscription, SupportsV0Only) @@ -785,11 +789,24 @@ def validate_language(cls, language: str) -> bool: f"or {list(ISO639_1_OTHER_LANGS.values())}") @classmethod - def get_decoder_prompt(cls, language: str, task_type: str, - prompt: str) -> str: - return ((f"<|prev|>{prompt}" if prompt else "") + - f"<|startoftranscript|><|{language}|>" + - f"<|{task_type}|><|notimestamps|>") + def get_generation_prompt(cls, audio: np.ndarray, + stt_config: SpeechToTextConfig, language: str, + task_type: str, + request_prompt: str) -> PromptType: + prompt = { + "encoder_prompt": { + # Whisper does not support encoder prompt. + "prompt": "", + "multi_modal_data": { + "audio": (audio, stt_config.sample_rate), + }, + }, + "decoder_prompt": + ((f"<|prev|>{request_prompt}" if request_prompt else "") + + f"<|startoftranscript|><|{language}|>" + + f"<|{task_type}|><|notimestamps|>") + } + return cast(PromptType, prompt) @classmethod def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: @@ -798,6 +815,30 @@ def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: raise ValueError("Only audio modality is supported") + @classmethod + def get_speech_to_text_config(cls, model_config: ModelConfig, + task_type: str) -> SpeechToTextConfig: + processor = cached_get_processor(model_config.model) + + return SpeechToTextConfig( + max_audio_clip_s=processor.feature_extractor.chunk_length, + sample_rate=processor.feature_extractor.sampling_rate, + ) + + @classmethod + def get_num_audio_tokens(cls, audio_duration_s: float, + stt_config: SpeechToTextConfig, + model_config: ModelConfig) -> Optional[int]: + processor = cached_get_processor(model_config.model) + hop_length = processor.feature_extractor.hop_length + assert hop_length is not None + # NOTE(NickLucche) user can't pass encoder + # prompts directly at least not to Whisper. + # One indicator of the encoder amount of processing + # is the log-mel spectogram length. + return math.ceil(audio_duration_s * stt_config.sample_rate / + hop_length) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() config = vllm_config.model_config.hf_config From 015e09078215176d7f139e62586285189ef57221 Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Sat, 12 Jul 2025 13:25:39 +0800 Subject: [PATCH 029/552] [Bugfix] Replace unavailable video url in multimodal test (#20854) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- tests/multimodal/test_utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/multimodal/test_utils.py b/tests/multimodal/test_utils.py index b642e5c0ad4..3fdf7e33ca5 100644 --- a/tests/multimodal/test_utils.py +++ b/tests/multimodal/test_utils.py @@ -39,7 +39,7 @@ TEST_VIDEO_URLS = [ "https://www.bogotobogo.com/python/OpenCV_Python/images/mean_shift_tracking/slow_traffic_small.mp4", - "https://filesamples.com/samples/video/avi/sample_640x360.avi", + "https://github.com/opencv/opencv/raw/refs/tags/4.12.0/samples/data/vtest.avi", ] From 6e82fbdf45f5133b8f5251cdd21528a5fd1fcbb5 Mon Sep 17 00:00:00 2001 From: lkchen Date: Fri, 11 Jul 2025 23:04:45 -0700 Subject: [PATCH 030/552] [Misc] Respect `no_use_tqdm_on_load` flag while capturing CUDA graph (#20834) Signed-off-by: Linkun Signed-off-by: x22x22 --- vllm/v1/worker/gpu_model_runner.py | 6 ++++-- vllm/worker/model_runner.py | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index f3279fa5fa8..44de1469d1b 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2270,8 +2270,10 @@ def capture_model(self) -> None: # Only rank 0 should print progress bar during capture compilation_cases = reversed(self.cudagraph_batch_sizes) if is_global_first_rank(): - compilation_cases = tqdm(list(compilation_cases), - desc="Capturing CUDA graph shapes") + compilation_cases = tqdm( + list(compilation_cases), + disable=not self.load_config.use_tqdm_on_load, + desc="Capturing CUDA graph shapes") for num_tokens in compilation_cases: # We skip EPLB here since we don't want to record dummy metrics for _ in range( diff --git a/vllm/worker/model_runner.py b/vllm/worker/model_runner.py index 9d936f3dbf0..4fe70a0abf8 100644 --- a/vllm/worker/model_runner.py +++ b/vllm/worker/model_runner.py @@ -1587,6 +1587,7 @@ def capture_model(self, kv_caches: List[List[torch.Tensor]]) -> None: if get_tensor_model_parallel_rank() == 0: compilation_cases = tqdm( list(compilation_cases), + disable=not self.load_config.use_tqdm_on_load, desc="Capturing CUDA graph shapes") for batch_size, use_inputs_embeds in compilation_cases: attn_metadata = ( From f27ea0e414780b2d6ca855aae6dc00dc3e311597 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Sat, 12 Jul 2025 02:05:12 -0400 Subject: [PATCH 031/552] [Bug] Fix DeepGemm for EP low latency case (#20833) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- .../layers/fused_moe/batched_deep_gemm_moe.py | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py index 70ac6688deb..70a580b9c4c 100644 --- a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py @@ -11,7 +11,8 @@ TopKWeightAndReduceDelegate) from vllm.model_executor.layers.fused_moe.utils import _resize_cache from vllm.triton_utils import tl, triton -from vllm.utils.deep_gemm import fp8_m_grouped_gemm_nt_masked +from vllm.utils.deep_gemm import (fp8_m_grouped_gemm_nt_masked, + is_blackwell_deep_gemm_used) logger = init_logger(__name__) @@ -50,6 +51,7 @@ def _silu_mul_fp8_quant_deep_gemm( eps: tl.constexpr, fp8_min: tl.constexpr, fp8_max: tl.constexpr, + use_ue8m0: tl.constexpr, # Meta --------------------------------------------------------------- BLOCK: tl.constexpr, @@ -92,7 +94,9 @@ def _silu_mul_fp8_quant_deep_gemm( y = x * y2 _absmax = tl.maximum(tl.max(tl.abs(y)), eps) - y_s = _absmax / fp8_max + scale_raw = _absmax / fp8_max + y_s = tl.math.exp2(tl.ceil( + tl.log2(scale_raw))) if use_ue8m0 else scale_raw y_q = tl.clamp(y / y_s, fp8_min, fp8_max).to(y_q_ptr.dtype.element_ty) tl.store(y_q_ptr + base_yq_offset + cols * stride_yq_h, y_q, mask=mask) @@ -174,6 +178,7 @@ def silu_mul_fp8_quant_deep_gemm( eps, fp8_min, fp8_max, + is_blackwell_deep_gemm_used(), BLOCK=group_size, num_warps=4, ) @@ -290,14 +295,10 @@ def apply( # may lead to better performance. expected_m = max_num_tokens fp8_m_grouped_gemm_nt_masked((a1q, a1q_scale), (w1, w1_scale), - out=workspace1, - masked_m=expert_num_tokens, - expected_m=expected_m) + workspace1, expert_num_tokens, expected_m) a2q, a2q_scale = silu_mul_fp8_quant_deep_gemm(workspace1, expert_num_tokens) - fp8_m_grouped_gemm_nt_masked((a2q, a2q_scale), (w2, w2_scale), - out=output, - masked_m=expert_num_tokens, - expected_m=expected_m) + fp8_m_grouped_gemm_nt_masked((a2q, a2q_scale), (w2, w2_scale), output, + expert_num_tokens, expected_m) From acbd35aabafb7447f9219fb7049ef7c9fb0fff11 Mon Sep 17 00:00:00 2001 From: Lucia Fang <116399278+luccafong@users.noreply.github.com> Date: Sat, 12 Jul 2025 14:05:32 +0800 Subject: [PATCH 032/552] [Docs] Update basic.md (#20846) Signed-off-by: x22x22 --- docs/contributing/model/basic.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md index 542351fd66b..edd9a47e132 100644 --- a/docs/contributing/model/basic.md +++ b/docs/contributing/model/basic.md @@ -73,6 +73,8 @@ def forward( self, input_ids: torch.Tensor, positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, ) -> torch.Tensor: ... ``` From 6cadf4a3f5a2e1d095020c3a6f4c156991b50b11 Mon Sep 17 00:00:00 2001 From: Richard Zou Date: Sat, 12 Jul 2025 02:06:04 -0400 Subject: [PATCH 033/552] [Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 (#20823) Signed-off-by: rzou Signed-off-by: x22x22 --- vllm/lora/layers.py | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/vllm/lora/layers.py b/vllm/lora/layers.py index 3d0c5831750..39b45027bd5 100644 --- a/vllm/lora/layers.py +++ b/vllm/lora/layers.py @@ -240,17 +240,19 @@ def set_lora( def forward(self, x: torch.Tensor) -> torch.Tensor: added_tokens_mask = torch.where(x > self.base_layer.org_vocab_size - 1, 1, 0) - embeddings_indices = torch.narrow( - self.punica_wrapper._embeddings_indices, 1, 0, x.size(0)) - indices = embeddings_indices[1] + # NB: Don't use torch.narrow here. torch.narrow triggers some + # Dynamic Shape specialization in torch.compile + num_tokens = x.shape[0] + indices_1 = self.punica_wrapper._embeddings_indices[1][:num_tokens] + indices_0 = self.punica_wrapper._embeddings_indices[0][:num_tokens] + full_lora_a_embeddings = F.embedding( - x + indices, + x + indices_1, self.lora_a_stacked_2d, ) - indices = embeddings_indices[0] full_output = self.base_layer.forward(x + - (indices * added_tokens_mask)) + (indices_0 * added_tokens_mask)) full_output_org = full_output if full_output.ndim == 3: From 739d2e1a4d4dc70b6ee9880d0035ee52cacd2c01 Mon Sep 17 00:00:00 2001 From: Boyuan Feng Date: Fri, 11 Jul 2025 23:06:13 -0700 Subject: [PATCH 034/552] [cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile (#20790) Signed-off-by: Boyuan Feng Signed-off-by: x22x22 --- vllm/compilation/wrapper.py | 16 +++++++++++++--- vllm/envs.py | 6 ++++++ 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/vllm/compilation/wrapper.py b/vllm/compilation/wrapper.py index 2a261c84c3f..4fd00f0c75b 100644 --- a/vllm/compilation/wrapper.py +++ b/vllm/compilation/wrapper.py @@ -95,16 +95,26 @@ def bytecode_hook(self, old_code: CodeType, new_code: CodeType): self.compiled_codes.append(new_code) local_cache_dir = self.vllm_config.compilation_config.local_cache_dir if isinstance(local_cache_dir, str): + decompiled_file_name = ("transformed_code.py" + if envs.VLLM_COMPILE_DEPYF else + "transformed_code_README.txt") + decompiled_file = os.path.join(local_cache_dir, - "transformed_code.py") + decompiled_file_name) if not os.path.exists(decompiled_file): try: # usually the decompilation will succeed for most models, # as we guarantee a full-graph compilation in Dynamo. # but there's no 100% guarantee, since decompliation is # not a reversible process. - import depyf - src = depyf.decompile(new_code) + if envs.VLLM_COMPILE_DEPYF: + import depyf + src = depyf.decompile(new_code) + else: + src = ( + "To get a transformed_code.py file, re-run with " + "VLLM_COMPILE_DEPYF=1") + with open(decompiled_file, "w") as f: f.write(src) diff --git a/vllm/envs.py b/vllm/envs.py index 7bff6ade815..7fd5abed700 100644 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -97,6 +97,7 @@ VLLM_ENABLE_V1_MULTIPROCESSING: bool = True VLLM_LOG_BATCHSIZE_INTERVAL: float = -1 VLLM_DISABLE_COMPILE_CACHE: bool = False + VLLM_COMPILE_DEPYF: bool = False Q_SCALE_CONSTANT: int = 200 K_SCALE_CONSTANT: int = 200 V_SCALE_CONSTANT: int = 100 @@ -741,6 +742,11 @@ def get_vllm_port() -> Optional[int]: "VLLM_DISABLE_COMPILE_CACHE": lambda: bool(int(os.getenv("VLLM_DISABLE_COMPILE_CACHE", "0"))), + # If set, vllm will decompile the torch compiled code and dump to + # transformed_code.py. This is useful for debugging. + "VLLM_COMPILE_DEPYF": + lambda: bool(int(os.getenv("VLLM_COMPILE_DEPYF", "0"))), + # If set, vllm will run in development mode, which will enable # some additional endpoints for developing and debugging, # e.g. `/reset_prefix_cache` From aff322335943248be35d352ed805a76960f279b5 Mon Sep 17 00:00:00 2001 From: Maximilien de Bayser Date: Sat, 12 Jul 2025 03:06:34 -0300 Subject: [PATCH 035/552] Remove extra tensor on CPU (#20693) Signed-off-by: Max de Bayser Signed-off-by: x22x22 --- vllm/v1/sample/logits_processor.py | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/vllm/v1/sample/logits_processor.py b/vllm/v1/sample/logits_processor.py index 16bd2b9ffd8..3a4c25964e7 100644 --- a/vllm/v1/sample/logits_processor.py +++ b/vllm/v1/sample/logits_processor.py @@ -234,10 +234,16 @@ def __init__(self, max_num_reqs: int, pin_memory: bool, device="cpu", pin_memory=pin_memory) self.min_p_cpu = self.min_p_cpu_tensor.numpy() - # Pre-allocated device tensor - self.min_p_device: torch.Tensor = torch.empty((max_num_reqs, ), - dtype=torch.float32, - device=device) + + self.use_double_tensor = torch.device("cpu") != torch.device(device) + + if self.use_double_tensor: + # Pre-allocated device tensor + self.min_p_device: torch.Tensor = torch.empty((max_num_reqs, ), + dtype=torch.float32, + device=device) + else: + self.min_p_device = self.min_p_cpu_tensor # Current slice of the device tensor self.min_p: torch.Tensor = self.min_p_device[:0] @@ -284,7 +290,9 @@ def update_state(self, batch_update: Optional[BatchUpdate]): size = batch_update.batch_size if self.min_p_count and (needs_update or self.min_p.shape[0] != size): self.min_p = self.min_p_device[:size] - self.min_p.copy_(self.min_p_cpu_tensor[:size], non_blocking=True) + if self.use_double_tensor: + self.min_p.copy_(self.min_p_cpu_tensor[:size], + non_blocking=True) self.min_p.unsqueeze_(1) def apply(self, logits: torch.Tensor) -> torch.Tensor: From 029b2fad3d3d04ed05db8c00c0c94decb9bd8ef3 Mon Sep 17 00:00:00 2001 From: Zhiyu Date: Fri, 11 Jul 2025 23:07:16 -0700 Subject: [PATCH 036/552] Enable ModelOpt Llama4 fp8 checkpoint deployment (#20419) Signed-off-by: Zhiyu Cheng Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/layer.py | 37 ++- .../layers/quantization/modelopt.py | 266 +++++++++++++++++- .../model_loader/weight_utils.py | 10 + vllm/model_executor/models/llama4.py | 59 +++- vllm/model_executor/models/mllama4.py | 164 +++++++++-- 5 files changed, 501 insertions(+), 35 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index eeff4379cf1..da772c11155 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -81,6 +81,16 @@ def create_weights(self, layer: torch.nn.Module, num_experts: int, params_dtype: torch.dtype, **extra_weight_attrs): raise NotImplementedError + def uses_weight_scale_2_pattern(self) -> bool: + """ + Returns True if this quantization method uses 'weight_scale_2' pattern + for per-tensor weight scales (e.g., FP4 variants), False otherwise. + + This method should be overridden by subclasses that use the + 'weight_scale_2' pattern instead of the standard 'weight_scale' pattern. + """ + return False + @staticmethod def maybe_make_prepare_finalize( moe: FusedMoEConfig) -> Optional[FusedMoEPrepareAndFinalize]: @@ -1081,12 +1091,23 @@ def weight_loader(self, # TODO @dsikka: ModelOpt should follow the proper MoE loading pattern if "ModelOpt" in quant_method_name: - if ('weight_scale_2' in weight_name - or 'input_scale' in weight_name): - self._load_per_tensor_weight_scale(shard_id=shard_id, - param=param, - loaded_weight=loaded_weight, - expert_id=expert_id) + # Determine per-tensor weight scale patterns based on variant + # Use the dedicated method instead of brittle string matching + uses_weight_scale_2 = self.quant_method.uses_weight_scale_2_pattern( + ) + + # For per-tensor, FP4 uses "weight_scale_2", FP8 uses "weight_scale" + per_tensor_conditions = ( + "weight_scale_2" in weight_name if uses_weight_scale_2 else + "weight_scale" in weight_name) or "input_scale" in weight_name + + if per_tensor_conditions: + self._load_per_tensor_weight_scale( + shard_id=shard_id, + param=param, + loaded_weight=loaded_weight, + expert_id=expert_id, + ) elif "weight" in weight_name: self._load_model_weight_or_group_weight_scale( shard_id=shard_id, @@ -1558,3 +1579,7 @@ def moe_forward_fake(hidden_states: torch.Tensor, router_logits: torch.Tensor, dispatch_key=current_platform.dispatch_key, tags=(torch.Tag.needs_fixed_stride_order, ), ) + +# Mark the FusedMoE weight_loader as supporting MoE-specific parameters +# to avoid expensive runtime reflection in model loading code +FusedMoE.weight_loader.supports_moe_loading = True # type: ignore[attr-defined] diff --git a/vllm/model_executor/layers/quantization/modelopt.py b/vllm/model_executor/layers/quantization/modelopt.py index 0a4e36f19bf..788f0a9116f 100644 --- a/vllm/model_executor/layers/quantization/modelopt.py +++ b/vllm/model_executor/layers/quantization/modelopt.py @@ -42,9 +42,13 @@ class ModelOptFp8Config(QuantizationConfig): def __init__( self, is_checkpoint_fp8_serialized: bool = False, + kv_cache_quant_method: Optional[str] = None, + exclude_modules: Optional[list[str]] = None, ) -> None: super().__init__() self.is_checkpoint_fp8_serialized = is_checkpoint_fp8_serialized + self.kv_cache_quant_method = kv_cache_quant_method + self.exclude_modules = exclude_modules if is_checkpoint_fp8_serialized: logger.warning("Detected ModelOpt fp8 checkpoint. Please note that" " the format is experimental and could change.") @@ -69,6 +73,11 @@ def get_config_filenames(cls) -> list[str]: def from_config(cls, config: dict[str, Any]) -> "ModelOptFp8Config": quant_config = cls.get_from_keys(config, ["quantization"]) quant_method = quant_config["quant_algo"] + kv_cache_quant_method = cls.get_from_keys( + config, ["quantization"]).get("kv_cache_quant_algo") + exclude_modules = cls.get_from_keys( + config, ["quantization"]).get("exclude_modules") + if quant_method not in QUANT_ALGOS: raise ValueError(f"ModelOpt currently only supports: {QUANT_ALGOS}" " quantizations in vLLM. Please check the " @@ -76,27 +85,51 @@ def from_config(cls, config: dict[str, Any]) -> "ModelOptFp8Config": "quant configuration.") is_checkpoint_fp8_serialized = ("FP8" in quant_method) - return cls(is_checkpoint_fp8_serialized) + return cls(is_checkpoint_fp8_serialized, kv_cache_quant_method, + exclude_modules) + + def is_layer_excluded(self, prefix: str) -> bool: + """ + Check if a layer should be excluded from quantization. + + This method handles both regular models and multimodal models that use + the language_model prefix. For multimodal models, it checks if the + module name (without the language_model prefix) is in the exclude list. + """ + if self.exclude_modules is None: + return False + + # Check if any excluded module matches the prefix + for module in self.exclude_modules: + if (module in prefix + or (prefix.startswith("language_model.") + and module in prefix.removeprefix("language_model."))): + return True + return False def get_quant_method(self, layer: torch.nn.Module, prefix: str) -> Optional["QuantizeMethodBase"]: from vllm.attention.layer import Attention # Avoid circular import if isinstance(layer, LinearBase): + if self.is_layer_excluded(prefix): + return UnquantizedLinearMethod() return ModelOptFp8LinearMethod(self) elif isinstance(layer, Attention): return ModelOptFp8KVCacheMethod(self) + elif isinstance(layer, FusedMoE): + return ModelOptFp8MoEMethod(self) return None class ModelOptFp8LinearMethod(LinearMethodBase): """Linear method for Model Optimizer static quantization. Supports loading FP8 checkpoints with static weight scale and - activation scale. Future support might be added for dynamic + activation scale. Future support might be added for dynamic scales. Limitations: 1. Only support per-tensor quantization due to torch._scaled_mm support. - 2. Only support float8_e4m3fn datatype + 2. Only support float8_e4m3fn datatype Args: quant_config: The ModelOpt quantization config. """ @@ -172,6 +205,223 @@ def apply( bias=bias) +class ModelOptFp8MoEMethod(FusedMoEMethodBase): + """MoE method for ModelOpt FP8. + Supports loading FP8 checkpoints with static weight scale and + activation scale. + Args: + quant_config: The ModelOpt quantization config. + """ + + def __init__(self, quant_config: ModelOptFp8Config): + self.quant_config = quant_config + from vllm.model_executor.layers.quantization.utils.w8a8_utils import ( + cutlass_fp8_supported) + self.cutlass_fp8_supported = cutlass_fp8_supported() + + def create_weights( + self, + layer: torch.nn.Module, + num_experts: int, + hidden_size: int, + intermediate_size_per_partition: int, + params_dtype: torch.dtype, + **extra_weight_attrs, + ): + + # Use FP8 dtype if checkpoint is serialized + weight_dtype = (torch.float8_e4m3fn + if self.quant_config.is_checkpoint_fp8_serialized else + params_dtype) + weight_loader = extra_weight_attrs.get("weight_loader") + + w13_weight = ModelWeightParameter( + data=torch.empty(num_experts, + 2 * intermediate_size_per_partition, + hidden_size, + dtype=weight_dtype), + input_dim=2, + output_dim=1, + weight_loader=weight_loader, + ) + layer.register_parameter("w13_weight", w13_weight) + + w2_weight = ModelWeightParameter( + data=torch.empty(num_experts, + hidden_size, + intermediate_size_per_partition, + dtype=weight_dtype), + input_dim=2, + output_dim=1, + weight_loader=weight_loader, + ) + layer.register_parameter("w2_weight", w2_weight) + + if self.quant_config.is_checkpoint_fp8_serialized: + # WEIGHT SCALES - Per-tensor scaling for ModelOpts + # Allocate 2 scales for w1 and w3 respectively. + # They will be combined to a single scale after weight loading. + w13_weight_scale = PerTensorScaleParameter( + data=torch.full( + (num_experts, 2), + 1.0, + dtype=torch.float32, + ), + weight_loader=weight_loader, + ) + w2_weight_scale = PerTensorScaleParameter( + data=torch.full((num_experts, ), 1.0, dtype=torch.float32), + weight_loader=weight_loader, + ) + layer.register_parameter("w13_weight_scale", w13_weight_scale) + layer.register_parameter("w2_weight_scale", w2_weight_scale) + + # Set weight loader attributes for scales + extra_weight_attrs.update( + {"quant_method": FusedMoeWeightScaleSupported.TENSOR.value}) + + # INPUT SCALES - Per-tensor scaling for ModelOpt + w13_input_scale = PerTensorScaleParameter( + data=torch.full((num_experts, ), 1.0, dtype=torch.float32), + weight_loader=weight_loader, + ) + w2_input_scale = PerTensorScaleParameter( + data=torch.full((num_experts, ), 1.0, dtype=torch.float32), + weight_loader=weight_loader, + ) + layer.register_parameter("w13_input_scale", w13_input_scale) + layer.register_parameter("w2_input_scale", w2_input_scale) + + def process_weights_after_loading(self, layer: torch.nn.Module) -> None: + """Process FP8 MoE weights after loading from serialized checkpoint. + Only supports pre-quantized checkpoints with FP8 weights and scales. + """ + + layer.w13_weight = Parameter(layer.w13_weight.data, + requires_grad=False) + layer.w2_weight = Parameter(layer.w2_weight.data, requires_grad=False) + + from vllm._custom_ops import scaled_fp8_quant + from vllm.model_executor.layers.quantization.utils.w8a8_utils import ( + per_tensor_dequantize) + + # Handle scale parameters + if hasattr(layer, + "w13_weight_scale") and layer.w13_weight_scale is not None: + # Fp8 moe kernel needs single weight scale for w13 per expert. + # We take the max of the w1 and w3 scales + # then dequant and requant each expert. + if layer.w13_weight_scale.dim() == 2: + + # Get the maximum scale across w1 and w3 for each expert + max_w13_scales = layer.w13_weight_scale.max(dim=1).values + + # Requantize each expert's weights using the combined scale + # w13_weight (num_experts, 2 * intermediate_size, hidden_size) + # where the first intermediate_size rows are w1, the next are w3 + intermediate_size = layer.w13_weight.shape[1] // 2 + for expert_id in range(layer.w13_weight.shape[0]): + start = 0 + for shard_id in range(2): # w1 and w3 + # Dequantize using the original scale for this shard + dq_weight = per_tensor_dequantize( + layer.w13_weight[expert_id][start:start + + intermediate_size, :], + layer.w13_weight_scale[expert_id][shard_id], + ) + # Requantize using the combined max scale + + ( + layer.w13_weight[expert_id][start:start + + intermediate_size, :], + _, + ) = scaled_fp8_quant(dq_weight, + max_w13_scales[expert_id]) + + start += intermediate_size + + # Update the scale parameter to be per-expert + layer.w13_weight_scale = Parameter(max_w13_scales, + requires_grad=False) + else: + layer.w13_weight_scale = Parameter(layer.w13_weight_scale.data, + requires_grad=False) + + if hasattr(layer, + "w2_weight_scale") and layer.w2_weight_scale is not None: + layer.w2_weight_scale = Parameter(layer.w2_weight_scale.data, + requires_grad=False) + # Input scales must be equal for each expert in fp8 MoE layers. + if hasattr(layer, + "w13_input_scale") and layer.w13_input_scale is not None: + layer.w13_input_scale = Parameter(layer.w13_input_scale.max(), + requires_grad=False) + if hasattr(layer, + "w2_input_scale") and layer.w2_input_scale is not None: + layer.w2_input_scale = Parameter(layer.w2_input_scale.max(), + requires_grad=False) + + def apply( + self, + layer: torch.nn.Module, + x: torch.Tensor, + router_logits: torch.Tensor, + top_k: int, + renormalize: bool, + use_grouped_topk: bool = False, + topk_group: Optional[int] = None, + num_expert_group: Optional[int] = None, + global_num_experts: int = -1, + expert_map: Optional[torch.Tensor] = None, + custom_routing_function: Optional[Callable] = None, + scoring_func: str = "softmax", + e_score_correction_bias: Optional[torch.Tensor] = None, + apply_router_weight_on_input: bool = False, + activation: str = "silu", + enable_eplb: bool = False, + expert_load_view: Optional[torch.Tensor] = None, + logical_to_physical_map: Optional[torch.Tensor] = None, + logical_replica_count: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + if enable_eplb: + raise NotImplementedError( + "EPLB not supported for `ModelOptFp8MoEMethod` yet.") + + # Expert selection + topk_weights, topk_ids = FusedMoE.select_experts( + hidden_states=x, + router_logits=router_logits, + use_grouped_topk=use_grouped_topk, + top_k=top_k, + renormalize=renormalize, + topk_group=topk_group, + num_expert_group=num_expert_group, + custom_routing_function=custom_routing_function, + scoring_func=scoring_func, + e_score_correction_bias=e_score_correction_bias, + ) + from vllm.model_executor.layers.fused_moe.fused_moe import ( + fused_experts) + return fused_experts( + x, + layer.w13_weight, + layer.w2_weight, + topk_weights=topk_weights, + topk_ids=topk_ids, + inplace=True, + activation=activation, + use_fp8_w8a8=True, + per_channel_quant=False, + global_num_experts=global_num_experts, + expert_map=expert_map, + w1_scale=layer.w13_weight_scale, + w2_scale=layer.w2_weight_scale, + a1_scale=layer.w13_input_scale, + a2_scale=layer.w2_input_scale, + apply_router_weight_on_input=apply_router_weight_on_input, + ) + + class ModelOptNvFp4Config(QuantizationConfig): """Config class for ModelOpt FP4.""" @@ -274,7 +524,7 @@ def __init__(self, quant_config: Union[ModelOptFp8Config, class ModelOptNvFp4LinearMethod(LinearMethodBase): """Linear method for Model Optimizer NVFP4. Supports loading NVFP4 checkpoints with the following structure: - + input_scale: torch.float32, scalar , weight: NVFP4(represented as byte) Shape: [1, X, y/2] weight_scale: FP8-E4M3, Shape: [X, Y], aka per block scale, @@ -455,7 +705,7 @@ def apply( class ModelOptNvFp4FusedMoE(FusedMoEMethodBase): """ MoE Method for FP4 Quantization. - Args: + Args: quant_config: NVFP4 Quant Config """ @@ -472,6 +722,12 @@ def __init__(self, quant_config: ModelOptNvFp4Config): " quantization. Please use Blackwell and" " above.") + def uses_weight_scale_2_pattern(self) -> bool: + """ + FP4 variants use 'weight_scale_2' pattern for per-tensor weight scales. + """ + return True + def create_weights(self, layer: torch.nn.Module, num_experts: int, hidden_size: int, intermediate_size_per_partition: int, params_dtype: torch.dtype, **extra_weight_attrs): diff --git a/vllm/model_executor/model_loader/weight_utils.py b/vllm/model_executor/model_loader/weight_utils.py index 1058ae140b5..178b37d7d70 100644 --- a/vllm/model_executor/model_loader/weight_utils.py +++ b/vllm/model_executor/model_loader/weight_utils.py @@ -762,6 +762,10 @@ def maybe_remap_kv_scale_name(name: str, params_dict: dict) -> Optional[str]: modelopt_scale_names = [ ".self_attn.k_proj.k_scale", ".self_attn.v_proj.v_scale" ] + # Also support qkv_proj scale parameters (from stacked parameter processing) + qkv_proj_scale_names = [ + ".self_attn.qkv_proj.k_scale", ".self_attn.qkv_proj.v_scale" + ] for scale_name in possible_scale_names: if name.endswith(scale_name): if any(mo_scale_name in name @@ -769,6 +773,12 @@ def maybe_remap_kv_scale_name(name: str, params_dict: dict) -> Optional[str]: remapped_name = name.replace( f".self_attn.{scale_name[1]}_proj{scale_name}", f".self_attn.attn{scale_name}") + elif any(qkv_scale_name in name + for qkv_scale_name in qkv_proj_scale_names): + # Handle qkv_proj scale parameters + remapped_name = name.replace( + f".self_attn.qkv_proj{scale_name}", + f".self_attn.attn{scale_name}") else: remapped_name = name.replace(scale_name, f".attn{scale_name}") if remapped_name not in params_dict: diff --git a/vllm/model_executor/models/llama4.py b/vllm/model_executor/models/llama4.py index 0c9baab1f2e..fab1c163ac2 100644 --- a/vllm/model_executor/models/llama4.py +++ b/vllm/model_executor/models/llama4.py @@ -35,7 +35,8 @@ RowParallelLinear) from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.rotary_embedding import get_rope -from vllm.model_executor.model_loader.weight_utils import default_weight_loader +from vllm.model_executor.model_loader.weight_utils import ( + default_weight_loader, maybe_remap_kv_scale_name) from .llama import LlamaForCausalLM, LlamaMLP, LlamaModel from .utils import (AutoWeightsLoader, extract_layer_index, fast_topk, @@ -432,12 +433,24 @@ def load_weights(self, weights: Iterable[tuple[str, for param_name, weight_name, shard_id in stacked_params_mapping: if weight_name not in name or "experts" in name: continue - name = name.replace(weight_name, param_name) + # This check is for ModelOpt ckpts with kv cache quant enabled + if not (name.endswith( + (".k_scale", ".v_scale")) and "self_attn" in name): + name = name.replace(weight_name, param_name) if is_pp_missing_parameter(name, self): continue + if name.endswith("scale") and "expert" not in name: + # Remapping the name of FP8 kv-scale. + name = maybe_remap_kv_scale_name(name, params_dict) + if name is None: + continue param = params_dict[name] - weight_loader = param.weight_loader - weight_loader(param, loaded_weight, shard_id) + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + if weight_loader == default_weight_loader: + weight_loader(param, loaded_weight) + else: + weight_loader(param, loaded_weight, shard_id) loaded_params.add(name) break else: @@ -452,6 +465,44 @@ def load_weights(self, weights: Iterable[tuple[str, if not moe_loaded: if is_pp_missing_parameter(name, self): continue + + # Handle flat expert scale parameters that + # don't match per-expert patterns + if ("experts." in name and ("w13_input_scale" in name + or "w13_weight_scale" in name + or "w2_input_scale" in name + or "w2_weight_scale" in name)): + # These are flat expert scales that apply to all experts + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + + # Check for MoE-specific loading support via + # attribute instead of expensive runtime reflection + supports_moe = getattr(weight_loader, + 'supports_moe_loading', False) + + if supports_moe: + # This is a MoE weight loader + if "w13_" in name: + shard_id = "w1" + elif "w2_" in name: + shard_id = "w2" + else: + shard_id = "w1" + + weight_loader(param, + loaded_weight, + name, + shard_id=shard_id, + expert_id=0) + else: + # Regular weight loader (handles both + # param.weight_loader and default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + continue + param = params_dict[name] weight_loader = getattr(param, "weight_loader", default_weight_loader) diff --git a/vllm/model_executor/models/mllama4.py b/vllm/model_executor/models/mllama4.py index 1276d626a7c..dea85d320ad 100644 --- a/vllm/model_executor/models/mllama4.py +++ b/vllm/model_executor/models/mllama4.py @@ -717,6 +717,7 @@ class Llama4ForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP): packed_modules_mapping = { "qkv_proj": ["q_proj", "k_proj", "v_proj"], + "gate_up_proj": ["gate_proj", "up_proj"], } @classmethod @@ -902,32 +903,109 @@ def _consolidate_qkv_weights( qkv_weight = torch.cat(weight, dim=0) yield key, qkv_weight - def load_weights(self, weights: Iterable[tuple[str, - torch.Tensor]]) -> set[str]: + def _rename_weight_for_modelopt_checkpoint(self, name: str) -> str: + """Rename weights from ModelOpt llama4 fp8 checkpoints to vLLM + format.""" + if name.startswith("model."): + # Handle expert scale parameters with flat naming + if "feed_forward.experts." in name and ("_input_scale" in name or + "_weight_scale" in name): + renamed = name.replace("model.", "language_model.model.", 1) + # Map checkpoint naming to vLLM's expected naming + if "down_proj_input_scale" in renamed: + return renamed.replace("down_proj_input_scale", + "w2_input_scale") + elif "down_proj_weight_scale" in renamed: + return renamed.replace("down_proj_weight_scale", + "w2_weight_scale") + elif "gate_up_proj_input_scale" in renamed: + return renamed.replace("gate_up_proj_input_scale", + "w13_input_scale") + elif "gate_up_proj_weight_scale" in renamed: + return renamed.replace("gate_up_proj_weight_scale", + "w13_weight_scale") + return renamed + + # Handle attention scale parameters + elif "self_attn." in name and (".k_scale" in name + or ".v_scale" in name): + renamed = name.replace("model.", "language_model.model.", 1) + if ".k_proj.k_scale" in renamed: + return renamed.replace(".k_proj.k_scale", ".attn.k_scale") + elif ".v_proj.v_scale" in renamed: + return renamed.replace(".v_proj.v_scale", ".attn.v_scale") + return renamed + + # Standard model.* to language_model.model.* renaming + return name.replace("model.", "language_model.model.", 1) + + elif name.startswith("lm_head.weight"): + return name.replace("lm_head.weight", + "language_model.lm_head.weight") + + return name + + def _separate_and_rename_weights( + self, weights: Iterable[tuple[str, torch.Tensor]] + ) -> tuple[list[tuple[str, torch.Tensor]], list[tuple[str, torch.Tensor]]]: + """Rename weights and separate them into language_model and other + weights.""" + language_model_weights = [] + other_weights = [] - stacked_params_mapping = [ - # (param_name, shard_name, shard_id) - (".self_attn.qkv_proj", ".self_attn.q_proj", "q"), - (".self_attn.qkv_proj", ".self_attn.k_proj", "k"), - (".self_attn.qkv_proj", ".self_attn.v_proj", "v"), - ] - params_dict = dict(self.named_parameters()) - updated_params: set[str] = set() + for name, weight in weights: + renamed = self._rename_weight_for_modelopt_checkpoint(name) - # language_model is an Llama4ForCausalLM instance. We load it's - # using llama4's load_weights routine. - language_model_weights, other_weights = self.separate_weights( - weights, prefix="language_model.") - loader = AutoWeightsLoader(self) - loaded_language_model_params = loader.load_weights( - language_model_weights) - assert loaded_language_model_params is not None - updated_params.update(loaded_language_model_params) + if renamed.startswith("language_model."): + language_model_weights.append((renamed, weight)) + else: + other_weights.append((renamed, weight)) + + return language_model_weights, other_weights + + def _handle_expert_scale_broadcasting( + self, weights: list[tuple[str, torch.Tensor]], params_dict: dict + ) -> tuple[list[tuple[str, torch.Tensor]], set[str]]: + """Handle expert scale parameters that need broadcasting. + + ModelOpt checkpoints use a single value tensor scalar for BMM style + experts, vLLM expects the scale to be broadcasted across all experts. + """ + regular_weights = [] + expert_scale_weights = [] + updated_params = set() + + for name, weight in weights: + # Check if this is an expert scale parameter that needs broadcasting + if ("feed_forward.experts." in name and "scale" in name + and ".shared_expert" not in name): + if name in params_dict: + param = params_dict[name] + if (hasattr(param, 'data') and param.data.numel() > 1 + and weight.numel() == 1): + # Broadcast single value to all experts + param.data.fill_(weight.item()) + updated_params.add(name) + continue + + expert_scale_weights.append((name, weight)) + else: + regular_weights.append((name, weight)) + + return regular_weights, expert_scale_weights, updated_params + + def _load_other_weights(self, other_weights: Iterable[tuple[str, + torch.Tensor]], + params_dict: dict, + stacked_params_mapping: list) -> set[str]: + """Load non-language-model weights with stacking support.""" + updated_params = set() if self.use_data_parallel: other_weights = self._consolidate_qkv_weights(other_weights) for name, loaded_weight in other_weights: + # Try stacked parameter mapping first for param_name, weight_name, shard_id in stacked_params_mapping: if weight_name not in name or self.use_data_parallel: continue @@ -938,10 +1016,56 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight, shard_id) break else: + # Use regular weight loading param = params_dict[name] weight_loader = getattr(param, "weight_loader", default_weight_loader) - weight_loader(param, loaded_weight) updated_params.add(name) + + return updated_params + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + (".self_attn.qkv_proj", ".self_attn.q_proj", "q"), + (".self_attn.qkv_proj", ".self_attn.k_proj", "k"), + (".self_attn.qkv_proj", ".self_attn.v_proj", "v"), + # Shared expert gate_up_proj stacking + (".shared_expert.gate_up_proj", ".shared_expert.gate_proj", 0), + (".shared_expert.gate_up_proj", ".shared_expert.up_proj", 1), + # Feed forward gate_up_proj stacking (for non-MoE layers if any) + (".feed_forward.gate_up_proj", ".feed_forward.gate_proj", 0), + (".feed_forward.gate_up_proj", ".feed_forward.up_proj", 1), + ] + params_dict = dict(self.named_parameters()) + updated_params: set[str] = set() + + # Separate and rename weights + language_model_weights, other_weights = ( + self._separate_and_rename_weights(weights)) + + # Handle expert scale parameters + regular_weights, expert_scale_weights, updated_params_from_experts = ( + self._handle_expert_scale_broadcasting(language_model_weights, + params_dict)) + updated_params.update(updated_params_from_experts) + + loader = AutoWeightsLoader(self) + loaded_language_model_params = loader.load_weights(regular_weights) + assert loaded_language_model_params is not None + updated_params.update(loaded_language_model_params) + + if expert_scale_weights: + loaded_expert_scale_params = loader.load_weights( + expert_scale_weights) + if loaded_expert_scale_params: + updated_params.update(loaded_expert_scale_params) + + updated_params.update( + self._load_other_weights(other_weights, params_dict, + stacked_params_mapping)) + return updated_params From 72943c9b63de12c5007d4732aa4753392c469f23 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Sat, 12 Jul 2025 15:07:35 +0900 Subject: [PATCH 037/552] Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694" (#20853) Signed-off-by: mgoin Signed-off-by: x22x22 --- CMakeLists.txt | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 538f9adcb24..e59e912a991 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -171,16 +171,6 @@ if(NVCC_THREADS AND VLLM_GPU_LANG STREQUAL "CUDA") list(APPEND VLLM_GPU_FLAGS "--threads=${NVCC_THREADS}") endif() -# -# Set nvcc fatbin compression. -# -if(VLLM_GPU_LANG STREQUAL "CUDA") - if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) - list(APPEND VLLM_GPU_FLAGS "-Xfatbin" "-compress-all" "-compress-mode=size") - endif() -endif() - - # # Use FetchContent for C++ dependencies that are compiled as part of vLLM's build process. # setup.py will override FETCHCONTENT_BASE_DIR to play nicely with sccache. From 6166a25491c69b2c5cf203d49ec213a45d982f12 Mon Sep 17 00:00:00 2001 From: Congcong Chen Date: Sat, 12 Jul 2025 06:02:10 -0700 Subject: [PATCH 038/552] [Model] New model support for microsoft/Phi-4-mini-flash-reasoning (#20702) Signed-off-by: Congcong Chen Signed-off-by: x22x22 --- csrc/mamba/mamba_ssm/selective_scan_fwd.cu | 49 +- docs/models/supported_models.md | 1 + tests/models/registry.py | 4 + tests/models/test_initialization.py | 3 + tests/test_utils.py | 25 + vllm/attention/backends/blocksparse_attn.py | 3 +- .../backends/differential_flash_attn.py | 1000 +++++++++++++++++ .../backends/dual_chunk_flash_attn.py | 3 +- vllm/attention/backends/flash_attn.py | 3 +- vllm/attention/backends/flashinfer.py | 3 +- vllm/attention/backends/hpu_attn.py | 3 +- vllm/attention/backends/rocm_flash_attn.py | 3 +- vllm/attention/backends/xformers.py | 3 +- vllm/attention/layer.py | 4 - .../model_executor/layers/logits_processor.py | 3 +- vllm/model_executor/models/phi4flash.py | 746 ++++++++++++ vllm/model_executor/models/registry.py | 1 + vllm/platforms/cuda.py | 4 + vllm/platforms/interface.py | 1 + vllm/utils/__init__.py | 18 +- vllm/worker/model_runner.py | 4 + vllm/worker/worker.py | 26 +- 22 files changed, 1869 insertions(+), 41 deletions(-) create mode 100644 vllm/attention/backends/differential_flash_attn.py create mode 100644 vllm/model_executor/models/phi4flash.py diff --git a/csrc/mamba/mamba_ssm/selective_scan_fwd.cu b/csrc/mamba/mamba_ssm/selective_scan_fwd.cu index 785d316025e..5f920997934 100644 --- a/csrc/mamba/mamba_ssm/selective_scan_fwd.cu +++ b/csrc/mamba/mamba_ssm/selective_scan_fwd.cu @@ -312,19 +312,20 @@ void selective_scan_fwd_launch(SSMParamsBase ¶ms, cudaStream_t stream) { // kIsVariableB, kIsVariableC and kHasZ are all set to True to reduce binary size constexpr bool kIsVariableB = true; constexpr bool kIsVariableC = true; - constexpr bool kHasZ = true; BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] { - BOOL_SWITCH(params.query_start_loc_ptr != nullptr , kVarlen, [&] { - using Ktraits = Selective_Scan_fwd_kernel_traits; - constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t); - dim3 grid(params.batch, params.dim / kNRows); - auto kernel = &selective_scan_fwd_kernel; - if (kSmemSize >= 48 * 1024) { - C10_CUDA_CHECK(cudaFuncSetAttribute( - (void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize)); - } - kernel<<>>(params); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] { + BOOL_SWITCH(params.query_start_loc_ptr != nullptr , kVarlen, [&] { + using Ktraits = Selective_Scan_fwd_kernel_traits; + constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t); + dim3 grid(params.batch, params.dim / kNRows); + auto kernel = &selective_scan_fwd_kernel; + if (kSmemSize >= 48 * 1024) { + C10_CUDA_CHECK(cudaFuncSetAttribute( + kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize)); + } + kernel<<>>(params); + C10_CUDA_KERNEL_LAUNCH_CHECK(); + }); }); }); } @@ -612,19 +613,20 @@ void selective_scan_fwd(const torch::Tensor &u, const torch::Tensor &delta, at::Tensor z, out_z; const bool has_z = z_.has_value(); - TORCH_CHECK(has_z, "has_z = False is disabled in favor of reduced binary size") - z = z_.value(); - TORCH_CHECK(z.scalar_type() == input_type); - TORCH_CHECK(z.is_cuda()); - TORCH_CHECK(z.stride(-1) == 1 || z.size(-1) == 1); - if (varlen){ - CHECK_SHAPE(z, dim, seqlen); - } else { - CHECK_SHAPE(z, batch_size, dim, seqlen); + if (has_z) { + z = z_.value(); + TORCH_CHECK(z.scalar_type() == input_type); + TORCH_CHECK(z.is_cuda()); + TORCH_CHECK(z.stride(-1) == 1 || z.size(-1) == 1); + if (varlen){ + CHECK_SHAPE(z, dim, seqlen); + } else { + CHECK_SHAPE(z, batch_size, dim, seqlen); + } + + out_z = z; } - out_z = z; - // Right now u has BHL layout and delta has HBL layout, and we want out to have HBL layout at::Tensor out = delta; TORCH_CHECK(ssm_states.scalar_type() == input_type); @@ -653,4 +655,3 @@ void selective_scan_fwd(const torch::Tensor &u, const torch::Tensor &delta, selective_scan_fwd_cuda(params, stream); }); } - diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index a9597e45fd5..9e70e46fabe 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -374,6 +374,7 @@ Specified using `--task generate`. | `Phi3ForCausalLM` | Phi-4, Phi-3 | `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Phi3SmallForCausalLM` | Phi-3-Small | `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc. | | ✅︎ | ✅︎ | | `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Phi4FlashForCausalLM` | Phi-4-mini-flash-reasoning | `microsoft/microsoft/Phi-4-mini-instruct`, etc. | | | | | `PersimmonForCausalLM` | Persimmon | `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc. | | ✅︎ | ✅︎ | | `Plamo2ForCausalLM` | PLaMo2 | `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc. | | | | | `QWenLMHeadModel` | Qwen | `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ | diff --git a/tests/models/registry.py b/tests/models/registry.py index fa10857313a..c10d375683e 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -248,6 +248,10 @@ def check_available_online( "Phi3SmallForCausalLM": _HfExamplesInfo("microsoft/Phi-3-small-8k-instruct", trust_remote_code=True, v0_only=True), + "Phi4FlashForCausalLM": _HfExamplesInfo("microsoft/Phi-4-mini-flash-reasoning", # noqa: E501 + trust_remote_code=True, + v0_only=True, + max_model_len=10240), "PhiMoEForCausalLM": _HfExamplesInfo("microsoft/Phi-3.5-MoE-instruct", trust_remote_code=True), "Plamo2ForCausalLM": _HfExamplesInfo("pfnet/plamo-2-1b", diff --git a/tests/models/test_initialization.py b/tests/models/test_initialization.py index 07ded1e5880..ea6a2cc37cc 100644 --- a/tests/models/test_initialization.py +++ b/tests/models/test_initialization.py @@ -103,6 +103,9 @@ def _initialize_kv_caches_v1(self, vllm_config): _initialize_kv_caches_v1), monkeypatch.context() as m): if model_info.v0_only: m.setenv("VLLM_USE_V1", "0") + if model_arch == "Phi4FlashForCausalLM": + # Phi4FlashForCausalLM only supports DIFFERENTIAL_FLASH_ATTN backend + m.setenv("VLLM_ATTENTION_BACKEND", "DIFFERENTIAL_FLASH_ATTN") LLM( model_info.default, tokenizer=model_info.tokenizer, diff --git a/tests/test_utils.py b/tests/test_utils.py index f90715fd751..28acacd2519 100644 --- a/tests/test_utils.py +++ b/tests/test_utils.py @@ -458,6 +458,31 @@ def test_bind_kv_cache(): assert ctx['layers.2.self_attn'].kv_cache[0] is kv_cache[2] assert ctx['layers.3.self_attn'].kv_cache[0] is kv_cache[3] +def test_bind_kv_cache_kv_sharing(): + from vllm.attention import Attention + + ctx = { + 'layers.0.self_attn': Attention(32, 128, 0.1), + 'layers.1.self_attn': Attention(32, 128, 0.1), + 'layers.2.self_attn': Attention(32, 128, 0.1), + 'layers.3.self_attn': Attention(32, 128, 0.1), + } + kv_cache = [ + torch.zeros((1, )), + torch.zeros((1, )), + torch.zeros((1, )), + torch.zeros((1, )), + ] + shared_kv_cache_layers = { + 'layers.2.self_attn': 'layers.1.self_attn', + 'layers.3.self_attn': 'layers.0.self_attn' + } + bind_kv_cache(ctx, [kv_cache], shared_kv_cache_layers) + assert ctx['layers.0.self_attn'].kv_cache[0] is kv_cache[0] + assert ctx['layers.1.self_attn'].kv_cache[0] is kv_cache[1] + assert ctx['layers.2.self_attn'].kv_cache[0] is kv_cache[1] + assert ctx['layers.3.self_attn'].kv_cache[0] is kv_cache[0] + def test_bind_kv_cache_non_attention(): from vllm.attention import Attention diff --git a/vllm/attention/backends/blocksparse_attn.py b/vllm/attention/backends/blocksparse_attn.py index fe9738d804c..e4338805f56 100644 --- a/vllm/attention/backends/blocksparse_attn.py +++ b/vllm/attention/backends/blocksparse_attn.py @@ -308,7 +308,8 @@ def __init__( kv_sharing_target_layer_name: Optional[str] = None, ) -> None: if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "BLOCK_SPARSE_FLASH_ATTN Backend.") assert blocksparse_params is not None assert alibi_slopes is None, ValueError( "Alibi not support for blocksparse flash attention.") diff --git a/vllm/attention/backends/differential_flash_attn.py b/vllm/attention/backends/differential_flash_attn.py new file mode 100644 index 00000000000..7c35e58967d --- /dev/null +++ b/vllm/attention/backends/differential_flash_attn.py @@ -0,0 +1,1000 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""" An implementation of https://arxiv.org/pdf/2410.05258 """ +from collections import defaultdict +from dataclasses import dataclass +from itertools import accumulate +from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type + +import torch +from einops import rearrange + +from vllm import _custom_ops as ops +# yapf conflicts with isort for this block +# yapf: disable +from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, + AttentionLayer, + AttentionMetadata, + AttentionMetadataBuilder, + AttentionType, + is_quantized_kv_cache) +from vllm.attention.backends.flash_attn import FlashAttentionBackend +# yapf: enable +from vllm.attention.backends.utils import (PAD_SLOT_ID, CommonAttentionState, + compute_slot_mapping, + compute_slot_mapping_start_idx, + is_all_cross_attn_metadata_set, + is_all_encoder_attn_metadata_set, + is_block_tables_empty) +from vllm.attention.utils.fa_utils import (flash_attn_supports_fp8, + get_flash_attn_version) +from vllm.logger import init_logger +from vllm.multimodal import MultiModalPlaceholderMap +from vllm.utils import async_tensor_h2d, make_tensor_with_pad +from vllm.vllm_flash_attn import (flash_attn_varlen_func, + flash_attn_with_kvcache) + +if TYPE_CHECKING: + from vllm.worker.model_runner import (ModelInputForGPUBuilder, + ModelInputForGPUWithSamplingMetadata) + +logger = init_logger(__name__) + + +class DifferentialFlashAttentionBackend(AttentionBackend): + accept_output_buffer = False + + @staticmethod + def get_supported_head_sizes() -> List[int]: + return [32, 64, 96, 128, 160, 192, 224, 256] + + @staticmethod + def get_kv_cache_shape( + num_blocks: int, + block_size: int, + num_kv_heads: int, + head_size: int, + ) -> Tuple[int, ...]: + if block_size % 16 != 0: + raise ValueError("Block size must be a multiple of 16.") + assert num_kv_heads % 2 == 0, "num_kv_heads must be divisible by 2" + return (2, 2, num_blocks, block_size, num_kv_heads // 2, head_size) + + @staticmethod + def get_name() -> str: + return "DIFFERENTIAL_FLASH_ATTN" + + @staticmethod + def get_impl_cls() -> Type["DifferentialFlashAttentionImpl"]: + return DifferentialFlashAttentionImpl + + @staticmethod + def get_metadata_cls() -> Type["DifferentialFlashAttentionMetadata"]: + return DifferentialFlashAttentionMetadata + + @staticmethod + def get_builder_cls() -> Type["DifferentialFlashAttentionMetadataBuilder"]: + return DifferentialFlashAttentionMetadataBuilder + + @staticmethod + def get_state_cls() -> Type["CommonAttentionState"]: + return CommonAttentionState + + @staticmethod + def swap_blocks( + src_kv_cache: torch.Tensor, + dst_kv_cache: torch.Tensor, + src_to_dst: torch.Tensor, + ) -> None: + src_key_cache = src_kv_cache[0] + dst_key_cache = dst_kv_cache[0] + ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dst) + src_value_cache = src_kv_cache[1] + dst_value_cache = dst_kv_cache[1] + ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dst) + + @staticmethod + def copy_blocks( + kv_caches: List[torch.Tensor], + src_to_dists: torch.Tensor, + ) -> None: + key_caches = [kv_cache[0] for kv_cache in kv_caches] + value_caches = [kv_cache[1] for kv_cache in kv_caches] + + ops.copy_blocks(key_caches, value_caches, src_to_dists) + + +@dataclass +class DifferentialFlashAttentionMetadata(AttentionMetadata): + """Metadata for FlashAttentionBackend. + + NOTE: Any python object stored here is not updated when it is + cuda-graph replayed. If you have values that need to be changed + dynamically, it should be stored in tensor. The tensor has to be + updated from `CUDAGraphRunner.forward` API. + """ + # (batch_size,). The sequence length per sequence. Sequence length means + # the computed tokens + new tokens None if it is a decoding. + seq_lens: Optional[List[int]] + # seq_lens stored as a tensor. + seq_lens_tensor: Optional[torch.Tensor] + + # NOTE(sang): Definition of context_len, query_len, and seq_len. + # |---------- N-1 iteration --------| + # |---------------- N iteration ---------------------| + # |- tokenA -|......................|-- newTokens ---| + # |---------- context_len ----------| + # |-------------------- seq_len ---------------------| + # |-- query_len ---| + + # Maximum sequence length among prefill batch. 0 if there are decoding + # requests only. + max_prefill_seq_len: int + # Maximum sequence length among decode batch. 0 if there are prefill + # requests only. + max_decode_seq_len: int + # (batch_size,) A tensor of context lengths (tokens that are computed + # so far). + context_lens_tensor: Optional[torch.Tensor] + + # (batch_size, max_blocks_per_seq). + # Block addresses per sequence. (Seq id -> list of physical block) + # E.g., [0, 1, 2] means tokens are stored in 0th, 1st, and 2nd blocks + # in the kv cache. Each block can contain up to block_size tokens. + # 2nd dimensions are padded up to max_blocks_per_seq if it is cuda-graph + # captured. + block_tables: Optional[torch.Tensor] + + # Whether or not if cuda graph is enabled. + # Cuda-graph is currently enabled for decoding only. + # TODO(woosuk): Move `use_cuda_graph` out since it's unrelated to attention. + + use_cuda_graph: bool + + # Maximum query length in the batch. + max_query_len: Optional[int] = None + + # Max number of query tokens among request in the batch. + max_decode_query_len: Optional[int] = None + + # (batch_size + 1,). The cumulative subquery lengths of the sequences in + # the batch, used to index into subquery. E.g., if the subquery length + # is [4, 6], it is [0, 4, 10]. + query_start_loc: Optional[torch.Tensor] = None + # (batch_size + 1,). The cumulative sequence lengths of the sequences in + # the batch, used to index into sequence. E.g., if the sequence length is + # [4, 6], it is [0, 4, 10]. + seq_start_loc: Optional[torch.Tensor] = None + + _cached_prefill_metadata: Optional[ + "DifferentialFlashAttentionMetadata"] = None + _cached_decode_metadata: Optional[ + "DifferentialFlashAttentionMetadata"] = None + + # Begin encoder attn & enc/dec cross-attn fields... + + # Encoder sequence lengths representation + encoder_seq_lens: Optional[List[int]] = None + encoder_seq_lens_tensor: Optional[torch.Tensor] = None + # (batch_size + 1,). The cumulative sequence lengths of the sequences in + # the batch, used to index into sequence. E.g., if the sequence length is + # [4, 6], it is [0, 4, 10]. + encoder_seq_start_loc: Optional[torch.Tensor] = None + # Maximum sequence length among encoder sequences + max_encoder_seq_len: Optional[int] = None + # Number of tokens input to encoder + num_encoder_tokens: Optional[int] = None + + # Cross-attention memory-mapping data structures: slot mapping + # and block tables + cross_slot_mapping: Optional[torch.Tensor] = None + cross_block_tables: Optional[torch.Tensor] = None + + # Cross-layer shared attention block tables + cross_layer_shared_block_tables: Optional[torch.Tensor] = None + + @property + def is_all_encoder_attn_metadata_set(self): + ''' + All attention metadata required for encoder attention is set. + ''' + return is_all_encoder_attn_metadata_set(self) + + @property + def is_all_cross_attn_metadata_set(self): + ''' + All attention metadata required for enc/dec cross-attention is set. + + Superset of encoder attention required metadata. + ''' + return is_all_cross_attn_metadata_set(self) + + @property + def prefill_metadata( + self) -> Optional["DifferentialFlashAttentionMetadata"]: + if self.num_prefills == 0: + return None + + if self._cached_prefill_metadata is not None: + return self._cached_prefill_metadata + + assert ((self.seq_lens is not None) + or (self.encoder_seq_lens is not None)) + assert ((self.seq_lens_tensor is not None) + or (self.encoder_seq_lens_tensor is not None)) + + # Compute some attn_metadata fields which default to None + query_start_loc = (None if self.query_start_loc is None else + self.query_start_loc[:self.num_prefills + 1]) + slot_mapping = (None if self.slot_mapping is None else + self.slot_mapping[:self.num_prefill_tokens]) + seq_lens = (None if self.seq_lens is None else + self.seq_lens[:self.num_prefills]) + seq_lens_tensor = (None if self.seq_lens_tensor is None else + self.seq_lens_tensor[:self.num_prefills]) + seq_start_loc = (None if self.seq_start_loc is None else + self.seq_start_loc[:self.num_prefills + 1]) + context_lens_tensor = (None if self.context_lens_tensor is None else + self.context_lens_tensor[:self.num_prefills]) + block_tables = (None if self.block_tables is None else + self.block_tables[:self.num_prefills]) + cross_layer_shared_block_tables = ( + None if self.cross_layer_shared_block_tables is None else + self.cross_layer_shared_block_tables[:self.num_prefills]) + + self._cached_prefill_metadata = DifferentialFlashAttentionMetadata( + num_prefills=self.num_prefills, + num_prefill_tokens=self.num_prefill_tokens, + num_decode_tokens=0, + slot_mapping=slot_mapping, + multi_modal_placeholder_index_maps=self. + multi_modal_placeholder_index_maps, + enable_kv_scales_calculation=self.enable_kv_scales_calculation, + seq_lens=seq_lens, + seq_lens_tensor=seq_lens_tensor, + max_query_len=self.max_query_len, + max_prefill_seq_len=self.max_prefill_seq_len, + max_decode_query_len=0, + max_decode_seq_len=0, + query_start_loc=query_start_loc, + seq_start_loc=seq_start_loc, + context_lens_tensor=context_lens_tensor, + block_tables=block_tables, + cross_layer_shared_block_tables=cross_layer_shared_block_tables, + use_cuda_graph=False, + # Begin encoder & cross attn fields below... + encoder_seq_lens=self.encoder_seq_lens, + encoder_seq_lens_tensor=self.encoder_seq_lens_tensor, + encoder_seq_start_loc=self.encoder_seq_start_loc, + max_encoder_seq_len=self.max_encoder_seq_len, + cross_slot_mapping=self.cross_slot_mapping, + cross_block_tables=self.cross_block_tables) + return self._cached_prefill_metadata + + @property + def decode_metadata( + self) -> Optional["DifferentialFlashAttentionMetadata"]: + if self.num_decode_tokens == 0: + return None + + if self._cached_decode_metadata is not None: + return self._cached_decode_metadata + assert ((self.seq_lens_tensor is not None) + or (self.encoder_seq_lens_tensor is not None)) + + # Compute some attn_metadata fields which default to None + slot_mapping = (None if self.slot_mapping is None else + self.slot_mapping[self.num_prefill_tokens:]) + seq_lens_tensor = (None if self.seq_lens_tensor is None else + self.seq_lens_tensor[self.num_prefills:]) + block_tables = (None if self.block_tables is None else + self.block_tables[self.num_prefills:]) + cross_layer_shared_block_tables = ( + None if self.cross_layer_shared_block_tables is None else + self.cross_layer_shared_block_tables[self.num_prefills:]) + self._cached_decode_metadata = DifferentialFlashAttentionMetadata( + num_prefills=0, + num_prefill_tokens=0, + num_decode_tokens=self.num_decode_tokens, + slot_mapping=slot_mapping, + multi_modal_placeholder_index_maps=None, + enable_kv_scales_calculation=True, + seq_lens=None, + seq_lens_tensor=seq_lens_tensor, + max_decode_query_len=self.max_decode_query_len, + max_query_len=self.max_query_len, + max_prefill_seq_len=0, + max_decode_seq_len=self.max_decode_seq_len, + # Batch may be composed of prefill|decodes, adjust query start + # indices to refer to the start of decodes. E.g. + # in tokens:[3 prefills|6 decodes], query_start_loc=[3,9] => [0,6]. + query_start_loc=(self.query_start_loc[self.num_prefills:] - + self.query_start_loc[self.num_prefills]) + if self.query_start_loc is not None else None, + seq_start_loc=self.seq_start_loc[self.num_prefills:] + if self.seq_start_loc is not None else None, + context_lens_tensor=None, + block_tables=block_tables, + cross_layer_shared_block_tables=cross_layer_shared_block_tables, + use_cuda_graph=self.use_cuda_graph, + # Begin encoder & cross attn fields below... + encoder_seq_lens=self.encoder_seq_lens, + encoder_seq_lens_tensor=self.encoder_seq_lens_tensor, + encoder_seq_start_loc=self.encoder_seq_start_loc, + max_encoder_seq_len=self.max_encoder_seq_len, + cross_slot_mapping=self.cross_slot_mapping, + cross_block_tables=self.cross_block_tables) + return self._cached_decode_metadata + + def advance_step(self, + model_input: "ModelInputForGPUWithSamplingMetadata", + sampled_token_ids: Optional[torch.Tensor], + block_size: int, + num_seqs: int, + num_queries: int, + turn_prefills_into_decodes: bool = False): + """ + Update metadata in-place to advance one decode step. + """ + # When using cudagraph, the num_seqs is padded to the next captured + # batch sized, but num_queries tracks the actual number of requests in + # the batch. For --enforce-eager mode, num_seqs == num_queries + if num_seqs != num_queries: + assert num_seqs > num_queries + assert self.use_cuda_graph + + if turn_prefills_into_decodes: + # When Multi-Step is enabled with Chunked-Prefill, prefills and + # decodes are scheduled together. In the first step, all the + # prefills turn into decodes. This update reflects that + # conversion. + assert self.num_decode_tokens + self.num_prefills == num_seqs + self.num_decode_tokens += self.num_prefills + self.num_prefills = 0 + self.num_prefill_tokens = 0 + self.max_prefill_seq_len = 0 + self.max_query_len = 1 + + self.slot_mapping = self.slot_mapping[:num_seqs] + else: + assert self.seq_lens is not None + assert self.max_decode_seq_len == max(self.seq_lens) + + assert self.num_prefills == 0 + assert self.num_prefill_tokens == 0 + assert self.num_decode_tokens == num_seqs + assert self.slot_mapping.shape == (num_seqs, ) + + assert self.seq_lens is not None + assert len(self.seq_lens) == num_seqs + assert self.seq_lens_tensor is not None + assert self.seq_lens_tensor.shape == (num_seqs, ) + assert self.max_query_len == 1 + assert self.max_prefill_seq_len == 0 + + assert self.query_start_loc is not None + assert self.query_start_loc.shape == (num_queries + 1, ) + assert self.seq_start_loc is not None + assert self.seq_start_loc.shape == (num_seqs + 1, ) + + assert self.context_lens_tensor is not None + assert self.context_lens_tensor.shape == (num_queries, ) + + assert self.block_tables is not None + assert self.block_tables.shape[0] == num_seqs + + # Update query lengths. Note that we update only queries and not seqs, + # since tensors may be padded due to captured cuda graph batch size + for i in range(num_queries): + self.seq_lens[i] += 1 + self.max_decode_seq_len = max(self.seq_lens) + + ops.advance_step_flashattn(num_seqs=num_seqs, + num_queries=num_queries, + block_size=block_size, + input_tokens=model_input.input_tokens, + sampled_token_ids=sampled_token_ids, + input_positions=model_input.input_positions, + seq_lens=self.seq_lens_tensor, + slot_mapping=self.slot_mapping, + block_tables=self.block_tables) + + +class DifferentialFlashAttentionMetadataBuilder( + AttentionMetadataBuilder[DifferentialFlashAttentionMetadata]): + + def __init__(self, input_builder: "ModelInputForGPUBuilder"): + self.input_builder = input_builder + self.runner = input_builder.runner + self.sliding_window = input_builder.sliding_window + self.block_size = input_builder.block_size + + def prepare(self): + self.slot_mapping: List[int] = [] + self.prefill_seq_lens: List[int] = [] + self.context_lens: List[int] = [] + self.block_tables: List[List[int]] = [] + self.cross_layer_shared_block_tables: List[List[int]] = [] + self.curr_seq_lens: List[int] = [] + self.multimodal_placeholder_maps: Dict[ + str, + MultiModalPlaceholderMap] = defaultdict(MultiModalPlaceholderMap) + self.num_prefills = 0 + self.num_prefill_tokens = 0 + self.num_decode_tokens = 0 + self.has_prefix_cache_hit = False + + def _add_seq_group( + self, inter_data: "ModelInputForGPUBuilder.InterDataForSeqGroup", + chunked_prefill_enabled: bool, prefix_cache_hit: bool): + """Add a sequence group to the metadata. Specifically update/append + 1. context length. + 2. block table. + 3. slot mapping. + """ + # TODO: add support for chunked prefill and prefix caching. + assert not chunked_prefill_enabled, \ + "chunked prefill is not supported for now" + assert not prefix_cache_hit, "prefix caching is not supported for now" + + is_prompt = inter_data.is_prompt + block_tables = inter_data.block_tables + + for (seq_id, token_len, seq_len, curr_seq_len, query_len, context_len, + curr_sliding_window_block) in zip( + inter_data.seq_ids, [len(t) for t in inter_data.input_tokens], + inter_data.orig_seq_lens, inter_data.seq_lens, + inter_data.query_lens, inter_data.context_lens, + inter_data.curr_sliding_window_blocks): + self.context_lens.append(context_len) + + if is_prompt: + mm_maps = inter_data.multi_modal_placeholder_maps + if mm_maps: + for modality, placeholders in mm_maps.items(): + self.multimodal_placeholder_maps[modality].extend( + placeholders) + + self.num_prefills += 1 + self.num_prefill_tokens += token_len + self.prefill_seq_lens.append(seq_len) + else: + self.num_decode_tokens += query_len + self.curr_seq_lens.append(curr_seq_len) + + # Compute block table. + # TODO(sang): Combine chunked prefill and prefix caching by + # only allowing multiple of block_size chunk size. + # NOTE: This only works for oooooooxxx style attention. + block_table = [] + if prefix_cache_hit: + # NOTE(woosuk): For flash-attn, the block table should + # include the entries for the incoming prefill tokens. + block_table = block_tables[seq_id] + elif ((chunked_prefill_enabled or not is_prompt) + and block_tables is not None): + if curr_sliding_window_block == 0: + block_table = block_tables[seq_id] + else: + block_table = block_tables[seq_id][ + -curr_sliding_window_block:] + self.block_tables.append(block_table) + + cross_layer_shared_block_table = [] + if prefix_cache_hit: + cross_layer_shared_block_table = block_tables[seq_id] + elif block_tables is not None: + if curr_sliding_window_block == 0: + cross_layer_shared_block_table = block_tables[seq_id] + else: + cross_layer_shared_block_table = block_tables[seq_id][ + -curr_sliding_window_block:] + self.cross_layer_shared_block_tables.append( + cross_layer_shared_block_table) + + # Compute slot mapping. + is_profile_run = is_block_tables_empty(block_tables) + start_idx = compute_slot_mapping_start_idx(is_prompt, query_len, + context_len, + self.sliding_window) + compute_slot_mapping(is_profile_run, self.slot_mapping, seq_id, + seq_len, context_len, start_idx, + self.block_size, inter_data.block_tables) + + def _get_graph_runner_block_tables(self, num_seqs: int, + block_tables: List[List[int]], + graph_block_tables) -> torch.Tensor: + # The shape of graph_block_tables is + # [max batch size, max context len // block size]. + # max_batch_size, max_blocks = self.runner.graph_block_tables.shape + max_batch_size, max_blocks = graph_block_tables.shape + assert max_batch_size >= num_seqs + + # graph_block_tables = self.runner.graph_block_tables[:num_seqs] + graph_block_tables = graph_block_tables[:num_seqs] + for i, block_table in enumerate(block_tables): + if block_table: + num_blocks = len(block_table) + if num_blocks <= max_blocks: + graph_block_tables[i, :num_blocks] = block_table + else: + # It may be possible to have more blocks allocated due + # to lookahead slots of multi-step, however, they are + # not used anyway, so can be safely ignored. + graph_block_tables[ + i, :max_blocks] = block_table[:max_blocks] + + return torch.from_numpy(graph_block_tables).to( + device=self.runner.device, non_blocking=True) + + def build(self, seq_lens: List[int], query_lens: List[int], + cuda_graph_pad_size: int, batch_size: int): + """Build attention metadata with on-device tensors. + + Args: + seq_lens: The maybe padded sequence lengths of the input sequences. + query_lens: The query lengths of the input sequences. + cuda_graph_pad_size: The padding size for cuda graph. + -1 if cuda graph is not used. + batch_size: The maybe padded batch size. + """ + prefix_cache_hit = any([ + inter_data.prefix_cache_hit + for inter_data in self.input_builder.inter_data_list + ]) + for inter_data in self.input_builder.inter_data_list: + self._add_seq_group(inter_data, + self.input_builder.chunked_prefill_enabled, + prefix_cache_hit) + + device = self.runner.device + use_captured_graph = cuda_graph_pad_size != -1 + + max_query_len = max(query_lens) + decode_query_lens = query_lens[self.num_prefills:] + if len(decode_query_lens) > 0: + max_decode_query_len = max(decode_query_lens) + else: + max_decode_query_len = 1 + max_prefill_seq_len = max(self.prefill_seq_lens, default=0) + max_decode_seq_len = max(self.curr_seq_lens, default=0) + num_decode_tokens = self.num_decode_tokens + query_start_loc = list(accumulate(query_lens, initial=0)) + seq_start_loc = list(accumulate(seq_lens, initial=0)) + + num_seqs = len(seq_lens) + if use_captured_graph: + self.slot_mapping.extend([PAD_SLOT_ID] * cuda_graph_pad_size) + self.block_tables.extend([] * cuda_graph_pad_size) + + self.cross_layer_shared_block_tables.extend([] * + cuda_graph_pad_size) + + num_decode_tokens = batch_size - self.num_prefill_tokens + block_tables = self._get_graph_runner_block_tables( + num_seqs, self.block_tables, self.runner.graph_block_tables) + cross_layer_shared_block_tables = \ + self._get_graph_runner_block_tables( + num_seqs, self.cross_layer_shared_block_tables, + self.runner.cross_layer_shared_graph_block_tables) + else: + block_tables = make_tensor_with_pad( + self.block_tables, + pad=0, + dtype=torch.int, + device=device, + ) + cross_layer_shared_block_tables = make_tensor_with_pad( + self.cross_layer_shared_block_tables, + pad=0, + dtype=torch.int, + device=device, + ) + assert max_query_len > 0, ("query_lens: {}".format(query_lens)) + + assert device is not None + context_lens_tensor = async_tensor_h2d(self.context_lens, torch.int, + device, self.runner.pin_memory) + seq_lens_tensor = async_tensor_h2d(seq_lens, torch.int, device, + self.runner.pin_memory) + slot_mapping_tensor = async_tensor_h2d(self.slot_mapping, torch.long, + device, self.runner.pin_memory) + query_start_loc_tensor = async_tensor_h2d(query_start_loc, torch.int32, + device, + self.runner.pin_memory) + seq_start_loc_tensor = async_tensor_h2d(seq_start_loc, torch.int32, + device, self.runner.pin_memory) + placeholder_index_maps = { + modality: placeholder_map.index_map() + for modality, placeholder_map in + self.multimodal_placeholder_maps.items() + } + + return DifferentialFlashAttentionMetadata( + num_prefills=self.num_prefills, + slot_mapping=slot_mapping_tensor, + num_prefill_tokens=self.num_prefill_tokens, + num_decode_tokens=num_decode_tokens, + seq_lens=seq_lens, + multi_modal_placeholder_index_maps=placeholder_index_maps, + enable_kv_scales_calculation=True, + seq_lens_tensor=seq_lens_tensor, + max_query_len=max_query_len, + max_decode_query_len=max_decode_query_len, + max_prefill_seq_len=max_prefill_seq_len, + max_decode_seq_len=max_decode_seq_len, + query_start_loc=query_start_loc_tensor, + seq_start_loc=seq_start_loc_tensor, + context_lens_tensor=context_lens_tensor, + block_tables=block_tables, + cross_layer_shared_block_tables=cross_layer_shared_block_tables, + use_cuda_graph=use_captured_graph, + ) + + +class DifferentialFlashAttentionImpl(AttentionImpl): + """ + If the input tensors contain prompt tokens, the layout is as follows: + |<--------------- num_prefill_tokens ----------------->| + |<--prefill_0-->|<--prefill_1-->|...|<--prefill_N-1--->| + + Otherwise, the layout is as follows: + |<----------------- num_decode_tokens ------------------>| + |<--decode_0-->|..........|<--decode_M-1-->|<--padding-->| + + Generation tokens can contain padding when cuda-graph is used. + Currently, prompt tokens don't contain any padding. + + The prompts might have different lengths, while the generation tokens + always have length 1. + + If chunked prefill is enabled, prefill tokens and decode tokens can be + batched together in a flattened 1D query. + + |<----- num_prefill_tokens ---->|<------- num_decode_tokens --------->| + |<-prefill_0->|...|<-prefill_N-1->|<--decode_0-->|...|<--decode_M-1-->| + + Currently, cuda graph is disabled for chunked prefill, meaning there's no + padding between prefill and decode tokens. + """ + + def __init__( + self, + num_heads: int, + head_size: int, + scale: float, + num_kv_heads: int, + alibi_slopes: Optional[List[float]], + sliding_window: Optional[int], + kv_cache_dtype: str, + blocksparse_params: Optional[Dict[str, Any]] = None, + logits_soft_cap: Optional[float] = None, + attn_type: str = AttentionType.DECODER, + kv_sharing_target_layer_name: Optional[str] = None, + use_irope: bool = False, + differential_flash_attention_config: Optional[Dict[str, Any]] = None, + ) -> None: + if differential_flash_attention_config is None: + differential_flash_attention_config = {} + self.differential_flash_attention_config = \ + differential_flash_attention_config + self.used_shared_kv_cache = kv_sharing_target_layer_name is not None + self.kv_sharing_target_layer_name = kv_sharing_target_layer_name + if blocksparse_params is not None: + raise ValueError( + "FlashAttention does not support block-sparse attention.") + if use_irope: + logger.warning( + "Using irope in V0 is not supported yet, it will fall back " + "to global attention for long context.") + self.num_heads = num_heads + self.head_size = head_size + self.scale = float(scale) + self.num_kv_heads = num_kv_heads + if alibi_slopes is not None: + alibi_slopes = torch.tensor(alibi_slopes, dtype=torch.float32) + self.alibi_slopes = alibi_slopes + self.sliding_window = ((sliding_window - 1, + 0) if sliding_window is not None else (-1, -1)) + self.kv_cache_dtype = kv_cache_dtype + self.vllm_flash_attn_version = get_flash_attn_version( + requires_alibi=self.alibi_slopes is not None) + if is_quantized_kv_cache(self.kv_cache_dtype) and ( + not self.kv_cache_dtype.startswith("fp8") + or not flash_attn_supports_fp8()): + raise NotImplementedError( + f"FlashAttention does not support {self.kv_cache_dtype} " + "kv-cache on this device " + f"(FA supports fp8 = {flash_attn_supports_fp8()}).") + if logits_soft_cap is None: + # In flash-attn, setting logits_soft_cap as 0 means no soft cap. + logits_soft_cap = 0 + self.logits_soft_cap = logits_soft_cap + + assert self.num_heads % self.num_kv_heads == 0 + self.num_queries_per_kv = self.num_heads // self.num_kv_heads + + support_head_sizes = FlashAttentionBackend.get_supported_head_sizes() + if head_size not in support_head_sizes: + raise ValueError( + f"Head size {head_size} is not supported by FlashAttention. " + f"Supported head sizes are: {support_head_sizes}.") + self.attn_type = attn_type + + self.lambda_full = None + self.subln = self.differential_flash_attention_config["subln"] + + def split_heads(self, x): + # split by num_heads, the stripe pattern is friendly to tensor parallel. + x = rearrange(x, "... (H two) D -> ... H two D", two=2) + x1 = x[..., 0, :] + x2 = x[..., 1, :] + return x1.contiguous(), x2.contiguous() + + def split_kv_cache(self, x): + # split by num_heads, the stripe pattern is friendly to tensor parallel. + if x.numel() == 0: + return torch.empty(0), torch.empty(0) + + x1, x2 = x[0], x[1] + return x1, x2 + + def populate_kv_cache(self, layer: AttentionLayer, key: torch.Tensor, + value: torch.Tensor, kv_cache: torch.Tensor, + attn_metadata: DifferentialFlashAttentionMetadata): + if kv_cache.numel() > 0 and key is not None and value is not None: + updated_slot_mapping = attn_metadata.slot_mapping + torch.ops._C_cache_ops.reshape_and_cache_flash( + key, + value, + kv_cache[0], + kv_cache[1], + updated_slot_mapping.flatten(), + self.kv_cache_dtype, + layer._k_scale, + layer._v_scale, + ) + + def forward_generate_kv_cache( + self, query: torch.Tensor, key: Optional[torch.Tensor], + value: Optional[torch.Tensor], k_cache: torch.Tensor, + v_cache: torch.Tensor, + attn_metadata: DifferentialFlashAttentionMetadata) -> torch.Tensor: + + head_size = self.head_size + num_heads = self.num_heads // 2 + num_kv_heads = self.num_kv_heads // 2 + + query = query.view(-1, num_heads, head_size) + if key is not None: + assert value is not None + key = key.view(-1, num_kv_heads, head_size) + value = value.view(-1, num_kv_heads, head_size) + else: + assert value is None + + num_prefill_tokens = attn_metadata.num_prefill_tokens + num_decode_tokens = attn_metadata.num_decode_tokens + assert key.shape[ + 0] == num_prefill_tokens + num_decode_tokens, "key shape mismatch" + assert value.shape[ + 0] == num_prefill_tokens + num_decode_tokens, "value shape mismatch" + + output = torch.empty_like(query) + # Query for decode. KV is not needed because it is already cached. + decode_query = query[num_prefill_tokens:] + # QKV for prefill. + query = query[:num_prefill_tokens] + if key is not None and value is not None: + key = key[:num_prefill_tokens] + value = value[:num_prefill_tokens] + + assert query.shape[0] == num_prefill_tokens, "query shape mismatch" + assert decode_query.shape[ + 0] == num_decode_tokens, "decode query shape mismatch" + + if prefill_meta := attn_metadata.prefill_metadata: + # Prompt run. + if k_cache.numel() == 0 \ + or prefill_meta.block_tables is None \ + or prefill_meta.block_tables.numel() == 0: + # normal attention + prefill_output = flash_attn_varlen_func( + q=query, + k=key, + v=value, + cu_seqlens_q=prefill_meta.seq_start_loc, + cu_seqlens_k=prefill_meta.seq_start_loc, + max_seqlen_q=prefill_meta.max_prefill_seq_len, + max_seqlen_k=prefill_meta.max_prefill_seq_len, + softmax_scale=self.scale, + causal=True, + window_size=self.sliding_window, + alibi_slopes=self.alibi_slopes, + softcap=self.logits_soft_cap, + ) + assert prefill_output.shape == output[: + num_prefill_tokens].shape + output[:num_prefill_tokens] = prefill_output + else: + raise Exception("prefix caching not supported") + + if decode_meta := attn_metadata.decode_metadata: + block_tables_arg = decode_meta.block_tables + try: + output[num_prefill_tokens:] = flash_attn_with_kvcache( + q=decode_query.unsqueeze(1), + k_cache=k_cache, + v_cache=v_cache, + block_table=block_tables_arg, + cache_seqlens=decode_meta.seq_lens_tensor, + softmax_scale=self.scale, + causal=True, + window_size=self.sliding_window, + alibi_slopes=self.alibi_slopes, + softcap=self.logits_soft_cap, + ).squeeze(1) + except Exception as e: + logger.error("Error in PagedAttention.forward_decode: %s", + str(e)) + raise e + + # Reshape the output tensor. + return output.view(-1, num_heads, head_size) + + def forward_with_kv_cache_only( + self, + query: torch.Tensor, + k_cache: torch.Tensor, + v_cache: torch.Tensor, + attn_metadata: DifferentialFlashAttentionMetadata, + ): + if not attn_metadata.decode_metadata: + block_tables_arg = attn_metadata.cross_layer_shared_block_tables + else: + block_tables_arg = attn_metadata.block_tables + + output = flash_attn_with_kvcache( + q=query.unsqueeze(1), + k_cache=k_cache, + v_cache=v_cache, + block_table=block_tables_arg, + cache_seqlens=attn_metadata.seq_lens_tensor, + softmax_scale=self.scale, + causal=True, + window_size=self.sliding_window, + alibi_slopes=self.alibi_slopes, + softcap=self.logits_soft_cap, + ).squeeze(1) + return output + + def forward( + self, + layer: AttentionLayer, + q: torch.Tensor, + k: torch.Tensor, + v: torch.Tensor, + kv_cache: torch.Tensor, + attn_metadata: DifferentialFlashAttentionMetadata, + output: Optional[torch.Tensor] = None, + output_scale: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + """Forward pass with FlashAttention. + + Args: + query: shape = [num_tokens, num_heads, head_size] + key: shape = [num_tokens, num_kv_heads, head_size] + value: shape = [num_tokens, num_kv_heads, head_size] + output: shape = [num_tokens, num_heads, head_size] + kv_cache = [2, num_blocks, block_size, num_kv_heads, head_size] + NOTE: kv_cache will be an empty tensor with shape [0] + for profiling run. + attn_metadata: Metadata for attention. + NOTE: It in-place updates the output tensor. + NOTE: FP8 quantization, flash-attn expect the size of + {q,k,v}_descale to be (num_sequences, num_kv_heads). + We use torch's .expand() to avoid duplicating values + """ + if self.lambda_full is None: + self.lambda_init = self.differential_flash_attention_config[ + "lambda_init"] + lambda_q1 = self.differential_flash_attention_config["lambda_q1"] + lambda_k1 = self.differential_flash_attention_config["lambda_k1"] + lambda_q2 = self.differential_flash_attention_config["lambda_q2"] + lambda_k2 = self.differential_flash_attention_config["lambda_k2"] + lambda_1 = torch.exp( + torch.sum(lambda_q1 * lambda_k1, dim=-1).float()).type_as(q) + lambda_2 = torch.exp( + torch.sum(lambda_q2 * lambda_k2, dim=-1).float()).type_as(q) + self.lambda_full = lambda_1 - lambda_2 + self.lambda_init + + if not self.used_shared_kv_cache: # need to generate kv-cache + q = q.view(-1, self.num_heads, self.head_size) + k = k.view(-1, self.num_kv_heads, self.head_size) + v = v.view(-1, self.num_kv_heads, self.head_size) + + q1, q2 = self.split_heads(q) + k1, k2 = self.split_heads(k) + v1, v2 = self.split_heads(v) + + # kv_cache shape is (2, 2, num_blocks, block_size, num_kv_heads // 2, head_size) # noqa: E501 + # Split by half along the first dimension. + kv_cache1, kv_cache2 = self.split_kv_cache(kv_cache) + assert kv_cache1.is_contiguous(), "kv_cache1 is not contiguous" + assert kv_cache2.is_contiguous(), "kv_cache2 is not contiguous" + + if kv_cache1.numel() != 0: + self.populate_kv_cache(layer, k1, v1, kv_cache1, attn_metadata) + self.populate_kv_cache(layer, k2, v2, kv_cache2, attn_metadata) + + key_cache1, value_cache1 = self.split_kv_cache(kv_cache1) + key_cache2, value_cache2 = self.split_kv_cache(kv_cache2) + else: + key_cache1, value_cache1 = torch.empty(0), torch.empty(0) + key_cache2, value_cache2 = torch.empty(0), torch.empty(0) + attn11 = self.forward_generate_kv_cache(q1, k1, v1, key_cache1, + value_cache1, + attn_metadata) + attn12 = self.forward_generate_kv_cache(q1, k1, v2, key_cache1, + value_cache2, + attn_metadata) + attn11 = attn11.view(q1.shape) + attn12 = attn12.view(q1.shape) + attn1 = torch.cat([attn11, attn12], dim=-1) + + attn21 = self.forward_generate_kv_cache(q2, k2, v1, key_cache2, + value_cache1, + attn_metadata) + attn22 = self.forward_generate_kv_cache(q2, k2, v2, key_cache2, + value_cache2, + attn_metadata) + attn21 = attn21.view(q2.shape) + attn22 = attn22.view(q2.shape) + attn2 = torch.cat([attn21, attn22], dim=-1) + + attn = attn1 - self.lambda_full * attn2 + # attn shape (-1, self.num_heads // 2, 2 * self.head_dim) + attn = self.subln(attn) + attn = attn * (1 - self.lambda_init) + # reshape back to 2 * num_head + attn_output = rearrange(attn, + "... H (two D) -> ... (H two) D", + two=2) + + else: # re-use the kv cache, full attention + q = q.view(-1, self.num_heads, self.head_size) + q1, q2 = self.split_heads(q) + # kv_cache shape is (2, num_blocks, block_size, num_kv_heads, head_size) # noqa: E501 + kv_cache1, kv_cache2 = self.split_kv_cache(kv_cache) + key_cache1, value_cache1 = kv_cache1[0], kv_cache1[1] + key_cache2, value_cache2 = kv_cache2[0], kv_cache2[1] + + attn11 = self.forward_with_kv_cache_only(q1, key_cache1, + value_cache1, + attn_metadata) + attn12 = self.forward_with_kv_cache_only(q1, key_cache1, + value_cache2, + attn_metadata) + attn11 = attn11.view(q1.shape) + attn12 = attn12.view(q1.shape) + attn1 = torch.cat([attn11, attn12], dim=-1) + + attn21 = self.forward_with_kv_cache_only(q2, key_cache2, + value_cache1, + attn_metadata) + attn22 = self.forward_with_kv_cache_only(q2, key_cache2, + value_cache2, + attn_metadata) + attn21 = attn21.view(q2.shape) + attn22 = attn22.view(q2.shape) + attn2 = torch.cat([attn21, attn22], dim=-1) + + attn = attn1 - self.lambda_full * attn2 + attn = self.subln(attn) + attn = attn * (1 - self.lambda_init) + # reshape back to 2 * num_head + attn_output = rearrange(attn, + "... H (two D) -> ... (H two) D", + two=2) + attn_output = attn_output.view(-1, self.num_heads * self.head_size) + return attn_output diff --git a/vllm/attention/backends/dual_chunk_flash_attn.py b/vllm/attention/backends/dual_chunk_flash_attn.py index f62a43b441f..40557a4e8f8 100644 --- a/vllm/attention/backends/dual_chunk_flash_attn.py +++ b/vllm/attention/backends/dual_chunk_flash_attn.py @@ -295,7 +295,8 @@ def __init__( dual_chunk_attention_config: Optional[Dict[str, Any]] = None, ) -> None: if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "DUAL_CHUNK_FLASH_ATTN backend.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) diff --git a/vllm/attention/backends/flash_attn.py b/vllm/attention/backends/flash_attn.py index bf8e373802f..20e67eb9b40 100755 --- a/vllm/attention/backends/flash_attn.py +++ b/vllm/attention/backends/flash_attn.py @@ -622,7 +622,8 @@ def __init__( use_irope: bool = False, ) -> None: if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "FLASH_ATTN backend.") if blocksparse_params is not None: raise ValueError( "FlashAttention does not support block-sparse attention.") diff --git a/vllm/attention/backends/flashinfer.py b/vllm/attention/backends/flashinfer.py index 5bbe340b143..1f913ad8952 100644 --- a/vllm/attention/backends/flashinfer.py +++ b/vllm/attention/backends/flashinfer.py @@ -1006,7 +1006,8 @@ def __init__( use_irope: bool = False, ) -> None: if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "FLASHINFER backend.") if use_irope: logger.warning_once( "Using irope in FlashInfer is not supported yet, it will fall" diff --git a/vllm/attention/backends/hpu_attn.py b/vllm/attention/backends/hpu_attn.py index bf778a1e501..b8fdf763a04 100644 --- a/vllm/attention/backends/hpu_attn.py +++ b/vllm/attention/backends/hpu_attn.py @@ -115,7 +115,8 @@ def __init__( ) -> None: super(AttentionImpl, self).__init__() if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "HPU_ATTN backend.") if use_irope: logger.warning_once( "Using irope in HPU is not supported yet, it will fall back " diff --git a/vllm/attention/backends/rocm_flash_attn.py b/vllm/attention/backends/rocm_flash_attn.py index 0b7783758dd..4653d5267e1 100644 --- a/vllm/attention/backends/rocm_flash_attn.py +++ b/vllm/attention/backends/rocm_flash_attn.py @@ -501,7 +501,8 @@ def __init__( use_irope: bool = False, ) -> None: if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "ROCM_FLASH backend.") if use_irope: logger.warning_once( "Using irope in ROCm Flash Attention is not supported yet, it " diff --git a/vllm/attention/backends/xformers.py b/vllm/attention/backends/xformers.py index b583240c73c..3ef79bb6212 100644 --- a/vllm/attention/backends/xformers.py +++ b/vllm/attention/backends/xformers.py @@ -394,7 +394,8 @@ def __init__( use_irope: bool = False, ) -> None: if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0.") + raise NotImplementedError("KV sharing is not supported in V0 " + "XFORMERS backend.") if blocksparse_params is not None: raise ValueError( "XFormers does not support block-sparse attention.") diff --git a/vllm/attention/layer.py b/vllm/attention/layer.py index 3d5746837be..f9c2d4f4983 100644 --- a/vllm/attention/layer.py +++ b/vllm/attention/layer.py @@ -160,10 +160,6 @@ def __init__( self.attn_type = attn_type if kv_sharing_target_layer_name is not None: - if not envs.VLLM_USE_V1: - raise NotImplementedError( - "Cross-layer KV sharing is not supported in V0.") - validate_kv_sharing_target( prefix, kv_sharing_target_layer_name, diff --git a/vllm/model_executor/layers/logits_processor.py b/vllm/model_executor/layers/logits_processor.py index 3d01253447c..e93be9bfb16 100644 --- a/vllm/model_executor/layers/logits_processor.py +++ b/vllm/model_executor/layers/logits_processor.py @@ -59,11 +59,12 @@ def forward( hidden_states: torch.Tensor, sampling_metadata: Optional[SamplingMetadata] = None, embedding_bias: Optional[torch.Tensor] = None, + prune_hidden_states: bool = True, ) -> Optional[torch.Tensor]: if self.logits_as_input: logits = hidden_states else: - if sampling_metadata is not None: + if sampling_metadata is not None and prune_hidden_states: hidden_states = _prune_hidden_states(hidden_states, sampling_metadata) diff --git a/vllm/model_executor/models/phi4flash.py b/vllm/model_executor/models/phi4flash.py new file mode 100644 index 00000000000..10f8b6552af --- /dev/null +++ b/vllm/model_executor/models/phi4flash.py @@ -0,0 +1,746 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import math +from collections.abc import Iterable +from typing import Optional, Union + +import torch +import torch.nn as nn +from transformers.activations import ACT2FN + +import vllm.envs as envs +from vllm.attention import Attention, AttentionMetadata, AttentionType +from vllm.attention.selector import _Backend +from vllm.config import CacheConfig, VllmConfig +from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size +from vllm.forward_context import ForwardContext, get_forward_context +from vllm.logger import init_logger +from vllm.model_executor.layers.linear import (ColumnParallelLinear, + MergedColumnParallelLinear, + RowParallelLinear) +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.mamba.ops.causal_conv1d import ( + causal_conv1d_fn, causal_conv1d_update) +from vllm.model_executor.layers.mamba.ops.mamba_ssm import ( + selective_scan_fn, selective_state_update) +from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler +from vllm.model_executor.layers.vocab_parallel_embedding import ( + DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) +from vllm.model_executor.models.interfaces import (HasInnerState, IsHybrid, + SupportsV0Only) +from vllm.model_executor.models.mamba_cache import (MambaCacheManager, + MambaCacheParams) +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.sequence import IntermediateTensors + +from .utils import make_layers, maybe_prefix + +logger = init_logger(__name__) + + +class SwiGLUActivation(nn.Module): + + def forward(self, x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor: + return x1 * nn.functional.silu(x2) + + +class SambaYMLP(nn.Module): + """Gated Linear Unit. + + Reference: + Language Modeling with Gated Convolutional Networks. + https://arxiv.org/pdf/1612.08083v3.pdf. + + """ + + def __init__(self, config): + super().__init__() + + self.config = config + self.fc1 = nn.Linear(config.hidden_size, + 2 * config.intermediate_size, + bias=False) + self.fc2 = nn.Linear(config.intermediate_size, + config.hidden_size, + bias=False) + + self.activation_fn = ACT2FN[config.hidden_act] + + def forward(self, hidden_states): + y = self.fc1(hidden_states) + gate, y = y.chunk(2, dim=-1) + y = y * self.activation_fn(gate) + return self.fc2(y) + + +def get_virtual_engine(): + forward_context: ForwardContext = get_forward_context() + return forward_context.virtual_engine + + +class SambaYAttention(nn.Module): + + def __init__(self, + config, + layer_idx: Optional[int] = None, + yoco_cross: bool = False, + cache_config: Optional[CacheConfig] = None, + prefix: str = ""): + super().__init__() + if layer_idx is None: + logger.warning_once( + f"Instantiating {self.__class__.__name__} without passing " + "a `layer_idx` is not recommended and will lead to errors " + "during the forward call if caching is used. Please make " + "sure to provide a `layer_idx` when creating this class.") + self.hidden_size = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.hidden_size // self.num_heads + self.num_key_value_heads = config.num_key_value_heads + self.yoco_cross = yoco_cross + + if (self.head_dim * self.num_heads) != self.hidden_size: + raise ValueError("hidden_size must be divisible by num_heads " + f"(got `hidden_size`: {self.hidden_size} and " + f"`num_heads`: {self.num_heads}).") + + op_size = self.num_heads * self.head_dim + 2 * ( + self.num_key_value_heads * self.head_dim) + self.out_proj = nn.Linear(self.num_heads * self.head_dim, + self.hidden_size, + bias=True) + if yoco_cross: + self.Wqkv = nn.Linear(self.hidden_size, + self.num_heads * self.head_dim, + bias=True) + else: + self.Wqkv = nn.Linear(self.hidden_size, op_size, bias=True) + + # disable sliding window for the second half of the model + sliding_window = config.interleaved_sliding_window[layer_idx] + if layer_idx >= config.num_hidden_layers // 2: + assert sliding_window is None, \ + "sliding_window must be none for the second decoder" + else: + assert sliding_window is not None, \ + "sliding_window must be set for the first decoder" + + assert self.num_heads % 2 == 0, 'num_heads should be even' + assert self.num_key_value_heads % 2 == 0, 'num_heads should be even' + + self.lambda_init = self.lambda_init_fn(layer_idx) + self.lambda_q1 = nn.Parameter( + torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0, + std=0.1)) + self.lambda_k1 = nn.Parameter( + torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0, + std=0.1)) + self.lambda_q2 = nn.Parameter( + torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0, + std=0.1)) + self.lambda_k2 = nn.Parameter( + torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0, + std=0.1)) + self.subln = nn.RMSNorm(2 * self.head_dim, + eps=1e-5, + elementwise_affine=True) + + params = { + 'differential_flash_attention_config': { + 'lambda_init': self.lambda_init, + 'lambda_q1': self.lambda_q1, + 'lambda_k1': self.lambda_k1, + 'lambda_q2': self.lambda_q2, + 'lambda_k2': self.lambda_k2, + "subln": self.subln, + } + } + + if yoco_cross: + kv_shared_layer_index = config.num_hidden_layers // 2 + 1 + kv_sharing_target_layer_name = \ + f"model.layers.{kv_shared_layer_index}.self_attn.attn" + else: + kv_sharing_target_layer_name = None + + self.attn = Attention( + self.num_heads, + self.head_dim, + self.head_dim**-0.5, + num_kv_heads=self.num_key_value_heads, + cache_config=cache_config, + per_layer_sliding_window=sliding_window, + prefix=f"{prefix}.attn", + attn_type=AttentionType.DECODER, + kv_sharing_target_layer_name=kv_sharing_target_layer_name, + **params) + assert self.attn.backend == _Backend.DIFFERENTIAL_FLASH_ATTN,\ + "DIFFERENTIAL_FLASH_ATTN required" + + def lambda_init_fn(self, depth): + return 0.8 - 0.6 * math.exp(-0.3 * depth) + + def forward( + self, + hidden_states: torch.Tensor, + ): + + if not self.yoco_cross: # need to generate kv-cache + qkv = self.Wqkv(hidden_states) + q, k, v = qkv.split([ + self.hidden_size, self.num_key_value_heads * self.head_dim, + self.num_key_value_heads * self.head_dim + ], + dim=-1) + attn_output = self.attn(q, k, v) + else: # re-use the kv cache, full attention + q = self.Wqkv(hidden_states) + attn_output = self.attn(q, None, None) + attn_output = attn_output.view(-1, self.num_heads * self.head_dim) + return self.out_proj(attn_output) + + +class Phi4Mamba(nn.Module): + + def __init__( + self, + d_model, + d_state=16, + d_conv=4, + expand=2, + dt_rank="auto", + dt_min=0.001, + dt_max=0.1, + dt_init="random", # difference + dt_scale=1.0, # difference + dt_init_floor=1e-4, + conv_bias=True, + bias=False, + use_fast_path=True, # Fused kernel options + layer_idx=None, + device=None, + dtype=None, + yoco_cross=False, + yoco_kv=False, + ): + factory_kwargs = {"params_dtype": dtype} # difference + super().__init__() + self.yoco_cross = yoco_cross + self.yoco_kv = yoco_kv + self.d_model = d_model + self.d_state = d_state + self.d_conv = d_conv + self.expand = expand + self.d_inner = int(self.expand * self.d_model) + self.dt_rank = math.ceil(self.d_model / + 16) if dt_rank == "auto" else dt_rank + self.use_fast_path = use_fast_path + self.layer_idx = layer_idx + self.swiGluActivation = SwiGLUActivation() + if self.yoco_cross: + self.in_proj = MergedColumnParallelLinear(self.d_model, + [self.d_inner], + bias=bias, + **factory_kwargs) + self.out_proj = RowParallelLinear(self.d_inner, + self.d_model, + bias=bias, + **factory_kwargs) + return + self.conv1d = ColumnParallelLinear( + input_size=d_conv, + output_size=self.d_inner, + bias=conv_bias, + params_dtype=dtype, + ) + # unsqueeze to fit conv1d weights shape into the linear weights shape. + # Can't do this in `weight_loader` since it already exists in + # `ColumnParallelLinear` and `set_weight_attrs` + # doesn't allow to override it + self.conv1d.weight.data = self.conv1d.weight.data.unsqueeze(1) + + self.in_proj = MergedColumnParallelLinear( + self.d_model, + [self.d_inner] * 2, + bias=bias, + params_dtype=dtype, + ) + + # selective projection used to make dt, B and C input dependent + self.x_proj = RowParallelLinear( + self.d_inner, + self.dt_rank + self.d_state * 2, + bias=False, + params_dtype=dtype, + ) + + # time step projection (discretization) - + # In the forward we need to apply dt_proj without the bias, + # as the bias is added in the selective scan kernel. + self.dt_proj = ColumnParallelLinear( + self.dt_rank, + self.d_inner, + bias=True, + skip_bias_add=True, + params_dtype=dtype, + ) + + # # D "skip" parameter + # self.D = nn.Parameter(torch.ones(self.d_inner)) # Keep in fp32 + self.A = nn.Parameter( + torch.empty( + self.d_inner, + self.d_state, + dtype=torch.float32, + )) + self.D = nn.Parameter(torch.ones(self.d_inner, dtype=torch.float32)) + + self.out_proj = RowParallelLinear( + self.d_inner, + self.d_model, + bias=bias, + input_is_parallel=True, + params_dtype=dtype, + ) + self.activation = "silu" + + def forward(self, + hidden_states: torch.Tensor, + attn_metadata: AttentionMetadata, + mamba_cache_params: MambaCacheParams, + yoco_key_values=None) -> torch.Tensor: + + if self.yoco_cross: + out = self.in_proj(hidden_states)[0] + out = self.swiGluActivation(yoco_key_values, out) + out = self.out_proj(out) + return out[0], yoco_key_values + + # 1. Gated MLP's linear projection + # projected_states = self.in_proj(hidden_states)[0].transpose(-2, -1) + projected_states = self.in_proj( + hidden_states.to(self.in_proj.weight.dtype))[0].transpose(-2, -1) + hidden_states, gate = projected_states.chunk(2, dim=-2) + + # 2. Convolution sequence transformation + conv_weights = self.conv1d.weight.view(self.conv1d.weight.size(0), + self.conv1d.weight.size(2)) + + if attn_metadata.query_start_loc is not None \ + and attn_metadata.context_lens_tensor is not None: + # |---------- N-1 iteration --------| + # |---------------- N iteration ---------------------| + # |- tokenA -|......................|-- newTokens ---| + # |---------- context_len ----------| + # |-------------------- seq_len ---------------------| + # |-- query_len ---| + hidden_states = causal_conv1d_fn( + hidden_states, + conv_weights, + self.conv1d.bias, + activation=self.activation, + conv_states=mamba_cache_params.conv_state, + has_initial_state=attn_metadata.context_lens_tensor > 0, + cache_indices=mamba_cache_params.state_indices_tensor, + query_start_loc=attn_metadata.query_start_loc) + else: + hidden_states = causal_conv1d_update( + hidden_states.transpose(0, 1), + mamba_cache_params.conv_state, + conv_weights, + self.conv1d.bias, + self.activation, + conv_state_indices=mamba_cache_params.state_indices_tensor) + hidden_states = hidden_states.transpose(0, 1) + + # 3. State Space Model sequence transformation + # 3.a. input varying initialization of time_step, B and C + ssm_parameters = self.x_proj(hidden_states.transpose(-2, -1))[0] + + time_step, B, C = torch.split( + ssm_parameters, + [self.dt_rank, self.d_state, self.d_state], + dim=-1, + ) + + # Note that Jamba normalizes B, C, and time_step here but Mamba doesn't. + + discrete_time_step = self.dt_proj(time_step)[0].transpose(-2, -1) + # 3.c perform the recurrence y ← SSM(A, B, C)(x) + time_proj_bias = (self.dt_proj.bias.float() if hasattr( + self.dt_proj, "bias") else None) + + if attn_metadata.query_start_loc is not None \ + and attn_metadata.context_lens_tensor is not None: + scan_outputs = selective_scan_fn( + hidden_states, + mamba_cache_params.ssm_state, + discrete_time_step, + self.A, + B.transpose(-2, -1), + C.transpose(-2, -1), + self.D.float(), + # z, + None if self.yoco_kv else gate, + time_proj_bias, + delta_softplus=True, + cache_indices=mamba_cache_params.state_indices_tensor, + has_initial_state=attn_metadata.context_lens_tensor > 0, + query_start_loc=attn_metadata.query_start_loc) + else: + scan_outputs = selective_state_update( + mamba_cache_params.ssm_state, + hidden_states.transpose(0, 1), + discrete_time_step.transpose(0, 1), + self.A, + B, + C, + self.D, + # z + # gate.transpose(0, 1), + None if self.yoco_kv else gate.transpose(0, 1), + time_proj_bias, + dt_softplus=True, + state_batch_indices=mamba_cache_params.state_indices_tensor) + scan_outputs = scan_outputs.transpose(0, 1) + + # 4. Final linear projection + if self.yoco_kv: + # gate = gate.transpose(-1,-2).contiguous() + yoco_key_values = scan_outputs.transpose(-2, -1) + scan_outputs = self.swiGluActivation(scan_outputs, gate) + + contextualized_states = self.out_proj(scan_outputs.transpose(-2, + -1))[0] + + return contextualized_states, yoco_key_values + + +class SambaYDecoderLayer(nn.Module): + + def __init__( + self, + config, + layer_idx, + cache_config, + prefix: str = "", + ) -> None: + super().__init__() + + self.config = config + self.layer_idx = layer_idx + + self.mlp = SambaYMLP(config) + self.input_layernorm = nn.LayerNorm(config.hidden_size, + eps=config.layer_norm_eps) + + self.yoco_mb = False + self.yoco_cross = False + if layer_idx >= config.num_hidden_layers // 2: + self.yoco_mb = True + self.yoco_cross = (layer_idx + >= (config.num_hidden_layers // 2 + 2)) + self.use_mamba = config.mb_per_layer > 0 and \ + layer_idx % config.mb_per_layer == 0 + if self.use_mamba: + factory_kwargs = {"dtype": None} + self.attn = Phi4Mamba(config.hidden_size, + layer_idx=layer_idx, + yoco_cross=self.yoco_cross, + yoco_kv=self.yoco_mb, + **factory_kwargs) + else: + self.attn = SambaYAttention(config, + layer_idx=layer_idx, + yoco_cross=self.yoco_cross, + cache_config=cache_config, + prefix=f"{prefix}.self_attn") + self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, + eps=config.layer_norm_eps) + + def forward( + self, + hidden_states: torch.Tensor, + positions: torch.Tensor, + attn_metadata: AttentionMetadata, + mamba_cache_params: MambaCacheParams, + ssm_output: Optional[torch.LongTensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + if self.use_mamba: + assert mamba_cache_params is not None + else: + assert mamba_cache_params is None + + residual = hidden_states + hidden_states = self.input_layernorm( + hidden_states.to(dtype=self.input_layernorm.weight.dtype)) + + if self.use_mamba: + attn_outputs, ssm_output = self.attn(hidden_states, + attn_metadata, + mamba_cache_params, + yoco_key_values=ssm_output) + residual = residual.to(torch.float32) + else: + attn_outputs = self.attn(hidden_states, ) + hidden_states = residual + attn_outputs + residual = hidden_states + hidden_states = self.post_attention_layernorm( + hidden_states.to(dtype=self.post_attention_layernorm.weight.dtype)) + hidden_states = self.mlp(hidden_states) + hidden_states = residual + hidden_states + + return hidden_states, ssm_output + + +class SambaYModel(nn.Module): + + def __init__(self, + config, + cache_config=None, + quant_config=None, + lora_config=None, + prefix: str = "") -> None: + super().__init__() + self.config = config + self.vocab_size = config.vocab_size + self.embed_tokens = VocabParallelEmbedding( + self.vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + ) + + # Pipeline parallel is not supported since the second half of + # the layers share the kv cache. + if get_pp_group().world_size != 1: + raise ValueError("Pipeline Parallel not supported") + + self.start_layer, self.end_layer, self.layers = make_layers( + config.num_hidden_layers, + lambda prefix: SambaYDecoderLayer(config, + int(prefix.split('.')[-1]), + cache_config, + prefix=prefix), + prefix=f"{prefix}.layers") + self.final_layernorm = nn.LayerNorm(config.hidden_size, + eps=config.layer_norm_eps) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.embed_tokens(input_ids) + + def forward( + self, + input_ids: Optional[torch.Tensor], + positions: torch.Tensor, + attn_metadata: AttentionMetadata, + mamba_cache_params: MambaCacheParams, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + + if get_pp_group().is_first_rank: + if inputs_embeds is not None: + hidden_states = inputs_embeds + else: + hidden_states = self.get_input_embeddings(input_ids) + else: + assert intermediate_tensors is not None + hidden_states = intermediate_tensors["hidden_states"] + + mamba_state_idx = 0 + ssm_output = None + for i in range(self.start_layer, self.end_layer): + layer = self.layers[i] + if i == self.config.num_hidden_layers // 2 + 2: + # profile run + kv_cache_idx = self.config.num_hidden_layers // 2 + 1 + cache_layer = self.layers[kv_cache_idx] + kv_cache = cache_layer.attn.attn.kv_cache + if kv_cache[0].numel() == 0: + break + + # Starting from this layer, we do not need to calculate + # the kv cache since we reuse the kv cache from last layer. + # If in prefill phase, we can prune> truncate + # the hidden state to save computation cost. + if attn_metadata.prefill_metadata and not envs.VLLM_USE_V1: + selected_token_indices = torch.cumsum( + attn_metadata.seq_lens_tensor, dim=0) - 1 + hidden_states = hidden_states.index_select( + 0, selected_token_indices) + ssm_output = ssm_output.index_select( + 0, selected_token_indices) + + if layer.use_mamba: + if i < self.config.num_hidden_layers // 2 or \ + not layer.yoco_cross: + mamba_cache = mamba_cache_params.at_layer_idx( + mamba_state_idx) + mamba_state_idx += 1 + else: + mamba_cache = mamba_cache_params.at_layer_idx( + mamba_state_idx - 1) + + hidden_states, ssm_output = layer(hidden_states, + positions, + attn_metadata, + mamba_cache, + ssm_output=ssm_output) + else: + hidden_states, ssm_output = layer( + hidden_states, + positions, + attn_metadata, + None, # mamba_cache_params + ssm_output=ssm_output) + + hidden_states = self.final_layernorm( + hidden_states.to(dtype=self.final_layernorm.weight.dtype)) + return hidden_states + + +class Phi4FlashForCausalLM(nn.Module, HasInnerState, IsHybrid, SupportsV0Only): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + config = vllm_config.model_config.hf_config + cache_config = vllm_config.cache_config + lora_config = vllm_config.lora_config + quant_config = vllm_config.quant_config + scheduler_config = vllm_config.scheduler_config + self.compilation_config = vllm_config.compilation_config + self.vllm_config = vllm_config + # Prefix caching and chunked prefill is not supported for this model. + assert not cache_config.enable_prefix_caching, \ + "Phi4flash currently does not support prefix caching" + assert not scheduler_config.chunked_prefill_enabled, \ + "Phi4Flash currently does not support prefix caching" + super().__init__() + self.config = config + self.model_config = vllm_config.model_config + self.scheduler_config = scheduler_config + self.model = SambaYModel(config, + cache_config=cache_config, + prefix=maybe_prefix(prefix, "model")) + self.unpadded_vocab_size = config.vocab_size + if lora_config: + self.unpadded_vocab_size += lora_config.lora_extra_vocab_size + self.lm_head = ParallelLMHead( + self.unpadded_vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + padding_size=( + DEFAULT_VOCAB_PADDING_SIZE + # We need bigger padding if using lora for kernel + # compatibility + if not lora_config else lora_config.lora_vocab_padding_size), + quant_config=quant_config, + ) + self.embedding_bias = None + # Used to track and store by the Mamba cache between steps. + self.mamba_cache: Optional[MambaCacheManager] = None + self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, + config.vocab_size, + logits_as_input=False) + self.sampler = get_sampler() + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + **kwargs, + ) -> Union[torch.Tensor, IntermediateTensors]: + if self.mamba_cache is None: + num_mamba_layers = self.config.num_hidden_layers \ + // 2 // self.config.mb_per_layer + 1 + self.mamba_cache = MambaCacheManager( + self.vllm_config, self.lm_head.weight.dtype, num_mamba_layers, + *self._get_mamba_cache_shape()) + mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) + + attn_metadata = get_forward_context().attn_metadata + # input_ids and hidden_states isn't a one-to-one mapping in prefill + # stage due to YOCO optimization. + hidden_states = self.model(input_ids, positions, attn_metadata, + mamba_cache_params, intermediate_tensors, + inputs_embeds) + return hidden_states + + def _get_mamba_cache_shape( + self + ) -> tuple[Optional[tuple[int, int]], Optional[tuple[int, int]]]: + world_size = get_tensor_model_parallel_world_size() + hidden_size = self.config.hidden_size + mamba_expand = self.config.mamba_expand # 2 + mamba_d_conv = self.config.mamba_d_conv # 4 + mamba_d_state = self.config.mamba_d_state # 16 + conv_state_shape = ( + mamba_expand * hidden_size // world_size, + mamba_d_conv - 1, + ) + temporal_state_shape = ( + mamba_expand * hidden_size // world_size, + mamba_d_state, + ) + return conv_state_shape, temporal_state_shape + + def copy_inputs_before_cuda_graphs(self, input_buffers, **kwargs): + return self.mamba_cache.copy_inputs_before_cuda_graphs( + input_buffers, **kwargs) + + def get_seqlen_agnostic_capture_inputs(self, batch_size: int): + return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + # If the shape is the same, it means that we have already + # prune hidden states manually. + prune_hidden_states = hidden_states.size( + 0) != sampling_metadata.selected_token_indices.size(0) + processed_logits = self.logits_processor( + self.lm_head, + hidden_states, + sampling_metadata, + self.embedding_bias, + prune_hidden_states=prune_hidden_states) + return processed_logits + + def sample( + self, + logits: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[SamplerOutput]: + next_tokens = self.sampler(logits, sampling_metadata) + return next_tokens + + def load_weights( + self, + weights: Iterable[tuple[str, torch.Tensor]], + ): + weights = {name: weight for name, weight in weights} + adjusted_weights = {} + for name, weight in weights.items(): + if "A_log" in name: + name = name.replace("A_log", "A") + weight = -torch.exp(weight.float()) + if "inner_cross_attn." in name: + name = name.replace("inner_cross_attn.", "") + adjusted_weights[name] = weight + adjusted_weights["lm_head.weight"] = weights[ + "model.embed_tokens.weight"] + loaded_params: set[str] = set() + for name, param in self.named_parameters(): + weight = adjusted_weights.get(name) + if weight is not None and weight.shape != param.shape: + logger.warning("Shape mismatch: %s %s %s", name, weight.shape, + param.shape) + loaded_params.add(name) + missing_keys, unexpected_keys = self.load_state_dict(adjusted_weights, + strict=False) + assert len(unexpected_keys) == 0, f"Unexpected keys: {unexpected_keys}" + assert len(missing_keys) == 0, f"Missing keys: {missing_keys}" + return loaded_params diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 17d44fa71d5..5f9b145b661 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -110,6 +110,7 @@ "Phi3ForCausalLM": ("phi3", "Phi3ForCausalLM"), "Phi3SmallForCausalLM": ("phi3_small", "Phi3SmallForCausalLM"), "PhiMoEForCausalLM": ("phimoe", "PhiMoEForCausalLM"), + "Phi4FlashForCausalLM": ("phi4flash", "Phi4FlashForCausalLM"), "Plamo2ForCausalLM": ("plamo2", "Plamo2ForCausalLM"), "QWenLMHeadModel": ("qwen", "QWenLMHeadModel"), "Qwen2ForCausalLM": ("qwen2", "Qwen2ForCausalLM"), diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index 00151296a75..878f8f77edf 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -316,6 +316,10 @@ def get_attn_backend_cls(cls, selected_backend, head_size, dtype, logger.info("Using DualChunkFlashAttention backend.") return ("vllm.attention.backends.dual_chunk_flash_attn." "DualChunkFlashAttentionBackend") + elif selected_backend == _Backend.DIFFERENTIAL_FLASH_ATTN: + logger.info("Using DifferentialFlashAttention backend.") + return ("vllm.attention.backends.differential_flash_attn." + "DifferentialFlashAttentionBackend") elif selected_backend == _Backend.FLASH_ATTN: pass elif selected_backend: diff --git a/vllm/platforms/interface.py b/vllm/platforms/interface.py index d3060685e98..ae675bcc8d2 100644 --- a/vllm/platforms/interface.py +++ b/vllm/platforms/interface.py @@ -60,6 +60,7 @@ class _Backend(enum.Enum): IPEX = enum.auto() BLOCK_SPARSE_FLASH_ATTN = enum.auto() DUAL_CHUNK_FLASH_ATTN = enum.auto() + DIFFERENTIAL_FLASH_ATTN = enum.auto() NO_ATTENTION = enum.auto() FLEX_ATTENTION = enum.auto() diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index 48346c7d6e5..495e359aa6d 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -2888,8 +2888,9 @@ def get_mp_context(): def bind_kv_cache( - ctx: dict[str, Any], - kv_cache: list[list[torch.Tensor]], # [virtual_engine][layer_index] + ctx: dict[str, Any], + kv_cache: list[list[torch.Tensor]], # [virtual_engine][layer_index] + shared_kv_cache_layers: Optional[dict[str, str]] = None ) -> None: # Bind the kv_cache tensor to Attention modules, similar to # ctx[layer_name].kv_cache[ve]=kv_cache[ve][extract_layer_index(layer_name)] @@ -2901,12 +2902,17 @@ def bind_kv_cache( # attention of the same layer (e.g., bart's decoder.layers.1.self_attn # and decoder.layers.1.encoder_attn) is mapped to the same kv cache # tensor + # 5. Some models have attention layers that share kv cache with previous + # layers, this is specified through shared_kv_cache_layers + if shared_kv_cache_layers is None: + shared_kv_cache_layers = {} from vllm.attention import AttentionType from vllm.model_executor.models.utils import extract_layer_index layer_need_kv_cache = [ layer_name for layer_name in ctx if (hasattr(ctx[layer_name], 'attn_type') and ctx[layer_name].attn_type - in (AttentionType.DECODER, AttentionType.ENCODER_DECODER)) + in (AttentionType.DECODER, AttentionType.ENCODER_DECODER)) \ + and ctx[layer_name].kv_sharing_target_layer_name is None ] layer_index_sorted = sorted( set( @@ -2919,6 +2925,12 @@ def bind_kv_cache( assert len(forward_ctx.kv_cache) == len(kv_cache) for ve, ve_kv_cache in enumerate(kv_cache): forward_ctx.kv_cache[ve] = ve_kv_cache[kv_cache_idx] + if shared_kv_cache_layers is not None: + for layer_name, target_layer_name in shared_kv_cache_layers.items(): + assert extract_layer_index(target_layer_name) < \ + extract_layer_index(layer_name), \ + "v0 doesn't support interleaving kv sharing" + ctx[layer_name].kv_cache = ctx[target_layer_name].kv_cache def run_method(obj: Any, method: Union[str, bytes, Callable], args: tuple[Any], diff --git a/vllm/worker/model_runner.py b/vllm/worker/model_runner.py index 4fe70a0abf8..bced3ba9ba1 100644 --- a/vllm/worker/model_runner.py +++ b/vllm/worker/model_runner.py @@ -1112,6 +1112,10 @@ def __init__( (self.max_batchsize_to_capture, self.get_max_block_per_batch()), dtype=np.int32) + self.cross_layer_shared_graph_block_tables = np.zeros( + (self.max_batchsize_to_capture, self.get_max_block_per_batch()), + dtype=np.int32) + # Attention-free but stateful models like Mamba need a placeholder attn # backend, as the attention metadata is needed to manage internal state. # However we must bypass attention selection altogether for some models diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py index 21e684a3fb5..b2926dbd185 100644 --- a/vllm/worker/worker.py +++ b/vllm/worker/worker.py @@ -9,7 +9,8 @@ import torch.distributed import vllm.envs as envs -from vllm.config import VllmConfig +from vllm.attention.layer import Attention +from vllm.config import VllmConfig, get_layers_from_vllm_config from vllm.device_allocator.cumem import CuMemAllocator from vllm.distributed import (ensure_model_parallel_initialized, init_distributed_environment, @@ -345,8 +346,29 @@ def _init_cache_engine(self): self.cache_engine[ve].gpu_cache for ve in range(self.parallel_config.pipeline_parallel_size) ] + + # Layer pairings for cross-layer KV sharing. + # If an Attention layer `layer_name` is in the keys of this dict, it + # means this layer will perform attention using the keys and values + # from the KV cache of `shared_kv_cache_layers[layer_name]`. + shared_kv_cache_layers: dict[str, str] = {} + + attn_layers = get_layers_from_vllm_config(self.vllm_config, Attention) + + for layer_name, attn_module in attn_layers.items(): + if (kv_tgt_layer := + attn_module.kv_sharing_target_layer_name) is not None: + # The layer doesn't need its own KV cache and will use that of + # the target layer. We skip creating a KVCacheSpec for it, so + # that KV cache management logic will act as this layer does + # not exist, and doesn't allocate KV cache for the layer. This + # enables the memory saving of cross-layer kv sharing, allowing + # a given amount of memory to accommodate longer context lengths + # or enable more requests to be processed simultaneously. + shared_kv_cache_layers[layer_name] = kv_tgt_layer + bind_kv_cache(self.compilation_config.static_forward_context, - self.gpu_cache) + self.gpu_cache, shared_kv_cache_layers) def _warm_up_model(self) -> None: # warm up sizes that are not in cudagraph capture sizes, From a9bd1cdba2976efc24eb8397ba82e344cfa38b4a Mon Sep 17 00:00:00 2001 From: Alex Brooks Date: Sat, 12 Jul 2025 07:11:30 -0600 Subject: [PATCH 039/552] [Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models (#20843) Signed-off-by: Alex-Brooks Signed-off-by: x22x22 --- vllm/model_executor/models/granite.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/vllm/model_executor/models/granite.py b/vllm/model_executor/models/granite.py index bd4d5d0b6b2..507a9206c42 100644 --- a/vllm/model_executor/models/granite.py +++ b/vllm/model_executor/models/granite.py @@ -273,6 +273,10 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.vocab_size, config.hidden_size, org_num_embeddings=config.vocab_size, + padding_size=DEFAULT_VOCAB_PADDING_SIZE + # We need bigger padding if using lora for kernel + # compatibility + if not lora_config else lora_config.lora_vocab_padding_size, quant_config=quant_config, ) else: From a5552d6a2813b6c773b9645b35baf94aa5c6cd26 Mon Sep 17 00:00:00 2001 From: Reid <61492567+reidliu41@users.noreply.github.com> Date: Sat, 12 Jul 2025 21:54:50 +0800 Subject: [PATCH 040/552] [docs] convert supported configs to table (#20858) Signed-off-by: reidliu41 Signed-off-by: x22x22 --- .../installation/intel_gaudi.md | 44 ++++++------------- 1 file changed, 14 insertions(+), 30 deletions(-) diff --git a/docs/getting_started/installation/intel_gaudi.md b/docs/getting_started/installation/intel_gaudi.md index 061599cb1b6..09cffb29cb3 100644 --- a/docs/getting_started/installation/intel_gaudi.md +++ b/docs/getting_started/installation/intel_gaudi.md @@ -133,36 +133,20 @@ docker run \ The following configurations have been validated to function with Gaudi2 devices. Configurations that are not listed may or may not work. -- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) - on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 - datatype with random or greedy sampling -- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) - on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 - datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) - on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 - datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) - on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 - datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) - on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 - datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) - on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 - datatype with random or greedy sampling -- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) - with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling -- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) - with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) - with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) - with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) - with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling -- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) - with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling +| Model | TP Size| dtype | Sampling | +|-------|--------|--------|----------| +| [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) | 1, 2, 8 | BF16 | Random / Greedy | +| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | 1, 2, 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | 1, 2, 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 1, 2, 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | 1, 2, 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | 1, 2, 8 | BF16 | Random / Greedy | +| [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) | 8 | BF16 | Random / Greedy | +| [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) | 8 | BF16 | Random / Greedy | +| [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 8 | BF16 | Random / Greedy | ## Performance tuning From 219fb234865e526fbd79650390d30b593f26dce9 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Sun, 13 Jul 2025 02:34:40 +0900 Subject: [PATCH 041/552] [Bugfix] Restrict Machete to only run on Hopper (#20830) Signed-off-by: mgoin Signed-off-by: x22x22 --- .../layers/quantization/kernels/mixed_precision/machete.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py b/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py index 851fd155465..ed81b02bc4a 100644 --- a/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py +++ b/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py @@ -32,6 +32,9 @@ def can_implement(cls, if not current_platform.is_cuda(): return False, "Machete only supported on CUDA" + if not current_platform.is_device_capability(90): + return False, "Machete requires compute capability of 90 (Hopper)" + if c.has_g_idx and\ c.partition_weight_shape[0] != c.full_weight_shape[0]: return False, "Act reordering currently not supported by Machete, "\ From 02f04a2fdf387d78a3a18ac9969616035198a207 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Sat, 12 Jul 2025 15:33:13 -0700 Subject: [PATCH 042/552] [Sched] Enhance the logic to remove stopped requests from queues (#20739) Signed-off-by: x22x22 --- requirements/common.txt | 2 +- tests/v1/core/test_scheduler.py | 62 +++++++++++++++++++++++++++++++++ vllm/v1/core/sched/scheduler.py | 45 +++++++++++++++--------- 3 files changed, 92 insertions(+), 17 deletions(-) diff --git a/requirements/common.txt b/requirements/common.txt index f97fe35d28b..526ed514ac0 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -7,7 +7,7 @@ requests >= 2.26.0 tqdm blake3 py-cpuinfo -transformers >= 4.51.1 +transformers >= 4.53.2 huggingface-hub[hf_xet] >= 0.33.0 # Required for Xet downloads. tokenizers >= 0.21.1 # Required for fast incremental detokenization. protobuf # Required by LlamaTokenizer. diff --git a/tests/v1/core/test_scheduler.py b/tests/v1/core/test_scheduler.py index 02d2c83ab15..2d3657b334b 100644 --- a/tests/v1/core/test_scheduler.py +++ b/tests/v1/core/test_scheduler.py @@ -451,6 +451,7 @@ def test_stop_via_update_from_output(): req.num_computed_tokens = req.num_tokens scheduler.requests[req.request_id] = req scheduler.running.append(req) + req.status = RequestStatus.RUNNING scheduler_output = SchedulerOutput( scheduled_new_reqs=[], @@ -504,6 +505,7 @@ def test_stop_via_update_from_output(): req.num_computed_tokens = req.num_tokens scheduler.requests[req.request_id] = req scheduler.running.append(req) + req.status = RequestStatus.RUNNING scheduler_output = SchedulerOutput( scheduled_new_reqs=[], @@ -556,6 +558,7 @@ def test_stop_via_update_from_output(): req.num_computed_tokens = req.num_tokens scheduler.requests[req.request_id] = req scheduler.running.append(req) + req.status = RequestStatus.RUNNING scheduler_output = SchedulerOutput( scheduled_new_reqs=[], @@ -703,6 +706,65 @@ def test_schedule_concurrent_batches(enable_prefix_caching: Optional[bool], scheduler.update_from_output(scheduler_output1, model_runner_output) +def test_preempt_during_execution(): + # NOTE(woosuk): The actual number of available blocks is 10 instead of 11 + # because block 0 is reserved as the null block. + scheduler = create_scheduler(max_num_batched_tokens=100, + block_size=16, + num_blocks=11, + enable_prefix_caching=False) + requests = create_requests(num_requests=2, num_tokens=80) + + # Schedule the first request. + scheduler.add_request(requests[0]) + scheduler_output0 = scheduler.schedule() + assert len(scheduler_output0.num_scheduled_tokens) == 1 + assert len(scheduler_output0.scheduled_new_reqs[0].block_ids[0]) == 5 + + # Schedule the second request while the first request is still running. + # This scenario can occur in certain cases, when max_concurrent_batches > 1 + # (e.g., when pipeline parallelism is used). + scheduler.add_request(requests[1]) + scheduler_output1 = scheduler.schedule() + assert len(scheduler_output1.num_scheduled_tokens) == 1 + assert len(scheduler_output1.scheduled_new_reqs[0].block_ids[0]) == 5 + + # Get the output of the first request. + model_runner_output0 = ModelRunnerOutput( + req_ids=[requests[0].request_id], + req_id_to_index={requests[0].request_id: 0}, + sampled_token_ids=[[0]], + spec_token_ids=None, + logprobs=None, + prompt_logprobs_dict={}, + pooler_output=[], + ) + scheduler.update_from_output(scheduler_output0, model_runner_output0) + + # Schedule the first request again. This will cause the preemption + # of the second request because the KV cache is full. + _ = scheduler.schedule() + assert len(scheduler.running) == 1 + assert scheduler.running[0] == requests[0] + assert requests[1].status == RequestStatus.PREEMPTED + + model_runner_output1 = ModelRunnerOutput( + req_ids=[requests[1].request_id], + req_id_to_index={requests[1].request_id: 0}, + sampled_token_ids=[[42]], + spec_token_ids=None, + logprobs=None, + prompt_logprobs_dict={}, + pooler_output=[], + ) + scheduler.update_from_output(scheduler_output1, model_runner_output1) + + # The second request (that is preempted) should be updated with the + # sampled token id. + assert len(requests[1].output_token_ids) == 1 + assert requests[1].output_token_ids[0] == 42 + + # Note - these test cases mirror some of those in test_rejection_sampler.py @pytest.mark.parametrize( "spec_tokens,output_tokens,expected", diff --git a/vllm/v1/core/sched/scheduler.py b/vllm/v1/core/sched/scheduler.py index b2d90614c29..f81bb9fc13a 100644 --- a/vllm/v1/core/sched/scheduler.py +++ b/vllm/v1/core/sched/scheduler.py @@ -747,19 +747,21 @@ def update_from_output( pooler_outputs = model_runner_output.pooler_output num_nans_in_logits = model_runner_output.num_nans_in_logits - new_running: list[Request] = [] outputs: dict[int, list[EngineCoreOutput]] = defaultdict(list) spec_decoding_stats: Optional[SpecDecodingStats] = None - # NOTE(woosuk): As len(self.running) can be up to 1K or more, the below - # loop can be a performance bottleneck. We should do our best to avoid - # expensive operations inside the loop. - for request in self.running: - req_id = request.request_id - num_tokens_scheduled = num_scheduled_tokens.get(req_id, 0) - if num_tokens_scheduled == 0: - # The request was not scheduled in this step. - new_running.append(request) + # NOTE(woosuk): As len(num_scheduled_tokens) can be up to 1K or more, + # the below loop can be a performance bottleneck. We should do our best + # to avoid expensive operations inside the loop. + stopped_running_reqs: set[Request] = set() + stopped_preempted_reqs: set[Request] = set() + for req_id, num_tokens_scheduled in num_scheduled_tokens.items(): + assert num_tokens_scheduled > 0 + request = self.requests.get(req_id) + if request is None: + # The request is already finished. This can happen if the + # request is aborted while the model is executing it (e.g., + # in pipeline parallelism). continue req_index = model_runner_output.req_id_to_index[req_id] @@ -792,6 +794,7 @@ def update_from_output( new_logprobs = None new_token_ids = generated_token_ids kv_transfer_params = None + status_before_stop = request.status # Append generated tokens and check for stop. Note that if # a request is still being prefilled, we expect the model runner @@ -803,17 +806,22 @@ def update_from_output( # This must be called before we make the EngineCoreOutput. stopped = check_stop(request, self.max_model_len) if stopped: - kv_transfer_params = self._free_request(request) del new_token_ids[num_new:] # Trim new tokens if needed. break + # Stop checking for pooler models. pooler_output = None if pooler_outputs: pooler_output = pooler_outputs[req_index] stopped = check_stop(request, self.max_model_len, pooler_output) - if stopped: - kv_transfer_params = self._free_request(request) + + if stopped: + kv_transfer_params = self._free_request(request) + if status_before_stop == RequestStatus.RUNNING: + stopped_running_reqs.add(request) + else: + stopped_preempted_reqs.add(request) # Extract sample logprobs if needed. if request.sampling_params is not None \ @@ -868,9 +876,14 @@ def update_from_output( # Invariant: EngineCore returns no partial prefill outputs. assert not prompt_logprobs_tensors - if not stopped: - new_running.append(request) - self.running = new_running + # Remove the stopped requests from the running and waiting queues. + if stopped_running_reqs: + self.running = [ + req for req in self.running if req not in stopped_running_reqs + ] + if stopped_preempted_reqs: + # This is a rare case and unlikely to impact performance. + self.waiting.remove_requests(stopped_preempted_reqs) # KV Connector: update state for finished KV Transfers. self._update_from_kv_xfer_finished(model_runner_output) From 85cd7d9d979d631a6fd1ea3579cf1d889d013cfd Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Sat, 12 Jul 2025 22:38:45 -0400 Subject: [PATCH 043/552] [Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant (#20841) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- tests/kernels/moe/test_deepgemm.py | 7 ++++--- tests/kernels/quantization/test_block_fp8.py | 5 ++--- .../layers/fused_moe/deep_gemm_moe.py | 13 ++++++------ vllm/model_executor/layers/fused_moe/utils.py | 7 +------ .../layers/quantization/utils/fp8_utils.py | 15 ++++++++++--- vllm/utils/deep_gemm.py | 21 ------------------- 6 files changed, 26 insertions(+), 42 deletions(-) diff --git a/tests/kernels/moe/test_deepgemm.py b/tests/kernels/moe/test_deepgemm.py index 6a04edafd96..1460fdd3aea 100644 --- a/tests/kernels/moe/test_deepgemm.py +++ b/tests/kernels/moe/test_deepgemm.py @@ -13,9 +13,10 @@ # vLLM fused-expert reference (Triton fallback + DeepGEMM option) from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts +from vllm.model_executor.layers.quantization.utils.fp8_utils import ( + per_token_group_quant_fp8) from vllm.utils import has_deep_gemm -from vllm.utils.deep_gemm import (calc_diff, per_block_cast_to_fp8, - per_token_group_cast_to_fp8) +from vllm.utils.deep_gemm import calc_diff, per_block_cast_to_fp8 BLOCK_SIZE = [128, 128] @@ -81,7 +82,7 @@ def run_single_case(m, n, k, topk, num_experts, block_size): """ tokens_bf16 = torch.randn( m, k, device="cuda", dtype=torch.bfloat16).clamp_min_(-1).clamp_max_(1) - _, a1_scale = per_token_group_cast_to_fp8(tokens_bf16, block_size[1]) + _, a1_scale = per_token_group_quant_fp8(tokens_bf16, block_size[1]) # expert weight tensors w1, w2, w1_s, w2_s = make_block_quant_fp8_weights(num_experts, n, k, diff --git a/tests/kernels/quantization/test_block_fp8.py b/tests/kernels/quantization/test_block_fp8.py index 97b5102dd47..26aa8d652e6 100644 --- a/tests/kernels/quantization/test_block_fp8.py +++ b/tests/kernels/quantization/test_block_fp8.py @@ -15,8 +15,7 @@ w8a8_block_fp8_matmul) from vllm.platforms import current_platform from vllm.utils import has_deep_gemm -from vllm.utils.deep_gemm import (fp8_gemm_nt, per_block_cast_to_fp8, - per_token_group_cast_to_fp8) +from vllm.utils.deep_gemm import fp8_gemm_nt, per_block_cast_to_fp8 if current_platform.get_device_capability() < (9, 0): pytest.skip("FP8 Triton requires CUDA 9.0 or higher", @@ -117,7 +116,7 @@ def test_w8a8_block_fp8_deep_gemm_matmul(M, N, K, block_size, out_dtype, seed): A_fp32 = (torch.rand(M, K, dtype=torch.float32) - 0.5) * 2 * fp8_max B_fp32 = (torch.rand(N, K, dtype=torch.float32) - 0.5) * 2 * fp8_max - A_fp8, As_fp8 = per_token_group_cast_to_fp8(A_fp32, block_size[1]) + A_fp8, As_fp8 = per_token_group_quant_fp8(A_fp32, block_size[1]) B_fp8, Bs_fp8 = per_block_cast_to_fp8(B_fp32) As = As_fp8.to(torch.float32) diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py index 433f957a843..b1107a1f479 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py @@ -15,9 +15,10 @@ from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( TopKWeightAndReduceDelegate) from vllm.model_executor.layers.fused_moe.utils import _resize_cache +from vllm.model_executor.layers.quantization.utils.fp8_utils import ( + per_token_group_quant_fp8) from vllm.utils import has_deep_gemm, round_up -from vllm.utils.deep_gemm import (m_grouped_fp8_gemm_nt_contiguous, - per_token_group_cast_to_fp8) +from vllm.utils.deep_gemm import m_grouped_fp8_gemm_nt_contiguous logger = init_logger(__name__) @@ -170,10 +171,10 @@ def apply( self.activation(activation, act_out, mm1_out.view(-1, N)) a2q_scale: Optional[torch.Tensor] = None - a2q, a2q_scale = per_token_group_cast_to_fp8(act_out, - self.block_shape[1], - column_major_scales=True, - out_q=quant_out) + a2q, a2q_scale = per_token_group_quant_fp8(act_out, + self.block_shape[1], + column_major_scales=True, + out_q=quant_out) m_grouped_fp8_gemm_nt_contiguous((a2q, a2q_scale), (w2, w2_scale), mm2_out, expert_ids) diff --git a/vllm/model_executor/layers/fused_moe/utils.py b/vllm/model_executor/layers/fused_moe/utils.py index 6638f423a32..c120d964b3c 100644 --- a/vllm/model_executor/layers/fused_moe/utils.py +++ b/vllm/model_executor/layers/fused_moe/utils.py @@ -15,8 +15,6 @@ from vllm.platforms import current_platform from vllm.triton_utils import tl, triton from vllm.utils import cdiv -from vllm.utils.deep_gemm import (is_blackwell_deep_gemm_used, - per_token_group_cast_to_fp8) @triton.jit @@ -119,10 +117,7 @@ def _fp8_quantize( assert not per_act_token assert len(block_shape) == 2 _, block_k = block_shape[0], block_shape[1] - if is_blackwell_deep_gemm_used(): - A, A_scale = per_token_group_cast_to_fp8(A, block_k) - else: - A, A_scale = per_token_group_quant_fp8(A, block_k) + A, A_scale = per_token_group_quant_fp8(A, block_k) assert cdiv(A.size(-1), block_k) == A_scale.size(-1) return A, A_scale diff --git a/vllm/model_executor/layers/quantization/utils/fp8_utils.py b/vllm/model_executor/layers/quantization/utils/fp8_utils.py index 1780cc5de2d..9c78dea17e5 100644 --- a/vllm/model_executor/layers/quantization/utils/fp8_utils.py +++ b/vllm/model_executor/layers/quantization/utils/fp8_utils.py @@ -20,6 +20,7 @@ from vllm.platforms import current_platform from vllm.triton_utils import tl, triton from vllm.utils import cdiv, direct_register_custom_op, has_deep_gemm +from vllm.utils.deep_gemm import is_blackwell_deep_gemm_used logger = init_logger(__name__) @@ -256,6 +257,7 @@ def _per_token_group_quant_fp8( # Information for float8 fp8_min, fp8_max, + use_ue8m0: tl.constexpr, # Meta-parameters BLOCK: tl.constexpr, ): @@ -285,7 +287,8 @@ def _per_token_group_quant_fp8( y = tl.load(y_ptr + cols, mask=mask, other=0.0).to(tl.float32) # Quant _absmax = tl.maximum(tl.max(tl.abs(y)), eps) - y_s = _absmax / fp8_max + scale_raw = _absmax / fp8_max + y_s = tl.math.exp2(tl.ceil(tl.log2(scale_raw))) if use_ue8m0 else scale_raw y_q = tl.clamp(y / y_s, fp8_min, fp8_max).to(y_q_ptr.dtype.element_ty) tl.store(y_q_ptr + cols, y_q, mask=mask) @@ -309,6 +312,7 @@ def _per_token_group_quant_fp8_colmajor( # Information for float8 fp8_min, fp8_max, + use_ue8m0: tl.constexpr, # Meta-parameters BLOCK: tl.constexpr, ): @@ -347,7 +351,8 @@ def _per_token_group_quant_fp8_colmajor( y = tl.load(y_ptr + cols, mask=mask, other=0.0).to(tl.float32) # Quant _absmax = tl.maximum(tl.max(tl.abs(y)), eps) - y_s = _absmax / fp8_max + scale_raw = _absmax / fp8_max + y_s = tl.math.exp2(tl.ceil(tl.log2(scale_raw))) if use_ue8m0 else scale_raw y_q = tl.clamp(y / y_s, fp8_min, fp8_max).to(y_q_ptr.dtype.element_ty) tl.store(y_q_ptr + cols, y_q, mask=mask) @@ -373,9 +378,11 @@ def per_token_group_quant_fp8( is supported for now. column_major_scales: Outputs scales in column major. out_q: Optional output tensor. If not provided, function will create. - Returns: tuple[torch.Tensor, torch.Tensor]: The quantized tensor and the scaling factor for quantization. + Returns: + tuple[torch.Tensor, torch.Tensor]: The quantized tensor and the + scaling factor. """ dtype = current_platform.fp8_dtype() if dtype is None else dtype assert (x.shape[-1] % group_size == 0), ( @@ -418,6 +425,7 @@ def per_token_group_quant_fp8( eps, fp8_min=fp8_min, fp8_max=fp8_max, + use_ue8m0=is_blackwell_deep_gemm_used(), BLOCK=BLOCK, num_warps=num_warps, num_stages=num_stages, @@ -433,6 +441,7 @@ def per_token_group_quant_fp8( eps, fp8_min=fp8_min, fp8_max=fp8_max, + use_ue8m0=is_blackwell_deep_gemm_used(), BLOCK=BLOCK, num_warps=num_warps, num_stages=num_stages, diff --git a/vllm/utils/deep_gemm.py b/vllm/utils/deep_gemm.py index 1684d6754f5..56326c9315b 100644 --- a/vllm/utils/deep_gemm.py +++ b/vllm/utils/deep_gemm.py @@ -49,7 +49,6 @@ def _resolve_symbol(module, new: str, old: str) -> Callable[..., Any] | None: _fp8_gemm_nt_impl: Callable[..., Any] | None = None _grouped_impl: Callable[..., Any] | None = None _grouped_masked_impl: Callable[..., Any] | None = None - _per_token_cast_impl: Callable[..., Any] | None = None _per_block_cast_impl: Callable[..., Any] | None = None else: _dg = importlib.import_module("deep_gemm") # type: ignore @@ -74,12 +73,9 @@ def _resolve_symbol(module, new: str, old: str) -> Callable[..., Any] | None: try: _math_mod = importlib.import_module( "deep_gemm.utils.math") # type: ignore - _per_token_cast_impl = getattr(_math_mod, "per_token_cast_to_fp8", - None) _per_block_cast_impl = getattr(_math_mod, "per_block_cast_to_fp8", None) except ModuleNotFoundError: - _per_token_cast_impl = None _per_block_cast_impl = None @@ -101,22 +97,6 @@ def fp8_m_grouped_gemm_nt_masked(*args, **kwargs): return _grouped_masked_impl(*args, **kwargs) -def per_token_group_cast_to_fp8(x, group_size, *args, **kwargs): - """Wrapper for token-wise FP8 quantisation. - - • If DeepGEMM provides ``per_token_cast_to_fp8`` (new API), use it. - • Otherwise, fall back to vLLM's ``per_token_group_quant_fp8`` - """ - - if _per_token_cast_impl is not None and is_blackwell_deep_gemm_used(): - assert group_size == 128, "group_size must be 128 for deepgemm" - return _per_token_cast_impl(x) - - from vllm.model_executor.layers.quantization.utils.fp8_utils import ( - per_token_group_quant_fp8 as _ptg) - return _ptg(x, group_size, *args, **kwargs) - - def per_block_cast_to_fp8(x, *args, **kwargs): if _per_block_cast_impl is not None and is_blackwell_deep_gemm_used(): return _per_block_cast_impl(x) @@ -146,7 +126,6 @@ def calc_diff(x: torch.Tensor, y: torch.Tensor): "fp8_gemm_nt", "m_grouped_fp8_gemm_nt_contiguous", "fp8_m_grouped_gemm_nt_masked", - "per_token_group_cast_to_fp8", "per_block_cast_to_fp8", "is_blackwell_deep_gemm_used", ] From ca894d43cfd123657d3766c6f141e8966302f23f Mon Sep 17 00:00:00 2001 From: ElizaWszola Date: Sun, 13 Jul 2025 04:39:14 +0200 Subject: [PATCH 044/552] [Bugfix] Fix a couple PPLX+CUTLASS MoE bugs (#20825) Signed-off-by: ElizaWszola Signed-off-by: x22x22 --- .../layers/fused_moe/pplx_prepare_finalize.py | 4 +- .../compressed_tensors_moe.py | 53 ++++++++++++------- 2 files changed, 37 insertions(+), 20 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py index 4cd68608f02..5a23a9f1ab0 100644 --- a/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py @@ -204,7 +204,7 @@ def prepare( out_expert_x_scale=expert_x_scale, dp_x=a1q, dp_x_scale=a1q_scale, - indices=topk_ids, + indices=topk_ids.view(dtype=torch.uint32), bound_m=bound_m, ) @@ -249,7 +249,7 @@ def finalize( topk_weights = torch.ones_like(topk_weights) self.a2a.combine(out_tokens=output, - indices=topk_ids, + indices=topk_ids.view(dtype=torch.uint32), weights=topk_weights, expert_y=fused_expert_output, bound_m=bound_m) diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py index c17a390dba5..baf4fec3cc6 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py @@ -737,10 +737,8 @@ def __init__( "For FP8 Fused MoE layer, we require either per tensor or " "channelwise, dynamic per token quantization.") - from vllm.model_executor.layers.fused_moe.cutlass_moe import ( - cutlass_moe_fp8) self.topk_indices_dtype = None - self.fused_experts = cutlass_moe_fp8 # type: ignore + self.fused_experts = None # type: ignore self.disable_expert_map = False def create_weights(self, layer: torch.nn.Module, num_experts: int, @@ -936,21 +934,40 @@ def apply( per_act_token = a1_scale.numel() != 1 if a1_scale is not None else ( a2_scale.numel() != 1 if a2_scale is not None else False) - return self.fused_experts( - x, - layer.w13_weight, - layer.w2_weight, - topk_weights, - topk_ids, - per_act_token=per_act_token, - activation=activation, - global_num_experts=global_num_experts, - expert_map=None if self.disable_expert_map else expert_map, - w1_scale=layer.w13_weight_scale, - w2_scale=layer.w2_weight_scale, - a1_scale=a1_scale, - a2_scale=a2_scale, - ) + if self.fused_experts is None: + # If no modular kernel is provided, use cutlass_moe_fp8 + from vllm.model_executor.layers.fused_moe.cutlass_moe import ( + cutlass_moe_fp8) + return cutlass_moe_fp8( + x, + layer.w13_weight, + layer.w2_weight, + topk_weights, + topk_ids, + per_act_token=per_act_token, + activation=activation, + global_num_experts=global_num_experts, + expert_map=None if self.disable_expert_map else expert_map, + w1_scale=layer.w13_weight_scale, + w2_scale=layer.w2_weight_scale, + a1_scale=a1_scale, + a2_scale=a2_scale, + ) + else: + return self.fused_experts( + x, + layer.w13_weight, + layer.w2_weight, + topk_weights, + topk_ids, + activation=activation, + global_num_experts=global_num_experts, + expert_map=None if self.disable_expert_map else expert_map, + w1_scale=layer.w13_weight_scale, + w2_scale=layer.w2_weight_scale, + a1_scale=layer.w13_input_scale, + a2_scale=layer.w2_input_scale, + ) class CompressedTensorsW8A8Int8MoEMethod(CompressedTensorsMoEMethod): From 5ca84e19bc38a4cbdc59873cbc931b98f9c81aba Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Sat, 12 Jul 2025 22:39:55 -0400 Subject: [PATCH 045/552] [Refactor] Change the way of import triton (#20774) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- tests/kernels/moe/test_batched_moe.py | 2 +- vllm/attention/ops/triton_unified_attention.py | 3 +-- vllm/lora/ops/triton_ops/lora_expand_op.py | 3 +-- vllm/lora/ops/triton_ops/lora_shrink_op.py | 3 +-- vllm/model_executor/layers/fused_moe/fused_batched_moe.py | 3 +-- 5 files changed, 5 insertions(+), 9 deletions(-) diff --git a/tests/kernels/moe/test_batched_moe.py b/tests/kernels/moe/test_batched_moe.py index c9a4375ac93..69317405d48 100644 --- a/tests/kernels/moe/test_batched_moe.py +++ b/tests/kernels/moe/test_batched_moe.py @@ -6,7 +6,6 @@ import pytest import torch -import triton.language as tl from tests.kernels.moe.utils import (batched_moe, make_quantized_test_activations, @@ -18,6 +17,7 @@ invoke_moe_batched_triton_kernel) from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk from vllm.platforms import current_platform +from vllm.triton_utils import tl MNK_FACTORS = [ (1, 128, 128), diff --git a/vllm/attention/ops/triton_unified_attention.py b/vllm/attention/ops/triton_unified_attention.py index f9645f65135..eb9c4f1c103 100644 --- a/vllm/attention/ops/triton_unified_attention.py +++ b/vllm/attention/ops/triton_unified_attention.py @@ -8,10 +8,9 @@ # - Thomas Parnell import torch -import triton -import triton.language as tl from vllm.logger import init_logger +from vllm.triton_utils import tl, triton logger = init_logger(__name__) diff --git a/vllm/lora/ops/triton_ops/lora_expand_op.py b/vllm/lora/ops/triton_ops/lora_expand_op.py index eaef8e2c190..b1ab84e08ba 100644 --- a/vllm/lora/ops/triton_ops/lora_expand_op.py +++ b/vllm/lora/ops/triton_ops/lora_expand_op.py @@ -8,12 +8,11 @@ """ import torch -import triton -import triton.language as tl from vllm.lora.ops.triton_ops.kernel_utils import do_expand_kernel from vllm.lora.ops.triton_ops.utils import _get_lora_b_ptr from vllm.platforms import current_platform +from vllm.triton_utils import tl, triton from vllm.utils import direct_register_custom_op diff --git a/vllm/lora/ops/triton_ops/lora_shrink_op.py b/vllm/lora/ops/triton_ops/lora_shrink_op.py index d299fa5e8e1..1e7075ab071 100644 --- a/vllm/lora/ops/triton_ops/lora_shrink_op.py +++ b/vllm/lora/ops/triton_ops/lora_shrink_op.py @@ -8,12 +8,11 @@ """ import torch -import triton -import triton.language as tl from vllm.lora.ops.triton_ops.kernel_utils import do_shrink_kernel from vllm.lora.ops.triton_ops.utils import _get_lora_a_ptr from vllm.platforms import current_platform +from vllm.triton_utils import tl, triton from vllm.utils import direct_register_custom_op diff --git a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py index 34f8c124759..61247e93091 100644 --- a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py @@ -4,8 +4,6 @@ from typing import Optional import torch -import triton -import triton.language as tl import vllm.model_executor.layers.fused_moe.modular_kernel as mk from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig @@ -18,6 +16,7 @@ normalize_scales_shape) from vllm.model_executor.layers.quantization.utils.quant_utils import ( group_broadcast) +from vllm.triton_utils import tl, triton @triton.jit From a47c4431945fe4ea02a0b41cde30d6a99770a160 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nicol=C3=B2=20Lucchesi?= Date: Sun, 13 Jul 2025 04:40:11 +0200 Subject: [PATCH 046/552] [Core] Support multiple tasks per model (#20771) Signed-off-by: NickLucche Signed-off-by: DarkLight1337 Co-authored-by: DarkLight1337 Signed-off-by: x22x22 --- tests/test_config.py | 49 ++++- vllm/config.py | 256 ++++++++++++++--------- vllm/entrypoints/llm.py | 61 +++--- vllm/entrypoints/openai/api_server.py | 26 +-- vllm/entrypoints/openai/run_batch.py | 14 +- vllm/model_executor/models/interfaces.py | 6 + vllm/model_executor/models/registry.py | 10 + vllm/model_executor/models/whisper.py | 3 + 8 files changed, 278 insertions(+), 147 deletions(-) diff --git a/tests/test_config.py b/tests/test_config.py index 6ed7ef9e6a4..a160b08f28a 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -54,7 +54,7 @@ def test_get_field(): ("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"), ("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"), ("Qwen/Qwen2.5-Math-RM-72B", "pooling", "reward"), - ("openai/whisper-small", "transcription", "transcription"), + ("openai/whisper-small", "generate", "transcription"), ], ) def test_auto_task(model_id, expected_runner_type, expected_task): @@ -69,7 +69,11 @@ def test_auto_task(model_id, expected_runner_type, expected_task): ) assert config.runner_type == expected_runner_type - assert config.task == expected_task + + if config.runner_type == "pooling": + assert config.task == expected_task + else: + assert expected_task in config.supported_tasks @pytest.mark.parametrize( @@ -98,11 +102,50 @@ def test_score_task(model_id, expected_runner_type, expected_task): assert config.task == expected_task +@pytest.mark.parametrize(("model_id", "expected_runner_type", "expected_task"), + [ + ("Qwen/Qwen2.5-1.5B-Instruct", "draft", "auto"), + ]) +def test_draft_task(model_id, expected_runner_type, expected_task): + config = ModelConfig( + model_id, + runner="draft", + tokenizer=model_id, + seed=0, + dtype="float16", + ) + + assert config.runner_type == expected_runner_type + assert config.task == expected_task + + +@pytest.mark.parametrize( + ("model_id", "expected_runner_type", "expected_task"), + [ + ("openai/whisper-small", "generate", "transcription"), + ], +) +def test_transcription_task(model_id, expected_runner_type, expected_task): + config = ModelConfig( + model_id, + task="transcription", + tokenizer=model_id, + tokenizer_mode="auto", + trust_remote_code=False, + seed=0, + dtype="float16", + ) + + assert config.runner_type == expected_runner_type + assert config.task == expected_task + + @pytest.mark.parametrize(("model_id", "bad_task"), [ ("Qwen/Qwen2.5-Math-RM-72B", "generate"), + ("Qwen/Qwen3-0.6B", "transcription"), ]) def test_incorrect_task(model_id, bad_task): - with pytest.raises(ValueError, match=r"does not support the .* task"): + with pytest.raises(ValueError, match=r"does not support task=.*"): ModelConfig( model_id, task=bad_task, diff --git a/vllm/config.py b/vllm/config.py index 90cea63dd14..69b64e1dcbe 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -91,24 +91,19 @@ ConfigT = TypeVar("ConfigT", bound=ConfigType) TaskOption = Literal["auto", "generate", "embedding", "embed", "classify", - "score", "reward", "transcription"] + "score", "reward", "transcription", "draft"] -_ResolvedTask = Literal["generate", "embed", "classify", "reward", "draft", - "transcription"] +_ResolvedTask = Literal["generate", "transcription", "pooling", "embed", + "classify", "reward", "draft"] -RunnerType = Literal["generate", "pooling", "draft", "transcription"] +RunnerOption = Literal["auto", "generate", "pooling", "draft"] -_RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = { - "generate": ["generate"], - "pooling": ["embed", "classify", "reward"], - "draft": ["draft"], - "transcription": ["transcription"], -} +RunnerType = Literal["generate", "pooling", "draft"] -_TASK_RUNNER: dict[_ResolvedTask, RunnerType] = { - task: runner - for runner, tasks in _RUNNER_TASKS.items() - for task in tasks +_RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = { + "generate": ["generate", "transcription"], + "pooling": ["pooling", "embed", "classify", "reward"], + "draft": [], } @@ -234,11 +229,14 @@ class ModelConfig: """Name or path of the Hugging Face model to use. It is also used as the content for `model_name` tag in metrics output when `served_model_name` is not specified.""" - task: Literal[TaskOption, Literal["draft"]] = "auto" - """The task to use the model for. Each vLLM instance only supports one - task, even if the same model can be used for multiple tasks. When the model - only supports one task, "auto" can be used to select it; otherwise, you - must specify explicitly which task to use.""" + runner: RunnerOption = "auto" + """The type of model runner to use. Each vLLM instance only supports one + model runner, even if the same model can be used for multiple types.""" + task: TaskOption = "auto" + """The task to use the model for. If the model supports more than one + model runner, this is used to select which model runner to run. + + Note that the model may support other tasks using the same model runner.""" tokenizer: SkipValidation[str] = None # type: ignore """Name or path of the Hugging Face tokenizer to use. If unspecified, model name or path will be used.""" @@ -553,10 +551,41 @@ def __post_init__(self) -> None: self.hf_image_processor_config = get_hf_image_processor_config( self.model, hf_token=self.hf_token, revision=self.revision) - supported_tasks, task = self._resolve_task(self.task) - self.supported_tasks = supported_tasks - self.task = task - if self.task in ("draft", "generate"): + # For pooling models, self.task is used to indicate the + # user-selected task + if self.task == "score": + if self.registry.is_cross_encoder_model(self.architectures): + self.task = "classify" + else: + self.task = "embed" + elif self.task == "embedding": + msg = ("The 'embedding' task has been renamed to 'embed', please " + "use the new name. The old name will be removed in v1.0.") + warnings.warn(msg, DeprecationWarning, stacklevel=2) + + self.task = "embed" + + all_supported_tasks = self._get_supported_tasks(self.task) + logger.debug("Tasks supported by runner type: %s", all_supported_tasks) + supported_runner_types = self._get_supported_runner_types( + all_supported_tasks) + runner_type = self._resolve_runner(self.runner, self.task, + supported_runner_types, + all_supported_tasks) + + logger.debug("Selected runner type: %s", runner_type) + # For pooling models, self.task is used to indicate the + # user-selected task + if runner_type == "pooling" and self.task == "auto": + selected_task = all_supported_tasks[runner_type][-1] + assert selected_task != "pooling" + self.task = selected_task + self.supported_runner_types = supported_runner_types + self.runner_type = runner_type + self.supported_tasks = all_supported_tasks[runner_type] + + if self.runner_type in ("draft", + "generate") and self.task != "transcription": self.truncation_side = "left" else: self.truncation_side = "right" @@ -780,11 +809,10 @@ def _verify_tokenizer_mode(self) -> None: f"one of {get_args(TokenizerMode)}.") self.tokenizer_mode = tokenizer_mode - def _get_preferred_task( + def _get_preferred_pooling_task( self, architectures: list[str], - supported_tasks: set[_ResolvedTask], - ) -> Optional[_ResolvedTask]: + ) -> _ResolvedTask: model_id = self.model if get_pooling_config(model_id, self.revision): return "embed" @@ -795,92 +823,136 @@ def _get_preferred_task( suffix_to_preferred_task: list[tuple[str, _ResolvedTask]] = [ # Other models follow this pattern - ("ForCausalLM", "generate"), - ("ForConditionalGeneration", "generate"), ("ForSequenceClassification", "classify"), - ("ChatModel", "generate"), - ("LMHeadModel", "generate"), ("EmbeddingModel", "embed"), ("RewardModel", "reward"), ] _, arch = self.registry.inspect_model_cls(architectures) for suffix, pref_task in suffix_to_preferred_task: - if arch.endswith(suffix) and pref_task in supported_tasks: + if arch.endswith(suffix): return pref_task - return None + return "embed" - def _resolve_task( + def _get_supported_generation_tasks( self, - task_option: Literal[TaskOption, Literal["draft"]], - ) -> tuple[set[_ResolvedTask], _ResolvedTask]: - if task_option == "draft": - return {"draft"}, "draft" + task_option: TaskOption, + ) -> list[_ResolvedTask]: + registry = self.registry + architectures = self.architectures + + if registry.is_transcription_only_model(architectures): + return ["transcription"] + + supported_tasks = list[_ResolvedTask]() + if registry.is_text_generation_model(architectures): + supported_tasks.append("generate") + + if registry.is_transcription_model(architectures): + supported_tasks.append("transcription") + + return supported_tasks + def _get_supported_pooling_tasks( + self, + task_option: TaskOption, + ) -> list[_ResolvedTask]: registry = self.registry architectures = self.architectures - runner_support: dict[RunnerType, bool] = { - # NOTE: Listed from highest to lowest priority, - # in case the model supports multiple of them - "transcription": registry.is_transcription_model(architectures), - "generate": registry.is_text_generation_model(architectures), - "pooling": registry.is_pooling_model(architectures), + supported_tasks = list[_ResolvedTask]() + if registry.is_pooling_model(architectures): + supported_tasks.append("pooling") + + # For now, users must specify the task (other than "pooling") + # to use for pooling models + if task_option == "auto": + preferred_task = self._get_preferred_pooling_task( + architectures) + + supported_tasks.append(preferred_task) + elif task_option in _RUNNER_TASKS["pooling"]: + supported_tasks.append(cast(_ResolvedTask, task_option)) + + return supported_tasks + + def _get_supported_tasks( + self, + task_option: TaskOption, + ) -> dict[RunnerType, list[_ResolvedTask]]: + return { + "generate": self._get_supported_generation_tasks(task_option), + "pooling": self._get_supported_pooling_tasks(task_option), + "draft": ["draft"] } - supported_runner_types_lst: list[RunnerType] = [ - runner_type - for runner_type, is_supported in runner_support.items() - if is_supported - ] - supported_tasks_lst: list[_ResolvedTask] = [ - task for runner_type in supported_runner_types_lst - for task in _RUNNER_TASKS[runner_type] - ] - supported_tasks = set(supported_tasks_lst) + def _get_supported_runner_types( + self, + supported_tasks: dict[RunnerType, list[_ResolvedTask]], + ) -> set[RunnerType]: + return { + runner + for runner, runner_tasks in supported_tasks.items() + if len(runner_tasks) > 0 + } - if task_option == "auto": - selected_task = next(iter(supported_tasks_lst)) + def _resolve_runner( + self, + runner_option: RunnerOption, + task_option: TaskOption, + supported_runner_types: set[RunnerType], + supported_tasks: dict[RunnerType, list[_ResolvedTask]], + ) -> RunnerType: + if not supported_runner_types: + raise ValueError("This model does not support any model runners!") + + if runner_option != "auto": + if runner_option not in supported_runner_types: + raise ValueError( + f"This model does not support runner={runner_option!r}. " + f"Available runners: {supported_runner_types}") - if len(supported_tasks_lst) > 1: - preferred_task = self._get_preferred_task( - architectures, supported_tasks) - if preferred_task is not None: - selected_task = preferred_task + return runner_option - logger.info( - "This model supports multiple tasks: %s. " - "Defaulting to '%s'.", supported_tasks, selected_task) - else: - if task_option == "score": - if not runner_support["pooling"]: - msg = (f"This model does not support the '{task_option}' " - f"task. Supported tasks: {supported_tasks}") - raise ValueError(msg) - if self.registry.is_cross_encoder_model(architectures): - task_option = "classify" - else: - task_option = "embed" + if task_option != "auto": + for runner, runner_tasks in supported_tasks.items(): + if task_option in runner_tasks: + return runner else: - # Aliases - if task_option == "embedding": - msg = ("The 'embedding' task has been renamed to " - "'embed', please use the new name. The old name " - "will be removed in v1.0.") - warnings.warn(msg, DeprecationWarning, stacklevel=2) + task_runner: RunnerType = next( + runner for runner, tasks in _RUNNER_TASKS.items() + if task_option in tasks) + raise ValueError( + f"This model does not support task={task_option!r}. " + f"Available tasks for runner={task_runner!r}: " + f"{supported_tasks[task_runner]}") - task_option = "embed" + suffix_to_preferred_runner: list[tuple[str, RunnerType]] = [ + ("ForCausalLM", "generate"), + ("ForConditionalGeneration", "generate"), + ("ChatModel", "generate"), + ("LMHeadModel", "generate"), + ("ForSequenceClassification", "pooling"), + ("EmbeddingModel", "pooling"), + ("RewardModel", "pooling"), + ] + _, arch = self.registry.inspect_model_cls(self.architectures) - if task_option not in supported_tasks: - msg = ( - f"This model does not support the '{task_option}' task. " - f"Supported tasks: {supported_tasks}") - raise ValueError(msg) + for suffix, pref_runner in suffix_to_preferred_runner: + if arch.endswith(suffix) and pref_runner in supported_runner_types: + return pref_runner - selected_task = task_option + if "classify" in supported_tasks.get("pooling", []): + # When multiple pooling tasks are present, default to + # pooling (eg cross-encoder) for non-standard architectures. + return "pooling" + if "generate" in supported_runner_types: + return "generate" + if "pooling" in supported_runner_types: + return "pooling" - return supported_tasks, selected_task + raise AssertionError("This line should not be reached") def _parse_quant_hf_config(self): quant_cfg = getattr(self.hf_config, "quantization_config", None) @@ -1449,14 +1521,6 @@ def is_cross_encoder(self) -> bool: def use_mla(self) -> bool: return self.is_deepseek_mla and not envs.VLLM_MLA_DISABLE - @property - def supported_runner_types(self) -> set[RunnerType]: - return {_TASK_RUNNER[task] for task in self.supported_tasks} - - @property - def runner_type(self) -> RunnerType: - return _TASK_RUNNER[cast(_ResolvedTask, self.task)] - @property def is_v1_compatible(self) -> bool: architectures = getattr(self.hf_config, "architectures", []) @@ -2694,7 +2758,7 @@ def __post_init__(self): if self.model is not None: self.draft_model_config = ModelConfig( model=self.model, - task="draft", + runner="draft", tokenizer=self.target_model_config.tokenizer, tokenizer_mode=self.target_model_config.tokenizer_mode, trust_remote_code=self.target_model_config. diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py index c60a566f585..e7398ecc23c 100644 --- a/vllm/entrypoints/llm.py +++ b/vllm/entrypoints/llm.py @@ -454,20 +454,19 @@ def generate( considered legacy and may be deprecated in the future. You should instead pass them via the `inputs` parameter. """ - runner_type = self.llm_engine.model_config.runner_type - if runner_type not in ["generate", "transcription"]: + model_config = self.llm_engine.model_config + runner_type = model_config.runner_type + if runner_type != "generate": messages = [ - "LLM.generate() is only supported for (conditional) generation " - "models (XForCausalLM, XForConditionalGeneration).", + "LLM.generate() is only supported for generative models." ] - supported_runner_types = self.llm_engine.model_config \ - .supported_runner_types - if "generate" in supported_runner_types: + if "generate" in model_config.supported_runner_types: messages.append( "Your model supports the 'generate' runner, but is " f"currently initialized for the '{runner_type}' runner. " - "Please initialize vLLM using `--task generate`.") + "Please initialize vLLM using `--task generate` or " + "`--task transcription`.") raise ValueError(" ".join(messages)) @@ -1091,13 +1090,12 @@ def encode( considered legacy and may be deprecated in the future. You should instead pass them via the `inputs` parameter. """ - runner_type = self.llm_engine.model_config.runner_type + model_config = self.llm_engine.model_config + runner_type = model_config.runner_type if runner_type != "pooling": messages = ["LLM.encode() is only supported for pooling models."] - supported_runner_types = self.llm_engine.model_config \ - .supported_runner_types - if "pooling" in supported_runner_types: + if "pooling" in model_config.supported_runner_types: messages.append( "Your model supports the 'pooling' runner, but is " f"currently initialized for the '{runner_type}' runner. " @@ -1119,13 +1117,13 @@ def encode( # Use default pooling params. pooling_params = PoolingParams() elif isinstance(pooling_params, PoolingParams): - pooling_params.verify(self.llm_engine.model_config) + pooling_params.verify(model_config) else: for pooling_param in pooling_params: - pooling_param.verify(self.llm_engine.model_config) + pooling_param.verify(model_config) - tokenization_kwargs: dict[str, Any] = {} - _validate_truncation_size(self.llm_engine.model_config.max_model_len, + tokenization_kwargs = dict[str, Any]() + _validate_truncation_size(model_config.max_model_len, truncate_prompt_tokens, tokenization_kwargs) self._validate_and_add_requests( @@ -1178,9 +1176,10 @@ def embed( A list of `EmbeddingRequestOutput` objects containing the embedding vectors in the same order as the input prompts. """ - if self.llm_engine.model_config.task != "embed": - raise ValueError( - "Embedding API is only enabled for `--task embed`") + model_config = self.llm_engine.model_config + if "embed" not in model_config.supported_tasks: + raise ValueError("Embedding API is not supported by this model. " + "Please set `--task embed`.") items = self.encode(prompts, truncate_prompt_tokens=truncate_prompt_tokens, @@ -1223,9 +1222,11 @@ def classify( A list of `ClassificationRequestOutput` objects containing the embedding vectors in the same order as the input prompts. """ - if self.llm_engine.model_config.task != "classify": + model_config = self.llm_engine.model_config + if "classify" not in model_config.supported_tasks: raise ValueError( - "Classification API is only enabled for `--task classify`") + "Classification API is not supported by this model. " + "Please set `--task classify`.") items = self.encode(prompts, use_tqdm=use_tqdm, @@ -1392,13 +1393,12 @@ def score( A list of `ScoringRequestOutput` objects containing the generated scores in the same order as the input prompts. """ - runner_type = self.llm_engine.model_config.runner_type + model_config = self.llm_engine.model_config + runner_type = model_config.runner_type if runner_type != "pooling": messages = ["LLM.score() is only supported for pooling models."] - supported_runner_types = self.llm_engine.model_config \ - .supported_runner_types - if "pooling" in supported_runner_types: + if "pooling" in model_config.supported_runner_types: messages.append( "Your model supports the 'pooling' runner, but is " f"currently initialized for the '{runner_type}' runner. " @@ -1407,12 +1407,13 @@ def score( raise ValueError(" ".join(messages)) - if self.llm_engine.model_config.task not in ("embed", "classify"): - raise ValueError("Score API is only enabled for " - "`--task embed or --task classify`.") + if all(t not in model_config.supported_tasks + for t in ("embed", "classify")): + raise ValueError("Score API is not supported by this model. " + "Please set `--task embed` or `--task classify`.") - if (self.llm_engine.model_config.task == "classify" - and self.llm_engine.model_config.hf_config.num_labels != 1): + if (model_config.task == "classify" + and getattr(model_config.hf_config, "num_labels", 0) != 1): raise ValueError("Score API is only enabled for num_labels == 1.") # the tokenizer for models such as diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 2f53357e1d4..049a90fea15 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1520,7 +1520,7 @@ async def init_app_state( reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, enable_force_include_usage=args.enable_force_include_usage, - ) if model_config.runner_type == "generate" else None + ) if "generate" in model_config.supported_tasks else None state.openai_serving_chat = OpenAIServingChat( engine_client, model_config, @@ -1537,7 +1537,7 @@ async def init_app_state( reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, enable_force_include_usage=args.enable_force_include_usage, - ) if model_config.runner_type == "generate" else None + ) if "generate" in model_config.supported_tasks else None state.openai_serving_completion = OpenAIServingCompletion( engine_client, model_config, @@ -1545,7 +1545,7 @@ async def init_app_state( request_logger=request_logger, return_tokens_as_token_ids=args.return_tokens_as_token_ids, enable_force_include_usage=args.enable_force_include_usage, - ) if model_config.runner_type == "generate" else None + ) if "generate" in model_config.supported_tasks else None state.openai_serving_pooling = OpenAIServingPooling( engine_client, model_config, @@ -1553,7 +1553,7 @@ async def init_app_state( request_logger=request_logger, chat_template=resolved_chat_template, chat_template_content_format=args.chat_template_content_format, - ) if model_config.runner_type == "pooling" else None + ) if "pooling" in model_config.supported_tasks else None state.openai_serving_embedding = OpenAIServingEmbedding( engine_client, model_config, @@ -1561,22 +1561,24 @@ async def init_app_state( request_logger=request_logger, chat_template=resolved_chat_template, chat_template_content_format=args.chat_template_content_format, - ) if model_config.task == "embed" else None + ) if "embed" in model_config.supported_tasks else None state.openai_serving_classification = ServingClassification( engine_client, model_config, state.openai_serving_models, request_logger=request_logger, - ) if model_config.task == "classify" else None + ) if "classify" in model_config.supported_tasks else None - enable_serving_reranking = (model_config.task == "classify" and getattr( - model_config.hf_config, "num_labels", 0) == 1) + enable_serving_reranking = ("classify" in model_config.supported_tasks + and getattr(model_config.hf_config, + "num_labels", 0) == 1) state.openai_serving_scores = ServingScores( engine_client, model_config, state.openai_serving_models, - request_logger=request_logger) if ( - model_config.task == "embed" or enable_serving_reranking) else None + request_logger=request_logger, + ) if ("embed" in model_config.supported_tasks + or enable_serving_reranking) else None state.openai_serving_tokenization = OpenAIServingTokenization( engine_client, @@ -1591,13 +1593,13 @@ async def init_app_state( model_config, state.openai_serving_models, request_logger=request_logger, - ) if model_config.runner_type == "transcription" else None + ) if "transcription" in model_config.supported_tasks else None state.openai_serving_translation = OpenAIServingTranslation( engine_client, model_config, state.openai_serving_models, request_logger=request_logger, - ) if model_config.runner_type == "transcription" else None + ) if "transcription" in model_config.supported_tasks else None state.task = model_config.task state.enable_server_load_tracking = args.enable_server_load_tracking diff --git a/vllm/entrypoints/openai/run_batch.py b/vllm/entrypoints/openai/run_batch.py index e112e2f893a..3dc5826909a 100644 --- a/vllm/entrypoints/openai/run_batch.py +++ b/vllm/entrypoints/openai/run_batch.py @@ -348,7 +348,7 @@ async def main(args): chat_template=None, chat_template_content_format="auto", enable_prompt_tokens_details=args.enable_prompt_tokens_details, - ) if model_config.runner_type == "generate" else None + ) if "generate" in model_config.supported_tasks else None openai_serving_embedding = OpenAIServingEmbedding( engine, model_config, @@ -356,17 +356,19 @@ async def main(args): request_logger=request_logger, chat_template=None, chat_template_content_format="auto", - ) if model_config.task == "embed" else None + ) if "embed" in model_config.supported_tasks else None - enable_serving_reranking = (model_config.task == "classify" and getattr( - model_config.hf_config, "num_labels", 0) == 1) + enable_serving_reranking = ("classify" in model_config.supported_tasks + and getattr(model_config.hf_config, + "num_labels", 0) == 1) - openai_serving_scores = (ServingScores( + openai_serving_scores = ServingScores( engine, model_config, openai_serving_models, request_logger=request_logger, - ) if (model_config.task == "embed" or enable_serving_reranking) else None) + ) if ("embed" in model_config.supported_tasks + or enable_serving_reranking) else None tracker = BatchProgressTracker() logger.info("Reading batch from %s...", args.input_file) diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 99669a23363..3a97641aa2f 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -694,6 +694,12 @@ class SupportsTranscription(Protocol): supports_transcription: ClassVar[Literal[True]] = True + supports_transcription_only: ClassVar[bool] = False + """ + Transcription models can opt out of text generation by setting this to + `True`. + """ + @classmethod def get_generation_prompt(cls, audio: np.ndarray, stt_config: SpeechToTextConfig, language: str, diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 5f9b145b661..e8530a555d2 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -284,6 +284,7 @@ class _ModelInfo: is_hybrid: bool has_noops: bool supports_transcription: bool + supports_transcription_only: bool supports_v0_only: bool @staticmethod @@ -299,6 +300,8 @@ def from_model_cls(model: type[nn.Module]) -> "_ModelInfo": is_attention_free=is_attention_free(model), is_hybrid=is_hybrid(model), supports_transcription=supports_transcription(model), + supports_transcription_only=(supports_transcription(model) and + model.supports_transcription_only), supports_v0_only=supports_v0_only(model), has_noops=has_noops(model), ) @@ -573,6 +576,13 @@ def is_transcription_model( model_cls, _ = self.inspect_model_cls(architectures) return model_cls.supports_transcription + def is_transcription_only_model( + self, + architectures: Union[str, list[str]], + ) -> bool: + model_cls, _ = self.inspect_model_cls(architectures) + return model_cls.supports_transcription_only + def is_v1_compatible( self, architectures: Union[str, list[str]], diff --git a/vllm/model_executor/models/whisper.py b/vllm/model_executor/models/whisper.py index 1a7982e48e4..08aed2205e0 100644 --- a/vllm/model_executor/models/whisper.py +++ b/vllm/model_executor/models/whisper.py @@ -772,6 +772,9 @@ class WhisperForConditionalGeneration(nn.Module, SupportsTranscription, ".fc2.": ".mlp.fc2." }) + # Whisper only supports audio-conditioned generation. + supports_transcription_only = True + @classmethod def validate_language(cls, language: str) -> bool: if language in ISO639_1_SUPPORTED_LANGS: From aea90f6225b9bb2d4f131b7a788d128b9836913d Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Sat, 12 Jul 2025 21:48:56 -0700 Subject: [PATCH 047/552] Renable google/gemma-3-1b-it accuracy test. (#20866) Signed-off-by: Qiliang Cui Signed-off-by: x22x22 --- tests/entrypoints/llm/test_accuracy.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/tests/entrypoints/llm/test_accuracy.py b/tests/entrypoints/llm/test_accuracy.py index 7e6bd3664eb..30a666d4c39 100644 --- a/tests/entrypoints/llm/test_accuracy.py +++ b/tests/entrypoints/llm/test_accuracy.py @@ -71,9 +71,8 @@ def test_lm_eval_accuracy_v1_engine(model, monkeypatch: pytest.MonkeyPatch): # Limit compilation time for TPU V1 if model == "google/gemma-3-1b-it": - pytest.skip( - "Temporarily disabled due to test failures" - "(timeout or accuracy mismatch). Re-enable once fixed.") + # TPU + google/gemma-3-1b-it + xet doesn't work well. + m.setenv("HF_HUB_DISABLE_XET", "1") more_args = "max_model_len=2048,max_num_seqs=64" From f14aa9dcc1b0ed8143650c75535cc451a1bf566a Mon Sep 17 00:00:00 2001 From: Minkyu Kim Date: Sun, 13 Jul 2025 16:09:34 +0900 Subject: [PATCH 048/552] Support for LlamaForSequenceClassification (#20807) Signed-off-by: thechaos16 Signed-off-by: x22x22 --- tests/models/registry.py | 1 + vllm/model_executor/models/llama.py | 4 ++++ vllm/model_executor/models/registry.py | 3 ++- 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/tests/models/registry.py b/tests/models/registry.py index c10d375683e..1207a928c92 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -330,6 +330,7 @@ def check_available_online( hf_overrides={"architectures": ["GemmaForSequenceClassification"], # noqa: E501 "classifier_from_token": ["Yes"], # noqa: E501 "method": "no_post_processing"}), # noqa: E501 + "LlamaForSequenceClassification": _HfExamplesInfo("Skywork/Skywork-Reward-V2-Llama-3.2-1B"), # noqa: E501 "ModernBertForSequenceClassification": _HfExamplesInfo("Alibaba-NLP/gte-reranker-modernbert-base", v0_only=True), # noqa: E501 "RobertaForSequenceClassification": _HfExamplesInfo("cross-encoder/quora-roberta-base", v0_only=True), # noqa: E501 "XLMRobertaForSequenceClassification": _HfExamplesInfo("BAAI/bge-reranker-v2-m3", v0_only=True), # noqa: E501 diff --git a/vllm/model_executor/models/llama.py b/vllm/model_executor/models/llama.py index 48ec611df12..2434ac9d205 100644 --- a/vllm/model_executor/models/llama.py +++ b/vllm/model_executor/models/llama.py @@ -49,6 +49,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors +from .adapters import as_seq_cls_model from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, is_pp_missing_parameter, @@ -645,3 +646,6 @@ def permute(w: torch.Tensor, n_heads: int): name = name.replace(item, mapping[item]) return name, loaded_weight + + +LlamaForSequenceClassification = as_seq_cls_model(LlamaForCausalLM) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index e8530a555d2..b7d4789549a 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -183,7 +183,8 @@ "GemmaForSequenceClassification": ("gemma", "GemmaForSequenceClassification"), # noqa: E501 "Qwen2ForSequenceClassification": ("qwen2", "Qwen2ForSequenceClassification"), # noqa: E501 "Qwen3ForSequenceClassification": ("qwen3", "Qwen3ForSequenceClassification"), # noqa: E501 - "JinaVLForRanking": ("jina_vl", "JinaVLForSequenceClassification"), # noqa: E501 + "LlamaForSequenceClassification": ("llama", "LlamaForSequenceClassification"), # noqa: E501 + "JinaVLForRanking": ("jina_vl", "JinaVLForSequenceClassification"), # noqa: E501, } _MULTIMODAL_MODELS = { From fa35d0a8441990f7149ea67e59bdde3a08a19787 Mon Sep 17 00:00:00 2001 From: Wang Siyuan Date: Sun, 13 Jul 2025 15:13:25 +0800 Subject: [PATCH 049/552] [Bugfix] Fix: add patch_rope_scaling after hf override (#20857) Signed-off-by: Wang Siyuan Signed-off-by: Wang Siyuan Signed-off-by: x22x22 --- vllm/config.py | 18 +++++++----------- vllm/transformers_utils/config.py | 10 ++++++++++ 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/vllm/config.py b/vllm/config.py index 69b64e1dcbe..f2381ffa232 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -532,16 +532,12 @@ def __post_init__(self) -> None: self.config_format = ConfigFormat(self.config_format) hf_config = get_config(self.hf_config_path or self.model, - self.trust_remote_code, self.revision, - self.code_revision, self.config_format) - - if hf_overrides_kw: - logger.debug("Overriding HF config with %s", hf_overrides_kw) - hf_config.update(hf_overrides_kw) - if hf_overrides_fn: - logger.debug("Overriding HF config with %s", hf_overrides_fn) - hf_config = hf_overrides_fn(hf_config) - + self.trust_remote_code, + self.revision, + self.code_revision, + self.config_format, + hf_overrides_kw=hf_overrides_kw, + hf_overrides_fn=hf_overrides_fn) self.hf_config = hf_config self.hf_text_config = get_hf_text_config(self.hf_config) @@ -5081,4 +5077,4 @@ class SpeechToTextConfig: @property def allow_audio_chunking(self) -> bool: - return self.min_energy_split_window_size is not None \ No newline at end of file + return self.min_energy_split_window_size is not None diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index 411c970b2f0..cf3f519b027 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -305,6 +305,9 @@ def get_config( revision: Optional[str] = None, code_revision: Optional[str] = None, config_format: ConfigFormat = ConfigFormat.AUTO, + hf_overrides_kw: Optional[dict[str, Any]] = None, + hf_overrides_fn: Optional[Callable[[PretrainedConfig], + PretrainedConfig]] = None, **kwargs, ) -> PretrainedConfig: # Separate model folder from file path for GGUF models @@ -423,6 +426,13 @@ def get_config( model_type = MODEL_FOR_CAUSAL_LM_MAPPING_NAMES[config.model_type] config.update({"architectures": [model_type]}) + if hf_overrides_kw: + logger.debug("Overriding HF config with %s", hf_overrides_kw) + config.update(hf_overrides_kw) + if hf_overrides_fn: + logger.debug("Overriding HF config with %s", hf_overrides_fn) + config = hf_overrides_fn(config) + patch_rope_scaling(config) if trust_remote_code: From fbb5590434ce02bc28f2441a0cd2626f4f6bb56c Mon Sep 17 00:00:00 2001 From: Liuchenlong Date: Sun, 13 Jul 2025 22:32:40 +0800 Subject: [PATCH 050/552] [Bugfix] fix define of RerankDocument (#20877) Signed-off-by: liuchenlong Co-authored-by: liuchenlong Signed-off-by: x22x22 --- vllm/entrypoints/openai/protocol.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index 26c23a48e1d..fdac6ccd19e 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -30,7 +30,8 @@ from vllm import envs from vllm.entrypoints.chat_utils import (ChatCompletionMessageParam, random_tool_call_id) -from vllm.entrypoints.score_utils import ScoreMultiModalParam +from vllm.entrypoints.score_utils import (ScoreContentPartParam, + ScoreMultiModalParam) from vllm.logger import init_logger from vllm.pooling_params import PoolingParams from vllm.sampling_params import (BeamSearchParams, GuidedDecodingParams, @@ -1354,7 +1355,7 @@ def to_pooling_params(self, *, use_cross_encoder: bool = False): class RerankDocument(BaseModel): text: Optional[str] = None - multi_modal: Optional[ScoreMultiModalParam] = None + multi_modal: Optional[ScoreContentPartParam] = None class RerankResult(BaseModel): From 1ad5bd7c06c6c391e02e5f4c32d24320d7112153 Mon Sep 17 00:00:00 2001 From: TJian Date: Sun, 13 Jul 2025 08:19:32 -0700 Subject: [PATCH 051/552] [V1] [ROCm] [AITER] Upgrade AITER to commit `916bf3c` and bugfix APIs (#20880) Signed-off-by: tjtanaa Signed-off-by: x22x22 --- docker/Dockerfile.rocm_base | 2 +- .../quantization/kernels/scaled_mm/aiter.py | 49 +++++++++++++++++-- .../layers/quantization/utils/fp8_utils.py | 2 +- 3 files changed, 48 insertions(+), 5 deletions(-) diff --git a/docker/Dockerfile.rocm_base b/docker/Dockerfile.rocm_base index dc8ec5f1a15..3414c0aa845 100644 --- a/docker/Dockerfile.rocm_base +++ b/docker/Dockerfile.rocm_base @@ -12,7 +12,7 @@ ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git" ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git" ARG FA_BRANCH="1a7f4dfa" ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git" -ARG AITER_BRANCH="6487649" +ARG AITER_BRANCH="916bf3c" ARG AITER_REPO="https://github.com/ROCm/aiter.git" FROM ${BASE_IMAGE} AS base diff --git a/vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py b/vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py index 165548a0601..7f808fa92a9 100644 --- a/vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py +++ b/vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py @@ -8,11 +8,55 @@ import vllm.envs as envs from vllm import _custom_ops as ops from vllm.platforms import current_platform +from vllm.utils import direct_register_custom_op from .cutlass import CutlassScaledMMLinearKernel from .ScaledMMLinearKernel import ScaledMMLinearLayerConfig +def rocm_aiter_gemm_w8a8_impl( + A: torch.Tensor, + B: torch.Tensor, + As: torch.Tensor, + Bs: torch.Tensor, + bias: Optional[torch.Tensor] = None, + output_dtype: torch.dtype = torch.float16, +) -> torch.Tensor: + + from aiter import gemm_a8w8_CK + + # gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects + # a to be [M, K] + # b to be [N, K] + # CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format + return gemm_a8w8_CK(A, B, As, Bs, bias, output_dtype) + + +def rocm_aiter_gemm_w8a8_fake( + A: torch.Tensor, + B: torch.Tensor, + As: torch.Tensor, + Bs: torch.Tensor, + bias: Optional[torch.Tensor] = None, + output_dtype: torch.dtype = torch.float16, +) -> torch.Tensor: + + m = A.shape[0] + n = B.shape[0] + Y = torch.empty(m, n, dtype=output_dtype, device=A.device) + return Y + + +if current_platform.is_rocm(): + direct_register_custom_op( + op_name="rocm_aiter_gemm_w8a8", + op_func=rocm_aiter_gemm_w8a8_impl, + mutates_args=[], + fake_impl=rocm_aiter_gemm_w8a8_fake, + dispatch_key=current_platform.dispatch_key, + ) + + class AiterScaledMMLinearKernel(CutlassScaledMMLinearKernel): @classmethod @@ -111,10 +155,9 @@ def apply_weights(self, " w8a8 scaled gemm. `AiterScaledMMLinearKernel` " + "does not support AITER block scaled GEMM.") - from aiter import gemm_a8w8_CK - # gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects # a to be [M, K] # b to be [N, K] # CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format - return gemm_a8w8_CK(x_q, w_q.t(), x_s, w_s, bias).to(out_dtype) + return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s, + bias, out_dtype) diff --git a/vllm/model_executor/layers/quantization/utils/fp8_utils.py b/vllm/model_executor/layers/quantization/utils/fp8_utils.py index 9c78dea17e5..c093a9bfc4a 100644 --- a/vllm/model_executor/layers/quantization/utils/fp8_utils.py +++ b/vllm/model_executor/layers/quantization/utils/fp8_utils.py @@ -56,7 +56,7 @@ def rocm_aiter_gemm_w8a8_blockscale_impl( ) -> torch.Tensor: import aiter as rocm_aiter - return rocm_aiter.gemm_a8w8_blockscale_CK(A, B, As, Bs, dtype=output_dtype) + return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype) def rocm_aiter_gemm_w8a8_blockscale_fake( From 2cd14d01d18e5beeddf01c892b6d6e49efacf415 Mon Sep 17 00:00:00 2001 From: nopperl <54780682+nopperl@users.noreply.github.com> Date: Mon, 14 Jul 2025 01:55:14 +0900 Subject: [PATCH 052/552] [V1] Hybrid allocator without prefix caching (#20661) Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/v1/core/kv_cache_coordinator.py | 33 ++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/vllm/v1/core/kv_cache_coordinator.py b/vllm/v1/core/kv_cache_coordinator.py index 38de00625e3..de72e60434a 100644 --- a/vllm/v1/core/kv_cache_coordinator.py +++ b/vllm/v1/core/kv_cache_coordinator.py @@ -171,6 +171,35 @@ def find_longest_cache_hit( pass +class KVCacheCoordinatorNoPrefixCache(KVCacheCoordinator): + """ + KV cache coordinator to use if prefix caching is disabled or unsupported. + In contrast to UnitaryKVCacheCoordinator and HybridKVCacheCoordinator, + supports arbitrary numbers of KV cache groups (including 0 groups). + Does not implement any features related to prefix caching. + """ + + def __init__(self, kv_cache_config: KVCacheConfig, max_model_len: int, + use_eagle: bool, caching_hash_fn: Callable, + enable_kv_cache_events: bool): + super().__init__(kv_cache_config, max_model_len, use_eagle, False, + caching_hash_fn, enable_kv_cache_events) + self.num_single_type_manager = len(self.single_type_managers) + + def get_num_common_prefix_blocks(self, request_id: str, + num_running_requests: int) -> list[int]: + return [0] * self.num_single_type_manager + + def find_longest_cache_hit( + self, + block_hashes: list[BlockHash], + max_cache_hit_length: int, + ) -> tuple[tuple[list[KVCacheBlock], ...], int]: + blocks: tuple[list[KVCacheBlock], ...] = tuple( + [] for _ in range(self.num_single_type_manager)) + return blocks, 0 + + class UnitaryKVCacheCoordinator(KVCacheCoordinator): """ KV cache coordinator for models with only one KV cache group. This is the @@ -359,6 +388,10 @@ def get_kv_cache_coordinator( kv_cache_config: KVCacheConfig, max_model_len: int, use_eagle: bool, enable_caching: bool, caching_hash_fn: Callable, enable_kv_cache_events: bool) -> KVCacheCoordinator: + if not enable_caching: + return KVCacheCoordinatorNoPrefixCache(kv_cache_config, max_model_len, + use_eagle, caching_hash_fn, + enable_kv_cache_events) if len(kv_cache_config.kv_cache_groups) == 1: return UnitaryKVCacheCoordinator(kv_cache_config, max_model_len, use_eagle, enable_caching, From 70bf3f0006fe1dc6540cc8d4fdd34f8d089e6a9c Mon Sep 17 00:00:00 2001 From: 22quinn <33176974+22quinn@users.noreply.github.com> Date: Sun, 13 Jul 2025 17:49:18 -0700 Subject: [PATCH 053/552] [Core] Add `update_config` RPC method (#20095) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: x22x22 --- tests/test_config.py | 30 +++++++++++++++++++++++- tests/v1/worker/test_gpu_model_runner.py | 16 +++++++++++-- vllm/config.py | 21 ++++++++++++++++- vllm/v1/worker/gpu_model_runner.py | 12 +++++++++- vllm/v1/worker/gpu_worker.py | 5 +++- vllm/v1/worker/tpu_model_runner.py | 17 ++++++++++++-- vllm/v1/worker/tpu_worker.py | 5 +++- 7 files changed, 97 insertions(+), 9 deletions(-) diff --git a/tests/test_config.py b/tests/test_config.py index a160b08f28a..015baef9181 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -7,7 +7,7 @@ from vllm.compilation.backends import VllmBackend from vllm.config import (LoadConfig, ModelConfig, PoolerConfig, VllmConfig, - get_field) + get_field, update_config) from vllm.model_executor.layers.pooler import PoolingType from vllm.platforms import current_platform @@ -46,6 +46,34 @@ def test_get_field(): assert c.default_factory is MISSING +@dataclass +class _TestNestedConfig: + a: _TestConfigFields = field( + default_factory=lambda: _TestConfigFields(a=0)) + + +def test_update_config(): + # Simple update + config1 = _TestConfigFields(a=0) + new_config1 = update_config(config1, {"a": 42}) + assert new_config1.a == 42 + # Nonexistent field + with pytest.raises(AssertionError): + new_config1 = update_config(config1, {"nonexistent": 1}) + # Nested update with dataclass + config2 = _TestNestedConfig() + new_inner_config = _TestConfigFields(a=1, c="new_value") + new_config2 = update_config(config2, {"a": new_inner_config}) + assert new_config2.a == new_inner_config + # Nested update with dict + config3 = _TestNestedConfig() + new_config3 = update_config(config3, {"a": {"c": "new_value"}}) + assert new_config3.a.c == "new_value" + # Nested update with invalid type + with pytest.raises(AssertionError): + new_config3 = update_config(config3, {"a": "new_value"}) + + @pytest.mark.parametrize( ("model_id", "expected_runner_type", "expected_task"), [ diff --git a/tests/v1/worker/test_gpu_model_runner.py b/tests/v1/worker/test_gpu_model_runner.py index d13df553db6..0bdf1f9820d 100644 --- a/tests/v1/worker/test_gpu_model_runner.py +++ b/tests/v1/worker/test_gpu_model_runner.py @@ -434,16 +434,28 @@ def rnd_stride_order(): assert all(not kv.is_contiguous() for kv in model_runner.kv_caches) +def test_update_config(model_runner): + # Simple update + model_runner.update_config({"load_config": {"load_format": "dummy"}}) + assert model_runner.load_config.load_format == "dummy" + # Raise error on non-existing config + with pytest.raises(AssertionError): + model_runner.update_config({"do_not_exist_config": "dummy"}) + + def test_load_model_weights_inplace(dist_init, model_runner, model_runner_2): # In this test, model_runner loads model + weights in one go, while # model_runner_2 loads dummy weights first then load real weights inplace model_runner.load_model() original_load_format = model_runner_2.load_config.load_format - model_runner_2.load_config.load_format = "dummy" + model_runner_2.update_config({"load_config": {"load_format": "dummy"}}) model_runner_2.load_model() # Initial model loading with dummy weights assert str(model_runner.get_model().state_dict()) != str( model_runner_2.get_model().state_dict()) - model_runner_2.load_config.load_format = original_load_format + model_runner_2.update_config( + {"load_config": { + "load_format": original_load_format + }}) model_runner_2.load_model() # Load real weights inplace assert str(model_runner.get_model().state_dict()) == str( model_runner_2.get_model().state_dict()) diff --git a/vllm/config.py b/vllm/config.py index f2381ffa232..ba599ada8eb 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -71,6 +71,7 @@ ConfigType = type[DataclassInstance] HfOverrides = Union[dict, Callable[[type], type]] else: + DataclassInstance = Any PlacementGroup = Any PretrainedConfig = Any ExecutorBase = Any @@ -87,7 +88,7 @@ "vllm.model_executor.models") logger = init_logger(__name__) - +DataclassInstanceT = TypeVar("DataclassInstanceT", bound=DataclassInstance) ConfigT = TypeVar("ConfigT", bound=ConfigType) TaskOption = Literal["auto", "generate", "embedding", "embed", "classify", @@ -5078,3 +5079,21 @@ class SpeechToTextConfig: @property def allow_audio_chunking(self) -> bool: return self.min_energy_split_window_size is not None + + +def update_config(config: DataclassInstanceT, + overrides: dict[str, Any]) -> DataclassInstanceT: + processed_overrides = {} + for field_name, value in overrides.items(): + assert hasattr( + config, field_name), f"{type(config)} has no field `{field_name}`" + current_value = getattr(config, field_name) + if is_dataclass(current_value) and not is_dataclass(value): + assert isinstance(value, dict), ( + f"Overrides to {type(config)}.{field_name} must be a dict" + f" or {type(current_value)}, but got {type(value)}") + value = update_config( + current_value, # type: ignore[type-var] + value) + processed_overrides[field_name] = value + return replace(config, **processed_overrides) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 44de1469d1b..4551cb2df98 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -19,7 +19,7 @@ from vllm.attention.layer import Attention from vllm.compilation.counter import compilation_counter from vllm.config import (CompilationLevel, VllmConfig, - get_layers_from_vllm_config) + get_layers_from_vllm_config, update_config) from vllm.distributed.eplb.eplb_state import EplbState from vllm.distributed.kv_transfer import (get_kv_transfer_group, has_kv_transfer_group) @@ -1728,6 +1728,16 @@ def propose_ngram_draft_token_ids( draft_token_ids.append(drafter_output.tolist()) return draft_token_ids + def update_config(self, overrides: dict[str, Any]) -> None: + allowed_config_names = {"load_config", "model_config"} + for config_name, config_overrides in overrides.items(): + assert config_name in allowed_config_names, \ + f"Config `{config_name}` not supported. " \ + f"Allowed configs: {allowed_config_names}" + config = getattr(self, config_name) + new_config = update_config(config, config_overrides) + setattr(self, config_name, new_config) + def load_model(self) -> None: logger.info("Starting to load model %s...", self.model_config.model) with DeviceMemoryProfiler() as m: # noqa: SIM117 diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 3c764bcdcb2..6458b55777a 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -4,7 +4,7 @@ import copy import gc import os -from typing import TYPE_CHECKING, Optional +from typing import TYPE_CHECKING, Any, Optional import torch import torch.distributed @@ -193,6 +193,9 @@ def load_model(self) -> None: with context: self.model_runner.load_model() + def update_config(self, overrides: dict[str, Any]) -> None: + self.model_runner.update_config(overrides) + @torch.inference_mode() def determine_available_memory(self) -> int: """Profiles the peak memory usage of the model to determine how much diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 5af052e6851..eb96e56f495 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -3,7 +3,7 @@ import bisect import gc import time -from typing import TYPE_CHECKING, Optional, cast +from typing import TYPE_CHECKING, Any, Optional, cast from unittest.mock import patch import numpy as np @@ -18,7 +18,8 @@ from vllm.attention.backends.abstract import AttentionType from vllm.attention.layer import Attention from vllm.compilation.wrapper import TorchCompileWrapperWithCustomDispatcher -from vllm.config import ParallelConfig, VllmConfig, get_layers_from_vllm_config +from vllm.config import (ParallelConfig, VllmConfig, + get_layers_from_vllm_config, update_config) from vllm.forward_context import set_forward_context from vllm.logger import init_logger from vllm.lora.layers import BaseLayerWithLoRA @@ -1111,6 +1112,18 @@ def concat_lists(input_lists): return model_runner_output + def update_config(self, overrides: dict[str, Any]) -> None: + # TODO: TPU config may need extra validation + # https://github.com/vllm-project/vllm/pull/20095#discussion_r2201497754 + allowed_config_names = {"load_config", "model_config"} + for config_name, config_overrides in overrides.items(): + assert config_name in allowed_config_names, \ + f"Config `{config_name}` not supported. " \ + f"Allowed configs: {allowed_config_names}" + config = getattr(self, config_name) + new_config = update_config(config, config_overrides) + setattr(self, config_name, new_config) + def load_model(self) -> None: self.device = self.device_config.device diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index ade4d082116..c5336e9ad51 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """A TPU worker class.""" import os -from typing import Optional +from typing import Any, Optional import torch import torch.distributed @@ -260,6 +260,9 @@ def add_lora(self, lora_request: LoRARequest) -> bool: def load_model(self) -> None: self.model_runner.load_model() + def update_config(self, overrides: dict[str, Any]) -> None: + self.model_runner.update_config(overrides) + def compile_or_warm_up_model(self) -> None: if not self.model_config.enforce_eager: self.model_runner.capture_model() From 6e0a6e4c0ee74470382b296aa56a9692da16fca0 Mon Sep 17 00:00:00 2001 From: Maroon Ayoub Date: Mon, 14 Jul 2025 05:45:31 +0300 Subject: [PATCH 054/552] [Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) (#20511) Signed-off-by: Maroon Ayoub Signed-off-by: x22x22 --- requirements/common.txt | 1 + requirements/docs.txt | 1 + tests/v1/core/test_kv_cache_utils.py | 30 ++++++++++++++++++---------- tests/v1/core/test_prefix_caching.py | 14 ++++++++----- vllm/config.py | 9 +++++++-- vllm/utils/__init__.py | 24 ++++++++++++++++++++++ vllm/v1/core/kv_cache_manager.py | 9 ++++++--- vllm/v1/core/kv_cache_utils.py | 28 ++++++++++++++++++-------- 8 files changed, 88 insertions(+), 28 deletions(-) diff --git a/requirements/common.txt b/requirements/common.txt index 526ed514ac0..c211cb5dc10 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -47,3 +47,4 @@ python-json-logger # Used by logging as per examples/others/logging_configuratio scipy # Required for phi-4-multimodal-instruct ninja # Required for xgrammar, rocm, tpu, xpu pybase64 # fast base64 implementation +cbor2 # Required for cross-language serialization of hashable objects diff --git a/requirements/docs.txt b/requirements/docs.txt index e20b6f6e34d..ec988d79471 100644 --- a/requirements/docs.txt +++ b/requirements/docs.txt @@ -11,6 +11,7 @@ ruff # Required for argparse hook only -f https://download.pytorch.org/whl/cpu cachetools +cbor2 cloudpickle fastapi msgspec diff --git a/tests/v1/core/test_kv_cache_utils.py b/tests/v1/core/test_kv_cache_utils.py index e80ad8a6815..0676cb3eb65 100644 --- a/tests/v1/core/test_kv_cache_utils.py +++ b/tests/v1/core/test_kv_cache_utils.py @@ -8,7 +8,7 @@ from vllm.config import ModelConfig, SchedulerConfig, VllmConfig from vllm.multimodal.inputs import MultiModalKwargs, PlaceholderRange from vllm.sampling_params import SamplingParams -from vllm.utils import GiB_bytes, sha256 +from vllm.utils import GiB_bytes, sha256, sha256_cbor_64bit from vllm.v1.core.kv_cache_manager import KVCacheManager # disable yapf here as it formats differently than isort such that both fail # yapf: disable @@ -16,7 +16,8 @@ FreeKVCacheBlockQueue, KVCacheBlock, PrefixCachingMetrics, estimate_max_model_len, generate_block_hash_extra_keys, get_kv_cache_config, get_max_concurrency_for_kv_cache_config, - hash_block_tokens, hash_request_tokens, unify_kv_cache_configs) + hash_block_tokens, hash_request_tokens, init_none_hash, + unify_kv_cache_configs) from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig, KVCacheGroupSpec, KVCacheTensor, SlidingWindowSpec) @@ -78,24 +79,27 @@ def new_sliding_window_spec(block_size=16, sliding_window=sliding_window) -def test_none_hash(monkeypatch): +@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor_64bit, hash]) +def test_none_hash(monkeypatch, hash_fn): import vllm.v1.core.kv_cache_utils # case 1: PYTHONHASHSEED is not set, use random with monkeypatch.context() as m: m.delenv('PYTHONHASHSEED', raising=False) reloaded_kv_cache_utils = importlib.reload(vllm.v1.core.kv_cache_utils) + reloaded_kv_cache_utils.init_none_hash(hash_fn) assert reloaded_kv_cache_utils.NONE_HASH is not None assert isinstance(reloaded_kv_cache_utils.NONE_HASH, int) assert reloaded_kv_cache_utils.NONE_HASH != 0 - # case 2: PYTHONHASHSEED is set, use the seed + # case 2: PYTHONHASHSEED is set, use the seed and hash_fn with monkeypatch.context() as m: m.setenv('PYTHONHASHSEED', 'python hash seed') reloaded_kv_cache_utils = importlib.reload(vllm.v1.core.kv_cache_utils) + reloaded_kv_cache_utils.init_none_hash(hash_fn) assert reloaded_kv_cache_utils.NONE_HASH is not None assert isinstance(reloaded_kv_cache_utils.NONE_HASH, int) - assert sha256('python hash seed') == reloaded_kv_cache_utils.NONE_HASH + assert hash_fn('python hash seed') == reloaded_kv_cache_utils.NONE_HASH def test_kv_cache_block(): @@ -287,9 +291,10 @@ def test_generate_block_hash_extra_keys_cache_salt(): assert next_mm_idx == 1 -@pytest.mark.parametrize("hash_fn", [sha256, hash]) +@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor_64bit, hash]) def test_hash_block_tokens(hash_fn): import vllm.v1.core.kv_cache_utils + init_none_hash(hash_fn) parent_block_hash = 123 curr_block_token_ids = (1, 2, 3) extra_keys = ("key1", "key2") @@ -303,9 +308,10 @@ def test_hash_block_tokens(hash_fn): assert block_hash.extra_keys == extra_keys -@pytest.mark.parametrize("hash_fn", [sha256, hash]) +@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor_64bit, hash]) def test_hash_request_tokens(hash_fn): import vllm.v1.core.kv_cache_utils + init_none_hash(hash_fn) request = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], @@ -332,8 +338,10 @@ def test_hash_request_tokens(hash_fn): assert block_hashes[1].extra_keys == ("hash2", ) -@pytest.mark.parametrize("hash_fn", [sha256, hash]) +@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor_64bit, hash]) def test_hash_tokens_different_mm_input(hash_fn): + init_none_hash(hash_fn) + request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], @@ -359,8 +367,10 @@ def test_hash_tokens_different_mm_input(hash_fn): assert block_hashes1[1] != block_hashes2[1] -@pytest.mark.parametrize("hash_fn", [sha256, hash]) +@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor_64bit, hash]) def test_hash_request_tokens_no_mm_inputs(hash_fn): + init_none_hash(hash_fn) + request = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], @@ -916,4 +926,4 @@ def test_get_kv_cache_config(): ], kv_cache_groups=[ KVCacheGroupSpec(["layer_1", "layer_2"], new_kv_cache_spec()) - ]) \ No newline at end of file + ]) diff --git a/tests/v1/core/test_prefix_caching.py b/tests/v1/core/test_prefix_caching.py index 7a42778831c..f31bdf74f4a 100644 --- a/tests/v1/core/test_prefix_caching.py +++ b/tests/v1/core/test_prefix_caching.py @@ -11,11 +11,12 @@ from vllm.distributed.kv_events import AllBlocksCleared, BlockRemoved from vllm.multimodal.inputs import MultiModalKwargs, PlaceholderRange from vllm.sampling_params import SamplingParams -from vllm.utils import sha256 +from vllm.utils import sha256, sha256_cbor_64bit from vllm.v1.core.block_pool import BlockPool from vllm.v1.core.kv_cache_manager import KVCacheManager, Request from vllm.v1.core.kv_cache_utils import (BlockHash, BlockHashWithGroupId, - KVCacheBlock, hash_block_tokens) + KVCacheBlock, hash_block_tokens, + init_none_hash) from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig, KVCacheGroupSpec, SlidingWindowSpec) @@ -91,7 +92,7 @@ def make_kv_cache_config_hybrid_model(block_size: int, ) -@pytest.mark.parametrize("hash_algo", ["sha256", "hash"]) +@pytest.mark.parametrize("hash_algo", ["sha256", "sha256_cbor_64bit", "hash"]) def test_prefill(hash_algo): manager = KVCacheManager( make_kv_cache_config(16, 11), @@ -101,7 +102,8 @@ def test_prefill(hash_algo): ) # choose the hash function according to the parameter - hash_fn = sha256 if hash_algo == "sha256" else hash + hash_fn = (sha256_cbor_64bit if hash_algo == "sha256_cbor_64bit" else + sha256 if hash_algo == "sha256" else hash) # Complete 3 blocks (48 tokens) common_token_ids = [i for i in range(3) for _ in range(16)] @@ -696,12 +698,14 @@ def test_basic_prefix_caching_disabled(): assert not blocks -@pytest.mark.parametrize("hash_fn", [sha256, hash]) +@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor_64bit, hash]) def test_cache_blocks(hash_fn): """ This is a unit test that tests the correctness of the _cache_full_blocks function of KVCacheManager. """ + init_none_hash(hash_fn) + block_size = 4 block_pool = BlockPool( num_gpu_blocks=5, diff --git a/vllm/config.py b/vllm/config.py index ba599ada8eb..6f7aefab0a3 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -1564,7 +1564,7 @@ def get_and_verify_max_len(self, max_model_len: int): BlockSize = Literal[1, 8, 16, 32, 64, 128] CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2"] -PrefixCachingHashAlgo = Literal["builtin", "sha256"] +PrefixCachingHashAlgo = Literal["builtin", "sha256", "sha256_cbor_64bit"] @config @@ -1609,7 +1609,12 @@ class CacheConfig: prefix_caching_hash_algo: PrefixCachingHashAlgo = "builtin" """Set the hash algorithm for prefix caching:\n - "builtin" is Python's built-in hash.\n - - "sha256" is collision resistant but with certain overheads.""" + - "sha256" is collision resistant but with certain overheads. + This option uses Pickle for object serialization before hashing.\n + - "sha256_cbor_64bit" provides a reproducible, cross-language compatible + hash. It serializes objects using canonical CBOR and hashes them with + SHA-256. The resulting hash consists of the lower 64 bits of the SHA-256 + digest.""" cpu_offload_gb: float = 0 """The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. Intuitively, this argument can be seen as a virtual way to diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index 495e359aa6d..0bc2341b7b4 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -52,6 +52,7 @@ from uuid import uuid4 import cachetools +import cbor2 import cloudpickle import numpy as np import numpy.typing as npt @@ -3177,6 +3178,29 @@ def sha256(input) -> int: byteorder="big") +def sha256_cbor_64bit(input) -> int: + """ + Hash objects using CBOR serialization and SHA-256, then truncate to 64bits. + + This option is useful for non-Python-dependent serialization and hashing. + + Args: + input: Object to be serialized and hashed. Supported types include + basic Python types and complex structures like lists, tuples, and + dictionaries. + Custom classes must implement CBOR serialization methods. + + Returns: + An integer in the range [0, 2^64-1] representing the lower 64 bits + of the SHA-256 hash of the CBOR serialized input. + """ + input_bytes = cbor2.dumps(input, canonical=True) + full_hash = int.from_bytes(hashlib.sha256(input_bytes).digest(), + byteorder="big") + + return full_hash & ((1 << 64) - 1) + + def is_torch_equal_or_newer(target: str) -> bool: """Check if the installed torch version is >= the target version. diff --git a/vllm/v1/core/kv_cache_manager.py b/vllm/v1/core/kv_cache_manager.py index 3d5f85d2eac..cbc787e8dd5 100644 --- a/vllm/v1/core/kv_cache_manager.py +++ b/vllm/v1/core/kv_cache_manager.py @@ -7,10 +7,10 @@ from vllm.distributed.kv_events import KVCacheEvent from vllm.logger import init_logger -from vllm.utils import sha256 +from vllm.utils import sha256, sha256_cbor_64bit from vllm.v1.core.kv_cache_coordinator import get_kv_cache_coordinator from vllm.v1.core.kv_cache_utils import (BlockHash, KVCacheBlock, - hash_request_tokens) + hash_request_tokens, init_none_hash) from vllm.v1.kv_cache_interface import KVCacheConfig from vllm.v1.metrics.stats import PrefixCacheStats from vllm.v1.request import Request, RequestStatus @@ -79,7 +79,10 @@ def __init__( self.max_model_len = max_model_len self.enable_caching = enable_caching - self.caching_hash_fn = sha256 if caching_hash_algo == "sha256" else hash + self.caching_hash_fn = ( + sha256_cbor_64bit if caching_hash_algo == "sha256_cbor_64bit" else + sha256 if caching_hash_algo == "sha256" else hash) + init_none_hash(self.caching_hash_fn) self.use_eagle = use_eagle self.log_stats = log_stats # FIXME: make prefix cache stats conditional on log_stats diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index 2fbcb569e3d..544b9f59932 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -10,7 +10,7 @@ from vllm.config import VllmConfig from vllm.logger import init_logger -from vllm.utils import GiB_bytes, cdiv, sha256 +from vllm.utils import GiB_bytes, cdiv, sha256_cbor_64bit from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig, KVCacheGroupSpec, KVCacheSpec, KVCacheTensor, SlidingWindowSpec) @@ -46,18 +46,30 @@ def get_hash_value(self) -> int: return self.block_hash.hash_value -# The hash seed for the first block of the prefix block sequence. -# -# Even if the hash function is the builtin hash(), we use sha256 to generate -# the initial hash to simplify the code. This is not performance critical -# as it is done one per process. +# The hash seed for the first block of any prefix block sequence. # # We use a random value to avoid hash collisions or PYTHONHASHSEED environment # variable if set such that processes can share the seed if needed. # This aligns with the behavior of Python's hash() function, which also uses # a random seed if PYTHONHASHSEED is not set. -NONE_HASH = int.from_bytes(os.urandom(32), byteorder="big") if os.getenv( - "PYTHONHASHSEED") is None else sha256(os.getenv("PYTHONHASHSEED")) +# +# The function `init_none_hash` initializes this variable globally. +NONE_HASH: int + + +def init_none_hash(hash_fn: Callable): + global NONE_HASH + + hash_seed = os.getenv("PYTHONHASHSEED") + if hash_seed is None and hash_fn is sha256_cbor_64bit: + logger.warning( + "PYTHONHASHSEED is not set. This will lead to non-reproducible " + "block-hashes when using sha256_cbor_64bit as the hash function." + "Consider setting PYTHONHASHSEED to a fixed value for " + "reproducibility.") + + NONE_HASH = (int.from_bytes(os.urandom(32), byteorder="big") + if hash_seed is None else hash_fn(hash_seed)) class PrefixCachingMetrics: From cecae80481a379ec637c124a1c8816fc3d7b397d Mon Sep 17 00:00:00 2001 From: Daniel song Date: Mon, 14 Jul 2025 02:15:05 -0400 Subject: [PATCH 055/552] Removing redundant python version check (#20888) Signed-off-by: Dannyso05 Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_engine.py | 5 ----- 1 file changed, 5 deletions(-) diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index 7581ab6e63b..dab5ac03253 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -18,11 +18,6 @@ from starlette.datastructures import Headers from typing_extensions import TypeIs -if sys.version_info >= (3, 12): - from typing import TypedDict -else: - from typing_extensions import TypedDict - if sys.version_info >= (3, 12): from typing import TypedDict else: From e650b0f742a04f348f13af606f3e6780515c8bd2 Mon Sep 17 00:00:00 2001 From: Reid <61492567+reidliu41@users.noreply.github.com> Date: Mon, 14 Jul 2025 15:09:57 +0800 Subject: [PATCH 056/552] Fix: Add missing EOFError handling in CLI complete command (#20896) Signed-off-by: reidliu41 Signed-off-by: x22x22 --- vllm/entrypoints/cli/openai.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/vllm/entrypoints/cli/openai.py b/vllm/entrypoints/cli/openai.py index 5ddaee5b52a..e71f77ba806 100644 --- a/vllm/entrypoints/cli/openai.py +++ b/vllm/entrypoints/cli/openai.py @@ -55,7 +55,7 @@ def chat(system_prompt: str | None, model_name: str, client: OpenAI) -> None: try: input_message = input("> ") except EOFError: - return + break conversation.append({"role": "user", "content": input_message}) chat_completion = client.chat.completions.create(model=model_name, @@ -118,7 +118,7 @@ def cmd(args: argparse.Namespace) -> None: try: input_message = input("> ") except EOFError: - return + break conversation.append({"role": "user", "content": input_message}) chat_completion = client.chat.completions.create( @@ -170,7 +170,10 @@ def cmd(args: argparse.Namespace) -> None: print("Please enter prompt to complete:") while True: - input_prompt = input("> ") + try: + input_prompt = input("> ") + except EOFError: + break completion = client.completions.create(model=model_name, prompt=input_prompt) output = completion.choices[0].text From ff00fe817b118381c3ce6ce477fd91f58bae485c Mon Sep 17 00:00:00 2001 From: TJian Date: Mon, 14 Jul 2025 00:23:28 -0700 Subject: [PATCH 057/552] [ROCm] [Bugfix] [Critical]: Fix mamba compilation bug (#20883) Signed-off-by: tjtanaa Co-authored-by: vllmellm Signed-off-by: x22x22 --- csrc/mamba/mamba_ssm/selective_scan_fwd.cu | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/csrc/mamba/mamba_ssm/selective_scan_fwd.cu b/csrc/mamba/mamba_ssm/selective_scan_fwd.cu index 5f920997934..5766fbab4e8 100644 --- a/csrc/mamba/mamba_ssm/selective_scan_fwd.cu +++ b/csrc/mamba/mamba_ssm/selective_scan_fwd.cu @@ -7,7 +7,11 @@ #include #include -#include // For C10_CUDA_CHECK and C10_CUDA_KERNEL_LAUNCH_CHECK +#ifdef USE_ROCM + #include // For C10_HIP_CHECK and C10_HIP_KERNEL_LAUNCH_CHECK +#else + #include // For C10_CUDA_CHECK and C10_CUDA_KERNEL_LAUNCH_CHECK +#endif #ifndef USE_ROCM #include @@ -320,8 +324,13 @@ void selective_scan_fwd_launch(SSMParamsBase ¶ms, cudaStream_t stream) { dim3 grid(params.batch, params.dim / kNRows); auto kernel = &selective_scan_fwd_kernel; if (kSmemSize >= 48 * 1024) { +#ifdef USE_ROCM + C10_HIP_CHECK(hipFuncSetAttribute( + reinterpret_cast(kernel), hipFuncAttributeMaxDynamicSharedMemorySize, kSmemSize)); +#else C10_CUDA_CHECK(cudaFuncSetAttribute( kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize)); +#endif } kernel<<>>(params); C10_CUDA_KERNEL_LAUNCH_CHECK(); From 0c1d32e0844a44a8ecbc4574f3d65419608cae7e Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Mon, 14 Jul 2025 15:34:34 +0800 Subject: [PATCH 058/552] [Quantization] add BNB for MixtralForCausalLM (#20893) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- vllm/model_executor/model_loader/utils.py | 7 +- vllm/model_executor/models/granitemoe.py | 105 +++++++++++++++++- .../model_executor/models/granitemoeshared.py | 5 +- vllm/model_executor/models/mixtral.py | 21 ++-- vllm/model_executor/models/olmoe.py | 3 +- vllm/model_executor/models/qwen2_moe.py | 3 +- vllm/model_executor/models/qwen3_moe.py | 4 +- 7 files changed, 128 insertions(+), 20 deletions(-) diff --git a/vllm/model_executor/model_loader/utils.py b/vllm/model_executor/model_loader/utils.py index 792a1044a56..8e5f332ba7c 100644 --- a/vllm/model_executor/model_loader/utils.py +++ b/vllm/model_executor/model_loader/utils.py @@ -227,7 +227,12 @@ def get_model_architecture( # Special handling for quantized Mixtral. # FIXME(woosuk): This is a temporary hack. mixtral_supported = [ - "fp8", "compressed-tensors", "gptq_marlin", "awq_marlin", "quark" + "fp8", + "compressed-tensors", + "gptq_marlin", + "awq_marlin", + "quark", + "bitsandbytes", ] vllm_supported_archs = ModelRegistry.get_supported_archs() diff --git a/vllm/model_executor/models/granitemoe.py b/vllm/model_executor/models/granitemoe.py index 5a70f3a616c..142b0e96729 100644 --- a/vllm/model_executor/models/granitemoe.py +++ b/vllm/model_executor/models/granitemoe.py @@ -45,12 +45,14 @@ from vllm.model_executor.layers.rotary_embedding import get_rope from vllm.model_executor.layers.vocab_parallel_embedding import ( DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) +from vllm.model_executor.model_loader.weight_utils import ( + default_weight_loader, maybe_remap_kv_scale_name) from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from . import mixtral from .interfaces import SupportsLoRA, SupportsPP -from .utils import AutoWeightsLoader, make_layers, maybe_prefix +from .utils import (AutoWeightsLoader, is_pp_missing_parameter, make_layers, + maybe_prefix) class GraniteMoeMoE(nn.Module): @@ -307,6 +309,103 @@ def forward( hidden_states = self.norm(hidden_states) return hidden_states + def _load_weights(self, + weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: + """ + This function is copied from `MixtralModel.load_weights`, mainly to + decouple from mixtral, avoiding impact on support like BNB + quantization. + """ + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + ("qkv_proj", "q_proj", "q"), + ("qkv_proj", "k_proj", "k"), + ("qkv_proj", "v_proj", "v"), + ] + + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + expert_params_mapping = FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="w1", + ckpt_down_proj_name="w2", + ckpt_up_proj_name="w3", + num_experts=self.config.num_local_experts) + + params_dict = dict(self.named_parameters()) + loaded_params: set[str] = set() + for name, loaded_weight in weights: + if (self.quant_config is not None and + (scale_name := self.quant_config.get_cache_scale(name))): + # Loading kv cache quantization scales + param = params_dict[scale_name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + loaded_weight = (loaded_weight if loaded_weight.dim() == 0 else + loaded_weight[0]) + weight_loader(param, loaded_weight) + loaded_params.add(scale_name) + continue + + for (param_name, weight_name, shard_id) in stacked_params_mapping: + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + # Skip loading extra bias for GPTQ models. + if ((name.endswith(".bias") or name.endswith("_bias")) + and name not in params_dict): + continue + # Skip layers on other devices. + if is_pp_missing_parameter(name, self): + continue + if name.endswith("scale"): + # Remapping the name of FP8 kv-scale. + name = maybe_remap_kv_scale_name(name, params_dict) + if name is None: + continue + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + break + else: + for mapping in expert_params_mapping: + param_name, weight_name, expert_id, shard_id = mapping + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + # Skip layers on other devices. + if is_pp_missing_parameter(name, self): + continue + if ((name.endswith(".bias") or name.endswith("_bias")) + and name not in params_dict): + continue + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, + loaded_weight, + name, + shard_id=shard_id, + expert_id=expert_id) + break + else: + # Skip loading extra bias for GPTQ models. + if ((name.endswith(".bias") or name.endswith("_bias")) + and name not in params_dict): + continue + # Skip layers on other devices. + if is_pp_missing_parameter(name, self): + continue + # Remapping the name of FP8 kv-scale. + name = maybe_remap_kv_scale_name(name, params_dict) + if name is None: + continue + + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + return loaded_params + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: new_weights = {} @@ -339,7 +438,7 @@ def load_weights(self, weights: Iterable[tuple[str, new_weights[gate_name] = p else: new_weights[n] = p - return mixtral.MixtralModel.load_weights(self, new_weights.items()) + return self._load_weights(new_weights.items()) class GraniteMoeForCausalLM(nn.Module, SupportsLoRA, SupportsPP): diff --git a/vllm/model_executor/models/granitemoeshared.py b/vllm/model_executor/models/granitemoeshared.py index bb160dbce45..7303f485378 100644 --- a/vllm/model_executor/models/granitemoeshared.py +++ b/vllm/model_executor/models/granitemoeshared.py @@ -27,8 +27,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from . import mixtral -from .granitemoe import GraniteMoeAttention, GraniteMoeMoE +from .granitemoe import GraniteMoeAttention, GraniteMoeModel, GraniteMoeMoE from .interfaces import SupportsLoRA, SupportsPP from .utils import AutoWeightsLoader, make_layers, maybe_prefix @@ -242,7 +241,7 @@ def load_weights(self, weights: Iterable[tuple[str, new_weights[gate_name] = p else: new_weights[n] = p - return mixtral.MixtralModel.load_weights(self, new_weights.items()) + return GraniteMoeModel._load_weights(self, new_weights.items()) class GraniteMoeSharedForCausalLM(nn.Module, SupportsLoRA, SupportsPP): diff --git a/vllm/model_executor/models/mixtral.py b/vllm/model_executor/models/mixtral.py index dec365119c7..30de83da49e 100644 --- a/vllm/model_executor/models/mixtral.py +++ b/vllm/model_executor/models/mixtral.py @@ -317,6 +317,15 @@ def forward( hidden_states, _ = self.norm(hidden_states, residual) return hidden_states + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="w1", + ckpt_down_proj_name="w2", + ckpt_up_proj_name="w3", + num_experts=self.config.num_local_experts) + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: stacked_params_mapping = [ @@ -326,16 +335,9 @@ def load_weights(self, weights: Iterable[tuple[str, ("qkv_proj", "v_proj", "v"), ] - # Params for weights, fp8 weight scales, fp8 activation scales - # (param_name, weight_name, expert_id, shard_id) - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="w1", - ckpt_down_proj_name="w2", - ckpt_up_proj_name="w3", - num_experts=self.config.num_local_experts) - params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if (self.quant_config is not None and (scale_name := self.quant_config.get_cache_scale(name))): @@ -486,3 +488,6 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: loader = AutoWeightsLoader(self) return loader.load_weights(weights) + + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() diff --git a/vllm/model_executor/models/olmoe.py b/vllm/model_executor/models/olmoe.py index 33438216ac1..7552f64c423 100644 --- a/vllm/model_executor/models/olmoe.py +++ b/vllm/model_executor/models/olmoe.py @@ -352,6 +352,7 @@ def load_weights(self, weights: Iterable[tuple[str, params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: for (param_name, weight_name, shard_id) in stacked_params_mapping: # Skip non-stacked layers and experts (experts handled below). @@ -380,7 +381,7 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight, shard_id) break else: - for mapping in self.get_expert_mapping(): + for mapping in expert_params_mapping: param_name, weight_name, expert_id, shard_id = mapping if weight_name not in name: continue diff --git a/vllm/model_executor/models/qwen2_moe.py b/vllm/model_executor/models/qwen2_moe.py index 597f4c7e120..84bae87804c 100644 --- a/vllm/model_executor/models/qwen2_moe.py +++ b/vllm/model_executor/models/qwen2_moe.py @@ -413,6 +413,7 @@ def load_weights(self, weights: Iterable[tuple[str, params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: for (param_name, weight_name, shard_id) in stacked_params_mapping: # Skip non-stacked layers and experts (experts handled below). @@ -442,7 +443,7 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight, shard_id) break else: - for mapping in self.get_expert_mapping(): + for mapping in expert_params_mapping: param_name, weight_name, expert_id, shard_id = mapping if weight_name not in name: continue diff --git a/vllm/model_executor/models/qwen3_moe.py b/vllm/model_executor/models/qwen3_moe.py index c87f41fa7c0..0f749b3e38f 100644 --- a/vllm/model_executor/models/qwen3_moe.py +++ b/vllm/model_executor/models/qwen3_moe.py @@ -400,11 +400,9 @@ def load_weights(self, weights: Iterable[tuple[str, ".v_scale", "_v_scale", ".weight_scale", "_weight_scale", ".input_scale", "_input_scale") - # Params for weights, fp8 weight scales, fp8 activation scales - # (param_name, weight_name, expert_id, shard_id) - expert_params_mapping = self.get_expert_mapping() params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: for (param_name, weight_name, shard_id) in stacked_params_mapping: # Skip non-stacked layers and experts (experts handled below). From 61b99a930fcf35fe46ca24bc378207f0f490e040 Mon Sep 17 00:00:00 2001 From: Aaron Pham Date: Mon, 14 Jul 2025 03:58:35 -0400 Subject: [PATCH 059/552] [Refactor][V1] Move outlines utils for V1 imports (#20878) Signed-off-by: Aaron Pham Signed-off-by: x22x22 --- vllm/v1/structured_output/backend_outlines.py | 9 +- vllm/v1/structured_output/utils.py | 200 +++++++++++++++++- 2 files changed, 204 insertions(+), 5 deletions(-) diff --git a/vllm/v1/structured_output/backend_outlines.py b/vllm/v1/structured_output/backend_outlines.py index e1e4ea431d9..572e4984480 100644 --- a/vllm/v1/structured_output/backend_outlines.py +++ b/vllm/v1/structured_output/backend_outlines.py @@ -13,13 +13,14 @@ import torch from regex import escape as regex_escape -from vllm.model_executor.guided_decoding.outlines_logits_processors import ( - OutlinesVocabulary, get_cache, get_vocabulary) from vllm.sampling_params import SamplingParams from vllm.utils import LazyLoader from vllm.v1.structured_output.backend_types import (StructuredOutputBackend, StructuredOutputGrammar, StructuredOutputOptions) +from vllm.v1.structured_output.utils import (OutlinesVocabulary, + get_outlines_cache, + get_outlines_vocabulary) if TYPE_CHECKING: import outlines_core as oc @@ -47,8 +48,8 @@ class OutlinesBackend(StructuredOutputBackend): def __post_init__(self): - self.vocabulary = get_vocabulary(self.tokenizer) - self.cache = get_cache() + self.vocabulary = get_outlines_vocabulary(self.tokenizer) + self.cache = get_outlines_cache() def _compile_index(self, regex_string: str, vocabulary: OutlinesVocabulary) -> oc.Index: diff --git a/vllm/v1/structured_output/utils.py b/vllm/v1/structured_output/utils.py index 7adee7237bd..95319831d51 100644 --- a/vllm/v1/structured_output/utils.py +++ b/vllm/v1/structured_output/utils.py @@ -3,7 +3,205 @@ from __future__ import annotations +import hashlib +import importlib.metadata +import os +from typing import TYPE_CHECKING + import regex as re +from cachetools import LRUCache +from diskcache import Cache + +import vllm.envs as envs +from vllm.logger import init_logger +from vllm.utils import LazyLoader + +if TYPE_CHECKING: + import outlines_core as oc + import transformers.file_utils as file_utils + import transformers.models.gpt2.tokenization_gpt2 as tokenization_gpt2 + + from vllm.transformers_utils.tokenizer import AnyTokenizer +else: + oc = LazyLoader("oc", globals(), "outlines_core") + file_utils = LazyLoader("file_utils", globals(), "transformers.file_utils") + tokenization_gpt2 = LazyLoader( + "tokenization_gpt2", + globals(), + "transformers.models.gpt2.tokenization_gpt2", + ) + +logger = init_logger(__name__) + +CACHE = None + + +class OutlinesVocabulary: + """ + Wrapper class for `outlines_core.Vocabulary`, + which allows us to store a hash with the vocabulary + """ + + def __init__(self, vocabulary: oc.Vocabulary) -> None: + # Actual vocabulary object + self.inner = vocabulary + # Have to do abs(hash()) because python hashes can + # be negative, and we are using hash as a cache key. + hex_str = hashlib.sha256( + vocabulary.__repr__().encode('utf-8')).hexdigest() + hash_int = int(hex_str, 16) + self._hash = hash_int + + +def get_outlines_cache_path() -> str: + """Get the context object that contains previously-computed return values""" + outlines_cache_dir = os.getenv("OUTLINES_CACHE_DIR") + xdg_cache_home = os.getenv("XDG_CACHE_HOME") + home_dir = os.path.expanduser("~") + + if outlines_cache_dir: + # OUTLINES_CACHE_DIR takes precedence + return outlines_cache_dir + elif xdg_cache_home: + return os.path.join(xdg_cache_home, ".cache", "outlines") + # If homedir is "/", we may be inside a container, and thus writing to + # root would be problematic, so we fallback to using a tempfile. + # Also validate the path exists, since os.path.expanduser does + # not garuntee existence. + elif os.path.isdir(home_dir) and home_dir != "/": + # Default Unix fallback: ~/.cache/outlines + return os.path.join(home_dir, ".cache", "outlines") + else: + import tempfile + + # home_dir may be / inside a docker container without existing user + tempdir = tempfile.gettempdir() + return os.path.join(tempdir, ".cache", "outlines") + + +def get_outlines_cache(): + """Get the Cache instance to be used for index caching""" + + cache_dir = get_outlines_cache_path() + if envs.VLLM_V1_USE_OUTLINES_CACHE: + logger.warning("Enabling outlines cache. This is an unbounded on-disk " + "cache. It may consume a lot of disk space and should " + "not be used with untrusted clients.") + cache = Cache(cache_dir, eviction_policy="none", cull_limit=0) + outlines_version = importlib.metadata.version("outlines_core") + + cached_version = cache.get('__version__', None) + if cached_version != outlines_version: + cache.clear() + cache.set('__version__', outlines_version) + return cache + else: + return LRUCache(maxsize=128) + + +re_llama_byte_token = re.compile(r"^<0x[0-9A-F]{2}>$") +re_replacement_seq = re.compile(r"^.{0,6}�+.{0,6}$") + + +def _reduced_vocabulary( + tokenizer: AnyTokenizer, + eos_token_id: int, +) -> dict[bytes, list[int]]: + """Create a map from vocabulary tokens to lists of equivalent token ids. + + Returns: + A Dict of token string -> equivalent token ids + """ + + unicode_to_bytes = { + v: k + for k, v in tokenization_gpt2.bytes_to_unicode().items() + } + + def convert_token_to_string(token: str) -> str: + + string = tokenizer.convert_tokens_to_string([token]) + + # A hack to handle missing spaces to HF's Llama tokenizers + if (type(token) is str + and token.startswith(file_utils.SPIECE_UNDERLINE) + or token == "<0x20>"): + return " " + string + + return string + + vocabulary: dict[bytes, list[int]] = {} + empty_token_ids: list[int] = [] + for token, token_idx in tokenizer.get_vocab().items(): + if token in tokenizer.all_special_tokens: # type: ignore + continue + + token_str = convert_token_to_string(token) + if token_str: + if isinstance(token, (bytes, bytearray)): + # For BPE tokenizers where tokens are stored as bytes. + + # safe to ignore since token_str is of type (bytearray, bytes) + # by this point. + token_bytes = bytes(token_str) # type: ignore[arg-type] + + elif "\ufffd" in token_str and not re_replacement_seq.match( + token_str): + # Handle tokens with invalid UTF-8 sequences. + if re_llama_byte_token.match(token): + # Llama-like tokenizers use <0xXX> for incomplete sequences. + token_bytes = bytes([int(token[3:5], 16)]) + else: + # GPT2 tokenizers: map each byte back using unicode_to_bytes + byte_vals = [unicode_to_bytes.get(c) for c in token] + if None in byte_vals: + raise RuntimeError( + f"Cannot convert token `{token}`" + f" ({token_idx}) to bytes: {token_str}") + # safe to ignore, since if None in byte_vals, + # an error is thrown. + token_bytes = bytes(byte_vals) # type: ignore[arg-type] + else: + token_bytes = token_str.encode('utf-8') + + if token_idx != eos_token_id: + vocabulary.setdefault(token_bytes, []).append(token_idx) + else: + empty_token_ids.append(token_idx) + + return vocabulary + + +def get_outlines_vocabulary(tokenizer: AnyTokenizer) -> oc.Vocabulary: + """Get the `Vocabulary` object for a given tokenizer. + """ + if hasattr(tokenizer, "_outlines_vocabulary"): + return tokenizer._outlines_vocabulary # type: ignore + + try: + if hasattr( + tokenizer, + "eos_token_id", + ) and tokenizer.eos_token_id is not None: + eos_token_id = tokenizer.eos_token_id + else: + raise ValueError( + f"Error during structured outputs setup for outlines: Tokenizer ({type(tokenizer)}) has no `eos_token_id` property, but `eos_token_id` is required for structured outputs to work properly." # noqa: E501 + ) + + reduced_vocab = _reduced_vocabulary( + tokenizer, + eos_token_id #type: ignore + ) + vocabulary = OutlinesVocabulary( + oc.Vocabulary(eos_token_id, reduced_vocab)) + tokenizer._outlines_vocabulary = vocabulary # type: ignore + + return vocabulary + except AttributeError as e: + raise ValueError(f"Cannot get the vocabulary of the tokenizer " + f"({type(tokenizer)}). The tokenizer should have a " + "get_vocab method.") from e def grammar_is_likely_lark(grammar_str: str) -> bool: @@ -77,7 +275,7 @@ def check_quotes(text: str, rule_name: str, line_num: int) -> None: raise ValueError( f"Mismatched quotes in {rule_name} on line {line_num}") - def extract_references(text: str) -> set: + def extract_references(text: str) -> set[str]: """Extract rule references from text.""" # Remove quoted strings and special characters text = re.sub(r'"[^"]*"', '', text) From c6866a95ec5ec2d787d71cd73b7bc75a3d74fc4c Mon Sep 17 00:00:00 2001 From: wangxiyuan Date: Mon, 14 Jul 2025 17:40:00 +0800 Subject: [PATCH 060/552] [MISC] Move bind_kv_cache to worker module (#20900) Signed-off-by: wangxiyuan Signed-off-by: x22x22 --- tests/v1/test_utils.py | 2 +- vllm/v1/utils.py | 48 --------------------------- vllm/v1/worker/gpu_model_runner.py | 4 +-- vllm/v1/worker/tpu_model_runner.py | 3 +- vllm/v1/worker/tpu_worker.py | 3 +- vllm/v1/worker/utils.py | 52 +++++++++++++++++++++++++++++- 6 files changed, 57 insertions(+), 55 deletions(-) diff --git a/tests/v1/test_utils.py b/tests/v1/test_utils.py index a3df882a9e2..fd0e630ce17 100644 --- a/tests/v1/test_utils.py +++ b/tests/v1/test_utils.py @@ -3,7 +3,7 @@ import torch -from vllm.v1.utils import bind_kv_cache +from vllm.v1.worker.utils import bind_kv_cache def test_bind_kv_cache(): diff --git a/vllm/v1/utils.py b/vllm/v1/utils.py index 6b40cf6fd36..97fec4704b4 100644 --- a/vllm/v1/utils.py +++ b/vllm/v1/utils.py @@ -4,7 +4,6 @@ import multiprocessing import time import weakref -from collections import defaultdict from collections.abc import Sequence from multiprocessing import connection from multiprocessing.process import BaseProcess @@ -14,14 +13,12 @@ import torch from vllm.logger import init_logger -from vllm.model_executor.models.utils import extract_layer_index from vllm.usage.usage_lib import (UsageContext, is_usage_stats_enabled, usage_message) from vllm.utils import (get_open_port, get_open_zmq_ipc_path, get_tcp_uri, kill_process_tree) if TYPE_CHECKING: - from vllm.attention.layer import Attention from vllm.v1.engine.coordinator import DPCoordinator from vllm.v1.engine.utils import (CoreEngineActorManager, CoreEngineProcManager) @@ -275,51 +272,6 @@ def shutdown(procs: list[BaseProcess]): kill_process_tree(pid) -def bind_kv_cache( - kv_caches: dict[str, torch.Tensor], - forward_context: dict[str, "Attention"], - runner_kv_caches: list[torch.Tensor], -) -> None: - """ - Bind the allocated KV cache to both ModelRunner and forward context so - that the KV cache can be used in the forward pass. - - This function: - 1) Fills the ModelRunner's kv cache list (`runner_kv_caches`) with - kv_caches. - 2) Associates each attention layer in the `forward_context` with its - corresponding KV cache in kv_caches. - - Args: - kv_caches: The allocated kv_caches with layer names as keys. - forward_context: The global forward context containing all Attention - layers with layer names as keys. - runner_kv_caches: The kv_cache declared by ModelRunner. - """ - # Bind kv_caches to ModelRunner - assert len(runner_kv_caches) == 0 - - # Convert kv_caches dict to a list of tensors in the order of layer_index. - index2name = defaultdict(list) - for layer_name in kv_caches: - index2name[extract_layer_index(layer_name)].append(layer_name) - - for layer_index in sorted(index2name.keys()): - layer_names = index2name[layer_index] - if len(layer_names) > 1: - # One typical case is encoder-decoder model, e.g., bart. - # The cross attention and self attention in the same decoder layer - # has different layer_name but the same layer_index. - raise NotImplementedError - layer_name = layer_names[0] - runner_kv_caches.append(kv_caches[layer_name]) - - # Bind kv_caches to forward context - for layer_name, kv_cache in kv_caches.items(): - # NOTE: Use list because of v0 PP virtual engine. - forward_context[layer_name].kv_cache = [kv_cache] - - def copy_slice(from_tensor: torch.Tensor, to_tensor: torch.Tensor, length: int) -> torch.Tensor: """ diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 4551cb2df98..734df82589a 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -62,13 +62,13 @@ from vllm.v1.spec_decode.medusa import MedusaProposer from vllm.v1.spec_decode.metadata import SpecDecodeMetadata from vllm.v1.spec_decode.ngram_proposer import NgramProposer -from vllm.v1.utils import bind_kv_cache from vllm.v1.worker.block_table import BlockTable from vllm.v1.worker.gpu_input_batch import CachedRequestState, InputBatch from vllm.v1.worker.lora_model_runner_mixin import LoRAModelRunnerMixin from ..sample.logits_processor import LogitsProcessorManager -from .utils import (gather_mm_placeholders, initialize_kv_cache_for_kv_sharing, +from .utils import (bind_kv_cache, gather_mm_placeholders, + initialize_kv_cache_for_kv_sharing, sanity_check_mm_encoder_outputs, scatter_mm_placeholders) if TYPE_CHECKING: diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index eb96e56f495..82a203caf2b 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -42,11 +42,10 @@ LogprobsTensors, ModelRunnerOutput) from vllm.v1.sample.tpu.metadata import TPUSupportedSamplingMetadata from vllm.v1.sample.tpu.sampler import Sampler as TPUSampler -from vllm.v1.utils import bind_kv_cache from vllm.v1.worker.lora_model_runner_mixin import LoRAModelRunnerMixin from vllm.v1.worker.tpu_input_batch import CachedRequestState, InputBatch -from .utils import (initialize_kv_cache_for_kv_sharing, +from .utils import (bind_kv_cache, initialize_kv_cache_for_kv_sharing, sanity_check_mm_encoder_outputs) if TYPE_CHECKING: diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index c5336e9ad51..c4bf40d6654 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -25,8 +25,9 @@ from vllm.v1.kv_cache_interface import (AttentionSpec, KVCacheConfig, KVCacheSpec) from vllm.v1.outputs import ModelRunnerOutput -from vllm.v1.utils import bind_kv_cache, report_usage_stats +from vllm.v1.utils import report_usage_stats from vllm.v1.worker.tpu_model_runner import TPUModelRunner +from vllm.v1.worker.utils import bind_kv_cache logger = init_logger(__name__) diff --git a/vllm/v1/worker/utils.py b/vllm/v1/worker/utils.py index 70339ff2f00..3ecb1d7dd65 100644 --- a/vllm/v1/worker/utils.py +++ b/vllm/v1/worker/utils.py @@ -1,12 +1,17 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from collections import defaultdict +from typing import TYPE_CHECKING, Optional import torch from vllm.model_executor.models.interfaces import MultiModalEmbeddings +from vllm.model_executor.models.utils import extract_layer_index from vllm.v1.kv_cache_interface import KVCacheGroupSpec +if TYPE_CHECKING: + from vllm.attention.layer import Attention + def sanity_check_mm_encoder_outputs( mm_embeddings: MultiModalEmbeddings, @@ -110,3 +115,48 @@ def initialize_kv_cache_for_kv_sharing( kv_caches[layer_name] = kv_caches[target_layer_name] group_idx = layer_to_kv_cache_group_idx[target_layer_name] kv_cache_groups[group_idx].layer_names.append(layer_name) + + +def bind_kv_cache( + kv_caches: dict[str, torch.Tensor], + forward_context: dict[str, "Attention"], + runner_kv_caches: list[torch.Tensor], +) -> None: + """ + Bind the allocated KV cache to both ModelRunner and forward context so + that the KV cache can be used in the forward pass. + + This function: + 1) Fills the ModelRunner's kv cache list (`runner_kv_caches`) with + kv_caches. + 2) Associates each attention layer in the `forward_context` with its + corresponding KV cache in kv_caches. + + Args: + kv_caches: The allocated kv_caches with layer names as keys. + forward_context: The global forward context containing all Attention + layers with layer names as keys. + runner_kv_caches: The kv_cache declared by ModelRunner. + """ + # Bind kv_caches to ModelRunner + assert len(runner_kv_caches) == 0 + + # Convert kv_caches dict to a list of tensors in the order of layer_index. + index2name = defaultdict(list) + for layer_name in kv_caches: + index2name[extract_layer_index(layer_name)].append(layer_name) + + for layer_index in sorted(index2name.keys()): + layer_names = index2name[layer_index] + if len(layer_names) > 1: + # One typical case is encoder-decoder model, e.g., bart. + # The cross attention and self attention in the same decoder layer + # has different layer_name but the same layer_index. + raise NotImplementedError + layer_name = layer_names[0] + runner_kv_caches.append(kv_caches[layer_name]) + + # Bind kv_caches to forward context + for layer_name, kv_cache in kv_caches.items(): + # NOTE: Use list because of v0 PP virtual engine. + forward_context[layer_name].kv_cache = [kv_cache] From 64e54a3e5f5b109c41cac6e4c06eb92b9daf72de Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Mon, 14 Jul 2025 18:32:35 +0800 Subject: [PATCH 061/552] [CI/Build] Fix OOM issue in Jina-VL test (#20907) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- .../pooling/test_jinavl_reranker.py | 143 +++++++++++------- 1 file changed, 85 insertions(+), 58 deletions(-) diff --git a/tests/models/multimodal/pooling/test_jinavl_reranker.py b/tests/models/multimodal/pooling/test_jinavl_reranker.py index 83d6ab8e403..50c91f1f81c 100644 --- a/tests/models/multimodal/pooling/test_jinavl_reranker.py +++ b/tests/models/multimodal/pooling/test_jinavl_reranker.py @@ -1,9 +1,15 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from typing import Union import pytest from transformers import AutoModel +from vllm.entrypoints.chat_utils import ChatCompletionContentPartImageParam +from vllm.entrypoints.score_utils import ScoreMultiModalParam + +from ....conftest import HfRunner, VllmRunner + model_name = "jinaai/jina-reranker-m0" mm_processor_kwargs = { @@ -14,73 +20,90 @@ limit_mm_per_prompt = {"image": 2} -def vllm_reranker(model_name, - query, - documents, - query_type="text", - doc_type="text"): - from vllm import LLM - - model = LLM( - model=model_name, - task="score", - max_model_len=32768, - mm_processor_kwargs=mm_processor_kwargs, - limit_mm_per_prompt=limit_mm_per_prompt, - ) +def vllm_reranker( + vllm_runner: type[VllmRunner], + model_name: str, + dtype: str, + query_strs: list[str], + document_strs: list[str], + query_type: str = "text", + doc_type: str = "text", +): - def create_image_param(url: str): + def create_image_param(url: str) -> ChatCompletionContentPartImageParam: return {"type": "image_url", "image_url": {"url": f"{url}"}} - if query_type == "image": - query = {"content": [create_image_param(url) for url in query]} - - if doc_type == "image": - documents = {"content": [create_image_param(url) for url in documents]} - - outputs = model.score(query, documents) + query: Union[list[str], ScoreMultiModalParam] + if query_type == "text": + query = query_strs + elif query_type == "image": + query = ScoreMultiModalParam( + content=[create_image_param(url) for url in query_strs]) + + documents: Union[list[str], ScoreMultiModalParam] + if doc_type == "text": + documents = document_strs + elif doc_type == "image": + documents = ScoreMultiModalParam( + content=[create_image_param(url) for url in document_strs]) + + with vllm_runner( + model_name, + task="score", + dtype=dtype, + max_num_seqs=2, + max_model_len=2048, + mm_processor_kwargs=mm_processor_kwargs, + limit_mm_per_prompt=limit_mm_per_prompt, + ) as vllm_model: + outputs = vllm_model.model.score(query, documents) return [output.outputs.score for output in outputs] -def hf_reranker(model_name, - query, - documents, - query_type="text", - doc_type="text"): - +def hf_reranker( + hf_runner: type[HfRunner], + model_name: str, + dtype: str, + query_strs: list[str], + document_strs: list[str], + query_type: str = "text", + doc_type: str = "text", +): checkpoint_to_hf_mapper = { "visual.": "model.visual.", "model.": "model.language_model.", } - model = AutoModel.from_pretrained( - model_name, - torch_dtype="auto", - trust_remote_code=True, - key_mapping=checkpoint_to_hf_mapper).to("cuda").eval() + data_pairs = [[query_strs[0], d] for d in document_strs] - data_pairs = [[query[0], d] for d in documents] - - scores = model.compute_score(data_pairs, - max_length=2048, - query_type=query_type, - doc_type=doc_type) - return scores + with hf_runner( + model_name, + dtype=dtype, + trust_remote_code=True, + auto_cls=AutoModel, + model_kwargs={"key_mapping": checkpoint_to_hf_mapper}, + ) as hf_model: + return hf_model.model.compute_score(data_pairs, + max_length=2048, + query_type=query_type, + doc_type=doc_type) # Visual Documents Reranking @pytest.mark.parametrize("model_name", [model_name]) -def test_model_text_image(model_name): - +@pytest.mark.parametrize("dtype", ["half"]) +def test_model_text_image(hf_runner, vllm_runner, model_name, dtype): query = ["slm markdown"] documents = [ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png", "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png", ] - hf_outputs = hf_reranker(model_name, query, documents, "text", "image") - vllm_outputs = vllm_reranker(model_name, query, documents, "text", "image") + hf_outputs = hf_reranker(hf_runner, model_name, dtype, query, documents, + "text", "image") + vllm_outputs = vllm_reranker(vllm_runner, model_name, dtype, query, + documents, "text", "image") assert hf_outputs[0] == pytest.approx(vllm_outputs[0], rel=0.02) assert hf_outputs[1] == pytest.approx(vllm_outputs[1], rel=0.02) @@ -88,8 +111,8 @@ def test_model_text_image(model_name): # Textual Documents Reranking @pytest.mark.parametrize("model_name", [model_name]) -def test_model_text_text(model_name): - +@pytest.mark.parametrize("dtype", ["half"]) +def test_model_text_text(hf_runner, vllm_runner, model_name, dtype): query = ["slm markdown"] documents = [ """We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient @@ -104,9 +127,10 @@ def test_model_text_text(model_name): lower computational requirements.""", # noqa: E501 "数据提取么?为什么不用正则啊,你用正则不就全解决了么?", ] - - hf_outputs = hf_reranker(model_name, query, documents, "text", "text") - vllm_outputs = vllm_reranker(model_name, query, documents, "text", "text") + hf_outputs = hf_reranker(hf_runner, model_name, dtype, query, documents, + "text", "text") + vllm_outputs = vllm_reranker(vllm_runner, model_name, dtype, query, + documents, "text", "text") assert hf_outputs[0] == pytest.approx(vllm_outputs[0], rel=0.02) assert hf_outputs[1] == pytest.approx(vllm_outputs[1], rel=0.02) @@ -114,8 +138,8 @@ def test_model_text_text(model_name): # Image Querying for Textual Documents @pytest.mark.parametrize("model_name", [model_name]) -def test_model_image_text(model_name): - +@pytest.mark.parametrize("dtype", ["half"]) +def test_model_image_text(hf_runner, vllm_runner, model_name, dtype): query = [ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png" ] @@ -133,8 +157,10 @@ def test_model_image_text(model_name): "数据提取么?为什么不用正则啊,你用正则不就全解决了么?", ] - hf_outputs = hf_reranker(model_name, query, documents, "image", "text") - vllm_outputs = vllm_reranker(model_name, query, documents, "image", "text") + hf_outputs = hf_reranker(hf_runner, model_name, dtype, query, documents, + "image", "text") + vllm_outputs = vllm_reranker(vllm_runner, model_name, dtype, query, + documents, "image", "text") assert hf_outputs[0] == pytest.approx(vllm_outputs[0], rel=0.02) assert hf_outputs[1] == pytest.approx(vllm_outputs[1], rel=0.02) @@ -142,8 +168,8 @@ def test_model_image_text(model_name): # Image Querying for Image Documents @pytest.mark.parametrize("model_name", [model_name]) -def test_model_image_image(model_name): - +@pytest.mark.parametrize("dtype", ["half"]) +def test_model_image_image(hf_runner, vllm_runner, model_name, dtype): query = [ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png" ] @@ -152,9 +178,10 @@ def test_model_image_image(model_name): "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png", ] - hf_outputs = hf_reranker(model_name, query, documents, "image", "image") - vllm_outputs = vllm_reranker(model_name, query, documents, "image", - "image") + hf_outputs = hf_reranker(hf_runner, model_name, dtype, query, documents, + "image", "image") + vllm_outputs = vllm_reranker(vllm_runner, model_name, dtype, query, + documents, "image", "image") assert hf_outputs[0] == pytest.approx(vllm_outputs[0], rel=0.02) assert hf_outputs[1] == pytest.approx(vllm_outputs[1], rel=0.02) From b234c85861973a5b796c937a2d4b7f9b48c95043 Mon Sep 17 00:00:00 2001 From: 22quinn <33176974+22quinn@users.noreply.github.com> Date: Mon, 14 Jul 2025 03:45:03 -0700 Subject: [PATCH 062/552] [Bugfix] Bump up mistral_common to support v13 tokenizer (#20905) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: x22x22 --- requirements/test.in | 2 +- requirements/test.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/requirements/test.in b/requirements/test.in index 1c725df7e60..673120258b1 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -28,7 +28,7 @@ torchvision==0.22.0 transformers_stream_generator # required for qwen-vl test mamba_ssm # required for plamo2 test matplotlib # required for qwen-vl test -mistral_common[opencv] >= 1.6.2 # required for pixtral test +mistral_common[opencv] >= 1.7.0 # required for pixtral test num2words # required for smolvlm test opencv-python-headless >= 4.11.0 # required for video test datamodel_code_generator # required for minicpm3 test diff --git a/requirements/test.txt b/requirements/test.txt index 6f500992bb5..3828efae381 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -305,7 +305,7 @@ mbstrdecoder==1.1.3 # typepy mdurl==0.1.2 # via markdown-it-py -mistral-common==1.6.2 +mistral-common==1.7.0 # via -r requirements/test.in more-itertools==10.5.0 # via lm-eval From 3b06d86341b6659cafa3b7c3fdf72814ce2d8100 Mon Sep 17 00:00:00 2001 From: Reid <61492567+reidliu41@users.noreply.github.com> Date: Mon, 14 Jul 2025 18:48:55 +0800 Subject: [PATCH 063/552] [Misc] Remove unused function (#20909) Signed-off-by: reidliu41 Signed-off-by: x22x22 --- vllm/entrypoints/cli/main.py | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/vllm/entrypoints/cli/main.py b/vllm/entrypoints/cli/main.py index 3e09d45b2ed..fed3ea65040 100644 --- a/vllm/entrypoints/cli/main.py +++ b/vllm/entrypoints/cli/main.py @@ -7,17 +7,6 @@ from __future__ import annotations import importlib.metadata -import signal -import sys - - -def register_signal_handlers(): - - def signal_handler(sig, frame): - sys.exit(0) - - signal.signal(signal.SIGINT, signal_handler) - signal.signal(signal.SIGTSTP, signal_handler) def main(): From bf678e693eb7903866b9eba3ef5e3a230d12578d Mon Sep 17 00:00:00 2001 From: Chauncey Date: Mon, 14 Jul 2025 19:06:45 +0800 Subject: [PATCH 064/552] [Bugfix]: Fix messy code when using logprobs (#20910) Signed-off-by: chaunceyjiang Signed-off-by: x22x22 --- vllm/transformers_utils/detokenizer_utils.py | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/vllm/transformers_utils/detokenizer_utils.py b/vllm/transformers_utils/detokenizer_utils.py index 6812cda7110..be1040c3e01 100644 --- a/vllm/transformers_utils/detokenizer_utils.py +++ b/vllm/transformers_utils/detokenizer_utils.py @@ -78,7 +78,6 @@ def convert_prompt_ids_to_tokens( def convert_ids_list_to_tokens( tokenizer: AnyTokenizer, token_ids: list[int], - skip_special_tokens: bool = False, ) -> list[str]: """Detokenize the input ids individually. @@ -92,10 +91,8 @@ def convert_ids_list_to_tokens( """ token_str_lst = [] for token_id in token_ids: - token_str = tokenizer.decode( - [token_id], - skip_special_tokens=skip_special_tokens, - ) + # use default skip_special_tokens. + token_str = tokenizer.decode([token_id]) if token_str is None: token_str = "" token_str_lst.append(token_str) From 83ec2c7fda92ecf71bd03afc8818c0ff3e838d54 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Mon, 14 Jul 2025 19:16:51 +0800 Subject: [PATCH 065/552] [Misc] Log the reason for falling back to FlexAttention (#20699) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/attention/selector.py | 49 +++++++++++++--- vllm/platforms/cuda.py | 57 ++++++++++++------- .../hunyuan_a13b_reasoning_parser.py | 2 +- vllm/v1/attention/backends/cpu_attn.py | 4 ++ vllm/v1/attention/backends/flash_attn.py | 4 ++ vllm/v1/attention/backends/flashinfer.py | 4 ++ vllm/v1/attention/backends/flex_attention.py | 4 ++ vllm/v1/attention/backends/mla/common.py | 4 ++ vllm/v1/attention/backends/rocm_aiter_fa.py | 4 ++ vllm/v1/attention/backends/triton_attn.py | 4 ++ 10 files changed, 104 insertions(+), 32 deletions(-) diff --git a/vllm/attention/selector.py b/vllm/attention/selector.py index df14aea729f..4d4886d02b7 100644 --- a/vllm/attention/selector.py +++ b/vllm/attention/selector.py @@ -3,6 +3,7 @@ import os from contextlib import contextmanager +from dataclasses import dataclass from functools import cache from typing import Generator, Optional, Union @@ -79,31 +80,61 @@ def get_global_forced_attn_backend() -> Optional[_Backend]: return forced_attn_backend -def supports_head_size( +@dataclass(frozen=True) +class _IsSupported: + can_import: bool + head_size: bool + dtype: bool + + def __bool__(self) -> bool: + return self.can_import and self.head_size and self.dtype + + +def is_attn_backend_supported( attn_backend: Union[str, type[AttentionBackend]], head_size: int, -) -> bool: + dtype: torch.dtype, + *, + allow_import_error: bool = True, +) -> _IsSupported: if isinstance(attn_backend, str): try: attn_backend = resolve_obj_by_qualname(attn_backend) except ImportError: - return False + if not allow_import_error: + raise + + return _IsSupported(can_import=False, head_size=False, dtype=False) assert isinstance(attn_backend, type) # TODO: Update the interface once V0 is removed if get_supported_head_sizes := getattr(attn_backend, "get_supported_head_sizes", None): - return head_size in get_supported_head_sizes() - if validate_head_size := getattr(attn_backend, "validate_head_size", None): + is_head_size_supported = head_size in get_supported_head_sizes() + elif validate_head_size := getattr(attn_backend, "validate_head_size", + None): try: validate_head_size(head_size) - return True + is_head_size_supported = True except Exception: - return False + is_head_size_supported = False + else: + raise NotImplementedError(f"{attn_backend.__name__} does not support " + "head size validation") + + if get_supported_dtypes := getattr(attn_backend, "get_supported_dtypes", + None): + is_dtype_supported = dtype in get_supported_dtypes() + else: + raise NotImplementedError(f"{attn_backend.__name__} does not support " + "dtype validation") - raise NotImplementedError(f"{attn_backend.__name__} does not support " - "head size validation") + return _IsSupported( + can_import=True, + head_size=is_head_size_supported, + dtype=is_dtype_supported, + ) def get_attn_backend( diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index 878f8f77edf..75b10643c2b 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -259,43 +259,56 @@ def get_attn_backend_cls(cls, selected_backend, head_size, dtype, logger.info_once("Using Flash Attention backend on V1 engine.") return FLASH_ATTN_V1 - from vllm.attention.selector import supports_head_size + from vllm.attention.selector import is_attn_backend_supported # Default backends for V1 engine - # FP32 is only supported by FlexAttention - if dtype not in (torch.float16, torch.bfloat16): - logger.info_once( - "Using FlexAttention backend for %s on V1 engine.", - dtype, - ) - return FLEX_ATTENTION_V1 - # Prefer FlashInfer for Blackwell GPUs if installed - if cls.is_device_capability(100) and \ - supports_head_size(FLASHINFER_V1, head_size): - try: - import flashinfer # noqa: F401 - + if cls.is_device_capability(100): + if is_default_backend_supported := is_attn_backend_supported( + FLASHINFER_V1, head_size, dtype): from vllm.v1.attention.backends.utils import ( set_kv_cache_layout) + logger.info_once( "Using FlashInfer backend with HND KV cache layout on " "V1 engine by default for Blackwell (SM 10.0) GPUs.") set_kv_cache_layout("HND") + return FLASHINFER_V1 - except ImportError: - logger.info_once( + + if not is_default_backend_supported.can_import: + logger.warning_once( "FlashInfer failed to import for V1 engine on " "Blackwell (SM 10.0) GPUs; it is recommended to " "install FlashInfer for better performance.") - pass + # FlashAttention is the default for SM 8.0+ GPUs - if cls.has_device_capability(80) and \ - supports_head_size(FLASH_ATTN_V1, head_size): - logger.info_once("Using Flash Attention backend on V1 engine.") - return FLASH_ATTN_V1 + if cls.has_device_capability(80): + if is_default_backend_supported := is_attn_backend_supported( + FLASH_ATTN_V1, head_size, dtype, + allow_import_error=False): + logger.info_once("Using Flash Attention backend on " + "V1 engine.") + return FLASH_ATTN_V1 + + # FlexAttention is the default for older GPUs + else: + logger.info_once("Using FlexAttention backend on V1 engine.") + return FLEX_ATTENTION_V1 + + assert not is_default_backend_supported + + use_flex_attention_reason = {} + if not is_default_backend_supported.head_size: + use_flex_attention_reason["head_size"] = head_size + if not is_default_backend_supported.dtype: + use_flex_attention_reason["dtype"] = dtype - logger.info_once("Using FlexAttention backend on V1 engine.") + logger.info_once( + "Using FlexAttention backend for %s on V1 engine.", + ", ".join(f"{k}={v}" + for k, v in use_flex_attention_reason.items()), + ) return FLEX_ATTENTION_V1 # Backends for V0 engine diff --git a/vllm/reasoning/hunyuan_a13b_reasoning_parser.py b/vllm/reasoning/hunyuan_a13b_reasoning_parser.py index 598a0e97e51..fb29d51eae8 100644 --- a/vllm/reasoning/hunyuan_a13b_reasoning_parser.py +++ b/vllm/reasoning/hunyuan_a13b_reasoning_parser.py @@ -1,10 +1,10 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import re from collections.abc import Sequence from typing import Optional, Union +import regex as re from transformers import PreTrainedTokenizerBase from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, diff --git a/vllm/v1/attention/backends/cpu_attn.py b/vllm/v1/attention/backends/cpu_attn.py index d6270fbf319..f1c6bdfc1c9 100644 --- a/vllm/v1/attention/backends/cpu_attn.py +++ b/vllm/v1/attention/backends/cpu_attn.py @@ -37,6 +37,10 @@ class TorchSDPABackend(AttentionBackend): accept_output_buffer: bool = False + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16, torch.float32] + @classmethod def validate_head_size(cls, head_size: int) -> None: attn_impl = _get_paged_attn_impl() diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index fbc13c06c65..552c2caf2fa 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -44,6 +44,10 @@ class FlashAttentionBackend(AttentionBackend): accept_output_buffer: bool = True + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16] + @classmethod def get_supported_head_sizes(cls) -> list[int]: return [32, 64, 96, 128, 160, 192, 224, 256] diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 4ae595c976b..f922e6e4c9e 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -42,6 +42,10 @@ class FlashInferBackend(AttentionBackend): accept_output_buffer: bool = True cached_sm100a_supported: Optional[bool] = None + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16] + @classmethod def get_supported_head_sizes(cls) -> list[int]: # https://github.com/flashinfer-ai/flashinfer/blob/3d55c71a62052c590c130897d3a3db49b14fcc34/include/flashinfer/utils.cuh#L157 diff --git a/vllm/v1/attention/backends/flex_attention.py b/vllm/v1/attention/backends/flex_attention.py index a8c5f464aa3..f0f54c28831 100644 --- a/vllm/v1/attention/backends/flex_attention.py +++ b/vllm/v1/attention/backends/flex_attention.py @@ -42,6 +42,10 @@ def _offsets_to_doc_ids_tensor(offsets: torch.Tensor) -> torch.Tensor: class FlexAttentionBackend(AttentionBackend): accept_output_buffer: bool = True + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16, torch.float32] + @classmethod def validate_head_size(cls, head_size: int) -> None: return # FlexAttention supports any head size diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py index 970de229e13..1232f73430f 100644 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -262,6 +262,10 @@ def get_kv_cache_shape( ) -> tuple[int, ...]: return (num_blocks, block_size, head_size) + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16] + @classmethod def get_supported_head_sizes(cls) -> list[int]: return [576] diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index 6a78b03dce8..dd86e56885e 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -314,6 +314,10 @@ class AiterFlashAttentionBackend(AttentionBackend): accept_output_buffer: bool = True + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16] + @classmethod def get_supported_head_sizes(cls) -> list[int]: return [32, 64, 96, 128, 160, 192, 224, 256] diff --git a/vllm/v1/attention/backends/triton_attn.py b/vllm/v1/attention/backends/triton_attn.py index cdaff2f6a40..7dc90a6a97e 100644 --- a/vllm/v1/attention/backends/triton_attn.py +++ b/vllm/v1/attention/backends/triton_attn.py @@ -190,6 +190,10 @@ class TritonAttentionBackend(AttentionBackend): accept_output_buffer: bool = True + @classmethod + def get_supported_dtypes(cls) -> list[torch.dtype]: + return [torch.float16, torch.bfloat16] + @classmethod def get_supported_head_sizes(cls) -> list[int]: return [32, 64, 96, 128, 160, 192, 224, 256] From 06492f26aefad4166bc01e1c4b08bd0183d20952 Mon Sep 17 00:00:00 2001 From: ant-yy Date: Mon, 14 Jul 2025 22:10:32 +0800 Subject: [PATCH 066/552] [Model] Add Ling implementation (#20680) Signed-off-by: vito.yy Signed-off-by: x22x22 --- docs/models/supported_models.md | 1 + tests/models/registry.py | 2 + vllm/model_executor/models/bailing_moe.py | 530 ++++++++++++++++++++++ vllm/model_executor/models/registry.py | 1 + 4 files changed, 534 insertions(+) create mode 100644 vllm/model_executor/models/bailing_moe.py diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 9e70e46fabe..444a65314e6 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -316,6 +316,7 @@ Specified using `--task generate`. | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ | | `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `BailingMoeForCausalLM` | Ling | `inclusionAI/Ling-lite-1.5`, `inclusionAI/Ling-plus`, etc. | | ✅︎ | ✅︎ | | `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅︎ | | `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | | | `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | | | diff --git a/tests/models/registry.py b/tests/models/registry.py index 1207a928c92..9d3fc8a1b1c 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -141,6 +141,8 @@ def check_available_online( trust_remote_code=True), "BaichuanForCausalLM": _HfExamplesInfo("baichuan-inc/Baichuan2-7B-chat", trust_remote_code=True), + "BailingMoeForCausalLM": _HfExamplesInfo("inclusionAI/Ling-lite-1.5", + trust_remote_code=True), "BambaForCausalLM": _HfExamplesInfo("ibm-ai-platform/Bamba-9B", extras={"tiny": "hmellor/tiny-random-BambaForCausalLM"}), # noqa: E501 "BloomForCausalLM": _HfExamplesInfo("bigscience/bloom-560m", diff --git a/vllm/model_executor/models/bailing_moe.py b/vllm/model_executor/models/bailing_moe.py new file mode 100644 index 00000000000..325ba7bbad8 --- /dev/null +++ b/vllm/model_executor/models/bailing_moe.py @@ -0,0 +1,530 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# Adapted from +# https://github.com/inclusionAI/Ling/blob/master/models/modeling_bailing_moe.py +# Copyright 2023 The vLLM team. +# Copyright 2023 Antgroup and The HuggingFace Inc. team. All rights reserved. +# +# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX +# and OPT implementations in this library. It has been modified from its +# original forms to accommodate minor architectural differences compared +# to GPT-NeoX and OPT used by the Meta AI team that trained the model. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Inference-only BailingMoE model compatible with HuggingFace weights.""" +from collections.abc import Iterable +from typing import Optional, Union + +import torch +import torch.nn.functional as F +from torch import nn +from transformers.configuration_utils import PretrainedConfig + +from vllm.attention import Attention +from vllm.config import CacheConfig, VllmConfig +from vllm.distributed import (get_pp_group, get_tensor_model_parallel_rank, + get_tensor_model_parallel_world_size, + tensor_model_parallel_all_reduce) +from vllm.model_executor.layers.activation import SiluAndMul +from vllm.model_executor.layers.fused_moe import FusedMoE +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.model_executor.layers.linear import (MergedColumnParallelLinear, + QKVParallelLinear, + ReplicatedLinear, + RowParallelLinear) +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.quantization.base_config import ( + QuantizationConfig) +from vllm.model_executor.layers.rotary_embedding import get_rope +from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler +from vllm.model_executor.layers.vocab_parallel_embedding import ( + ParallelLMHead, VocabParallelEmbedding) +from vllm.model_executor.model_loader.weight_utils import default_weight_loader +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.sequence import IntermediateTensors + +from .interfaces import SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, + make_empty_intermediate_tensors_factory, make_layers, + maybe_prefix) + + +class BailingAttention(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ): + super().__init__() + self.hidden_size = config.hidden_size + self.total_num_heads = config.num_attention_heads + self.total_kv_heads = config.num_key_value_heads + tp_size = get_tensor_model_parallel_world_size() + + assert self.total_num_heads % tp_size == 0 + assert self.total_kv_heads % tp_size == 0 + assert self.total_num_heads >= self.total_kv_heads + + self.num_heads = self.total_num_heads // tp_size + self.head_dim = config.head_dim or (self.hidden_size // + self.total_num_heads) + self.q_size_per_rank = self.head_dim * self.num_heads + + self.num_kv_heads = self.total_kv_heads // tp_size + self.kv_size_per_rank = self.num_kv_heads * self.head_dim + self.scale = self.head_dim**-0.5 + + self.query_key_value = QKVParallelLinear( + self.hidden_size, + self.head_dim, + self.total_num_heads, + self.total_kv_heads, + bias=(config.use_bias or config.use_qkv_bias), + quant_config=quant_config, + prefix=f"{prefix}.query_key_value", + ) + + self.dense = RowParallelLinear( + self.total_num_heads * self.head_dim, + self.hidden_size, + bias=config.use_bias, + quant_config=quant_config, + prefix=f"{prefix}.dense", + ) + + self.attn = Attention(self.num_heads, + self.head_dim, + self.scale, + num_kv_heads=self.num_kv_heads, + cache_config=cache_config, + prefix=f"{prefix}.attn") + + self.rotary_emb = get_rope( + self.head_dim, + rotary_dim=self.head_dim, + max_position=config.max_position_embeddings, + base=config.rope_theta, + is_neox_style=True, + rope_scaling=config.rope_scaling, + ) + + def forward( + self, + hidden_states: torch.Tensor, + position_ids: torch.Tensor, + ) -> torch.Tensor: + + qkv, _ = self.query_key_value(hidden_states) + q, k, v = qkv.split([ + self.q_size_per_rank, self.kv_size_per_rank, self.kv_size_per_rank + ], + dim=-1) + + q, k = self.rotary_emb(position_ids, q, k) + + context_layer = self.attn(q, k, v) + + attn_output, _ = self.dense(context_layer) + return attn_output + + +class BailingMLP(nn.Module): + + def __init__( + self, + intermediate_size: int, + config: PretrainedConfig, + quant_config: Optional[QuantizationConfig] = None, + reduce_results: Optional[bool] = True, + prefix: str = "", + ) -> None: + super().__init__() + self.gate_up_proj = MergedColumnParallelLinear( + config.hidden_size, + [intermediate_size] * 2, + bias=config.use_bias, + quant_config=quant_config, + prefix=f"{prefix}.gate_up_proj", + ) + self.down_proj = RowParallelLinear( + intermediate_size, + config.hidden_size, + bias=config.use_bias, + quant_config=quant_config, + reduce_results=reduce_results, + prefix=f"{prefix}.down_proj", + ) + self.act_fn = SiluAndMul() + + def forward(self, x): + x, _ = self.gate_up_proj(x) + x = self.act_fn(x) + x, _ = self.down_proj(x) + return x + + +class BailingMoE(nn.Module): + + def __init__( + self, + intermediate_size: int, + config: PretrainedConfig, + quant_config: Optional[QuantizationConfig] = None, + reduce_results: Optional[bool] = True, + prefix: str = "", + ): + super().__init__() + + self.tp_size = get_tensor_model_parallel_world_size() + self.tp_rank = get_tensor_model_parallel_rank() + self.num_experts = config.num_experts + self.top_k = config.num_experts_per_tok + self.norm_expert_prob = config.norm_topk_prob + self.hidden_size = config.hidden_size + self.quant_config = quant_config + self.num_shared_experts = config.num_shared_experts + # Gate always runs at half / full precision for now. + self.gate = ReplicatedLinear(self.hidden_size, + self.num_experts, + bias=False, + quant_config=None) + + self.experts = FusedMoE(num_experts=self.num_experts, + top_k=self.top_k, + hidden_size=self.hidden_size, + intermediate_size=config.moe_intermediate_size, + reduce_results=False, + renormalize=self.norm_expert_prob, + quant_config=quant_config, + prefix=f"{prefix}.experts") + + if self.num_shared_experts > 0: + intermediate_size = (config.moe_intermediate_size * + self.num_shared_experts) + self.shared_experts = BailingMLP( + intermediate_size=intermediate_size, + config=config, + quant_config=quant_config, + reduce_results=False, + prefix=f"{prefix}.shared_experts") + else: + self.shared_experts = None + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + num_tokens, hidden_size = hidden_states.shape + hidden_states = hidden_states.view(-1, hidden_size) + if self.num_shared_experts > 0: + shared_output = self.shared_experts(hidden_states) + # router_logits: (num_tokens, n_experts) + router_logits, _ = self.gate(hidden_states) + final_hidden_states = self.experts(hidden_states=hidden_states, + router_logits=router_logits) + + if self.num_shared_experts > 0: + final_hidden_states = final_hidden_states + shared_output + + if self.tp_size > 1: + final_hidden_states = tensor_model_parallel_all_reduce( + final_hidden_states) + return final_hidden_states.view(num_tokens, hidden_size) + + +class BailingMoeBlock(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ): + super().__init__() + hidden_size = config.hidden_size + intermediate_size = config.intermediate_size + self.input_layernorm = RMSNorm(hidden_size, eps=config.rms_norm_eps) + self.attention = BailingAttention(config, + cache_config, + quant_config, + prefix=f"{prefix}.attention") + self.post_attention_layernorm = RMSNorm(hidden_size, + eps=config.rms_norm_eps) + self.mlp = BailingMoE(intermediate_size, + config, + quant_config, + True, + prefix=f"{prefix}.mlp") + + def forward( + self, + hidden_states: torch.Tensor, + position_ids: torch.Tensor, + residual: Optional[torch.Tensor], + ) -> torch.Tensor: + if residual is None: + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) + else: + hidden_states, residual = self.input_layernorm( + hidden_states, residual) + + hidden_states = self.attention( + hidden_states=hidden_states, + position_ids=position_ids, + ) + + hidden_states, residual = self.post_attention_layernorm( + hidden_states, residual) + hidden_states = self.mlp(hidden_states) + return hidden_states, residual + + +class BailingMoeModel(nn.Module): + + def __init__( + self, + *, + vllm_config: VllmConfig, + prefix: str = "", + ): + super().__init__() + config = vllm_config.model_config.hf_config + cache_config = vllm_config.cache_config + quant_config = vllm_config.quant_config + + self.config = config + self.vocab_size = config.vocab_size + self.embed_dim = config.hidden_size + + if get_pp_group().is_first_rank or (config.tie_word_embeddings + and get_pp_group().is_last_rank): + self.word_embeddings = VocabParallelEmbedding( + self.vocab_size, self.embed_dim) + else: + self.word_embeddings = PPMissingLayer() + + self.embedding_dropout = torch.nn.Dropout(config.embedding_dropout) + + self.start_layer, self.end_layer, self.layers = make_layers( + config.num_hidden_layers, + lambda prefix: BailingMoeBlock( + config=config, + cache_config=cache_config, + quant_config=quant_config, + prefix=prefix, + ), + prefix=f"{prefix}.layers") + + self.make_empty_intermediate_tensors = ( + make_empty_intermediate_tensors_factory( + ["hidden_states", "residual"], config.hidden_size)) + + if get_pp_group().is_last_rank: + self.norm = RMSNorm(self.embed_dim, eps=config.rms_norm_eps) + else: + self.norm = PPMissingLayer() + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.word_embeddings(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + position_ids: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors], + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + if get_pp_group().is_first_rank: + if inputs_embeds is not None: + hidden_states = inputs_embeds + else: + hidden_states = self.get_input_embeddings(input_ids) + residual = None + else: + assert intermediate_tensors is not None + hidden_states = intermediate_tensors["hidden_states"] + residual = intermediate_tensors["residual"] + + for i in range(self.start_layer, self.end_layer): + layer = self.layers[i] + hidden_states, residual = layer( + hidden_states, + position_ids, + residual, + ) + + if not get_pp_group().is_last_rank: + return IntermediateTensors({ + "hidden_states": hidden_states, + "residual": residual + }) + + hidden_states, _ = self.norm(hidden_states, residual) + return hidden_states + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + ("gate_up_proj", "gate_proj", 0), + ("gate_up_proj", "up_proj", 1), + ] + expert_params_mapping = FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.num_experts) + + params_dict = dict(self.named_parameters(remove_duplicate=False)) + loaded_params: set[str] = set() + for name, loaded_weight in weights: + if self.config.norm_head and "lm_head.weight" in name: + loaded_weight = F.normalize(loaded_weight, + dim=0, + p=2, + eps=1e-7) + + for (param_name, weight_name, shard_id) in stacked_params_mapping: + if weight_name not in name: + continue + if "mlp.experts" in name: + continue + name = name.replace(weight_name, param_name) + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + if name not in params_dict: + continue + + if is_pp_missing_parameter(name, self): + continue + + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + break + else: + for mapping in expert_params_mapping: + param_name, weight_name, expert_id, shard_id = mapping + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + + if is_pp_missing_parameter(name, self): + continue + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, + loaded_weight, + name, + shard_id=shard_id, + expert_id=expert_id) + break + else: + if name.endswith(".bias") and name not in params_dict: + continue + if name not in params_dict: + continue + + if is_pp_missing_parameter(name, self): + continue + + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + return loaded_params + + +class BailingMoeForCausalLM(nn.Module, SupportsPP): + + packed_modules_mapping = { + "query_key_value": ["query_key_value"], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } + + def __init__( + self, + *, + vllm_config: VllmConfig, + prefix: str = "", + ) -> None: + super().__init__() + + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + + self.config = config + self.quant_config = quant_config + self.max_position_embeddings = config.max_position_embeddings + self.model = BailingMoeModel(vllm_config=vllm_config, + prefix=maybe_prefix(prefix, "model")) + if get_pp_group().is_last_rank: + self.lm_head = (self.word_embeddings if config.tie_word_embeddings + else ParallelLMHead(config.vocab_size, + config.hidden_size, + quant_config=quant_config)) + self.logits_processor = LogitsProcessor(config.vocab_size) + else: + self.lm_head = PPMissingLayer() + + self.sampler = get_sampler() + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.model.get_input_embeddings(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + model_output = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds) + return model_output + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def sample( + self, + logits: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[SamplerOutput]: + next_tokens = self.sampler(logits, sampling_metadata) + return next_tokens + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader( + self, + skip_prefixes=(["lm_head."] + if self.config.tie_word_embeddings else None), + ) + return loader.load_weights(weights) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index b7d4789549a..79190860ac9 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -41,6 +41,7 @@ "BaiChuanForCausalLM": ("baichuan", "BaiChuanForCausalLM"), # baichuan-13b, lower case 'c' in the class name "BaichuanForCausalLM": ("baichuan", "BaichuanForCausalLM"), + "BailingMoeForCausalLM": ("bailing_moe", "BailingMoeForCausalLM"), "BambaForCausalLM": ("bamba", "BambaForCausalLM"), "BloomForCausalLM": ("bloom", "BloomForCausalLM"), "ChatGLMModel": ("chatglm", "ChatGLMForCausalLM"), From 33f7efd1b1a0dbbc25bd9b86cfd9719156f05c3f Mon Sep 17 00:00:00 2001 From: Richard Zou Date: Mon, 14 Jul 2025 10:52:17 -0400 Subject: [PATCH 067/552] [CI] cc folks on changes to vllm/compilation (#20925) Signed-off-by: Richard Zou Signed-off-by: x22x22 --- .github/CODEOWNERS | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 2acb03d52a6..6f6e3dc79da 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -16,6 +16,7 @@ /vllm/lora @jeejeelee /vllm/reasoning @aarnphm /vllm/entrypoints @aarnphm +/vllm/compilation @zou3519 CMakeLists.txt @tlrmchlsmth @LucasWilkinson # Any change to the VllmConfig changes can have a large user-facing impact, From b40289008529b8362223959085fe5bbf3722b137 Mon Sep 17 00:00:00 2001 From: Lu Fang <30275821+houseroad@users.noreply.github.com> Date: Mon, 14 Jul 2025 08:33:19 -0700 Subject: [PATCH 068/552] [CI] Update codeowner for compilation code (#20929) Signed-off-by: Lu Fang Signed-off-by: x22x22 --- .github/CODEOWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 6f6e3dc79da..7def035b792 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -16,7 +16,7 @@ /vllm/lora @jeejeelee /vllm/reasoning @aarnphm /vllm/entrypoints @aarnphm -/vllm/compilation @zou3519 +/vllm/compilation @zou3519 @youkaichao CMakeLists.txt @tlrmchlsmth @LucasWilkinson # Any change to the VllmConfig changes can have a large user-facing impact, From e44f579aa24fb11c1bfc054d82da411e7c1499cb Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Mon, 14 Jul 2025 23:36:43 +0800 Subject: [PATCH 069/552] [Misc] Clean up Aimv2 config registration in Ovis config (#20921) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- vllm/transformers_utils/configs/ovis.py | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/vllm/transformers_utils/configs/ovis.py b/vllm/transformers_utils/configs/ovis.py index c2728f0ed64..021d402a71f 100644 --- a/vllm/transformers_utils/configs/ovis.py +++ b/vllm/transformers_utils/configs/ovis.py @@ -73,8 +73,6 @@ def __init__( IMAGE_ATOM_ID = -300 IMAGE_INDICATOR_IDS = [-301, -302, -303, -304, -305] -AutoConfig.register("aimv2", AIMv2Config) - # ---------------------------------------------------------------------- # Visual Tokenizer Configuration @@ -105,9 +103,11 @@ def __init__(self, f"expect `backbone_config` to be instance of PretrainedConfig or dict, but got {type(backbone_config)} type" if not isinstance(backbone_config, PretrainedConfig): model_type = backbone_config['model_type'] - backbone_config.pop('model_type') - backbone_config = AutoConfig.for_model(model_type, - **backbone_config) + if model_type != "aimv2": + backbone_config.pop('model_type') + backbone_config = AutoConfig.for_model(model_type, **backbone_config) + else: + backbone_config = AIMv2Config(**backbone_config) self.backbone_config = backbone_config self.hidden_stride = hidden_stride From e03fb4dd45f90626514e8cc92960cca039e60114 Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Tue, 15 Jul 2025 00:33:17 +0800 Subject: [PATCH 070/552] [CI/Build] Add Transformers nightly tests in CI (#20924) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index af0bf2ae364..4440187c36e 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -630,6 +630,18 @@ steps: # e.g. pytest -v -s models/encoder_decoder/vision_language/test_mllama.py # *To avoid merge conflicts, remember to REMOVE (not just comment out) them before merging the PR* +- label: Transformers Nightly Models Test + working_dir: "/vllm-workspace/" + optional: true + commands: + - pip install --upgrade git+https://github.com/huggingface/transformers + - pytest -v -s models/test_initialization.py + - pytest -v -s tests/models/multimodal/processing/ + - pytest -v -s tests/models/multimodal/test_mapping.py + - python3 examples/offline_inference/basic/chat.py + - python3 examples/offline_inference/audio_language.py --model-type whisper + - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl + ##### 1 GPU test ##### ##### multi gpus test ##### From 13f36b35c306fd4af6edc97d782e639bfd3802f1 Mon Sep 17 00:00:00 2001 From: Tyler Michael Smith Date: Mon, 14 Jul 2025 12:54:52 -0400 Subject: [PATCH 071/552] Change default model to Qwen3-0.6B (#20335) Signed-off-by: Tyler Michael Smith Signed-off-by: x22x22 --- vllm/config.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/config.py b/vllm/config.py index 6f7aefab0a3..ce81fea2d64 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -226,7 +226,7 @@ def is_init_field(cls: ConfigType, name: str) -> bool: class ModelConfig: """Configuration for the model.""" - model: str = "facebook/opt-125m" + model: str = "Qwen/Qwen3-0.6B" """Name or path of the Hugging Face model to use. It is also used as the content for `model_name` tag in metrics output when `served_model_name` is not specified.""" From 4e96c303818039abc1d30588f4023f145cb0b961 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Tue, 15 Jul 2025 04:10:07 +0900 Subject: [PATCH 072/552] Add benchmark dataset for mlperf llama tasks (#20338) Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/benchmarks/datasets.py | 82 +++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/vllm/benchmarks/datasets.py b/vllm/benchmarks/datasets.py index fdc4e9175a7..45b58035ebe 100644 --- a/vllm/benchmarks/datasets.py +++ b/vllm/benchmarks/datasets.py @@ -654,6 +654,9 @@ def get_samples(args, tokenizer) -> list[SampleRequest]: elif args.dataset_path in ASRDataset.SUPPORTED_DATASET_PATHS: dataset_class = ASRDataset args.hf_split = "train" + elif args.dataset_path in MLPerfDataset.SUPPORTED_DATASET_PATHS: + dataset_class = MLPerfDataset + args.hf_split = "train" else: supported_datasets = set([ dataset_name for cls in HuggingFaceDataset.__subclasses__() @@ -1447,3 +1450,82 @@ def sample( ) self.maybe_oversample_requests(sampled_requests, num_requests) return sampled_requests + + +# ----------------------------------------------------------------------------- +# MLPerf Dataset Implementation +# ----------------------------------------------------------------------------- + + +class MLPerfDataset(HuggingFaceDataset): + """ + MLPerf Inference Dataset. + + Dataset on HF: + https://huggingface.co/datasets/mgoin/mlperf-inference-llama2-data + https://huggingface.co/datasets/mgoin/mlperf-inference-llama3.1-data + + Each record contains: + - "system_prompt": system role instruction. + - "question": user question. + - "output": reference answer. + + We combine the system prompt and question into a chat-formatted prompt + (using the tokenizer's chat template) and set the expected output length to + the tokenized length of the provided reference answer. + """ + + SUPPORTED_DATASET_PATHS = { + "mgoin/mlperf-inference-llama2-data", + "mgoin/mlperf-inference-llama3.1-data", + } + + def sample( + self, + tokenizer: PreTrainedTokenizerBase, + num_requests: int, + output_len: Optional[int] = None, + **kwargs, + ) -> list[SampleRequest]: + # Force dynamic output length based on reference completion. + dynamic_output = output_len is None + sampled_requests: list[SampleRequest] = [] + + for item in self.data: + if len(sampled_requests) >= num_requests: + break + + system_prompt = item["system_prompt"] + question = item["question"] + reference_answer = item["output"] + + # Build chat-style prompt using tokenizer template, if available. + messages = [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": question}, + ] + prompt_formatted = tokenizer.apply_chat_template( + messages, add_generation_prompt=True, tokenize=False + ) + prompt_len = len(tokenizer(prompt_formatted).input_ids) + + # Determine output length from reference answer tokens. + ref_out_len = len( + tokenizer(reference_answer, add_special_tokens=False).input_ids + ) + expected_output_len = ref_out_len if dynamic_output else output_len + + # Validate sequence lengths. + if not is_valid_sequence(prompt_len, expected_output_len): + continue + + sampled_requests.append( + SampleRequest( + prompt=prompt_formatted, + prompt_len=prompt_len, + expected_output_len=expected_output_len, + ) + ) + + self.maybe_oversample_requests(sampled_requests, num_requests) + return sampled_requests From 3eba4186994252a122277cf69901e171bd1eed92 Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Tue, 15 Jul 2025 01:17:16 +0530 Subject: [PATCH 073/552] [Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts (#20725) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- .../layers/fused_moe/batched_deep_gemm_moe.py | 2 + .../batched_triton_or_deep_gemm_moe.py | 40 ++--- .../layers/fused_moe/cutlass_moe.py | 31 ++-- .../layers/fused_moe/deep_gemm_moe.py | 31 ++-- .../layers/fused_moe/fused_batched_moe.py | 14 +- .../layers/fused_moe/fused_moe.py | 71 +++++---- .../layers/fused_moe/modular_kernel.py | 150 +++++++++++------- .../fused_moe/topk_weight_and_reduce.py | 17 +- .../layers/fused_moe/triton_deep_gemm_moe.py | 4 + 9 files changed, 203 insertions(+), 157 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py index 70a580b9c4c..0b394329215 100644 --- a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py @@ -260,6 +260,7 @@ def apply( hidden_states: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, activation: str, global_num_experts: int, @@ -273,6 +274,7 @@ def apply( workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, ): assert expert_tokens_meta is not None expert_num_tokens = expert_tokens_meta.expert_num_tokens diff --git a/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py index 41faced58f1..12df9bb34d2 100644 --- a/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py @@ -129,30 +129,22 @@ def workspace_shapes( return self.batched_triton_experts.workspace_shapes( a, aq, M, N, K, topk, global_num_experts, local_num_experts) - def apply( - self, - output: torch.Tensor, - hidden_states: torch.Tensor, - w1: torch.Tensor, - w2: torch.Tensor, - topk_ids: torch.Tensor, - activation: str, - global_num_experts: int, - expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], - w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], - w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - workspace13: torch.Tensor, - workspace2: torch.Tensor, - expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - ): + def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, + workspace2: torch.Tensor, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool): experts = (self.batched_deep_gemm_experts if self.allow_deep_gemm else self.batched_triton_experts) assert experts is not None - experts.apply(output, hidden_states, w1, w2, topk_ids, activation, - global_num_experts, expert_map, w1_scale, w2_scale, - w1_zp, w2_zp, a1q_scale, a2_scale, workspace13, - workspace2, expert_tokens_meta) + experts.apply(output, hidden_states, w1, w2, topk_weights, topk_ids, + activation, global_num_experts, expert_map, w1_scale, + w2_scale, w1_zp, w2_zp, a1q_scale, a2_scale, workspace13, + workspace2, expert_tokens_meta, + apply_router_weight_on_input) diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index d6a30e34269..e479f1b4044 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -291,26 +291,17 @@ def workspace_shapes( return (workspace1, workspace2, output, self.out_dtype if self.out_dtype is not None else a.dtype) - def apply( - self, - output: torch.Tensor, - hidden_states: torch.Tensor, - w1: torch.Tensor, - w2: torch.Tensor, - topk_ids: torch.Tensor, - activation: str, - global_num_experts: int, - expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], - w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], - w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - workspace13: torch.Tensor, - workspace2: torch.Tensor, - expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - ): + def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, + workspace2: torch.Tensor, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool): assert w1_zp is None, "w1_zp is not supported in CUTLASS MoE" assert w2_zp is None, "w2_zp is not supported in CUTLASS MoE" diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py index b1107a1f479..cc5e7cf5714 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py @@ -13,7 +13,7 @@ from vllm.model_executor.layers.fused_moe.prepare_finalize import ( MoEPrepareAndFinalizeNoEP) from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( - TopKWeightAndReduceDelegate) + TopKWeightAndReduceContiguous, TopKWeightAndReduceNoOP) from vllm.model_executor.layers.fused_moe.utils import _resize_cache from vllm.model_executor.layers.quantization.utils.fp8_utils import ( per_token_group_quant_fp8) @@ -90,8 +90,7 @@ def supports_expert_map(self) -> bool: return True def finalize_weight_and_reduce_impl(self) -> mk.TopKWeightAndReduce: - # Let PrepareAndFinalize::finalize() decide the impl. - return TopKWeightAndReduceDelegate() + return TopKWeightAndReduceNoOP() def workspace_shapes( self, a: torch.Tensor, aq: torch.Tensor, M: int, N: int, K: int, @@ -104,9 +103,9 @@ def workspace_shapes( block_m = self.block_shape[0] M_sum = (M * topk) + num_experts * (block_m - 1) M_sum = round_up(M_sum, block_m) - workspace1 = (M_sum, max(N * 2, K)) + workspace1 = (M_sum, max(N // 2, K)) workspace2 = (M_sum, max(N, K)) - output = (M, topk, K) + output = (M, K) return (workspace1, workspace2, output, a.dtype) def apply( @@ -115,6 +114,7 @@ def apply( hidden_states: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, activation: str, global_num_experts: int, @@ -128,11 +128,14 @@ def apply( workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, ): assert self.block_shape is not None a1q = hidden_states _, N, K = w1.size() + M, _ = output.size() + num_topk = topk_ids.size(1) if global_num_experts == -1: global_num_experts = w1.size(0) @@ -159,11 +162,12 @@ def apply( # Note: M_sum is different than the pre-permuted shape of a1q. M_sum = a1q.size(0) - mm1_out = _resize_cache(workspace13, (M_sum, N)) - act_out = _resize_cache(workspace2, (M_sum, N // 2)) - quant_out = _resize_cache(workspace13.view(dtype=torch.float8_e4m3fn), + mm1_out = _resize_cache(workspace2, (M_sum, N)) + act_out = _resize_cache(workspace13, (M_sum, N // 2)) + quant_out = _resize_cache(workspace2.view(dtype=torch.float8_e4m3fn), (M_sum, N // 2)) - mm2_out = _resize_cache(workspace2, (M_sum, K)) + mm2_out = _resize_cache(workspace13, (M_sum, K)) + perm_out = _resize_cache(workspace2, (M * num_topk, K)) m_grouped_fp8_gemm_nt_contiguous((a1q, a1q_scale), (w1, w1_scale), mm1_out, expert_ids) @@ -179,7 +183,14 @@ def apply( m_grouped_fp8_gemm_nt_contiguous((a2q, a2q_scale), (w2, w2_scale), mm2_out, expert_ids) - torch.index_select(mm2_out, 0, inv_perm, out=output.view((-1, K))) + torch.index_select(mm2_out, 0, inv_perm, out=perm_out) + + TopKWeightAndReduceContiguous().apply( + output=output, + fused_expert_output=perm_out, + topk_weights=topk_weights, + topk_ids=topk_ids, + apply_router_weight_on_input=apply_router_weight_on_input) def deep_gemm_moe_fp8( diff --git a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py index 61247e93091..b311ef1ac1c 100644 --- a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py @@ -696,15 +696,16 @@ def dequant(self, t: torch.Tensor, scale: torch.Tensor) -> torch.Tensor: return t.to(f32) * group_broadcast(scale, t.shape) def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, - w1: torch.Tensor, w2: torch.Tensor, topk_ids: torch.Tensor, - activation: str, global_num_experts: int, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, expert_map: Optional[torch.Tensor], w1_scale: Optional[torch.Tensor], w2_scale: Optional[torch.Tensor], w1_zp: Optional[torch.Tensor], w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, workspace2: torch.Tensor, - expert_tokens_meta: Optional[mk.ExpertTokensMetadata]): + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool): assert hidden_states.dim() == 3 assert expert_tokens_meta is not None expert_num_tokens = expert_tokens_meta.expert_num_tokens @@ -899,15 +900,16 @@ def workspace_shapes( return (workspace13, workspace2, output, a.dtype) def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, - w1: torch.Tensor, w2: torch.Tensor, topk_ids: torch.Tensor, - activation: str, global_num_experts: int, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, expert_map: Optional[torch.Tensor], w1_scale: Optional[torch.Tensor], w2_scale: Optional[torch.Tensor], w1_zp: Optional[torch.Tensor], w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, workspace2: torch.Tensor, - expert_tokens_meta: Optional[mk.ExpertTokensMetadata]): + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool): # Check constraints. if self.use_int4_w4a16: assert hidden_states.size(-1) // 2 == w1.size(2), ( diff --git a/vllm/model_executor/layers/fused_moe/fused_moe.py b/vllm/model_executor/layers/fused_moe/fused_moe.py index 6a9767fc6f3..f0bffc7dae2 100644 --- a/vllm/model_executor/layers/fused_moe/fused_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_moe.py @@ -26,7 +26,7 @@ from vllm.model_executor.layers.fused_moe.prepare_finalize import ( MoEPrepareAndFinalizeNoEP) from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( - TopKWeightAndReduceDelegate) + TopKWeightAndReduceNoOP) from vllm.model_executor.layers.fused_moe.utils import ( _resize_cache, moe_kernel_quantize_input) from vllm.model_executor.layers.quantization.utils.mxfp4_utils import ( @@ -1606,8 +1606,7 @@ def supports_expert_map(self) -> bool: return True def finalize_weight_and_reduce_impl(self) -> mk.TopKWeightAndReduce: - # Let PrepareAndFinalize::finalize() decide the impl. - return TopKWeightAndReduceDelegate() + return TopKWeightAndReduceNoOP() def workspace_shapes( self, @@ -1620,9 +1619,9 @@ def workspace_shapes( global_num_experts: int, local_num_experts: int, ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: - workspace1 = (M, topk, max(N * 2, K)) - workspace2 = (M, topk, N) - output = (M, topk, K) + workspace1 = (M, topk, max(N // 2, K)) + workspace2 = (M, topk, max(N, K)) + output = (M, K) return (workspace1, workspace2, output, a.dtype) def apply( @@ -1631,6 +1630,7 @@ def apply( hidden_states: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, activation: str, global_num_experts: int, @@ -1644,6 +1644,7 @@ def apply( workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, ): # Check constraints. if self.use_int4_w4a16: @@ -1696,37 +1697,39 @@ def apply( raise ValueError( f"Unsupported compute_type: {hidden_states.dtype}") - # We can reuse the memory between these because by the time we need - # cache3, we're done with cache1 - intermediate_cache1 = _resize_cache(workspace13, + # Note that the output tensor might be in workspace1 + intermediate_cache1 = _resize_cache(workspace2, (num_tokens, top_k_num, N)) - intermediate_cache2 = _resize_cache(workspace2, + intermediate_cache2 = _resize_cache(workspace13, (num_tokens * top_k_num, N // 2)) + intermediate_cache3 = _resize_cache(workspace2, + (num_tokens, top_k_num, K)) sorted_token_ids, expert_ids, num_tokens_post_padded = ( moe_align_block_size(topk_ids, config['BLOCK_SIZE_M'], global_num_experts, expert_map)) - invoke_fused_moe_kernel(hidden_states, - w1, - intermediate_cache1, - a1q_scale, - w1_scale, - w1_zp, - None, - sorted_token_ids, - expert_ids, - num_tokens_post_padded, - False, - top_k_num, - config, - compute_type=compute_type, - use_fp8_w8a8=self.use_fp8_w8a8, - use_int8_w8a8=self.use_int8_w8a8, - use_int8_w8a16=self.use_int8_w8a16, - use_int4_w4a16=self.use_int4_w4a16, - per_channel_quant=self.per_act_token_quant, - block_shape=self.block_shape) + invoke_fused_moe_kernel( + hidden_states, + w1, + intermediate_cache1, + a1q_scale, + w1_scale, + w1_zp, + None, # topk_weights + sorted_token_ids, + expert_ids, + num_tokens_post_padded, + False, # mul_routed_weights + top_k_num, + config, + compute_type=compute_type, + use_fp8_w8a8=self.use_fp8_w8a8, + use_int8_w8a8=self.use_int8_w8a8, + use_int8_w8a16=self.use_int8_w8a16, + use_int4_w4a16=self.use_int4_w4a16, + per_channel_quant=self.per_act_token_quant, + block_shape=self.block_shape) self.activation(activation, intermediate_cache2, intermediate_cache1.view(-1, N)) @@ -1739,15 +1742,15 @@ def apply( invoke_fused_moe_kernel(qintermediate_cache2, w2, - output, + intermediate_cache3, a2q_scale, w2_scale, w2_zp, - None, + topk_weights, sorted_token_ids, expert_ids, num_tokens_post_padded, - False, + not apply_router_weight_on_input, 1, config, compute_type=compute_type, @@ -1758,6 +1761,8 @@ def apply( per_channel_quant=self.per_act_token_quant, block_shape=self.block_shape) + ops.moe_sum(intermediate_cache3, output) + def modular_triton_fused_moe( use_fp8_w8a8: bool, diff --git a/vllm/model_executor/layers/fused_moe/modular_kernel.py b/vllm/model_executor/layers/fused_moe/modular_kernel.py index d0d8c7d6f41..028eee24178 100644 --- a/vllm/model_executor/layers/fused_moe/modular_kernel.py +++ b/vllm/model_executor/layers/fused_moe/modular_kernel.py @@ -360,6 +360,7 @@ def apply( hidden_states: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, activation: str, global_num_experts: int, @@ -373,6 +374,7 @@ def apply( workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[ExpertTokensMetadata], + apply_router_weight_on_input: bool, ): """ This function computes the intermediate result of a Mixture of Experts @@ -384,6 +386,8 @@ def apply( layer. - w1 (torch.Tensor): The first set of expert weights. - w2 (torch.Tensor): The second set of expert weights. + - topk_weights: A map of row to expert weights. Some implementations + choose to do weight application. - topk_ids (torch.Tensor): A map of row to expert id. - activation (str): The activation function to apply after the first MoE layer. @@ -409,6 +413,9 @@ def apply( ExpertTokensMetadata object containing gpu/cpu tensors as big as the number of local experts with the information about the number of tokens assigned to each local expert. + - apply_router_weight_on_input: True if router weights are already + applied on the input. This is relevant if the implementation + chooses to do weight application. """ raise NotImplementedError @@ -452,17 +459,21 @@ def __init__( f"{fused_experts.__class__.__name__}." f"{fused_experts.activation_formats[0]}") - def _do_fused_experts( - self, fused_out: Optional[torch.Tensor], a1: torch.Tensor, - a1q: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, - topk_ids: torch.Tensor, activation: str, global_num_experts: int, - local_num_experts: int, expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - expert_tokens_meta: Optional[ExpertTokensMetadata] - ) -> torch.Tensor: + def _do_fused_experts(self, fused_out: Optional[torch.Tensor], + a1: torch.Tensor, a1q: torch.Tensor, + w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + activation: str, global_num_experts: int, + local_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], + w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], + a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], + expert_tokens_meta: Optional[ExpertTokensMetadata], + apply_router_weight_on_input: bool) -> torch.Tensor: _, M, N, K, top_k = _moe_problem_size(a1q, w1, w2, topk_ids) @@ -485,36 +496,49 @@ def _do_fused_experts( # reuse workspace13 for the output fused_out = _resize_cache(workspace13, fused_out_shape) - self.fused_experts.apply(fused_out, - a1q, - w1, - w2, - topk_ids=topk_ids, - activation=activation, - global_num_experts=global_num_experts, - expert_map=expert_map, - w1_scale=w1_scale, - w2_scale=w2_scale, - w1_zp=w1_zp, - w2_zp=w2_zp, - a1q_scale=a1q_scale, - a2_scale=a2_scale, - workspace13=workspace13, - workspace2=workspace2, - expert_tokens_meta=expert_tokens_meta) + self.fused_experts.apply( + fused_out, + a1q, + w1, + w2, + topk_weights=topk_weights, + topk_ids=topk_ids, + activation=activation, + global_num_experts=global_num_experts, + expert_map=expert_map, + w1_scale=w1_scale, + w2_scale=w2_scale, + w1_zp=w1_zp, + w2_zp=w2_zp, + a1q_scale=a1q_scale, + a2_scale=a2_scale, + workspace13=workspace13, + workspace2=workspace2, + expert_tokens_meta=expert_tokens_meta, + apply_router_weight_on_input=apply_router_weight_on_input) return fused_out def _maybe_chunk_fused_experts( - self, a1: torch.Tensor, a1q: torch.Tensor, w1: torch.Tensor, - w2: torch.Tensor, topk_ids: torch.Tensor, activation: str, - global_num_experts: int, local_num_experts: int, - expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - expert_tokens_meta: Optional[ExpertTokensMetadata] + self, + a1: torch.Tensor, + a1q: torch.Tensor, + w1: torch.Tensor, + w2: torch.Tensor, + topk_weights: torch.Tensor, + topk_ids: torch.Tensor, + activation: str, + global_num_experts: int, + local_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], + w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], + a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], + expert_tokens_meta: Optional[ExpertTokensMetadata], + apply_router_weight_on_input: bool, ) -> torch.Tensor: _, M, N, K, top_k = _moe_problem_size(a1q, w1, w2, topk_ids) @@ -529,6 +553,7 @@ def _maybe_chunk_fused_experts( a1q=a1q, w1=w1, w2=w2, + topk_weights=topk_weights, topk_ids=topk_ids, activation=activation, global_num_experts=global_num_experts, @@ -540,7 +565,8 @@ def _maybe_chunk_fused_experts( w2_zp=w2_zp, a1q_scale=a1q_scale, a2_scale=a2_scale, - expert_tokens_meta=expert_tokens_meta) + expert_tokens_meta=expert_tokens_meta, + apply_router_weight_on_input=apply_router_weight_on_input) # Chunking required case assert num_chunks > 1 @@ -557,11 +583,12 @@ def _maybe_chunk_fused_experts( def slice_input_tensors( chunk_idx: int ) -> tuple[torch.Tensor, Optional[torch.Tensor], - Optional[torch.Tensor], torch.Tensor]: + Optional[torch.Tensor], torch.Tensor, torch.Tensor]: s = chunk_idx * CHUNK_SIZE e = min(s + CHUNK_SIZE, M) return (a1q[s:e], _chunk_scales(a1q_scale, s, e), - _chunk_scales(a2_scale, s, e), topk_ids[s:e]) + _chunk_scales(a2_scale, s, + e), topk_ids[s:e], topk_weights[s:e]) def slice_output_tensor(chunk_idx: int) -> torch.Tensor: assert fused_out.size(0) % M == 0, ( @@ -594,7 +621,7 @@ def slice_expert_tokens_metadata( expert_num_tokens_cpu=c_expert_num_tokens_cpu) for chunk_idx in range(num_chunks): - c_a1q, c_a1q_scale, c_a2_scale, c_topk_ids = ( + c_a1q, c_a1q_scale, c_a2_scale, c_topk_ids, c_topk_weights = ( slice_input_tensors(chunk_idx)) c_expert_tokens_meta = None @@ -603,23 +630,26 @@ def slice_expert_tokens_metadata( expert_tokens_meta, c_topk_ids, local_num_experts, expert_map) - self._do_fused_experts(fused_out=slice_output_tensor(chunk_idx), - a1=a1, - a1q=c_a1q, - w1=w1, - w2=w2, - topk_ids=c_topk_ids, - activation=activation, - global_num_experts=global_num_experts, - local_num_experts=local_num_experts, - expert_map=expert_map, - w1_scale=w1_scale, - w2_scale=w2_scale, - w1_zp=w1_zp, - w2_zp=w2_zp, - a1q_scale=c_a1q_scale, - a2_scale=c_a2_scale, - expert_tokens_meta=c_expert_tokens_meta) + self._do_fused_experts( + fused_out=slice_output_tensor(chunk_idx), + a1=a1, + a1q=c_a1q, + w1=w1, + w2=w2, + topk_weights=c_topk_weights, + topk_ids=c_topk_ids, + activation=activation, + global_num_experts=global_num_experts, + local_num_experts=local_num_experts, + expert_map=expert_map, + w1_scale=w1_scale, + w2_scale=w2_scale, + w1_zp=w1_zp, + w2_zp=w2_zp, + a1q_scale=c_a1q_scale, + a2_scale=c_a2_scale, + expert_tokens_meta=c_expert_tokens_meta, + apply_router_weight_on_input=apply_router_weight_on_input) return fused_out @@ -719,6 +749,7 @@ def forward( a1q=a1q, w1=w1, w2=w2, + topk_weights=topk_weights, topk_ids=topk_ids, activation=activation, global_num_experts=global_num_experts, @@ -730,7 +761,8 @@ def forward( w2_zp=w2_zp, a1q_scale=a1q_scale, a2_scale=a2_scale, - expert_tokens_meta=expert_tokens_meta) + expert_tokens_meta=expert_tokens_meta, + apply_router_weight_on_input=apply_router_weight_on_input) self.prepare_finalize.finalize( output, fused_out, topk_weights, topk_ids, diff --git a/vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py b/vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py index 9a5315b8b6f..fb398eec119 100644 --- a/vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py +++ b/vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py @@ -48,11 +48,18 @@ def apply(self, output: Optional[torch.Tensor], fused_expert_output: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, apply_router_weight_on_input: bool) -> torch.Tensor: - # Relax this if an explicit copy is necessary. Note that, - # if a copy is employed we have to make sure that the - # tensors don't overlap - assert output is None - return fused_expert_output + # Weight application and reduction operations are already done. + if output is None: + return fused_expert_output + + # MoEPrepareAndFinalizeNoEP needs the output to be in the `output` + # tensor. + assert output.size() == fused_expert_output.size(), ( + "output shape is expected to match the fused_expert_output shape. " + f"But got output={output.size()}, " + f"used_expert_output={fused_expert_output.size()}") + output.copy_(fused_expert_output, non_blocking=True) + return output class TopKWeightAndReduceContiguous(mk.TopKWeightAndReduce): diff --git a/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py index fefe74cc4ae..2f35c19b705 100644 --- a/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py @@ -122,6 +122,7 @@ def apply( hidden_states: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, activation: str, global_num_experts: int, @@ -135,6 +136,7 @@ def apply( workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, ): use_deep_gemm = (self.allow_deep_gemm and (_valid_deep_gemm(hidden_states, w1, w2) @@ -148,6 +150,7 @@ def apply( hidden_states, w1, w2, + topk_weights, topk_ids, activation, global_num_experts, @@ -161,4 +164,5 @@ def apply( workspace13, workspace2, expert_tokens_meta, + apply_router_weight_on_input, ) From 04d357144433648de12f3de4d43b33012bf7e057 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nicol=C3=B2=20Lucchesi?= Date: Mon, 14 Jul 2025 22:08:36 +0200 Subject: [PATCH 074/552] [Misc] Relax translations tests (#20856) Signed-off-by: NickLucche Signed-off-by: x22x22 --- tests/entrypoints/openai/test_translation_validation.py | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/tests/entrypoints/openai/test_translation_validation.py b/tests/entrypoints/openai/test_translation_validation.py index 0c2cb367f33..79e769e3a1a 100644 --- a/tests/entrypoints/openai/test_translation_validation.py +++ b/tests/entrypoints/openai/test_translation_validation.py @@ -39,8 +39,8 @@ async def test_basic_audio(foscolo): # TODO remove once language detection is implemented extra_body=dict(language="it"), temperature=0.0) - out = json.loads(translation)['text'].strip() - assert "Nor will I ever touch the sacred" in out + out = json.loads(translation)['text'].strip().lower() + assert "greek sea" in out @pytest.mark.asyncio @@ -168,5 +168,4 @@ async def test_long_audio_request(foscolo): response_format="text", temperature=0.0) out = json.loads(translation)['text'].strip().lower() - # TODO investigate higher model uncertainty in for longer translations. - assert out.count("nor will i ever") == 2 + assert out.count("greek sea") == 2 From db574f5173cd966a267981f306047c64641b7524 Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Mon, 14 Jul 2025 23:43:07 +0200 Subject: [PATCH 075/552] Fix overflow indexing in causal_conv1d kernel (#20938) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- vllm/model_executor/layers/mamba/ops/causal_conv1d.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/mamba/ops/causal_conv1d.py b/vllm/model_executor/layers/mamba/ops/causal_conv1d.py index 6793f6def2b..a8bd0067bf4 100644 --- a/vllm/model_executor/layers/mamba/ops/causal_conv1d.py +++ b/vllm/model_executor/layers/mamba/ops/causal_conv1d.py @@ -92,7 +92,8 @@ def _causal_conv1d_fwd_kernel( # continuous batching if IS_CONTINUOUS_BATCHING: # cache_idx - conv_state_batch_coord = tl.load(conv_state_indices_ptr + idx_seq) + conv_state_batch_coord = tl.load(conv_state_indices_ptr + idx_seq).to( + tl.int64) else: # cache_idx conv_state_batch_coord = idx_seq From bb0f7b9da2e624b509fac0f176c5fb84a68235fa Mon Sep 17 00:00:00 2001 From: Kuntai Du Date: Mon, 14 Jul 2025 15:14:17 -0700 Subject: [PATCH 076/552] [Docs] remove outdated performance benchmark (#20935) Signed-off-by: Kuntai Du Signed-off-by: x22x22 --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index c4b14685526..dc2f0afbe35 100644 --- a/README.md +++ b/README.md @@ -63,8 +63,6 @@ vLLM is fast with: - Speculative decoding - Chunked prefill -**Performance benchmark**: We include a performance benchmark at the end of [our blog post](https://blog.vllm.ai/2024/09/05/perf-update.html). It compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [SGLang](https://github.com/sgl-project/sglang) and [LMDeploy](https://github.com/InternLM/lmdeploy)). The implementation is under [nightly-benchmarks folder](.buildkite/nightly-benchmarks/) and you can [reproduce](https://github.com/vllm-project/vllm/issues/8176) this benchmark using our one-click runnable script. - vLLM is flexible and easy to use with: - Seamless integration with popular Hugging Face models From 6c11418813664b7b1aabf4ff49f8c9b2906c4284 Mon Sep 17 00:00:00 2001 From: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Date: Mon, 14 Jul 2025 16:11:18 -0700 Subject: [PATCH 077/552] Fall back if flashinfer comm module not found (#20936) Signed-off-by: Yong Hoon Shin Signed-off-by: x22x22 --- vllm/compilation/collective_fusion.py | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/vllm/compilation/collective_fusion.py b/vllm/compilation/collective_fusion.py index 5892669a3a9..97cb2995cb3 100644 --- a/vllm/compilation/collective_fusion.py +++ b/vllm/compilation/collective_fusion.py @@ -20,10 +20,12 @@ from .vllm_inductor_pass import VllmInductorPass if find_spec("flashinfer"): - import flashinfer.comm as flashinfer_comm - - flashinfer_comm = (flashinfer_comm if hasattr( - flashinfer_comm, "trtllm_allreduce_fusion") else None) + try: + import flashinfer.comm as flashinfer_comm + flashinfer_comm = (flashinfer_comm if hasattr( + flashinfer_comm, "trtllm_allreduce_fusion") else None) + except ImportError: + flashinfer_comm = None else: flashinfer_comm = None from vllm.platforms import current_platform @@ -411,7 +413,8 @@ def __init__(self, config: VllmConfig, max_token_num: int): use_fp32_lamport = self.model_dtype == torch.float32 if flashinfer_comm is None: logger.warning( - "Flashinfer is not installed, skipping allreduce fusion pass") + "Flashinfer is not installed or comm module not found, " + "skipping allreduce fusion pass") return # Check if the world size is supported if self.tp_size not in _FI_MAX_SIZES: From 0cf6392100ec63085dd76fe36f9902fc612375e9 Mon Sep 17 00:00:00 2001 From: Alexander Matveev <59768536+alexm-redhat@users.noreply.github.com> Date: Mon, 14 Jul 2025 21:06:38 -0400 Subject: [PATCH 078/552] SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP (#20769) Signed-off-by: Alexander Matveev Signed-off-by: x22x22 --- CMakeLists.txt | 3 +- .../cutlass_sm100_mla/device/sm100_mla.hpp | 372 +++ .../kernel/sm100_fmha_mla_reduction.hpp | 203 ++ .../sm100_fmha_mla_tma_warpspecialized.hpp | 2023 +++++++++++++++++ .../kernel/sm100_mla_tile_scheduler.hpp | 165 ++ .../attention/mla/sm100_cutlass_mla_kernel.cu | 273 +++ csrc/ops.h | 13 + csrc/torch_bindings.cpp | 17 + vllm/_custom_ops.py | 20 + vllm/platforms/cuda.py | 7 + vllm/v1/attention/backends/mla/common.py | 5 + vllm/v1/attention/backends/mla/cutlass_mla.py | 184 +- 12 files changed, 3283 insertions(+), 2 deletions(-) create mode 100644 csrc/attention/mla/cutlass_sm100_mla/device/sm100_mla.hpp create mode 100644 csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_reduction.hpp create mode 100644 csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_tma_warpspecialized.hpp create mode 100644 csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_mla_tile_scheduler.hpp create mode 100644 csrc/attention/mla/sm100_cutlass_mla_kernel.cu diff --git a/CMakeLists.txt b/CMakeLists.txt index e59e912a991..513f4a87f8f 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -553,7 +553,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") cuda_archs_loose_intersection(MLA_ARCHS "10.0a" "${CUDA_ARCHS}") if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND MLA_ARCHS) set(SRCS - "csrc/attention/mla/cutlass_mla_kernels.cu") + "csrc/attention/mla/cutlass_mla_kernels.cu" + "csrc/attention/mla/sm100_cutlass_mla_kernel.cu") set_gencode_flags_for_srcs( SRCS "${SRCS}" CUDA_ARCHS "${MLA_ARCHS}") diff --git a/csrc/attention/mla/cutlass_sm100_mla/device/sm100_mla.hpp b/csrc/attention/mla/cutlass_sm100_mla/device/sm100_mla.hpp new file mode 100644 index 00000000000..95e32559cd5 --- /dev/null +++ b/csrc/attention/mla/cutlass_sm100_mla/device/sm100_mla.hpp @@ -0,0 +1,372 @@ +/*************************************************************************************************** + * Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-License-Identifier: BSD-3-Clause + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * 3. Neither the name of the copyright holder nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE + *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + *POSSIBILITY OF SUCH DAMAGE. + * + **************************************************************************************************/ +/* + * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 + * by Alcanderian JieXin Liang + */ + +/*! + \file + \brief An universal device layer for cutlass 3.x-style kernels. +*/ + +// clang-format off +#pragma once + +// common +#include "cutlass/cutlass.h" +#include "cutlass/device_kernel.h" + +#if !defined(__CUDACC_RTC__) +#include "cutlass/cluster_launch.hpp" +#include "cutlass/trace.h" +#endif // !defined(__CUDACC_RTC__) + +#include "../kernel/sm100_fmha_mla_tma_warpspecialized.hpp" +#include "../kernel/sm100_fmha_mla_reduction.hpp" + +//////////////////////////////////////////////////////////////////////////////// + +namespace cutlass::fmha::device { + +using namespace cute; +using namespace cutlass::fmha::kernel; + + +//////////////////////////////////////////////////////////////////////////////// +////////////////////////////// CUTLASS 3.x API ///////////////////////////////// +//////////////////////////////////////////////////////////////////////////////// + +template< + class Kernel_ +> +class MLA { +public: + + using Kernel = Kernel_; + + using ReductionKernel = cutlass::fmha::kernel::Sm100FmhaMlaReductionKernel< + typename Kernel::ElementOut, + typename Kernel::ElementAcc, + typename Kernel::ElementAcc, + Kernel::TileShapeH::value, + Kernel::TileShapeL::value, + 256 /*Max split*/ + >; + + /// Argument structure: User API + using KernelArguments = typename Kernel::Arguments; + using ReductionArguments = typename ReductionKernel::Arguments; + + using Arguments = KernelArguments; + + /// Argument structure: Kernel API + using KernelParams = typename Kernel::Params; + using ReductionParams = typename ReductionKernel::Params; + struct Params { + KernelParams fmha_params; + ReductionParams reduction_params; + }; + +private: + + /// Kernel API parameters object + Params params_; + + bool is_initialized(bool set = false) { + static bool initialized = false; + if (set) initialized = true; + return initialized; + } + + static ReductionArguments to_reduction_args(Arguments const& args) { + auto [H, K, D, B] = args.problem_shape; + return ReductionArguments{ + nullptr, args.epilogue.ptr_o, nullptr, args.epilogue.ptr_lse, + args.mainloop.softmax_scale, B, args.split_kv, K, args.mainloop.ptr_seq, + args.ptr_split_kv, Kernel::TileShapeS::value + }; + } + +public: + + /// Access the Params structure + Params const& params() const { + return params_; + } + + static void set_split_kv (KernelArguments& args) { + // printf("set_split_kv start"); + if (args.split_kv >= 1) return; + auto [H, K, D, B] = args.problem_shape; + // std::cout << H << " " << K << " " << D << " " << B << "\n"; + int sm_count = args.hw_info.sm_count; + // printf(" sm_count = %d\n", sm_count); + int max_splits = ceil_div(K, 128); + max_splits = min(16, max_splits); + // printf(" max_splits = %d\n", max_splits); + int sms_per_batch = max(1, sm_count / B); + // printf(" sms_per_batch = %d\n", sms_per_batch); + int split_heur = min(max_splits, sms_per_batch); + int waves = ceil_div(B * split_heur, sm_count); + int k_waves = ceil_div(max_splits, split_heur); + int split_wave_aware = ceil_div(max_splits, k_waves); + args.split_kv = split_wave_aware; + // printf(" args.split_kv = %d\n", args.split_kv); + + } + + /// Determines whether the GEMM can execute the given problem. + static Status + can_implement(Arguments const& args) { + if (! Kernel::can_implement(args)) { + return Status::kInvalid; + } + if (! ReductionKernel::can_implement(to_reduction_args(args))) { + return Status::kInvalid; + } + return Status::kSuccess; + } + + /// Gets the workspace size + static size_t + get_workspace_size(Arguments const& args) { + size_t workspace_bytes = 0; + workspace_bytes += Kernel::get_workspace_size(args); + workspace_bytes += ReductionKernel::get_workspace_size(to_reduction_args(args)); + return workspace_bytes; + } + + /// Computes the maximum number of active blocks per multiprocessor + static int maximum_active_blocks(int /* smem_capacity */ = -1) { + CUTLASS_TRACE_HOST("MLA::maximum_active_blocks()"); + int max_active_blocks = -1; + int smem_size = Kernel::SharedStorageSize; + + // first, account for dynamic smem capacity if needed + cudaError_t result; + if (smem_size >= (48 << 10)) { + CUTLASS_TRACE_HOST(" Setting smem size to " << smem_size); + result = cudaFuncSetAttribute( + device_kernel, + cudaFuncAttributeMaxDynamicSharedMemorySize, + smem_size); + if (cudaSuccess != result) { + result = cudaGetLastError(); // to clear the error bit + CUTLASS_TRACE_HOST( + " cudaFuncSetAttribute() returned error: " + << cudaGetErrorString(result)); + return -1; + } + } + + // query occupancy after setting smem size + result = cudaOccupancyMaxActiveBlocksPerMultiprocessor( + &max_active_blocks, + device_kernel, + Kernel::MaxThreadsPerBlock, + smem_size); + + if (cudaSuccess != result) { + result = cudaGetLastError(); // to clear the error bit + CUTLASS_TRACE_HOST( + " cudaOccupancyMaxActiveBlocksPerMultiprocessor() returned error: " + << cudaGetErrorString(result)); + return -1; + } + + CUTLASS_TRACE_HOST(" max_active_blocks: " << max_active_blocks); + return max_active_blocks; + } + + /// Initializes GEMM state from arguments. + Status + initialize(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) { + CUTLASS_TRACE_HOST("MLA::initialize() - workspace " + << workspace << ", stream: " << (stream ? "non-null" : "null")); + + // Initialize the workspace + Status status = Kernel::initialize_workspace(args, workspace, stream); + if (status != Status::kSuccess) { + return status; + } + status = ReductionKernel::initialize_workspace(to_reduction_args(args), workspace, stream); + if (status != Status::kSuccess) { + return status; + } + KernelParams kernel_params = Kernel::to_underlying_arguments(args, workspace); + + ReductionArguments reduction_args = to_reduction_args(args); + if (reduction_args.split_kv > 1) { + reduction_args.ptr_oaccum = kernel_params.epilogue.ptr_o_acc; + reduction_args.ptr_lseaccum = kernel_params.epilogue.ptr_lse_acc; + } + ReductionParams reduction_params = ReductionKernel::to_underlying_arguments(reduction_args, workspace); + // Initialize the Params structure + params_ = Params {kernel_params, reduction_params}; + + if (is_initialized()) return Status::kSuccess; + + // account for dynamic smem capacity if needed + // no dynamic smem is needed for reduction kernel + int smem_size = Kernel::SharedStorageSize; + if (smem_size >= (48 << 10)) { + CUTLASS_TRACE_HOST(" Setting smem size to " << smem_size); + cudaError_t result = cudaFuncSetAttribute( + device_kernel, + cudaFuncAttributeMaxDynamicSharedMemorySize, + smem_size); + if (cudaSuccess != result) { + result = cudaGetLastError(); // to clear the error bit + CUTLASS_TRACE_HOST(" cudaFuncSetAttribute() returned error: " << cudaGetErrorString(result)); + return Status::kErrorInternal; + } + } + + is_initialized(true); + + return Status::kSuccess; + } + + /// Update API is preserved in 3.0, but does not guarantee a lightweight update of params. + Status + update(Arguments const& args, void* workspace = nullptr) { + CUTLASS_TRACE_HOST("MLA()::update() - workspace: " << workspace); + + size_t workspace_bytes = get_workspace_size(args); + if (workspace_bytes > 0 && nullptr == workspace) { + return Status::kErrorWorkspaceNull; + } + + auto fmha_params = Kernel::to_underlying_arguments(args, workspace); + + ReductionArguments reduction_args = to_reduction_args(args); + if (reduction_args.split_kv > 1) { + reduction_args.ptr_oaccum = fmha_params.epilogue.ptr_o_acc; + reduction_args.ptr_lseaccum = fmha_params.epilogue.ptr_lse_acc; + } + ReductionParams reduction_params = ReductionKernel::to_underlying_arguments(reduction_args, workspace); + // Initialize the Params structure + params_ = Params {fmha_params, reduction_params}; + + return Status::kSuccess; + } + + /// Primary run() entry point API that is static allowing users to create and manage their own params. + /// Supplied params struct must be construct by calling Kernel::to_underling_arguments() + static Status + run(Params& params, cudaStream_t stream = nullptr) { + CUTLASS_TRACE_HOST("MLA::run()"); + dim3 const block = Kernel::get_block_shape(); + dim3 const grid = Kernel::get_grid_shape(params.fmha_params); + + // configure smem size and carveout + int smem_size = Kernel::SharedStorageSize; + + Status launch_result; + // Use extended launch API only for mainloops that use it + if constexpr(Kernel::ArchTag::kMinComputeCapability >= 90) { + dim3 cluster(cute::size<0>(typename Kernel::ClusterShape{}), + cute::size<1>(typename Kernel::ClusterShape{}), + cute::size<2>(typename Kernel::ClusterShape{})); + void const* kernel = (void const*) device_kernel; + void* kernel_params[] = {¶ms.fmha_params}; + launch_result = ClusterLauncher::launch(grid, cluster, block, smem_size, stream, kernel, kernel_params); + } + else { + launch_result = Status::kSuccess; + device_kernel<<>>(params.fmha_params); + } + + cudaError_t result = cudaGetLastError(); + if (cudaSuccess != result or Status::kSuccess != launch_result) { + //return Status::kSuccess; + CUTLASS_TRACE_HOST(" Kernel launch failed. Reason: " << result); + return Status::kErrorInternal; + } + if (params.reduction_params.split_kv > 1) { + // launch reduction kernel + dim3 const block = ReductionKernel::get_block_shape(); + dim3 const grid = ReductionKernel::get_grid_shape(params.reduction_params); + device_kernel<<>>(params.reduction_params); + cudaError_t result = cudaGetLastError(); + if (cudaSuccess == result) { + return Status::kSuccess; + } + else { + CUTLASS_TRACE_HOST(" Kernel launch failed. Reason: " << result); + return Status::kErrorInternal; + } + } + else { + return Status::kSuccess; + } + } + + // + // Non-static launch overloads that first create and set the internal params struct of this kernel handle. + // + + /// Launches the kernel after first constructing Params internal state from supplied arguments. + Status + run(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) { + Status status = initialize(args, workspace, stream); + if (Status::kSuccess == status) { + status = run(params_, stream); + } + return status; + } + + /// Launches the kernel after first constructing Params internal state from supplied arguments. + Status + operator()(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) { + return run(args, workspace, stream); + } + + /// Overload that allows a user to re-launch the same kernel without updating internal params struct. + Status + run(cudaStream_t stream = nullptr) { + return run(params_, stream); + } + + /// Overload that allows a user to re-launch the same kernel without updating internal params struct. + Status + operator()(cudaStream_t stream = nullptr) { + return run(params_, stream); + } +}; + +//////////////////////////////////////////////////////////////////////////////// + +} // namespace cutlass::fmha::device + +//////////////////////////////////////////////////////////////////////////////// diff --git a/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_reduction.hpp b/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_reduction.hpp new file mode 100644 index 00000000000..7b6e1dd2657 --- /dev/null +++ b/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_reduction.hpp @@ -0,0 +1,203 @@ +/*************************************************************************************************** + * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights + *reserved. SPDX-License-Identifier: BSD-3-Clause + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * 3. Neither the name of the copyright holder nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE + *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + *POSSIBILITY OF SUCH DAMAGE. + * + **************************************************************************************************/ +/* + * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 + * by Alcanderian JieXin Liang + */ + +// clang-format off +#pragma once + +#include "cutlass/cutlass.h" +#include "cutlass/arch/arch.h" +#include "cute/tensor.hpp" + +namespace cutlass::fmha::kernel { + +using namespace cute; +template< + class ElementOut, + class ElementAcc, + class ElementScale, + size_t kNumHeads, + size_t kHeadDimLatent, + int kMaxSplits +> +struct Sm100FmhaMlaReductionKernel { + + static const int SharedStorageSize = 0; + static const int MaxThreadsPerBlock = 128; + static const int MinBlocksPerMultiprocessor = 1; + + using ArchTag = cutlass::arch::Sm100; + + static_assert(kHeadDimLatent % MaxThreadsPerBlock == 0); + struct Arguments { + ElementAcc* ptr_oaccum = nullptr; + ElementOut* ptr_o = nullptr; + ElementAcc* ptr_lseaccum = nullptr; + ElementAcc* ptr_lse = nullptr; + ElementScale scale = 1.f; + int num_batches = 0; + int split_kv = -1; + int dim_k = -1; + int* ptr_seq = nullptr; + int* ptr_split_kv = nullptr; + int tile_shape_s = 128; + }; + using Params = Arguments; + + static Params to_underlying_arguments(Arguments const& args, void* workspace) { + return {args.ptr_oaccum, args.ptr_o, args.ptr_lseaccum, args.ptr_lse, + args.scale, args.num_batches, args.split_kv, args.dim_k, args.ptr_seq, + args.ptr_split_kv, args.tile_shape_s}; + } + + static size_t get_workspace_size(Arguments const& /*args*/) { + return 0; + } + + static Status initialize_workspace( + Arguments const& /*args*/, void* /*ws*/, cudaStream_t /*stream*/) { + return Status::kSuccess; + } + + static dim3 get_grid_shape(Params const& params) { + return dim3(kNumHeads, 1, params.num_batches); + } + + static dim3 get_block_shape() { + return dim3(MaxThreadsPerBlock, 1, 1); + } + + static bool can_implement(Arguments const& args) { + if (args.num_batches <= 0) return false; + if (args.split_kv <= 0) return false; + return true; + } + + CUTLASS_DEVICE void operator() (Params const& params, char* smem_raw) { + if (params.split_kv <= 1) return; + auto blk_coord = make_coord(blockIdx.x, _0{}, blockIdx.z); + + __shared__ ElementAcc sLseScale[kMaxSplits]; + const size_t offset_lseaccum = get<0>(blk_coord) + kNumHeads * params.split_kv * get<2>(blk_coord); + const size_t offset_lse = get<0>(blk_coord) + kNumHeads * get<2>(blk_coord); + + Tensor gLSEaccum = make_tensor(make_gmem_ptr(params.ptr_lseaccum + offset_lseaccum), + make_shape(params.split_kv), Stride>{}); + + Tensor gLSE = make_tensor(make_gmem_ptr(params.ptr_lse + offset_lse), + Shape<_1>{}, Stride<_1>{}); + + auto dim_k = params.ptr_seq == nullptr ? params.dim_k : params.ptr_seq[get<2>(blk_coord)]; + auto local_split_kv = params.ptr_split_kv == nullptr ? params.split_kv : params.ptr_split_kv[get<2>(blk_coord)]; + auto k_tile_total = ceil_div(dim_k, params.tile_shape_s); + auto k_tile_per_cta = ceil_div(k_tile_total, local_split_kv); + local_split_kv = ceil_div(k_tile_total, k_tile_per_cta); + + int warp_idx = cutlass::canonical_warp_idx_sync(); + if (warp_idx == 0) { + constexpr int kNLsePerThread = cute::ceil_div(kMaxSplits, 32); + + ElementAcc local_lse[kNLsePerThread]; + + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < kNLsePerThread; ++i) { + const int split = i * 32 + threadIdx.x; + local_lse[i] = split < local_split_kv ? gLSEaccum(split) : -std::numeric_limits::infinity(); + } + + ElementAcc lse_max = -std::numeric_limits::infinity(); + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < kNLsePerThread; ++i) { + lse_max = max(lse_max, local_lse[i]); + } + CUTLASS_PRAGMA_UNROLL + for (int offset = 16; offset >= 1; offset /= 2) { + lse_max = max(lse_max, __shfl_xor_sync(0xffffffff, lse_max, offset)); + } + lse_max = lse_max == -std::numeric_limits::infinity() ? 0.0f : lse_max; // In case all local LSEs are -inf + lse_max = __shfl_sync(0xffffffff, lse_max, 0); + + ElementAcc sum_lse = 0; + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < kNLsePerThread; ++i) { + sum_lse = sum_lse + expf(local_lse[i] - lse_max); + } + + CUTLASS_PRAGMA_UNROLL + for (int offset = 16; offset >= 1; offset /= 2) { + sum_lse = sum_lse + __shfl_xor_sync(0xffffffff, sum_lse, offset); + } + + sum_lse = __shfl_sync(0xffffffff, sum_lse, 0); + + ElementAcc global_lse = (sum_lse == 0.f || sum_lse != sum_lse) ? std::numeric_limits::infinity() : logf(sum_lse) + lse_max; + if (threadIdx.x == 0 and params.ptr_lse != nullptr) { + gLSE(0) = global_lse; + } + + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < kNLsePerThread; ++i) { + const int split = i * 32 + threadIdx.x; + if (split < local_split_kv) { + sLseScale[split] = expf(local_lse[i] - global_lse); + } + } + } + __syncthreads(); + + constexpr int Elements = kHeadDimLatent / MaxThreadsPerBlock; + const size_t offset_oaccum = kHeadDimLatent * params.split_kv * (get<0>(blk_coord) + kNumHeads * get<2>(blk_coord)); + Tensor gOaccum = make_tensor(make_gmem_ptr(params.ptr_oaccum + offset_oaccum), + Shape>{}, Stride<_1>{}); + ElementAcc local_val[Elements] = {0}; + for (int split = 0; split < local_split_kv; ++split) { + ElementAcc lse_scale = sLseScale[split]; + CUTLASS_PRAGMA_UNROLL + for(int i = 0; i < Elements; ++i) { + local_val[i] += lse_scale * gOaccum(threadIdx.x + MaxThreadsPerBlock * i); + } + gOaccum.data() = gOaccum.data() + kHeadDimLatent; + } + auto ptr_o_local = params.ptr_o + (get<0>(blk_coord) + get<2>(blk_coord) * kNumHeads) * kHeadDimLatent; + Tensor gO = make_tensor(make_gmem_ptr(ptr_o_local), Shape>{}, Stride<_1>{}); + + CUTLASS_PRAGMA_UNROLL + for(int i = 0; i < Elements; ++i) { + gO(threadIdx.x + MaxThreadsPerBlock * i) = static_cast(local_val[i]); + } + } +}; + +} // namespace cutlass::fmha::kernel diff --git a/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_tma_warpspecialized.hpp b/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_tma_warpspecialized.hpp new file mode 100644 index 00000000000..2cbc2379579 --- /dev/null +++ b/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_fmha_mla_tma_warpspecialized.hpp @@ -0,0 +1,2023 @@ +/*************************************************************************************************** + * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights + *reserved. SPDX-License-Identifier: BSD-3-Clause + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * 3. Neither the name of the copyright holder nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE + *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + *POSSIBILITY OF SUCH DAMAGE. + * + **************************************************************************************************/ +/* + * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 + * by Alcanderian JieXin Liang + */ + +// clang-format off +#pragma once + +#include "cutlass/cutlass.h" + +#include "cute/tensor.hpp" +#include "cute/arch/simd_sm100.hpp" + +#include "cutlass/arch/arch.h" +#include "cutlass/arch/memory_sm80.h" +#include "cutlass/epilogue/thread/linear_combination.h" +#include "cutlass/gemm/collective/collective_builder.hpp" + +#include "gather_tensor.hpp" // from examples/common +#include "common/pow_2.hpp" + +namespace cutlass::fmha::kernel { + +using namespace cute; + +template< + class TileShape, + class Element_, + class ElementAcc_, + class ElementOut_, + class ElementLSE_, + class TileScheduler, +#ifdef CPASYNC + bool kIsCpAsync = true +#else + bool kIsCpAsync = false +#endif +> +struct Sm100FmhaMlaKernelTmaWarpspecialized { + + using Element = Element_; + using ElementAcc = ElementAcc_; + using ElementOut = ElementOut_; + using ElementLSE = ElementLSE_; + + // only 2Sm mode is supported + static const bool kIs2Sm = true; + static const int MaxThreadsPerBlock = 256; + static const int MinBlocksPerMultiprocessor = 1; + static const int TotalSNum = 2; + static const int TotalPNum = 2; + using ArchTag = cutlass::arch::Sm100; + + using ClusterShape = cute::conditional_t, Shape<_1, _1, _1>>; + + using TileShapeH = tuple_element_t<0, TileShape>; + using TileShapeS = tuple_element_t<1, TileShape>; + using TileShapeD = tuple_element_t<2, TileShape>; + + using TileShapeL = tuple_element_t<0, TileShapeD>; + using TileShapeR = tuple_element_t<1, TileShapeD>; + static_assert(TileShapeL{} % TileShapeR{} == 0, "Rope head dim must divide latent head dim"); + + using ProblemShape = Shape; + using TensorStride = Stride; + using TmemAllocator = cute::conditional_t; + + static_assert(TileShapeH{} == 128); + static const int kWarpsInN = kIs2Sm ? 2 : 1; + + static const int kNumComputeWarps = 4; + static const int kNumLoadWarps = kIsCpAsync ? 2 : 1; + + enum class WarpRole { + kMma = 0x1, kLoad = 0x2, kCompute = 0x3, kLoadPageTable = 0x4, kEmpty=0x0 + }; + + static const long long unsigned int kWarpAssignment = kIsCpAsync ? 0x4221'3333ull : 0x0021'3333ull; + + static CUTLASS_DEVICE WarpRole warp_idx_to_role(int warp_idx) { + return static_cast((kWarpAssignment >> (4 * warp_idx)) & 0xF); + } + + static const int Alignment = 128 / sizeof_bits_v; + static const int AlignmentOut = 128 / sizeof_bits_v; + + using TileShapeQK = Shape; + static const int StagesQK = 24 / sizeof(Element); // free parameter + static const int IterationsQKLatent = decltype(TileShapeL{} / get<2>(TileShapeQK{}))::value; + static const int IterationsQKRope = decltype(TileShapeR{} / get<2>(TileShapeQK{}))::value; + static const int IterationsQK = IterationsQKLatent + IterationsQKRope; + + using Schedule = cute::conditional_t; + using CollectiveMmaQK = typename cutlass::gemm::collective::CollectiveBuilder< + cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp, + Element, TensorStride, Alignment, + Element, TensorStride, Alignment, + ElementAcc, + TileShapeQK, ClusterShape, cutlass::gemm::collective::StageCount, + Schedule>::CollectiveOp; + using TiledMmaQK = typename CollectiveMmaQK::TiledMma; + using CtaShapeQK = typename CollectiveMmaQK::CtaShape_MNK; + + // chosen for unified smem staging between K and V + using TileShapePV = Shape; + using TransposeTensorStride = decltype(select<1,0,2>(TensorStride{})); + static const int StagesPV = StagesQK; // not sure why, but must be at least two. check pipes + static const int IterationsPV_K = decltype(TileShapeS{} / get<2>(TileShapePV{}))::value; + static const int IterationsPV_N = decltype(TileShapeL{} / get<1>(TileShapePV{}))::value; + + using CollectiveMmaPV = typename cutlass::gemm::collective::CollectiveBuilder< + cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp, + Element, TensorStride, Alignment, + Element, TransposeTensorStride, Alignment, + ElementAcc, + TileShapePV, ClusterShape, cutlass::gemm::collective::StageCount, + Schedule>::CollectiveOp; + using CtaShapePV = typename CollectiveMmaPV::CtaShape_MNK; + static_assert(std::is_same_v); + + using TiledMmaPV = typename CollectiveMmaPV::TiledMma; + + using AtomThrShapeMNK = typename CollectiveMmaQK::AtomThrShapeMNK; + static_assert(typename CollectiveMmaQK::AtomThrShapeMNK{} == typename CollectiveMmaPV::AtomThrShapeMNK{}, "schedule must match"); + + static const int StagesPageTable = kIsCpAsync ? StagesPV : 1; + + // pipelines from load to mma, PipelineTmaUmmaAsync, stages tbd + // use expect_tx for Q load + using PipelineLoadQK = cute::conditional_t, PipelineTmaUmmaAsync>; + using PipelineLoadPV = PipelineLoadQK; + // pipeline from mma (Q@K) to softmax, PipelineUmmaAsync, 2 stages + using PipelineS = PipelineUmmaAsync; + // pipeline from softmax (P) to mma (bmm2), PipelineUmmaAsync, 2 stages + using PipelineP = PipelineUmmaConsumerAsync; + // pipeline from mma to softmax (for rescale), PipelineUmmaAsync, 1 stage + using PipelineO = PipelineUmmaAsync<1, AtomThrShapeMNK>; + + using PipelinePT = PipelineAsync; + + struct PipelineStorage { + alignas(16) typename PipelineLoadQK::SharedStorage load_qk; + alignas(16) typename PipelineS::SharedStorage mma_s; + alignas(16) typename PipelineP::SharedStorage p_mma; + alignas(16) typename PipelineO::SharedStorage mma_o; + alignas(16) typename PipelinePT::SharedStorage load_page_table; + }; + + template + static CUTE_DEVICE constexpr auto unstageSmemLayout(Layout const& layout, Stages stages = {}) { + return composition(layout, make_tuple(_, _, _, make_layout(stages))); + } + + using SmemLayoutQ = decltype(unstageSmemLayout(typename CollectiveMmaQK::SmemLayoutA{}, Int{})); + using SmemLayoutKC = typename CollectiveMmaQK::SmemLayoutB; + using SmemLayoutVC = typename CollectiveMmaPV::SmemLayoutB; + using SmemLayoutP = decltype(unstageSmemLayout(typename CollectiveMmaPV::SmemLayoutA{}, make_shape(Int{}, _2{}))); + + static const int kBytesLoadQ = size(AtomThrShapeMNK{}) * cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutQ{})) * cute::sizeof_bits_v); + static const int kBytesLoadKC = size(AtomThrShapeMNK{}) * cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutKC{})) * cute::sizeof_bits_v); + static const int kBytesLoadVC = size(AtomThrShapeMNK{}) * cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutVC{})) * cute::sizeof_bits_v); + // pre-condition for overlapped smem staging + static_assert(kBytesLoadKC == kBytesLoadVC); + static_assert(StagesQK == StagesPV); + + static const int kTransactionsBytesLoadQK = kBytesLoadKC; + static const int kTransactionsBytesLoadExtraQ = kBytesLoadQ; + static const int kTransactionsBytesLoadPV = kBytesLoadVC; + + static const int kNamedBarrierExchange = (int) cutlass::arch::ReservedNamedBarriers::TransformBarrier; + // This Named Barrier is introduced to solve Q tile loading overwritten issue when enable persistent + // tile scheduler for FP8 MLA. + static const int kNamedBarrierEpilogue = (int) cutlass::arch::ReservedNamedBarriers::EpilogueBarrier; + // + static const int kNamedBarrierTmemDealloc = (int) cutlass::arch::ReservedNamedBarriers::TmemAllocBarrier; + + enum class TmemAllocation : uint32_t { + kSizeS = TileShapeS::value / kWarpsInN, + // Overall + kSizeO = TileShapeL::value / kWarpsInN, + // Between accumulators we loop over + kSizeAccO = decltype(get<1>(TileShapePV{}))::value / kWarpsInN, + kNumS = TotalSNum, + kNumP = TotalPNum, + kNumO = 1, + kS0 = 0, + kS1 = kS0 + kSizeS, + kO0 = kS1 + kSizeS, + kTotal = kO0 + kSizeO + }; + + static_assert(static_cast(TmemAllocation::kTotal) <= TmemAllocator::Sm100TmemCapacityColumns, "using too much tmem"); + + struct TensorStorage { + // to communicate max and row_sum + cute::array smem_exchange; + cute::array smem_page_table; + alignas(2048) cute::array> smem_q; + union { + alignas(2048) cute::array> smem_kc; + alignas(2048) cute::array> smem_vc; + }; + alignas(2048) cute::array> smem_p; + }; + + struct SharedStorage { + PipelineStorage pipelines; + TensorStorage tensors; + uint32_t tmem_base_ptr; + }; + + static const int SharedStorageSize = sizeof(SharedStorage); + static_assert(SharedStorageSize <= cutlass::arch::sm100_smem_capacity_bytes, "using too much smem"); + + struct MainloopArguments { + ElementAcc softmax_scale; + + // all tensors strides are (num_heads or seqlen, head_dim, batch) + // head_dim stride is always 1 + Element* ptr_q_latent; + TensorStride stride_q_latent; + Element* ptr_q_rope; + TensorStride stride_q_rope; + + Element* ptr_c_latent; + TensorStride stride_c_latent; + Element* ptr_k_rope; + TensorStride stride_k_rope; + + // for paged attention, we interpret what was previously [batch, seqlen] + // as [page_count, page_size], and index according to page_table + int* ptr_seq = nullptr; + int* ptr_page_table = nullptr; + // page table is [batch, seqlen or similar] + Stride<_1, int> stride_page_table = {}; + int page_count = 0; + int page_size = TileShapeS{}; // powers of two if kIsCpAsync, otherwise TileShapeS + }; + + struct EpilogueArguments { + ElementOut* ptr_o = nullptr; + TensorStride stride_o; + ElementLSE* ptr_lse = nullptr; + Stride<_1, int> stride_lse; + ElementAcc output_scale = 1.0f; + }; + + struct Arguments { + // (num_heads=128, seqlen, (d_latent=512, d_rope=64), batch_count) + // for paged attention, seqlen is max seqlen + ProblemShape problem_shape; + MainloopArguments mainloop; + EpilogueArguments epilogue; + KernelHardwareInfo hw_info; + int split_kv = -1; + int* ptr_split_kv = nullptr; + }; + + using TmaLoadQLatent = typename CollectiveMmaQK::Params::TMA_A; + using TmaLoadQRope = typename CollectiveMmaQK::Params::TMA_A; + using TmaLoadCLatent = typename CollectiveMmaQK::Params::TMA_B; + using TmaLoadKRope = typename CollectiveMmaQK::Params::TMA_B; + using TmaLoadCLatentTranspose = typename CollectiveMmaPV::Params::TMA_B; + + struct MainloopParams { + TmaLoadQLatent tma_load_q_latent; + TmaLoadQRope tma_load_q_rope; + TmaLoadCLatent tma_load_c_latent; + TmaLoadKRope tma_load_k_rope; + TmaLoadCLatentTranspose tma_load_c_latent_transpose; + }; + + struct EpilogueParams { + ElementOut* ptr_o = nullptr; + ElementAcc* ptr_o_acc = nullptr; + TensorStride stride_o; + TensorStride stride_o_acc; + ElementLSE* ptr_lse = nullptr; + ElementLSE* ptr_lse_acc = nullptr; + Stride<_1, int> stride_lse; + Stride<_1, int> stride_lse_acc; + ElementAcc output_scale = 1.0f; + }; + + struct Params { + ProblemShape problem_shape; + MainloopArguments mainloop; + EpilogueParams epilogue; + MainloopParams mainloop_params; + typename TileScheduler::Params tile_scheduler; + int split_kv = -1; + int* ptr_split_kv = nullptr; + }; + + static Params to_underlying_arguments(Arguments const& args, void* workspace) { + //workspace = nullptr; // let's get an error if one of these needs workspace + + auto [H, K, D, B] = args.problem_shape; + auto [L, R] = D; + + int paged_B = B; + int paged_K = K; + if (args.mainloop.ptr_page_table != nullptr) { + paged_B = args.mainloop.page_count; + paged_K = args.mainloop.page_size; + } + + auto params_qk_latent = CollectiveMmaQK::to_underlying_arguments( + make_shape(H, K, L, B), + typename CollectiveMmaQK::Arguments { + args.mainloop.ptr_q_latent, args.mainloop.stride_q_latent, + args.mainloop.ptr_c_latent, args.mainloop.stride_c_latent, + }, nullptr); + + auto params_qk_latent_paged = CollectiveMmaQK::to_underlying_arguments( + make_shape(H, paged_K, L, paged_B), + typename CollectiveMmaQK::Arguments { + args.mainloop.ptr_q_latent, args.mainloop.stride_q_latent, + args.mainloop.ptr_c_latent, args.mainloop.stride_c_latent, + }, nullptr); + + auto params_qk_rope = CollectiveMmaQK::to_underlying_arguments( + make_shape(H, K, R, B), + typename CollectiveMmaQK::Arguments { + args.mainloop.ptr_q_rope, args.mainloop.stride_q_rope, + args.mainloop.ptr_k_rope, args.mainloop.stride_k_rope, + }, nullptr); + + auto params_qk_rope_paged = CollectiveMmaQK::to_underlying_arguments( + make_shape(H, paged_K, R, paged_B), + typename CollectiveMmaQK::Arguments { + args.mainloop.ptr_q_rope, args.mainloop.stride_q_rope, + args.mainloop.ptr_k_rope, args.mainloop.stride_k_rope, + }, nullptr); + + + auto stride_c_latent_transpose = select<1,0,2>(args.mainloop.stride_c_latent); + auto params_pv_latent = CollectiveMmaPV::to_underlying_arguments( + make_shape(H, L, paged_K, paged_B), + typename CollectiveMmaPV::Arguments { + args.mainloop.ptr_q_latent, args.mainloop.stride_q_latent, // dummy, never used + args.mainloop.ptr_c_latent, stride_c_latent_transpose, + }, nullptr); + + MainloopParams mainloop_params { + params_qk_latent.tma_load_a, + params_qk_rope.tma_load_a, + params_qk_latent_paged.tma_load_b, + params_qk_rope_paged.tma_load_b, + params_pv_latent.tma_load_b + }; + + EpilogueParams epilogue_params; + + epilogue_params.ptr_o = args.epilogue.ptr_o; + epilogue_params.stride_o = args.epilogue.stride_o; + epilogue_params.ptr_lse = args.epilogue.ptr_lse; + epilogue_params.stride_lse = args.epilogue.stride_lse; + epilogue_params.output_scale = args.epilogue.output_scale; + + if (args.split_kv > 1) { + ElementAcc* ptr_o_acc = reinterpret_cast(workspace); + ElementLSE* ptr_lse_acc = reinterpret_cast(ptr_o_acc + H * L * args.split_kv * B); + epilogue_params.ptr_o_acc = ptr_o_acc; + epilogue_params.ptr_lse_acc = ptr_lse_acc; + + epilogue_params.stride_o_acc = make_tuple(static_cast(0 + L) * args.split_kv, _1{}, static_cast(0 + H * L) * args.split_kv); + epilogue_params.stride_lse_acc = make_tuple(_1{}, (0 + H) * args.split_kv); + } + + return {args.problem_shape, args.mainloop, epilogue_params, mainloop_params, + TileScheduler::to_underlying_arguments(args.problem_shape, args.hw_info, ClusterShape{}, args.split_kv), args.split_kv, args.ptr_split_kv}; + } + + static size_t get_workspace_size(Arguments const& args) { + ProblemShape problem_shape = args.problem_shape; + auto [H, K, D, B] = problem_shape; + auto [D_latent, D_rope] = D; + auto split_kv = args.split_kv; + return (sizeof(ElementAcc) * D_latent + sizeof(ElementLSE)) * H * split_kv * B; + } + static Status initialize_workspace( + Arguments const& /*args*/, void* /*ws*/, cudaStream_t /*stream*/) { + return Status::kSuccess; + } + + static dim3 get_grid_shape(Params const& params) { + return TileScheduler::get_grid_shape(params.tile_scheduler); + } + + static dim3 get_block_shape() { + dim3 block(MaxThreadsPerBlock, 1, 1); + return block; + } + + static bool can_implement(Arguments const& args) { + if (kIsCpAsync) { + if ((args.mainloop.page_size & (args.mainloop.page_size - 1)) != 0) { + return false; + } + if (args.mainloop.page_size > TileShapeS{}) { + return false; + } + } + else { + if (args.mainloop.ptr_page_table != nullptr && args.mainloop.page_size != TileShapeS{}) { + return false; + } + } + if (get<0>(args.problem_shape) != 128) { + return false; + } + if (get<1>(args.problem_shape) <= 0) { + return false; + } + if (args.split_kv <= 0) { + return false; + } + return true; + } + + + CUTLASS_DEVICE void operator()(Params const& params, char* smem_raw) { + + TileScheduler tile_scheduler(params.tile_scheduler); + + int warp_idx = cutlass::canonical_warp_idx_sync(); + auto role = warp_idx_to_role(warp_idx); + uint32_t lane_predicate = cute::elect_one_sync(); + + uint32_t cta_rank_in_cluster = cute::block_rank_in_cluster(); + int cta_coord_v = cta_rank_in_cluster % size<0>(AtomThrShapeMNK{}); + bool is_mma_leader_cta = cta_coord_v == 0; + + if (role == WarpRole::kLoad && lane_predicate && ! kIsCpAsync) { + prefetch_tma_descriptor(params.mainloop_params.tma_load_q_latent.get_tma_descriptor()); + prefetch_tma_descriptor(params.mainloop_params.tma_load_c_latent.get_tma_descriptor()); + prefetch_tma_descriptor(params.mainloop_params.tma_load_q_rope.get_tma_descriptor()); + prefetch_tma_descriptor(params.mainloop_params.tma_load_k_rope.get_tma_descriptor()); + prefetch_tma_descriptor(params.mainloop_params.tma_load_c_latent_transpose.get_tma_descriptor()); + } + SharedStorage& shared_storage = *reinterpret_cast(smem_raw); + + typename PipelineLoadQK::Params pipeline_load_qk_params; + if (role == WarpRole::kLoad) { + pipeline_load_qk_params.role = PipelineLoadQK::ThreadCategory::Producer; + } + if (role == WarpRole::kMma) { + pipeline_load_qk_params.role = PipelineLoadQK::ThreadCategory::Consumer; + } + if constexpr (kIsCpAsync) { + // we can make our life easier by unconditionally loading blocks + // since we know it'll always be legal + pipeline_load_qk_params.producer_arv_count = kNumLoadWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{}); + } + else { + pipeline_load_qk_params.is_leader = lane_predicate && (role == WarpRole::kLoad) && is_mma_leader_cta; + pipeline_load_qk_params.transaction_bytes = kTransactionsBytesLoadQK; + } + pipeline_load_qk_params.initializing_warp = 0; + PipelineLoadQK pipeline_load_qk(shared_storage.pipelines.load_qk, pipeline_load_qk_params, + ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{}); + + typename PipelineS::Params pipeline_mma_s_params; + if (role == WarpRole::kMma) { + pipeline_mma_s_params.role = PipelineS::ThreadCategory::Producer; + } + if (role == WarpRole::kCompute) { + pipeline_mma_s_params.role = PipelineS::ThreadCategory::Consumer; + } + pipeline_mma_s_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{}); + pipeline_mma_s_params.initializing_warp = 1; + PipelineS pipeline_mma_s( + shared_storage.pipelines.mma_s, + pipeline_mma_s_params, + ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{}); + + typename PipelineP::Params pipeline_p_mma_params; + if (role == WarpRole::kMma) { + pipeline_p_mma_params.role = PipelineP::ThreadCategory::Consumer; + } + if (role == WarpRole::kCompute) { + pipeline_p_mma_params.role = PipelineP::ThreadCategory::Producer; + } + pipeline_p_mma_params.producer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{}); + pipeline_p_mma_params.consumer_arv_count = 1; + pipeline_p_mma_params.initializing_warp = 2; + PipelineP pipeline_p_mma( + shared_storage.pipelines.p_mma, + pipeline_p_mma_params, + ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{}); + + typename PipelineO::Params pipeline_mma_o_params; + if (role == WarpRole::kMma) { + pipeline_mma_o_params.role = PipelineO::ThreadCategory::Producer; + } + if (role == WarpRole::kCompute) { + pipeline_mma_o_params.role = PipelineO::ThreadCategory::Consumer; + } + pipeline_mma_o_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{}); + pipeline_mma_o_params.initializing_warp = 3; + PipelineO pipeline_mma_o( + shared_storage.pipelines.mma_o, + pipeline_mma_o_params, + ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{}); + + typename PipelinePT::Params pipeline_pt_params; + if (role == WarpRole::kLoad) { + pipeline_pt_params.role = PipelinePT::ThreadCategory::Consumer; + } + if (role == WarpRole::kLoadPageTable) { + pipeline_pt_params.role = PipelinePT::ThreadCategory::Producer; + } + pipeline_pt_params.consumer_arv_count = kNumLoadWarps * cutlass::NumThreadsPerWarp; + pipeline_pt_params.producer_arv_count = cutlass::NumThreadsPerWarp; + pipeline_pt_params.initializing_warp = 4; + PipelinePT pipeline_page_table( + shared_storage.pipelines.load_page_table, + pipeline_pt_params); + + TmemAllocator tmem_allocator; + + pipeline_init_arrive_relaxed(size(ClusterShape{})); + + pipeline_load_qk.init_masks(ClusterShape{}); // do we need an update here for 2Sm? + pipeline_mma_s.init_masks(ClusterShape{}); + pipeline_p_mma.init_masks(ClusterShape{}); + pipeline_mma_o.init_masks(ClusterShape{}); + + typename PipelineLoadQK::PipelineState pipeline_load_qk_consumer_state; + typename PipelineLoadQK::PipelineState pipeline_load_qk_producer_state = cutlass::make_producer_start_state(); + + typename PipelineS::PipelineState pipeline_mma_s_consumer_state; + typename PipelineS::PipelineState pipeline_mma_s_producer_state = cutlass::make_producer_start_state(); + + typename PipelineP::PipelineState pipeline_p_mma_consumer_state; + typename PipelineP::PipelineState pipeline_p_mma_producer_state = cutlass::make_producer_start_state(); + + typename PipelineO::PipelineState pipeline_mma_o_consumer_state; + typename PipelineO::PipelineState pipeline_mma_o_producer_state = cutlass::make_producer_start_state(); + + typename PipelinePT::PipelineState pipeline_pt_consumer_state; + typename PipelinePT::PipelineState pipeline_pt_producer_state = cutlass::make_producer_start_state(); + + pipeline_init_wait(size(ClusterShape{})); + + if (role == WarpRole::kLoadPageTable) { + CUTLASS_PRAGMA_NO_UNROLL + for (; tile_scheduler.is_valid(); ++tile_scheduler) { + auto blk_coord = tile_scheduler.get_block_coord(); + auto problem_shape = params.problem_shape; + auto local_split_kv = params.split_kv; + if (params.mainloop.ptr_seq != nullptr) { + get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)]; + if (params.ptr_split_kv != nullptr) { + local_split_kv = params.ptr_split_kv[get<2>(blk_coord)]; + } + } + if (local_split_kv <= get<3>(blk_coord)) + continue; + load_page_table( + blk_coord, + problem_shape, + params.mainloop, + shared_storage.tensors, + pipeline_page_table, pipeline_pt_producer_state, + local_split_kv + ); + } + } + else if (role == WarpRole::kLoad) { + if constexpr (kIsCpAsync) { + CUTLASS_PRAGMA_NO_UNROLL + for (; tile_scheduler.is_valid(); ++tile_scheduler) { + auto blk_coord = tile_scheduler.get_block_coord(); + auto problem_shape = params.problem_shape; + auto local_split_kv = params.split_kv; + if (params.mainloop.ptr_seq != nullptr) { + get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)]; + if (params.ptr_split_kv != nullptr) { + local_split_kv = params.ptr_split_kv[get<2>(blk_coord)]; + } + } + if (local_split_kv <= get<3>(blk_coord)) + continue; + load_cpasync( + blk_coord, + problem_shape, + params.mainloop, + params.mainloop_params, + shared_storage.tensors, + pipeline_load_qk, pipeline_load_qk_producer_state, + local_split_kv, + /* must be shared pipe */ + pipeline_page_table, pipeline_pt_consumer_state + ); + cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait(); + } + } + else { + if (params.mainloop.ptr_page_table != nullptr) { + CUTLASS_PRAGMA_NO_UNROLL + for (; tile_scheduler.is_valid(); ++tile_scheduler) { + auto blk_coord = tile_scheduler.get_block_coord(); + auto problem_shape = params.problem_shape; + auto local_split_kv = params.split_kv; + if (params.mainloop.ptr_seq != nullptr) { + get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)]; + if (params.ptr_split_kv != nullptr) { + local_split_kv = params.ptr_split_kv[get<2>(blk_coord)]; + } + } + if (local_split_kv <= get<3>(blk_coord)) + continue; + load_tma( + blk_coord, + problem_shape, + params.mainloop, + params.mainloop_params, + shared_storage.tensors, + pipeline_load_qk, pipeline_load_qk_producer_state, + pipeline_load_qk, pipeline_load_qk_producer_state, + local_split_kv + ); + cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait(); + } + } + else { + CUTLASS_PRAGMA_NO_UNROLL + for (; tile_scheduler.is_valid(); ++tile_scheduler) { + auto blk_coord = tile_scheduler.get_block_coord(); + auto problem_shape = params.problem_shape; + auto local_split_kv = params.split_kv; + if (params.mainloop.ptr_seq != nullptr) { + get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)]; + if (params.ptr_split_kv != nullptr) { + local_split_kv = params.ptr_split_kv[get<2>(blk_coord)]; + } + } + if (local_split_kv <= get<3>(blk_coord)) + continue; + load_tma( + blk_coord, + problem_shape, + params.mainloop, + params.mainloop_params, + shared_storage.tensors, + pipeline_load_qk, pipeline_load_qk_producer_state, + pipeline_load_qk, pipeline_load_qk_producer_state, + local_split_kv + ); + cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait(); + } + } + } + } + else if (role == WarpRole::kMma) { + tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr); + __syncwarp(); + + if (is_mma_leader_cta) { + CUTLASS_PRAGMA_NO_UNROLL + for (; tile_scheduler.is_valid(); ++tile_scheduler) { + auto blk_coord = tile_scheduler.get_block_coord(); + auto problem_shape = params.problem_shape; + auto local_split_kv = params.split_kv; + if (params.mainloop.ptr_seq != nullptr) { + get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)]; + if (params.ptr_split_kv != nullptr) { + local_split_kv = params.ptr_split_kv[get<2>(blk_coord)]; + } + } + if (local_split_kv <= get<3>(blk_coord)) + continue; + mma(blk_coord, + problem_shape, + shared_storage.tensors, + pipeline_load_qk, pipeline_load_qk_consumer_state, + pipeline_load_qk, pipeline_load_qk_consumer_state, + pipeline_mma_s, pipeline_mma_s_producer_state, + pipeline_p_mma, pipeline_p_mma_consumer_state, + pipeline_mma_o, pipeline_mma_o_producer_state, + local_split_kv + ); + } + } + + //cutlass::arch::NamedBarrier((kNumComputeWarps + 1) * NumThreadsPerWarp, kNamedBarrierTmemDealloc).arrive_and_wait(); + + //uint32_t free_stage_ptr = shared_storage.tmem_base_ptr; + //tmem_allocator.free(free_stage_ptr, TmemAllocator::Sm100TmemCapacityColumns); + } + else if (role == WarpRole::kCompute) { + CUTLASS_PRAGMA_NO_UNROLL + for (; tile_scheduler.is_valid(); ++tile_scheduler) { + auto blk_coord = tile_scheduler.get_block_coord(); + auto problem_shape = params.problem_shape; + auto split_kv = params.split_kv; + auto local_split_kv = split_kv; + if (params.mainloop.ptr_seq != nullptr) { + get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)]; + if (params.ptr_split_kv != nullptr) { + local_split_kv = params.ptr_split_kv[get<2>(blk_coord)]; + } + } + if (local_split_kv <= get<3>(blk_coord)) + continue; + compute( + blk_coord, + problem_shape, + params.mainloop, // for softmax_scale + params.epilogue, + shared_storage.tensors, // for smem_comm + pipeline_mma_s, pipeline_mma_s_consumer_state, + pipeline_p_mma, pipeline_p_mma_producer_state, + pipeline_mma_o, pipeline_mma_o_consumer_state, + local_split_kv + ); + } + + //cutlass::arch::NamedBarrier((kNumComputeWarps + 1) * NumThreadsPerWarp, kNamedBarrierTmemDealloc).arrive(); + } + + cute::cluster_sync(); + cutlass::arch::NamedBarrier((kNumComputeWarps + 1) * NumThreadsPerWarp, kNamedBarrierTmemDealloc).arrive(); + if (role == WarpRole::kMma) { + uint32_t free_stage_ptr = shared_storage.tmem_base_ptr; + tmem_allocator.free(free_stage_ptr, TmemAllocator::Sm100TmemCapacityColumns); + } + } + + template + CUTLASS_DEVICE void load_page_table( + BlkCoord const& blk_coord, + ProblemShape const& problem_shape, + MainloopArguments const& mainloop_args, + TensorStorage& shared_tensors, + PipelinePT& pipeline_page_table, + typename PipelinePT::PipelineState& pipeline_pt_producer_state, int const& split_kv) { + + auto [H, K, D, B] = problem_shape; + int batch_coord = get<2>(blk_coord); + + auto mPT_l = make_tensor(make_gmem_ptr(mainloop_args.ptr_page_table), + make_shape(mainloop_args.page_count, B), + mainloop_args.stride_page_table); + auto mPT = mPT_l(_, batch_coord); + + int k_tile_total = ceil_div(K, TileShapeS{}); + int k_tile_per_cta = ceil_div(k_tile_total, split_kv); + int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit + int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index); + if (k_tile_count == 0) { + return; + } + + auto page_size = Pow2{mainloop_args.page_size}; + auto pages_per_tile = Pow2{TileShapeS{} / page_size}; + int thread_idx = threadIdx.x % cutlass::NumThreadsPerWarp; + +#if 1 + for (; k_tile_count > 0; ++k_index, --k_tile_count) { + pipeline_page_table.producer_acquire(pipeline_pt_producer_state); + + // assume a single warp + + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < TileShapeS{}; i += cutlass::NumThreadsPerWarp) { + int idx = i + thread_idx; + bool guard = idx < pages_per_tile; + int smem_idx = pipeline_pt_producer_state.index() * TileShapeS::value + idx; + int pt_idx = pages_per_tile * k_index + idx; + + cutlass::arch::cp_async_zfill( + &shared_tensors.smem_page_table[smem_idx], &mPT(pt_idx), guard + ); + } + + pipeline_page_table.producer_commit(pipeline_pt_producer_state, cutlass::arch::cpasync_barrier_arrive); + ++pipeline_pt_producer_state; + } +#endif + } + + + struct Gather { + int& page_table_stage; + Pow2 pages_per_tile; + const int * __restrict__ smem_page_table; + + CUTLASS_DEVICE int operator()(int idx) const { + return smem_page_table[page_table_stage * TileShapeS::value + idx % pages_per_tile]; + } + + CUTLASS_DEVICE friend void print(Gather const&) { + printf(""); + } + + }; + + + template + CUTLASS_DEVICE void load_cpasync( + BlkCoord const& blk_coord, + ProblemShape const& problem_shape, + MainloopArguments const& mainloop_args, + MainloopParams const& mainloop_params, + TensorStorage& shared_tensors, + PipelineLoadQK& pipeline_load, + typename PipelineLoadQK::PipelineState& pipeline_load_producer_state, + int const& split_kv, + PipelinePT& pipeline_page_table, + typename PipelinePT::PipelineState& pipeline_pt_consumer_state) { + + auto [H, K, D, B] = problem_shape; + auto [D_latent, D_rope] = D; + + using X = Underscore; + + int k_tile_total = ceil_div(K, TileShapeS{}); + int k_tile_per_cta = ceil_div(k_tile_total, split_kv); + int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit + int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index); + if (k_tile_count == 0) { + return; + } + + // partition all tensors + auto mQL = make_tensor(make_gmem_ptr(mainloop_args.ptr_q_latent), make_shape(H, D_latent, B), mainloop_args.stride_q_latent); + auto mQR = make_tensor(make_gmem_ptr(mainloop_args.ptr_q_rope), make_shape(H, D_rope, B), mainloop_args.stride_q_rope); + + int paged_B = mainloop_args.page_count; + auto paged_K = Pow2{mainloop_args.page_size}; + auto mPT_l = make_tensor(make_gmem_ptr(mainloop_args.ptr_page_table), make_shape(paged_B, B), mainloop_args.stride_page_table); + + int batch_coord = get<2>(blk_coord); + auto mPT = mPT_l(_, batch_coord); + + auto gQL = local_tile(mQL, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{}); + auto gQR = local_tile(mQR, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{}); + + ThrMMA cta_mma_qk = TiledMmaQK{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{})); + ThrMMA cta_mma_pv = TiledMmaPV{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{})); + + auto tSgQL = cta_mma_qk.partition_A(gQL); + auto tSgQR = cta_mma_qk.partition_A(gQR); + + Tensor sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{}); + Tensor sKC = make_tensor(make_smem_ptr(shared_tensors.smem_kc.begin()), SmemLayoutKC{}); + Tensor sVC = make_tensor(make_smem_ptr(shared_tensors.smem_vc.begin()), SmemLayoutVC{}); + + auto make_copy_for = [](auto sT) { + auto rT_a = sT.layout()(_, _, _, _0{}); + auto rT = make_ordered_layout(shape(rT_a), stride(rT_a)); + auto threads = Int{}; + auto values = Int{}; + return make_cotiled_copy( + Copy_Atom, Element>{}, + make_ordered_layout( + make_shape(threads, values), + make_stride(_1{}, _0{})), + rT); + }; + + // like cute::copy, but makes sure we do all page table lookups first + auto copy_split = [](auto atom, auto src, auto dst) { + auto src_v = group_modes<1, rank_v>(src); + auto dst_v = group_modes<1, rank_v>(dst); + + auto src_v_ptrs = make_tensor(size<1>(src_v)); + for (int i = 0; i < size<1>(src_v); i++) { + src_v_ptrs(i) = &src_v(_0{}, i); + } + + + for (int i = 0; i < size<1>(src_v); i++) { + auto src_v_i = make_tensor( + make_gmem_ptr(src_v_ptrs(i)), + make_shape(shape<0>(src_v)), + make_stride(make_stride(_1{}, _0{})) + ); + atom.call(src_v_i, dst_v(_, i)); + } + }; + + auto tiled_copy_q = make_copy_for(sQ); + auto tiled_copy_kc = make_copy_for(sKC); + auto tiled_copy_vc = make_copy_for(sVC); + + auto thr_copy_q = tiled_copy_q.get_thread_slice(threadIdx.x % (kNumLoadWarps * cutlass::NumThreadsPerWarp)); + auto thr_copy_kc = tiled_copy_kc.get_thread_slice(threadIdx.x % (kNumLoadWarps * cutlass::NumThreadsPerWarp)); + auto thr_copy_vc = tiled_copy_vc.get_thread_slice(threadIdx.x % (kNumLoadWarps * cutlass::NumThreadsPerWarp)); + + auto tQsQ = thr_copy_q.partition_D(sQ); + auto tQgQL = thr_copy_q.partition_S(tSgQL); + auto tQgQR = thr_copy_q.partition_S(tSgQR); + + auto tKCsKC = thr_copy_kc.partition_D(sKC); + auto tVCsVC = thr_copy_vc.partition_D(sVC); + + auto pipeline_pt_release_state = pipeline_pt_consumer_state; + + int page_table_stage = -1; + Pow2 pages_per_tile{TileShapeS{} / paged_K}; + const int * __restrict__ smem_page_table = shared_tensors.smem_page_table.begin(); + Gather gather{page_table_stage, pages_per_tile, smem_page_table}; + + auto mCL = make_tensor( + make_gmem_ptr(mainloop_args.ptr_c_latent), + ComposedLayout{ + make_layout( + make_shape(make_shape(paged_K, paged_B), _1{}), + make_stride(make_stride(get<0>(mainloop_args.stride_c_latent), example::CustomStride(gather, get<2>(mainloop_args.stride_c_latent))), get<1>(mainloop_args.stride_c_latent))), + make_coord(_0{}, _0{}), + make_identity_layout(make_shape(paged_K * paged_B, D_latent))}); + + auto mKR = make_tensor( + make_gmem_ptr(mainloop_args.ptr_k_rope), + ComposedLayout{ + make_layout( + make_shape(make_shape(paged_K, paged_B), _1{}), + make_stride(make_stride(get<0>(mainloop_args.stride_k_rope), example::CustomStride(gather, get<2>(mainloop_args.stride_k_rope))), get<1>(mainloop_args.stride_k_rope))), + make_coord(_0{}, _0{}), + make_identity_layout(make_shape(paged_K * paged_B, D_latent))}); + + auto mCLT = make_tensor( + make_gmem_ptr(mainloop_args.ptr_c_latent), + ComposedLayout{ + make_layout( + make_shape(_1{}, make_shape(paged_K, paged_B)), + make_stride(get<1>(mainloop_args.stride_c_latent), make_stride(get<0>(mainloop_args.stride_c_latent), example::CustomStride(gather, get<2>(mainloop_args.stride_c_latent))))), + make_coord(_0{}, _0{}), + make_identity_layout(make_shape(D_latent, paged_K * paged_B))}); + + auto gCL = local_tile(mCL, TileShapeQK{}, make_coord(_,_,_), Step{}); + auto gKR = local_tile(mKR, TileShapeQK{}, make_coord(_,_,_), Step{}); + auto gCLT = local_tile(mCLT, TileShapePV{}, make_coord(_,_,_), Step{}); + + auto tSgCL = cta_mma_qk.partition_B(gCL); + auto tSgKR = cta_mma_qk.partition_B(gKR); + auto tOgCLT = cta_mma_pv.partition_B(gCLT); + + auto tKCgCL = thr_copy_kc.partition_S(tSgCL); + auto tKCgKR = thr_copy_kc.partition_S(tSgKR); + auto tVCgCLT = thr_copy_vc.partition_S(tOgCLT); + + // latent is first in memory, so let's load it first always + // startup: alternate Q and K, set tx count appropriately, for k_idx = 0 + auto& pipeline_acquire_state = pipeline_load_producer_state; + auto pipeline_commit_state = pipeline_acquire_state; + int pipeline_offset = 0; + + for (int i = 0; i < StagesPV; i++) { + cutlass::arch::cp_async_fence(); + } + + auto load_stage = [&](auto fn) { + pipeline_load.producer_acquire(pipeline_acquire_state); + fn(pipeline_acquire_state.index()); + cutlass::arch::cp_async_fence(); + + ++pipeline_acquire_state; + ++pipeline_offset; + + if (pipeline_offset == StagesPV - 1) { + cutlass::arch::cp_async_wait(); + pipeline_load.producer_commit(pipeline_commit_state); + ++pipeline_commit_state; + --pipeline_offset; + } + }; + + pipeline_page_table.consumer_wait(pipeline_pt_consumer_state); + page_table_stage = pipeline_pt_consumer_state.index(); + ++pipeline_pt_consumer_state; + + // each Q/K tile consists of rope and latent + for (int i = 0; i < IterationsQKLatent; i++) { + load_stage([&](int index) { + cute::copy(tiled_copy_q, tQgQL(_, _, _, _, _0{}, i, batch_coord), tQsQ(_, _, _, _, i)); + copy_split(tiled_copy_kc, tKCgCL(_, _, _, _, k_index, i), tKCsKC(_, _, _, _, index)); + }); + } + + for (int i = 0; i < IterationsQKRope; i++) { + load_stage([&](int index) { + cute::copy(tiled_copy_q, tQgQR(_, _, _, _, _0{}, i, batch_coord), tQsQ(_, _, _, _, IterationsQKLatent + i)); + copy_split(tiled_copy_kc, tKCgKR(_, _, _, _, k_index, i), tKCsKC(_, _, _, _, index)); + }); + } + + k_index += 1; + k_tile_count -= 1; + + // assume k_tile_count >= 1 + // perform K+Q load here + CUTLASS_PRAGMA_NO_UNROLL + while (k_tile_count > 0) { + + pipeline_page_table.consumer_wait(pipeline_pt_consumer_state); + page_table_stage = pipeline_pt_consumer_state.index(); + ++pipeline_pt_consumer_state; + + for (int i = 0; i < IterationsQKLatent; i++) { + load_stage([&](int index) { + copy_split(tiled_copy_kc, tKCgCL(_, _, _, _, k_index, i), tKCsKC(_, _, _, _, index)); + }); + } + + for (int i = 0; i < IterationsQKRope; i++) { + load_stage([&](int index) { + copy_split(tiled_copy_kc, tKCgKR(_, _, _, _, k_index, i), tKCsKC(_, _, _, _, index)); + }); + } + + page_table_stage = pipeline_pt_release_state.index(); + + for (int i = 0; i < IterationsPV_K; i++) { + for (int j = 0; j < IterationsPV_N; j++) { + load_stage([&](int index) { + copy_split(tiled_copy_vc, tVCgCLT(_, _, _, _, j, IterationsPV_K * (k_index - 1) + i), tVCsVC(_, _, _, _, index)); + }); + } + } + + pipeline_page_table.consumer_release(pipeline_pt_release_state); + ++pipeline_pt_release_state; + + k_index += 1; + k_tile_count -= 1; + } + + page_table_stage = pipeline_pt_release_state.index(); + + for (int i = 0; i < IterationsPV_K; i++) { + for (int j = 0; j < IterationsPV_N; j++) { + load_stage([&](int index) { + copy_split(tiled_copy_vc, tVCgCLT(_, _, _, _, j, IterationsPV_K * (k_index - 1) + i), tVCsVC(_, _, _, _, index)); + }); + } + } + + pipeline_page_table.consumer_release(pipeline_pt_release_state); + ++pipeline_pt_release_state; + + while (pipeline_offset > 0) { + cutlass::arch::cp_async_fence(); + + cutlass::arch::cp_async_wait(); + pipeline_load.producer_commit(pipeline_commit_state); + ++pipeline_commit_state; + --pipeline_offset; + } + + cutlass::arch::cp_async_wait<0>(); + + } + + + template + CUTLASS_DEVICE void load_tma( + BlkCoord const& blk_coord, + ProblemShape const& problem_shape, + MainloopArguments const& mainloop_args, + MainloopParams const& mainloop_params, + TensorStorage& shared_tensors, + PipelineLoadQK& pipeline_load_qk, + typename PipelineLoadQK::PipelineState& pipeline_load_qk_producer_state, + PipelineLoadPV& pipeline_load_pv, + typename PipelineLoadPV::PipelineState& pipeline_load_pv_producer_state, + int const& split_kv) { + + auto [H, K, D, B] = problem_shape; + auto [D_latent, D_rope] = D; + + int k_tile_total = ceil_div(K, TileShapeS{}); + int k_tile_per_cta = ceil_div(k_tile_total, split_kv); + int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit + int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index); + if (k_tile_count == 0) { + return; + } + + using X = Underscore; + + // partition all tensors + auto mQL = mainloop_params.tma_load_q_latent.get_tma_tensor(make_shape(H, D_latent, B)); + auto mQR = mainloop_params.tma_load_q_rope.get_tma_tensor(make_shape(H, D_rope, B)); + + int paged_B = B; + int paged_K = K; + if constexpr (kIsPaged) { + paged_B = mainloop_args.page_count; + paged_K = mainloop_args.page_size; + } + auto mPT_l = make_tensor(make_gmem_ptr(mainloop_args.ptr_page_table), make_shape(paged_B, B), mainloop_args.stride_page_table); + + auto mCL = mainloop_params.tma_load_c_latent.get_tma_tensor(make_shape(paged_K, D_latent, paged_B)); + auto mKR = mainloop_params.tma_load_k_rope.get_tma_tensor(make_shape(paged_K, D_rope, paged_B)); + + auto mCLT = mainloop_params.tma_load_c_latent_transpose.get_tma_tensor(make_shape(D_latent, paged_K, paged_B)); + + auto gQL = local_tile(mQL, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{}); + auto gQR = local_tile(mQR, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{}); + + auto gCL = local_tile(mCL, TileShapeQK{}, make_coord(_,_,_), Step{}); + auto gKR = local_tile(mKR, TileShapeQK{}, make_coord(_,_,_), Step{}); + auto gCLT = local_tile(mCLT, TileShapePV{}, make_coord(_,_,_), Step{}); + + ThrMMA cta_mma_qk = TiledMmaQK{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{})); + ThrMMA cta_mma_pv = TiledMmaPV{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{})); + + auto tSgQL = cta_mma_qk.partition_A(gQL); + auto tSgQR = cta_mma_qk.partition_A(gQR); + + auto tSgCL = cta_mma_qk.partition_B(gCL); + auto tSgKR = cta_mma_qk.partition_B(gKR); + + auto tOgCLT = cta_mma_pv.partition_B(gCLT); + + Tensor sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{}); + Tensor sKC = make_tensor(make_smem_ptr(shared_tensors.smem_kc.begin()), SmemLayoutKC{}); + Tensor sVC = make_tensor(make_smem_ptr(shared_tensors.smem_vc.begin()), SmemLayoutVC{}); + + auto [tQLgQL_mkl, tQsQ] = tma_partition( + mainloop_params.tma_load_q_latent, _0{}, make_layout(_1{}), + group_modes<0,3>(sQ), group_modes<0,3>(tSgQL)); + + auto [tQRgQR_mkl, tQsQ_ignore] = tma_partition( + mainloop_params.tma_load_q_rope, _0{}, make_layout(_1{}), + group_modes<0,3>(sQ), group_modes<0,3>(tSgQR)); + + auto [tCLgCL_nkl, tKCsKC] = tma_partition( + mainloop_params.tma_load_c_latent, _0{}, make_layout(_1{}), + group_modes<0,3>(sKC), group_modes<0,3>(tSgCL)); + + auto [tKRgKR_nkl, tKCsKC_ignore] = tma_partition( + mainloop_params.tma_load_k_rope, _0{}, make_layout(_1{}), + group_modes<0,3>(sKC), group_modes<0,3>(tSgKR)); + + auto [tCLTgCLT_nkl, tVCsVC] = tma_partition( + mainloop_params.tma_load_c_latent_transpose, _0{}, make_layout(_1{}), + group_modes<0,3>(sVC), group_modes<0,3>(tOgCLT)); + + uint16_t mcast_mask = 0; + + int batch_coord = get<2>(blk_coord); + Tensor tQLgQL = tQLgQL_mkl(_, _, _, batch_coord); + Tensor tQRgQR = tQRgQR_mkl(_, _, _, batch_coord); + + auto mPT = mPT_l(_, batch_coord); + + Tensor tCLgCL = tCLgCL_nkl(_, _, _, _); + Tensor tKRgKR = tKRgKR_nkl(_, _, _, _); + + // careful: stage and k are swapped here! + Tensor tCLTgCLT = tCLTgCLT_nkl(_, _, _, _); + + // latent is first in memory, so let's load it first always + // startup: alternate Q and K, set tx count appropriately, for k_idx = 0 + + // each Q/K tile consists of rope and latent + for (int i = 0; i < IterationsQKLatent; i++) { + pipeline_load_qk.producer_expect_transaction(pipeline_load_qk_producer_state, kTransactionsBytesLoadExtraQ); + pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state); + auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state); + + if (cute::elect_one_sync()) { + // expect the extra bytes + // load_qk ql + cute::copy(mainloop_params.tma_load_q_latent.with(*tma_barrier, mcast_mask), tQLgQL(_, _0{}, i), tQsQ(_, i)); + // load_qk cl + if constexpr (kIsPaged) { + cute::copy( + mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask), + tCLgCL(_, _0{}, i, mPT(k_index)), + tKCsKC(_, pipeline_load_qk_producer_state.index()) + ); + } + else { + cute::copy( + mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask), + tCLgCL(_, k_index, i, batch_coord), + tKCsKC(_, pipeline_load_qk_producer_state.index())); + } + } + ++pipeline_load_qk_producer_state; + } + + for (int i = 0; i < IterationsQKRope; i++) { + pipeline_load_qk.producer_expect_transaction(pipeline_load_qk_producer_state, kTransactionsBytesLoadExtraQ); + pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state); + auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state); + + if (cute::elect_one_sync()) { + // expect the extra bytes + // load_qk ql + cute::copy(mainloop_params.tma_load_q_rope.with(*tma_barrier, mcast_mask), tQRgQR(_, _0{}, i), tQsQ(_, i + IterationsQKLatent)); + // load_qk cl + if constexpr (kIsPaged) { + cute::copy( + mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask), + tKRgKR(_, _0{}, i, mPT(k_index)), + tKCsKC(_, pipeline_load_qk_producer_state.index()) + ); + } + else { + cute::copy( + mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask), + tKRgKR(_, k_index, i, batch_coord), + tKCsKC(_, pipeline_load_qk_producer_state.index())); + } + } + ++pipeline_load_qk_producer_state; + } + + k_index += 1; + k_tile_count -= 1; + + // assume k_tile_count >= 1 + // perform K+Q load here + CUTLASS_PRAGMA_NO_UNROLL + while (k_tile_count > 0) { + + // perform K load + for (int i = 0; i < IterationsQKLatent; i++) { + pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state); + auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state); + + if (cute::elect_one_sync()) { + // load_qk cl + if constexpr (kIsPaged) { + cute::copy( + mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask), + tCLgCL(_, _0{}, i, mPT(k_index)), + tKCsKC(_, pipeline_load_qk_producer_state.index()) + ); + } + else { + cute::copy( + mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask), + tCLgCL(_, k_index, i, batch_coord), + tKCsKC(_, pipeline_load_qk_producer_state.index())); + } + } + ++pipeline_load_qk_producer_state; + } + + for (int i = 0; i < IterationsQKRope; i++) { + pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state); + auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state); + + if (cute::elect_one_sync()) { + // load_qk cl + if constexpr (kIsPaged) { + cute::copy( + mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask), + tKRgKR(_, _0{}, i, mPT(k_index)), + tKCsKC(_, pipeline_load_qk_producer_state.index()) + ); + } + else { + cute::copy( + mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask), + tKRgKR(_, k_index, i, batch_coord), + tKCsKC(_, pipeline_load_qk_producer_state.index())); + } + } + ++pipeline_load_qk_producer_state; + } + + // prefetch next K load to keep busy while we transpose-load from cache + const int kPrefetchDistance = 1; + for (int i = 0; i < IterationsQKLatent; i++) { + if (cute::elect_one_sync()) { + if constexpr (kIsPaged) { + if (k_tile_count > kPrefetchDistance) { + cute::prefetch( + mainloop_params.tma_load_c_latent, + tCLgCL(_, _0{}, i, mPT(k_index + kPrefetchDistance)) + ); + } + } + else { + cute::prefetch( + mainloop_params.tma_load_c_latent, + tCLgCL(_, k_index + kPrefetchDistance, i, batch_coord) + ); + } + } + } + + for (int i = 0; i < IterationsQKRope; i++) { + if (cute::elect_one_sync()) { + if constexpr (kIsPaged) { + if (k_tile_count > kPrefetchDistance) { + cute::prefetch( + mainloop_params.tma_load_k_rope, + tKRgKR(_, _0{}, i, mPT(k_index + kPrefetchDistance)) + ); + } + } + else { + cute::prefetch( + mainloop_params.tma_load_k_rope, + tKRgKR(_, k_index + kPrefetchDistance, i, batch_coord) + ); + } + } + } + + // perform V load (k_idx - 1) + + for (int i = 0; i < IterationsPV_K; i++) { + for (int j = 0; j < IterationsPV_N; j++) { + pipeline_load_pv.producer_acquire(pipeline_load_pv_producer_state); + auto tma_barrier = pipeline_load_pv.producer_get_barrier(pipeline_load_pv_producer_state); + + if (cute::elect_one_sync()) { + // load_pv cl + // note the transpose in indices! + // note we are off-by-one on k_index + if constexpr (kIsPaged) { + cute::copy( + mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST), + tCLTgCLT(_, j, i, mPT(k_index - 1)), + tVCsVC(_, pipeline_load_pv_producer_state.index()) + ); + } + else { + cute::copy( + mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST), + tCLTgCLT(_, j, IterationsPV_K * (k_index - 1) + i, batch_coord), + tVCsVC(_, pipeline_load_pv_producer_state.index()) + ); + } + } + ++pipeline_load_pv_producer_state; + } + } + + k_index += 1; + k_tile_count -= 1; + } + + for (int i = 0; i < IterationsPV_K; i++) { + for (int j = 0; j < IterationsPV_N; j++) { + pipeline_load_pv.producer_acquire(pipeline_load_pv_producer_state); + auto tma_barrier = pipeline_load_pv.producer_get_barrier(pipeline_load_pv_producer_state); + + if (cute::elect_one_sync()) { + // load_pv cl + // note the transpose in indices + // note we are off-by-one on k_index + + if constexpr (kIsPaged) { + cute::copy( + mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST), + tCLTgCLT(_, j, i, mPT(k_index - 1)), + tVCsVC(_, pipeline_load_pv_producer_state.index()) + ); + } + else { + cute::copy( + mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST), + tCLTgCLT(_, j, IterationsPV_K * (k_index - 1) + i, batch_coord), + tVCsVC(_, pipeline_load_pv_producer_state.index()) + ); + } + } + ++pipeline_load_pv_producer_state; + } + } + } + + template + CUTLASS_DEVICE void mma( + BlkCoord const& blk_coord, + ProblemShape const& problem_shape, + TensorStorage& shared_tensors, + PipelineLoadQK& pipeline_load_qk, + typename PipelineLoadQK::PipelineState& pipeline_load_qk_consumer_state, + PipelineLoadPV& pipeline_load_pv, + typename PipelineLoadPV::PipelineState& pipeline_load_pv_consumer_state, + PipelineS& pipeline_mma_s, + typename PipelineS::PipelineState& pipeline_mma_s_producer_state, + PipelineP& pipeline_p_mma, + typename PipelineP::PipelineState& pipeline_p_mma_consumer_state, + PipelineO& pipeline_mma_o, + typename PipelineO::PipelineState& pipeline_mma_o_producer_state, + int const& split_kv) { + + auto [H, K, D, B] = problem_shape; + + int k_tile_total = ceil_div(K, TileShapeS{}); + int k_tile_per_cta = ceil_div(k_tile_total, split_kv); + int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit + int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index); + if (k_tile_count == 0) { + return; + } + + // mma init + Tensor sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{}); + Tensor sKC = make_tensor(make_smem_ptr(shared_tensors.smem_kc.begin()), SmemLayoutKC{}); + Tensor sVC = make_tensor(make_smem_ptr(shared_tensors.smem_vc.begin()), SmemLayoutVC{}); + Tensor sP = make_tensor(make_smem_ptr((Element*) shared_tensors.smem_p.begin()), SmemLayoutP{}); + + Tensor tSrQ = TiledMmaQK::make_fragment_A(sQ); + Tensor tSrKC = TiledMmaQK::make_fragment_B(sKC); + Tensor tOrP = TiledMmaPV::make_fragment_A(sP); + Tensor tOrVC = TiledMmaPV::make_fragment_B(sVC); + + TiledMmaQK tiled_mma_qk; + TiledMmaPV tiled_mma_pv; + + Tensor tStS = partition_fragment_C(tiled_mma_qk, select<0,1>(TileShapeQK{})); + Tensor tItI = partition_fragment_C(tiled_mma_pv, select<0,1>(TileShapePV{})); + + tiled_mma_pv.accumulate_ = UMMA::ScaleOut::Zero; + + pipeline_mma_s.producer_acquire(pipeline_mma_s_producer_state); + + // Mma S0 S1 O0 S2 O1 ... Sn On-1 On + // S0 ownership -- ----- -- -- + // S1 ownership -- ----- ---- + // O ownership -- -- ---- -- + + tiled_mma_qk.accumulate_ = UMMA::ScaleOut::Zero; + for (int i = 0; i < IterationsQK; i++) { + pipeline_load_qk.consumer_wait(pipeline_load_qk_consumer_state); + int read_stage = pipeline_load_qk_consumer_state.index(); + + tStS.data() = uint32_t(pipeline_mma_s_producer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1); + + CUTLASS_PRAGMA_UNROLL + for (int k_block = 0; k_block < size<2>(tSrQ); ++k_block) { + cute::gemm(tiled_mma_qk, + tSrQ(_,_,k_block,i), + tSrKC(_,_,k_block,read_stage), + tStS); + tiled_mma_qk.accumulate_ = UMMA::ScaleOut::One; + } + + pipeline_load_qk.consumer_release(pipeline_load_qk_consumer_state); + ++pipeline_load_qk_consumer_state; + } + + pipeline_mma_s.producer_commit(pipeline_mma_s_producer_state); + ++pipeline_mma_s_producer_state; + + k_tile_count -= 1; + + CUTLASS_PRAGMA_NO_UNROLL + while (k_tile_count > 0) { + + pipeline_mma_s.producer_acquire(pipeline_mma_s_producer_state); + tiled_mma_qk.accumulate_ = UMMA::ScaleOut::Zero; + for (int i = 0; i < IterationsQK; i++) { + pipeline_load_qk.consumer_wait(pipeline_load_qk_consumer_state); + int read_stage = pipeline_load_qk_consumer_state.index(); + + tStS.data() = uint32_t(pipeline_mma_s_producer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1); + + CUTLASS_PRAGMA_UNROLL + for (int k_block = 0; k_block < size<2>(tSrQ); ++k_block) { + cute::gemm(tiled_mma_qk, + tSrQ(_,_,k_block,i), + tSrKC(_,_,k_block,read_stage), + tStS); + tiled_mma_qk.accumulate_ = UMMA::ScaleOut::One; + } + + pipeline_load_qk.consumer_release(pipeline_load_qk_consumer_state); + ++pipeline_load_qk_consumer_state; + } + + pipeline_mma_s.producer_commit(pipeline_mma_s_producer_state); + ++pipeline_mma_s_producer_state; + + pipeline_mma_o.producer_acquire(pipeline_mma_o_producer_state); + pipeline_p_mma.consumer_wait(pipeline_p_mma_consumer_state); + + for (int i = 0; i < IterationsPV_K; i++) { + auto acc_flag = tiled_mma_pv.accumulate_; + for (int j = 0; j < IterationsPV_N; j++) { + pipeline_load_pv.consumer_wait(pipeline_load_pv_consumer_state); + + int read_stage = pipeline_load_pv_consumer_state.index(); + + tItI.data() = uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO); + tiled_mma_pv.accumulate_ = acc_flag; + + CUTLASS_PRAGMA_UNROLL + for (int k_block = 0; k_block < size<2>(tOrP); ++k_block) { + cute::gemm(tiled_mma_pv, + tOrP(_,_,k_block, make_coord(i, pipeline_p_mma_consumer_state.index())), + tOrVC(_,_,k_block,read_stage), + tItI); + tiled_mma_pv.accumulate_ = UMMA::ScaleOut::One; + } + + pipeline_load_pv.consumer_release(pipeline_load_pv_consumer_state); + ++pipeline_load_pv_consumer_state; + } + } + + pipeline_p_mma.consumer_release(pipeline_p_mma_consumer_state); + ++pipeline_p_mma_consumer_state; + pipeline_mma_o.producer_commit(pipeline_mma_o_producer_state); + ++pipeline_mma_o_producer_state; + + --k_tile_count; + } + + pipeline_mma_o.producer_acquire(pipeline_mma_o_producer_state); + pipeline_p_mma.consumer_wait(pipeline_p_mma_consumer_state); + + for (int i = 0; i < IterationsPV_K; i++) { + auto acc_flag = tiled_mma_pv.accumulate_; + for (int j = 0; j < IterationsPV_N; j++) { + pipeline_load_pv.consumer_wait(pipeline_load_pv_consumer_state); + + int read_stage = pipeline_load_pv_consumer_state.index(); + + tItI.data() = uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO); + tiled_mma_pv.accumulate_ = acc_flag; + + CUTLASS_PRAGMA_UNROLL + for (int k_block = 0; k_block < size<2>(tOrP); ++k_block) { + cute::gemm(tiled_mma_pv, + tOrP(_,_,k_block, make_coord(i, pipeline_p_mma_consumer_state.index())), + tOrVC(_,_,k_block,read_stage), + tItI); + tiled_mma_pv.accumulate_ = UMMA::ScaleOut::One; + } + + pipeline_load_pv.consumer_release(pipeline_load_pv_consumer_state); + ++pipeline_load_pv_consumer_state; + } + } + + pipeline_p_mma.consumer_release(pipeline_p_mma_consumer_state); + ++pipeline_p_mma_consumer_state; + pipeline_mma_o.producer_commit(pipeline_mma_o_producer_state); + ++pipeline_mma_o_producer_state; + } + + + template + CUTLASS_DEVICE void softmax( + IsLastTile const& is_last_tile, + ElementAcc& row_max, + ElementAcc& row_sum, + ElementAcc& correction_factor, + ProblemShape const& problem_shape, + MainloopArguments const& mainloop_args, + TensorStorage& shared_tensors, + int k_index, + uint32_t tmem_s, + int smem_p_index) { + + auto load_op = cute::SM100_TMEM_LOAD_32dp32b32x{}; + + TiledMmaQK tiled_mma_qk; + + Tensor tStS = partition_fragment_C(tiled_mma_qk, select<0,1>(TileShapeQK{})); + tStS.data() = tmem_s; + + CUTE_STATIC_ASSERT_V(shape<1>(tStS) == _1{}); + CUTE_STATIC_ASSERT_V(shape<2>(tStS) == _1{}); + Tensor tAcc = tStS(make_coord(_,_),_0{},_0{}); + + Tensor cS = make_identity_tensor(take<0,2>(CtaShapeQK{})); + + auto tiled_t2r = make_tmem_copy(load_op, tAcc); + auto thread_idx = threadIdx.x % size(tiled_t2r); + + auto thread_t2r = tiled_t2r.get_slice(thread_idx); + Tensor tTR_cS = thread_t2r.partition_D(cS); + Tensor tTR_rAcc = make_tensor(shape(tTR_cS)); + + Tensor tTR_rS_frag = make_tensor(shape(tTR_rAcc)); + const int AlignmentS = 4; + Tensor tTR_tAcc = thread_t2r.partition_S(tAcc); + Tensor tTR_rAcc_vec = recast>(tTR_rAcc); + Tensor tTR_rS_vec = recast>(tTR_rS_frag); + + // load s + copy(tiled_t2r, tTR_tAcc, tTR_rAcc); + + if (is_last_tile) { + for (int i = 0; i < size(tTR_rAcc); i++) { + if (get<1>(tTR_cS(i)) + TileShapeS{} * k_index >= get<1>(problem_shape)) { + tTR_rAcc(i) = -std::numeric_limits::infinity(); + } + } + } + + // max + ElementAcc row_max_new = row_max; + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(tTR_rAcc); i += 1) { + row_max_new = ::fmax(row_max_new, tTR_rAcc(i)); + } + + // for 2x2 dp, reduce here + if constexpr (kWarpsInN > 1) { + shared_tensors.smem_exchange[threadIdx.x] = row_max_new; + cutlass::arch::NamedBarrier(kNumComputeWarps*NumThreadsPerWarp, kNamedBarrierExchange).sync(); + // (64, 2) shape + int peer_index = (threadIdx.x + 64) % 128; + row_max_new = cutlass::max(row_max_new, shared_tensors.smem_exchange[peer_index]); + } + +#ifndef B2B + // find correction factor + ElementAcc softmax_scale_log2 = mainloop_args.softmax_scale * static_cast(M_LOG2E); + correction_factor = ::exp2f(softmax_scale_log2 * (row_max - row_max_new)); + row_max = row_max_new; + + // softmax + ElementAcc row_max_scale_log2 = row_max * softmax_scale_log2; + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(tTR_rAcc); i++) { + tTR_rAcc(i) = ::exp2f(softmax_scale_log2 * tTR_rAcc(i) - row_max_scale_log2); + } +#endif + + // quantize + cutlass::NumericArrayConverter epilogue_op; + + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(tTR_rAcc_vec); i++) { + tTR_rS_vec(i) = epilogue_op(tTR_rAcc_vec(i)); + } + + Tensor sP = make_tensor(make_smem_ptr((Element*) shared_tensors.smem_p.begin()), SmemLayoutP{})(_, _, _, make_coord(_, smem_p_index)); + + Tensor tOcP = TiledMmaPV{}.get_slice(_0{}).partition_A(cS); + + // have a mapping for each thread to coord + // find identical mapping to coords for the MMA + auto l = make_ordered_layout(make_shape(make_shape(_64{}, _2{}), make_shape(_16{}, TileShapeS{} / _32{})), make_stride(make_stride(_0{}, _3{}), make_stride(_1{}, _2{}))); + auto sP_ = as_position_independent_swizzle_tensor(sP); + copy_aligned(tTR_rS_frag, sP_.compose(l)(threadIdx.x, _)); + + // sum + row_sum *= correction_factor; + + static_assert(cute::is_same_v); + auto tTR_rAcc_float2 = recast(tTR_rAcc); + auto sums = make_tensor(_4{}); + static_assert(size(tTR_rAcc_float2) % size(sums) == 0); + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(sums); i++) { + sums(i) = tTR_rAcc_float2(i); + } + CUTLASS_PRAGMA_UNROLL + for (int i = size(sums); i < size(tTR_rAcc_float2); i += size(sums)) { + CUTLASS_PRAGMA_UNROLL + for (int j = 0; j < size(sums); j++) { + cute::add(sums(j), sums(j), tTR_rAcc_float2(i + j)); + } + } + CUTLASS_PRAGMA_UNROLL + for (int i = 1; i < size(sums); i *= 2) { + CUTLASS_PRAGMA_UNROLL + for (int j = 0; j < size(sums); j += 2*i) { + cute::add(sums(j), sums(j), sums(j+i)); + } + } + row_sum += sums(0).x + sums(0).y; + } + + + CUTLASS_DEVICE void rescale( + ElementAcc correction_factor, + uint32_t tmem_o) { + + // for b2b gemm, do nothing +#ifndef B2B + auto load_op = cute::SM100_TMEM_LOAD_32dp32b32x{}; + auto store_op = TMEM::tmem_load_to_store(load_op); + + TiledMmaPV tiled_mma_pv; + + Tensor tItI = partition_fragment_C(tiled_mma_pv, select<0,1>(TileShapePV{})); + tItI.data() = tmem_o; + + CUTE_STATIC_ASSERT_V(shape<1>(tItI) == _1{}); + CUTE_STATIC_ASSERT_V(shape<2>(tItI) == _1{}); + Tensor tAcc = tItI(make_coord(_,_),_0{},_0{}); + + auto cta_tiler_pv = take<0,2>(typename CollectiveMmaPV::CtaShape_MNK{}); + Tensor gO = make_tensor(make_gmem_ptr((ElementAcc*) nullptr), cta_tiler_pv, make_stride(0, 0)); + + auto tiled_t2r = make_tmem_copy(load_op, tAcc); + auto tiled_r2t = make_tmem_copy(store_op, tAcc); + auto thread_idx = threadIdx.x % size(tiled_t2r); + + auto thread_t2r = tiled_t2r.get_slice(thread_idx); + auto thread_r2t = tiled_r2t.get_slice(thread_idx); + Tensor tTR_gO = thread_t2r.partition_D(gO); + Tensor tTR_rAcc = make_tensor(shape(tTR_gO)); + + Tensor tTR_tAcc = thread_t2r.partition_S(tAcc); + + // load o + copy(tiled_t2r, tTR_tAcc, tTR_rAcc); + + // multiply by correction factor + float2 correction_factor_vec = make_float2(correction_factor, correction_factor); + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(tTR_rAcc); i += 2) { + float2 in = make_float2(tTR_rAcc(i + 0), tTR_rAcc(i + 1)); + float2 out; + cute::mul(out, in, correction_factor_vec); + tTR_rAcc(i + 0) = out.x; + tTR_rAcc(i + 1) = out.y; + } + + // store o + copy(tiled_r2t, tTR_rAcc, tTR_tAcc); +#endif + } + + + template + CUTLASS_DEVICE void epilogue( + ElementAcc& row_max, + ElementAcc& row_sum, + BlkCoord const& cta_coord, + ProblemShape const& problem_shape, + MainloopArguments const& mainloop_args, + EpilogueParams const& epilogue_args, + TensorStorage& shared_tensors, + uint32_t tmem_o, + int const& split_kv) { + + auto load_op = cute::SM100_TMEM_LOAD_32dp32b32x{}; + + TiledMmaPV tiled_mma_pv; + + Tensor tItI = TiledMmaPV::make_fragment_C(partition_shape_C(TiledMmaPV{}, take<0, 2>(TileShapePV{}))); + tItI.data() = tmem_o; + + CUTE_STATIC_ASSERT_V(shape<1>(tItI) == _1{}); + CUTE_STATIC_ASSERT_V(shape<2>(tItI) == _1{}); + Tensor tAcc = tItI(make_coord(_,_),_0{},_0{}); + + auto [H, K, D, B] = problem_shape; + auto [D_latent, D_rope] = D; + if (epilogue_args.ptr_o_acc != nullptr) { + using ElementOutAcc = ElementAcc; + constexpr auto AlignmentOutAcc = 128 / cute::sizeof_bits_v; + Tensor mO = make_tensor(make_gmem_ptr(epilogue_args.ptr_o_acc + get<3>(cta_coord) * D_latent), make_shape(H, D_latent, B), epilogue_args.stride_o_acc); + auto cta_tiler_pv = take<0,2>(typename CollectiveMmaPV::CtaShape_MNK{}); + Tensor gO = local_tile(mO, cta_tiler_pv, take<0,3>(cta_coord)); + + auto tiled_t2r = make_tmem_copy(load_op, tAcc); + auto thread_idx = threadIdx.x % size(tiled_t2r); + + auto thread_t2r = tiled_t2r.get_slice(thread_idx); + Tensor tTR_gO = thread_t2r.partition_D(gO); + Tensor tTR_rAcc = make_tensor(shape(tTR_gO)); + + Tensor tTR_rO_frag = make_tensor(shape(tTR_rAcc)); + Tensor tTR_rO_src = recast>(coalesce(tTR_rO_frag)); + Tensor tR2G_rO_dst = recast>(coalesce(tTR_gO)); + Tensor tTR_tAcc = thread_t2r.partition_S(tAcc); + + copy(tiled_t2r, tTR_tAcc, tTR_rAcc); + + cutlass::epilogue::thread::LinearCombination epilogue_op({epilogue_args.output_scale / row_sum}); + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(tTR_rAcc); i++) { + tTR_rO_frag(i) = epilogue_op(tTR_rAcc(i)); + } + + copy(tTR_rO_src, tR2G_rO_dst); + +#ifndef B2B + + // compute LSE + ElementAcc lse = cutlass::fast_log(row_sum) + mainloop_args.softmax_scale * row_max; + + // store LSE + Tensor mLSE = make_tensor(make_gmem_ptr(epilogue_args.ptr_lse_acc + H * get<3>(cta_coord)), make_shape(H, B), epilogue_args.stride_lse_acc); + Tensor gLSE = local_tile(mLSE, append<3>(cta_tiler_pv, _1{}), take<0,3>(cta_coord), Step<_1, Underscore, _1>{}); + // for 2x2 dp, this must be conditional and the index is wrong + if (! kIs2Sm || (threadIdx.x < 64)) + { + gLSE(threadIdx.x) = lse; + } + #endif + } + else { + Tensor mO = make_tensor(make_gmem_ptr(epilogue_args.ptr_o), make_shape(H, D_latent, B), epilogue_args.stride_o); + auto cta_tiler_pv = take<0,2>(typename CollectiveMmaPV::CtaShape_MNK{}); + Tensor gO = local_tile(mO, cta_tiler_pv, take<0,3>(cta_coord)); + + auto tiled_t2r = make_tmem_copy(load_op, tAcc); + auto thread_idx = threadIdx.x % size(tiled_t2r); + + auto thread_t2r = tiled_t2r.get_slice(thread_idx); + Tensor tTR_gO = thread_t2r.partition_D(gO); + Tensor tTR_rAcc = make_tensor(shape(tTR_gO)); + + Tensor tTR_rO_frag = make_tensor(shape(tTR_rAcc)); + Tensor tTR_rO_src = recast>(coalesce(tTR_rO_frag)); + Tensor tR2G_rO_dst = recast>(coalesce(tTR_gO)); + Tensor tTR_tAcc = thread_t2r.partition_S(tAcc); + + copy(tiled_t2r, tTR_tAcc, tTR_rAcc); + + cutlass::epilogue::thread::LinearCombination epilogue_op({epilogue_args.output_scale / row_sum}); + CUTLASS_PRAGMA_UNROLL + for (int i = 0; i < size(tTR_rAcc); i++) { + tTR_rO_frag(i) = epilogue_op(tTR_rAcc(i)); + } + + copy(tTR_rO_src, tR2G_rO_dst); + +#ifndef B2B + if (epilogue_args.ptr_lse != nullptr) { + // compute LSE + ElementAcc lse = cutlass::fast_log(row_sum) + mainloop_args.softmax_scale * row_max; + + // store LSE + Tensor mLSE = make_tensor(make_gmem_ptr(epilogue_args.ptr_lse), make_shape(H, B), epilogue_args.stride_lse); + Tensor gLSE = local_tile(mLSE, append<3>(cta_tiler_pv, _1{}), take<0,3>(cta_coord), Step<_1, Underscore, _1>{}); + + // for 2x2 dp, this must be conditional and the index is wrong + if (! kIs2Sm || (threadIdx.x < 64)) + { + gLSE(threadIdx.x) = lse; + } + } +#endif + } + } + + + template + CUTLASS_DEVICE void compute( + CtaCoord const& cta_coord, + ProblemShape const& problem_shape, + MainloopArguments const& mainloop_args, + EpilogueParams const& epilogue_args, + TensorStorage& shared_tensors, + PipelineS& pipeline_mma_s, + typename PipelineS::PipelineState& pipeline_mma_s_consumer_state, + PipelineP& pipeline_p_mma, + typename PipelineP::PipelineState& pipeline_p_mma_producer_state, + PipelineO& pipeline_mma_o, + typename PipelineO::PipelineState& pipeline_mma_o_consumer_state, + int const& split_kv) { + + auto [H, K, D, B] = problem_shape; + + int k_tile_total = ceil_div(K, TileShapeS{}); + int k_tile_per_cta = ceil_div(k_tile_total, split_kv); + int k_index = get<3>(cta_coord) * k_tile_per_cta; // lower limit + int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index); + if (k_tile_count == 0) { + + // if we return early, we have to make sure we release the load warp + cutlass::arch::NamedBarrier( + (kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, + kNamedBarrierEpilogue + ).arrive(); + + return; + } + int k_index_final = k_tile_total - 1; + + ElementAcc row_max = -std::numeric_limits::infinity(); + ElementAcc row_sum = 0; + ElementAcc correction_factor = 1; + + pipeline_p_mma.producer_acquire(pipeline_p_mma_producer_state); + pipeline_mma_s.consumer_wait(pipeline_mma_s_consumer_state); + + auto dispatch_bool = [](bool b, auto fn) { + if (b) { + fn(cute::true_type{}); + } + else { + fn(cute::false_type{}); + } + }; + + // softmax s0 -> p0 + dispatch_bool(k_index == k_index_final, [&](auto is_last_tile) { + softmax( + is_last_tile, + row_max, row_sum, correction_factor, + problem_shape, mainloop_args, shared_tensors, k_index, + uint32_t(pipeline_mma_s_consumer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1), + pipeline_p_mma_producer_state.index() + ); + }); + + k_index += 1; + + cutlass::arch::fence_view_async_tmem_load(); + cutlass::arch::fence_view_async_shared(); + pipeline_mma_s.consumer_release(pipeline_mma_s_consumer_state); + ++pipeline_mma_s_consumer_state; + pipeline_p_mma.producer_commit(pipeline_p_mma_producer_state); + ++pipeline_p_mma_producer_state; + + k_tile_count -= 1; + + CUTLASS_PRAGMA_NO_UNROLL + while (k_tile_count > 0) { + pipeline_p_mma.producer_acquire(pipeline_p_mma_producer_state); + pipeline_mma_s.consumer_wait(pipeline_mma_s_consumer_state); + + // softmax s1 -> p1 + dispatch_bool(k_index == k_index_final, [&](auto is_last_tile) { + softmax( + is_last_tile, + row_max, row_sum, correction_factor, + problem_shape, mainloop_args, shared_tensors, k_index, + uint32_t(pipeline_mma_s_consumer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1), + pipeline_p_mma_producer_state.index() + ); + }); + + cutlass::arch::fence_view_async_tmem_load(); + cutlass::arch::fence_view_async_shared(); + pipeline_mma_s.consumer_release(pipeline_mma_s_consumer_state); + ++pipeline_mma_s_consumer_state; + pipeline_p_mma.producer_commit(pipeline_p_mma_producer_state); + ++pipeline_p_mma_producer_state; + + pipeline_mma_o.consumer_wait(pipeline_mma_o_consumer_state); + + // rescale + CUTLASS_PRAGMA_UNROLL + for (int j = 0; j < IterationsPV_N; j++) { + rescale(correction_factor, uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO)); + } + + cutlass::arch::fence_view_async_tmem_store(); + pipeline_mma_o.consumer_release(pipeline_mma_o_consumer_state); + ++pipeline_mma_o_consumer_state; + + --k_tile_count; + k_index += 1; + } + + pipeline_mma_o.consumer_wait(pipeline_mma_o_consumer_state); + +#ifdef B2B + row_sum = 1; +#else + if constexpr (kWarpsInN > 1) { + // reduce row_sum if needed (for 2x2 dp) + shared_tensors.smem_exchange[threadIdx.x] = row_sum; + cutlass::arch::NamedBarrier(kNumComputeWarps*NumThreadsPerWarp, kNamedBarrierExchange).sync(); + // (64, 2) shape + int peer_index = (threadIdx.x + 64) % 128; + row_sum += shared_tensors.smem_exchange[peer_index]; + } +#endif + + cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive(); + + // epilogue + CUTLASS_PRAGMA_UNROLL + for (int j = 0; j < IterationsPV_N; j++) { + epilogue( + row_max, row_sum, + replace<1>(cta_coord, j), problem_shape, + mainloop_args, epilogue_args, shared_tensors, + uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO), split_kv + ); + } + + cutlass::arch::fence_view_async_tmem_load(); + pipeline_mma_o.consumer_release(pipeline_mma_o_consumer_state); + ++pipeline_mma_o_consumer_state; + } + +}; + +/////////////////////////////////////////////////////////////////////////////// + +} // namespace cutlass::fmha::kernel diff --git a/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_mla_tile_scheduler.hpp b/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_mla_tile_scheduler.hpp new file mode 100644 index 00000000000..c990ee2d856 --- /dev/null +++ b/csrc/attention/mla/cutlass_sm100_mla/kernel/sm100_mla_tile_scheduler.hpp @@ -0,0 +1,165 @@ +/*************************************************************************************************** + * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights + *reserved. SPDX-License-Identifier: BSD-3-Clause + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * 3. Neither the name of the copyright holder nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE + *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + *POSSIBILITY OF SUCH DAMAGE. + * + **************************************************************************************************/ +/* + * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 + * by Alcanderian JieXin Liang + */ + +// clang-format off +#pragma once + +#include "cutlass/cutlass.h" +#include "cutlass/fast_math.h" +#include "cutlass/kernel_hardware_info.h" + +namespace cutlass::fmha::kernel { + +//////////////////////////////////////////////////////////////////////////////// + +struct Sm100MlaIndividualTileScheduler { + + struct Params { + dim3 grid; + }; + + bool valid_ = true; + + CUTLASS_DEVICE + Sm100MlaIndividualTileScheduler(Params const&) {} + + template + static Params to_underlying_arguments( + ProblemShape const& problem_shape, KernelHardwareInfo hw_info, + ClusterShape const& cluster_shape, int const& split_kv) { + using namespace cute; + dim3 grid(get<0>(cluster_shape), get<3>(problem_shape) /* Batch */, split_kv /*Maximum Split KV*/); + return Params{ grid }; + } + + static dim3 get_grid_shape(Params const& params) { + return params.grid; + } + + CUTLASS_DEVICE + bool is_valid() { + return valid_; + } + + CUTLASS_DEVICE + auto get_block_coord() { + using namespace cute; + return make_coord(blockIdx.x, _0{}, blockIdx.y, blockIdx.z); + } + + CUTLASS_DEVICE + Sm100MlaIndividualTileScheduler& operator++() { + valid_ = false; + return *this; + } +}; + +//////////////////////////////////////////////////////////////////////////////// + +struct Sm100MlaPersistentTileScheduler { + + struct Params { + int num_blocks; + FastDivmod divmod_m_block; + FastDivmod divmod_b; + FastDivmod divmod_split_kv; + KernelHardwareInfo hw_info; + }; + + int block_idx = 0; + Params params; + + CUTLASS_DEVICE + Sm100MlaPersistentTileScheduler(Params const& params) : block_idx(blockIdx.x), params(params) {} + + template + static Params to_underlying_arguments( + ProblemShape const& problem_shape, KernelHardwareInfo hw_info, + ClusterShape const& cluster_shape, int const& split_kv) { + using namespace cute; + // Get SM count if needed, otherwise use user supplied SM count + int sm_count = hw_info.sm_count; + if (sm_count <= 1 || sm_count % size<0>(cluster_shape) != 0) { + CUTLASS_TRACE_HOST(" WARNING: Arguments do not include a valid SM count.\n" + " For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count."); + sm_count = KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id); + } + + CUTLASS_TRACE_HOST("to_underlying_arguments(): Setting persistent grid SM count to " << sm_count); + hw_info.sm_count = sm_count; + + int num_m_blocks = size<0>(cluster_shape); + int num_blocks = num_m_blocks * get<3>(problem_shape) /* Batch */; + num_blocks *= split_kv; /* Maximum Split KV*/ + + return Params { + num_blocks, + { num_m_blocks}, { get<3>(problem_shape) }, {split_kv}, + hw_info + }; + } + + static dim3 get_grid_shape(Params const& params) { + dim3 grid(std::min(params.num_blocks, params.hw_info.sm_count), 1, 1); + return grid; + } + + CUTLASS_DEVICE + bool is_valid() { + return block_idx < params.num_blocks; + } + + CUTLASS_DEVICE + auto get_block_coord() { + using namespace cute; + int block_decode = block_idx; + int m_block, bidb, n_split_kv; + params.divmod_m_block(block_decode, m_block, block_decode); + params.divmod_b(block_decode, bidb, block_decode); + params.divmod_split_kv(block_decode, n_split_kv, block_decode); + return make_coord(m_block, _0{}, bidb, n_split_kv); + } + + CUTLASS_DEVICE + Sm100MlaPersistentTileScheduler& operator++() { + block_idx += gridDim.x; + return *this; + } +}; + +//////////////////////////////////////////////////////////////////////////////// + +} // namespace cutlass::fmha::kernel diff --git a/csrc/attention/mla/sm100_cutlass_mla_kernel.cu b/csrc/attention/mla/sm100_cutlass_mla_kernel.cu new file mode 100644 index 00000000000..0d57ff4cc7c --- /dev/null +++ b/csrc/attention/mla/sm100_cutlass_mla_kernel.cu @@ -0,0 +1,273 @@ +/* +Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +Copyright 2025 SGLang Team. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +==============================================================================*/ +/* + * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 + * by Alcanderian JieXin Liang + */ + +#include +#include +#include +#include +#include + +#include +#include + +#include "cutlass_sm100_mla/device/sm100_mla.hpp" +#include "cutlass_sm100_mla/kernel/sm100_mla_tile_scheduler.hpp" + +// clang-format off +#if !defined(CUDA_VERSION) || CUDA_VERSION < 12040 +void sm100_cutlass_mla_decode( + torch::Tensor const& out, + torch::Tensor const& q_nope, + torch::Tensor const& q_pe, + torch::Tensor const& kv_c_and_k_pe_cache, + torch::Tensor const& seq_lens, + torch::Tensor const& page_table, + torch::Tensor const& workspace, + int64_t num_kv_splits) { + TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_decode"); +} +int64_t sm100_cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_batches, int64_t sm_count, int64_t num_kv_splits) { + TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_get_workspace_size"); +} +#else + +#define CUTLASS_CHECK(status) \ + { \ + cutlass::Status error = status; \ + TORCH_CHECK(error == cutlass::Status::kSuccess, cutlassGetStatusString(error)); \ + } + +using namespace cute; +using namespace cutlass::fmha::kernel; + +template +struct IsPersistent { + static const bool value = v; +}; + +template > +struct MlaSm100 { + using Element = T; + using ElementAcc = float; + using ElementOut = T; + + using TileShape = Shape<_128, _128, Shape<_512, _64>>; + using TileShapeH = cute::tuple_element_t<0, TileShape>; + using TileShapeD = cute::tuple_element_t<2, TileShape>; + + // H K (D_latent D_rope) B + using ProblemShape = cute::tuple; + + using StrideQ = cute::tuple; // H D B + using StrideK = cute::tuple; // K D B + using StrideO = StrideK; // H D B + using StrideLSE = cute::tuple<_1, int>; // H B + + using TileScheduler = + std::conditional_t; + + using FmhaKernel = cutlass::fmha::kernel::Sm100FmhaMlaKernelTmaWarpspecialized< + TileShape, + Element, + ElementAcc, + ElementOut, + ElementAcc, + TileScheduler, + /*kIsCpAsync=*/!IsPaged128>; + using Fmha = cutlass::fmha::device::MLA; +}; + +template +typename T::Fmha::Arguments args_from_options( + at::Tensor const& out, + at::Tensor const& q_nope, + at::Tensor const& q_pe, + at::Tensor const& kv_c_and_k_pe_cache, + at::Tensor const& seq_lens, + at::Tensor const& page_table, + double sm_scale, + int64_t num_kv_splits) { + cutlass::KernelHardwareInfo hw_info; + hw_info.device_id = q_nope.device().index(); + hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id); + + int batches = q_nope.sizes()[0]; + int page_count_per_seq = page_table.sizes()[1]; + int page_count_total = kv_c_and_k_pe_cache.sizes()[0]; + int page_size = kv_c_and_k_pe_cache.sizes()[1]; + int max_seq_len = page_size * page_count_per_seq; + using TileShapeH = typename T::TileShapeH; + using TileShapeD = typename T::TileShapeD; + auto problem_shape = cute::make_tuple(TileShapeH{}, max_seq_len, TileShapeD{}, batches); + + auto [H, K, D, B] = problem_shape; + auto [D_latent, D_rope] = D; + + float scale = float(sm_scale); + + using StrideQ = typename T::StrideQ; + using StrideK = typename T::StrideK; + using StrideO = typename T::StrideO; + using StrideLSE = typename T::StrideLSE; + + StrideQ stride_Q_nope = cute::make_tuple( + static_cast(q_nope.stride(1)), _1{}, static_cast(q_nope.stride(0))); + StrideQ stride_Q_pe = cute::make_tuple( + static_cast(q_pe.stride(1)), _1{}, static_cast(q_pe.stride(0))); + + StrideK stride_C = cute::make_tuple( + static_cast(0 + D_latent + D_rope), _1{}, static_cast(page_size * (D_latent + D_rope))); + StrideLSE stride_PT = cute::make_stride(_1{}, page_count_per_seq); + StrideLSE stride_LSE = cute::make_tuple(_1{}, 0 + H); + StrideO stride_O = cute::make_tuple(static_cast(0 + D_latent), _1{}, static_cast(0 + H * D_latent)); + + using Element = typename T::Element; + using ElementOut = typename T::ElementOut; + using ElementAcc = typename T::ElementAcc; + auto Q_nope_ptr = static_cast(q_nope.data_ptr()); + auto Q_pe_ptr = static_cast(q_pe.data_ptr()); + auto C_ptr = static_cast(kv_c_and_k_pe_cache.data_ptr()); + typename T::Fmha::Arguments arguments{ + problem_shape, + {scale, + Q_nope_ptr, + stride_Q_nope, + Q_pe_ptr, + stride_Q_pe, + C_ptr, + stride_C, + C_ptr + D_latent, + stride_C, + static_cast(seq_lens.data_ptr()), + static_cast(page_table.data_ptr()), + stride_PT, + page_count_total, + page_size}, + {static_cast(out.data_ptr()), stride_O, static_cast(nullptr), stride_LSE}, + hw_info, + // TODO(trevor-m): Change split_kv back to -1 when + // https://github.com/NVIDIA/cutlass/issues/2274 is fixed. Split_kv=1 will + // perform worse with larger context length and smaller batch sizes. + num_kv_splits, // split_kv + nullptr, // is_var_split_kv + }; + // TODO(kaixih@nvidia): When split_kv=-1 and is_var_split_kv=false, we compute + // split_kv automatically based on batch size and sequence length to balance + // workload across available SMs. Consider using var_split_kv for manual + // control if needed. + T::Fmha::set_split_kv(arguments); + return arguments; +} + +template +void runMla( + at::Tensor const& out, + at::Tensor const& q_nope, + at::Tensor const& q_pe, + at::Tensor const& kv_c_and_k_pe_cache, + at::Tensor const& seq_lens, + at::Tensor const& page_table, + at::Tensor const& workspace, + double sm_scale, + int64_t num_kv_splits, + cudaStream_t stream) { + using MlaSm100Type = MlaSm100; + typename MlaSm100Type::Fmha fmha; + auto arguments = args_from_options(out, q_nope, q_pe, kv_c_and_k_pe_cache, seq_lens, page_table, sm_scale, num_kv_splits); + + CUTLASS_CHECK(fmha.can_implement(arguments)); + + CUTLASS_CHECK(fmha.initialize(arguments, workspace.data_ptr(), stream)); + + CUTLASS_CHECK(fmha.run(arguments, workspace.data_ptr(), stream)); +} + +#define DISPATCH_BOOL(expr, const_expr, ...) \ + [&]() -> bool { \ + if (expr) { \ + constexpr bool const_expr = true; \ + return __VA_ARGS__(); \ + } else { \ + constexpr bool const_expr = false; \ + return __VA_ARGS__(); \ + } \ + }() + +void sm100_cutlass_mla_decode( + torch::Tensor const& out, + torch::Tensor const& q_nope, + torch::Tensor const& q_pe, + torch::Tensor const& kv_c_and_k_pe_cache, + torch::Tensor const& seq_lens, + torch::Tensor const& page_table, + torch::Tensor const& workspace, + double sm_scale, + int64_t num_kv_splits) { + auto in_dtype = q_nope.dtype(); + at::cuda::CUDAGuard device_guard{(char)q_nope.get_device()}; + const cudaStream_t stream = at::cuda::getCurrentCUDAStream(q_nope.get_device()); + const int page_size = kv_c_and_k_pe_cache.sizes()[1]; + + // NOTE(alcanderian): IsPersistent has bug with manual split_kv. + // Kernel will hang if batch is too large with large num_kv_splits. (for example bs=8, num_kv_splits=8) + // Maybe per batch split kv will fix this. + DISPATCH_BOOL(page_size == 128, IsPaged128, [&] { + DISPATCH_BOOL(num_kv_splits <= 1, NotManualSplitKV, [&] { + if (in_dtype == at::ScalarType::Half) { + runMla>( + out, q_nope, q_pe, kv_c_and_k_pe_cache, seq_lens, page_table, workspace, sm_scale, num_kv_splits, stream); + } else if (in_dtype == at::ScalarType::BFloat16) { + runMla>( + out, q_nope, q_pe, kv_c_and_k_pe_cache, seq_lens, page_table, workspace, sm_scale, num_kv_splits, stream); + } else if (in_dtype == at::ScalarType::Float8_e4m3fn) { + runMla>( + out, q_nope, q_pe, kv_c_and_k_pe_cache, seq_lens, page_table, workspace, sm_scale, num_kv_splits, stream); + } else { + TORCH_CHECK(false, "Unsupported input data type of MLA"); + } + return true; + }); + return true; + }); +} + +int64_t sm100_cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_batches, int64_t sm_count, int64_t num_kv_splits) { + // Workspace size depends on ElementAcc and ElementLSE (same as ElementAcc) + // which are float, so Element type here doesn't matter. + using MlaSm100Type = MlaSm100; + + // Get split kv. Requires problem shape and sm_count only. + typename MlaSm100Type::Fmha::Arguments arguments; + using TileShapeH = typename MlaSm100Type::TileShapeH; + using TileShapeD = typename MlaSm100Type::TileShapeD; + arguments.problem_shape = + cute::make_tuple(TileShapeH{}, static_cast(max_seq_len), TileShapeD{}, static_cast(num_batches)); + // Assumes device 0 when getting sm_count. + arguments.hw_info.sm_count = + sm_count <= 0 ? cutlass::KernelHardwareInfo::query_device_multiprocessor_count(/*device_id=*/0) : sm_count; + arguments.split_kv = num_kv_splits; + MlaSm100Type::Fmha::set_split_kv(arguments); + + return MlaSm100Type::Fmha::get_workspace_size(arguments); +} + +#endif +// clang-format on diff --git a/csrc/ops.h b/csrc/ops.h index 7f3e6b6923a..20ad163dc0d 100644 --- a/csrc/ops.h +++ b/csrc/ops.h @@ -167,6 +167,19 @@ void cutlass_mla_decode(torch::Tensor const& out, torch::Tensor const& q_nope, torch::Tensor const& seq_lens, torch::Tensor const& page_table, double scale); +void sm100_cutlass_mla_decode( + torch::Tensor const& out, torch::Tensor const& q_nope, + torch::Tensor const& q_pe, torch::Tensor const& kv_c_and_k_pe_cache, + torch::Tensor const& seq_lens, torch::Tensor const& page_table, + torch::Tensor const& workspace, double sm_scale, + int64_t num_kv_splits = + 1 /* Set to 1 to avoid cuda_graph issue by default. */); + +int64_t sm100_cutlass_mla_get_workspace_size( + int64_t max_seq_len, int64_t num_batches, int64_t sm_count = 0, + int64_t num_kv_splits = + 1 /* Set to 1 to avoid cuda_graph issue by default. */); + torch::Tensor get_cuda_view_from_cpu_tensor(torch::Tensor& cpu_tensor); #ifndef USE_ROCM diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index 1920bec4223..370edc20149 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -514,6 +514,23 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { " Tensor page_table, float scale) -> ()"); ops.impl("cutlass_mla_decode", torch::kCUDA, &cutlass_mla_decode); + // SM100 CUTLASS MLA decode + ops.def( + "sm100_cutlass_mla_decode(Tensor! out, Tensor q_nope, Tensor q_pe," + " Tensor kv_c_and_k_pe_cache, Tensor seq_lens," + " Tensor page_table, Tensor workspace, float " + "scale," + " int num_kv_splits) -> ()"); + ops.impl("sm100_cutlass_mla_decode", torch::kCUDA, &sm100_cutlass_mla_decode); + + // SM100 CUTLASS MLA workspace + ops.def( + "sm100_cutlass_mla_get_workspace_size(int max_seq_len, int num_batches," + " int sm_count, int num_kv_splits) " + "-> int"); + ops.impl("sm100_cutlass_mla_get_workspace_size", + &sm100_cutlass_mla_get_workspace_size); + // Compute NVFP4 block quantized tensor. ops.def( "scaled_fp4_quant(Tensor! output, Tensor input," diff --git a/vllm/_custom_ops.py b/vllm/_custom_ops.py index deedeef46b0..f25db40a4ef 100644 --- a/vllm/_custom_ops.py +++ b/vllm/_custom_ops.py @@ -1843,6 +1843,26 @@ def cutlass_mla_decode(out: torch.Tensor, q_nope: torch.Tensor, return out +def sm100_cutlass_mla_decode(out: torch.Tensor, q_nope: torch.Tensor, + q_pe: torch.Tensor, + kv_c_and_k_pe_cache: torch.Tensor, + seq_lens: torch.Tensor, page_table: torch.Tensor, + workspace: torch.Tensor, scale: float, + num_kv_splits: int) -> torch.Tensor: + torch.ops._C.sm100_cutlass_mla_decode(out, q_nope, q_pe, + kv_c_and_k_pe_cache, seq_lens, + page_table, workspace, scale, + num_kv_splits) + return out + + +def sm100_cutlass_mla_get_workspace_size(max_seq_len: int, num_batches: int, + sm_count: int, + num_kv_splits: int) -> int: + return torch.ops._C.sm100_cutlass_mla_get_workspace_size( + max_seq_len, num_batches, sm_count, num_kv_splits) + + if hasattr(torch.ops._C, "weight_packed_linear"): @register_fake("_C::weight_packed_linear") diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index 75b10643c2b..03f0c15270b 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -166,6 +166,13 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: logger.info( "Forcing kv cache block size to 64 for FlashMLA backend.") + use_cutlass_mla = (envs.VLLM_ATTENTION_BACKEND is not None \ + and envs.VLLM_ATTENTION_BACKEND == "CUTLASS_MLA_VLLM_V1") + if use_cutlass_mla and cache_config.block_size != 128: + cache_config.block_size = 128 + logger.info("Forcing kv cache block size to 128 for " + "CUTLASS_MLA_VLLM_V1 backend.") + compilation_config = vllm_config.compilation_config if (envs.VLLM_ALL2ALL_BACKEND == "deepep_high_throughput" and parallel_config.data_parallel_size > 1 diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py index 1232f73430f..904b6081d92 100644 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -333,6 +333,9 @@ class MLACommonMetadata(Generic[D]): # |-------------------- seq_len ---------------------| # |-- query_len ---| + num_reqs: int + max_query_len: int + num_actual_tokens: int # Number of tokens excluding padding. query_start_loc: torch.Tensor slot_mapping: torch.Tensor @@ -716,6 +719,8 @@ def build(self, common_prefix_len: int, ) attn_metadata = self.metadata_cls( + num_reqs=common_attn_metadata.num_reqs, + max_query_len=common_attn_metadata.max_query_len, num_actual_tokens=num_actual_tokens, query_start_loc=query_start_loc, slot_mapping=slot_mapping, diff --git a/vllm/v1/attention/backends/mla/cutlass_mla.py b/vllm/v1/attention/backends/mla/cutlass_mla.py index b2116bf1143..a0f7c39c004 100644 --- a/vllm/v1/attention/backends/mla/cutlass_mla.py +++ b/vllm/v1/attention/backends/mla/cutlass_mla.py @@ -1,6 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import os from typing import Any, Optional import torch @@ -27,6 +28,41 @@ def get_impl_cls() -> type["CutlassMLAImpl"]: return CutlassMLAImpl +class SM100Workspace: + + def __init__(self, initial_workspace_size): + self._workspace_buf = torch.empty(initial_workspace_size, + device="cuda", + dtype=torch.uint8) + + self._block_size = 128 # Forced to 128 + + # Pre-compute sm_count to avoid recomputing it. Use device 0 as a proxy + # (assumes all devices are similar) + properties = torch.cuda.get_device_properties(torch.device("cuda:0")) + self._sm_count = properties.multi_processor_count + + def get_buf(self): + return self._workspace_buf + + def ensure_size(self, attn_metadata: MLACommonMetadata, + num_kv_splits: int): + batch_size = attn_metadata.num_reqs + max_seq_len = attn_metadata.max_query_len + + workspace_size = ops.sm100_cutlass_mla_get_workspace_size( + max_seq_len * self._block_size, + batch_size, + self._sm_count, + num_kv_splits=num_kv_splits) + + if self._workspace_buf.shape[0] < workspace_size: + self._workspace_buf.resize_(workspace_size) + + +g_sm100_workspace = SM100Workspace(128 * 1024 * 1024) # 128MB + + class CutlassMLAImpl(MLACommonImpl[MLACommonMetadata]): def __init__( @@ -68,7 +104,137 @@ def __init__( raise NotImplementedError( "CutlassMLA V1 with FP8 KV cache not yet supported") - def _forward_decode( + self._use_old_cutlass_mla = False + force_old_cutlass = os.environ.get("FORCE_OLD_CUTLASS_MLA", None) + if force_old_cutlass: + logger.warning("Forcing old cutlass mla kernel") + self._use_old_cutlass_mla = True + + # TODO: Currently, num_kv_splits is limited to 16 to avoid hanging + # issues. In case the code hangs, use: + # FORCE_NUM_KV_SPLITS=1 + force_num_kv_splits = os.environ.get("FORCE_NUM_KV_SPLITS", None) + if force_num_kv_splits: + logger.warning("Forcing num_kv_splits to %d", + int(force_num_kv_splits)) + self._num_kv_splits = int(force_num_kv_splits) + else: + self._num_kv_splits = -1 # => Auto-detect + + # Share workspace buffer across all executions + self._workspace = g_sm100_workspace + + def _sm100_cutlass_mla_decode( + self, + q_nope: torch.Tensor, + q_pe: torch.Tensor, + kv_c_and_k_pe_cache: torch.Tensor, + seq_lens: torch.Tensor, + page_table: torch.Tensor, + workspace: torch.Tensor, + sm_scale: float, + num_kv_splits: int, + ) -> torch.Tensor: + assert (q_nope.ndim == 3 + ), f"q_nope must be a 3D tensor, but got {q_nope.ndim}" + assert ( + q_pe.ndim == 3), f"q_pe must be a 3D tensor, but got {q_pe.ndim}" + assert ( + kv_c_and_k_pe_cache.ndim == 3 + ), "kv_c_and_k_pe_cache must be a 3D tensor, but got {}".format( + kv_c_and_k_pe_cache.ndim) + + B_q, H, D_q_nope = q_nope.shape + B_q_2, H_2, D_q_pe = q_pe.shape + assert (B_q == B_q_2) and (H == H_2) + + _, PAGE_SIZE, D_ckv = kv_c_and_k_pe_cache.shape + + D_latent = 512 + D_rope = 64 + assert D_q_nope == D_latent + assert D_q_pe == D_rope + assert D_ckv == D_latent + D_rope + + MAX_HEADS = 128 + assert H <= MAX_HEADS, f"H must be <= {MAX_HEADS}, but got {H}" + if H < MAX_HEADS: + q_nope_padded = q_nope.new_empty((B_q, MAX_HEADS, D_q_nope)) + q_nope_padded[:, :H] = q_nope + q_nope = q_nope_padded + + q_pe_padded = q_pe.new_empty((B_q, MAX_HEADS, D_q_pe)) + q_pe_padded[:, :H] = q_pe + q_pe = q_pe_padded + + assert len(page_table.shape) == 2 + B_block_table, block_num = page_table.shape + assert B_block_table == B_q + assert (block_num + > 0), f"block num must be greater than 0, got {block_num}" + assert block_num % (128 / PAGE_SIZE) == 0 + + # TODO(kaixih@nvidia): support fp8 + assert q_nope.dtype in ( + torch.float16, + torch.bfloat16, + ), f"q_nope.dtype needs to be fp16 or bf16 but got {q_nope.dtype}." + assert q_nope.dtype == q_pe.dtype == kv_c_and_k_pe_cache.dtype + assert ( + seq_lens.dtype == torch.int32 + ), f"seq_lens.dtype needs to be int32 but got {seq_lens.dtype}." + assert ( + page_table.dtype == torch.int32 + ), f"page_table.dtype needs to be int32 but got {page_table.dtype}." + + out = q_nope.new_empty((B_q, MAX_HEADS, D_latent)) + + ops.sm100_cutlass_mla_decode( + out, + q_nope, + q_pe, + kv_c_and_k_pe_cache, + seq_lens, + page_table, + workspace, + sm_scale, + num_kv_splits, + ) + return out[:, :H].contiguous() + + def _sm100_forward_decode( + self, + q_nope: torch.Tensor, + q_pe: torch.Tensor, + kv_c_and_k_pe_cache: torch.Tensor, + attn_metadata: MLACommonMetadata, + ) -> torch.Tensor: + assert kv_c_and_k_pe_cache.numel() > 0 + assert attn_metadata.decode is not None + + if self.kv_cache_dtype.startswith("fp8"): + raise NotImplementedError("FP8 Cutlass MLA not yet supported") + + # Adjust workspace size (if necessary) + self._workspace.ensure_size(attn_metadata, self._num_kv_splits) + + # Run MLA + # Clone q_nope and q_pe to make sure strides computation is correct. + # TODO: Check if we really need it + q_nope = q_nope.clone() + q_pe = q_pe.clone() + + o = self._sm100_cutlass_mla_decode(q_nope, q_pe, kv_c_and_k_pe_cache, + attn_metadata.decode.seq_lens, + attn_metadata.decode.block_table, + self._workspace.get_buf(), + self.scale, self._num_kv_splits) + + return self._v_up_proj(o) + + # TODO: Currently we leave it here only for backup in case something is + # wrong with the new SM100 CUTLASS MLA kernel + def _old_forward_decode( self, q_nope: torch.Tensor, q_pe: torch.Tensor, @@ -97,3 +263,19 @@ def _forward_decode( attn_metadata.decode.block_table, self.scale) return self._v_up_proj(o) + + def _forward_decode( + self, + q_nope: torch.Tensor, + q_pe: torch.Tensor, + kv_c_and_k_pe_cache: torch.Tensor, + attn_metadata: MLACommonMetadata, + ) -> torch.Tensor: + if self._use_old_cutlass_mla: + # TODO: Remove the old cutlass MLA kernel after more extensive + # testing + return self._old_forward_decode(q_nope, q_pe, kv_c_and_k_pe_cache, + attn_metadata) + + return self._sm100_forward_decode(q_nope, q_pe, kv_c_and_k_pe_cache, + attn_metadata) From dac768ef375986ffcb1a5a53c2b0417fab9e6349 Mon Sep 17 00:00:00 2001 From: Richard Zou Date: Mon, 14 Jul 2025 21:26:18 -0400 Subject: [PATCH 079/552] [BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache (#20942) Signed-off-by: Richard Zou Signed-off-by: x22x22 --- tests/compile/test_config.py | 24 ++++++++++++++++++++++++ vllm/compilation/backends.py | 3 ++- vllm/compilation/compiler_interface.py | 4 +++- vllm/compilation/counter.py | 4 ++++ 4 files changed, 33 insertions(+), 2 deletions(-) diff --git a/tests/compile/test_config.py b/tests/compile/test_config.py index 8679d5c3019..0ba59f4b5a0 100644 --- a/tests/compile/test_config.py +++ b/tests/compile/test_config.py @@ -26,6 +26,30 @@ def test_use_cudagraphs_dynamic(monkeypatch): assert not vllm_config.compilation_config.use_cudagraph +# NB: We don't test VLLM_DISABLE_COMPILE_CACHE=0 because that depends +# on the state of the cache directory on the current machine, which +# may be influenced by other tests. +@pytest.mark.parametrize("val", ["1"]) +def test_VLLM_DISABLE_COMPILE_CACHE(vllm_runner, monkeypatch, val): + assert vllm.envs.VLLM_USE_V1 + + # spawn means that the counters are in the same process. + monkeypatch.setenv('VLLM_WORKER_MULTIPROC_METHOD', "spawn") + monkeypatch.setenv('VLLM_DISABLE_COMPILE_CACHE', val) + + compilation_config = { + "use_cudagraph": False, # speed things up a bit + } + with ( + compilation_counter.expect(num_cache_entries_updated=0, + num_compiled_artifacts_saved=0), + # loading the model causes compilation (if enabled) to happen + vllm_runner('facebook/opt-125m', + compilation_config=compilation_config, + gpu_memory_utilization=0.4) as _): + pass + + @pytest.mark.parametrize("enabled", [True, False]) def test_use_cudagraphs(vllm_runner, monkeypatch, enabled): assert vllm.envs.VLLM_USE_V1 diff --git a/vllm/compilation/backends.py b/vllm/compilation/backends.py index 5148c289d86..673fb586623 100644 --- a/vllm/compilation/backends.py +++ b/vllm/compilation/backends.py @@ -183,9 +183,10 @@ def compile(self, assert compiled_graph is not None, "Failed to compile the graph" # store the artifact in the cache - if handle is not None: + if not envs.VLLM_DISABLE_COMPILE_CACHE and handle is not None: self.cache[(runtime_shape, graph_index, self.compiler.name)] = handle + compilation_counter.num_cache_entries_updated += 1 self.is_cache_updated = True if graph_index == 0: # adds some info logging for the first graph diff --git a/vllm/compilation/compiler_interface.py b/vllm/compilation/compiler_interface.py index fd39a6127d0..b529f84b798 100644 --- a/vllm/compilation/compiler_interface.py +++ b/vllm/compilation/compiler_interface.py @@ -213,7 +213,9 @@ def compile( # Save the compiled artifact to disk in the specified path assert key is not None path = os.path.join(self.cache_dir, key) - compiled_graph.save(path=path, format="unpacked") + if not envs.VLLM_DISABLE_COMPILE_CACHE: + compiled_graph.save(path=path, format="unpacked") + compilation_counter.num_compiled_artifacts_saved += 1 return compiled_graph, (key, path) def load(self, diff --git a/vllm/compilation/counter.py b/vllm/compilation/counter.py index 9d7a25689b5..6acb8abb3de 100644 --- a/vllm/compilation/counter.py +++ b/vllm/compilation/counter.py @@ -23,6 +23,10 @@ class CompilationCounter: num_inductor_compiles: int = 0 # EagerAdapter.compile calls num_eager_compiles: int = 0 + # The number of time vLLM's compiler cache entry was updated + num_cache_entries_updated: int = 0 + # The number of standalone_compile compiled artifacts saved + num_compiled_artifacts_saved: int = 0 def clone(self) -> "CompilationCounter": return copy.deepcopy(self) From 14982a05e1faed682fa0853468e68a5d6dab1e4e Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Tue, 15 Jul 2025 10:42:17 +0900 Subject: [PATCH 080/552] [Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM (#20933) Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/model_executor/layers/quantization/fp8.py | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/vllm/model_executor/layers/quantization/fp8.py b/vllm/model_executor/layers/quantization/fp8.py index 59db3e6c444..824dfe15ae2 100644 --- a/vllm/model_executor/layers/quantization/fp8.py +++ b/vllm/model_executor/layers/quantization/fp8.py @@ -488,11 +488,16 @@ def __init__(self, quant_config: Fp8Config): logger.warning_once("Failed to import DeepGemm kernels.") elif not self.block_quant: logger.warning_once("Model is not block quantized. Not using " - " DeepGemm kernels") + "DeepGemm kernels") elif (current_platform.is_cuda() - and current_platform.has_device_capability(90)): + and current_platform.is_device_capability(90)): logger.info_once("Using DeepGemm kernels for Fp8MoEMethod.") self.allow_deep_gemm = True + elif (current_platform.is_cuda() + and is_blackwell_deep_gemm_used()): + logger.info_once("Using DeepGemm SM100 kernels for " + "Fp8MoEMethod.") + self.allow_deep_gemm = True else: logger.warning_once( "DeepGemm not supported on the current platform.") @@ -500,10 +505,10 @@ def __init__(self, quant_config: Fp8Config): # Check for CutlassBlockScaledGroupedGemm support. self.allow_cutlass_block_scaled_grouped_gemm = False if not self.block_quant: - logger.warning_once("Model is not block quantized. Not using " - "CutlassBlockScaledGroupedGemm kernels") + logger.debug_once("Model is not block quantized. Not using " + "CutlassBlockScaledGroupedGemm kernels") elif (current_platform.is_cuda() - and current_platform.has_device_capability(100)): + and current_platform.is_device_capability(100)): logger.info_once( "Using CutlassBlockScaledGroupedGemm kernels for Fp8MoEMethod." ) From 385727dbc3c5dbd28b40aa46c0d0dbe8c111315e Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Tue, 15 Jul 2025 11:44:18 +0900 Subject: [PATCH 081/552] [CI/Build] Split Entrypoints Test into LLM and API Server (#20945) Signed-off-by: mgoin Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 4440187c36e..dd723cb620a 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -117,7 +117,7 @@ steps: commands: - pytest -v -s core -- label: Entrypoints Test # 40min +- label: Entrypoints Test (LLM) # 40min mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/tests" fast_check: true @@ -125,8 +125,6 @@ steps: source_file_dependencies: - vllm/ - tests/entrypoints/llm - - tests/entrypoints/openai - - tests/entrypoints/test_chat_utils - tests/entrypoints/offline_mode commands: - export VLLM_WORKER_MULTIPROC_METHOD=spawn @@ -135,9 +133,21 @@ steps: - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process - pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process - VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process + - VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests + +- label: Entrypoints Test (API Server) # 40min + mirror_hardwares: [amdexperimental] + working_dir: "/vllm-workspace/tests" + fast_check: true + torch_nightly: true + source_file_dependencies: + - vllm/ + - tests/entrypoints/openai + - tests/entrypoints/test_chat_utils + commands: + - export VLLM_WORKER_MULTIPROC_METHOD=spawn - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ - pytest -v -s entrypoints/test_chat_utils.py - - VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests - label: Distributed Tests (4 GPUs) # 10min mirror_hardwares: [amdexperimental] From 7889a536f8e761a9ece915d9958bbf94bd9170ce Mon Sep 17 00:00:00 2001 From: XiongfeiWei Date: Mon, 14 Jul 2025 20:06:33 -0700 Subject: [PATCH 082/552] Use w8a8 quantized matmul Pallas kernel (#19170) Signed-off-by: Xiongfei Wei Signed-off-by: x22x22 --- requirements/tpu.txt | 10 +++--- tests/tpu/test_quantization_accuracy.py | 8 ++--- tests/v1/tpu/test_basic.py | 32 +++++++++++++++++++ .../quantization/kernels/scaled_mm/xla.py | 19 ++++++----- 4 files changed, 50 insertions(+), 19 deletions(-) diff --git a/requirements/tpu.txt b/requirements/tpu.txt index a4aee21d2bd..db58b37c2b1 100644 --- a/requirements/tpu.txt +++ b/requirements/tpu.txt @@ -18,9 +18,9 @@ setuptools==78.1.0 --find-links https://storage.googleapis.com/libtpu-releases/index.html --find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html --find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html -torch==2.9.0.dev20250703 -torchvision==0.24.0.dev20250703 -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.8.0.dev20250703-cp39-cp39-linux_x86_64.whl ; python_version == "3.9" -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.8.0.dev20250703-cp310-cp310-linux_x86_64.whl ; python_version == "3.10" -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.8.0.dev20250703-cp311-cp311-linux_x86_64.whl ; python_version == "3.11" +torch==2.9.0.dev20250711 +torchvision==0.24.0.dev20250711 +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250711-cp39-cp39-linux_x86_64.whl ; python_version == "3.9" +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250711-cp310-cp310-linux_x86_64.whl ; python_version == "3.10" +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250711-cp311-cp311-linux_x86_64.whl ; python_version == "3.11" diff --git a/tests/tpu/test_quantization_accuracy.py b/tests/tpu/test_quantization_accuracy.py index a13cf7064d5..6cefbae4bdd 100644 --- a/tests/tpu/test_quantization_accuracy.py +++ b/tests/tpu/test_quantization_accuracy.py @@ -14,7 +14,7 @@ @dataclass class GSM8KAccuracyTestConfig: model_name: str - excepted_value: float + expected_value: float def get_model_args(self) -> str: return (f"pretrained={self.model_name}," @@ -25,13 +25,13 @@ def get_model_args(self) -> str: ACCURACY_CONFIGS = [ GSM8KAccuracyTestConfig( model_name="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", - excepted_value=0.76), # no bias + expected_value=0.76), # no bias # NOTE(rob): We cannot re-initialize vLLM in the same process for TPU, # so only one of these tests can run in a single call to pytest. As # a follow up, move this into the LM-EVAL section of the CI. # GSM8KAccuracyTestConfig( # model_name="neuralmagic/Qwen2-7B-Instruct-quantized.w8a8", - # excepted_value=0.66), # bias in QKV layers + # expected_value=0.66), # bias in QKV layers ] @@ -45,7 +45,7 @@ def test_gsm8k_correctness(config: GSM8KAccuracyTestConfig): batch_size="auto", ) - EXPECTED_VALUE = config.excepted_value + EXPECTED_VALUE = config.expected_value measured_value = results["results"][TASK][FILTER] assert (measured_value - RTOL < EXPECTED_VALUE and measured_value + RTOL > EXPECTED_VALUE diff --git a/tests/v1/tpu/test_basic.py b/tests/v1/tpu/test_basic.py index c0d2192ad81..c8cd099a98c 100644 --- a/tests/v1/tpu/test_basic.py +++ b/tests/v1/tpu/test_basic.py @@ -145,3 +145,35 @@ def test_gemma3_27b_with_text_input_and_tp( for output, answer in zip(vllm_outputs, answers): generated_text = output[1] assert answer in generated_text + + +@pytest.mark.skipif(not current_platform.is_tpu(), + reason="This is a basic test for TPU only") +def test_w8a8_quantization( + vllm_runner: type[VllmRunner], + monkeypatch: pytest.MonkeyPatch, +) -> None: + model = "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8" + max_tokens = 5 + tensor_parallel_size = 1 + max_num_seqs = 4 + + prompt = "The next numbers of the sequence " + ", ".join( + str(i) for i in range(1024)) + " are:" + example_prompts = [prompt] + + with monkeypatch.context() as m: + m.setenv("VLLM_USE_V1", "1") + + with vllm_runner( + model, + max_num_batched_tokens=64, + max_model_len=4096, + gpu_memory_utilization=0.7, + max_num_seqs=max_num_seqs, + tensor_parallel_size=tensor_parallel_size) as vllm_model: + vllm_outputs = vllm_model.generate_greedy(example_prompts, + max_tokens) + output = vllm_outputs[0][1] + + assert "1024" in output or "0, 1" in output diff --git a/vllm/model_executor/layers/quantization/kernels/scaled_mm/xla.py b/vllm/model_executor/layers/quantization/kernels/scaled_mm/xla.py index 3de28af40aa..0b931b2d8b8 100644 --- a/vllm/model_executor/layers/quantization/kernels/scaled_mm/xla.py +++ b/vllm/model_executor/layers/quantization/kernels/scaled_mm/xla.py @@ -90,16 +90,15 @@ def apply_weights(self, bias: Optional[torch.Tensor] = None) -> torch.Tensor: w_q, w_s, _, _, _ = self._get_weight_params(layer) - import torch_xla.experimental.xla_quantized_matmul # noqa: F401 - out = torch.ops.xla.quantized_matmul(x, - w_q, - w_s, - zero_point=None, - block_size=-1, - int4_weight=False, - quantize_activation=True) - # `quantized_matmul` output is fp32, cast it down to bf16 for perf - out = out.to(x.dtype) + # Required to register custom ops. + import torch_xla.experimental.custom_kernel # noqa: F401 + out = torch.ops.xla.quantized_matmul_int8( + x, + w_q, + w_s, + quantize_activation=True, + ) + # Explicitly capture control flow to make dynamo happy. # https://pytorch.org/docs/main/generated/exportdb/index.html#cond-branch-class-method # noqa: E501 return cond(bias is None, self.no_add_bias, self.add_bias, [out, bias]) From ab2fb1dc6f97563d13ebaf2b442c4b4d7939d22a Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Mon, 14 Jul 2025 23:13:55 -0400 Subject: [PATCH 083/552] [Docs] Add Kuberay to deployment integrations (#20592) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- docs/deployment/integrations/kuberay.md | 20 ++++++++++++++++++++ docs/deployment/k8s.md | 1 + 2 files changed, 21 insertions(+) create mode 100644 docs/deployment/integrations/kuberay.md diff --git a/docs/deployment/integrations/kuberay.md b/docs/deployment/integrations/kuberay.md new file mode 100644 index 00000000000..1dcc98024e8 --- /dev/null +++ b/docs/deployment/integrations/kuberay.md @@ -0,0 +1,20 @@ +# KubeRay + +[KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clusters. +A Ray cluster can be declared in YAML, and the operator then handles pod scheduling, networking configuration, restarts, and blue-green deployments — all while preserving the familiar Kubernetes experience. + +## Why KubeRay instead of manual scripts? + +| Feature | Manual scripts | KubeRay | +|---------|-----------------------------------------------------------|---------| +| Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` | +| Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size | +| Upgrades | Tear down & re-create manually | Blue/green deployment updates supported | +| Declarative config | Bash flags & environment variables | Git-ops-friendly YAML CRDs (RayCluster/RayService) | + +Using KubeRay reduces the operational burden and simplifies integration of Ray + vLLM with existing Kubernetes workflows (CI/CD, secrets, storage classes, etc.). + +## Learn more + +* ["Serve a Large Language Model using Ray Serve LLM on Kubernetes"](https://docs.ray.io/en/master/cluster/kubernetes/examples/rayserve-llm-example.html) - An end-to-end example of how to serve a model using vLLM, KubeRay, and Ray Serve. +* [KubeRay documentation](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index 8eb2270ab7c..f244b0858eb 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -13,6 +13,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: - [Helm](frameworks/helm.md) - [InftyAI/llmaz](integrations/llmaz.md) - [KServe](integrations/kserve.md) +- [KubeRay](integrations/kuberay.md) - [kubernetes-sigs/lws](frameworks/lws.md) - [meta-llama/llama-stack](integrations/llamastack.md) - [substratusai/kubeai](integrations/kubeai.md) From 921c0fb521d6dade94e20ff52fe18a7e9762fa25 Mon Sep 17 00:00:00 2001 From: Reid <61492567+reidliu41@users.noreply.github.com> Date: Tue, 15 Jul 2025 11:14:23 +0800 Subject: [PATCH 084/552] feat: add image zoom to improve image viewing experience (#20763) Signed-off-by: reidliu41 Signed-off-by: x22x22 --- mkdocs.yaml | 1 + requirements/docs.txt | 1 + 2 files changed, 2 insertions(+) diff --git a/mkdocs.yaml b/mkdocs.yaml index f97aff49073..b392fb160c2 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -61,6 +61,7 @@ plugins: - search - autorefs - awesome-nav + - glightbox # For API reference generation - api-autonav: modules: ["vllm"] diff --git a/requirements/docs.txt b/requirements/docs.txt index ec988d79471..7ea768b9909 100644 --- a/requirements/docs.txt +++ b/requirements/docs.txt @@ -4,6 +4,7 @@ mkdocs-material mkdocstrings-python mkdocs-gen-files mkdocs-awesome-nav +mkdocs-glightbox python-markdown-math regex ruff From e5c12c6545cd99ce095ce443c172ec7aa5466f20 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nicol=C3=B2=20Lucchesi?= Date: Tue, 15 Jul 2025 05:15:15 +0200 Subject: [PATCH 085/552] [CI] Fix flaky `test_streaming_response` test (#20913) Signed-off-by: NickLucche Signed-off-by: x22x22 --- tests/entrypoints/openai/test_transcription_validation.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tests/entrypoints/openai/test_transcription_validation.py b/tests/entrypoints/openai/test_transcription_validation.py index e1d175d9c6e..b46409b0f89 100644 --- a/tests/entrypoints/openai/test_transcription_validation.py +++ b/tests/entrypoints/openai/test_transcription_validation.py @@ -154,7 +154,8 @@ async def post_with_stream(*args, **kwargs): file=winning_call, language="en", temperature=0.0, - extra_body=dict(stream=True)) + extra_body=dict(stream=True), + timeout=30) # Reconstruct from chunks and validate async for chunk in res: # just a chunk @@ -184,7 +185,8 @@ async def post_with_stream(*args, **kwargs): temperature=0.0, extra_body=dict(stream=True, stream_include_usage=True, - stream_continuous_usage_stats=True)) + stream_continuous_usage_stats=True), + timeout=30) final = False continuous = True async for chunk in res: From 80d61826554527d399af693dc63457e905868fa2 Mon Sep 17 00:00:00 2001 From: Ruheena Suhani Shaik Date: Tue, 15 Jul 2025 08:56:08 +0530 Subject: [PATCH 086/552] Enabled BnB NF4 inference on Gaudi (#20172) Signed-off-by: Ruheena Suhani Shaik Signed-off-by: x22x22 --- .../layers/quantization/bitsandbytes.py | 12 ++++++------ .../model_loader/bitsandbytes_loader.py | 14 ++++++++++++-- 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/vllm/model_executor/layers/quantization/bitsandbytes.py b/vllm/model_executor/layers/quantization/bitsandbytes.py index 92a46ad65cb..a96f3ee5c30 100644 --- a/vllm/model_executor/layers/quantization/bitsandbytes.py +++ b/vllm/model_executor/layers/quantization/bitsandbytes.py @@ -13,6 +13,7 @@ from vllm.model_executor.layers.quantization import QuantizationMethods from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) +from vllm.platforms import current_platform from vllm.utils import direct_register_custom_op @@ -390,12 +391,11 @@ def _apply_bnb_4bit_fake( try: - direct_register_custom_op( - op_name="apply_bnb_4bit", - op_func=_apply_bnb_4bit, - mutates_args=["out"], - fake_impl=_apply_bnb_4bit_fake, - ) + direct_register_custom_op(op_name="apply_bnb_4bit", + op_func=_apply_bnb_4bit, + mutates_args=["out"], + fake_impl=_apply_bnb_4bit_fake, + dispatch_key=current_platform.dispatch_key) apply_bnb_4bit = torch.ops.vllm.apply_bnb_4bit except AttributeError as error: diff --git a/vllm/model_executor/model_loader/bitsandbytes_loader.py b/vllm/model_executor/model_loader/bitsandbytes_loader.py index d22b1e7b67d..907bc3c1361 100644 --- a/vllm/model_executor/model_loader/bitsandbytes_loader.py +++ b/vllm/model_executor/model_loader/bitsandbytes_loader.py @@ -199,6 +199,10 @@ def _get_quantized_weights_iterator( if self.pre_quant: if self.load_8bit: + if current_platform.is_hpu(): + raise ValueError( + "currently hpu supports 4bit quantization only") + return self._quantized_8bit_generator( hf_weights_files, use_safetensors, quant_state_dict), quant_state_dict @@ -302,6 +306,10 @@ def _parse_quant_state(param_name: str, in temp_state_dict): quant_state = _parse_quant_state(mapped_weight_name, temp_state_dict) + if current_platform.is_hpu(): + assert quant_state.quant_type == "nf4", ( + "currently hpu supports nf4 quant_type only") + quant_state_dict[mapped_weight_name] = quant_state yield org_weight_name, weight_tensor else: @@ -372,10 +380,12 @@ def _unquantized_generator(self, hf_weights_files, use_safetensors, ...] # bitsandbytes requires data in GPU - if weight_sub_tensor.is_cuda: + if (weight_sub_tensor.is_cuda + or weight_sub_tensor.device.type == "hpu"): loaded_weight = weight_sub_tensor else: - loaded_weight = weight_sub_tensor.cuda() + loaded_weight = weight_sub_tensor.to( + device=current_platform.device_type) # remove the following after the issue is fixed: # https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1342 From 40900127ca92578f7cabdca0180f2e658164a3cf Mon Sep 17 00:00:00 2001 From: Pavani Majety Date: Mon, 14 Jul 2025 20:27:50 -0700 Subject: [PATCH 087/552] [Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer (#20934) Signed-off-by: Pavani Majety Signed-off-by: x22x22 --- vllm/engine/arg_utils.py | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index f47499309d8..e2c86158758 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1418,14 +1418,15 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: and not envs.is_set("VLLM_ATTENTION_BACKEND") ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" supported = False - if current_platform.is_rocm(): + if current_platform.is_rocm() or ( + current_platform.is_cuda() + and current_platform.is_device_capability(100)): supported = True elif fp8_attention and will_use_fa: from vllm.attention.utils.fa_utils import ( flash_attn_supports_fp8) supported = flash_attn_supports_fp8() - elif envs.VLLM_USE_TRTLLM_DECODE_ATTENTION: - supported = True + if not supported: _raise_or_fallback(feature_name="--kv-cache-dtype", recommend_to_remove=False) From 4860cafe85bb0dae9b8392f4da4179dfc4c64c0a Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Tue, 15 Jul 2025 12:56:53 +0800 Subject: [PATCH 088/552] [Doc] Clearer mistral3 and pixtral model support description (#20926) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- docs/models/supported_models.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 444a65314e6..cbb2236eed5 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -584,14 +584,14 @@ Specified using `--task generate`. | `KeyeForConditionalGeneration` | Keye-VL-8B-Preview | T + IE+ + VE+ | `Kwai-Keye/Keye-VL-8B-Preview` | | | ✅︎ | | `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I+ | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | | ✅︎ | | `Llama4ForConditionalGeneration` | Llama 4 | T + I+ | `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc. | | ✅︎ | ✅︎ | -| `LlavaForConditionalGeneration` | LLaVA-1.5 | T + IE+ | `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc. | | ✅︎ | ✅︎ | +| `LlavaForConditionalGeneration` | LLaVA-1.5, Pixtral (HF Transformers) | T + IE+ | `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), `mistral-community/pixtral-12b`, etc. | | ✅︎ | ✅︎ | | `LlavaNextForConditionalGeneration` | LLaVA-NeXT | T + IE+ | `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc. | | ✅︎ | ✅︎ | | `LlavaNextVideoForConditionalGeneration` | LLaVA-NeXT-Video | T + V | `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. | | ✅︎ | ✅︎ | | `LlavaOnevisionForConditionalGeneration` | LLaVA-Onevision | T + I+ + V+ | `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc. | | ✅︎ | ✅︎ | | `MiniCPMO` | MiniCPM-O | T + IE+ + VE+ + AE+ | `openbmb/MiniCPM-o-2_6`, etc. | ✅︎ | ✅︎ | ✅︎ | | `MiniCPMV` | MiniCPM-V | T + IE+ + VE+ | `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc. | ✅︎ | | ✅︎ | | `MiniMaxVL01ForConditionalGeneration` | MiniMax-VL | T + IE+ | `MiniMaxAI/MiniMax-VL-01`, etc. | | ✅︎ | ✅︎ | -| `Mistral3ForConditionalGeneration` | Mistral3 | T + I+ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Mistral3ForConditionalGeneration` | Mistral3 (HF Transformers) | T + I+ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | ✅︎ | | `MllamaForConditionalGeneration` | Llama 3.2 | T + I+ | `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc. | | | | | `MolmoForCausalLM` | Molmo | T + I+ | `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc. | ✅︎ | ✅︎ | ✅︎ | | `NVLM_D_Model` | NVLM-D 1.0 | T + I+ | `nvidia/NVLM-D-72B`, etc. | | ✅︎ | ✅︎ | @@ -599,7 +599,7 @@ Specified using `--task generate`. | `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + IE | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | | ✅︎ | ⚠️ | | `Phi3VForCausalLM` | Phi-3-Vision, Phi-3.5-Vision | T + IE+ | `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc. | | ✅︎ | ✅︎ | | `Phi4MMForCausalLM` | Phi-4-multimodal | T + I+ / T + A+ / I+ + A+ | `microsoft/Phi-4-multimodal-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | -| `PixtralForConditionalGeneration` | Pixtral | T + I+ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistral-community/pixtral-12b`, etc. | | ✅︎ | ✅︎ | +| `PixtralForConditionalGeneration` | Mistral 3 (Mistral format), Pixtral (Mistral format) | T + I+ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistralai/Pixtral-12B-2409`, etc. | | ✅︎ | ✅︎ | | `QwenVLForConditionalGeneration`^ | Qwen-VL | T + IE+ | `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen2AudioForConditionalGeneration` | Qwen2-Audio | T + A+ | `Qwen/Qwen2-Audio-7B-Instruct` | | ✅︎ | ✅︎ | | `Qwen2VLForConditionalGeneration` | QVQ, Qwen2-VL | T + IE+ + VE+ | `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | From 82c041373e03096f9b1b934029ada5d053bd2dce Mon Sep 17 00:00:00 2001 From: Boyuan Feng Date: Mon, 14 Jul 2025 22:02:17 -0700 Subject: [PATCH 089/552] [cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir (#20940) Signed-off-by: Boyuan Feng Signed-off-by: x22x22 --- vllm/compilation/wrapper.py | 22 +++++++--------------- vllm/envs.py | 6 ------ 2 files changed, 7 insertions(+), 21 deletions(-) diff --git a/vllm/compilation/wrapper.py b/vllm/compilation/wrapper.py index 4fd00f0c75b..8d5df1061ed 100644 --- a/vllm/compilation/wrapper.py +++ b/vllm/compilation/wrapper.py @@ -93,27 +93,19 @@ def bytecode_hook(self, old_code: CodeType, new_code: CodeType): return self.compiled_codes.append(new_code) - local_cache_dir = self.vllm_config.compilation_config.local_cache_dir - if isinstance(local_cache_dir, str): - decompiled_file_name = ("transformed_code.py" - if envs.VLLM_COMPILE_DEPYF else - "transformed_code_README.txt") - - decompiled_file = os.path.join(local_cache_dir, - decompiled_file_name) + debug_dump_dir = self.vllm_config.compilation_config.debug_dump_path + if isinstance(debug_dump_dir, str) and debug_dump_dir != "": + rank = self.vllm_config.parallel_config.rank + decompiled_file = os.path.join(debug_dump_dir, f"rank_{rank}", + "transformed_code.py") if not os.path.exists(decompiled_file): try: # usually the decompilation will succeed for most models, # as we guarantee a full-graph compilation in Dynamo. # but there's no 100% guarantee, since decompliation is # not a reversible process. - if envs.VLLM_COMPILE_DEPYF: - import depyf - src = depyf.decompile(new_code) - else: - src = ( - "To get a transformed_code.py file, re-run with " - "VLLM_COMPILE_DEPYF=1") + import depyf + src = depyf.decompile(new_code) with open(decompiled_file, "w") as f: f.write(src) diff --git a/vllm/envs.py b/vllm/envs.py index 7fd5abed700..7bff6ade815 100644 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -97,7 +97,6 @@ VLLM_ENABLE_V1_MULTIPROCESSING: bool = True VLLM_LOG_BATCHSIZE_INTERVAL: float = -1 VLLM_DISABLE_COMPILE_CACHE: bool = False - VLLM_COMPILE_DEPYF: bool = False Q_SCALE_CONSTANT: int = 200 K_SCALE_CONSTANT: int = 200 V_SCALE_CONSTANT: int = 100 @@ -742,11 +741,6 @@ def get_vllm_port() -> Optional[int]: "VLLM_DISABLE_COMPILE_CACHE": lambda: bool(int(os.getenv("VLLM_DISABLE_COMPILE_CACHE", "0"))), - # If set, vllm will decompile the torch compiled code and dump to - # transformed_code.py. This is useful for debugging. - "VLLM_COMPILE_DEPYF": - lambda: bool(int(os.getenv("VLLM_COMPILE_DEPYF", "0"))), - # If set, vllm will run in development mode, which will enable # some additional endpoints for developing and debugging, # e.g. `/reset_prefix_cache` From b098f9ffd71deee5ff3e7c9124a2e0fea7ec0143 Mon Sep 17 00:00:00 2001 From: Jennifer He Date: Tue, 15 Jul 2025 01:34:24 -0400 Subject: [PATCH 090/552] [Model] Add AutoWeightsLoader support for BERT, RoBERTa (#20534) Signed-off-by: Jennifer He Signed-off-by: Signed-off-by: Jen H Signed-off-by: x22x22 --- vllm/model_executor/models/bert.py | 85 ++++++++++++--------------- vllm/model_executor/models/roberta.py | 74 +++++++---------------- 2 files changed, 59 insertions(+), 100 deletions(-) diff --git a/vllm/model_executor/models/bert.py b/vllm/model_executor/models/bert.py index 6e955e1c512..a43803ed433 100644 --- a/vllm/model_executor/models/bert.py +++ b/vllm/model_executor/models/bert.py @@ -22,12 +22,11 @@ from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) -from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.sequence import IntermediateTensors, PoolerOutput from .interfaces import SupportsCrossEncoding, SupportsQuant, SupportsV0Only -from .utils import WeightsMapper, maybe_prefix +from .utils import AutoWeightsLoader, WeightsMapper, maybe_prefix class BertEmbedding(nn.Module): @@ -44,9 +43,11 @@ def __init__(self, config: BertConfig): config.type_vocab_size, config.hidden_size) self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) - self.position_ids = nn.Parameter( - torch.empty((1, config.max_position_embeddings)), ) + self.register_buffer( + "position_ids", + torch.arange(config.max_position_embeddings).unsqueeze(0), + ) self.position_embedding_type = config.position_embedding_type if self.position_embedding_type != "absolute": raise ValueError("Only 'absolute' position_embedding_type" + @@ -358,45 +359,45 @@ def load_weights(self, weights: Iterable[tuple[str, ("qkv_proj", "value", "v"), ] + loaded_stacked_params = [] + other_weights = [] params_dict = dict(self.named_parameters()) - loaded_params: set[str] = set() for name, loaded_weight in weights: - if self.pooler is None and "pooler" in name: - continue for (param_name, weight_name, shard_id) in stacked_params_mapping: if weight_name not in name: continue + name = name.replace(weight_name, param_name) - # Skip loading extra bias for GPTQ models. - if name.endswith(".bias") and name not in params_dict: + if name not in params_dict: continue param = params_dict[name] weight_loader = param.weight_loader weight_loader(param, loaded_weight, shard_id) + loaded_stacked_params.append(name) break else: - # Skip loading extra bias for GPTQ models. - if name.endswith(".bias") and name not in params_dict: - continue - param = params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - weight_loader(param, loaded_weight) - loaded_params.add(name) + if name in params_dict: + other_weights.append((name, loaded_weight)) + + loader = AutoWeightsLoader( + self, + skip_prefixes=(["pooler."] if self.pooler is None else []), + ) + loaded_params = loader.load_weights(other_weights) + loaded_params.update(loaded_stacked_params) return loaded_params class BertEmbeddingModel(nn.Module, SupportsV0Only, SupportsQuant): """A model that uses Bert to provide embedding functionalities. - This class encapsulates the BertModel and provides an interface for - embedding operations and customized pooling functions. + This class encapsulates the BertModel and provides an interface for + embedding operations and customized pooling functions. - Attributes: - model: An instance of BertModel used for forward operations. - _pooler: An instance of Pooler used for pooling operations. - """ - hf_to_vllm_mapper = WeightsMapper(orig_to_new_prefix={"model.": ""}) + Attributes: + model: An instance of BertModel used for forward operations. + _pooler: An instance of Pooler used for pooling operations. + """ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() @@ -425,10 +426,15 @@ def pooler( return self._pooler(hidden_states, pooling_metadata) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - weights = self.hf_to_vllm_mapper.apply(weights) - weights = ((name, data) for name, data in weights - if not name.startswith("lm_head.")) - self.model.load_weights(weights) + weights_list = list(weights) + + has_model_prefix = any( + name.startswith("model.") for name, _ in weights_list) + if not has_model_prefix: + mapper = WeightsMapper(orig_to_new_prefix={"": "model."}) + + loader = AutoWeightsLoader(self, skip_prefixes=["lm_head."]) + return loader.load_weights(weights_list, mapper=mapper) def _build_model(self, vllm_config: VllmConfig, @@ -470,26 +476,9 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.classifier, self.bert.pooler) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - - self_weights = [] - - def weight_filter(): - for name, weight in weights: - if name.startswith("bert."): - yield (name[len("bert."):], weight) - else: - self_weights.append((name, weight)) - - self.bert.load_weights(weight_filter()) - - params_dict = dict(self.named_parameters()) - - for name, loaded_weight in self_weights: - if name.startswith("classifier"): - param = params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - weight_loader(param, loaded_weight) + loader = AutoWeightsLoader(self) + loaded_params = loader.load_weights(weights) + return loaded_params def pooler( self, diff --git a/vllm/model_executor/models/roberta.py b/vllm/model_executor/models/roberta.py index 048fa827fb2..1d3a23a5e54 100644 --- a/vllm/model_executor/models/roberta.py +++ b/vllm/model_executor/models/roberta.py @@ -1,7 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import itertools from collections.abc import Iterable from typing import Optional, Union @@ -13,9 +12,9 @@ from vllm.model_executor.layers.pooler import ClassifierPooler from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) -from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.models.bert import BertEmbeddingModel, BertModel -from vllm.model_executor.models.utils import WeightsMapper, maybe_prefix +from vllm.model_executor.models.utils import (AutoWeightsLoader, WeightsMapper, + maybe_prefix) from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.sequence import IntermediateTensors, PoolerOutput @@ -39,8 +38,10 @@ def __init__(self, config: RobertaConfig): config.hidden_size) self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) - self.position_ids = nn.Parameter( - torch.empty((1, config.max_position_embeddings)), ) + self.register_buffer( + "position_ids", + torch.arange(config.max_position_embeddings).unsqueeze(0), + ) self.position_embedding_type = config.position_embedding_type if self.position_embedding_type != "absolute": @@ -136,16 +137,20 @@ def _build_model(self, embedding_class=RobertaEmbedding) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - weights = self.hf_to_vllm_mapper.apply(weights) - # Separate weights in "roberta"-prefixed and all else (not in memory). - # For use with models like FacebookAI/roberta-base. - bert_weights, task_weights = roberta_task_weights_filter(weights) - loaded = self.model.load_weights(bert_weights) - if not len(loaded): - # Fix for models like `sentence-transformers/stsb-roberta-base-v2` - # which use the same architecture, but have no "roberta" prefix. - loaded = self.model.load_weights(task_weights) - assert len(loaded), "Unable to load RobertaEmbeddingModel" + weights_list = list(weights) + has_roberta_prefix = any( + name.startswith("roberta.") for name, _ in weights_list) + if has_roberta_prefix: + # For models with the `roberta.` prefix e.g. + # `FacebookAI/roberta-base` + mapper = WeightsMapper(orig_to_new_prefix={"roberta.": "model."}) + else: + # For models without the `roberta.` prefix e.g. + # `sentence-transformers/stsb-roberta-base-v2` + mapper = WeightsMapper(orig_to_new_prefix={"": "model."}) + + loader = AutoWeightsLoader(self, skip_prefixes=["lm_head."]) + return loader.load_weights(weights_list, mapper=mapper) class RobertaForSequenceClassification(nn.Module, SupportsCrossEncoding, @@ -187,19 +192,8 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.classifier) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - bert_weights, task_weights = roberta_task_weights_filter(weights) - bert_weights = self.jina_to_vllm_mapper.apply(bert_weights) - - self.roberta.load_weights(bert_weights) - - params_dict = dict(self.named_parameters()) - - for name, loaded_weight in task_weights: - if name.startswith("classifier"): - param = params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - weight_loader(param, loaded_weight) + loader = AutoWeightsLoader(self) + return loader.load_weights(weights, mapper=self.jina_to_vllm_mapper) def pooler( self, @@ -245,27 +239,3 @@ def create_position_ids_from_input_ids(input_ids, past_key_values_length) * mask return incremental_indices.long() + padding_idx - - -def roberta_task_weights_filter( - all_weights: Iterable[tuple[str, torch.Tensor]] -) -> tuple[Iterable[tuple[str, torch.Tensor]], Iterable[tuple[str, - torch.Tensor]]]: - """ - Separate task-specific weights that are applied on top - of the encoder-decoder bert base. - To do so, return two generators over the original iterator. - Also, remove the "roberta." prefix to make it loadable - from vanilla BertModel. - """ - # Copy of a lazy iterator without in-memory overhead so both - # iterators can be iterated upon independently. - all_weights1, all_weights2 = itertools.tee(all_weights) - - def encoder_decoder_weights(): - for name, weight in all_weights1: - if name.startswith("roberta."): - yield (name[len("roberta."):], weight) - - return encoder_decoder_weights(), ((n, w) for n, w in all_weights2 - if not n.startswith("roberta.")) From 1ec5ce2b82adb28151da3d8a36dd285b6df0b2b4 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Mon, 14 Jul 2025 23:01:46 -0700 Subject: [PATCH 091/552] Implement Async Scheduling (#19970) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- tests/v1/core/__init__.py | 0 tests/v1/core/test_async_scheduler.py | 228 +++++++++++++++++++ tests/v1/core/test_scheduler.py | 128 +---------- tests/v1/core/utils.py | 152 +++++++++++++ vllm/config.py | 11 + vllm/engine/arg_utils.py | 25 ++ vllm/v1/core/sched/async_scheduler.py | 47 ++++ vllm/v1/core/sched/scheduler.py | 60 +++-- vllm/v1/executor/multiproc_executor.py | 2 + vllm/v1/executor/ray_distributed_executor.py | 2 + vllm/v1/request.py | 1 + 11 files changed, 508 insertions(+), 148 deletions(-) create mode 100644 tests/v1/core/__init__.py create mode 100644 tests/v1/core/test_async_scheduler.py create mode 100644 tests/v1/core/utils.py create mode 100644 vllm/v1/core/sched/async_scheduler.py diff --git a/tests/v1/core/__init__.py b/tests/v1/core/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/v1/core/test_async_scheduler.py b/tests/v1/core/test_async_scheduler.py new file mode 100644 index 00000000000..3ccefbd81ca --- /dev/null +++ b/tests/v1/core/test_async_scheduler.py @@ -0,0 +1,228 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from collections import deque + +import pytest + +from vllm.v1.core.sched.output import SchedulerOutput +from vllm.v1.outputs import ModelRunnerOutput +from vllm.v1.request import RequestStatus + +from .utils import create_requests, create_scheduler + + +def _make_model_runner_output( + scheduler_output: SchedulerOutput, ) -> ModelRunnerOutput: + req_ids = list(scheduler_output.num_scheduled_tokens.keys()) + return ModelRunnerOutput( + req_ids=req_ids, + req_id_to_index={ + req_id: i + for i, req_id in enumerate(req_ids) + }, + sampled_token_ids=[[i] for i in range(len(req_ids))], + spec_token_ids=None, + logprobs=None, + prompt_logprobs_dict={}, + pooler_output=[], + ) + + +@pytest.mark.parametrize("max_tokens", [1, 2, 3, 5]) +def test_stop_by_max_tokens(max_tokens: int): + scheduler = create_scheduler(async_scheduling=True) + requests = create_requests(num_requests=2, max_tokens=max_tokens) + req0, req1 = requests + + sched_outputs: deque[SchedulerOutput] = deque() + scheduler.add_request(req0) + sched_outputs.append(scheduler.schedule()) + + scheduler.add_request(req1) + sched_outputs.append(scheduler.schedule()) + + while sched_outputs: + sched_output = sched_outputs.popleft() + model_runner_output = _make_model_runner_output(sched_output) + scheduler.update_from_output(sched_output, model_runner_output) + + sched_output = scheduler.schedule() + if sched_output.num_scheduled_tokens: + sched_outputs.append(sched_output) + + assert scheduler.get_num_unfinished_requests() == 0 + assert req0.num_output_tokens == max_tokens + assert req1.num_output_tokens == max_tokens + + +def test_abort(): + scheduler = create_scheduler(async_scheduling=True) + requests = create_requests(num_requests=10, max_tokens=20) + + for req in requests: + scheduler.add_request(req) + + sched_outputs: deque[SchedulerOutput] = deque() + sched_outputs.append(scheduler.schedule()) + sched_outputs.append(scheduler.schedule()) + + abort_order = [0, 8, 3, 1, 6, 4, 2, 5, 7, 9] + abort_order_copy = abort_order.copy() + + def abort_request(): + if not abort_order: + return + req = requests[abort_order.pop(0)] + scheduler.finish_requests(req.request_id, + RequestStatus.FINISHED_ABORTED) + + while sched_outputs: + # Abort a scheduled request. + abort_request() + sched_output = sched_outputs.popleft() + model_runner_output = _make_model_runner_output(sched_output) + scheduler.update_from_output(sched_output, model_runner_output) + + sched_output = scheduler.schedule() + if sched_output.num_scheduled_tokens: + sched_outputs.append(sched_output) + + for i, req in enumerate(requests): + assert req.status == RequestStatus.FINISHED_ABORTED + assert req.num_output_tokens == abort_order_copy.index(i) + + +def test_preempt(): + scheduler = create_scheduler(async_scheduling=True) + requests = create_requests(num_requests=10, max_tokens=20) + + for req in requests: + scheduler.add_request(req) + + sched_outputs: deque[SchedulerOutput] = deque() + sched_outputs.append(scheduler.schedule()) + sched_outputs.append(scheduler.schedule()) + + abort_order = [0, 8, 3, 1, 6, 4, 2, 5, 7, 9] + abort_order_copy = abort_order.copy() + + def abort_request(): + if not abort_order: + return + req = requests[abort_order.pop(0)] + scheduler.finish_requests(req.request_id, + RequestStatus.FINISHED_ABORTED) + + while sched_outputs: + # Abort a scheduled request. + abort_request() + sched_output = sched_outputs.popleft() + model_runner_output = _make_model_runner_output(sched_output) + scheduler.update_from_output(sched_output, model_runner_output) + + sched_output = scheduler.schedule() + if sched_output.num_scheduled_tokens: + sched_outputs.append(sched_output) + + for i, req in enumerate(requests): + assert req.status == RequestStatus.FINISHED_ABORTED + assert req.num_output_tokens == abort_order_copy.index(i) + + +def test_prefix_caching_for_prefill_dedup(): + CHUNK_SIZE = 1000 + BLOCK_SIZE = 16 + num_prompt_tokens = 100 + scheduler = create_scheduler(async_scheduling=True, + max_num_batched_tokens=CHUNK_SIZE, + enable_prefix_caching=True, + block_size=BLOCK_SIZE) + requests = create_requests(num_requests=5, + num_tokens=num_prompt_tokens, + max_tokens=3, + same_prompt=True) + requests_copy = requests.copy() + + # Two requests with the same prompt. + req0 = requests.pop(0) + req1 = requests.pop(0) + scheduler.add_request(req0) + scheduler.add_request(req1) + + sched_outputs: deque[SchedulerOutput] = deque() + sched_output = scheduler.schedule() + sched_outputs.append(sched_output) + # Make sure prefix caching de-duplicates the prompts in the same step, + # so all the blocks except the last are shared between the two requests. + assert len(sched_output.num_scheduled_tokens) == 2 + num_blocks = num_prompt_tokens // BLOCK_SIZE + assert req0.num_cached_tokens == 0 + assert req1.num_cached_tokens >= num_blocks * BLOCK_SIZE + + sched_outputs.append(scheduler.schedule()) + while sched_outputs: + if requests: + scheduler.add_request(requests.pop(0)) + sched_output = sched_outputs.popleft() + model_runner_output = _make_model_runner_output(sched_output) + scheduler.update_from_output(sched_output, model_runner_output) + sched_output = scheduler.schedule() + if sched_output.num_scheduled_tokens: + sched_outputs.append(sched_output) + + # Other requests scheduled after the two requests should also get + # prefix cache hit. + assert scheduler.get_num_unfinished_requests() == 0 + for req in requests_copy[1:]: + assert req.num_cached_tokens >= num_blocks * BLOCK_SIZE + + +def test_prefix_caching_for_multi_turn(): + CHUNK_SIZE = 1000 + BLOCK_SIZE = 16 + num_prompt_tokens = 100 + num_output_tokens = 200 + scheduler = create_scheduler(async_scheduling=True, + max_num_batched_tokens=CHUNK_SIZE, + enable_prefix_caching=True, + block_size=BLOCK_SIZE) + requests = create_requests(num_requests=5, + num_tokens=num_prompt_tokens, + max_tokens=num_output_tokens) + + for req in requests: + scheduler.add_request(req) + sched_outputs: deque[SchedulerOutput] = deque() + sched_outputs.append(scheduler.schedule()) + sched_outputs.append(scheduler.schedule()) + + # Process the requests. + while sched_outputs: + sched_output = sched_outputs.popleft() + model_runner_output = _make_model_runner_output(sched_output) + scheduler.update_from_output(sched_output, model_runner_output) + sched_output = scheduler.schedule() + if sched_output.num_scheduled_tokens: + sched_outputs.append(sched_output) + assert scheduler.get_num_unfinished_requests() == 0 + + # Create next-turn requests whose prompts are the full output of the + # previous turn. + next_turn_requests = create_requests( + num_requests=5, + num_tokens=num_prompt_tokens + num_output_tokens, + max_tokens=num_output_tokens, + ) + for i, req in enumerate(next_turn_requests): + req.prompt_token_ids = (requests[i].prompt_token_ids + + list(requests[i].output_token_ids)) + # Schedule the next-turn requests. + for req in next_turn_requests: + scheduler.add_request(req) + sched_outputs.append(scheduler.schedule()) + + # Make sure the next-turn requests get prefix cache hit by the previous + # requests. + for req in next_turn_requests: + assert (req.num_cached_tokens == req.num_prompt_tokens // BLOCK_SIZE * + BLOCK_SIZE) diff --git a/tests/v1/core/test_scheduler.py b/tests/v1/core/test_scheduler.py index 2d3657b334b..a858a4d8c82 100644 --- a/tests/v1/core/test_scheduler.py +++ b/tests/v1/core/test_scheduler.py @@ -19,133 +19,7 @@ from vllm.v1.structured_output import StructuredOutputManager from vllm.v1.structured_output.request import StructuredOutputRequest -EOS_TOKEN_ID = 50256 - - -def create_scheduler( - model: str = "facebook/opt-125m", - max_num_seqs: int = 16, - max_num_batched_tokens: int = 8192, - enable_prefix_caching: Optional[bool] = None, - long_prefill_token_threshold: int = 0, - disable_chunked_mm_input: bool = False, - use_kv_connector: bool = False, - num_blocks: int = 10000, - block_size: int = 16, - max_model_len: Optional[int] = None, - num_speculative_tokens: Optional[int] = None, - skip_tokenizer_init: bool = False, -) -> Scheduler: - '''Create scheduler under test. - - Args: - model: model under test - max_num_seqs: max sequences to schedule - max_num_batch_tokens: max num tokens to batch - enable_prefix_caching: optionally force APC config - (True/False) or use default - (None) - - Returns: - {class}`Scheduler` instance - ''' - if max_model_len is None: - max_model_len = max_num_batched_tokens - scheduler_config = SchedulerConfig( - max_num_seqs=max_num_seqs, - max_num_batched_tokens=max_num_batched_tokens, - max_model_len=max_model_len, - long_prefill_token_threshold=long_prefill_token_threshold, - disable_chunked_mm_input=disable_chunked_mm_input, - enable_chunked_prefill=True, - ) - model_config = ModelConfig( - model=model, - task="auto", - tokenizer=model, - tokenizer_mode="auto", - trust_remote_code=True, - dtype="float16", - seed=42, - skip_tokenizer_init=skip_tokenizer_init, - ) - # Cache config, optionally force APC - kwargs_cache = ({} if enable_prefix_caching is None else { - 'enable_prefix_caching': enable_prefix_caching - }) - cache_config = CacheConfig( - block_size=block_size, - gpu_memory_utilization=0.9, - swap_space=0, - cache_dtype="auto", - **kwargs_cache, - ) - kv_transfer_config = KVTransferConfig( - kv_connector="SharedStorageConnector", - kv_role="kv_both", - kv_connector_extra_config={"shared_storage_path": "local_storage"}, - ) if use_kv_connector else None - - speculative_config: Optional[SpeculativeConfig] = None - if num_speculative_tokens is not None: - speculative_config = SpeculativeConfig( - model="ngram", num_speculative_tokens=num_speculative_tokens) - - vllm_config = VllmConfig( - scheduler_config=scheduler_config, - model_config=model_config, - cache_config=cache_config, - kv_transfer_config=kv_transfer_config, - speculative_config=speculative_config, - ) - kv_cache_config = KVCacheConfig( - num_blocks=num_blocks, # A large number of blocks to hold all requests - kv_cache_tensors=[], - kv_cache_groups=[ - KVCacheGroupSpec(['layer'], - FullAttentionSpec(block_size, 1, 1, torch.float32, - False)) - ], - ) - cache_config.num_gpu_blocks = num_blocks - return Scheduler( - vllm_config=vllm_config, - kv_cache_config=kv_cache_config, - log_stats=True, - structured_output_manager=StructuredOutputManager(vllm_config), - ) - - -def create_requests(num_requests: int, - num_tokens: int = 10, - mm_positions: Optional[list[PlaceholderRange]] = None, - max_tokens: int = 16, - stop_token_ids: Optional[list[int]] = None, - prompt_logprobs: Optional[int] = None): - sampling_params = SamplingParams(ignore_eos=False, - max_tokens=max_tokens, - stop_token_ids=stop_token_ids, - prompt_logprobs=prompt_logprobs) - requests = [] - for i in range(num_requests): - if mm_positions is not None: - mm_position = mm_positions[i] - mm_inputs = [MultiModalKwargs({})] * len(mm_position) - else: - mm_position = None - mm_inputs = None - request = Request( - request_id=f"{i}", - prompt_token_ids=[i] * num_tokens, - sampling_params=sampling_params, - pooling_params=None, - multi_modal_inputs=mm_inputs, - multi_modal_placeholders=mm_position, - multi_modal_hashes=None, - eos_token_id=EOS_TOKEN_ID, - ) - requests.append(request) - return requests +from .utils import EOS_TOKEN_ID, create_requests, create_scheduler def test_add_requests(): diff --git a/tests/v1/core/utils.py b/tests/v1/core/utils.py new file mode 100644 index 00000000000..0b7d8251b64 --- /dev/null +++ b/tests/v1/core/utils.py @@ -0,0 +1,152 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from typing import Optional, Union + +import torch + +from vllm.config import (CacheConfig, KVTransferConfig, ModelConfig, + SchedulerConfig, SpeculativeConfig, VllmConfig) +from vllm.multimodal.inputs import MultiModalKwargs, PlaceholderRange +from vllm.sampling_params import SamplingParams +from vllm.v1.core.sched.async_scheduler import AsyncScheduler +from vllm.v1.core.sched.scheduler import Scheduler +from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig, + KVCacheGroupSpec) +from vllm.v1.request import Request +from vllm.v1.structured_output import StructuredOutputManager + +EOS_TOKEN_ID = 50256 + + +def create_scheduler( + model: str = "facebook/opt-125m", + max_num_seqs: int = 16, + max_num_batched_tokens: int = 8192, + enable_prefix_caching: Optional[bool] = None, + long_prefill_token_threshold: int = 0, + disable_chunked_mm_input: bool = False, + use_kv_connector: bool = False, + num_blocks: int = 10000, + block_size: int = 16, + max_model_len: Optional[int] = None, + num_speculative_tokens: Optional[int] = None, + skip_tokenizer_init: bool = False, + async_scheduling: bool = False, +) -> Union[Scheduler, AsyncScheduler]: + '''Create scheduler under test. + + Args: + model: model under test + max_num_seqs: max sequences to schedule + max_num_batch_tokens: max num tokens to batch + enable_prefix_caching: optionally force APC config + (True/False) or use default + (None) + + Returns: + {class}`Scheduler` instance + ''' + if max_model_len is None: + max_model_len = max_num_batched_tokens + scheduler_config = SchedulerConfig( + max_num_seqs=max_num_seqs, + max_num_batched_tokens=max_num_batched_tokens, + max_model_len=max_model_len, + long_prefill_token_threshold=long_prefill_token_threshold, + disable_chunked_mm_input=disable_chunked_mm_input, + enable_chunked_prefill=True, + async_scheduling=async_scheduling, + ) + model_config = ModelConfig( + model=model, + task="auto", + tokenizer=model, + tokenizer_mode="auto", + trust_remote_code=True, + dtype="float16", + seed=42, + skip_tokenizer_init=skip_tokenizer_init, + ) + # Cache config, optionally force APC + kwargs_cache = ({} if enable_prefix_caching is None else { + 'enable_prefix_caching': enable_prefix_caching + }) + cache_config = CacheConfig( + block_size=block_size, + gpu_memory_utilization=0.9, + swap_space=0, + cache_dtype="auto", + **kwargs_cache, + ) + kv_transfer_config = KVTransferConfig( + kv_connector="SharedStorageConnector", + kv_role="kv_both", + kv_connector_extra_config={"shared_storage_path": "local_storage"}, + ) if use_kv_connector else None + + speculative_config: Optional[SpeculativeConfig] = None + if num_speculative_tokens is not None: + speculative_config = SpeculativeConfig( + model="ngram", num_speculative_tokens=num_speculative_tokens) + + vllm_config = VllmConfig( + scheduler_config=scheduler_config, + model_config=model_config, + cache_config=cache_config, + kv_transfer_config=kv_transfer_config, + speculative_config=speculative_config, + ) + kv_cache_config = KVCacheConfig( + num_blocks=num_blocks, # A large number of blocks to hold all requests + kv_cache_tensors=[], + kv_cache_groups=[ + KVCacheGroupSpec(['layer'], + FullAttentionSpec(block_size, 1, 1, torch.float32, + False)) + ], + ) + cache_config.num_gpu_blocks = num_blocks + scheduler_cls = AsyncScheduler if async_scheduling else Scheduler + return scheduler_cls( + vllm_config=vllm_config, + kv_cache_config=kv_cache_config, + log_stats=True, + structured_output_manager=StructuredOutputManager(vllm_config), + ) + + +def create_requests( + num_requests: int, + num_tokens: int = 10, + mm_positions: Optional[list[PlaceholderRange]] = None, + max_tokens: int = 16, + stop_token_ids: Optional[list[int]] = None, + prompt_logprobs: Optional[int] = None, + same_prompt: bool = False, +) -> list[Request]: + sampling_params = SamplingParams(ignore_eos=False, + max_tokens=max_tokens, + stop_token_ids=stop_token_ids, + prompt_logprobs=prompt_logprobs) + requests = [] + for i in range(num_requests): + if mm_positions is not None: + mm_position = mm_positions[i] + mm_inputs = [MultiModalKwargs({})] * len(mm_position) + else: + mm_position = None + mm_inputs = None + prompt_token_ids = ([0] * num_tokens if same_prompt else [i] * + num_tokens) + request = Request( + request_id=f"{i}", + prompt_token_ids=prompt_token_ids, + sampling_params=sampling_params, + pooling_params=None, + multi_modal_inputs=mm_inputs, + multi_modal_placeholders=mm_position, + multi_modal_hashes=None, + eos_token_id=EOS_TOKEN_ID, + ) + requests.append(request) + return requests diff --git a/vllm/config.py b/vllm/config.py index ce81fea2d64..70b023a5d23 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -2308,6 +2308,13 @@ class SchedulerConfig: like full attention and sliding window attention. """ + async_scheduling: bool = False + """EXPERIMENTAL: If set to True, perform async scheduling. This may help + reduce the CPU overheads, leading to better latency and throughput. However, + async scheduling is currently not supported with some features such as + structured outputs, speculative decoding, and pipeline parallelism. + """ + def compute_hash(self) -> str: """ WARNING: Whenever a new field is added to this config, @@ -2401,6 +2408,10 @@ def __post_init__(self) -> None: if not self.cuda_graph_sizes: self.cuda_graph_sizes = [min(self.max_num_seqs * 2, 512)] + if self.async_scheduling: + self.scheduler_cls = ( + "vllm.v1.core.sched.async_scheduler.AsyncScheduler") + @model_validator(mode='after') def _verify_args(self) -> Self: if (self.max_num_batched_tokens < self.max_model_len diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index e2c86158758..269477c4848 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -484,6 +484,8 @@ class EngineArgs: enable_multimodal_encoder_data_parallel: bool = \ ParallelConfig.enable_multimodal_encoder_data_parallel + async_scheduling: bool = SchedulerConfig.async_scheduling + def __post_init__(self): # support `EngineArgs(compilation_config={...})` # without having to manually construct a @@ -921,6 +923,8 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: scheduler_group.add_argument( "--disable-hybrid-kv-cache-manager", **scheduler_kwargs["disable_hybrid_kv_cache_manager"]) + scheduler_group.add_argument("--async-scheduling", + **scheduler_kwargs["async_scheduling"]) # vLLM arguments vllm_kwargs = get_kwargs(VllmConfig) @@ -1206,6 +1210,26 @@ def create_engine_config( self.data_parallel_rpc_port is not None) else ParallelConfig.data_parallel_rpc_port + if self.async_scheduling: + # Async scheduling does not work with the uniprocess backend. + if self.distributed_executor_backend is None: + self.distributed_executor_backend = "mp" + logger.info("Using mp-based distributed executor backend " + "for async scheduling.") + if self.distributed_executor_backend == "uni": + raise ValueError("Async scheduling is not supported with " + "uni-process backend.") + if self.pipeline_parallel_size > 1: + raise ValueError("Async scheduling is not supported with " + "pipeline-parallel-size > 1.") + + # Currently, async scheduling does not support speculative decoding. + # TODO(woosuk): Support it. + if self.speculative_config is not None: + raise ValueError( + "Currently, speculative decoding is not supported with " + "async scheduling.") + parallel_config = ParallelConfig( pipeline_parallel_size=self.pipeline_parallel_size, tensor_parallel_size=self.tensor_parallel_size, @@ -1286,6 +1310,7 @@ def create_engine_config( long_prefill_token_threshold=self.long_prefill_token_threshold, disable_hybrid_kv_cache_manager=self. disable_hybrid_kv_cache_manager, + async_scheduling=self.async_scheduling, ) if not model_config.is_multimodal_model and self.default_mm_loras: diff --git a/vllm/v1/core/sched/async_scheduler.py b/vllm/v1/core/sched/async_scheduler.py new file mode 100644 index 00000000000..74ff6261732 --- /dev/null +++ b/vllm/v1/core/sched/async_scheduler.py @@ -0,0 +1,47 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from __future__ import annotations + +from vllm.logger import init_logger +from vllm.v1.core.sched.output import SchedulerOutput +from vllm.v1.core.sched.scheduler import Scheduler +from vllm.v1.request import Request, RequestStatus + +logger = init_logger(__name__) + + +class AsyncScheduler(Scheduler): + + def _update_after_schedule( + self, + scheduler_output: SchedulerOutput, + ) -> None: + super()._update_after_schedule(scheduler_output) + for req_id in scheduler_output.num_scheduled_tokens: + request = self.requests[req_id] + if (request.num_computed_tokens == request.num_tokens + + request.num_output_placeholders): + # The request will generate a new token in this scheduling step. + # TODO(woosuk): Support speculative decoding. + request.num_output_placeholders += 1 + + def _update_request_with_output( + self, + request: Request, + new_token_ids: list[int], + ) -> tuple[list[int], bool]: + status_before_update = request.status + new_token_ids, stopped = super()._update_request_with_output( + request, new_token_ids) + + # Update the number of output placeholders. + request.num_output_placeholders -= len(new_token_ids) + assert request.num_output_placeholders >= 0 + + # Cache the new tokens. Preempted requests should be skipped. + if status_before_update == RequestStatus.RUNNING: + self.kv_cache_manager.cache_blocks( + request, + request.num_computed_tokens - request.num_output_placeholders) + return new_token_ids, stopped diff --git a/vllm/v1/core/sched/scheduler.py b/vllm/v1/core/sched/scheduler.py index f81bb9fc13a..446f98034cb 100644 --- a/vllm/v1/core/sched/scheduler.py +++ b/vllm/v1/core/sched/scheduler.py @@ -204,7 +204,8 @@ def schedule(self) -> SchedulerOutput: while req_index < len(self.running) and token_budget > 0: request = self.running[req_index] - num_new_tokens = (request.num_tokens_with_spec - + num_new_tokens = (request.num_tokens_with_spec + + request.num_output_placeholders - request.num_computed_tokens) if (0 < self.scheduler_config.long_prefill_token_threshold < num_new_tokens): @@ -230,9 +231,11 @@ def schedule(self) -> SchedulerOutput: if num_new_tokens == 0: # The request cannot be scheduled because one of the following # reasons: - # 1. No new tokens to schedule. This may happen when PP>1 and - # we have already scheduled all prompt tokens but they are - # not finished yet. + # 1. No new tokens to schedule. This may happen when + # (1) PP>1 and we have already scheduled all prompt tokens + # but they are not finished yet. + # (2) Async scheduling and the request has reached to either + # its max_total_tokens or max_model_len. # 2. The encoder budget is exhausted. # 3. The encoder cache is exhausted. # NOTE(woosuk): Here, by doing `continue` instead of `break`, @@ -598,6 +601,14 @@ def _update_after_schedule( request = self.requests[req_id] request.num_computed_tokens += num_scheduled_token + # NOTE: _free_encoder_inputs relies on num_computed_tokens, which + # may be updated again in _update_from_output for speculative + # decoding. However, it is safe to call the method here because + # encoder inputs are always part of the prompt, not the output, + # and thus are unaffected by speculative decoding. + if request.has_encoder_inputs: + self._free_encoder_inputs(request) + # Clear the finished request IDs. # NOTE: We shouldn't do self.finished_req_ids.clear() here because # it will also affect the scheduler output. @@ -785,29 +796,16 @@ def update_from_output( num_draft_tokens=len(scheduled_spec_token_ids), num_accepted_tokens=len(generated_token_ids) - 1) - # NOTE(woosuk): This has to be executed after updating - # `request.num_computed_tokens`. - if request.has_encoder_inputs: - self._free_encoder_inputs(request) - stopped = False new_logprobs = None new_token_ids = generated_token_ids kv_transfer_params = None status_before_stop = request.status - # Append generated tokens and check for stop. Note that if - # a request is still being prefilled, we expect the model runner - # to return empty token ids for the request. - for num_new, output_token_id in enumerate(new_token_ids, 1): - request.append_output_token_ids(output_token_id) - - # Check for stop and update request state. - # This must be called before we make the EngineCoreOutput. - stopped = check_stop(request, self.max_model_len) - if stopped: - del new_token_ids[num_new:] # Trim new tokens if needed. - break + # Check for stop and update request status. + if new_token_ids: + new_token_ids, stopped = self._update_request_with_output( + request, new_token_ids) # Stop checking for pooler models. pooler_output = None @@ -915,6 +913,26 @@ def update_from_output( return engine_core_outputs + def _update_request_with_output( + self, + request: Request, + new_token_ids: list[int], + ) -> tuple[list[int], bool]: + # Append generated tokens and check for stop. Note that if + # a request is still being prefilled, we expect the model runner + # to return empty token ids for the request. + stopped = False + for num_new, output_token_id in enumerate(new_token_ids, 1): + request.append_output_token_ids(output_token_id) + + # Check for stop and update request state. + # This must be called before we make the EngineCoreOutput. + stopped = check_stop(request, self.max_model_len) + if stopped: + del new_token_ids[num_new:] # Trim new tokens if needed. + break + return new_token_ids, stopped + def _free_encoder_inputs(self, request: Request) -> None: cached_encoder_input_ids = ( self.encoder_cache_manager.get_cached_input_ids(request)) diff --git a/vllm/v1/executor/multiproc_executor.py b/vllm/v1/executor/multiproc_executor.py index 95ba45147fd..d29da55ce88 100644 --- a/vllm/v1/executor/multiproc_executor.py +++ b/vllm/v1/executor/multiproc_executor.py @@ -367,6 +367,8 @@ def check_health(self) -> None: @property def max_concurrent_batches(self) -> int: + if self.scheduler_config.async_scheduling: + return 2 return self.parallel_config.pipeline_parallel_size def _get_output_rank(self) -> int: diff --git a/vllm/v1/executor/ray_distributed_executor.py b/vllm/v1/executor/ray_distributed_executor.py index 257564793cf..daca7c0faf6 100644 --- a/vllm/v1/executor/ray_distributed_executor.py +++ b/vllm/v1/executor/ray_distributed_executor.py @@ -33,6 +33,8 @@ def max_concurrent_batches(self) -> int: """Ray distributed executor supports pipeline parallelism, meaning that it allows PP size batches to be executed concurrently. """ + if self.scheduler_config.async_scheduling: + return 2 return self.parallel_config.pipeline_parallel_size def execute_model( diff --git a/vllm/v1/request.py b/vllm/v1/request.py index 9b96f4599f9..85f5dcb92eb 100644 --- a/vllm/v1/request.py +++ b/vllm/v1/request.py @@ -77,6 +77,7 @@ def __init__( self.num_prompt_tokens = len(self.prompt_token_ids) self._output_token_ids: list[int] = [] self._all_token_ids: list[int] = self.prompt_token_ids.copy() + self.num_output_placeholders = 0 # Used in async scheduling. self.spec_token_ids: list[int] = [] self.num_computed_tokens = 0 self.cache_salt: Optional[str] = cache_salt From cfeafd5b92e3c9af29d732637b40db86760247b4 Mon Sep 17 00:00:00 2001 From: Ilya Markov Date: Tue, 15 Jul 2025 08:57:40 +0200 Subject: [PATCH 092/552] [Misc] Refactor AllReduceFusionPass. Remove parameter (#20918) Signed-off-by: ilmarkov Co-authored-by: ilmarkov Signed-off-by: x22x22 --- tests/compile/test_fusion_all_reduce.py | 4 +--- vllm/compilation/collective_fusion.py | 8 +++++--- vllm/compilation/pass_manager.py | 5 +---- 3 files changed, 7 insertions(+), 10 deletions(-) diff --git a/tests/compile/test_fusion_all_reduce.py b/tests/compile/test_fusion_all_reduce.py index 7101857210a..492e90f2a75 100644 --- a/tests/compile/test_fusion_all_reduce.py +++ b/tests/compile/test_fusion_all_reduce.py @@ -132,9 +132,7 @@ def all_reduce_fusion_pass_on_test_model(local_rank: int, world_size: int, dtype=dtype, seed=42) - all_reduce_fusion_pass = AllReduceFusionPass( - vllm_config, vllm_config.compilation_config.pass_config. - fi_allreduce_fusion_max_token_num) + all_reduce_fusion_pass = AllReduceFusionPass(vllm_config) backend = TestBackend(all_reduce_fusion_pass) model = test_model_cls(hidden_size) diff --git a/vllm/compilation/collective_fusion.py b/vllm/compilation/collective_fusion.py index 97cb2995cb3..a8b00aaf084 100644 --- a/vllm/compilation/collective_fusion.py +++ b/vllm/compilation/collective_fusion.py @@ -397,7 +397,7 @@ def replacement(residual: torch.Tensor, input: torch.Tensor, class AllReduceFusionPass(VllmInductorPass): - def __init__(self, config: VllmConfig, max_token_num: int): + def __init__(self, config: VllmConfig): super().__init__(config) self.disabled = True self.tp_size = get_tensor_model_parallel_world_size() @@ -429,7 +429,8 @@ def __init__(self, config: VllmConfig, max_token_num: int): flashinfer_comm.trtllm_create_ipc_workspace_for_all_reduce_fusion( tp_rank=rank, tp_size=self.tp_size, - max_token_num=max_token_num, + max_token_num=config.compilation_config.pass_config. + fi_allreduce_fusion_max_token_num, hidden_dim=self.hidden_dim, group=self.group, use_fp32_lamport=use_fp32_lamport, @@ -441,7 +442,8 @@ def __init__(self, config: VllmConfig, max_token_num: int): rank=rank, world_size=self.tp_size, use_fp32_lamport=use_fp32_lamport, - max_token_num=max_token_num, + max_token_num=config.compilation_config.pass_config. + fi_allreduce_fusion_max_token_num, ) for epsilon in [1e-5, 1e-6]: diff --git a/vllm/compilation/pass_manager.py b/vllm/compilation/pass_manager.py index 078188854f0..58216a1f0ed 100644 --- a/vllm/compilation/pass_manager.py +++ b/vllm/compilation/pass_manager.py @@ -63,10 +63,7 @@ def configure(self, config: VllmConfig): if self.pass_config.enable_attn_fusion: self.passes += [AttnFusionPass(config)] if self.pass_config.enable_fi_allreduce_fusion: - self.passes += [ - AllReduceFusionPass( - config, self.pass_config.fi_allreduce_fusion_max_token_num) - ] + self.passes += [AllReduceFusionPass(config)] self.fix_functionalization = FixFunctionalizationPass(config) def add(self, pass_: InductorPass): From bbd052bc22ec634572ed5b0dae55198a0dc9baef Mon Sep 17 00:00:00 2001 From: Reid <61492567+reidliu41@users.noreply.github.com> Date: Tue, 15 Jul 2025 15:42:00 +0800 Subject: [PATCH 093/552] [frontend] Add --help=page option for paginated help output (#20961) Signed-off-by: reidliu41 Signed-off-by: x22x22 --- docs/cli/README.md | 3 +++ vllm/entrypoints/utils.py | 44 ++++++++++++++++++++++++++++++++------- 2 files changed, 39 insertions(+), 8 deletions(-) diff --git a/docs/cli/README.md b/docs/cli/README.md index 3541437659c..1d951747a7a 100644 --- a/docs/cli/README.md +++ b/docs/cli/README.md @@ -37,6 +37,9 @@ Start the vLLM OpenAI Compatible API server. # To search by keyword vllm serve --help=max + + # To view full help with pager (less/more) + vllm serve --help=page ``` ## chat diff --git a/vllm/entrypoints/utils.py b/vllm/entrypoints/utils.py index 6c37ce818e6..87334f458fe 100644 --- a/vllm/entrypoints/utils.py +++ b/vllm/entrypoints/utils.py @@ -5,6 +5,7 @@ import asyncio import functools import os +import subprocess import sys from typing import Any, Optional, Union @@ -25,7 +26,8 @@ " - To view a argument group: --help=ModelConfig\n" " - To view a single argument: --help=max-num-seqs\n" " - To search by keyword: --help=max\n" - " - To list all groups: --help=listgroup") + " - To list all groups: --help=listgroup\n" + " - To view help with pager: --help=page") async def listen_for_disconnect(request: Request) -> None: @@ -190,6 +192,24 @@ def _validate_truncation_size( return truncate_prompt_tokens +def _output_with_pager(text: str): + """Output text using scrolling view if available and appropriate.""" + + pagers = ['less -R', 'more'] + for pager_cmd in pagers: + try: + proc = subprocess.Popen(pager_cmd.split(), + stdin=subprocess.PIPE, + text=True) + proc.communicate(input=text) + return + except (subprocess.SubprocessError, OSError, FileNotFoundError): + continue + + # No pager worked, fall back to normal print + print(text) + + def show_filtered_argument_or_group_from_help(parser: argparse.ArgumentParser, subcommand_name: list[str]): @@ -208,16 +228,24 @@ def show_filtered_argument_or_group_from_help(parser: argparse.ArgumentParser, if arg.startswith('--help='): search_keyword = arg.split('=', 1)[1] + # Enable paged view for full help + if search_keyword == 'page': + help_text = parser.format_help() + _output_with_pager(help_text) + sys.exit(0) + # List available groups if search_keyword == 'listgroup': - print("\nAvailable argument groups:") + output_lines = ["\nAvailable argument groups:"] for group in parser._action_groups: if group.title and not group.title.startswith( "positional arguments"): - print(f" - {group.title}") + output_lines.append(f" - {group.title}") if group.description: - print(" " + group.description.strip()) - print() + output_lines.append(" " + + group.description.strip()) + output_lines.append("") + _output_with_pager("\n".join(output_lines)) sys.exit(0) # For group search @@ -229,7 +257,7 @@ def show_filtered_argument_or_group_from_help(parser: argparse.ArgumentParser, formatter.add_text(group.description) formatter.add_arguments(group._group_actions) formatter.end_section() - print(formatter.format_help()) + _output_with_pager(formatter.format_help()) sys.exit(0) # For single arg @@ -243,10 +271,10 @@ def show_filtered_argument_or_group_from_help(parser: argparse.ArgumentParser, matched_actions.append(action) if matched_actions: - print(f"\nParameters matching '{search_keyword}':\n") + header = f"\nParameters matching '{search_keyword}':\n" formatter = parser._get_formatter() formatter.add_arguments(matched_actions) - print(formatter.format_help()) + _output_with_pager(header + formatter.format_help()) sys.exit(0) print(f"\nNo group or parameter matching '{search_keyword}'") From a9b18dfccfd39ed47660fb6f99bf47f2a69837f0 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Tue, 15 Jul 2025 04:54:10 -0400 Subject: [PATCH 094/552] [Docs] Improve documentation for RLHF example (#20598) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- examples/offline_inference/rlhf.py | 85 +++++++++++++++++------------- 1 file changed, 49 insertions(+), 36 deletions(-) diff --git a/examples/offline_inference/rlhf.py b/examples/offline_inference/rlhf.py index c6e63531a99..752117a4e36 100644 --- a/examples/offline_inference/rlhf.py +++ b/examples/offline_inference/rlhf.py @@ -1,17 +1,31 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """ -a simple demonstration of RLHF with vLLM, inspired by -the OpenRLHF framework https://github.com/OpenRLHF/OpenRLHF . -It follows the design that, training processes and inference processes -are different, and they live on different GPUs. -Training processes send prompts to inference processes to generate data, -and also synchronize the weights of the model by broadcasting the weights -from the training process to the inference process. -Note that this is a simple demonstration of one training instance and one -inference instance. In practice, there could be multiple training instances -and multiple inference instances. For the full implementation, please refer -to the OpenRLHF framework. +Demonstrates reinforcement learning from human feedback (RLHF) using vLLM and Ray. + +The script separates training and inference workloads onto distinct GPUs +so that Ray can manage process placement and inter-process communication. +A Hugging Face Transformer model occupies GPU 0 for training, whereas a +tensor-parallel vLLM inference engine occupies GPU 1–2. + +The example performs the following steps: + +* Load the training model on GPU 0. +* Split the inference model across GPUs 1–2 using vLLM's tensor parallelism + and Ray placement groups. +* Generate text from a list of prompts using the inference engine. +* Update the weights of the training model and broadcast the updated weights + to the inference engine by using a Ray collective RPC group. Note that + for demonstration purposes we simply zero out the weights. + +For a production-ready implementation that supports multiple training and +inference replicas, see the OpenRLHF framework: +https://github.com/OpenRLHF/OpenRLHF + +This example assumes a single-node cluster with three GPUs, but Ray +supports multi-node clusters. vLLM expects the GPUs are only used for vLLM +workloads. Residual GPU activity interferes with vLLM memory profiling and +causes unexpected behavior. """ import os @@ -28,29 +42,27 @@ class MyLLM(LLM): + """Configure the vLLM worker for Ray placement group execution.""" + def __init__(self, *args, **kwargs): - # a hack to make the script work. - # stop ray from manipulating CUDA_VISIBLE_DEVICES - # at the top-level + # Remove the top-level CUDA_VISIBLE_DEVICES variable set by Ray + # so that vLLM can manage its own device placement within the worker. os.environ.pop("CUDA_VISIBLE_DEVICES", None) super().__init__(*args, **kwargs) -""" -Start the training process, here we use huggingface transformers -as an example to hold a model on GPU 0. -""" - +# Load the OPT-125M model onto GPU 0 for the training workload. train_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m") train_model.to("cuda:0") -""" -Start the inference process, here we use vLLM to hold a model on GPU 1 and -GPU 2. For the details on how to use ray, please refer to the ray -documentation https://docs.ray.io/en/latest/ . -""" + +# Initialize Ray and set the visible devices. The vLLM engine will +# be placed on GPUs 1 and 2. os.environ["CUDA_VISIBLE_DEVICES"] = "1,2" ray.init() +# Create a placement group that reserves GPU 1–2 for the vLLM inference engine. +# Learn more about Ray placement groups: +# https://docs.ray.io/en/latest/placement-groups.html pg_inference = placement_group([{"GPU": 1, "CPU": 0}] * 2) ray.get(pg_inference.ready()) scheduling_inference = PlacementGroupSchedulingStrategy( @@ -58,10 +70,9 @@ def __init__(self, *args, **kwargs): placement_group_capture_child_tasks=True, placement_group_bundle_index=0, ) -""" -launch the vLLM inference engine. -here we use `enforce_eager` to reduce the start time. -""" + +# Launch the vLLM inference engine. The `enforce_eager` flag reduces +# start-up latency. llm = ray.remote( num_cpus=0, num_gpus=0, @@ -74,7 +85,7 @@ def __init__(self, *args, **kwargs): distributed_executor_backend="ray", ) -# Generate texts from the prompts. +# Generate text from the prompts. prompts = [ "Hello, my name is", "The president of the United States is", @@ -93,8 +104,8 @@ def __init__(self, *args, **kwargs): print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}") print("-" * 50) -# set up the communication between the training process -# and the inference engine. +# Set up the communication channel between the training process and the +# inference engine. master_address = get_ip() master_port = get_open_port() @@ -107,21 +118,23 @@ def __init__(self, *args, **kwargs): ) ray.get(handle) -# simulate training, modify the weights of the model. +# Simulate a training step by zeroing out all model weights. +# In a real RLHF training loop the weights would be updated using the gradient +# from an RL objective such as PPO on a reward model. for name, p in train_model.named_parameters(): p.data.zero_() -# sync weight from the training process to the inference engine. +# Synchronize the updated weights to the inference engine. for name, p in train_model.named_parameters(): handle = llm.collective_rpc.remote("update_weight", args=(name, p.dtype, p.shape)) model_update_group.broadcast(p, src=0, stream=torch.cuda.current_stream()) ray.get(handle) -# check if the weights are updated. +# Verify that the inference weights have been updated. assert all(ray.get(llm.collective_rpc.remote("check_weights_changed"))) -# use the updated model to generate texts, they will be nonsense -# because the weights are all zeros. +# Generate text with the updated model. The output is expected to be nonsense +# because the weights are zero. outputs_updated = ray.get(llm.generate.remote(prompts, sampling_params)) print("-" * 50) for output in outputs_updated: From a70cf720ac62c5e103a14b3278ffb304b3ee9449 Mon Sep 17 00:00:00 2001 From: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Tue, 15 Jul 2025 02:23:42 -0700 Subject: [PATCH 095/552] [frontend] Refactor CLI Args for a better modular integration (#20206) Signed-off-by: Kourosh Hakhamaneshi Signed-off-by: x22x22 --- .pre-commit-config.yaml | 2 +- vllm/entrypoints/openai/cli_args.py | 377 ++++++++++++---------------- 2 files changed, 167 insertions(+), 212 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 720c06acf14..24399677c08 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -166,7 +166,7 @@ repos: language: python types: [python] pass_filenames: true - files: vllm/config.py|tests/test_config.py + files: vllm/config.py|tests/test_config.py|vllm/entrypoints/openai/cli_args.py # Keep `suggestion` last - id: suggestion name: Suggestion diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 4f8aaab772f..9a7f04cd9b2 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -10,9 +10,13 @@ import json import ssl from collections.abc import Sequence -from typing import Optional, Union, get_args +from dataclasses import field +from typing import Literal, Optional, Union + +from pydantic.dataclasses import dataclass import vllm.envs as envs +from vllm.config import config from vllm.engine.arg_utils import AsyncEngineArgs, optional_type from vllm.entrypoints.chat_utils import (ChatTemplateContentFormatOption, validate_chat_template) @@ -82,220 +86,171 @@ def __call__( setattr(namespace, self.dest, adapter_list) +@config +@dataclass +class FrontendArgs: + """Arguments for the OpenAI-compatible frontend server.""" + host: Optional[str] = None + """Host name.""" + port: int = 8000 + """Port number.""" + uvicorn_log_level: Literal["debug", "info", "warning", "error", "critical", + "trace"] = "info" + """Log level for uvicorn.""" + disable_uvicorn_access_log: bool = False + """Disable uvicorn access log.""" + allow_credentials: bool = False + """Allow credentials.""" + allowed_origins: list[str] = field(default_factory=lambda: ["*"]) + """Allowed origins.""" + allowed_methods: list[str] = field(default_factory=lambda: ["*"]) + """Allowed methods.""" + allowed_headers: list[str] = field(default_factory=lambda: ["*"]) + """Allowed headers.""" + api_key: Optional[str] = None + """If provided, the server will require this key to be presented in the + header.""" + lora_modules: Optional[list[LoRAModulePath]] = None + """LoRA modules configurations in either 'name=path' format or JSON format + or JSON list format. Example (old format): `'name=path'` Example (new + format): `{\"name\": \"name\", \"path\": \"lora_path\", + \"base_model_name\": \"id\"}`""" + prompt_adapters: Optional[list[PromptAdapterPath]] = None + """Prompt adapter configurations in the format name=path. Multiple adapters + can be specified.""" + chat_template: Optional[str] = None + """The file path to the chat template, or the template in single-line form + for the specified model.""" + chat_template_content_format: ChatTemplateContentFormatOption = "auto" + """The format to render message content within a chat template. + +* "string" will render the content as a string. Example: `"Hello World"` +* "openai" will render the content as a list of dictionaries, similar to OpenAI +schema. Example: `[{"type": "text", "text": "Hello world!"}]`""" + response_role: str = "assistant" + """The role name to return if `request.add_generation_prompt=true`.""" + ssl_keyfile: Optional[str] = None + """The file path to the SSL key file.""" + ssl_certfile: Optional[str] = None + """The file path to the SSL cert file.""" + ssl_ca_certs: Optional[str] = None + """The CA certificates file.""" + enable_ssl_refresh: bool = False + """Refresh SSL Context when SSL certificate files change""" + ssl_cert_reqs: int = int(ssl.CERT_NONE) + """Whether client certificate is required (see stdlib ssl module's).""" + root_path: Optional[str] = None + """FastAPI root_path when app is behind a path based routing proxy.""" + middleware: list[str] = field(default_factory=lambda: []) + """Additional ASGI middleware to apply to the app. We accept multiple + --middleware arguments. The value should be an import path. If a function + is provided, vLLM will add it to the server using + `@app.middleware('http')`. If a class is provided, vLLM will + add it to the server using `app.add_middleware()`.""" + return_tokens_as_token_ids: bool = False + """When `--max-logprobs` is specified, represents single tokens as + strings of the form 'token_id:{token_id}' so that tokens that are not + JSON-encodable can be identified.""" + disable_frontend_multiprocessing: bool = False + """If specified, will run the OpenAI frontend server in the same process as + the model serving engine.""" + enable_request_id_headers: bool = False + """If specified, API server will add X-Request-Id header to responses. + Caution: this hurts performance at high QPS.""" + enable_auto_tool_choice: bool = False + """Enable auto tool choice for supported models. Use `--tool-call-parser` + to specify which parser to use.""" + tool_call_parser: Optional[str] = None + """Select the tool call parser depending on the model that you're using. + This is used to parse the model-generated tool call into OpenAI API format. + Required for `--enable-auto-tool-choice`. You can choose any option from + the built-in parsers or register a plugin via `--tool-parser-plugin`.""" + tool_parser_plugin: str = "" + """Special the tool parser plugin write to parse the model-generated tool + into OpenAI API format, the name register in this plugin can be used in + `--tool-call-parser`.""" + log_config_file: Optional[str] = envs.VLLM_LOGGING_CONFIG_PATH + """Path to logging config JSON file for both vllm and uvicorn""" + max_log_len: Optional[int] = None + """Max number of prompt characters or prompt ID numbers being printed in + log. The default of None means unlimited.""" + disable_fastapi_docs: bool = False + """Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint.""" + enable_prompt_tokens_details: bool = False + """If set to True, enable prompt_tokens_details in usage.""" + enable_server_load_tracking: bool = False + """If set to True, enable tracking server_load_metrics in the app state.""" + enable_force_include_usage: bool = False + """If set to True, including usage on every request.""" + expand_tools_even_if_tool_choice_none: bool = False + """Include tool definitions in prompts even when `tool_choice='none'`. + + This is a transitional option that will be removed in v0.10.0. In + v0.10.0, tool definitions will always be included regardless of + `tool_choice` setting. Use this flag to test the upcoming behavior + before the breaking change.""" + + @staticmethod + def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: + from vllm.engine.arg_utils import get_kwargs + + frontend_kwargs = get_kwargs(FrontendArgs) + + # Special case: allowed_origins, allowed_methods, allowed_headers all + # need json.loads type + # Should also remove nargs + print(frontend_kwargs["allowed_origins"]) + frontend_kwargs["allowed_origins"]["type"] = json.loads + frontend_kwargs["allowed_methods"]["type"] = json.loads + frontend_kwargs["allowed_headers"]["type"] = json.loads + del frontend_kwargs["allowed_origins"]["nargs"] + del frontend_kwargs["allowed_methods"]["nargs"] + del frontend_kwargs["allowed_headers"]["nargs"] + + # Special case: LoRA modules need custom parser action and + # optional_type(str) + frontend_kwargs["lora_modules"]["type"] = optional_type(str) + frontend_kwargs["lora_modules"]["action"] = LoRAParserAction + + # Special case: Prompt adapters need custom parser action and + # optional_type(str) + frontend_kwargs["prompt_adapters"]["type"] = optional_type(str) + frontend_kwargs["prompt_adapters"][ + "action"] = PromptAdapterParserAction + + # Special case: Middleware needs append action + frontend_kwargs["middleware"]["action"] = "append" + + # Special case: Tool call parser shows built-in options. + valid_tool_parsers = list(ToolParserManager.tool_parsers.keys()) + frontend_kwargs["tool_call_parser"]["choices"] = valid_tool_parsers + + # Special case for expand-tools-even-if-tool-choice-none because of + # the deprecation field + frontend_kwargs["expand_tools_even_if_tool_choice_none"]\ + ["deprecated"] = True + + frontend_group = parser.add_argument_group( + title="Frontend", + description=FrontendArgs.__doc__, + ) + + for key, value in frontend_kwargs.items(): + frontend_group.add_argument(f"--{key.replace('_', '-')}", **value) + + return parser + + def make_arg_parser(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: - parser.add_argument("--host", - type=optional_type(str), - default=None, - help="Host name.") - parser.add_argument("--port", type=int, default=8000, help="Port number.") - parser.add_argument( - "--uvicorn-log-level", - type=str, - default="info", - choices=['debug', 'info', 'warning', 'error', 'critical', 'trace'], - help="Log level for uvicorn.") - parser.add_argument("--disable-uvicorn-access-log", - action="store_true", - help="Disable uvicorn access log.") - parser.add_argument("--allow-credentials", - action="store_true", - help="Allow credentials.") - parser.add_argument("--allowed-origins", - type=json.loads, - default=["*"], - help="Allowed origins.") - parser.add_argument("--allowed-methods", - type=json.loads, - default=["*"], - help="Allowed methods.") - parser.add_argument("--allowed-headers", - type=json.loads, - default=["*"], - help="Allowed headers.") - parser.add_argument("--api-key", - type=optional_type(str), - default=None, - help="If provided, the server will require this key " - "to be presented in the header.") - parser.add_argument( - "--lora-modules", - type=optional_type(str), - default=None, - nargs='+', - action=LoRAParserAction, - help="LoRA module configurations in either 'name=path' format" - "or JSON format. " - "Example (old format): ``'name=path'`` " - "Example (new format): " - "``{\"name\": \"name\", \"path\": \"lora_path\", " - "\"base_model_name\": \"id\"}``") - parser.add_argument( - "--prompt-adapters", - type=optional_type(str), - default=None, - nargs='+', - action=PromptAdapterParserAction, - help="Prompt adapter configurations in the format name=path. " - "Multiple adapters can be specified.") - parser.add_argument("--chat-template", - type=optional_type(str), - default=None, - help="The file path to the chat template, " - "or the template in single-line form " - "for the specified model.") - parser.add_argument( - '--chat-template-content-format', - type=str, - default="auto", - choices=get_args(ChatTemplateContentFormatOption), - help='The format to render message content within a chat template.' - '\n\n' - '* "string" will render the content as a string. ' - 'Example: ``"Hello World"``\n' - '* "openai" will render the content as a list of dictionaries, ' - 'similar to OpenAI schema. ' - 'Example: ``[{"type": "text", "text": "Hello world!"}]``') - parser.add_argument("--response-role", - type=optional_type(str), - default="assistant", - help="The role name to return if " - "``request.add_generation_prompt=true``.") - parser.add_argument("--ssl-keyfile", - type=optional_type(str), - default=None, - help="The file path to the SSL key file.") - parser.add_argument("--ssl-certfile", - type=optional_type(str), - default=None, - help="The file path to the SSL cert file.") - parser.add_argument("--ssl-ca-certs", - type=optional_type(str), - default=None, - help="The CA certificates file.") - parser.add_argument( - "--enable-ssl-refresh", - action="store_true", - default=False, - help="Refresh SSL Context when SSL certificate files change") - parser.add_argument( - "--ssl-cert-reqs", - type=int, - default=int(ssl.CERT_NONE), - help="Whether client certificate is required (see stdlib ssl module's)." - ) - parser.add_argument( - "--root-path", - type=optional_type(str), - default=None, - help="FastAPI root_path when app is behind a path based routing proxy." - ) - parser.add_argument( - "--middleware", - type=optional_type(str), - action="append", - default=[], - help="Additional ASGI middleware to apply to the app. " - "We accept multiple --middleware arguments. " - "The value should be an import path. " - "If a function is provided, vLLM will add it to the server " - "using ``@app.middleware('http')``. " - "If a class is provided, vLLM will add it to the server " - "using ``app.add_middleware()``. ") - parser.add_argument( - "--return-tokens-as-token-ids", - action="store_true", - help="When ``--max-logprobs`` is specified, represents single tokens " - " as strings of the form 'token_id:{token_id}' so that tokens " - "that are not JSON-encodable can be identified.") - parser.add_argument( - "--disable-frontend-multiprocessing", - action="store_true", - help="If specified, will run the OpenAI frontend server in the same " - "process as the model serving engine.") - parser.add_argument( - "--enable-request-id-headers", - action="store_true", - help="If specified, API server will add X-Request-Id header to " - "responses.") - parser.add_argument( - "--enable-auto-tool-choice", - action="store_true", - default=False, - help="Enable auto tool choice for supported models. Use " - "``--tool-call-parser`` to specify which parser to use.") - parser.add_argument( - "--expand-tools-even-if-tool-choice-none", - action="store_true", - default=False, - deprecated=True, - help="Include tool definitions in prompts " - "even when tool_choice='none'. " - "This is a transitional option that will be removed in v0.10.0. " - "In v0.10.0, tool definitions will always be included regardless of " - "tool_choice setting. Use this flag now to test the new behavior " - "before the breaking change.") - - valid_tool_parsers = ToolParserManager.tool_parsers.keys() - parser.add_argument( - "--tool-call-parser", - type=str, - metavar="{" + ",".join(valid_tool_parsers) + "} or name registered in " - "--tool-parser-plugin", - default=None, - help= - "Select the tool call parser depending on the model that you're using." - " This is used to parse the model-generated tool call into OpenAI API " - "format. Required for ``--enable-auto-tool-choice``.") - - parser.add_argument( - "--tool-parser-plugin", - type=str, - default="", - help= - "Special the tool parser plugin write to parse the model-generated tool" - " into OpenAI API format, the name register in this plugin can be used " - "in ``--tool-call-parser``.") - - parser.add_argument( - "--log-config-file", - type=str, - default=envs.VLLM_LOGGING_CONFIG_PATH, - help="Path to logging config JSON file for both vllm and uvicorn", - ) + """Create the CLI argument parser used by the OpenAI API server. + We rely on the helper methods of `FrontendArgs` and `AsyncEngineArgs` to + register all arguments instead of manually enumerating them here. This + avoids code duplication and keeps the argument definitions in one place. + """ + parser = FrontendArgs.add_cli_args(parser) parser = AsyncEngineArgs.add_cli_args(parser) - parser.add_argument('--max-log-len', - type=int, - default=None, - help='Max number of prompt characters or prompt ' - 'ID numbers being printed in log.' - ' The default of None means unlimited.') - - parser.add_argument( - "--disable-fastapi-docs", - action='store_true', - default=False, - help="Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint." - ) - parser.add_argument( - "--enable-prompt-tokens-details", - action='store_true', - default=False, - help="If set to True, enable prompt_tokens_details in usage.") - parser.add_argument( - "--enable-force-include-usage", - action='store_true', - default=False, - help="If set to True, including usage on every request.") - parser.add_argument( - "--enable-server-load-tracking", - action='store_true', - default=False, - help= - "If set to True, enable tracking server_load_metrics in the app state." - ) - return parser From 47028f6bdfe5ecc84888db25f7821abefc222092 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Tue, 15 Jul 2025 06:55:45 -0400 Subject: [PATCH 096/552] [Docs] Improve documentation for ray cluster launcher helper script (#20602) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- examples/online_serving/run_cluster.sh | 74 +++++++++++++++++++++----- 1 file changed, 62 insertions(+), 12 deletions(-) diff --git a/examples/online_serving/run_cluster.sh b/examples/online_serving/run_cluster.sh index 7b4b40b4b7e..522b9566212 100644 --- a/examples/online_serving/run_cluster.sh +++ b/examples/online_serving/run_cluster.sh @@ -1,35 +1,81 @@ #!/bin/bash +# +# Launch a Ray cluster inside Docker for vLLM inference. +# +# This script can start either a head node or a worker node, depending on the +# --head or --worker flag provided as the third positional argument. +# +# Usage: +# 1. Designate one machine as the head node and execute: +# bash run_cluster.sh \ +# vllm/vllm-openai \ +# \ +# --head \ +# /abs/path/to/huggingface/cache \ +# -e VLLM_HOST_IP= +# +# 2. On every worker machine, execute: +# bash run_cluster.sh \ +# vllm/vllm-openai \ +# \ +# --worker \ +# /abs/path/to/huggingface/cache \ +# -e VLLM_HOST_IP= +# +# Each worker requires a unique VLLM_HOST_IP value. +# Keep each terminal session open. Closing a session stops the associated Ray +# node and thereby shuts down the entire cluster. +# Every machine must be reachable at the supplied IP address. +# +# The container is named "node-". To open a shell inside +# a container after launch, use: +# docker exec -it node- /bin/bash +# +# Then, you can execute vLLM commands on the Ray cluster as if it were a +# single machine, e.g. vllm serve ... +# +# To stop the container, use: +# docker stop node- -# Check for minimum number of required arguments +# Check for minimum number of required arguments. if [ $# -lt 4 ]; then - echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]" + echo "Usage: $0 docker_image head_node_ip --head|--worker path_to_hf_home [additional_args...]" exit 1 fi -# Assign the first three arguments and shift them away +# Extract the mandatory positional arguments and remove them from $@. DOCKER_IMAGE="$1" HEAD_NODE_ADDRESS="$2" -NODE_TYPE="$3" # Should be --head or --worker +NODE_TYPE="$3" # Should be --head or --worker. PATH_TO_HF_HOME="$4" shift 4 -# Additional arguments are passed directly to the Docker command +# Preserve any extra arguments so they can be forwarded to Docker. ADDITIONAL_ARGS=("$@") -# Validate node type +# Validate the NODE_TYPE argument. if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then echo "Error: Node type must be --head or --worker" exit 1 fi -# Define a function to cleanup on EXIT signal +# Generate a unique container name with random suffix. +# Docker container names must be unique on each host. +# The random suffix allows multiple Ray containers to run simultaneously on the same machine, +# for example, on a multi-GPU machine. +CONTAINER_NAME="node-${RANDOM}" + +# Define a cleanup routine that removes the container when the script exits. +# This prevents orphaned containers from accumulating if the script is interrupted. cleanup() { - docker stop node - docker rm node + docker stop "${CONTAINER_NAME}" + docker rm "${CONTAINER_NAME}" } trap cleanup EXIT -# Command setup for head or worker node +# Build the Ray start command based on the node role. +# The head node manages the cluster and accepts connections on port 6379, +# while workers connect to the head's address. RAY_START_CMD="ray start --block" if [ "${NODE_TYPE}" == "--head" ]; then RAY_START_CMD+=" --head --port=6379" @@ -37,11 +83,15 @@ else RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379" fi -# Run the docker command with the user specified parameters and additional arguments +# Launch the container with the assembled parameters. +# --network host: Allows Ray nodes to communicate directly via host networking +# --shm-size 10.24g: Increases shared memory +# --gpus all: Gives container access to all GPUs on the host +# -v HF_HOME: Mounts HuggingFace cache to avoid re-downloading models docker run \ --entrypoint /bin/bash \ --network host \ - --name node \ + --name "${CONTAINER_NAME}" \ --shm-size 10.24g \ --gpus all \ -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \ From 41a1a96e486f059fd56ce3d432f38b7666e1b3a3 Mon Sep 17 00:00:00 2001 From: Yifei Teng Date: Tue, 15 Jul 2025 03:56:43 -0700 Subject: [PATCH 097/552] [TPU] Optimize kv cache update kernel (#20415) Signed-off-by: Yifei Teng Signed-off-by: x22x22 --- vllm/utils/__init__.py | 7 +++ vllm/v1/attention/backends/pallas.py | 6 +++ vllm/v1/worker/tpu_model_runner.py | 66 +++++++++++++++++++++------- 3 files changed, 63 insertions(+), 16 deletions(-) diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index 0bc2341b7b4..0fed490a1fc 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -947,6 +947,13 @@ def next_power_of_2(n) -> int: return 1 << (n - 1).bit_length() +def prev_power_of_2(n: int) -> int: + """The previous power of 2 (inclusive)""" + if n <= 0: + return 0 + return 1 << (n.bit_length() - 1) + + def round_up(x: int, y: int) -> int: return ((x + y - 1) // y) * y diff --git a/vllm/v1/attention/backends/pallas.py b/vllm/v1/attention/backends/pallas.py index 2921e8ed55a..32ef5dc2e36 100644 --- a/vllm/v1/attention/backends/pallas.py +++ b/vllm/v1/attention/backends/pallas.py @@ -324,3 +324,9 @@ def kv_cache_update_op_non_xla(kv: torch.Tensor, slot_mapping: torch.Tensor, page_size: int, num_slices_per_block: int) -> torch.Tensor: return kv_cache + + +def get_page_size_bytes(block_size: int, num_kv_heads: int, head_size: int, + kv_cache_dtype: torch.dtype) -> int: + """Returns the size in bytes of one page of the KV cache.""" + return block_size * num_kv_heads * head_size * kv_cache_dtype.itemsize diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 82a203caf2b..83a80bd865b 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -31,9 +31,10 @@ from vllm.multimodal.utils import group_mm_inputs_by_modality from vllm.sequence import IntermediateTensors from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, LayerBlockType, cdiv, - is_pin_memory_available) + is_pin_memory_available, prev_power_of_2) from vllm.v1.attention.backends.pallas import (PallasAttentionBackend, - PallasMetadata) + PallasMetadata, + get_page_size_bytes) from vllm.v1.core.encoder_cache_manager import compute_encoder_budget from vllm.v1.kv_cache_interface import (AttentionSpec, FullAttentionSpec, KVCacheConfig, KVCacheSpec, @@ -56,8 +57,6 @@ INVALID_TOKEN_ID = -1 # Smallest output size MIN_NUM_SEQS = 8 -# Block size used for kv cache updating kernel -NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK = 8 ######################################################### @@ -139,7 +138,11 @@ def __init__( self.pin_memory = is_pin_memory_available() self.dtype = self.model_config.dtype if cache_config.cache_dtype == "auto": - self.kv_cache_dtype = self.dtype + model_dtype = self.dtype + if isinstance(model_dtype, str): + self.kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[model_dtype] + else: + self.kv_cache_dtype = model_dtype else: self.kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[ cache_config.cache_dtype] @@ -192,6 +195,14 @@ def __init__( self.max_num_encoder_input_tokens = encoder_compute_budget self.encoder_cache_size = encoder_cache_size + self._num_slices_per_kv_cache_update_block = \ + _get_num_slices_per_kv_cache_update_block(get_page_size_bytes( + block_size=self.block_size, + num_kv_heads=self.num_kv_heads, + head_size=self.head_size, + kv_cache_dtype=self.kv_cache_dtype, + )) + # Lazy initialization self.model: nn.Module # Set after load_model self.kv_caches: list[torch.Tensor] = [] @@ -719,7 +730,7 @@ def _prepare_inputs(self, scheduler_output: "SchedulerOutput", num_kv_update_slices = slot_mapping_metadata.shape[0] padded_num_slices = _get_padded_num_kv_cache_update_slices( padded_total_num_scheduled_tokens, self.max_num_reqs, - self.block_size) + self.block_size, self._num_slices_per_kv_cache_update_block) slot_mapping_metadata = np.pad( slot_mapping_metadata, [[0, padded_num_slices - len(slot_mapping_metadata)], [0, 0]], @@ -750,8 +761,8 @@ def _prepare_inputs(self, scheduler_output: "SchedulerOutput", num_kv_update_slices=torch.tensor([num_kv_update_slices], dtype=torch.int32, device=self.device), - num_slices_per_kv_cache_update_block= - NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK, + num_slices_per_kv_cache_update_block=self. + _num_slices_per_kv_cache_update_block, ) # NOTE(woosuk): Due to chunked prefills, there can be at most 1 partial # request in the batch. While we should not sample any token from this @@ -1197,7 +1208,8 @@ def _dummy_run(self, num_tokens: int, num_reqs: int, position_ids = torch.zeros(num_tokens, dtype=torch.int32).to(self.device) padded_num_slices = _get_padded_num_kv_cache_update_slices( - num_tokens, self.max_num_reqs, self.block_size) + num_tokens, self.max_num_reqs, self.block_size, + self._num_slices_per_kv_cache_update_block) num_kv_update_slices = torch.tensor([padded_num_slices], dtype=torch.int32).to(self.device) slot_mapping = torch.zeros((3, padded_num_slices), @@ -1220,8 +1232,8 @@ def _dummy_run(self, num_tokens: int, num_reqs: int, query_start_loc=query_start_loc, num_seqs=num_seqs, num_kv_update_slices=num_kv_update_slices, - num_slices_per_kv_cache_update_block= - NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK, + num_slices_per_kv_cache_update_block=self. + _num_slices_per_kv_cache_update_block, ) if self.is_multimodal_model: @@ -1826,19 +1838,41 @@ def _get_padded_token_len(paddings: list[int], x: int) -> int: return paddings[index] -def _get_padded_num_kv_cache_update_slices(num_tokens: int, max_num_reqs: int, - page_size: int) -> int: +def _get_padded_num_kv_cache_update_slices( + num_tokens: int, max_num_reqs: int, page_size: int, + num_slices_per_kv_cache_update_block: int) -> int: """Calculates the padded number of KV cache update slices to avoid recompilation.""" padded_num_slices = 2 * max_num_reqs + num_tokens // page_size padded_num_slices = min(padded_num_slices, num_tokens) padded_num_slices = ( - padded_num_slices + NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK - 1 - ) // NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK * \ - NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK + padded_num_slices + num_slices_per_kv_cache_update_block - 1 + ) // num_slices_per_kv_cache_update_block * \ + num_slices_per_kv_cache_update_block return padded_num_slices +def _get_num_slices_per_kv_cache_update_block(page_size_bytes: int) -> int: + """Find the optimum number of slices to copy per Pallas program instance. + + Increasing the number of slices copied in one instance of the kernel program + will increase HBM bandwidth utilization via more in-flight DMAs. + + However, it will also use more VMEM, and experimentally, we observed + performance regression at 128 slices on v6e, likely due to running + out of scalar registers. Thus this function will limit the number of + slices to 64. + """ + # Conservative VMEM usage limit: 32 MiB + vmem_limit = 32 * 1024 * 1024 + num_slices_per_block = vmem_limit // page_size_bytes + assert num_slices_per_block > 0, "Number of slices should be positive" + num_slices_per_block = prev_power_of_2(num_slices_per_block) + if num_slices_per_block > 64: + num_slices_per_block = 64 + return num_slices_per_block + + def replace_set_lora(model): def _tpu_set_lora( From 2485f57938b19b192c24f79ca8b0f2fcc652fee9 Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Tue, 15 Jul 2025 13:04:35 +0200 Subject: [PATCH 098/552] [V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli (#20840) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- docs/usage/v1_guide.md | 3 +- .../models/language/generation/test_hybrid.py | 16 +--- vllm/config.py | 9 +- .../layers/mamba/mamba_mixer2.py | 48 ++-------- .../layers/mamba/mamba_utils.py | 55 +++++++++++ vllm/model_executor/models/bamba.py | 81 ++++++++-------- vllm/model_executor/models/config.py | 90 ++++++++++++++++++ vllm/model_executor/models/falcon_h1.py | 80 ++++++++-------- .../model_executor/models/granitemoehybrid.py | 80 ++++++++-------- vllm/model_executor/models/interfaces.py | 20 ++++ vllm/model_executor/models/mamba2.py | 80 ++++++++-------- vllm/model_executor/models/nemotron_h.py | 82 +++++++++-------- vllm/model_executor/models/zamba2.py | 92 +++++++++---------- vllm/v1/worker/gpu_model_runner.py | 58 +----------- 14 files changed, 441 insertions(+), 353 deletions(-) create mode 100644 vllm/model_executor/layers/mamba/mamba_utils.py diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 459ea2d676c..d7634223542 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -112,8 +112,7 @@ enforcing eager mode and disabling prefix caching in V1. Models that combine Mamba-2 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`, `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`). Please note that these models currently require enforcing eager mode, disabling prefix caching, and using the FlashInfer attention -backend in V1. It is also necessary to pass a non-standard block size for attention layers (this is not possible -using the `vllm serve` CLI yet). +backend in V1. #### Encoder-Decoder Models diff --git a/tests/models/language/generation/test_hybrid.py b/tests/models/language/generation/test_hybrid.py index ecaae3ec1fc..eba14e64553 100644 --- a/tests/models/language/generation/test_hybrid.py +++ b/tests/models/language/generation/test_hybrid.py @@ -61,14 +61,6 @@ "tiiuae/Falcon-H1-0.5B-Base", ] -ATTN_BLOCK_SIZES = { - "ibm-ai-platform/Bamba-9B-v1": 528, - "Zyphra/Zamba2-1.2B-instruct": 80, - "nvidia/Nemotron-H-8B-Base-8K": 528, - "ibm-granite/granite-4.0-tiny-preview": 400, - "tiiuae/Falcon-H1-0.5B-Base": 800, -} - # Avoid OOM MAX_NUM_SEQS = 4 @@ -105,11 +97,6 @@ def test_models( example_prompts, max_tokens, num_logprobs) if model in V1_SUPPORTED_MODELS: - if model in HYBRID_MODELS and model in ATTN_BLOCK_SIZES: - block_size = ATTN_BLOCK_SIZES[model] - else: - block_size = 16 - with monkeypatch.context() as m: m.setenv("VLLM_USE_V1", "1") if model in HYBRID_MODELS: @@ -118,8 +105,7 @@ def test_models( with vllm_runner(model, max_num_seqs=MAX_NUM_SEQS, enforce_eager=True, - enable_prefix_caching=False, - block_size=block_size) as vllm_model: + enable_prefix_caching=False) as vllm_model: vllm_v1_outputs = vllm_model.generate_greedy_logprobs( example_prompts, max_tokens, num_logprobs) else: diff --git a/vllm/config.py b/vllm/config.py index 70b023a5d23..f16287c2be5 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -1630,6 +1630,9 @@ class CacheConfig: checkpoint if available. Otherwise, the scales will default to 1.0.""" cpu_kvcache_space_bytes: Optional[int] = None """(CPU backend only) CPU key-value cache space.""" + mamba_page_size_padded: Optional[int] = None + """ Optional override for mamba page size; used by hybrid mamba/attention + models to ensure exact alignment with attention page size.""" # Will be set after profiling. num_gpu_blocks: Optional[int] = field(default=None, init=False) @@ -4911,11 +4914,15 @@ def try_verify_and_update_config(self): if architecture is None: return - from vllm.model_executor.models.config import MODELS_CONFIG_MAP + from vllm.model_executor.models.config import ( + MODELS_CONFIG_MAP, HybridAttentionMambaModelConfig) cls = MODELS_CONFIG_MAP.get(architecture, None) if cls is not None: cls.verify_and_update_config(self) + if self.model_config.is_hybrid: + HybridAttentionMambaModelConfig.verify_and_update_config(self) + if self.model_config.task == "classify": # Maybe convert ForCausalLM into ForSequenceClassification model. from vllm.model_executor.models.adapters import ( diff --git a/vllm/model_executor/layers/mamba/mamba_mixer2.py b/vllm/model_executor/layers/mamba/mamba_mixer2.py index 4ca8e6b97fc..a88bd55e236 100644 --- a/vllm/model_executor/layers/mamba/mamba_mixer2.py +++ b/vllm/model_executor/layers/mamba/mamba_mixer2.py @@ -20,6 +20,8 @@ from vllm.model_executor.layers.mamba.abstract import MambaBase from vllm.model_executor.layers.mamba.mamba2_metadata import (Mamba2Metadata, update_metadata) +from vllm.model_executor.layers.mamba.mamba_utils import ( + extra_groups_for_head_shards, get_mamba_state_shape) from vllm.model_executor.layers.mamba.ops.causal_conv1d import ( causal_conv1d_fn, causal_conv1d_update) from vllm.model_executor.layers.mamba.ops.mamba_ssm import ( @@ -146,18 +148,6 @@ def forward_cuda( return out -def extra_groups_for_head_shards(ngroups: int, tp_size: int): - """Compute the increase in group numbers to account for - replication in order to accompany the head shards.""" - - # in the case ngoups % tp_size == 0, this will be zero - if ngroups % tp_size == 0: - return 0 - - # for n_groups == 1, this is exactly tp_size - n_groups - return tp_size - ngroups - - def mamba_v2_sharded_weight_loader( shard_spec: list[tuple[int, int, float]], tp_size: int, @@ -707,30 +697,12 @@ def forward_cuda( return out def get_state_shape(self) -> tuple[tuple[int, ...], tuple[int, ...]]: - world_size = get_tensor_model_parallel_world_size() - - conv_state_shape, temporal_state_shape = None, None - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = (self.n_groups + - extra_groups_for_head_shards(self.n_groups, world_size)) - - # - heads and n_groups are TP-ed - conv_dim = (self.intermediate_size + - 2 * n_groups * self.ssm_state_size) - # contiguous along 'dim' axis - conv_state_shape = ( - self.conv_kernel_size - 1, - divide(conv_dim, world_size), - ) - - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(self.num_heads, world_size), - self.head_dim, - self.ssm_state_size, + return get_mamba_state_shape( + intermediate_size=self.intermediate_size, + tp_world_size=get_tensor_model_parallel_world_size(), + n_groups=self.n_groups, + num_heads=self.num_heads, + head_dim=self.head_dim, + state_size=self.ssm_state_size, + conv_kernel=self.conv_kernel_size, ) - return conv_state_shape, temporal_state_shape diff --git a/vllm/model_executor/layers/mamba/mamba_utils.py b/vllm/model_executor/layers/mamba/mamba_utils.py new file mode 100644 index 00000000000..99a582066c0 --- /dev/null +++ b/vllm/model_executor/layers/mamba/mamba_utils.py @@ -0,0 +1,55 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from vllm.distributed import divide + + +def extra_groups_for_head_shards(ngroups: int, tp_size: int): + """Compute the increase in group numbers to account for + replication in order to accompany the head shards.""" + + # in the case ngoups % tp_size == 0, this will be zero + if ngroups % tp_size == 0: + return 0 + + # for n_groups == 1, this is exactly tp_size - n_groups + return tp_size - ngroups + + +def get_mamba_state_shape( + intermediate_size: int, + tp_world_size: int, + n_groups: int, + num_heads: int, + head_dim: int, + state_size: int, + conv_kernel: int, + use_v1: bool = True, +) -> tuple[tuple[int, int], tuple[int, int, int]]: + """ Get the shape of mamba state.""" + + # if n_groups is not divisible by world_size, need to extend the shards + # to ensure all groups needed by a head is sharded along with it + n_groups = (n_groups + + extra_groups_for_head_shards(n_groups, tp_world_size)) + + # - heads and n_groups are TP-ed + conv_dim = (intermediate_size + 2 * n_groups * state_size) + # contiguous along 'dim' axis + conv_state_shape = ( + conv_kernel - 1, + divide(conv_dim, tp_world_size), + ) + + if not use_v1: + conv_state_shape = (conv_state_shape[1], conv_state_shape[0]) + + # These are not TP-ed as they depend on A, dt_bias, D + # - they are typically small + # e.g., (h_heads, head_dim, state_size) = (128, 64, 128) + temporal_state_shape = ( + divide(num_heads, tp_world_size), + head_dim, + state_size, + ) + + return conv_state_shape, temporal_state_shape diff --git a/vllm/model_executor/models/bamba.py b/vllm/model_executor/models/bamba.py index dfc55b0c341..e93d4294a62 100644 --- a/vllm/model_executor/models/bamba.py +++ b/vllm/model_executor/models/bamba.py @@ -12,7 +12,7 @@ from vllm import envs from vllm.attention.layer import Attention from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import divide, get_tensor_model_parallel_world_size +from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group from vllm.forward_context import get_forward_context from vllm.model_executor.layers.activation import SiluAndMul @@ -23,8 +23,8 @@ from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba2_metadata import ( Mamba2Metadata, prepare_mamba2_metadata) -from vllm.model_executor.layers.mamba.mamba_mixer2 import ( - MambaMixer2, extra_groups_for_head_shards) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 +from vllm.model_executor.layers.mamba.mamba_utils import get_mamba_state_shape from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.rotary_embedding import get_rope from vllm.model_executor.layers.vocab_parallel_embedding import ( @@ -435,6 +435,38 @@ class BambaForCausalLM(nn.Module, HasInnerState, SupportsLoRA, SupportsPP, } embedding_padding_modules = ["lm_head"] + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + parallel_config = vllm_config.parallel_config + hf_config = vllm_config.model_config.hf_config + intermediate_size = hf_config.mamba_expand * hf_config.hidden_size + + return get_mamba_state_shape( + intermediate_size=intermediate_size, + tp_world_size=parallel_config.tensor_parallel_size, + n_groups=hf_config.mamba_n_groups, + num_heads=hf_config.mamba_n_heads, + head_dim=hf_config.mamba_d_head, + state_size=hf_config.mamba_d_state, + conv_kernel=hf_config.mamba_d_conv, + use_v1=use_v1, + ) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config self.vllm_config = vllm_config @@ -491,10 +523,13 @@ def forward(self, self.vllm_config.parallel_config, LayerBlockType.mamba ) - - self.mamba_cache = MambaCacheManager( - self.vllm_config, self.lm_head.weight.dtype, - num_mamba_layers, *self._get_mamba_cache_shape()) + mamba_state_shape = \ + self.get_mamba_state_shape_from_config( + self.vllm_config, use_v1=False) + self.mamba_cache = MambaCacheManager(self.vllm_config, + self.lm_head.weight.dtype, + num_mamba_layers, + *mamba_state_shape) mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) @@ -510,38 +545,6 @@ def copy_inputs_before_cuda_graphs(self, input_buffers, **kwargs): def get_seqlen_agnostic_capture_inputs(self, batch_size: int): return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) - def _get_mamba_cache_shape( - self) -> tuple[tuple[int, int], tuple[int, int]]: - world_size = get_tensor_model_parallel_world_size() - hidden_size = self.config.hidden_size - - conv_state_shape, temporal_state_shape = None, None - - intermediate_size = self.config.mamba_expand * hidden_size - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = (self.config.mamba_n_groups + extra_groups_for_head_shards( - self.config.mamba_n_groups, world_size)) - - # - heads and n_groups are TP-ed - conv_dim = (intermediate_size + - 2 * n_groups * self.config.mamba_d_state) - conv_state_shape = ( - divide(conv_dim, world_size), - self.config.mamba_d_conv - 1, - ) - - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(self.config.mamba_n_heads, world_size), - self.config.mamba_d_head, - self.config.mamba_d_state, - ) - return conv_state_shape, temporal_state_shape - def compute_logits( self, hidden_states: torch.Tensor, diff --git a/vllm/model_executor/models/config.py b/vllm/model_executor/models/config.py index 6d0ffad1a81..6c6f8e7268b 100644 --- a/vllm/model_executor/models/config.py +++ b/vllm/model_executor/models/config.py @@ -3,9 +3,14 @@ from copy import deepcopy from typing import TYPE_CHECKING +import vllm.envs as envs from vllm.logger import init_logger +from vllm.model_executor.models import ModelRegistry +from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, cdiv +from vllm.v1.kv_cache_interface import FullAttentionSpec, MambaSpec if TYPE_CHECKING: + from vllm.config import VllmConfig logger = init_logger(__name__) @@ -200,6 +205,91 @@ def verify_and_update_config(vllm_config: "VllmConfig") -> None: } +class HybridAttentionMambaModelConfig(VerifyAndUpdateConfig): + + @classmethod + def verify_and_update_config(cls, vllm_config: "VllmConfig") -> None: + """ + Ensure that page size of attention layers is greater than or + equal to the mamba layers. If not, automatically set the attention + block size to ensure that it is. If the attention page size is + strictly greater than the mamba page size, we pad the mamba page size + to make them equal. + + Args: + vllm_config: vLLM Config + """ + + if not envs.VLLM_USE_V1: + return + + cache_config = vllm_config.cache_config + model_config = vllm_config.model_config + parallel_config = vllm_config.parallel_config + + if cache_config.cache_dtype == "auto": + kv_cache_dtype = model_config.dtype + else: + kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype] + + # get attention page size (for 1 token) + attn_page_size_1_token = FullAttentionSpec( + block_size=1, + num_kv_heads=model_config.get_num_kv_heads(parallel_config), + head_size=model_config.get_head_size(), + dtype=kv_cache_dtype, + use_mla=model_config.use_mla).page_size_bytes + + model_cls = ModelRegistry.resolve_model_cls( + model_config._model_info.architecture)[0] + + # get mamba page size + mamba_page_size = MambaSpec( + shapes=model_cls.get_mamba_state_shape_from_config(vllm_config), + dtype=kv_cache_dtype, + block_size=model_config.max_model_len, + ).page_size_bytes + + # some attention backends (e.g. FA) only support setting + # block size to multiple of 16, so let's suggest a value + # that would work (note: FA is currently not compatible + # with mamba layers, use FlashInfer instead). + attn_block_size = 16 * cdiv(mamba_page_size, + 16 * attn_page_size_1_token) + + # override attention block size if either (a) the + # user has not set it or (b) the user has set it + # too small. + if (cache_config.block_size is None + or cache_config.block_size < attn_block_size): + cache_config.block_size = attn_block_size + logger.info( + "Setting attention block size to %d tokens " + "to ensure that attention page size is >= mamba page size.", + attn_block_size) + + # compute new attention page size + attn_page_size = \ + cache_config.block_size * attn_page_size_1_token + + assert attn_page_size >= mamba_page_size + + if attn_page_size == mamba_page_size: + # don't need to pad mamba page size + return + + # pad mamba page size to exactly match attention + if (cache_config.mamba_page_size_padded is None + or cache_config.mamba_page_size_padded != attn_page_size): + cache_config.mamba_page_size_padded = (attn_page_size) + mamba_padding_pct = 100 * (attn_page_size - + mamba_page_size) / mamba_page_size + logger.info( + "Padding mamba page size by %.2f%% to ensure " + "that mamba page size and attention page size are " + "exactly equal.", mamba_padding_pct) + + MODELS_CONFIG_MAP: dict[str, type[VerifyAndUpdateConfig]] = { "GteModel": SnowflakeGteNewModelConfig, "GteNewModel": GteNewModelConfig, diff --git a/vllm/model_executor/models/falcon_h1.py b/vllm/model_executor/models/falcon_h1.py index ad3f39793b6..7761de224c9 100644 --- a/vllm/model_executor/models/falcon_h1.py +++ b/vllm/model_executor/models/falcon_h1.py @@ -11,7 +11,7 @@ from vllm import envs from vllm.attention.layer import Attention from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import divide, get_tensor_model_parallel_world_size +from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group from vllm.forward_context import get_forward_context from vllm.model_executor.layers.activation import SiluAndMul @@ -22,8 +22,8 @@ from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba2_metadata import ( Mamba2Metadata, prepare_mamba2_metadata) -from vllm.model_executor.layers.mamba.mamba_mixer2 import ( - MambaMixer2, extra_groups_for_head_shards) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 +from vllm.model_executor.layers.mamba.mamba_utils import get_mamba_state_shape from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.rotary_embedding import get_rope from vllm.model_executor.layers.vocab_parallel_embedding import ( @@ -514,6 +514,42 @@ class FalconH1ForCausalLM(nn.Module, HasInnerState, SupportsLoRA, SupportsPP, } embedding_padding_modules = ["lm_head"] + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + parallel_config = vllm_config.parallel_config + hf_config = vllm_config.model_config.hf_config + + intermediate_size = (int(hf_config.mamba_expand * + hf_config.hidden_size) + if hf_config.mamba_d_ssm is None else + hf_config.mamba_d_ssm) + + return get_mamba_state_shape( + intermediate_size=intermediate_size, + tp_world_size=parallel_config.tensor_parallel_size, + n_groups=hf_config.mamba_n_groups, + num_heads=hf_config.mamba_n_heads, + head_dim=hf_config.mamba_d_head, + state_size=hf_config.mamba_d_state, + conv_kernel=hf_config.mamba_d_conv, + use_v1=use_v1, + ) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config self.vllm_config = vllm_config @@ -580,12 +616,15 @@ def forward( mamba_cache_params = None if not envs.VLLM_USE_V1: if self.mamba_cache is None: + mamba_state_shape = \ + self.get_mamba_state_shape_from_config( + self.vllm_config, use_v1=False) self.mamba_cache = MambaCacheManager( self.vllm_config, self.lm_head.weight.dtype if hasattr( self.lm_head, 'weight') else torch.bfloat16, self.config.num_hidden_layers, - *self._get_mamba_cache_shape(), + *mamba_state_shape, ) mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) @@ -606,39 +645,6 @@ def copy_inputs_before_cuda_graphs(self, input_buffers, **kwargs): def get_seqlen_agnostic_capture_inputs(self, batch_size: int): return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) - def _get_mamba_cache_shape( - self) -> tuple[tuple[int, int], tuple[int, int]]: - world_size = get_tensor_model_parallel_world_size() - hidden_size = self.config.hidden_size - - conv_state_shape, temporal_state_shape = None, None - - intermediate_size = (int(self.config.mamba_expand * - hidden_size) if self.config.mamba_d_ssm - is None else self.config.mamba_d_ssm) - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = self.config.mamba_n_groups + extra_groups_for_head_shards( - self.config.mamba_n_groups, world_size) - - # - heads and n_groups are TP-ed - conv_dim = intermediate_size + 2 * n_groups * self.config.mamba_d_state - conv_state_shape = ( - divide(conv_dim, world_size), - self.config.mamba_d_conv - 1, - ) - - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(self.config.mamba_n_heads, world_size), - self.config.mamba_d_head, - self.config.mamba_d_state, - ) - return conv_state_shape, temporal_state_shape - def compute_logits( self, hidden_states: torch.Tensor, diff --git a/vllm/model_executor/models/granitemoehybrid.py b/vllm/model_executor/models/granitemoehybrid.py index 1055fa0372b..1c93e90737a 100644 --- a/vllm/model_executor/models/granitemoehybrid.py +++ b/vllm/model_executor/models/granitemoehybrid.py @@ -12,7 +12,7 @@ from vllm import envs from vllm.attention.layer import Attention from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import divide, get_tensor_model_parallel_world_size +from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group from vllm.forward_context import get_forward_context from vllm.model_executor.layers.layernorm import RMSNorm @@ -21,8 +21,8 @@ from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba2_metadata import ( Mamba2Metadata, prepare_mamba2_metadata) -from vllm.model_executor.layers.mamba.mamba_mixer2 import ( - MambaMixer2, extra_groups_for_head_shards) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 +from vllm.model_executor.layers.mamba.mamba_utils import get_mamba_state_shape from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.rotary_embedding import get_rope from vllm.model_executor.layers.vocab_parallel_embedding import ( @@ -524,6 +524,38 @@ class GraniteMoeHybridForCausalLM(nn.Module, HasInnerState, SupportsLoRA, } embedding_padding_modules = ["lm_head"] + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + parallel_config = vllm_config.parallel_config + hf_config = vllm_config.model_config.hf_config + intermediate_size = hf_config.mamba_expand * hf_config.hidden_size + + return get_mamba_state_shape( + intermediate_size=intermediate_size, + tp_world_size=parallel_config.tensor_parallel_size, + n_groups=hf_config.mamba_n_groups, + num_heads=hf_config.mamba_n_heads, + head_dim=hf_config.mamba_d_head, + state_size=hf_config.mamba_d_state, + conv_kernel=hf_config.mamba_d_conv, + use_v1=use_v1, + ) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() @@ -587,9 +619,13 @@ def forward(self, self.model_config.get_num_layers_by_block_type( self.vllm_config.parallel_config, LayerBlockType.mamba)) - self.mamba_cache = MambaCacheManager( - self.vllm_config, self.model_config.dtype, - num_mamba_layers, *self._get_mamba_cache_shape()) + mamba_state_shape = \ + self.get_mamba_state_shape_from_config( + self.vllm_config, use_v1=False) + self.mamba_cache = MambaCacheManager(self.vllm_config, + self.model_config.dtype, + num_mamba_layers, + *mamba_state_shape) mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) @@ -605,38 +641,6 @@ def copy_inputs_before_cuda_graphs(self, input_buffers, **kwargs): def get_seqlen_agnostic_capture_inputs(self, batch_size: int): return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) - def _get_mamba_cache_shape( - self) -> tuple[tuple[int, int], tuple[int, int]]: - world_size = get_tensor_model_parallel_world_size() - hidden_size = self.config.hidden_size - - conv_state_shape, temporal_state_shape = None, None - - intermediate_size = self.config.mamba_expand * hidden_size - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = (self.config.mamba_n_groups + extra_groups_for_head_shards( - self.config.mamba_n_groups, world_size)) - - # - heads and n_groups are TP-ed - conv_dim = (intermediate_size + - 2 * n_groups * self.config.mamba_d_state) - conv_state_shape = ( - divide(conv_dim, world_size), - self.config.mamba_d_conv - 1, - ) - - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(self.config.mamba_n_heads, world_size), - self.config.mamba_d_head, - self.config.mamba_d_state, - ) - return conv_state_shape, temporal_state_shape - def compute_logits( self, hidden_states: torch.Tensor, diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 3a97641aa2f..95970474d55 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -22,6 +22,7 @@ if TYPE_CHECKING: from vllm.attention import AttentionMetadata + from vllm.config import VllmConfig from vllm.model_executor.models.utils import WeightsMapper from vllm.sequence import IntermediateTensors @@ -481,6 +482,25 @@ class IsHybrid(Protocol): , also indicates that the model's hf_config has 'layers_block_type' """ + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + ... + @runtime_checkable class _IsHybridType(Protocol): diff --git a/vllm/model_executor/models/mamba2.py b/vllm/model_executor/models/mamba2.py index b9fa5707393..d812d8cc0a3 100644 --- a/vllm/model_executor/models/mamba2.py +++ b/vllm/model_executor/models/mamba2.py @@ -11,15 +11,14 @@ from vllm import envs from vllm.attention.backends.abstract import AttentionMetadata from vllm.config import VllmConfig -from vllm.distributed import divide, get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group from vllm.forward_context import get_forward_context from vllm.model_executor.layers.layernorm import RMSNorm from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba2_metadata import ( Mamba2Metadata, prepare_mamba2_metadata) -from vllm.model_executor.layers.mamba.mamba_mixer2 import ( - MambaMixer2, extra_groups_for_head_shards) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 +from vllm.model_executor.layers.mamba.mamba_utils import get_mamba_state_shape from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) from vllm.model_executor.layers.vocab_parallel_embedding import ( @@ -198,6 +197,38 @@ def load_weights(self, weights: Iterable[tuple[str, class Mamba2ForCausalLM(nn.Module, HasInnerState, IsAttentionFree): + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + parallel_config = vllm_config.parallel_config + hf_config = vllm_config.model_config.hf_config + intermediate_size = hf_config.expand * hf_config.hidden_size + + return get_mamba_state_shape( + intermediate_size=intermediate_size, + tp_world_size=parallel_config.tensor_parallel_size, + n_groups=hf_config.n_groups, + num_heads=hf_config.num_heads, + head_dim=hf_config.head_dim, + state_size=hf_config.state_size, + conv_kernel=hf_config.conv_kernel, + use_v1=use_v1, + ) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config cache_config = vllm_config.cache_config @@ -253,9 +284,13 @@ def forward(self, self.model_config.get_num_layers_by_block_type( self.vllm_config.parallel_config, LayerBlockType.mamba)) - self.mamba_cache = MambaCacheManager( - self.vllm_config, self.lm_head.weight.dtype, - num_mamba_layers, *self._get_mamba_cache_shape()) + mamba_state_shape = \ + self.get_mamba_state_shape_from_config( + self.vllm_config, use_v1=False) + self.mamba_cache = MambaCacheManager(self.vllm_config, + self.lm_head.weight.dtype, + num_mamba_layers, + *mamba_state_shape) mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) else: @@ -274,39 +309,6 @@ def copy_inputs_before_cuda_graphs(self, input_buffers, **kwargs): def get_seqlen_agnostic_capture_inputs(self, batch_size: int): return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) - def _get_mamba_cache_shape( - self) -> tuple[tuple[int, int], tuple[int, int]]: - world_size = get_tensor_model_parallel_world_size() - - conv_state_shape, temporal_state_shape = None, None - - intermediate_size = getattr( - self.config, "intermediate_size", - self.config.expand * self.config.hidden_size) - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = ( - self.config.n_groups + - extra_groups_for_head_shards(self.config.n_groups, world_size)) - - # - heads and n_groups are TP-ed - conv_dim = (intermediate_size + 2 * n_groups * self.config.state_size) - conv_state_shape = ( - divide(conv_dim, world_size), - self.config.conv_kernel - 1, - ) - - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(self.config.num_heads, world_size), - self.config.head_dim, - self.config.state_size, - ) - return conv_state_shape, temporal_state_shape - def compute_logits(self, hidden_states: torch.Tensor, sampling_metadata: SamplingMetadata) -> torch.Tensor: logits = self.logits_processor(self.lm_head, hidden_states, diff --git a/vllm/model_executor/models/nemotron_h.py b/vllm/model_executor/models/nemotron_h.py index 60fb7254725..cf7b39db1fe 100644 --- a/vllm/model_executor/models/nemotron_h.py +++ b/vllm/model_executor/models/nemotron_h.py @@ -26,7 +26,7 @@ from vllm import envs from vllm.attention.layer import Attention from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import divide, get_tensor_model_parallel_world_size +from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group from vllm.forward_context import get_forward_context from vllm.model_executor.layers.activation import ReLUSquaredActivation @@ -37,8 +37,8 @@ from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba2_metadata import ( Mamba2Metadata, prepare_mamba2_metadata) -from vllm.model_executor.layers.mamba.mamba_mixer2 import ( - MambaMixer2, extra_groups_for_head_shards) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 +from vllm.model_executor.layers.mamba.mamba_utils import get_mamba_state_shape from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) @@ -459,6 +459,38 @@ class NemotronHForCausalLM(nn.Module, HasInnerState, SupportsLoRA, SupportsPP, } embedding_padding_modules = ["lm_head"] + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + parallel_config = vllm_config.parallel_config + hf_config = vllm_config.model_config.hf_config + intermediate_size = hf_config.expand * hf_config.hidden_size + + return get_mamba_state_shape( + intermediate_size=intermediate_size, + tp_world_size=parallel_config.tensor_parallel_size, + n_groups=hf_config.n_groups, + num_heads=hf_config.mamba_num_heads, + head_dim=hf_config.mamba_head_dim, + state_size=hf_config.ssm_state_size, + conv_kernel=hf_config.conv_kernel, + use_v1=use_v1, + ) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config self.vllm_config = vllm_config @@ -515,10 +547,13 @@ def forward(self, self.vllm_config.parallel_config, LayerBlockType.mamba ) - - self.mamba_cache = MambaCacheManager( - self.vllm_config, self.lm_head.weight.dtype, - num_mamba_layers, *self._get_mamba_cache_shape()) + mamba_state_shape = \ + self.get_mamba_state_shape_from_config( + self.vllm_config, use_v1=False) + self.mamba_cache = MambaCacheManager(self.vllm_config, + self.lm_head.weight.dtype, + num_mamba_layers, + *mamba_state_shape) mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) @@ -534,39 +569,6 @@ def copy_inputs_before_cuda_graphs(self, input_buffers, **kwargs): def get_seqlen_agnostic_capture_inputs(self, batch_size: int): return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) - def _get_mamba_cache_shape( - self) -> tuple[tuple[int, int], tuple[int, int]]: - world_size = get_tensor_model_parallel_world_size() - hidden_size = self.config.hidden_size - - conv_state_shape, temporal_state_shape = None, None - - intermediate_size = self.config.expand * hidden_size - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = ( - self.config.n_groups + - extra_groups_for_head_shards(self.config.n_groups, world_size)) - - # - heads and n_groups are TP-ed - conv_dim = (intermediate_size + - 2 * n_groups * self.config.ssm_state_size) - conv_state_shape = ( - divide(conv_dim, world_size), - self.config.conv_kernel - 1, - ) - - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(self.config.mamba_num_heads, world_size), - self.config.mamba_head_dim, - self.config.ssm_state_size, - ) - return conv_state_shape, temporal_state_shape - def compute_logits( self, hidden_states: torch.Tensor, diff --git a/vllm/model_executor/models/zamba2.py b/vllm/model_executor/models/zamba2.py index 4935fd9e6df..ebf8dd497f6 100644 --- a/vllm/model_executor/models/zamba2.py +++ b/vllm/model_executor/models/zamba2.py @@ -18,7 +18,7 @@ from vllm import envs from vllm.attention.layer import Attention from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import divide, get_tensor_model_parallel_world_size +from vllm.distributed import get_tensor_model_parallel_world_size from vllm.forward_context import get_forward_context from vllm.model_executor.layers.activation import GeluAndMul from vllm.model_executor.layers.layernorm import RMSNorm @@ -30,8 +30,8 @@ from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba2_metadata import ( Mamba2Metadata, prepare_mamba2_metadata) -from vllm.model_executor.layers.mamba.mamba_mixer2 import ( - MambaMixer2, extra_groups_for_head_shards) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 +from vllm.model_executor.layers.mamba.mamba_utils import get_mamba_state_shape from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.rotary_embedding import get_rope from vllm.model_executor.layers.vocab_parallel_embedding import ( @@ -843,6 +843,39 @@ class Zamba2ForCausalLM(nn.Module, HasInnerState, IsHybrid): "1.weight": "B.weight", }) + @classmethod + def get_mamba_state_shape_from_config( + cls, + vllm_config: "VllmConfig", + use_v1: bool = True, + ) -> tuple[tuple[int, int], tuple[int, int, int]]: + """Calculate shapes for Mamba's convolutional and state caches. + + Args: + vllm_config: vLLM config + use_v1: Get shapes for V1 (or V0) + + Returns: + Tuple containing: + - conv_state_shape: Shape for convolutional state cache + - temporal_state_shape: Shape for state space model cache + """ + + parallel_config = vllm_config.parallel_config + hf_config = vllm_config.model_config.hf_config + intermediate_size = hf_config.mamba_expand * hf_config.hidden_size + + return get_mamba_state_shape( + intermediate_size=intermediate_size, + tp_world_size=parallel_config.tensor_parallel_size, + n_groups=hf_config.mamba_ngroups, + num_heads=hf_config.n_mamba_heads, + head_dim=hf_config.mamba_headdim, + state_size=hf_config.mamba_d_state, + conv_kernel=hf_config.mamba_d_conv, + use_v1=use_v1, + ) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = "") -> None: """Initialize the Zamba2 model for causal language modeling. @@ -925,9 +958,13 @@ def forward(self, if not envs.VLLM_USE_V1: if self.mamba_cache is None: num_mamba_layers = self.config.num_hidden_layers - self.mamba_cache = MambaCacheManager( - self.vllm_config, self.lm_head.weight.dtype, - num_mamba_layers, *self._get_mamba_cache_shape()) + mamba_state_shape = \ + self.get_mamba_state_shape_from_config( + self.vllm_config, use_v1=False) + self.mamba_cache = MambaCacheManager(self.vllm_config, + self.lm_head.weight.dtype, + num_mamba_layers, + *mamba_state_shape) # Get cache parameters for current run mamba_cache_params = self.mamba_cache.current_run_tensors(**kwargs) @@ -968,49 +1005,6 @@ def get_seqlen_agnostic_capture_inputs( """ return self.mamba_cache.get_seqlen_agnostic_capture_inputs(batch_size) - def _get_mamba_cache_shape( - self) -> tuple[tuple[int, int], tuple[int, int]]: - """Calculate shapes for Mamba's convolutional and state caches. - - Returns: - Tuple containing: - - conv_state_shape: Shape for convolutional state cache - - temporal_state_shape: Shape for state space model cache - """ - world_size = get_tensor_model_parallel_world_size() - - intermediate_size = self.config.mamba_expand * self.config.hidden_size - - # Extend groups if needed to ensure all groups needed by a head - # are sharded together - - # if n_groups is not divisible by world_size, need to extend the shards - # to ensure all groups needed by a head is sharded along with it - n_groups = (self.config.mamba_ngroups + extra_groups_for_head_shards( - self.config.mamba_ngroups, world_size)) - - # Calculate conv state shape (includes groups) - # - heads and n_groups are TP-ed - conv_dim = (intermediate_size + - 2 * n_groups * self.config.mamba_d_state) - conv_state_shape = ( - divide(conv_dim, world_size), - self.config.mamba_d_conv - 1, - ) - - # Calculate temporal state shape (per-head states) - # These are not TP-ed as they depend on A, dt_bias, D - # - they are typically small - # e.g., (h_heads, d_head, d_state) = (128, 64, 128) - temporal_state_shape = ( - divide(divide(intermediate_size, self.config.mamba_headdim), - world_size), - self.config.mamba_headdim, - self.config.mamba_d_state, - ) - - return conv_state_shape, temporal_state_shape - def compute_logits( self, hidden_states: torch.Tensor, diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 734df82589a..af216539c90 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -42,7 +42,7 @@ from vllm.sampling_params import SamplingType from vllm.sequence import IntermediateTensors from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, DeviceMemoryProfiler, - GiB_bytes, LazyLoader, async_tensor_h2d, cdiv, + GiB_bytes, LazyLoader, async_tensor_h2d, check_use_alibi, get_dtype_size, is_pin_memory_available, round_up) from vllm.v1.attention.backends.mamba_attn import Mamba2AttentionBackend @@ -2648,9 +2648,8 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: "Prefix caching is not supported for Mamba yet.") max_model_len = self.vllm_config.model_config.max_model_len - page_size_padded = self._maybe_pad_mamba_page_size( - attn_layers, mamba_layers, kv_cache_spec, max_model_len, - block_size) + page_size_padded = ( + self.vllm_config.cache_config.mamba_page_size_padded) # Set block_size to max_model_len, so that mamba model will always # have only one block in the KV cache. @@ -2662,54 +2661,3 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: page_size_padded=page_size_padded) return kv_cache_spec - - def _maybe_pad_mamba_page_size( - self, - attn_layers: dict[str, Attention], - mamba_layers: dict[str, MambaBase], - kv_cache_spec: dict[str, KVCacheSpec], - max_model_len: int, - block_size: int, - ) -> Optional[int]: - """ - Ensure that page size of attention KV cache groups is greater than or - equal to the mamba KV cache groups. If not, we suggest to the user - how to set the attention block size to ensure that it is. - - If the attention page size is strictly greater than the mamba page size, - we pad the mamba page size to make them equal. - - Args: - attn_layers: Attention layers - mamba_layers: Mamba layers - kv_cache_spec: KV cache spec (populated with attention layers) - - Returns: - Optional[int]: Mamba page size with padding (None if no padding). - """ - - if len(attn_layers) == 0: - return None - - attn_layer_name = next(iter(attn_layers)) - attn_page_size = kv_cache_spec[attn_layer_name].page_size_bytes - mamba_layer_name = next(iter(mamba_layers)) - mamba_page_size = MambaSpec( - shapes=mamba_layers[mamba_layer_name].get_state_shape(), - dtype=self.kv_cache_dtype, - block_size=max_model_len).page_size_bytes - if attn_page_size < mamba_page_size: - # attention page size (for 16 tokens) - attn_page_size_16 = 16 * attn_page_size // block_size - # some attention backends (e.g. FA) only support setting - # block size to multiple of 16, so let's suggest a value - # that would work (note: FA is currently not compatible - # with mamba layers, use FlashInfer instead). - suggest_attn_block_size = 16 * cdiv(mamba_page_size, - attn_page_size_16) - raise ValueError( - "Attention block size should be increased to at least " - f"{suggest_attn_block_size} in order to match " - "the mamba page size") - - return attn_page_size From 67615ee92ec3fb9ea112c964f08236fd52f4aa2f Mon Sep 17 00:00:00 2001 From: Li Wang Date: Tue, 15 Jul 2025 20:16:33 +0800 Subject: [PATCH 099/552] [MISC] Add init files for python package (#20908) Signed-off-by: wangli Signed-off-by: x22x22 --- vllm/attention/utils/__init__.py | 0 vllm/ray/__init__.py | 0 2 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 vllm/attention/utils/__init__.py create mode 100644 vllm/ray/__init__.py diff --git a/vllm/attention/utils/__init__.py b/vllm/attention/utils/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/vllm/ray/__init__.py b/vllm/ray/__init__.py new file mode 100644 index 00000000000..e69de29bb2d From 97f7b1df18b3895d5ec27a5cdf60e62b67dca48e Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Tue, 15 Jul 2025 05:37:12 -0700 Subject: [PATCH 100/552] [doc] Add more details for Ray-based DP (#20948) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- docs/serving/data_parallel_deployment.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/serving/data_parallel_deployment.md b/docs/serving/data_parallel_deployment.md index 484443fdc5a..9ff9f59c54e 100644 --- a/docs/serving/data_parallel_deployment.md +++ b/docs/serving/data_parallel_deployment.md @@ -57,12 +57,20 @@ vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 ``` -This DP mode can also be used with Ray, in which case only a single launch command is needed irrespective of the number of nodes: +This DP mode can also be used with Ray by specifying `--data-parallel-backend=ray`: ```bash -vllm serve $MODEL --data-parallel-size 16 --tensor-parallel-size 2 --data-parallel-backend=ray +vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \ + --data-parallel-backend=ray ``` +There are several notable differences when using Ray: + +- A single launch command (on any node) is needed to start all local and remote DP ranks, therefore it is more convenient compared to launching on each node +- There is no need to specify `--data-parallel-address`, and the node where the command is run is used as `--data-parallel-address` +- There is no need to specify `--data-parallel-rpc-port` +- Remote DP ranks will be allocated based on node resources of the Ray cluster + Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic. When deploying large DP sizes using this method, the API server process can become a bottleneck. In this case, the orthogonal `--api-server-count` command line option can be used to scale this out (for example `--api-server-count=4`). This is transparent to users - a single HTTP endpoint / port is still exposed. Note that this API server scale-out is "internal" and still confined to the "head" node. From 5d2bf2c0fa2242f06a98c460ea98a6d28c7a9220 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 15 Jul 2025 15:00:50 +0100 Subject: [PATCH 101/552] [Deprecation] Remove `TokenizerPoolConfig` (#20968) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/api/README.md | 1 - tests/async_engine/test_api_server.py | 8 ++----- vllm/config.py | 33 --------------------------- vllm/engine/arg_utils.py | 24 ++----------------- 4 files changed, 4 insertions(+), 62 deletions(-) diff --git a/docs/api/README.md b/docs/api/README.md index 2b5142e0bcd..245c925f7f5 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -8,7 +8,6 @@ API documentation for vLLM's configuration classes. - [vllm.config.ModelConfig][] - [vllm.config.CacheConfig][] -- [vllm.config.TokenizerPoolConfig][] - [vllm.config.LoadConfig][] - [vllm.config.ParallelConfig][] - [vllm.config.SchedulerConfig][] diff --git a/tests/async_engine/test_api_server.py b/tests/async_engine/test_api_server.py index 38ecaf2233d..76c94bdf80c 100644 --- a/tests/async_engine/test_api_server.py +++ b/tests/async_engine/test_api_server.py @@ -29,7 +29,7 @@ def _query_server_long(prompt: str) -> dict: @pytest.fixture -def api_server(tokenizer_pool_size: int, distributed_executor_backend: str): +def api_server(distributed_executor_backend: str): script_path = Path(__file__).parent.joinpath( "api_server_async_engine.py").absolute() commands = [ @@ -40,8 +40,6 @@ def api_server(tokenizer_pool_size: int, distributed_executor_backend: str): "facebook/opt-125m", "--host", "127.0.0.1", - "--tokenizer-pool-size", - str(tokenizer_pool_size), "--distributed-executor-backend", distributed_executor_backend, ] @@ -54,10 +52,8 @@ def api_server(tokenizer_pool_size: int, distributed_executor_backend: str): uvicorn_process.terminate() -@pytest.mark.parametrize("tokenizer_pool_size", [0, 2]) @pytest.mark.parametrize("distributed_executor_backend", ["mp", "ray"]) -def test_api_server(api_server, tokenizer_pool_size: int, - distributed_executor_backend: str): +def test_api_server(api_server, distributed_executor_backend: str): """ Run the API server and test it. diff --git a/vllm/config.py b/vllm/config.py index f16287c2be5..36671d7d4cc 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -1730,35 +1730,6 @@ def verify_with_parallel_config( logger.warning("Possibly too large swap space. %s", msg) -@config -@dataclass -class TokenizerPoolConfig: - """This config is deprecated and will be removed in a future release. - - Passing these parameters will have no effect. Please remove them from your - configurations. - """ - - pool_size: int = 0 - """This parameter is deprecated and will be removed in a future release. - Passing this parameter will have no effect. Please remove it from your - configurations.""" - pool_type: str = "ray" - """This parameter is deprecated and will be removed in a future release. - Passing this parameter will have no effect. Please remove it from your - configurations.""" - extra_config: dict = field(default_factory=dict) - """This parameter is deprecated and will be removed in a future release. - Passing this parameter will have no effect. Please remove it from your - configurations.""" - - def __post_init__(self) -> None: - logger.warning_once( - "TokenizerPoolConfig is deprecated and will be removed in a " - "future release. Passing this parameter will have no effect. " - "Please remove it from your configurations.") - - class LoadFormat(str, enum.Enum): AUTO = "auto" PT = "pt" @@ -1922,10 +1893,6 @@ class ParallelConfig: disable_custom_all_reduce: bool = False """Disable the custom all-reduce kernel and fall back to NCCL.""" - tokenizer_pool_config: Optional[TokenizerPoolConfig] = None - """This parameter is deprecated and will be removed in a future release. - Please remove it from your configs""" - ray_workers_use_nsight: bool = False """Whether to profile Ray workers with nsight, see https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html#profiling-nsight-profiler.""" diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 269477c4848..998a352497f 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -32,8 +32,8 @@ ObservabilityConfig, ParallelConfig, PoolerConfig, PrefixCachingHashAlgo, PromptAdapterConfig, SchedulerConfig, SchedulerPolicy, SpeculativeConfig, - TaskOption, TokenizerMode, TokenizerPoolConfig, - VllmConfig, get_attr_docs, get_field) + TaskOption, TokenizerMode, VllmConfig, get_attr_docs, + get_field) from vllm.logger import init_logger from vllm.platforms import CpuArchEnum, current_platform from vllm.plugins import load_general_plugins @@ -373,13 +373,6 @@ class EngineArgs: enforce_eager: bool = ModelConfig.enforce_eager max_seq_len_to_capture: int = ModelConfig.max_seq_len_to_capture disable_custom_all_reduce: bool = ParallelConfig.disable_custom_all_reduce - # The following three fields are deprecated and will be removed in a future - # release. Setting them will have no effect. Please remove them from your - # configurations. - tokenizer_pool_size: int = TokenizerPoolConfig.pool_size - tokenizer_pool_type: str = TokenizerPoolConfig.pool_type - tokenizer_pool_extra_config: dict = \ - get_field(TokenizerPoolConfig, "extra_config") limit_mm_per_prompt: dict[str, int] = \ get_field(MultiModalConfig, "limit_per_prompt") interleave_mm_strings: bool = MultiModalConfig.interleave_mm_strings @@ -751,19 +744,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: cache_group.add_argument("--calculate-kv-scales", **cache_kwargs["calculate_kv_scales"]) - # Tokenizer arguments - tokenizer_kwargs = get_kwargs(TokenizerPoolConfig) - tokenizer_group = parser.add_argument_group( - title="TokenizerPoolConfig", - description=TokenizerPoolConfig.__doc__, - ) - tokenizer_group.add_argument("--tokenizer-pool-size", - **tokenizer_kwargs["pool_size"]) - tokenizer_group.add_argument("--tokenizer-pool-type", - **tokenizer_kwargs["pool_type"]) - tokenizer_group.add_argument("--tokenizer-pool-extra-config", - **tokenizer_kwargs["extra_config"]) - # Multimodal related configs multimodal_kwargs = get_kwargs(MultiModalConfig) multimodal_group = parser.add_argument_group( From a3b10901bc50b2f31508cf6b2546a9aaa7d900b0 Mon Sep 17 00:00:00 2001 From: Christian Pinto Date: Tue, 15 Jul 2025 15:20:01 +0100 Subject: [PATCH 102/552] [v1][core] Support for attention free models (#20811) Signed-off-by: Christian Pinto Signed-off-by: x22x22 --- vllm/v1/core/kv_cache_manager.py | 7 ++++++- vllm/v1/core/kv_cache_utils.py | 21 ++++++++++++++++++++- vllm/v1/engine/core.py | 8 +++++++- 3 files changed, 33 insertions(+), 3 deletions(-) diff --git a/vllm/v1/core/kv_cache_manager.py b/vllm/v1/core/kv_cache_manager.py index cbc787e8dd5..e820a0ad6d5 100644 --- a/vllm/v1/core/kv_cache_manager.py +++ b/vllm/v1/core/kv_cache_manager.py @@ -78,7 +78,12 @@ def __init__( ) -> None: self.max_model_len = max_model_len + if len(kv_cache_config.kv_cache_groups) == 0: + # Attention free models don't have kv cache, + # thus don't need prefix caching. + enable_caching = False self.enable_caching = enable_caching + self.caching_hash_fn = ( sha256_cbor_64bit if caching_hash_algo == "sha256_cbor_64bit" else sha256 if caching_hash_algo == "sha256" else hash) @@ -101,7 +106,7 @@ def __init__( kv_cache_config=kv_cache_config, max_model_len=self.max_model_len, use_eagle=self.use_eagle, - enable_caching=enable_caching, + enable_caching=self.enable_caching, caching_hash_fn=self.caching_hash_fn, enable_kv_cache_events=enable_kv_cache_events, ) diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index 544b9f59932..6067a127e97 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -563,6 +563,10 @@ def check_enough_kv_cache_memory(vllm_config: VllmConfig, ValueError: If there is not enough memory available for the KV cache. """ + # No need to check for available memory if the kv_cache_spec is empty + if not kv_cache_spec: + return + if available_memory <= 0: raise ValueError("No available memory for the cache blocks. " "Try increasing `gpu_memory_utilization` when " @@ -749,6 +753,13 @@ def is_kv_cache_page_size_uniform( return len(page_sizes) == 1 +def is_kv_cache_type_attention_free( + kv_cache_spec: dict[str, KVCacheSpec]) -> bool: + + # kv_cache_spec is an empty dict for attention free models + return not kv_cache_spec + + def _get_kv_cache_config_uniform_page_size( vllm_config: VllmConfig, kv_cache_spec: dict[str, KVCacheSpec], available_memory: int) -> KVCacheConfig: @@ -891,6 +902,10 @@ def _get_kv_cache_config_uniform_page_size( return kv_cache_config +def _get_kv_cache_config_attention_free() -> KVCacheConfig: + return KVCacheConfig(num_blocks=1, kv_cache_tensors=[], kv_cache_groups=[]) + + def unify_hybrid_kv_cache_specs(kv_cache_spec: dict[str, KVCacheSpec]): """ This function tries to convert the KV cache specs to one type if the model @@ -957,7 +972,11 @@ def get_kv_cache_config( if vllm_config.scheduler_config.disable_hybrid_kv_cache_manager: unify_hybrid_kv_cache_specs(kv_cache_spec) - if is_kv_cache_type_uniform(kv_cache_spec): + if is_kv_cache_type_attention_free(kv_cache_spec): + # This returns a kv_cache config with 0 kv_cache groups and 1 block + # to allow for the KVCache manager to handle attention free models. + return _get_kv_cache_config_attention_free() + elif is_kv_cache_type_uniform(kv_cache_spec): # KV cache of all layers are the same, which is true for # most models. Allocate the same amount of memory for # each layer. diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index e2fdf6f8a11..f5c59bef478 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -139,7 +139,13 @@ def _initialize_kv_caches( # Profiles the peak memory usage of the model to determine how much # memory can be allocated for kv cache. - available_gpu_memory = self.model_executor.determine_available_memory() + has_kv_cache = any(kv_cache_spec for kv_cache_spec in kv_cache_specs) + if has_kv_cache: + available_gpu_memory = \ + self.model_executor.determine_available_memory() + else: + # Attention free models don't need memory for kv cache + available_gpu_memory = [0] * len(kv_cache_specs) assert len(kv_cache_specs) == len(available_gpu_memory) # Get the kv cache tensor size From 3978f4fe17978e9ca7cea633bbad7f80e5e0bcb0 Mon Sep 17 00:00:00 2001 From: Patrick von Platen Date: Tue, 15 Jul 2025 16:35:30 +0200 Subject: [PATCH 103/552] Voxtral (#20970) Signed-off-by: Patrick von Platen Co-authored-by: Cyrus Leung Signed-off-by: x22x22 --- examples/offline_inference/audio_language.py | 85 ++- requirements/common.txt | 2 +- requirements/nightly_torch_test.txt | 2 +- requirements/test.in | 2 +- requirements/test.txt | 8 +- setup.py | 3 +- .../openai/test_transcription_validation.py | 28 +- tests/models/registry.py | 3 +- vllm/entrypoints/openai/speech_to_text.py | 1 + vllm/model_executor/models/interfaces.py | 3 +- vllm/model_executor/models/registry.py | 1 + vllm/model_executor/models/voxtral.py | 691 ++++++++++++++++++ vllm/model_executor/models/whisper.py | 81 +- vllm/transformers_utils/configs/mistral.py | 50 +- 14 files changed, 913 insertions(+), 47 deletions(-) create mode 100644 vllm/model_executor/models/voxtral.py diff --git a/examples/offline_inference/audio_language.py b/examples/offline_inference/audio_language.py index 8e5cac78a4b..8014cb53f16 100644 --- a/examples/offline_inference/audio_language.py +++ b/examples/offline_inference/audio_language.py @@ -10,7 +10,7 @@ import os from dataclasses import asdict -from typing import NamedTuple, Optional +from typing import Any, NamedTuple, Optional from huggingface_hub import snapshot_download from transformers import AutoTokenizer @@ -30,7 +30,9 @@ class ModelRequestData(NamedTuple): engine_args: EngineArgs - prompt: str + prompt: Optional[str] = None + prompt_token_ids: Optional[dict[str, list[int]]] = None + multi_modal_data: Optional[dict[str, Any]] = None stop_token_ids: Optional[list[int]] = None lora_requests: Optional[list[LoRARequest]] = None @@ -40,6 +42,60 @@ class ModelRequestData(NamedTuple): # Unless specified, these settings have been tested to work on a single L4. +# Voxtral +def run_voxtral(question: str, audio_count: int) -> ModelRequestData: + from mistral_common.audio import Audio + from mistral_common.protocol.instruct.messages import ( + AudioChunk, + RawAudio, + TextChunk, + UserMessage, + ) + from mistral_common.protocol.instruct.request import ChatCompletionRequest + from mistral_common.tokens.tokenizers.mistral import MistralTokenizer + + model_name = "mistralai/Voxtral-Mini-3B-2507" + tokenizer = MistralTokenizer.from_hf_hub(model_name) + + engine_args = EngineArgs( + model=model_name, + max_model_len=8192, + max_num_seqs=2, + limit_mm_per_prompt={"audio": audio_count}, + config_format="mistral", + load_format="mistral", + tokenizer_mode="mistral", + enforce_eager=True, + enable_chunked_prefill=False, + ) + + text_chunk = TextChunk(text=question) + audios = [ + Audio.from_file(str(audio_assets[i].get_local_path()), strict=False) + for i in range(audio_count) + ] + audio_chunks = [ + AudioChunk(input_audio=RawAudio.from_audio(audio)) for audio in audios + ] + + messages = [UserMessage(content=[*audio_chunks, text_chunk])] + + req = ChatCompletionRequest(messages=messages, model=model_name) + + tokens = tokenizer.encode_chat_completion(req) + prompt_ids, audios = tokens.tokens, tokens.audios + + audios_and_sr = [(au.audio_array, au.sampling_rate) for au in audios] + + multi_modal_data = {"audio": audios_and_sr} + + return ModelRequestData( + engine_args=engine_args, + prompt_token_ids=prompt_ids, + multi_modal_data=multi_modal_data, + ) + + # Granite Speech def run_granite_speech(question: str, audio_count: int) -> ModelRequestData: # NOTE - the setting in this example are somehat different than what is @@ -243,6 +299,7 @@ def run_whisper(question: str, audio_count: int) -> ModelRequestData: model_example_map = { + "voxtral": run_voxtral, "granite_speech": run_granite_speech, "minicpmo": run_minicpmo, "phi4_mm": run_phi4mm, @@ -311,16 +368,24 @@ def main(args): temperature=0.2, max_tokens=64, stop_token_ids=req_data.stop_token_ids ) - mm_data = {} - if audio_count > 0: - mm_data = { - "audio": [ - asset.audio_and_sample_rate for asset in audio_assets[:audio_count] - ] - } + mm_data = req_data.multi_modal_data + if not mm_data: + mm_data = {} + if audio_count > 0: + mm_data = { + "audio": [ + asset.audio_and_sample_rate for asset in audio_assets[:audio_count] + ] + } assert args.num_prompts > 0 - inputs = {"prompt": req_data.prompt, "multi_modal_data": mm_data} + inputs = {"multi_modal_data": mm_data} + + if req_data.prompt: + inputs["prompt"] = req_data.prompt + else: + inputs["prompt_token_ids"] = req_data.prompt_token_ids + if args.num_prompts > 1: # Batch inference inputs = [inputs] * args.num_prompts diff --git a/requirements/common.txt b/requirements/common.txt index c211cb5dc10..14e59f41a10 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -33,7 +33,7 @@ pyzmq >= 25.0.0 msgspec gguf >= 0.13.0 importlib_metadata; python_version < '3.10' -mistral_common[opencv] >= 1.6.2 +mistral_common[opencv] >= 1.8.0 opencv-python-headless >= 4.11.0 # required for video IO pyyaml six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12 diff --git a/requirements/nightly_torch_test.txt b/requirements/nightly_torch_test.txt index d8bd031f1d7..9c378dcf68f 100644 --- a/requirements/nightly_torch_test.txt +++ b/requirements/nightly_torch_test.txt @@ -23,7 +23,7 @@ jiwer # required for audio tests timm # required for internvl test transformers_stream_generator # required for qwen-vl test matplotlib # required for qwen-vl test -mistral_common[opencv] >= 1.6.2 # required for pixtral test +mistral_common[opencv] >= 1.8.0 # required for voxtral test num2words # required for smolvlm test opencv-python-headless >= 4.11.0 # required for video test datamodel_code_generator # required for minicpm3 test diff --git a/requirements/test.in b/requirements/test.in index 673120258b1..e8537d10fa7 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -28,7 +28,7 @@ torchvision==0.22.0 transformers_stream_generator # required for qwen-vl test mamba_ssm # required for plamo2 test matplotlib # required for qwen-vl test -mistral_common[opencv] >= 1.7.0 # required for pixtral test +mistral_common[opencv] >= 1.8.0 # required for voxtral test num2words # required for smolvlm test opencv-python-headless >= 4.11.0 # required for video test datamodel_code_generator # required for minicpm3 test diff --git a/requirements/test.txt b/requirements/test.txt index 3828efae381..84303b83117 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -305,7 +305,7 @@ mbstrdecoder==1.1.3 # typepy mdurl==0.1.2 # via markdown-it-py -mistral-common==1.7.0 +mistral-common==1.8.0 # via -r requirements/test.in more-itertools==10.5.0 # via lm-eval @@ -518,6 +518,8 @@ pyasn1-modules==0.4.2 # via google-auth pybind11==2.13.6 # via lm-eval +pycountry==24.6.1 + # via pydantic-extra-types pycparser==2.22 # via cffi pycryptodomex==3.22.0 @@ -528,9 +530,12 @@ pydantic==2.11.5 # datamodel-code-generator # mistral-common # mteb + # pydantic-extra-types # ray pydantic-core==2.33.2 # via pydantic +pydantic-extra-types==2.10.5 + # via mistral-common pygments==2.18.0 # via rich pyparsing==3.2.0 @@ -835,6 +840,7 @@ typing-extensions==4.12.2 # pqdm # pydantic # pydantic-core + # pydantic-extra-types # torch # typer # typing-inspection diff --git a/setup.py b/setup.py index 9200c6cef5a..795d5496455 100644 --- a/setup.py +++ b/setup.py @@ -692,7 +692,8 @@ def _read_requirements(filename: str) -> list[str]: "tensorizer": ["tensorizer==2.10.1"], "fastsafetensors": ["fastsafetensors >= 0.1.10"], "runai": ["runai-model-streamer", "runai-model-streamer-s3", "boto3"], - "audio": ["librosa", "soundfile"], # Required for audio processing + "audio": ["librosa", "soundfile", + "mistral_common[audio]"], # Required for audio processing "video": [] # Kept for backwards compatibility }, cmdclass=cmdclass, diff --git a/tests/entrypoints/openai/test_transcription_validation.py b/tests/entrypoints/openai/test_transcription_validation.py index b46409b0f89..461b8aab2e9 100644 --- a/tests/entrypoints/openai/test_transcription_validation.py +++ b/tests/entrypoints/openai/test_transcription_validation.py @@ -17,6 +17,11 @@ from ...utils import RemoteOpenAIServer +MISTRAL_FORMAT_ARGS = [ + "--tokenizer_mode", "mistral", "--config_format", "mistral", + "--load_format", "mistral" +] + @pytest.fixture def mary_had_lamb(): @@ -33,9 +38,18 @@ def winning_call(): @pytest.mark.asyncio -async def test_basic_audio(mary_had_lamb): - model_name = "openai/whisper-large-v3-turbo" +@pytest.mark.parametrize( + "model_name", + ["openai/whisper-large-v3-turbo", "mistralai/Voxtral-Mini-3B-2507"]) +async def test_basic_audio(mary_had_lamb, model_name): server_args = ["--enforce-eager"] + + if model_name.startswith("mistralai"): + server_args += MISTRAL_FORMAT_ARGS + + # TODO(PATRICK) - REMOVE AFTER RELEASE + return # skip for now + # Based on https://github.com/openai/openai-cookbook/blob/main/examples/Whisper_prompting_guide.ipynb. with RemoteOpenAIServer(model_name, server_args) as remote_server: client = remote_server.get_async_client() @@ -65,10 +79,13 @@ async def test_bad_requests(mary_had_lamb): @pytest.mark.asyncio -async def test_long_audio_request(mary_had_lamb): - model_name = "openai/whisper-large-v3-turbo" +@pytest.mark.parametrize("model_name", ["openai/whisper-large-v3-turbo"]) +async def test_long_audio_request(mary_had_lamb, model_name): server_args = ["--enforce-eager"] + if model_name.startswith("openai"): + return + mary_had_lamb.seek(0) audio, sr = librosa.load(mary_had_lamb) # Add small silence after each audio for repeatability in the split process @@ -87,7 +104,8 @@ async def test_long_audio_request(mary_had_lamb): response_format="text", temperature=0.0) out = json.loads(transcription)['text'] - assert out.count("Mary had a little lamb") == 10 + counts = out.count("Mary had a little lamb") + assert counts == 10, counts @pytest.mark.asyncio diff --git a/tests/models/registry.py b/tests/models/registry.py index 9d3fc8a1b1c..0bac0f8db15 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -440,6 +440,7 @@ def check_available_online( tokenizer="Isotr0py/Florence-2-tokenizer", # noqa: E501 trust_remote_code=True), # noqa: E501 "MllamaForConditionalGeneration": _HfExamplesInfo("meta-llama/Llama-3.2-11B-Vision-Instruct"), # noqa: E501 + "VoxtralForConditionalGeneration": _HfExamplesInfo("mistralai/Voxtral-Mini-3B-2507", is_available_online=False, tokenizer_mode="mistral"), # noqa: E501 "WhisperForConditionalGeneration": _HfExamplesInfo("openai/whisper-large-v3"), # noqa: E501 # [Cross-encoder] @@ -513,4 +514,4 @@ def find_hf_info(self, model_id: str) -> _HfExamplesInfo: raise ValueError(f"No example model defined for {model_id}") -HF_EXAMPLE_MODELS = HfExampleModels(_EXAMPLE_MODELS) \ No newline at end of file +HF_EXAMPLE_MODELS = HfExampleModels(_EXAMPLE_MODELS) diff --git a/vllm/entrypoints/openai/speech_to_text.py b/vllm/entrypoints/openai/speech_to_text.py index c70355b2ae4..e7589a3804c 100644 --- a/vllm/entrypoints/openai/speech_to_text.py +++ b/vllm/entrypoints/openai/speech_to_text.py @@ -112,6 +112,7 @@ async def _preprocess_speech_to_text( prompt = self.model_cls.get_generation_prompt( audio=chunk, stt_config=self.asr_config, + model_config=self.model_config, language=lang, task_type=self.task_type, request_prompt=request.prompt) diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 95970474d55..92ecb8972d5 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -722,7 +722,8 @@ class SupportsTranscription(Protocol): @classmethod def get_generation_prompt(cls, audio: np.ndarray, - stt_config: SpeechToTextConfig, language: str, + stt_config: SpeechToTextConfig, + model_config: ModelConfig, language: str, task_type: str, request_prompt: str) -> PromptType: """Get the prompt for the ASR model. diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 79190860ac9..b7f9638d322 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -231,6 +231,7 @@ "Phi4MMForCausalLM": ("phi4mm", "Phi4MMForCausalLM"), "TarsierForConditionalGeneration": ("tarsier", "TarsierForConditionalGeneration"), # noqa: E501 "Tarsier2ForConditionalGeneration": ("qwen2_vl", "Tarsier2ForConditionalGeneration"), # noqa: E501 + "VoxtralForConditionalGeneration": ("voxtral", "VoxtralForConditionalGeneration"), # noqa: E501 # [Encoder-decoder] "Florence2ForConditionalGeneration": ("florence2", "Florence2ForConditionalGeneration"), # noqa: E501 "MllamaForConditionalGeneration": ("mllama", "MllamaForConditionalGeneration"), # noqa: E501 diff --git a/vllm/model_executor/models/voxtral.py b/vllm/model_executor/models/voxtral.py new file mode 100644 index 00000000000..97cab628317 --- /dev/null +++ b/vllm/model_executor/models/voxtral.py @@ -0,0 +1,691 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import math +from collections.abc import Iterable, Mapping, Sequence +from functools import cached_property +from math import ceil +from typing import Optional, Union, cast + +import numpy as np +import regex as re +import torch +import torch.nn as nn +from mistral_common.audio import mel_filter_bank +from mistral_common.protocol.instruct.messages import (AudioChunk, RawAudio, + TextChunk, UserMessage) +from mistral_common.protocol.instruct.request import ChatCompletionRequest +from mistral_common.protocol.transcription.request import TranscriptionRequest +from mistral_common.tokens.tokenizers.audio import Audio, AudioEncoder +from transformers import TensorType, WhisperConfig +from transformers.tokenization_utils_base import TextInput + +from vllm.config import ModelConfig, SpeechToTextConfig, VllmConfig +from vllm.inputs.data import PromptType +from vllm.logger import init_logger +from vllm.model_executor.model_loader.weight_utils import default_weight_loader +from vllm.model_executor.models import SupportsPP +# yapf: disable +from vllm.model_executor.models.whisper import ( + WhisperEncoder, WhisperForConditionalGeneration) +# yapf: enable +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.multimodal import MULTIMODAL_REGISTRY +from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, + MultiModalKwargs, NestedTensors) +from vllm.multimodal.parse import (AudioProcessorItems, MultiModalDataItems, + MultiModalDataParser) +from vllm.multimodal.processing import (BaseMultiModalProcessor, + BaseProcessingInfo, MultiModalHashes, + PromptReplacement, PromptUpdate) +from vllm.multimodal.profiling import BaseDummyInputsBuilder, ProcessorInputs +from vllm.sequence import IntermediateTensors +from vllm.transformers_utils.tokenizer import (MistralTokenizer, + cached_tokenizer_from_config) + +from .interfaces import (MultiModalEmbeddings, SupportsMultiModal, + SupportsTranscription) +from .utils import (flatten_bn, init_vllm_registered_model, maybe_prefix, + merge_multimodal_embeddings) + +logger = init_logger(__name__) + + +class VoxtralProcessorAdapter: + """ + Provide a HF-compatible interface for + :class:`mistral_common.tokens.tokenizers.multimodal.AudioEncoder`. + """ + + def __init__(self, tokenizer: MistralTokenizer) -> None: + super().__init__() + self.tokenizer = tokenizer + + @cached_property + def _audio_processor(self) -> AudioEncoder: + audio_encoder = self.tokenizer.instruct.audio_encoder + assert isinstance(audio_encoder, AudioEncoder) + return audio_encoder + + @cached_property + def audio_token_id(self) -> int: + return self._audio_processor.special_ids.audio + + @cached_property + def begin_audio_token_id(self) -> int: + return self._audio_processor.special_ids.begin_audio + + # @cached_property + # def begin_transcript_token_id(self) -> int: + # return self._audio_processor.special_ids.begin_transcript + + # @cached_property + # def end_transcript_token_id(self) -> int: + # return self._audio_processor.special_ids.end_transcript + + @cached_property + def sampling_rate(self) -> int: + return self._audio_processor.audio_config.sampling_rate + + @cached_property + def frame_rate(self) -> float: + return self._audio_processor.audio_config.frame_rate + + def get_num_audio_tokens( + self, + audio_length: int, + ) -> int: + pad_audio_length = self._audio_processor.next_multiple_of_chunk_frames( + audio_length, self.sampling_rate) + return ceil(pad_audio_length / (self.sampling_rate // self.frame_rate)) + + def __call__( + self, + text: Optional[Union[TextInput, list[TextInput]]] = None, + audios: Optional[Union[np.ndarray, list[np.ndarray]]] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + **kwargs, + ) -> Mapping[str, NestedTensors]: + if text is None: + text = [] + if not isinstance(text, list): + text = [text] + if audios is None: + audios = [] + if not isinstance(audios, list): + audios = [audios] + + if not audios: + input_ids = self.tokenizer(text).input_ids + return {"input_ids": torch.tensor(input_ids)} + + # Allow dummy text, which is used for profiling as well as token inputs + if any(len(t) > 0 for t in text): + raise ValueError( + "You've passed text inputs instead of token inputs. " + "Make sure to process your input via `mistral_common`'s " + "tokenizer or pass a chat completion request. " + "For more info, see: " + "https://github.com/vllm-project/vllm/issues/8411.") + + audios_tokens = list[torch.Tensor]() + audios_processed = list[torch.Tensor]() + for audio in audios: + assert isinstance(audio, np.ndarray) + assert audio.ndim == 1 + + # pad if necessary + audio = self._audio_processor.pad(audio, self.sampling_rate) + + audio_tokens = [ + self.begin_audio_token_id + ] + [self.audio_token_id] * self.get_num_audio_tokens(len(audio)) + + audios_tokens.append(torch.tensor(audio_tokens)) + audios_processed.append(torch.tensor(audio)) + + return { + "input_ids": torch.cat(audios_tokens)[None].expand(len(text), -1), + "audio_arrays": audios_processed, + } + + +class VoxtralProcessingInfo(BaseProcessingInfo): + + def get_tokenizer(self) -> MistralTokenizer: + tokenizer = cached_tokenizer_from_config(self.ctx.model_config) + if not isinstance(tokenizer, MistralTokenizer): + raise ValueError("This model requires `--tokenizer-mode mistral`") + + return tokenizer + + def get_hf_processor(self) -> VoxtralProcessorAdapter: + return VoxtralProcessorAdapter(self.get_tokenizer()) + + def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]: + return {"audio": 5} # Performance tends to degrade after 5 + + def get_mm_max_tokens_per_item( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> Mapping[str, int]: + return {"audio": self.get_max_audio_tokens()} + + def get_max_audio_tokens(self) -> int: + return self.ctx.model_config.max_model_len + + def get_max_audio_array_len(self) -> int: + processor = self.get_hf_processor() + return self.get_max_audio_tokens() * int( + processor.sampling_rate // processor.frame_rate) + + +class VoxtralDummyInputsBuilder(BaseDummyInputsBuilder[VoxtralProcessingInfo]): + + def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str: + return "" + + def get_dummy_mm_data( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> MultiModalDataDict: + num_audios = mm_counts.get("audio", 0) + + target_length = self.info.get_max_audio_array_len() + + return { + "audio": + self._get_dummy_audios(length=target_length, num_audios=num_audios) + } + + def get_dummy_processor_inputs( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> ProcessorInputs: + tokenizer = self.info.get_tokenizer() + + dummy_text = self.get_dummy_text(mm_counts) + dummy_mm_data = self.get_dummy_mm_data(seq_len, mm_counts) + dummy_audios = dummy_mm_data.get("audio", []) + + audio_chunks: list[AudioChunk] = [] + format = "wav" + for audio in dummy_audios: + audio_item = Audio( + audio_array=audio, + sampling_rate=self.info.get_hf_processor().sampling_rate, + format=format, + ) + chunk = AudioChunk(input_audio=RawAudio.from_audio(audio_item)) + audio_chunks.append(chunk) + + request = ChatCompletionRequest(messages=[ + UserMessage(content=[TextChunk(text=dummy_text), *audio_chunks]), + ]) + res = tokenizer.mistral.encode_chat_completion(request) + dummy_tokens = res.tokens + # whixtral tokenizer adds padding to the audio + # so we need to update the audio arrays + dummy_mm_data["audio"] = [a.audio_array for a in res.audios] + + return ProcessorInputs(prompt=dummy_tokens, mm_data=dummy_mm_data) + + +class VoxtralMultiModalProcessor(BaseMultiModalProcessor[VoxtralProcessingInfo] + ): + + def _get_mm_fields_config( + self, + hf_inputs: Mapping[str, NestedTensors], + hf_processor_mm_kwargs: Mapping[str, object], + ) -> Mapping[str, MultiModalFieldConfig]: + return dict(audio_arrays=MultiModalFieldConfig.batched("audio")) + + def _get_prompt_updates( + self, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + out_mm_kwargs: MultiModalKwargs, + ) -> Sequence[PromptUpdate]: + processor = self.info.get_hf_processor(**hf_processor_mm_kwargs) + + audio_id = processor.audio_token_id + + def get_replacement(item_idx: int): + audios = mm_items.get_items("audio", AudioProcessorItems) + audio_len = audios.get_audio_length(item_idx) + + nb_audio_tokens = processor.get_num_audio_tokens(audio_len) + + return [audio_id] * nb_audio_tokens + + return [ + PromptReplacement( + modality="audio", + target="", # Never match the prompt (see below note) + replacement=get_replacement, + ), + ] + + def _cached_apply_hf_processor( + self, + prompt: Union[str, list[int]], + mm_data_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + tokenization_kwargs: Mapping[str, object], + *, + return_mm_hashes: bool, + ) -> tuple[list[int], MultiModalKwargs, Optional[MultiModalHashes], bool]: + prompt_ids, mm_kwargs, mm_hashes, _ = super( + )._cached_apply_hf_processor( + prompt=prompt, + mm_data_items=mm_data_items, + hf_processor_mm_kwargs=hf_processor_mm_kwargs, + tokenization_kwargs=tokenization_kwargs, + return_mm_hashes=return_mm_hashes, + ) + + # NOTE: The tokens are already inserted by the chat template + return prompt_ids, mm_kwargs, mm_hashes, True + + def _get_data_parser(self) -> MultiModalDataParser: + sampling_rate = self.info.get_hf_processor().sampling_rate + return MultiModalDataParser(target_sr=sampling_rate) + + +@MULTIMODAL_REGISTRY.register_processor(VoxtralMultiModalProcessor, + info=VoxtralProcessingInfo, + dummy_inputs=VoxtralDummyInputsBuilder) +class VoxtralForConditionalGeneration(nn.Module, SupportsMultiModal, + SupportsPP, SupportsTranscription): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + self.tokenizer = cached_tokenizer_from_config(vllm_config.model_config) + + config = vllm_config.model_config.hf_config + self.config = config + self.downsample_factor = self.config.audio_config.downsample_factor + + self.language_model = init_vllm_registered_model( + vllm_config=vllm_config, + hf_config=config.text_config, + prefix=maybe_prefix(prefix, "language_model"), + ) + self.whisper_encoder = VoxtralEncoderModel( + vllm_config.with_hf_config(config.audio_config), + prefix=maybe_prefix(prefix, "whisper_encoder"), + ) + self.audio_language_adapter = AudioLanguageAdapter( + hidden_size=config.audio_config.d_model * self.downsample_factor, + dim=config.text_config.hidden_size, + ) + + def get_language_model(self) -> torch.nn.Module: + return self.language_model + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + **kwargs: object, + ) -> Union[torch.Tensor, IntermediateTensors]: + if intermediate_tensors is not None: + inputs_embeds = None + + # NOTE: In v1, inputs_embeds is always generated at model runner, this + # condition is for v0 compatibility. + elif inputs_embeds is None: + audio_embeddings = self.get_multimodal_embeddings(**kwargs) + inputs_embeds = self.get_input_embeddings(input_ids, + audio_embeddings) + input_ids = None + + hidden_states = self.language_model.model(input_ids, + positions, + intermediate_tensors, + inputs_embeds=inputs_embeds) + + return hidden_states + + def get_multimodal_embeddings( + self, **kwargs + ) -> Union[list[torch.Tensor], torch.Tensor, tuple[torch.Tensor, ...], + None]: + audio_inputs = self._parse_and_validate_audio_arrays(**kwargs) + if audio_inputs is None: + return None + + audio_embeddings = self.whisper_encoder(audio_inputs) + + for i, audio_embedding in enumerate(audio_embeddings): + seq_len, dim = audio_embedding.shape + # Pad such that seq_len is divisible by downsample_factor + target_seq_len = self.downsample_factor * math.ceil( + seq_len / self.downsample_factor) + audio_embedding = torch.nn.functional.pad( + audio_embedding, + (0, 0, 0, target_seq_len - seq_len), + ) + audio_embeddings[i] = audio_embedding.reshape( + target_seq_len // self.downsample_factor, + dim * self.downsample_factor) + + # Concat, project and resplit + audio_embeddings_packed = torch.cat(audio_embeddings, dim=0) + audio_embeddings_packed = self.audio_language_adapter( + audio_embeddings_packed) + audio_embeddings = torch.split(audio_embeddings_packed, + [a.shape[0] for a in audio_embeddings], + dim=0) + + return audio_embeddings + + def get_input_embeddings( + self, + input_ids: torch.Tensor, + multimodal_embeddings: Optional[MultiModalEmbeddings] = None, + ) -> torch.Tensor: + audio_encoder = self.tokenizer.instruct.audio_encoder + audio_tok_id = audio_encoder.audio_token + + inputs_embeds = self.language_model.get_input_embeddings(input_ids) + if multimodal_embeddings is not None: + inputs_embeds = merge_multimodal_embeddings( + input_ids, inputs_embeds, multimodal_embeddings, audio_tok_id) + return inputs_embeds + + def _parse_and_validate_audio_arrays( + self, **kwargs: object) -> Union[list[torch.Tensor], None]: + audio_arrays = kwargs.pop("audio_arrays", None) + if audio_arrays is None: + return None + + if not isinstance(audio_arrays, (torch.Tensor, list)): + raise ValueError("Incorrect type of audio_arrays. " + f"Got type: {type(audio_arrays)}") + + audio_arrays = flatten_bn(audio_arrays) + if isinstance(audio_arrays, torch.Tensor): + audio_arrays = list(audio_arrays.unbind(0)) + return audio_arrays + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + return self.language_model.compute_logits(hidden_states, + sampling_metadata) + + @classmethod + def get_speech_to_text_config(cls, model_config: ModelConfig, + task_type: str) -> SpeechToTextConfig: + tokenizer = cached_tokenizer_from_config(model_config) + audio_config = tokenizer.instruct.audio_encoder.audio_config + max_audio_clip_s = audio_config.chunk_length_s + sample_rate = audio_config.sampling_rate + return SpeechToTextConfig( + max_audio_clip_s=max_audio_clip_s, + sample_rate=sample_rate, + # mistral_common and whisper encoder take care of chunking + min_energy_split_window_size=None, + ) + + @classmethod + # for speech-to-text transcription + def get_generation_prompt(cls, audio: np.ndarray, + model_config: ModelConfig, + stt_config: SpeechToTextConfig, language: str, + task_type: str, + request_prompt: str) -> PromptType: + tokenizer = cached_tokenizer_from_config(model_config) + audio = Audio(audio, int(stt_config.sample_rate), + format="wav") # lossless + req = TranscriptionRequest(model=model_config.model, + audio=RawAudio.from_audio(audio), + language=language) + + tokenized = tokenizer.instruct.encode_transcription(req) + audio = (tokenized.audios[0].audio_array, stt_config.sample_rate) + prompts_dict = {"multi_modal_data": {"audio": audio}} + prompts_dict["prompt_token_ids"] = tokenized.tokens + return cast(PromptType, prompts_dict) + + @classmethod + def validate_language(cls, language: str) -> bool: + # same as whisper + return WhisperForConditionalGeneration.validate_language(language) + + @classmethod + def get_num_audio_tokens(cls, audio_duration_s: float, + stt_config: SpeechToTextConfig, + model_config: ModelConfig) -> Optional[int]: + """ + Map from audio duration to number of audio tokens produced by the ASR + model, without running a forward pass. + This is used for estimating the amount of processing for this audio. + """ + tokenizer = cached_tokenizer_from_config(model_config) + adapter = VoxtralProcessorAdapter(tokenizer) + return adapter.get_num_audio_tokens( + int(audio_duration_s * stt_config.sample_rate)) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + # fmt: off + remapping_rules = [ + (r"mm_whisper_embeddings\.(.*)", r"\1"), + (r"audio_language_projection\.(.*)", r"audio_language_adapter.\1"), + (r"audio_language_adapter\.0\.weight", r"audio_language_adapter.w_in.weight"), # noqa: E501 + (r"audio_language_adapter\.2\.weight", r"audio_language_adapter.w_out.weight"), # noqa: E501 + ] + # fmt: on + + audio_params = dict( + nn.ModuleDict({ + "audio_language_adapter": + self.audio_language_adapter, + }).named_parameters()) + + loaded_weights = set() + + def llm_weights_generator(): + nonlocal loaded_weights + for name, w in weights: + is_encoder = ( + name.startswith("mm_whisper_embeddings") and + not name.startswith("mm_whisper_embeddings.tok_embeddings") + and not name.startswith( + "mm_whisper_embeddings.audio_language_projection")) + + for pattern, repl in remapping_rules: + if re.fullmatch(pattern, name): + name = re.sub(pattern, repl, name) + + if is_encoder: + name = self.whisper_encoder.load_weight((name, w)) + loaded_weights.add(f"whisper_encoder.{name}") + continue + + if name in audio_params: + param = audio_params[name] + with torch.no_grad(): + default_weight_loader(param, w) + loaded_weights.add(name) + else: + yield (name, w) + + for name in self.language_model.load_weights(llm_weights_generator()): + loaded_weights.add(f"language_model.{name}") + + # potentially manually add position embeddings + sin_key = "whisper_encoder.whisper_encoder.embed_positions.weight" + if sin_key not in loaded_weights: + # make sure we don't hit an error here + loaded_weights.add(sin_key) + + return loaded_weights + + +class AudioLanguageAdapter(nn.Module): + + def __init__(self, hidden_size: int, dim: int) -> None: + super().__init__() + self.w_in = nn.Linear(hidden_size, dim, bias=False) + self.gelu = nn.GELU() + self.w_out = nn.Linear(dim, dim, bias=False) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.w_out(self.gelu(self.w_in(x))) + + +class VoxtralEncoderModel(nn.Module): + packed_modules_mapping = {"qkv_proj": ["q_proj", "k_proj", "v_proj"]} + + # fmt: off + mistral_remapping = [ + (r"whisper_encoder\.conv_layers\.0\.(weight|bias)", r"whisper_encoder.conv1.\1"), # noqa: E501 + (r"whisper_encoder\.conv_layers\.1\.(weight|bias)", r"whisper_encoder.conv2.\1"), # noqa: E501 + (r"whisper_encoder\.transformer\.layers\.(\d+)\.attention\.w([qkv])\.(weight|bias)", r"whisper_encoder.layers.\1.self_attn.\2_proj.\3"), # noqa: E501 + (r"whisper_encoder\.transformer\.layers\.(\d+)\.attention\.wo\.(weight|bias)", r"whisper_encoder.layers.\1.self_attn.out_proj.\2"), # noqa: E501 + (r"whisper_encoder\.transformer\.layers\.(\d+)\.attention_norm\.(weight|bias)", r"whisper_encoder.layers.\1.self_attn_layer_norm.\2"), # noqa: E501 + (r"whisper_encoder\.transformer\.layers\.(\d+)\.feed_forward\.w1\.(weight|bias)", r"whisper_encoder.layers.\1.mlp.fc1.\2"), # noqa: E501 + (r"whisper_encoder\.transformer\.layers\.(\d+)\.feed_forward\.w2\.(weight|bias)", r"whisper_encoder.layers.\1.mlp.fc2.\2"), # noqa: E501 + (r"whisper_encoder\.transformer\.layers\.(\d+)\.ffn_norm\.(weight|bias)", r"whisper_encoder.layers.\1.final_layer_norm.\2"), # noqa: E501 + (r"whisper_encoder\.transformer\.norm\.(weight|bias)", r"whisper_encoder.layer_norm.\1"), # noqa: E501 + ] + # fmt: on + + def __init__( + self, + vllm_config: VllmConfig, + *, + prefix: str = "", + ) -> None: + super().__init__() + self.config = cast(WhisperConfig, vllm_config.model_config.hf_config) + self.dtype: torch.dtype = vllm_config.model_config.dtype + self.whisper_encoder = WhisperEncoder(vllm_config=vllm_config, + prefix=maybe_prefix( + prefix, "whisper_encoder"), + is_standalone_encoder=True, + init_in_fp32=True) + mel_filters = mel_filter_bank( + num_frequency_bins=1 + self.config.window_size // 2, + num_mel_bins=self.config.num_mel_bins, + min_frequency=0.0, + max_frequency=8000.0, + sampling_rate=self.config.sampling_rate, + ) + self.mel_filters = torch.tensor(mel_filters, dtype=torch.float32) + + def compute_whisper_melspec( + self, + audio_waveforms: torch.Tensor, + ) -> torch.Tensor: + input_dtype = audio_waveforms.dtype + window = torch.hann_window(self.config.window_size).to( + audio_waveforms.device) + stft = torch.stft( + audio_waveforms, + self.config.window_size, + self.config.hop_length, + window=window, + return_complex=True, + ) + magnitudes = stft[..., :-1].abs()**2 + mel_spec = self.mel_filters.T @ magnitudes + log_spec = torch.clamp(mel_spec, min=1e-10).log10() + log_spec = torch.maximum(log_spec, log_spec.max() - 8.0) + log_spec = (log_spec + 4.0) / 4.0 + return log_spec.to(input_dtype) + + @property + def downsample_factor(self) -> int: + return self.whisper_encoder.conv1.stride[ + 0] * self.whisper_encoder.conv2.stride[0] + + @property + def chunk_size(self) -> int: + return self.config.max_source_positions * self.downsample_factor + + def prepare_inputs_for_conv( + self, + audio_waveforms: list[torch.Tensor], + ) -> tuple[torch.Tensor, list[int]]: + assert isinstance(audio_waveforms, list) + # list[num_mel_bins, seq_len] + input_features = [ + self.compute_whisper_melspec(audio).to(self.dtype) + for audio in audio_waveforms + ] + + chunked_features: list[torch.Tensor] = [] + chunks_per_example: list[int] = [] + for feature in input_features: + chunks = feature.split(self.chunk_size, dim=-1) + chunked_features += chunks + chunks_per_example.append(len(chunks)) + + # [total_num_chunks, num_mel_bins, chunk_size] + return torch.stack(chunked_features), chunks_per_example + + def forward( + self, input_features: Union[torch.Tensor, list[torch.Tensor]] + ) -> list[torch.Tensor]: + if not isinstance(input_features, list): + input_features = [input_features] + + # Split long inputs into chunks + input_embeds, chunks_per_example = ( + self.prepare_inputs_for_conv(input_features)) + + # [total_num_chunks, ceil(chunk_size / downsample_factor), hidden_size] + out = self.whisper_encoder([input_embeds]) + + # Re-concatenate the chunks + chunk_idx = 0 + results = [] + for n_chunks in chunks_per_example: + result = out[chunk_idx:chunk_idx + n_chunks].flatten(0, 1) + results.append(result) + chunk_idx += n_chunks + + return results + + def load_weight(self, weight: tuple[str, torch.Tensor]) -> str: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + ("qkv_proj", "q_proj", "q"), + ("qkv_proj", "k_proj", "k"), + ("qkv_proj", "v_proj", "v"), + ] + params_dict = dict(self.named_parameters()) + + name, loaded_weight = weight + for pattern, repl in self.mistral_remapping: + if re.fullmatch(pattern, name): + name = re.sub(pattern, repl, name) + + for (param_name, weight_name, shard_id) in stacked_params_mapping: + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + break + else: + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + + return name diff --git a/vllm/model_executor/models/whisper.py b/vllm/model_executor/models/whisper.py index 08aed2205e0..d98dab5fac0 100644 --- a/vllm/model_executor/models/whisper.py +++ b/vllm/model_executor/models/whisper.py @@ -3,6 +3,7 @@ import math from collections.abc import Iterable, Mapping, Sequence +from contextlib import nullcontext from typing import Optional, TypedDict, Union, cast import numpy as np @@ -13,6 +14,7 @@ from transformers.models.whisper.modeling_whisper import sinusoids from vllm.attention import Attention, AttentionType +from vllm.attention.layer import MultiHeadAttention from vllm.config import (CacheConfig, ModelConfig, SpeechToTextConfig, VllmConfig) from vllm.distributed import get_tensor_model_parallel_world_size @@ -26,6 +28,7 @@ from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead +from vllm.model_executor.model_loader.utils import set_default_torch_dtype from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.multimodal import MULTIMODAL_REGISTRY, NestedTensors @@ -178,6 +181,7 @@ def __init__( cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", + standalone_encoder: bool = False, ): super().__init__() self.embed_dim = embed_dim @@ -213,16 +217,24 @@ def __init__( quant_config=quant_config, prefix=f"{prefix}.out_proj", ) - self.attn = Attention( - self.num_heads, - self.head_dim, - self.scaling, - num_kv_heads=self.num_kv_heads, - cache_config=cache_config, - quant_config=quant_config, - prefix=f"{prefix}.attn", - attn_type=self.attn_type, - ) + if standalone_encoder: + self.attn = MultiHeadAttention( + self.num_heads, + self.head_dim, + self.scaling, + num_kv_heads=self.num_kv_heads, + ) + else: + self.attn = Attention( + self.num_heads, + self.head_dim, + self.scaling, + num_kv_heads=self.num_kv_heads, + cache_config=cache_config, + quant_config=quant_config, + prefix=f"{prefix}.attn", + attn_type=self.attn_type, + ) def _init_qkv( self, @@ -357,7 +369,11 @@ def forward(self, hidden_states: torch.Tensor): class WhisperEncoderLayer(nn.Module): - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + def __init__(self, + *, + vllm_config: VllmConfig, + prefix: str = "", + is_standalone_encoder: bool = False): super().__init__() config = vllm_config.model_config.hf_config cache_config = vllm_config.cache_config @@ -371,6 +387,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): cache_config=cache_config, quant_config=quant_config, prefix=f"{prefix}.self_attn", + standalone_encoder=is_standalone_encoder, ) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.mlp = WhisperMLP( @@ -462,10 +479,16 @@ def forward( class WhisperEncoder(nn.Module): - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + def __init__(self, + *, + vllm_config: VllmConfig, + prefix: str = "", + is_standalone_encoder: bool = False, + init_in_fp32: bool = False): super().__init__() config = vllm_config.model_config.hf_config embed_dim = config.d_model + self.is_standalone_encoder = is_standalone_encoder self.num_mel_bins = config.num_mel_bins self.max_source_positions = config.max_source_positions self.embed_scale = (math.sqrt(embed_dim) @@ -480,17 +503,25 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): kernel_size=3, stride=2, padding=1) - self.embed_positions = nn.Embedding(self.max_source_positions, - embed_dim) self.start_layer, self.end_layer, self.layers = make_layers( config.encoder_layers, lambda prefix: WhisperEncoderLayer(vllm_config=vllm_config, - prefix=f"{prefix}.layers"), + prefix=f"{prefix}.layers", + is_standalone_encoder= + is_standalone_encoder), prefix=f"{prefix}.layers", ) self.layer_norm = nn.LayerNorm(config.d_model) - with torch.no_grad(): + maybe_fp32_init_ctx = set_default_torch_dtype( + torch.float32) if init_in_fp32 else nullcontext() + + with ( + torch.no_grad(), + maybe_fp32_init_ctx, + ): + self.embed_positions = nn.Embedding(self.max_source_positions, + embed_dim) self.embed_positions.weight.copy_( sinusoids(*self.embed_positions.weight.shape)) @@ -499,8 +530,10 @@ def forward(self, input_features: Union[torch.Tensor, list[torch.Tensor]]): for features in input_features: embeds = nn.functional.gelu(self.conv1(features)) embeds = nn.functional.gelu(self.conv2(embeds)) - embeds = embeds.permute(1, 0) - embeds = embeds + self.embed_positions.weight[:embeds.size(0), :] + embeds = embeds.transpose(-1, -2) + embeds = (embeds + + self.embed_positions.weight[:embeds.size(-2), :]).to( + embeds.dtype) hidden_states.append(embeds) hidden_states = torch.cat(hidden_states) @@ -792,10 +825,14 @@ def validate_language(cls, language: str) -> bool: f"or {list(ISO639_1_OTHER_LANGS.values())}") @classmethod - def get_generation_prompt(cls, audio: np.ndarray, - stt_config: SpeechToTextConfig, language: str, - task_type: str, - request_prompt: str) -> PromptType: + def get_generation_prompt( + cls, + audio: np.ndarray, + model_config: ModelConfig, # not needed here + stt_config: SpeechToTextConfig, + language: str, + task_type: str, + request_prompt: str) -> PromptType: prompt = { "encoder_prompt": { # Whisper does not support encoder prompt. diff --git a/vllm/transformers_utils/configs/mistral.py b/vllm/transformers_utils/configs/mistral.py index d2059c55a30..e66f762eb80 100644 --- a/vllm/transformers_utils/configs/mistral.py +++ b/vllm/transformers_utils/configs/mistral.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from typing import Any -from transformers import PretrainedConfig +from transformers import PretrainedConfig, WhisperConfig from vllm.logger import init_logger @@ -24,9 +24,21 @@ def adapt_config_dict(config_dict: dict[str, Any], if bool(config_dict.get("yarn")): config_dict = _remap_mistral_yarn_args(config_dict) - if bool((config_dict.get("multimodal") or {}).get("vision_encoder_args") - or config_dict.get("vision_encoder")): + + is_vision = ((config_dict.get("multimodal") + or {}).get("vision_encoder_args") + or config_dict.get("vision_encoder")) + is_audio = bool( + ((config_dict.get("multimodal") or {}).get("whisper_model_args") + or {}).get("encoder_args")) + + assert not (is_vision and is_audio), \ + "Vision and audio are mutually exclusive" + + if is_vision: config_dict = _remap_mistral_vision_args(config_dict) + if is_audio: + config_dict = _remap_mistral_audio_args(config_dict) config = PretrainedConfig.from_dict(config_dict) @@ -118,3 +130,35 @@ def _remap_mistral_quantization_args(config: dict) -> dict: config["quantization_config"] = quantization_config return config + + +def _remap_mistral_audio_args(config: dict) -> dict: + whisper_args = config["multimodal"].pop("whisper_model_args") + encoder_args = whisper_args["encoder_args"] + downsample_args = whisper_args["downsample_args"] + + quant_config = config.get("quantization_config") + config = { + "model_type": + "whixtral", + "architectures": ["VoxtralForConditionalGeneration"], + "text_config": + PretrainedConfig.from_dict(config), + "audio_config": + WhisperConfig( + num_mel_bins=encoder_args["audio_encoding_args"]["num_mel_bins"], + window_size=encoder_args["audio_encoding_args"]["window_size"], + sampling_rate=encoder_args["audio_encoding_args"]["sampling_rate"], + hop_length=encoder_args["audio_encoding_args"]["hop_length"], + downsample_factor=downsample_args["downsample_factor"], + d_model=encoder_args["dim"], + encoder_layers=encoder_args["n_layers"], + encoder_ffn_dim=encoder_args["hidden_dim"], + encoder_attention_heads=encoder_args["n_heads"], + vocab_size=encoder_args["vocab_size"], + max_source_positions=encoder_args["max_source_positions"], + ) + } + if quant_config: + config["quantization_config"] = quant_config + return config From 5a7bfdc778edf7b4c32ac22dbc2a4c77ada91b87 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Tue, 15 Jul 2025 23:53:16 +0800 Subject: [PATCH 104/552] [CI/Build] Fix wrong path in Transformers Nightly Models Test (#20994) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index dd723cb620a..bbbcfb745d5 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -645,7 +645,7 @@ steps: optional: true commands: - pip install --upgrade git+https://github.com/huggingface/transformers - - pytest -v -s models/test_initialization.py + - pytest -v -s tests/models/test_initialization.py - pytest -v -s tests/models/multimodal/processing/ - pytest -v -s tests/models/multimodal/test_mapping.py - python3 examples/offline_inference/basic/chat.py From d1d54c2784a2450c9e623982c68a064f37cf3048 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 15 Jul 2025 16:57:53 +0100 Subject: [PATCH 105/552] [Deprecation] Remove everything scheduled for removal in v0.10.0 (#20979) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/features/tool_calling.md | 4 +-- vllm/config.py | 35 +------------------- vllm/engine/arg_utils.py | 27 --------------- vllm/entrypoints/openai/api_server.py | 4 --- vllm/entrypoints/openai/cli_args.py | 12 ------- vllm/entrypoints/openai/serving_chat.py | 17 ---------- vllm/entrypoints/openai/serving_responses.py | 1 - vllm/sampling_params.py | 22 ------------ 8 files changed, 2 insertions(+), 120 deletions(-) diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index 35e01861c5d..f1e5dad35f1 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -103,9 +103,7 @@ When tool_choice='required' is set, the model is guaranteed to generate one or m vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request. -By default, when `tool_choice='none'` is specified, vLLM excludes tool definitions from the prompt to optimize context usage. To include tool definitions even with `tool_choice='none'`, use the `--expand-tools-even-if-tool-choice-none` option. - -Note: This behavior will change in v0.10.0, where tool definitions will be included by default even with `tool_choice='none'`. +However, when `tool_choice='none'` is specified, vLLM includes tool definitions from the prompt. ## Automatic Function Calling diff --git a/vllm/config.py b/vllm/config.py index 36671d7d4cc..2965696090d 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -26,7 +26,7 @@ from pydantic.dataclasses import dataclass from safetensors.torch import _TYPES as _SAFETENSORS_TO_TORCH_DTYPE from torch.distributed import ProcessGroup, ReduceOp -from typing_extensions import Self, deprecated, runtime_checkable +from typing_extensions import Self, runtime_checkable import vllm.envs as envs from vllm import version @@ -3688,18 +3688,6 @@ def get_served_model_name(model: str, class DecodingConfig: """Dataclass which contains the decoding strategy of the engine.""" - @property - @deprecated( - "`guided_decoding_backend` is deprecated and has been renamed to " - "`backend`. This will be removed in v0.10.0. Please use the " - "`backend` argument instead.") - def guided_decoding_backend(self) -> GuidedDecodingBackend: - return self.backend - - @guided_decoding_backend.setter - def guided_decoding_backend(self, value: GuidedDecodingBackend): - self.backend = value - backend: GuidedDecodingBackend = "auto" if envs.VLLM_USE_V1 else "xgrammar" """Which engine will be used for guided decoding (JSON schema / regex etc) by default. With "auto", we will make opinionated choices based on request @@ -3742,9 +3730,6 @@ def compute_hash(self) -> str: return hash_str def __post_init__(self): - if ":" in self.backend: - self._extract_backend_options() - if envs.VLLM_USE_V1: valid_guided_backends = get_args(GuidedDecodingBackendV1) else: @@ -3760,24 +3745,6 @@ def __post_init__(self): raise ValueError("disable_additional_properties is only supported " "for the guidance backend.") - @deprecated( - "Passing guided decoding backend options inside backend in the format " - "'backend:...' is deprecated. This will be removed in v0.10.0. Please " - "use the dedicated arguments '--disable-fallback', " - "'--disable-any-whitespace' and '--disable-additional-properties' " - "instead.") - def _extract_backend_options(self): - """Extract backend options from the backend string.""" - backend, options = self.backend.split(":") - self.backend = cast(GuidedDecodingBackend, backend) - options_set = set(options.strip().split(",")) - if "no-fallback" in options_set: - self.disable_fallback = True - if "disable-any-whitespace" in options_set: - self.disable_any_whitespace = True - if "no-additional-properties" in options_set: - self.disable_additional_properties = True - DetailedTraceModules = Literal["model", "worker", "all"] diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 998a352497f..500b333926c 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -9,7 +9,6 @@ import json import sys import threading -import warnings from dataclasses import MISSING, dataclass, fields, is_dataclass from itertools import permutations from typing import (TYPE_CHECKING, Annotated, Any, Callable, Dict, List, @@ -434,7 +433,6 @@ class EngineArgs: speculative_config: Optional[Dict[str, Any]] = None - qlora_adapter_name_or_path: Optional[str] = None show_hidden_metrics_for_version: Optional[str] = \ ObservabilityConfig.show_hidden_metrics_for_version otlp_traces_endpoint: Optional[str] = \ @@ -468,7 +466,6 @@ class EngineArgs: additional_config: dict[str, Any] = \ get_field(VllmConfig, "additional_config") - enable_reasoning: Optional[bool] = None # DEPRECATED reasoning_parser: str = DecodingConfig.reasoning_backend use_tqdm_on_load: bool = LoadConfig.use_tqdm_on_load @@ -486,13 +483,6 @@ def __post_init__(self): if isinstance(self.compilation_config, (int, dict)): self.compilation_config = CompilationConfig.from_cli( str(self.compilation_config)) - if self.qlora_adapter_name_or_path is not None: - warnings.warn( - "The `qlora_adapter_name_or_path` is deprecated " - "and will be removed in v0.10.0. ", - DeprecationWarning, - stacklevel=2, - ) # Setup plugins from vllm.plugins import load_general_plugins load_general_plugins() @@ -605,14 +595,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: **load_kwargs["ignore_patterns"]) load_group.add_argument("--use-tqdm-on-load", **load_kwargs["use_tqdm_on_load"]) - load_group.add_argument( - "--qlora-adapter-name-or-path", - type=str, - default=None, - help="The `--qlora-adapter-name-or-path` has no effect, do not set" - " it, and it will be removed in v0.10.0.", - deprecated=True, - ) load_group.add_argument('--pt-load-map-location', **load_kwargs["pt_load_map_location"]) @@ -633,15 +615,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: guided_decoding_group.add_argument( "--guided-decoding-disable-additional-properties", **guided_decoding_kwargs["disable_additional_properties"]) - guided_decoding_group.add_argument( - "--enable-reasoning", - action=argparse.BooleanOptionalAction, - deprecated=True, - help="[DEPRECATED] The `--enable-reasoning` flag is deprecated as " - "of v0.9.0. Use `--reasoning-parser` to specify the reasoning " - "parser backend instead. This flag (`--enable-reasoning`) will be " - "removed in v0.10.0. When `--reasoning-parser` is specified, " - "reasoning mode is automatically enabled.") guided_decoding_group.add_argument( "--reasoning-parser", # This choices is a special case because it's not static diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 049a90fea15..65ceeff8eb4 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1514,8 +1514,6 @@ async def init_app_state( chat_template_content_format=args.chat_template_content_format, return_tokens_as_token_ids=args.return_tokens_as_token_ids, enable_auto_tools=args.enable_auto_tool_choice, - expand_tools_even_if_tool_choice_none=args. - expand_tools_even_if_tool_choice_none, tool_parser=args.tool_call_parser, reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, @@ -1531,8 +1529,6 @@ async def init_app_state( chat_template_content_format=args.chat_template_content_format, return_tokens_as_token_ids=args.return_tokens_as_token_ids, enable_auto_tools=args.enable_auto_tool_choice, - expand_tools_even_if_tool_choice_none=args. - expand_tools_even_if_tool_choice_none, tool_parser=args.tool_call_parser, reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 9a7f04cd9b2..c8288b73a45 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -182,13 +182,6 @@ class FrontendArgs: """If set to True, enable tracking server_load_metrics in the app state.""" enable_force_include_usage: bool = False """If set to True, including usage on every request.""" - expand_tools_even_if_tool_choice_none: bool = False - """Include tool definitions in prompts even when `tool_choice='none'`. - - This is a transitional option that will be removed in v0.10.0. In - v0.10.0, tool definitions will always be included regardless of - `tool_choice` setting. Use this flag to test the upcoming behavior - before the breaking change.""" @staticmethod def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: @@ -225,11 +218,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: valid_tool_parsers = list(ToolParserManager.tool_parsers.keys()) frontend_kwargs["tool_call_parser"]["choices"] = valid_tool_parsers - # Special case for expand-tools-even-if-tool-choice-none because of - # the deprecation field - frontend_kwargs["expand_tools_even_if_tool_choice_none"]\ - ["deprecated"] = True - frontend_group = parser.add_argument_group( title="Frontend", description=FrontendArgs.__doc__, diff --git a/vllm/entrypoints/openai/serving_chat.py b/vllm/entrypoints/openai/serving_chat.py index 53509e8f65a..b902166a25b 100644 --- a/vllm/entrypoints/openai/serving_chat.py +++ b/vllm/entrypoints/openai/serving_chat.py @@ -63,7 +63,6 @@ def __init__( return_tokens_as_token_ids: bool = False, reasoning_parser: str = "", enable_auto_tools: bool = False, - expand_tools_even_if_tool_choice_none: bool = False, tool_parser: Optional[str] = None, enable_prompt_tokens_details: bool = False, enable_force_include_usage: bool = False, @@ -112,8 +111,6 @@ def __init__( raise TypeError("Error: --enable-auto-tool-choice requires " f"tool_parser:'{tool_parser}' which has not " "been registered") from e - self.expand_tools_even_if_tool_choice_none = ( - expand_tools_even_if_tool_choice_none) self.enable_prompt_tokens_details = enable_prompt_tokens_details self.enable_force_include_usage = enable_force_include_usage @@ -182,20 +179,6 @@ async def create_chat_completion( if request.tools is None: tool_dicts = None - elif (request.tool_choice == "none" - and not self.expand_tools_even_if_tool_choice_none): - if len(request.tools) > 0: - logger.warning_once( - "Tools are specified but tool_choice is set to 'none' " - "and --expand-tools-even-if-tool-choice-none is not " - "enabled. Tool definitions will be excluded from the " - "prompt. This behavior will change in vLLM v0.10 where " - "tool definitions will be included by default even " - "with tool_choice='none'. To adopt the new behavior " - "now, use --expand-tools-even-if-tool-choice-none. " - "To suppress this warning, either remove tools from " - "the request or set tool_choice to a different value.") - tool_dicts = None else: tool_dicts = [tool.model_dump() for tool in request.tools] diff --git a/vllm/entrypoints/openai/serving_responses.py b/vllm/entrypoints/openai/serving_responses.py index ac2b3dfafec..f7bde6e243b 100644 --- a/vllm/entrypoints/openai/serving_responses.py +++ b/vllm/entrypoints/openai/serving_responses.py @@ -51,7 +51,6 @@ def __init__( return_tokens_as_token_ids: bool = False, reasoning_parser: str = "", enable_auto_tools: bool = False, - expand_tools_even_if_tool_choice_none: bool = False, tool_parser: Optional[str] = None, enable_prompt_tokens_details: bool = False, enable_force_include_usage: bool = False, diff --git a/vllm/sampling_params.py b/vllm/sampling_params.py index a9a862384d1..322e53b7539 100644 --- a/vllm/sampling_params.py +++ b/vllm/sampling_params.py @@ -9,7 +9,6 @@ import msgspec from pydantic import BaseModel -from typing_extensions import deprecated from vllm.logger import init_logger from vllm.logits_process import LogitsProcessor @@ -84,27 +83,6 @@ def __post_init__(self): "You can only use one kind of guided decoding but multiple are " f"specified: {self.__dict__}") - if self.backend is not None and ":" in self.backend: - self._extract_backend_options() - - @deprecated( - "Passing guided decoding backend options inside backend in the format " - "'backend:...' is deprecated. This will be removed in v0.10.0. Please " - "use the dedicated arguments '--disable-fallback', " - "'--disable-any-whitespace' and '--disable-additional-properties' " - "instead.") - def _extract_backend_options(self): - """Extract backend options from the backend string.""" - assert isinstance(self.backend, str) - self.backend, options = self.backend.split(":") - options_set = set(options.strip().split(",")) - if "no-fallback" in options_set: - self.disable_fallback = True - if "disable-any-whitespace" in options_set: - self.disable_any_whitespace = True - if "no-additional-properties" in options_set: - self.disable_additional_properties = True - class RequestOutputKind(Enum): # Return entire output so far in every RequestOutput From c12b8117b203e4cb07f8610fd0e5b04934c22ab6 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 15 Jul 2025 17:37:05 +0100 Subject: [PATCH 106/552] Configure Gemini (#20971) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- .gemini/config.yaml | 6 ++++++ 1 file changed, 6 insertions(+) create mode 100644 .gemini/config.yaml diff --git a/.gemini/config.yaml b/.gemini/config.yaml new file mode 100644 index 00000000000..2499d3f0951 --- /dev/null +++ b/.gemini/config.yaml @@ -0,0 +1,6 @@ +# https://developers.google.com/gemini-code-assist/docs/customize-gemini-behavior-github +have_fun: false # Just review the code +code_review: + comment_severity_threshold: HIGH # Reduce quantity of comments + pull_request_opened: + summary: false # Don't summarize the PR in a separate comment From 1d15a1e30b161d1aa7280898f11f2e3e5dba7cdf Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 15 Jul 2025 18:21:50 +0100 Subject: [PATCH 107/552] [Deprecation] Remove `nullable_kvs` (#20969) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- tests/engine/test_arg_utils.py | 56 ++----------------- .../entrypoints/openai/test_openai_schema.py | 3 +- vllm/engine/arg_utils.py | 41 +------------- 3 files changed, 7 insertions(+), 93 deletions(-) diff --git a/tests/engine/test_arg_utils.py b/tests/engine/test_arg_utils.py index 86e28c68784..5a91758414a 100644 --- a/tests/engine/test_arg_utils.py +++ b/tests/engine/test_arg_utils.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import json -from argparse import ArgumentError, ArgumentTypeError +from argparse import ArgumentError from contextlib import nullcontext from dataclasses import dataclass, field from typing import Annotated, Literal, Optional @@ -12,8 +12,8 @@ from vllm.config import CompilationConfig, config from vllm.engine.arg_utils import (EngineArgs, contains_type, get_kwargs, get_type, get_type_hints, is_not_builtin, - is_type, literal_to_kwargs, nullable_kvs, - optional_type, parse_type) + is_type, literal_to_kwargs, optional_type, + parse_type) from vllm.utils import FlexibleArgumentParser @@ -25,18 +25,10 @@ "foo": 1, "bar": 2 }), - (json.loads, "foo=1,bar=2", { - "foo": 1, - "bar": 2 - }), ]) def test_parse_type(type, value, expected): parse_type_func = parse_type(type) - context = nullcontext() - if value == "foo=1,bar=2": - context = pytest.warns(DeprecationWarning) - with context: - assert parse_type_func(value) == expected + assert parse_type_func(value) == expected def test_optional_type(): @@ -203,34 +195,6 @@ def test_get_kwargs(): assert kwargs["from_cli_config2"]["type"]('{"field": 2}').field == 4 -@pytest.mark.parametrize(("arg", "expected"), [ - (None, dict()), - ("image=16", { - "image": 16 - }), - ("image=16,video=2", { - "image": 16, - "video": 2 - }), - ("Image=16, Video=2", { - "image": 16, - "video": 2 - }), -]) -def test_limit_mm_per_prompt_parser(arg, expected): - """This functionality is deprecated and will be removed in the future. - This argument should be passed as JSON string instead. - - TODO: Remove with nullable_kvs.""" - parser = EngineArgs.add_cli_args(FlexibleArgumentParser()) - if arg is None: - args = parser.parse_args([]) - else: - args = parser.parse_args(["--limit-mm-per-prompt", arg]) - - assert args.limit_mm_per_prompt == expected - - @pytest.mark.parametrize( ("arg", "expected"), [ @@ -326,18 +290,6 @@ def test_prefix_cache_default(): assert not engine_args.enable_prefix_caching -@pytest.mark.parametrize( - ("arg"), - [ - "image", # Missing = - "image=4,image=5", # Conflicting values - "image=video=4" # Too many = in tokenized arg - ]) -def test_bad_nullable_kvs(arg): - with pytest.raises(ArgumentTypeError): - nullable_kvs(arg) - - # yapf: disable @pytest.mark.parametrize(("arg", "expected", "option"), [ (None, None, "mm-processor-kwargs"), diff --git a/tests/entrypoints/openai/test_openai_schema.py b/tests/entrypoints/openai/test_openai_schema.py index aa87cd22fe4..580bf34f20c 100644 --- a/tests/entrypoints/openai/test_openai_schema.py +++ b/tests/entrypoints/openai/test_openai_schema.py @@ -1,5 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import json from typing import Final import pytest @@ -29,7 +30,7 @@ def server(): "--enforce-eager", "--trust-remote-code", "--limit-mm-per-prompt", - f"image={MAXIMUM_IMAGES}", + json.dumps({"image": MAXIMUM_IMAGES}), ] with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 500b333926c..7b73060e349 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -18,7 +18,7 @@ import regex as re import torch from pydantic import TypeAdapter, ValidationError -from typing_extensions import TypeIs, deprecated +from typing_extensions import TypeIs import vllm.envs as envs from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig, @@ -65,9 +65,6 @@ def parse_type(return_type: Callable[[str], T]) -> Callable[[str], T]: def _parse_type(val: str) -> T: try: - if return_type is json.loads and not re.match( - r"(?s)^\s*{.*}\s*$", val): - return cast(T, nullable_kvs(val)) return return_type(val) except ValueError as e: raise argparse.ArgumentTypeError( @@ -93,42 +90,6 @@ def union_dict_and_str(val: str) -> Optional[Union[str, dict[str, str]]]: return optional_type(json.loads)(val) -@deprecated( - "Passing a JSON argument as a string containing comma separated key=value " - "pairs is deprecated. This will be removed in v0.10.0. Please use a JSON " - "string instead.") -def nullable_kvs(val: str) -> dict[str, int]: - """Parses a string containing comma separate key [str] to value [int] - pairs into a dictionary. - - Args: - val: String value to be parsed. - - Returns: - Dictionary with parsed values. - """ - out_dict: dict[str, int] = {} - for item in val.split(","): - kv_parts = [part.lower().strip() for part in item.split("=")] - if len(kv_parts) != 2: - raise argparse.ArgumentTypeError( - "Each item should be in the form KEY=VALUE") - key, value = kv_parts - - try: - parsed_value = int(value) - except ValueError as exc: - msg = f"Failed to parse value of item {key}={value}" - raise argparse.ArgumentTypeError(msg) from exc - - if key in out_dict and out_dict[key] != parsed_value: - raise argparse.ArgumentTypeError( - f"Conflicting values specified for key: {key}") - out_dict[key] = parsed_value - - return out_dict - - def is_type(type_hint: TypeHint, type: TypeHintT) -> TypeIs[TypeHintT]: """Check if the type hint is a specific type.""" return type_hint is type or get_origin(type_hint) is type From de6c541ebcdef5633123b1110f938b35fab8d4c4 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 15 Jul 2025 18:42:30 +0100 Subject: [PATCH 108/552] Add full serve CLI reference back to docs (#20978) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/cli/README.md | 8 +++++++ docs/configuration/serve_args.md | 2 +- docs/mkdocs/hooks/generate_argparse.py | 23 ++++++++++++++++--- requirements/docs.txt | 1 + vllm/entrypoints/cli/serve.py | 31 -------------------------- vllm/entrypoints/openai/cli_args.py | 28 +++++++++++++++++++++++ 6 files changed, 58 insertions(+), 35 deletions(-) diff --git a/docs/cli/README.md b/docs/cli/README.md index 1d951747a7a..dfb6051a8c8 100644 --- a/docs/cli/README.md +++ b/docs/cli/README.md @@ -1,3 +1,7 @@ +--- +toc_depth: 4 +--- + # vLLM CLI Guide The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: @@ -42,6 +46,10 @@ Start the vLLM OpenAI Compatible API server. vllm serve --help=page ``` +### Options + +--8<-- "docs/argparse/serve.md" + ## chat Generate chat completions via the running API server. diff --git a/docs/configuration/serve_args.md b/docs/configuration/serve_args.md index 142d4b8af89..c1cc5577bc7 100644 --- a/docs/configuration/serve_args.md +++ b/docs/configuration/serve_args.md @@ -5,7 +5,7 @@ The `vllm serve` command is used to launch the OpenAI-compatible server. ## CLI Arguments The `vllm serve` command is used to launch the OpenAI-compatible server. -To see the available CLI arguments, run `vllm serve --help`! +To see the available options, take a look at the [CLI Reference](../cli/README.md#options)! ## Configuration file diff --git a/docs/mkdocs/hooks/generate_argparse.py b/docs/mkdocs/hooks/generate_argparse.py index 64120f2d151..22cf41e6041 100644 --- a/docs/mkdocs/hooks/generate_argparse.py +++ b/docs/mkdocs/hooks/generate_argparse.py @@ -16,6 +16,7 @@ sys.modules["vllm._C"] = MagicMock() from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs # noqa: E402 +from vllm.entrypoints.openai.cli_args import make_arg_parser # noqa: E402 from vllm.utils import FlexibleArgumentParser # noqa: E402 logger = logging.getLogger("mkdocs") @@ -24,15 +25,18 @@ class MarkdownFormatter(HelpFormatter): """Custom formatter that generates markdown for argument groups.""" - def __init__(self, prog): + def __init__(self, prog, starting_heading_level=3): super().__init__(prog, max_help_position=float('inf'), width=float('inf')) + self._section_heading_prefix = "#" * starting_heading_level + self._argument_heading_prefix = "#" * (starting_heading_level + 1) self._markdown_output = [] def start_section(self, heading): if heading not in {"positional arguments", "options"}: - self._markdown_output.append(f"\n### {heading}\n\n") + heading_md = f"\n{self._section_heading_prefix} {heading}\n\n" + self._markdown_output.append(heading_md) def end_section(self): pass @@ -46,9 +50,13 @@ def add_usage(self, usage, actions, groups, prefix=None): def add_arguments(self, actions): for action in actions: + if (len(action.option_strings) == 0 + or "--help" in action.option_strings): + continue option_strings = f'`{"`, `".join(action.option_strings)}`' - self._markdown_output.append(f"#### {option_strings}\n\n") + heading_md = f"{self._argument_heading_prefix} {option_strings}\n\n" + self._markdown_output.append(heading_md) if choices := action.choices: choices = f'`{"`, `".join(str(c) for c in choices)}`' @@ -81,6 +89,14 @@ def create_parser(cls, **kwargs) -> FlexibleArgumentParser: return cls.add_cli_args(parser, **kwargs) +def create_serve_parser() -> FlexibleArgumentParser: + """Create a parser for the serve command with markdown formatting.""" + parser = FlexibleArgumentParser() + parser.formatter_class = lambda prog: MarkdownFormatter( + prog, starting_heading_level=4) + return make_arg_parser(parser) + + def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): logger.info("Generating argparse documentation") logger.debug("Root directory: %s", ROOT_DIR.resolve()) @@ -95,6 +111,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): "engine_args": create_parser(EngineArgs), "async_engine_args": create_parser(AsyncEngineArgs, async_args_only=True), + "serve": create_serve_parser(), } # Generate documentation for each parser diff --git a/requirements/docs.txt b/requirements/docs.txt index 7ea768b9909..1ddc825a9cd 100644 --- a/requirements/docs.txt +++ b/requirements/docs.txt @@ -17,6 +17,7 @@ cloudpickle fastapi msgspec openai +partial-json-parser pillow psutil pybase64 diff --git a/vllm/entrypoints/cli/serve.py b/vllm/entrypoints/cli/serve.py index d25105cbb78..1204ccc1c67 100644 --- a/vllm/entrypoints/cli/serve.py +++ b/vllm/entrypoints/cli/serve.py @@ -67,37 +67,6 @@ def subparser_init( help="Start the vLLM OpenAI Compatible API server.", description="Start the vLLM OpenAI Compatible API server.", usage="vllm serve [model_tag] [options]") - serve_parser.add_argument("model_tag", - type=str, - nargs='?', - help="The model tag to serve " - "(optional if specified in config)") - serve_parser.add_argument( - "--headless", - action='store_true', - default=False, - help="Run in headless mode. See multi-node data parallel " - "documentation for more details.") - serve_parser.add_argument( - '--data-parallel-start-rank', - '-dpr', - type=int, - default=0, - help="Starting data parallel rank for secondary nodes. " - "Requires --headless.") - serve_parser.add_argument('--api-server-count', - '-asc', - type=int, - default=1, - help='How many API server processes to run.') - serve_parser.add_argument( - "--config", - type=str, - default='', - required=False, - help="Read CLI options from a config file. " - "Must be a YAML with the following options: " - "https://docs.vllm.ai/en/latest/configuration/serve_args.html") serve_parser = make_arg_parser(serve_parser) show_filtered_argument_or_group_from_help(serve_parser, ["serve"]) diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index c8288b73a45..f8fdfe71bbe 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -236,6 +236,34 @@ def make_arg_parser(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: register all arguments instead of manually enumerating them here. This avoids code duplication and keeps the argument definitions in one place. """ + parser.add_argument("model_tag", + type=str, + nargs="?", + help="The model tag to serve " + "(optional if specified in config)") + parser.add_argument( + "--headless", + action="store_true", + default=False, + help="Run in headless mode. See multi-node data parallel " + "documentation for more details.") + parser.add_argument( + "--data-parallel-start-rank", + "-dpr", + type=int, + default=0, + help="Starting data parallel rank for secondary nodes. " + "Requires --headless.") + parser.add_argument("--api-server-count", + "-asc", + type=int, + default=1, + help="How many API server processes to run.") + parser.add_argument( + "--config", + help="Read CLI options from a config file. " + "Must be a YAML with the following options: " + "https://docs.vllm.ai/en/latest/configuration/serve_args.html") parser = FrontendArgs.add_cli_args(parser) parser = AsyncEngineArgs.add_cli_args(parser) From 5fa365584e39eea45a8151dbcae41791c4e46991 Mon Sep 17 00:00:00 2001 From: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Date: Tue, 15 Jul 2025 14:01:44 -0400 Subject: [PATCH 109/552] [ROCm] warpSize is being made non constexpr in ROCm 7.0 (#20330) Signed-off-by: Gregory Shtrasberg Signed-off-by: x22x22 --- csrc/attention/attention_kernels.cuh | 8 +------- csrc/attention/paged_attention_v1.cu | 8 +------- csrc/attention/paged_attention_v2.cu | 8 +------- csrc/cuda_compat.h | 6 +++--- 4 files changed, 6 insertions(+), 24 deletions(-) diff --git a/csrc/attention/attention_kernels.cuh b/csrc/attention/attention_kernels.cuh index 79a546554fa..8f24be89578 100644 --- a/csrc/attention/attention_kernels.cuh +++ b/csrc/attention/attention_kernels.cuh @@ -24,6 +24,7 @@ #include "attention_dtypes.h" #include "attention_utils.cuh" +#include "cuda_compat.h" #ifdef USE_ROCM #include @@ -33,12 +34,6 @@ typedef __hip_bfloat16 __nv_bfloat16; #include "../quantization/fp8/nvidia/quant_utils.cuh" #endif -#ifndef USE_ROCM - #define WARP_SIZE 32 -#else - #define WARP_SIZE warpSize -#endif - #define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b)) #define DIVIDE_ROUND_UP(a, b) (((a) + (b) - 1) / (b)) @@ -670,7 +665,6 @@ __global__ void paged_attention_v2_reduce_kernel( } // namespace vllm -#undef WARP_SIZE #undef MAX #undef MIN #undef DIVIDE_ROUND_UP diff --git a/csrc/attention/paged_attention_v1.cu b/csrc/attention/paged_attention_v1.cu index 46108a32d71..7a5ef10f8ef 100644 --- a/csrc/attention/paged_attention_v1.cu +++ b/csrc/attention/paged_attention_v1.cu @@ -18,12 +18,7 @@ */ #include "attention_kernels.cuh" - -#ifndef USE_ROCM - #define WARP_SIZE 32 -#else - #define WARP_SIZE warpSize -#endif +#include "cuda_compat.h" #define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b)) @@ -187,7 +182,6 @@ void paged_attention_v1( CALL_V1_LAUNCHER_BLOCK_SIZE) } -#undef WARP_SIZE #undef MAX #undef MIN #undef DIVIDE_ROUND_UP diff --git a/csrc/attention/paged_attention_v2.cu b/csrc/attention/paged_attention_v2.cu index 9358c0d9f6a..b45b28dad05 100644 --- a/csrc/attention/paged_attention_v2.cu +++ b/csrc/attention/paged_attention_v2.cu @@ -18,12 +18,7 @@ */ #include "attention_kernels.cuh" - -#ifndef USE_ROCM - #define WARP_SIZE 32 -#else - #define WARP_SIZE warpSize -#endif +#include "cuda_compat.h" #define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b)) @@ -197,7 +192,6 @@ void paged_attention_v2( CALL_V2_LAUNCHER_BLOCK_SIZE) } -#undef WARP_SIZE #undef MAX #undef MIN #undef DIVIDE_ROUND_UP diff --git a/csrc/cuda_compat.h b/csrc/cuda_compat.h index 82e55613d91..affa051c759 100644 --- a/csrc/cuda_compat.h +++ b/csrc/cuda_compat.h @@ -4,10 +4,10 @@ #include #endif -#ifndef USE_ROCM - #define WARP_SIZE 32 +#if defined(USE_ROCM) && defined(__GFX9__) + #define WARP_SIZE 64 #else - #define WARP_SIZE warpSize + #define WARP_SIZE 32 #endif #ifndef USE_ROCM From dbb9e1879a56ff68c1d3339022742aee502f3d3a Mon Sep 17 00:00:00 2001 From: "Tuan, Hoang-Trong" Date: Tue, 15 Jul 2025 16:08:26 -0400 Subject: [PATCH 110/552] [BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 (#20838) Signed-off-by: Tuan M. Hoang-Trong Co-authored-by: Tuan M. Hoang-Trong Signed-off-by: x22x22 --- vllm/model_executor/layers/mamba/mamba_mixer2.py | 16 +++++++++++----- .../layers/mamba/ops/causal_conv1d.py | 7 +++---- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/vllm/model_executor/layers/mamba/mamba_mixer2.py b/vllm/model_executor/layers/mamba/mamba_mixer2.py index a88bd55e236..f3850d31c82 100644 --- a/vllm/model_executor/layers/mamba/mamba_mixer2.py +++ b/vllm/model_executor/layers/mamba/mamba_mixer2.py @@ -573,8 +573,8 @@ def forward_cuda( x = hidden_states_B_C_p.transpose( 0, 1) # this is the form that causal-conv see if mamba2_metadata.cu_seqlen is None: - mamba2_metadata = update_metadata( - x, attn_metadata.query_start_loc, mamba2_metadata) + mamba2_metadata = update_metadata(x, query_start_loc_p, + mamba2_metadata) hidden_states_B_C_p = causal_conv1d_fn( x, conv_weights, @@ -583,6 +583,7 @@ def forward_cuda( conv_states=conv_state, has_initial_state=has_initial_states_p, cache_indices=state_indices_tensor_p, + metadata=mamba2_metadata, query_start_loc=query_start_loc_p).transpose( 0, 1)[:num_prefill_tokens] @@ -593,9 +594,14 @@ def forward_cuda( initial_states = None if (has_initial_states_p is not None and prep_initial_states): # making a copy of the states - initial_states = torch.where( - has_initial_states_p[:, None, None, None], - ssm_state[state_indices_tensor_p], 0) + if envs.VLLM_USE_V1: + initial_states = torch.where( + has_initial_states_p[:, None, None, None], + ssm_state[state_indices_tensor_p], 0) + else: + initial_states = torch.where( + has_initial_states_p[:num_prefills, None, None, None], + ssm_state[state_indices_tensor_p], 0) scan_output, varlen_state = mamba_chunk_scan_combined( hidden_states_p.view(1, num_prefill_tokens, diff --git a/vllm/model_executor/layers/mamba/ops/causal_conv1d.py b/vllm/model_executor/layers/mamba/ops/causal_conv1d.py index a8bd0067bf4..b8d4bbc3710 100644 --- a/vllm/model_executor/layers/mamba/ops/causal_conv1d.py +++ b/vllm/model_executor/layers/mamba/ops/causal_conv1d.py @@ -55,7 +55,6 @@ def _causal_conv1d_fwd_kernel( # continuous batching IS_CONTINUOUS_BATCHING: tl.constexpr, USE_PAD_SLOT: tl.constexpr, NP2_STATELEN: tl.constexpr, - DECODE_SEQLEN: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, ): @@ -416,7 +415,7 @@ def causal_conv1d_fn( activation = "silu" args = None - out = torch.zeros_like(x) + out = torch.empty_like(x) if metadata is not None: cu_seqlen = metadata.cu_seqlen nums_dict = metadata.nums_dict @@ -607,7 +606,6 @@ def grid(META): IS_CONTINUOUS_BATCHING=cache_indices is not None, USE_PAD_SLOT=pad_slot_id is not None, NP2_STATELEN=np2_statelen, - DECODE_SEQLEN=1, #launch_cooperative_grid=True BLOCK_M=8, BLOCK_N=256, @@ -665,7 +663,8 @@ def _causal_conv1d_update_kernel( if IS_CONTINUOUS_BATCHING: # mask = idx_seq < batch - conv_state_batch_coord = tl.load(conv_state_indices_ptr + idx_seq) + conv_state_batch_coord = tl.load(conv_state_indices_ptr + idx_seq).to( + tl.int64) else: conv_state_batch_coord = idx_seq if USE_PAD_SLOT: # noqa From 905da93827d425655c86d505458d4703c320227a Mon Sep 17 00:00:00 2001 From: Marko Rosenmueller <5467316+dr75@users.noreply.github.com> Date: Tue, 15 Jul 2025 23:01:04 +0200 Subject: [PATCH 111/552] [Frontend] Support cache_salt in /v1/completions and /v1/responses (#20981) Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/entrypoints/openai/api_server.py | 1 + vllm/entrypoints/openai/protocol.py | 52 +++++++++++++++++-- vllm/entrypoints/openai/serving_completion.py | 17 ++++++ vllm/entrypoints/openai/serving_engine.py | 11 +++- 4 files changed, 77 insertions(+), 4 deletions(-) diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 65ceeff8eb4..19d0110ff37 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1540,6 +1540,7 @@ async def init_app_state( state.openai_serving_models, request_logger=request_logger, return_tokens_as_token_ids=args.return_tokens_as_token_ids, + enable_prompt_tokens_details=args.enable_prompt_tokens_details, enable_force_include_usage=args.enable_force_include_usage, ) if "generate" in model_config.supported_tasks else None state.openai_serving_pooling = OpenAIServingPooling( diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index fdac6ccd19e..f17faa23d01 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -290,6 +290,15 @@ class ResponsesRequest(OpenAIBaseModel): "default: 0). Any priority other than 0 will raise an error " "if the served model does not use priority scheduling."), ) + cache_salt: Optional[str] = Field( + default=None, + description=( + "If specified, the prefix cache will be salted with the provided " + "string to prevent an attacker to guess prompts in multi-user " + "environments. The salt should be random, protected from " + "access by 3rd parties, and long enough to be " + "unpredictable (e.g., 43 characters base64-encoded, corresponding " + "to 256 bit). Not supported by vLLM engine V0.")) # --8<-- [end:responses-extra-params] _DEFAULT_SAMPLING_PARAMS = { @@ -351,6 +360,19 @@ def validate_prompt(cls, data): raise ValueError("prompt template is not supported") return data + @model_validator(mode="before") + def check_cache_salt_support(cls, data): + if data.get("cache_salt") is not None: + if not envs.VLLM_USE_V1: + raise ValueError( + "Parameter 'cache_salt' is not supported with " + "this instance of vLLM, which uses engine V0.") + if not isinstance(data["cache_salt"], + str) or not data["cache_salt"]: + raise ValueError("Parameter 'cache_salt' must be a " + "non-empty string if provided.") + return data + class ChatCompletionRequest(OpenAIBaseModel): # Ordered by official OpenAI API documentation @@ -1004,6 +1026,16 @@ class CompletionRequest(OpenAIBaseModel): " as strings of the form 'token_id:{token_id}' so that tokens " "that are not JSON-encodable can be identified.")) + cache_salt: Optional[str] = Field( + default=None, + description=( + "If specified, the prefix cache will be salted with the provided " + "string to prevent an attacker to guess prompts in multi-user " + "environments. The salt should be random, protected from " + "access by 3rd parties, and long enough to be " + "unpredictable (e.g., 43 characters base64-encoded, corresponding " + "to 256 bit). Not supported by vLLM engine V0.")) + kv_transfer_params: Optional[dict[str, Any]] = Field( default=None, description="KVTransfer parameters used for disaggregated serving.") @@ -1180,6 +1212,20 @@ def validate_prompt_and_prompt_embeds(cls, data): "At least one of `prompt` or `prompt_embeds` must be set.") return data + @model_validator(mode="before") + @classmethod + def check_cache_salt_support(cls, data): + if data.get("cache_salt") is not None: + if not envs.VLLM_USE_V1: + raise ValueError( + "Parameter 'cache_salt' is not supported with " + "this instance of vLLM, which uses engine V0.") + if not isinstance(data["cache_salt"], + str) or not data["cache_salt"]: + raise ValueError("Parameter 'cache_salt' must be a " + "non-empty string if provided.") + return data + class EmbeddingCompletionRequest(OpenAIBaseModel): # Ordered by official OpenAI API documentation @@ -1971,7 +2017,7 @@ class TranscriptionRequest(OpenAIBaseModel): """ stream: Optional[bool] = False - """When set, it will enable output to be streamed in a similar fashion + """When set, it will enable output to be streamed in a similar fashion as the Chat Completion endpoint. """ # --8<-- [start:transcription-extra-params] @@ -2233,9 +2279,9 @@ class TranslationRequest(OpenAIBaseModel): """ stream: Optional[bool] = False - """Custom field not present in the original OpenAI definition. When set, + """Custom field not present in the original OpenAI definition. When set, it will enable output to be streamed in a similar fashion as the Chat - Completion endpoint. + Completion endpoint. """ # Flattened stream option to simplify form data. stream_include_usage: Optional[bool] = False diff --git a/vllm/entrypoints/openai/serving_completion.py b/vllm/entrypoints/openai/serving_completion.py index 6c9c29b7144..eb9a35a7a37 100644 --- a/vllm/entrypoints/openai/serving_completion.py +++ b/vllm/entrypoints/openai/serving_completion.py @@ -23,6 +23,7 @@ CompletionResponseStreamChoice, CompletionStreamResponse, ErrorResponse, + PromptTokenUsageInfo, RequestResponseMetadata, UsageInfo) from vllm.entrypoints.openai.serving_engine import ( @@ -56,6 +57,7 @@ def __init__( *, request_logger: Optional[RequestLogger], return_tokens_as_token_ids: bool = False, + enable_prompt_tokens_details: bool = False, enable_force_include_usage: bool = False, ): super().__init__(engine_client=engine_client, @@ -64,6 +66,7 @@ def __init__( request_logger=request_logger, return_tokens_as_token_ids=return_tokens_as_token_ids, enable_force_include_usage=enable_force_include_usage) + self.enable_prompt_tokens_details = enable_prompt_tokens_details self.default_sampling_params = ( self.model_config.get_diff_sampling_param()) if self.default_sampling_params: @@ -313,6 +316,8 @@ async def completion_stream_generator( previous_num_tokens = [0] * num_choices * num_prompts has_echoed = [False] * num_choices * num_prompts num_prompt_tokens = [0] * num_prompts + num_cached_tokens = None + first_iteration = True stream_options = request.stream_options if stream_options: @@ -328,6 +333,10 @@ async def completion_stream_generator( prompt_token_ids = res.prompt_token_ids prompt_logprobs = res.prompt_logprobs + if first_iteration: + num_cached_tokens = res.num_cached_tokens + first_iteration = False + if res.prompt is not None: prompt_text = res.prompt else: @@ -431,6 +440,10 @@ async def completion_stream_generator( completion_tokens=total_completion_tokens, total_tokens=total_prompt_tokens + total_completion_tokens) + if self.enable_prompt_tokens_details and num_cached_tokens: + final_usage_info.prompt_tokens_details = PromptTokenUsageInfo( + cached_tokens=num_cached_tokens) + if include_usage: final_usage_chunk = CompletionStreamResponse( id=request_id, @@ -535,6 +548,10 @@ def request_output_to_completion_response( total_tokens=num_prompt_tokens + num_generated_tokens, ) + if self.enable_prompt_tokens_details and final_res.num_cached_tokens: + usage.prompt_tokens_details = PromptTokenUsageInfo( + cached_tokens=final_res.num_cached_tokens) + request_metadata.final_usage_info = usage return CompletionResponse( diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index dab5ac03253..462317a0878 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -226,7 +226,7 @@ def __init__( def _get_async_tokenizer(self, tokenizer) -> AsyncMicrobatchTokenizer: """ - Return (and cache) an `AsyncMicrobatchTokenizer` bound to the + Return (and cache) an `AsyncMicrobatchTokenizer` bound to the given tokenizer. """ async_tokenizer = self._async_tokenizer_pool.get(tokenizer) @@ -811,6 +811,12 @@ async def _preprocess_completion( prompt_token_ids=request_prompt_text["prompt_token_ids"]) for request_prompt_text in request_prompts_text ] + cache_salt = request.cache_salt if ( + hasattr(request, "cache_salt") + and request.cache_salt is not None) else None + if cache_salt: + for prompt_text in engine_prompts_text: + prompt_text["cache_salt"] = cache_salt # This check is equivalent to simply checking if # `request_prompts_embeds` is empty, but it's difficult to propagate @@ -828,6 +834,9 @@ async def _preprocess_completion( prompt_embeds=request_prompt_embeds["prompt_embeds"]) for request_prompt_embeds in request_prompts_embeds ] + if cache_salt: + for prompt_embed in engine_prompts_embeds: + prompt_embed["cache_salt"] = cache_salt request_prompts = request_prompts_embeds + request_prompts_text engine_prompts = engine_prompts_embeds + engine_prompts_text From 91b3e1339e81f725c6e31d24fc5dae3a42b72ea4 Mon Sep 17 00:00:00 2001 From: Chen LI Date: Tue, 15 Jul 2025 14:23:52 -0700 Subject: [PATCH 112/552] =?UTF-8?q?[Bug=20Fix]=20get=5Fdistributed=5Finit?= =?UTF-8?q?=5Fmethod=20should=20get=20the=20ip=20from=20get=5Fip=20i?= =?UTF-8?q?=E2=80=A6=20(#20889)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Chen Li Co-authored-by: Russell Bryant Signed-off-by: Russell Bryant Signed-off-by: x22x22 --- vllm/envs.py | 5 +++++ vllm/utils/__init__.py | 27 ++++++++++++++++++++++++++ vllm/v1/executor/multiproc_executor.py | 8 ++++---- 3 files changed, 36 insertions(+), 4 deletions(-) diff --git a/vllm/envs.py b/vllm/envs.py index 7bff6ade815..37dd8146c06 100644 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -139,6 +139,7 @@ VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16: bool = True VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB: Optional[int] = None VLLM_NIXL_ABORT_REQUEST_TIMEOUT: int = 120 + VLLM_LOOPBACK_IP: str = "" def get_default_cache_root(): @@ -964,6 +965,10 @@ def get_vllm_port() -> Optional[int]: # If set to 1, use the TRTLLM Decode Attention backend in flashinfer. "VLLM_USE_TRTLLM_DECODE_ATTENTION": lambda: os.getenv("VLLM_USE_TRTLLM_DECODE_ATTENTION", None), + + # Used to force set up loopback IP + "VLLM_LOOPBACK_IP": + lambda: os.getenv("VLLM_LOOPBACK_IP", ""), } # --8<-- [end:env-vars-definition] diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index 0fed490a1fc..c18f1d12ba9 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -813,6 +813,33 @@ def get_ip() -> str: return "0.0.0.0" +def test_loopback_bind(address, family): + try: + s = socket.socket(family, socket.SOCK_DGRAM) + s.bind((address, 0)) # Port 0 = auto assign + s.close() + return True + except OSError: + return False + + +def get_loopback_ip() -> str: + loopback_ip = envs.VLLM_LOOPBACK_IP + if loopback_ip: + return loopback_ip + + # VLLM_LOOPBACK_IP is not set, try to get it based on network interface + + if test_loopback_bind("127.0.0.1", socket.AF_INET): + return "127.0.0.1" + elif test_loopback_bind("::1", socket.AF_INET6): + return "::1" + else: + raise RuntimeError( + "Neither 127.0.0.1 nor ::1 are bound to a local interface. " + "Set the VLLM_LOOPBACK_IP environment variable explicitly.") + + def is_valid_ipv6_address(address: str) -> bool: try: ipaddress.IPv6Address(address) diff --git a/vllm/v1/executor/multiproc_executor.py b/vllm/v1/executor/multiproc_executor.py index d29da55ce88..5960dd766c8 100644 --- a/vllm/v1/executor/multiproc_executor.py +++ b/vllm/v1/executor/multiproc_executor.py @@ -30,8 +30,8 @@ from vllm.executor.multiproc_worker_utils import ( _add_prefix, set_multiprocessing_worker_envs) from vllm.logger import init_logger -from vllm.utils import (get_distributed_init_method, get_mp_context, - get_open_port) +from vllm.utils import (get_distributed_init_method, get_loopback_ip, + get_mp_context, get_open_port) from vllm.v1.executor.abstract import Executor, FailureCallback from vllm.v1.outputs import ModelRunnerOutput from vllm.worker.worker_base import WorkerWrapperBase @@ -63,9 +63,9 @@ def _init_executor(self) -> None: # Multiprocessing-based executor does not support multi-node setting. # Since it only works for single node, we can use the loopback address - # 127.0.0.1 for communication. + # get_loopback_ip() for communication. distributed_init_method = get_distributed_init_method( - "127.0.0.1", get_open_port()) + get_loopback_ip(), get_open_port()) # Initialize worker and set up message queues for SchedulerOutputs # and ModelRunnerOutputs From d1462bcb20c7b4b5899210b43e493ae57774aef4 Mon Sep 17 00:00:00 2001 From: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Date: Tue, 15 Jul 2025 17:56:45 -0700 Subject: [PATCH 113/552] [Nvidia] Integrate SM100 cudnn prefill API to MLA prefill (#20411) Signed-off-by: Elfie Guo Co-authored-by: Elfie Guo Signed-off-by: x22x22 --- vllm/envs.py | 5 + vllm/v1/attention/backends/mla/common.py | 113 ++++++++++++++++++++++- 2 files changed, 113 insertions(+), 5 deletions(-) mode change 100644 => 100755 vllm/envs.py mode change 100644 => 100755 vllm/v1/attention/backends/mla/common.py diff --git a/vllm/envs.py b/vllm/envs.py old mode 100644 new mode 100755 index 37dd8146c06..502978c7685 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -139,6 +139,7 @@ VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16: bool = True VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB: Optional[int] = None VLLM_NIXL_ABORT_REQUEST_TIMEOUT: int = 120 + VLLM_USE_CUDNN_PREFILL: bool = False VLLM_LOOPBACK_IP: str = "" @@ -962,6 +963,10 @@ def get_vllm_port() -> Optional[int]: "VLLM_NIXL_ABORT_REQUEST_TIMEOUT": lambda: int(os.getenv("VLLM_NIXL_ABORT_REQUEST_TIMEOUT", "120")), + # Controls whether or not to use cudnn prefill + "VLLM_USE_CUDNN_PREFILL": + lambda: bool(int(os.getenv("VLLM_USE_CUDNN_PREFILL", "0"))), + # If set to 1, use the TRTLLM Decode Attention backend in flashinfer. "VLLM_USE_TRTLLM_DECODE_ATTENTION": lambda: os.getenv("VLLM_USE_TRTLLM_DECODE_ATTENTION", None), diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py old mode 100644 new mode 100755 index 904b6081d92..381a92a8309 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -194,6 +194,7 @@ import torch +import vllm.envs as envs from vllm import _custom_ops as ops from vllm.attention.backends.abstract import (AttentionBackend, AttentionLayer, AttentionMetadata, @@ -225,6 +226,8 @@ try: from flashinfer import BatchPrefillWithRaggedKVCacheWrapper + from flashinfer.prefill import ( # noqa: F401 + cudnn_batch_prefill_with_kv_cache) flashinfer_available = True except ImportError: flashinfer_available = False @@ -236,6 +239,8 @@ logger = init_logger(__name__) +CUDNN_WORKSPACE_SIZE = 12800 + class MLACommonBackend(AttentionBackend): @@ -294,6 +299,7 @@ class ChunkedContextMetadata: starts: torch.Tensor seq_tot: list[int] max_seq_lens: list[int] + seq_lens: torch.Tensor workspace: torch.Tensor block_table: torch.Tensor @@ -309,6 +315,17 @@ class FlashInferPrefillMetadata(MLACommonPrefillMetadata): default_factory=list) +@dataclass +class CudnnPrefillMetadata(MLACommonPrefillMetadata): + + class ChunkedContextMetadata( + MLACommonPrefillMetadata.ChunkedContextMetadata): + seq_lens: torch.Tensor + + query_seq_lens: Optional[torch.Tensor] = None + cudnn_workspace: Optional[torch.Tensor] = None + + @dataclass class MLACommonDecodeMetadata: block_table: torch.Tensor @@ -351,7 +368,8 @@ class MLACommonMetadata(Generic[D]): decode: Optional[D] = None prefill: Optional[Union[MLACommonPrefillMetadata, - FlashInferPrefillMetadata]] = None + FlashInferPrefillMetadata, + CudnnPrefillMetadata]] = None def __post_init__(self): if self.head_dim is not None: @@ -362,13 +380,19 @@ def __post_init__(self): def use_flashinfer_prefill() -> bool: - if flashinfer_available: + if flashinfer_available and not envs.VLLM_USE_CUDNN_PREFILL: # For blackwell default to flashinfer prefill if its available since # its faster than FA2. return current_platform.has_device_capability(100) return False +def use_cudnn_prefill() -> bool: + if flashinfer_available and envs.VLLM_USE_CUDNN_PREFILL: + return current_platform.has_device_capability(100) + return False + + # Currently 394MB, this can be tuned based on GEMM sizes used. # Choosen to be the same as sglang: # https://github.com/sgl-project/sglang/blob/766392c6bda2558b61ce6d1c1bfd8081a549e1f1/python/sglang/global_config.py#L37 @@ -427,11 +451,15 @@ def __init__(self, dtype=model_config.dtype, device=runner.device, ) + self.block_table = block_table + self._use_cudnn_prefill = use_cudnn_prefill() self._use_fi_prefill = use_flashinfer_prefill() - self.prefill_metadata_cls = FlashInferPrefillMetadata \ - if self._use_fi_prefill else MLACommonPrefillMetadata + self.prefill_metadata_cls = ( + FlashInferPrefillMetadata + if self._use_fi_prefill else CudnnPrefillMetadata + if self._use_cudnn_prefill else MLACommonPrefillMetadata) if self._use_fi_prefill: self._workspace_buffer = torch.empty( @@ -447,6 +475,13 @@ def __init__(self, self._global_hyperparameters = infer_global_hyperparameters( get_per_layer_parameters(runner.vllm_config, MLACommonImpl)) + if self._use_cudnn_prefill: + self.cudnn_workspace = torch.empty( + CUDNN_WORKSPACE_SIZE * scheduler_config.max_num_seqs, + dtype=torch.int8, + device=runner.device, + ) + def _build_fi_prefill_wrappers(self, prefill: FlashInferPrefillMetadata): qo_indptr = prefill.query_start_loc @@ -692,15 +727,24 @@ def build(self, common_prefix_len: int, out=cu_seq_lens_cpu[:, 1:], dtype=torch.int32) + chunked_context_metadata_cls = \ + CudnnPrefillMetadata.ChunkedContextMetadata \ + if self._use_cudnn_prefill else \ + MLACommonPrefillMetadata.ChunkedContextMetadata + chunked_context_metadata = \ - MLACommonPrefillMetadata.ChunkedContextMetadata( + chunked_context_metadata_cls( cu_seq_lens=cu_seq_lens_cpu.to(device, non_blocking=True), starts=chunk_starts.to(device, non_blocking=True), seq_tot=chunk_seq_lens.sum(dim=1).tolist(), max_seq_lens=chunk_seq_lens.max(dim=1).values.tolist(), + seq_lens=chunk_seq_lens, workspace=self.chunked_prefill_workspace, ) + if self._use_cudnn_prefill: + chunked_context_metadata.seq_lens = chunk_seq_lens + assert max(chunked_context_metadata.max_seq_lens) <= \ self.chunked_prefill_workspace_size @@ -711,6 +755,12 @@ def build(self, common_prefix_len: int, chunked_context=chunked_context_metadata, ) + if self._use_cudnn_prefill: + assert isinstance(prefill_metadata, CudnnPrefillMetadata) + prefill_metadata.query_seq_lens = prefill_query_start_loc[1:] \ + - prefill_query_start_loc[:-1] + prefill_metadata.cudnn_workspace = self.cudnn_workspace + decode_metadata = None if self._num_decodes > 0: decode_metadata = self._build_decode( @@ -794,6 +844,12 @@ def __init__( self._run_prefill_context_chunk = self._run_prefill_context_chunk_fi self._run_prefill_new_tokens = self._run_prefill_new_tokens_fi self._pad_v = False + elif use_cudnn_prefill(): + logger.debug_once("Using CUDNN prefill for MLA") + self._run_prefill_context_chunk = \ + self._run_prefill_context_chunk_cudnn + self._run_prefill_new_tokens = self._run_prefill_new_tokens_cudnn + self._pad_v = False else: # Use FlashAttention logger.debug_once("Using FlashAttention prefill for MLA") self._run_prefill_context_chunk = self._run_prefill_context_chunk_fa @@ -882,6 +938,29 @@ def _run_prefill_new_tokens_fi(self, prefill: MLACommonPrefillMetadata, q, return_lse=return_softmax_lse, ) + def _run_prefill_new_tokens_cudnn(self, prefill: MLACommonPrefillMetadata, + q, k, v, return_softmax_lse): + assert isinstance(prefill, CudnnPrefillMetadata) + assert prefill.query_seq_lens is not None + output, lse = cudnn_batch_prefill_with_kv_cache( + q=q, + k_cache=k, + v_cache=v, + scale=self.scale, + workspace_buffer=prefill.cudnn_workspace, + max_token_per_sequence=prefill.max_query_len, + max_sequence_kv=prefill.max_query_len, + actual_seq_lens_q=prefill.query_seq_lens.view(-1, 1, 1, 1), + actual_seq_lens_kv=prefill.query_seq_lens.view(-1, 1, 1, 1), + causal=True, + return_lse=True, # do not support False for now + is_cuda_graph_compatible= + True, #Indicates actual_seq_lens are on GPU or CPU. + ) + if return_softmax_lse: + return output, lse + return output + def _run_prefill_context_chunk_fa(self, prefill: MLACommonPrefillMetadata, chunk_idx: int, q, k, v): assert prefill.chunked_context is not None @@ -908,6 +987,30 @@ def _run_prefill_context_chunk_fi(self, prefill: MLACommonPrefillMetadata, return_lse=True, ) + def _run_prefill_context_chunk_cudnn(self, + prefill: MLACommonPrefillMetadata, + chunk_idx: int, q, k, v): + assert isinstance(prefill, CudnnPrefillMetadata) + assert prefill.chunked_context is not None + assert prefill.chunked_context.seq_lens[chunk_idx] is not None + assert prefill.query_seq_lens is not None + return cudnn_batch_prefill_with_kv_cache( + q=q, + k_cache=k, + v_cache=v, + scale=self.scale, + workspace_buffer=prefill.cudnn_workspace, + max_token_per_sequence=prefill.max_query_len, + max_sequence_kv=prefill.chunked_context.max_seq_lens[chunk_idx], + actual_seq_lens_q=prefill.query_seq_lens.view(-1, 1, 1, 1), + actual_seq_lens_kv=prefill.chunked_context.seq_lens[chunk_idx]. + view(-1, 1, 1, 1), + causal=False, + return_lse=True, + is_cuda_graph_compatible= + True, #Indicates actual_seq_lens are on GPU or CPU. + ) + def _v_up_proj(self, x): # Convert from (B, N, L) to (N, B, L) x = x.view(-1, self.num_heads, self.kv_lora_rank).transpose(0, 1) From a9b07e44489870ba99d525321461c5c31d1eae22 Mon Sep 17 00:00:00 2001 From: Chauncey Date: Wed, 16 Jul 2025 08:59:36 +0800 Subject: [PATCH 114/552] [Frontend] OpenAI Responses API supports input image (#20975) Signed-off-by: chaunceyjiang Signed-off-by: x22x22 --- .../openai/responses/test_image.py | 166 ++++++++++++++++++ vllm/entrypoints/chat_utils.py | 9 +- 2 files changed, 172 insertions(+), 3 deletions(-) create mode 100644 tests/v1/entrypoints/openai/responses/test_image.py diff --git a/tests/v1/entrypoints/openai/responses/test_image.py b/tests/v1/entrypoints/openai/responses/test_image.py new file mode 100644 index 00000000000..f3bce91e97c --- /dev/null +++ b/tests/v1/entrypoints/openai/responses/test_image.py @@ -0,0 +1,166 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import json + +import openai +import pytest +import pytest_asyncio + +from tests.utils import RemoteOpenAIServer +from vllm.multimodal.utils import encode_image_base64, fetch_image + +# Use a small vision model for testing +MODEL_NAME = "Qwen/Qwen2.5-VL-3B-Instruct" +MAXIMUM_IMAGES = 2 +# Test different image extensions (JPG/PNG) and formats (gray/RGB/RGBA) +TEST_IMAGE_URLS = [ + "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", + "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png", + "https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Venn_diagram_rgb.svg/1280px-Venn_diagram_rgb.svg.png", + "https://upload.wikimedia.org/wikipedia/commons/0/0b/RGBA_comp.png", +] + + +@pytest.fixture(scope="module") +def default_image_server_args(): + return [ + "--enforce-eager", + "--max-model-len", + "6000", + "--max-num-seqs", + "128", + "--limit-mm-per-prompt", + json.dumps({"image": MAXIMUM_IMAGES}), + ] + + +@pytest.fixture(scope="module") +def image_server(default_image_server_args): + with RemoteOpenAIServer(MODEL_NAME, + default_image_server_args) as remote_server: + yield remote_server + + +@pytest_asyncio.fixture +async def client(image_server): + async with image_server.get_async_client() as async_client: + yield async_client + + +@pytest.fixture(scope="session") +def base64_encoded_image() -> dict[str, str]: + return { + image_url: encode_image_base64(fetch_image(image_url)) + for image_url in TEST_IMAGE_URLS + } + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS) +async def test_single_chat_session_image(client: openai.AsyncOpenAI, + model_name: str, image_url: str): + content_text = "What's in this image?" + messages = [{ + "role": + "user", + "content": [ + { + "type": "input_image", + "image_url": image_url, + "detail": "auto", + }, + { + "type": "input_text", + "text": content_text + }, + ], + }] + + # test image url + response = await client.responses.create( + model=model_name, + input=messages, + ) + assert len(response.output_text) > 0 + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS) +async def test_single_chat_session_image_base64encoded( + client: openai.AsyncOpenAI, + model_name: str, + image_url: str, + base64_encoded_image: dict[str, str], +): + content_text = "What's in this image?" + messages = [{ + "role": + "user", + "content": [ + { + "type": "input_image", + "image_url": + f"data:image/jpeg;base64,{base64_encoded_image[image_url]}", + "detail": "auto", + }, + { + "type": "input_text", + "text": content_text + }, + ], + }] + # test image base64 + response = await client.responses.create( + model=model_name, + input=messages, + ) + assert len(response.output_text) > 0 + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +@pytest.mark.parametrize( + "image_urls", + [TEST_IMAGE_URLS[:i] for i in range(2, len(TEST_IMAGE_URLS))]) +async def test_multi_image_input(client: openai.AsyncOpenAI, model_name: str, + image_urls: list[str]): + messages = [{ + "role": + "user", + "content": [ + *({ + "type": "input_image", + "image_url": image_url, + "detail": "auto", + } for image_url in image_urls), + { + "type": "input_text", + "text": "What's in this image?" + }, + ], + }] + + if len(image_urls) > MAXIMUM_IMAGES: + with pytest.raises(openai.BadRequestError): # test multi-image input + await client.responses.create( + model=model_name, + input=messages, + ) + # the server should still work afterwards + response = await client.responses.create( + model=model_name, + input=[{ + "role": "user", + "content": "What's the weather like in Paris today?", + }], + ) + assert len(response.output_text) > 0 + else: + response = await client.responses.create( + model=model_name, + input=messages, + ) + assert len(response.output_text) > 0 diff --git a/vllm/entrypoints/chat_utils.py b/vllm/entrypoints/chat_utils.py index f5b7239cb30..496caef4256 100644 --- a/vllm/entrypoints/chat_utils.py +++ b/vllm/entrypoints/chat_utils.py @@ -28,6 +28,7 @@ ChatCompletionToolMessageParam) from openai.types.chat.chat_completion_content_part_input_audio_param import ( InputAudio) +from openai.types.responses import ResponseInputImageParam from PIL import Image from pydantic import BaseModel, ConfigDict, TypeAdapter # yapf: enable @@ -942,6 +943,8 @@ def _get_full_multimodal_text_prompt(placeholder_storage: dict[str, list], _AudioParser = TypeAdapter(ChatCompletionContentPartAudioParam).validate_python _VideoParser = TypeAdapter(ChatCompletionContentPartVideoParam).validate_python +_ResponsesInputImageParser = TypeAdapter( + ResponseInputImageParam).validate_python _ContentPart: TypeAlias = Union[str, dict[str, str], InputAudio, PILImage] # Define a mapping from part types to their corresponding parsing functions. @@ -953,6 +956,8 @@ def _get_full_multimodal_text_prompt(placeholder_storage: dict[str, list], lambda part: _TextParser(part).get("text", None), "input_text": lambda part: _TextParser(part).get("text", None), + "input_image": + lambda part: _ResponsesInputImageParser(part).get("image_url", None), "image_url": lambda part: _ImageParser(part).get("image_url", {}).get("url", None), "image_embeds": @@ -1085,10 +1090,8 @@ def _parse_chat_message_content_part( """ if isinstance(part, str): # Handle plain text parts return part - # Handle structured dictionary parts part_type, content = _parse_chat_message_content_mm_part(part) - # if part_type is text/refusal/image_url/audio_url/video_url/input_audio but # content is None, log a warning and skip if part_type in VALID_MESSAGE_CONTENT_MM_PART_TYPES and content is None: @@ -1109,7 +1112,7 @@ def _parse_chat_message_content_part( image_content = cast(Image.Image, content) mm_parser.parse_image_pil(image_content) modality = "image" - elif part_type == "image_url": + elif part_type in ("image_url", "input_image"): str_content = cast(str, content) mm_parser.parse_image(str_content) modality = "image" From fbc7b1494d6b666f7115ebf71fe27a7718d362a9 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Tue, 15 Jul 2025 22:18:41 -0400 Subject: [PATCH 115/552] [Frontend] Remove print left in FrontendArgs.add_cli_args (#21004) Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/entrypoints/openai/cli_args.py | 1 - 1 file changed, 1 deletion(-) diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index f8fdfe71bbe..bccce73b79f 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -192,7 +192,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: # Special case: allowed_origins, allowed_methods, allowed_headers all # need json.loads type # Should also remove nargs - print(frontend_kwargs["allowed_origins"]) frontend_kwargs["allowed_origins"]["type"] = json.loads frontend_kwargs["allowed_methods"]["type"] = json.loads frontend_kwargs["allowed_headers"]["type"] = json.loads From 8f38f928bd74d1adbf72ccc201c5b9c997af3def Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Wed, 16 Jul 2025 04:19:10 +0200 Subject: [PATCH 116/552] [Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture (#20923) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- vllm/model_executor/models/config.py | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/vllm/model_executor/models/config.py b/vllm/model_executor/models/config.py index 6c6f8e7268b..cb07fe7d9e1 100644 --- a/vllm/model_executor/models/config.py +++ b/vllm/model_executor/models/config.py @@ -205,6 +205,19 @@ def verify_and_update_config(vllm_config: "VllmConfig") -> None: } +class GraniteMoeHybridModelConfig(VerifyAndUpdateConfig): + + @staticmethod + def verify_and_update_config(vllm_config: "VllmConfig") -> None: + config = vllm_config.model_config + config.max_seq_len_to_capture = config.max_model_len + logger.info( + "Setting max_seq_len_to_capture to %d " + "to ensure that CUDA graph capture " + "covers sequences of length up to max_model_len.", + config.max_model_len) + + class HybridAttentionMambaModelConfig(VerifyAndUpdateConfig): @classmethod @@ -297,4 +310,5 @@ def verify_and_update_config(cls, vllm_config: "VllmConfig") -> None: "Qwen3ForSequenceClassification": Qwen3ForSequenceClassificationConfig, "XLMRobertaModel": JinaRobertaModelConfig, "JinaVLForRanking": JinaVLForSequenceClassificationConfig, + "GraniteMoeHybridForCausalLM": GraniteMoeHybridModelConfig, } From 8b25987d8efe15f23efcaf5ea488b2bd80873fdc Mon Sep 17 00:00:00 2001 From: Chauncey Date: Wed, 16 Jul 2025 10:42:16 +0800 Subject: [PATCH 117/552] [Misc] bump xgrammar version to v0.1.21 (#20992) Signed-off-by: chaunceyjiang Signed-off-by: x22x22 --- requirements/common.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/common.txt b/requirements/common.txt index 14e59f41a10..1876a7e9af0 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -25,7 +25,7 @@ outlines_core == 0.2.10 # required for outlines backend disk cache diskcache == 5.6.3 lark == 1.2.2 -xgrammar == 0.1.19; platform_machine == "x86_64" or platform_machine == "aarch64" or platform_machine == "arm64" +xgrammar == 0.1.21; platform_machine == "x86_64" or platform_machine == "aarch64" or platform_machine == "arm64" typing_extensions >= 4.10 filelock >= 3.16.1 # need to contain https://github.com/tox-dev/filelock/pull/317 partial-json-parser # used for parsing partial JSON outputs From d3f6221ad73564826e7de4f365b020c49bea8a5d Mon Sep 17 00:00:00 2001 From: Brayden Zhong Date: Tue, 15 Jul 2025 22:42:40 -0400 Subject: [PATCH 118/552] [Chore] Remove outdated transformers check (#20989) Signed-off-by: Brayden Zhong Signed-off-by: x22x22 --- vllm/model_executor/models/idefics3.py | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/vllm/model_executor/models/idefics3.py b/vllm/model_executor/models/idefics3.py index 4643468af4c..de216a81e93 100644 --- a/vllm/model_executor/models/idefics3.py +++ b/vllm/model_executor/models/idefics3.py @@ -22,8 +22,8 @@ import torch from torch import nn -from transformers import (AddedToken, BatchFeature, Idefics3Config, - Idefics3ImageProcessor, Idefics3Processor) +from transformers import (BatchFeature, Idefics3Config, Idefics3ImageProcessor, + Idefics3Processor) from vllm.config import VllmConfig from vllm.model_executor.layers.linear import ReplicatedLinear @@ -199,21 +199,14 @@ def get_num_patches( return grid_w * grid_h + 1 - # TODO: Remove after requiring transformers>=4.52 - def _get_content(self, token: Union[AddedToken, str]) -> str: - if isinstance(token, str): - return token - - return token.content - def _get_image_token( self, processor: Optional[Idefics3Processor]) -> tuple[str, str, str]: if processor is None: processor = self.get_hf_processor() - image_token = self._get_content(processor.image_token) - fake_image_token = self._get_content(processor.fake_image_token) + image_token = processor.image_token + fake_image_token = processor.fake_image_token global_image_token = processor.global_image_tag return image_token, fake_image_token, global_image_token From 60f5394098fc6a1de00fb479810fd0dbd0ead267 Mon Sep 17 00:00:00 2001 From: Reid <61492567+reidliu41@users.noreply.github.com> Date: Wed, 16 Jul 2025 10:43:19 +0800 Subject: [PATCH 119/552] [Misc] Refactor: Improve argument handling for `conda` command (#20481) Signed-off-by: reidliu41 Signed-off-by: x22x22 --- vllm/collect_env.py | 45 +++++++++++++++++++++++++-------------------- 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/vllm/collect_env.py b/vllm/collect_env.py index 64172a9bf91..ee43ad12e8a 100644 --- a/vllm/collect_env.py +++ b/vllm/collect_env.py @@ -96,25 +96,30 @@ def run(command): """Return (return-code, stdout, stderr).""" shell = True if type(command) is str else False - p = subprocess.Popen(command, - stdout=subprocess.PIPE, - stderr=subprocess.PIPE, - shell=shell) - raw_output, raw_err = p.communicate() - rc = p.returncode - if get_platform() == 'win32': - enc = 'oem' - else: - enc = locale.getpreferredencoding() - output = raw_output.decode(enc) - if command == 'nvidia-smi topo -m': - # don't remove the leading whitespace of `nvidia-smi topo -m` - # because they are meaningful - output = output.rstrip() - else: - output = output.strip() - err = raw_err.decode(enc) - return rc, output, err.strip() + try: + p = subprocess.Popen(command, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + shell=shell) + raw_output, raw_err = p.communicate() + rc = p.returncode + if get_platform() == 'win32': + enc = 'oem' + else: + enc = locale.getpreferredencoding() + output = raw_output.decode(enc) + if command == 'nvidia-smi topo -m': + # don't remove the leading whitespace of `nvidia-smi topo -m` + # because they are meaningful + output = output.rstrip() + else: + output = output.strip() + err = raw_err.decode(enc) + return rc, output, err.strip() + + except FileNotFoundError: + cmd_str = command if isinstance(command, str) else command[0] + return 127, '', f"Command not found: {cmd_str}" def run_and_read_all(run_lambda, command): @@ -148,7 +153,7 @@ def get_conda_packages(run_lambda, patterns=None): if patterns is None: patterns = DEFAULT_CONDA_PATTERNS conda = os.environ.get('CONDA_EXE', 'conda') - out = run_and_read_all(run_lambda, "{} list".format(conda)) + out = run_and_read_all(run_lambda, [conda, 'list']) if out is None: return out From 32ec0add1d7bc7c7fdc0df4e7175c869c34590b9 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Tue, 15 Jul 2025 22:46:56 -0400 Subject: [PATCH 120/552] [Docs] Enhance Anyscale documentation, add quickstart links for vLLM (#21018) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- docs/deployment/frameworks/anyscale.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/deployment/frameworks/anyscale.md b/docs/deployment/frameworks/anyscale.md index 5604f7f9615..9957c5b1413 100644 --- a/docs/deployment/frameworks/anyscale.md +++ b/docs/deployment/frameworks/anyscale.md @@ -3,6 +3,15 @@ [](){ #deployment-anyscale } [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. -It hosts Ray clusters inside your own AWS, GCP, or Azure account, delivering the flexibility of open-source Ray -without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, or managing observability stacks. + +Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray +without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like . + When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm). + +## Production-ready vLLM on Anyscale quickstarts + +- [Offline batch inference](https://console.anyscale.com/template-preview/llm_batch_inference?utm_source=vllm_docs) +- [Deploy vLLM services](https://console.anyscale.com/template-preview/llm_serving?utm_source=vllm_docs) +- [Curate a dataset](https://console.anyscale.com/template-preview/audio-dataset-curation-llm-judge?utm_source=vllm_docs) +- [Finetune an LLM](https://console.anyscale.com/template-preview/entity-recognition-with-llms?utm_source=vllm_docs) From 10ca140440cbf8495112e87af5a6fe4b817e3982 Mon Sep 17 00:00:00 2001 From: Ming Yang Date: Tue, 15 Jul 2025 19:53:42 -0700 Subject: [PATCH 121/552] =?UTF-8?q?[Bugfix]=20Correct=20per=5Fact=5Ftoken?= =?UTF-8?q?=20in=20CompressedTensorsW8A8Fp8MoECutlassM=E2=80=A6=20(#20937)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Ming Yang Signed-off-by: x22x22 --- .../compressed_tensors/compressed_tensors_moe.py | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py index baf4fec3cc6..c636e7e79bf 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py @@ -929,10 +929,8 @@ def apply( scoring_func=scoring_func, e_score_correction_bias=e_score_correction_bias) - a1_scale = layer.w13_input_scale - a2_scale = layer.w2_input_scale - per_act_token = a1_scale.numel() != 1 if a1_scale is not None else ( - a2_scale.numel() != 1 if a2_scale is not None else False) + per_act_token = ( + self.input_quant.strategy == QuantizationStrategy.TOKEN) if self.fused_experts is None: # If no modular kernel is provided, use cutlass_moe_fp8 @@ -950,8 +948,8 @@ def apply( expert_map=None if self.disable_expert_map else expert_map, w1_scale=layer.w13_weight_scale, w2_scale=layer.w2_weight_scale, - a1_scale=a1_scale, - a2_scale=a2_scale, + a1_scale=layer.w13_input_scale, + a2_scale=layer.w2_input_scale, ) else: return self.fused_experts( From 7ea906c4f03ea52f7fe94fc23980152c2c8597b6 Mon Sep 17 00:00:00 2001 From: Doug Smith Date: Tue, 15 Jul 2025 22:53:57 -0400 Subject: [PATCH 122/552] Add Dockerfile argument for VLLM_USE_PRECOMPILED environment (#20943) Signed-off-by: dougbtv Signed-off-by: x22x22 --- docker/Dockerfile | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docker/Dockerfile b/docker/Dockerfile index 6ae4f789f05..78b548df32c 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -207,6 +207,19 @@ ARG SCCACHE_ENDPOINT ARG SCCACHE_BUCKET_NAME=vllm-build-sccache ARG SCCACHE_REGION_NAME=us-west-2 ARG SCCACHE_S3_NO_CREDENTIALS=0 + +# Flag to control whether to use pre-built vLLM wheels +ARG VLLM_USE_PRECOMPILED +# TODO: in setup.py VLLM_USE_PRECOMPILED is sensitive to truthiness, it will take =0 as "true", this should be fixed +ENV VLLM_USE_PRECOMPILED="" +RUN if [ "${VLLM_USE_PRECOMPILED}" = "1" ]; then \ + export VLLM_USE_PRECOMPILED=1 && \ + echo "Using precompiled wheels"; \ + else \ + unset VLLM_USE_PRECOMPILED && \ + echo "Leaving VLLM_USE_PRECOMPILED unset to build wheels from source"; \ + fi + # if USE_SCCACHE is set, use sccache to speed up compilation RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=.git,target=.git \ From 7580345e5d5d19e7201ef5e5a3a7d837df2ffbb8 Mon Sep 17 00:00:00 2001 From: "Chendi.Xue" Date: Tue, 15 Jul 2025 22:07:05 -0500 Subject: [PATCH 123/552] [CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty (#21006) Signed-off-by: Chendi.Xue Signed-off-by: x22x22 --- .buildkite/scripts/hardware_ci/run-hpu-test.sh | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-hpu-test.sh b/.buildkite/scripts/hardware_ci/run-hpu-test.sh index ae5b35a9ac6..dc9f2d39ba7 100644 --- a/.buildkite/scripts/hardware_ci/run-hpu-test.sh +++ b/.buildkite/scripts/hardware_ci/run-hpu-test.sh @@ -6,19 +6,17 @@ set -exuo pipefail # Try building the docker image cat < Date: Tue, 15 Jul 2025 23:08:41 -0400 Subject: [PATCH 124/552] [Bugfix] Fix Mistral3 support on SM100/SM120 (#20998) Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/model_executor/models/pixtral.py | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/vllm/model_executor/models/pixtral.py b/vllm/model_executor/models/pixtral.py index 475d65a58b2..325a264a2f4 100644 --- a/vllm/model_executor/models/pixtral.py +++ b/vllm/model_executor/models/pixtral.py @@ -43,6 +43,7 @@ PromptReplacement, PromptUpdate, PromptUpdateDetails) from vllm.multimodal.profiling import BaseDummyInputsBuilder, ProcessorInputs +from vllm.platforms import current_platform from vllm.sequence import IntermediateTensors from vllm.transformers_utils.tokenizer import (MistralTokenizer, cached_tokenizer_from_config) @@ -54,7 +55,12 @@ try: from xformers import ops as xops - USE_XFORMERS_OPS = True + if (current_platform.is_cuda() + and current_platform.has_device_capability(100)): + # Xformers FA is not compatible with B200 + USE_XFORMERS_OPS = False + else: + USE_XFORMERS_OPS = True except ImportError: USE_XFORMERS_OPS = False @@ -1082,7 +1088,6 @@ def forward( # Transpose q and k back for attention q = q.transpose(1, 2).contiguous() k = k.transpose(1, 2).contiguous() - out = xops.memory_efficient_attention(q, k, v, From 638be3a2b21957c686fc006302750705a46c5c1d Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Tue, 15 Jul 2025 23:09:13 -0400 Subject: [PATCH 125/552] [Doc] Remove duplicate docstring (#21012) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- vllm/model_executor/layers/quantization/utils/fp8_utils.py | 2 -- 1 file changed, 2 deletions(-) diff --git a/vllm/model_executor/layers/quantization/utils/fp8_utils.py b/vllm/model_executor/layers/quantization/utils/fp8_utils.py index c093a9bfc4a..20e7b444856 100644 --- a/vllm/model_executor/layers/quantization/utils/fp8_utils.py +++ b/vllm/model_executor/layers/quantization/utils/fp8_utils.py @@ -378,8 +378,6 @@ def per_token_group_quant_fp8( is supported for now. column_major_scales: Outputs scales in column major. out_q: Optional output tensor. If not provided, function will create. - tuple[torch.Tensor, torch.Tensor]: The quantized tensor and the - scaling factor for quantization. Returns: tuple[torch.Tensor, torch.Tensor]: The quantized tensor and the scaling factor. From af37b09b211011e6c9c3fb41e051e8de20eec33e Mon Sep 17 00:00:00 2001 From: Patrick von Platen Date: Wed, 16 Jul 2025 06:11:49 +0200 Subject: [PATCH 126/552] [Voxtral] Add more tests (#21010) Signed-off-by: Patrick von Platen Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- tests/conftest.py | 13 +- .../openai/test_transcription_validation.py | 3 - .../multimodal/generation/test_voxtral.py | 115 ++++++++++++++++++ tests/models/registry.py | 2 +- 4 files changed, 125 insertions(+), 8 deletions(-) create mode 100644 tests/models/multimodal/generation/test_voxtral.py diff --git a/tests/conftest.py b/tests/conftest.py index c5d7156905b..f3524d1fe2a 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -804,7 +804,7 @@ def __init__( def get_inputs( self, - prompts: Union[list[str], list[torch.Tensor]], + prompts: Union[list[str], list[torch.Tensor], list[int]], images: Optional[PromptImageInput] = None, videos: Optional[PromptVideoInput] = None, audios: Optional[PromptAudioInput] = None, @@ -826,11 +826,16 @@ def get_inputs( if audios is not None and (audio := audios[i]) is not None: multi_modal_data["audio"] = audio - text_prompt_kwargs = { - ("prompt" if isinstance(prompt, str) else "prompt_embeds"): - prompt, + text_prompt_kwargs: dict[str, Any] = { "multi_modal_data": multi_modal_data or None } + if isinstance(prompt, str): + text_prompt_kwargs["prompt"] = prompt + elif isinstance(prompt, list): + text_prompt_kwargs["prompt_token_ids"] = prompt + else: + text_prompt_kwargs["prompt_embeds"] = prompt + inputs.append(TextPrompt(**text_prompt_kwargs)) return inputs diff --git a/tests/entrypoints/openai/test_transcription_validation.py b/tests/entrypoints/openai/test_transcription_validation.py index 461b8aab2e9..a8e2eb40b15 100644 --- a/tests/entrypoints/openai/test_transcription_validation.py +++ b/tests/entrypoints/openai/test_transcription_validation.py @@ -47,9 +47,6 @@ async def test_basic_audio(mary_had_lamb, model_name): if model_name.startswith("mistralai"): server_args += MISTRAL_FORMAT_ARGS - # TODO(PATRICK) - REMOVE AFTER RELEASE - return # skip for now - # Based on https://github.com/openai/openai-cookbook/blob/main/examples/Whisper_prompting_guide.ipynb. with RemoteOpenAIServer(model_name, server_args) as remote_server: client = remote_server.get_async_client() diff --git a/tests/models/multimodal/generation/test_voxtral.py b/tests/models/multimodal/generation/test_voxtral.py new file mode 100644 index 00000000000..b4439dfe020 --- /dev/null +++ b/tests/models/multimodal/generation/test_voxtral.py @@ -0,0 +1,115 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import json + +import pytest +import pytest_asyncio +from mistral_common.audio import Audio +from mistral_common.protocol.instruct.messages import (AudioChunk, RawAudio, + TextChunk, UserMessage) + +from vllm.transformers_utils.tokenizer import MistralTokenizer + +from ....conftest import AudioTestAssets +from ....utils import RemoteOpenAIServer +from .test_ultravox import MULTI_AUDIO_PROMPT, run_multi_audio_test + +MODEL_NAME = "mistralai/Voxtral-Mini-3B-2507" +MISTRAL_FORMAT_ARGS = [ + "--tokenizer_mode", "mistral", "--config_format", "mistral", + "--load_format", "mistral" +] + + +@pytest.fixture() +def server(request, audio_assets: AudioTestAssets): + args = [ + "--enforce-eager", + "--limit-mm-per-prompt", + json.dumps({"audio": len(audio_assets)}), + ] + MISTRAL_FORMAT_ARGS + + with RemoteOpenAIServer(MODEL_NAME, + args, + env_dict={"VLLM_AUDIO_FETCH_TIMEOUT": + "30"}) as remote_server: + yield remote_server + + +@pytest_asyncio.fixture +async def client(server): + async with server.get_async_client() as async_client: + yield async_client + + +def _get_prompt(audio_assets, question): + tokenizer = MistralTokenizer.from_pretrained(MODEL_NAME) + + audios = [ + Audio.from_file(str(audio_assets[i].get_local_path()), strict=False) + for i in range(len(audio_assets)) + ] + audio_chunks = [ + AudioChunk(input_audio=RawAudio.from_audio(audio)) for audio in audios + ] + + text_chunk = TextChunk(text=question) + messages = [UserMessage(content=[*audio_chunks, text_chunk]).to_openai()] + + return tokenizer.apply_chat_template(messages=messages) + + +@pytest.mark.core_model +@pytest.mark.parametrize("dtype", ["half"]) +@pytest.mark.parametrize("max_tokens", [128]) +@pytest.mark.parametrize("num_logprobs", [5]) +def test_models_with_multiple_audios(vllm_runner, + audio_assets: AudioTestAssets, dtype: str, + max_tokens: int, + num_logprobs: int) -> None: + vllm_prompt = _get_prompt(audio_assets, MULTI_AUDIO_PROMPT) + run_multi_audio_test( + vllm_runner, + [(vllm_prompt, [audio.audio_and_sample_rate + for audio in audio_assets])], + MODEL_NAME, + dtype=dtype, + max_tokens=max_tokens, + num_logprobs=num_logprobs, + tokenizer_mode="mistral", + ) + + +@pytest.mark.asyncio +async def test_online_serving(client, audio_assets: AudioTestAssets): + """Exercises online serving with/without chunked prefill enabled.""" + + def asset_to_chunk(asset): + audio = Audio.from_file(str(asset.get_local_path()), strict=False) + audio.format = "wav" + audio_dict = AudioChunk.from_audio(audio).to_openai() + return audio_dict + + audio_chunks = [asset_to_chunk(asset) for asset in audio_assets] + messages = [{ + "role": + "user", + "content": [ + *audio_chunks, + { + "type": + "text", + "text": + f"What's happening in these {len(audio_assets)} audio clips?" + }, + ], + }] + + chat_completion = await client.chat.completions.create(model=MODEL_NAME, + messages=messages, + max_tokens=10) + + assert len(chat_completion.choices) == 1 + choice = chat_completion.choices[0] + assert choice.finish_reason == "length" diff --git a/tests/models/registry.py b/tests/models/registry.py index 0bac0f8db15..d3b764780f7 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -440,7 +440,7 @@ def check_available_online( tokenizer="Isotr0py/Florence-2-tokenizer", # noqa: E501 trust_remote_code=True), # noqa: E501 "MllamaForConditionalGeneration": _HfExamplesInfo("meta-llama/Llama-3.2-11B-Vision-Instruct"), # noqa: E501 - "VoxtralForConditionalGeneration": _HfExamplesInfo("mistralai/Voxtral-Mini-3B-2507", is_available_online=False, tokenizer_mode="mistral"), # noqa: E501 + "VoxtralForConditionalGeneration": _HfExamplesInfo("mistralai/Voxtral-Mini-3B-2507", tokenizer_mode="mistral"), # noqa: E501 "WhisperForConditionalGeneration": _HfExamplesInfo("openai/whisper-large-v3"), # noqa: E501 # [Cross-encoder] From d1c5240cb878524b8bae78965515a52f309397c9 Mon Sep 17 00:00:00 2001 From: Maximilien de Bayser Date: Wed, 16 Jul 2025 01:12:14 -0300 Subject: [PATCH 127/552] Avoid direct comparison of floating point numbers (#21002) Signed-off-by: Max de Bayser Signed-off-by: x22x22 --- tests/entrypoints/openai/test_classification.py | 6 +++++- tests/entrypoints/openai/test_embedding.py | 17 +++++++++++++++-- tests/entrypoints/openai/test_pooling.py | 16 ++++++++++++++-- tests/entrypoints/openai/test_rerank.py | 6 +++++- tests/entrypoints/openai/test_score.py | 6 +++++- 5 files changed, 44 insertions(+), 7 deletions(-) diff --git a/tests/entrypoints/openai/test_classification.py b/tests/entrypoints/openai/test_classification.py index 330c7ff5c92..b2472658ca8 100644 --- a/tests/entrypoints/openai/test_classification.py +++ b/tests/entrypoints/openai/test_classification.py @@ -176,4 +176,8 @@ async def test_invocations(server: RemoteOpenAIServer): invocation_output = invocation_response.json() assert classification_output.keys() == invocation_output.keys() - assert classification_output["data"] == invocation_output["data"] + for classification_data, invocation_data in zip( + classification_output["data"], invocation_output["data"]): + assert classification_data.keys() == invocation_data.keys() + assert classification_data["probs"] == pytest.approx( + invocation_data["probs"], rel=0.01) diff --git a/tests/entrypoints/openai/test_embedding.py b/tests/entrypoints/openai/test_embedding.py index 143999edeaf..f03c96b1217 100644 --- a/tests/entrypoints/openai/test_embedding.py +++ b/tests/entrypoints/openai/test_embedding.py @@ -14,6 +14,7 @@ from ...models.language.pooling.embed_utils import ( run_embedding_correctness_test) +from ...models.utils import check_embeddings_close from ...utils import RemoteOpenAIServer MODEL_NAME = "intfloat/multilingual-e5-small" @@ -321,7 +322,13 @@ async def test_invocations(server: RemoteOpenAIServer, invocation_output = invocation_response.json() assert completion_output.keys() == invocation_output.keys() - assert completion_output["data"] == invocation_output["data"] + for completion_data, invocation_data in zip(completion_output["data"], + invocation_output["data"]): + assert completion_data.keys() == invocation_data.keys() + check_embeddings_close(embeddings_0_lst=[completion_data["embedding"]], + embeddings_1_lst=[invocation_data["embedding"]], + name_0="completion", + name_1="invocation") @pytest.mark.asyncio @@ -355,4 +362,10 @@ async def test_invocations_conversation(server: RemoteOpenAIServer): invocation_output = invocation_response.json() assert chat_output.keys() == invocation_output.keys() - assert chat_output["data"] == invocation_output["data"] + for chat_data, invocation_data in zip(chat_output["data"], + invocation_output["data"]): + assert chat_data.keys() == invocation_data.keys() + check_embeddings_close(embeddings_0_lst=[chat_data["embedding"]], + embeddings_1_lst=[invocation_data["embedding"]], + name_0="chat", + name_1="invocation") diff --git a/tests/entrypoints/openai/test_pooling.py b/tests/entrypoints/openai/test_pooling.py index 8752b128d54..02165ee6d58 100644 --- a/tests/entrypoints/openai/test_pooling.py +++ b/tests/entrypoints/openai/test_pooling.py @@ -281,7 +281,13 @@ async def test_invocations(server: RemoteOpenAIServer): invocation_output = invocation_response.json() assert completion_output.keys() == invocation_output.keys() - assert completion_output["data"] == invocation_output["data"] + for completion_data, invocation_data in zip(completion_output["data"], + invocation_output["data"]): + assert completion_data.keys() == invocation_data.keys() + check_embeddings_close(embeddings_0_lst=completion_data["data"], + embeddings_1_lst=invocation_data["data"], + name_0="completion", + name_1="invocation") @pytest.mark.asyncio @@ -314,4 +320,10 @@ async def test_invocations_conversation(server: RemoteOpenAIServer): invocation_output = invocation_response.json() assert chat_output.keys() == invocation_output.keys() - assert chat_output["data"] == invocation_output["data"] + for chat_data, invocation_data in zip(chat_output["data"], + invocation_output["data"]): + assert chat_data.keys() == invocation_data.keys() + check_embeddings_close(embeddings_0_lst=chat_data["data"], + embeddings_1_lst=invocation_data["data"], + name_0="chat", + name_1="invocation") diff --git a/tests/entrypoints/openai/test_rerank.py b/tests/entrypoints/openai/test_rerank.py index 16a947bc3fe..4da97fe1369 100644 --- a/tests/entrypoints/openai/test_rerank.py +++ b/tests/entrypoints/openai/test_rerank.py @@ -120,4 +120,8 @@ def test_invocations(server: RemoteOpenAIServer): invocation_output = invocation_response.json() assert rerank_output.keys() == invocation_output.keys() - assert rerank_output["results"] == invocation_output["results"] + for rerank_result, invocations_result in zip(rerank_output["results"], + invocation_output["results"]): + assert rerank_result.keys() == invocations_result.keys() + assert rerank_result["relevance_score"] == pytest.approx( + invocations_result["relevance_score"], rel=0.01) diff --git a/tests/entrypoints/openai/test_score.py b/tests/entrypoints/openai/test_score.py index 4d3bbd9decc..187542b7baf 100644 --- a/tests/entrypoints/openai/test_score.py +++ b/tests/entrypoints/openai/test_score.py @@ -215,4 +215,8 @@ def test_invocations(self, server: RemoteOpenAIServer, model: dict[str, invocation_output = invocation_response.json() assert score_output.keys() == invocation_output.keys() - assert score_output["data"] == invocation_output["data"] + for score_data, invocation_data in zip(score_output["data"], + invocation_output["data"]): + assert score_data.keys() == invocation_data.keys() + assert score_data["score"] == pytest.approx( + invocation_data["score"], rel=0.01) From 9fa34760161fd806e04380076db079f270fb5d03 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Wed, 16 Jul 2025 12:12:40 +0800 Subject: [PATCH 128/552] [CI] update typos config for CI pre-commit and fix some spells (#20919) Signed-off-by: Peter Pan Signed-off-by: x22x22 --- .pre-commit-config.yaml | 2 +- csrc/cpu/sgl-kernels/common.h | 2 +- csrc/cpu/sgl-kernels/gemm.h | 2 +- csrc/cpu/sgl-kernels/gemm_int8.cpp | 2 +- csrc/cpu/sgl-kernels/vec.h | 2 +- docker/Dockerfile | 2 +- docs/usage/v1_guide.md | 2 +- pyproject.toml | 183 ++++++++++++++++++ .../moe/modular_kernel_tools/common.py | 2 +- tests/kernels/moe/test_deepgemm.py | 2 +- tests/models/test_initialization.py | 2 +- tests/v1/test_external_lb_dp.py | 2 +- typos.toml | 179 ----------------- .../backends/differential_flash_attn.py | 2 +- vllm/entrypoints/openai/serving_responses.py | 2 +- .../layers/fused_moe/fused_moe.py | 2 +- vllm/model_executor/models/phi4flash.py | 2 +- vllm/v1/attention/backends/mla/common.py | 2 +- vllm/v1/worker/tpu_model_runner.py | 2 +- 19 files changed, 200 insertions(+), 196 deletions(-) delete mode 100644 typos.toml diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 24399677c08..5197820fb40 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -21,7 +21,7 @@ repos: - id: ruff-format files: ^(.buildkite|benchmarks|examples)/.* - repo: https://github.com/crate-ci/typos - rev: v1.32.0 + rev: v1.34.0 hooks: - id: typos - repo: https://github.com/PyCQA/isort diff --git a/csrc/cpu/sgl-kernels/common.h b/csrc/cpu/sgl-kernels/common.h index 20261c1ef3e..b96037e82c1 100644 --- a/csrc/cpu/sgl-kernels/common.h +++ b/csrc/cpu/sgl-kernels/common.h @@ -58,7 +58,7 @@ namespace { #define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous") #define CHECK_LAST_DIM_CONTIGUOUS(x) \ - TORCH_CHECK(x.strides()[x.strides().size() - 1] == 1, #x "must be contiguous at last dimention") + TORCH_CHECK(x.strides()[x.strides().size() - 1] == 1, #x "must be contiguous at last dimension") #define CHECK_INPUT(x) \ CHECK_CPU(x); \ diff --git a/csrc/cpu/sgl-kernels/gemm.h b/csrc/cpu/sgl-kernels/gemm.h index afae19721ae..fba5673323f 100644 --- a/csrc/cpu/sgl-kernels/gemm.h +++ b/csrc/cpu/sgl-kernels/gemm.h @@ -126,7 +126,7 @@ void fused_experts_int4_w4a16_kernel_impl( int64_t topk, int64_t num_tokens_post_pad); -// shared expert implememntation for int8 w8a8 +// shared expert implementation for int8 w8a8 template void shared_expert_int8_kernel_impl( scalar_t* __restrict__ output, diff --git a/csrc/cpu/sgl-kernels/gemm_int8.cpp b/csrc/cpu/sgl-kernels/gemm_int8.cpp index 5a0f65a9200..9a5ca0642e7 100644 --- a/csrc/cpu/sgl-kernels/gemm_int8.cpp +++ b/csrc/cpu/sgl-kernels/gemm_int8.cpp @@ -41,7 +41,7 @@ struct tinygemm_kernel_nn { __m512 vd0; __m512 vd1[COLS]; - // oops! 4x4 spills but luckly we use 4x2 + // oops! 4x4 spills but luckily we use 4x2 __m512 vbias[COLS]; // [NOTE]: s8s8 igemm compensation in avx512-vnni diff --git a/csrc/cpu/sgl-kernels/vec.h b/csrc/cpu/sgl-kernels/vec.h index 87955cfb292..160845c9b1c 100644 --- a/csrc/cpu/sgl-kernels/vec.h +++ b/csrc/cpu/sgl-kernels/vec.h @@ -37,7 +37,7 @@ inline Vectorized convert_from_float_ext(const Vecto #define CVT_FP16_TO_FP32(a) \ _mm512_cvtps_ph(a, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC)) -// this doesn't hanel NaN. +// this doesn't handle NaN. inline __m512bh cvt_e4m3_bf16_intrinsic_no_nan(__m256i fp8_vec) { const __m512i x = _mm512_cvtepu8_epi16(fp8_vec); diff --git a/docker/Dockerfile b/docker/Dockerfile index 78b548df32c..e0e08510c10 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -63,7 +63,7 @@ ARG PYTORCH_CUDA_NIGHTLY_INDEX_BASE_URL=https://download.pytorch.org/whl/nightly ARG PIP_KEYRING_PROVIDER=disabled ARG UV_KEYRING_PROVIDER=${PIP_KEYRING_PROVIDER} -# Flag enables build-in KV-connector dependency libs into docker images +# Flag enables built-in KV-connector dependency libs into docker images ARG INSTALL_KV_CONNECTORS=false #################### BASE BUILD IMAGE #################### diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index d7634223542..12150cf2a82 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -106,7 +106,7 @@ to enable simultaneous generation and embedding using the same engine instance i Models using selective state-space mechanisms instead of standard transformer attention are partially supported. Models that use Mamba-2 layers (e.g., `Mamba2ForCausalLM`) are supported, but models that use older Mamba-1 layers -(e.g., `MambaForCausalLM`, `JambaForCausalLM`) are not yet suported. Please note that these models currently require +(e.g., `MambaForCausalLM`, `JambaForCausalLM`) are not yet supported. Please note that these models currently require enforcing eager mode and disabling prefix caching in V1. Models that combine Mamba-2 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`, diff --git a/pyproject.toml b/pyproject.toml index 340abb38565..65ba0b4d833 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -174,3 +174,186 @@ respect-ignore-files = true [tool.ty.environment] python = "./.venv" + +[tool.typos.files] +# these files may be written in non english words +extend-exclude = ["tests/models/fixtures/*", "tests/prompts/*", + "benchmarks/sonnet.txt", "tests/lora/data/*", "build/*", + "vllm/third_party/*"] +ignore-hidden = true +ignore-files = true +ignore-dot = true +ignore-vcs = true +ignore-global = true +ignore-parent = true + +[tool.typos.default] +binary = false +check-filename = false +check-file = true +unicode = true +ignore-hex = true +identifier-leading-digits = false +locale = "en" +extend-ignore-identifiers-re = ["NVML_*", ".*Unc.*", ".*_thw", + ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", + ".*[Tt]h[rR].*"] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.default.extend-identifiers] +bbc5b7ede = "bbc5b7ede" +womens_doubles = "womens_doubles" +v_2nd = "v_2nd" +# splitted_input = "splitted_input" +NOOPs = "NOOPs" +typ = "typ" +nin_shortcut = "nin_shortcut" +UperNetDecoder = "UperNetDecoder" +subtile = "subtile" +cudaDevAttrMaxSharedMemoryPerBlockOptin = "cudaDevAttrMaxSharedMemoryPerBlockOptin" +SFOuput = "SFOuput" +# huggingface transformers repo uses these words +depthwise_seperable_out_channel = "depthwise_seperable_out_channel" +DepthWiseSeperableConv1d = "DepthWiseSeperableConv1d" +depthwise_seperable_CNN = "depthwise_seperable_CNN" + +[tool.typos.default.extend-words] +iy = "iy" +tendencias = "tendencias" +# intel cpu features +tme = "tme" +dout = "dout" +Pn = "Pn" +arange = "arange" + +[tool.typos.type.py] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.py.extend-identifiers] +arange = "arange" +NDArray = "NDArray" +EOFError = "EOFError" +fo = "fo" +ba = "ba" + +[tool.typos.type.py.extend-words] + +[tool.typos.type.cpp] +extend-glob = ["*.cu"] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.cpp.extend-identifiers] +countr_one = "countr_one" +k_ot = "k_ot" +ot = "ot" + +[tool.typos.type.cpp.extend-words] + +[tool.typos.type.rust] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.rust.extend-identifiers] +flate2 = "flate2" + +[tool.typos.type.rust.extend-words] +ser = "ser" + +[tool.typos.type.lock] +extend-glob = [] +check-file = false +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.lock.extend-identifiers] + +[tool.typos.type.lock.extend-words] + +[tool.typos.type.jl] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.jl.extend-identifiers] + +[tool.typos.type.jl.extend-words] +modul = "modul" +egals = "egals" +usig = "usig" +egal = "egal" + +[tool.typos.type.go] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.go.extend-identifiers] +flate = "flate" + +[tool.typos.type.go.extend-words] + +[tool.typos.type.css] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.css.extend-identifiers] +nd = "nd" + +[tool.typos.type.css.extend-words] + +[tool.typos.type.man] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.man.extend-identifiers] +Nd = "Nd" + +[tool.typos.type.man.extend-words] + +[tool.typos.type.cert] +extend-glob = [] +check-file = false +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.cert.extend-identifiers] + +[tool.typos.type.cert.extend-words] + +[tool.typos.type.sh] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.sh.extend-identifiers] +ot = "ot" + +[tool.typos.type.sh.extend-words] + +[tool.typos.type.vimscript] +extend-glob = [] +extend-ignore-identifiers-re = [] +extend-ignore-words-re = [] +extend-ignore-re = [] + +[tool.typos.type.vimscript.extend-identifiers] +windo = "windo" + +[tool.typos.type.vimscript.extend-words] diff --git a/tests/kernels/moe/modular_kernel_tools/common.py b/tests/kernels/moe/modular_kernel_tools/common.py index a1319ab0509..fd99e8dc5c9 100644 --- a/tests/kernels/moe/modular_kernel_tools/common.py +++ b/tests/kernels/moe/modular_kernel_tools/common.py @@ -416,7 +416,7 @@ def make_hidden_states( # We dequant and use that as hidden_states so the tests are stable. # quantizing and dequantizing yield slightly different results # depending on the hardware. Here we, quantize and dequantize - # first - so further quantize and dequantize will yeild the same + # first - so further quantize and dequantize will yield the same # values. if config.is_per_tensor_act_quant: a_q, a_scales = ops.scaled_fp8_quant( diff --git a/tests/kernels/moe/test_deepgemm.py b/tests/kernels/moe/test_deepgemm.py index 1460fdd3aea..f7578e22691 100644 --- a/tests/kernels/moe/test_deepgemm.py +++ b/tests/kernels/moe/test_deepgemm.py @@ -95,7 +95,7 @@ def run_single_case(m, n, k, topk, num_experts, block_size): topk_weights, topk_ids = torch.topk(router_logits, k=topk, dim=-1) topk_weights = torch.nn.functional.softmax(topk_weights, dim=-1) - # triton referrence + # triton reference out_triton = fused_experts( hidden_states=tokens_bf16, w1=w1, diff --git a/tests/models/test_initialization.py b/tests/models/test_initialization.py index ea6a2cc37cc..2d12327dc2e 100644 --- a/tests/models/test_initialization.py +++ b/tests/models/test_initialization.py @@ -43,7 +43,7 @@ def hf_overrides(hf_config: PretrainedConfig) -> PretrainedConfig: text_config = hf_config.get_text_config() # Ensure at least 2 expert per group - # Since `grouped_topk` assums top-2 + # Since `grouped_topk` assumes top-2 n_group = getattr(text_config, 'n_group', None) num_experts = n_group * 2 if n_group is not None else 2 diff --git a/tests/v1/test_external_lb_dp.py b/tests/v1/test_external_lb_dp.py index 17952dfb0d9..98fefad1ff4 100644 --- a/tests/v1/test_external_lb_dp.py +++ b/tests/v1/test_external_lb_dp.py @@ -17,7 +17,7 @@ # Number of data parallel ranks for external LB testing DP_SIZE = int(os.getenv("DP_SIZE", "2")) -# Default tensor parallell size to use +# Default tensor parallel size to use TP_SIZE = int(os.getenv("TP_SIZE", "1")) diff --git a/typos.toml b/typos.toml deleted file mode 100644 index f51ce2f3620..00000000000 --- a/typos.toml +++ /dev/null @@ -1,179 +0,0 @@ -[files] -# these files may be written in non english words -extend-exclude = ["tests/models/fixtures/*", "tests/prompts/*", - "benchmarks/sonnet.txt", "tests/lora/data/*", "build/*", - "vllm/third_party/*"] -ignore-hidden = true -ignore-files = true -ignore-dot = true -ignore-vcs = true -ignore-global = true -ignore-parent = true - -[default] -binary = false -check-filename = false -check-file = true -unicode = true -ignore-hex = true -identifier-leading-digits = false -locale = "en" -extend-ignore-identifiers-re = ["NVML_*", ".*Unc.*", ".*_thw", - ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*", - ".*ot.*", ".*[Tt]h[rR].*"] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[default.extend-identifiers] -bbc5b7ede = "bbc5b7ede" -womens_doubles = "womens_doubles" -v_2nd = "v_2nd" -splitted_input = "splitted_input" -NOOPs = "NOOPs" -typ = "typ" -nin_shortcut = "nin_shortcut" -UperNetDecoder = "UperNetDecoder" -subtile = "subtile" -cudaDevAttrMaxSharedMemoryPerBlockOptin = "cudaDevAttrMaxSharedMemoryPerBlockOptin" -SFOuput = "SFOuput" -# huggingface transformers repo uses these words -depthwise_seperable_out_channel = "depthwise_seperable_out_channel" -DepthWiseSeperableConv1d = "DepthWiseSeperableConv1d" -depthwise_seperable_CNN = "depthwise_seperable_CNN" - -[default.extend-words] -iy = "iy" -tendencias = "tendencias" -# intel cpu features -tme = "tme" -dout = "dout" -Pn = "Pn" -arange = "arange" - -[type.py] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.py.extend-identifiers] -arange = "arange" -NDArray = "NDArray" -EOFError = "EOFError" - -[type.py.extend-words] - -[type.cpp] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.cpp.extend-identifiers] -countr_one = "countr_one" - -[type.cpp.extend-words] - -[type.rust] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.rust.extend-identifiers] -flate2 = "flate2" - -[type.rust.extend-words] -ser = "ser" - -[type.lock] -extend-glob = [] -check-file = false -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.lock.extend-identifiers] - -[type.lock.extend-words] - -[type.jl] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.jl.extend-identifiers] - -[type.jl.extend-words] -modul = "modul" -egals = "egals" -usig = "usig" -egal = "egal" - -[type.go] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.go.extend-identifiers] -flate = "flate" - -[type.go.extend-words] - -[type.css] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.css.extend-identifiers] -nd = "nd" - -[type.css.extend-words] - -[type.man] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.man.extend-identifiers] -Nd = "Nd" - -[type.man.extend-words] - -[type.cert] -extend-glob = [] -check-file = false -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.cert.extend-identifiers] - -[type.cert.extend-words] - -[type.sh] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.sh.extend-identifiers] -stap = "stap" -ot = "ot" - -[type.sh.extend-words] - -[type.vimscript] -extend-glob = [] -extend-ignore-identifiers-re = [] -extend-ignore-words-re = [] -extend-ignore-re = [] - -[type.vimscript.extend-identifiers] -windo = "windo" - -[type.vimscript.extend-words] diff --git a/vllm/attention/backends/differential_flash_attn.py b/vllm/attention/backends/differential_flash_attn.py index 7c35e58967d..1c139952371 100644 --- a/vllm/attention/backends/differential_flash_attn.py +++ b/vllm/attention/backends/differential_flash_attn.py @@ -961,7 +961,7 @@ def forward( "... H (two D) -> ... (H two) D", two=2) - else: # re-use the kv cache, full attention + else: # reuse the kv cache, full attention q = q.view(-1, self.num_heads, self.head_size) q1, q2 = self.split_heads(q) # kv_cache shape is (2, num_blocks, block_size, num_kv_heads, head_size) # noqa: E501 diff --git a/vllm/entrypoints/openai/serving_responses.py b/vllm/entrypoints/openai/serving_responses.py index f7bde6e243b..a359371848c 100644 --- a/vllm/entrypoints/openai/serving_responses.py +++ b/vllm/entrypoints/openai/serving_responses.py @@ -372,7 +372,7 @@ def _construct_input_messages( }) # Append the new input. - # Reponses API supports simple text inputs without chat format. + # Responses API supports simple text inputs without chat format. if isinstance(request.input, str): messages.append({"role": "user", "content": request.input}) else: diff --git a/vllm/model_executor/layers/fused_moe/fused_moe.py b/vllm/model_executor/layers/fused_moe/fused_moe.py index f0bffc7dae2..079486dd438 100644 --- a/vllm/model_executor/layers/fused_moe/fused_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_moe.py @@ -1172,7 +1172,7 @@ def fused_experts( allow_cutlass_block_scaled_grouped_gemm: bool = False) -> torch.Tensor: # For now, disable DeepGemm for small N (<= 512) until better # permute/unpermute ops are available. - # However, on B200, we use DeepGemm for all cases becuase they only support + # However, on B200, we use DeepGemm for all cases because they only support # E8M0 scale, which means we requantize the weight and input to the specific # scale. Fallen back to cutlass or triton for some cases would cause # accuracy issue. diff --git a/vllm/model_executor/models/phi4flash.py b/vllm/model_executor/models/phi4flash.py index 10f8b6552af..c1dd9fab7fa 100644 --- a/vllm/model_executor/models/phi4flash.py +++ b/vllm/model_executor/models/phi4flash.py @@ -193,7 +193,7 @@ def forward( ], dim=-1) attn_output = self.attn(q, k, v) - else: # re-use the kv cache, full attention + else: # reuse the kv cache, full attention q = self.Wqkv(hidden_states) attn_output = self.attn(q, None, None) attn_output = attn_output.view(-1, self.num_heads * self.head_dim) diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py index 381a92a8309..173c8466f6d 100755 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -394,7 +394,7 @@ def use_cudnn_prefill() -> bool: # Currently 394MB, this can be tuned based on GEMM sizes used. -# Choosen to be the same as sglang: +# Chosen to be the same as sglang: # https://github.com/sgl-project/sglang/blob/766392c6bda2558b61ce6d1c1bfd8081a549e1f1/python/sglang/global_config.py#L37 FLASHINFER_WORKSPACE_BUFFER_SIZE = 394 * 1024 * 1024 diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 83a80bd865b..6ac06929935 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -969,7 +969,7 @@ def execute_model( else: mm_embeds = [] xm.mark_step() - # Prepare inputs, the requests might be splitted into multiple + # Prepare inputs, the requests might be split into multiple # executions, combine the result of each execution. start_index = 0 combined_selected_tokens: list[torch.Tensor] = [] From 2a16813330934ecde5c1b5cc49fb80d652b1ed1b Mon Sep 17 00:00:00 2001 From: zhiweiz Date: Tue, 15 Jul 2025 21:14:15 -0700 Subject: [PATCH 129/552] [Meta] Llama4 EAGLE Support (#20591) Signed-off-by: qizixi Co-authored-by: qizixi Signed-off-by: x22x22 --- examples/offline_inference/spec_decode.py | 1 + tests/models/registry.py | 5 + tests/models/test_initialization.py | 5 + tests/v1/e2e/test_spec_decode.py | 48 +++-- vllm/model_executor/models/llama4_eagle.py | 214 +++++++++++++++++++++ vllm/model_executor/models/registry.py | 1 + 6 files changed, 257 insertions(+), 17 deletions(-) create mode 100644 vllm/model_executor/models/llama4_eagle.py diff --git a/examples/offline_inference/spec_decode.py b/examples/offline_inference/spec_decode.py index 26e492fed25..ce735f3b27d 100644 --- a/examples/offline_inference/spec_decode.py +++ b/examples/offline_inference/spec_decode.py @@ -84,6 +84,7 @@ def main(): gpu_memory_utilization=0.8, speculative_config=speculative_config, disable_log_stats=False, + max_model_len=16384, ) sampling_params = SamplingParams(temperature=args.temp, max_tokens=args.output_len) diff --git a/tests/models/registry.py b/tests/models/registry.py index d3b764780f7..d2e70e291df 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -465,6 +465,11 @@ def check_available_online( trust_remote_code=True, speculative_model="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", tokenizer="meta-llama/Llama-3.1-8B-Instruct"), + "EagleLlama4ForCausalLM": _HfExamplesInfo( + "morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct", + trust_remote_code=True, + speculative_model="morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct", + tokenizer="meta-llama/Llama-4-Scout-17B-16E-Instruct"), # noqa: E501 "EagleMiniCPMForCausalLM": _HfExamplesInfo("openbmb/MiniCPM-1B-sft-bf16", trust_remote_code=True, is_available_online=False, diff --git a/tests/models/test_initialization.py b/tests/models/test_initialization.py index 2d12327dc2e..52005e74ef7 100644 --- a/tests/models/test_initialization.py +++ b/tests/models/test_initialization.py @@ -36,6 +36,11 @@ def test_can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch): "KimiVLForConditionalGeneration"): pytest.skip("Avoid OOM") + if model_arch in ("Llama4ForCausalLM", "EagleLlama4ForCausalLM"): + from vllm.model_executor.models.llama4 import Llama4ForCausalLM + from vllm.model_executor.models.registry import ModelRegistry + ModelRegistry.register_model("Llama4ForCausalLM", Llama4ForCausalLM) + # Avoid OOM and reduce initialization time by only using 1 layer def hf_overrides(hf_config: PretrainedConfig) -> PretrainedConfig: hf_config.update(model_info.hf_overrides) diff --git a/tests/v1/e2e/test_spec_decode.py b/tests/v1/e2e/test_spec_decode.py index 93e7c12f3a0..2423f966acf 100644 --- a/tests/v1/e2e/test_spec_decode.py +++ b/tests/v1/e2e/test_spec_decode.py @@ -6,8 +6,10 @@ from typing import Any import pytest +import torch from vllm import LLM, SamplingParams +from vllm.distributed import cleanup_dist_env_and_memory @pytest.fixture @@ -53,14 +55,6 @@ def model_name(): return "meta-llama/Llama-3.1-8B-Instruct" -def eagle_model_name(): - return "yuhuili/EAGLE-LLaMA3.1-Instruct-8B" - - -def eagle3_model_name(): - return "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" - - def test_ngram_correctness( monkeypatch: pytest.MonkeyPatch, test_prompts: list[list[dict[str, Any]]], @@ -77,6 +71,8 @@ def test_ngram_correctness( ref_llm = LLM(model=model_name, max_model_len=1024) ref_outputs = ref_llm.chat(test_prompts, sampling_config) del ref_llm + torch.cuda.empty_cache() + cleanup_dist_env_and_memory() spec_llm = LLM( model=model_name, @@ -103,34 +99,50 @@ def test_ngram_correctness( # Upon failure, inspect the outputs to check for inaccuracy. assert matches > int(0.7 * len(ref_outputs)) del spec_llm - - -@pytest.mark.parametrize("use_eagle3", [False, True], ids=["eagle", "eagle3"]) + torch.cuda.empty_cache() + cleanup_dist_env_and_memory() + + +@pytest.mark.parametrize("model_setup", [ + ("eagle", "meta-llama/Llama-3.1-8B-Instruct", + "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", 1), + ("eagle3", "meta-llama/Llama-3.1-8B-Instruct", + "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", 1), + pytest.param( + ("eagle", "meta-llama/Llama-4-Scout-17B-16E-Instruct", + "morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct", 4), + marks=pytest.mark.skip(reason="Skipping due to CI OOM issues")), +], + ids=["llama3_eagle", "llama3_eagle3", "llama4_eagle"]) def test_eagle_correctness( monkeypatch: pytest.MonkeyPatch, test_prompts: list[list[dict[str, Any]]], sampling_config: SamplingParams, - model_name: str, - use_eagle3: bool, + model_setup: tuple[str, str, str, int], ): ''' Compare the outputs of a original LLM and a speculative LLM should be the same when using eagle speculative decoding. + model_setup: (method, model_name, eagle_model_name, tp_size) ''' with monkeypatch.context() as m: m.setenv("VLLM_USE_V1", "1") + method, model_name, spec_model_name, tp_size = model_setup - ref_llm = LLM(model=model_name, max_model_len=2048) + ref_llm = LLM(model=model_name, + max_model_len=2048, + tensor_parallel_size=tp_size) ref_outputs = ref_llm.chat(test_prompts, sampling_config) del ref_llm + torch.cuda.empty_cache() + cleanup_dist_env_and_memory() - spec_model_name = eagle3_model_name( - ) if use_eagle3 else eagle_model_name() spec_llm = LLM( model=model_name, trust_remote_code=True, + tensor_parallel_size=tp_size, speculative_config={ - "method": "eagle3" if use_eagle3 else "eagle", + "method": method, "model": spec_model_name, "num_speculative_tokens": 3, "max_model_len": 2048, @@ -152,3 +164,5 @@ def test_eagle_correctness( # Upon failure, inspect the outputs to check for inaccuracy. assert matches > int(0.66 * len(ref_outputs)) del spec_llm + torch.cuda.empty_cache() + cleanup_dist_env_and_memory() diff --git a/vllm/model_executor/models/llama4_eagle.py b/vllm/model_executor/models/llama4_eagle.py new file mode 100644 index 00000000000..222ab5dfaee --- /dev/null +++ b/vllm/model_executor/models/llama4_eagle.py @@ -0,0 +1,214 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# Copyright 2025 the LLAMA4, Meta Inc., vLLM, and HuggingFace Inc. team. +# All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections.abc import Iterable +from typing import Optional + +import torch +import torch.nn as nn + +from vllm.compilation.decorators import support_torch_compile +from vllm.config import VllmConfig +from vllm.distributed.parallel_state import get_pp_group +from vllm.logger import init_logger +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.quantization.base_config import ( + QuantizationConfig) +from vllm.model_executor.layers.quantization.torchao import TorchAOConfig +from vllm.model_executor.layers.vocab_parallel_embedding import ( + VocabParallelEmbedding) +from vllm.model_executor.model_loader.weight_utils import default_weight_loader +from vllm.model_executor.models.llama4 import (Llama4DecoderLayer, + Llama4ForCausalLM) +from vllm.model_executor.models.utils import extract_layer_index + +from .utils import AutoWeightsLoader, maybe_prefix + +logger = init_logger(__name__) + + +@support_torch_compile +class LlamaModel(nn.Module): + + def __init__( + self, + *, + vllm_config: VllmConfig, + prefix: str = "", + start_layer_id: int = 0, + quant_config: Optional[QuantizationConfig] = None, + ) -> None: + super().__init__() + self.config = ( + vllm_config.speculative_config.draft_model_config.hf_config) + self.validate_and_update_config(start_layer_id, quant_config) + self.vocab_size = self.config.vocab_size + self.embed_tokens = VocabParallelEmbedding( + self.config.vocab_size, + self.config.hidden_size, + prefix=maybe_prefix(prefix, "embed_tokens"), + ) + + self.layers = nn.ModuleList([ + Llama4DecoderLayer( + self.config, + quant_config=quant_config, + prefix=maybe_prefix(prefix, f"layers.{i + start_layer_id}"), + ) for i in range(self.config.num_hidden_layers) + ]) + self.fc = torch.nn.Linear(self.config.hidden_size * 2, + self.config.hidden_size, + bias=False) + self.norm = RMSNorm(self.config.hidden_size, + eps=self.config.rms_norm_eps) + + def forward( + self, + input_ids: Optional[torch.Tensor], + positions: torch.Tensor, + hidden_states: torch.Tensor, + ) -> tuple[torch.Tensor, torch.Tensor]: + input_embeds = self.embed_tokens(input_ids) + hidden_states = self.fc( + torch.cat((input_embeds, hidden_states), dim=-1)) + residual = None + for layer in self.layers: + hidden_states, residual = layer( + positions, + hidden_states, + residual, + ) + hidden_states, _ = self.norm(hidden_states, residual) + return hidden_states, hidden_states + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + (".qkv_proj", ".q_proj", "q"), + (".qkv_proj", ".k_proj", "k"), + (".qkv_proj", ".v_proj", "v"), + (".gate_up_proj", ".gate_proj", 0), + (".gate_up_proj", ".up_proj", 1), + ] + params_dict = dict(self.named_parameters()) + loaded_params: set[str] = set() + for name, loaded_weight in weights: + name = name.removeprefix("model.") + for param_name, weight_name, shard_id in stacked_params_mapping: + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + break + else: + # if PP disabled then draft will share embed with target + if get_pp_group().world_size == 1 and \ + "embed_tokens." in name: + continue + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + for name in params_dict: + # if PP disabled then draft will share embed with target + if get_pp_group().world_size == 1 and \ + "embed_tokens." in name: + continue + assert name in loaded_params, f"{name} is not loaded!" + return loaded_params + + def validate_and_update_config( + self, + start_layer_id: int, + quant_config: Optional[QuantizationConfig] = None) -> None: + # yoco and moe is not supported by draft model yet + assert self.config.yoco_global_kv_layer is None + assert self.config.yoco_local_kv_layer is None + assert len(self.config.moe_layers) == 0 + # draft model layer index is increased by start_layer_id, + # so we need to pad relevant configs accordingly + self.config.no_rope_layers = [ + 0 + ] * start_layer_id + self.config.no_rope_layers + # currently only TorchAO quantization is supported + if isinstance(quant_config, TorchAOConfig): + + def pad_layer_name(layer: str) -> str: + layer_index = extract_layer_index(layer) + return layer.replace(str(layer_index), + str(layer_index + start_layer_id)) + + quant_config.torchao_config.module_fqn_to_config = { + pad_layer_name(layer): quantization + for layer, quantization in + quant_config.torchao_config.module_fqn_to_config.items() + } + + +class EagleLlama4ForCausalLM(Llama4ForCausalLM): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + nn.Module.__init__(self) + self.config = ( + vllm_config.speculative_config.draft_model_config.hf_config) + target_layer_num = vllm_config.model_config.get_num_layers( + vllm_config.parallel_config) + # draft model quantization config may differ from target model + quant_config = VllmConfig.get_quantization_config( + vllm_config.speculative_config.draft_model_config, + vllm_config.load_config) + self.model = LlamaModel(vllm_config=vllm_config, + prefix="model", + start_layer_id=target_layer_num, + quant_config=quant_config) + logit_scale = getattr(self.config, "logit_scale", 1.0) + self.logits_processor = LogitsProcessor(self.config.vocab_size, + scale=logit_scale) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + hidden_states: torch.Tensor, + ) -> tuple[torch.Tensor, torch.Tensor]: + return self.model(input_ids, positions, hidden_states) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> None: + loader = AutoWeightsLoader( + self, + # lm_head is tied with target model (Llama4ForCausalLM) + skip_prefixes=(["lm_head."]), + ) + + model_weights = {} + weights = [ + self.permute_qk_weight_for_rotary(name, loaded_weight) + for name, loaded_weight in weights + ] + for name, loaded_weight in weights: + if "lm_head" not in name: + name = "model." + name + model_weights[name] = loaded_weight + + loader.load_weights(model_weights.items()) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index b7f9638d322..bc936500bdc 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -244,6 +244,7 @@ "MiMoMTPModel": ("mimo_mtp", "MiMoMTP"), "EAGLEModel": ("eagle", "EAGLE"), "EagleLlamaForCausalLM": ("llama_eagle", "EagleLlamaForCausalLM"), + "EagleLlama4ForCausalLM": ("llama4_eagle", "EagleLlama4ForCausalLM"), "EagleMiniCPMForCausalLM": ("minicpm_eagle", "EagleMiniCPMForCausalLM"), "Eagle3LlamaForCausalLM": ("llama_eagle3", "Eagle3LlamaForCausalLM"), "DeepSeekMTPModel": ("deepseek_mtp", "DeepSeekMTP"), From 4facedc651839f450bb3f287260b2ad464bda2e2 Mon Sep 17 00:00:00 2001 From: Chengji Yao Date: Tue, 15 Jul 2025 21:39:48 -0700 Subject: [PATCH 130/552] [TPU] fix kv_cache_update kernel block size choosing logic (#21007) Signed-off-by: Chengji Yao Signed-off-by: x22x22 --- vllm/v1/attention/backends/pallas.py | 49 +++++++++++++++++++++++++++- vllm/v1/worker/tpu_model_runner.py | 5 +-- 2 files changed, 51 insertions(+), 3 deletions(-) diff --git a/vllm/v1/attention/backends/pallas.py b/vllm/v1/attention/backends/pallas.py index 32ef5dc2e36..b7fc1ffeb65 100644 --- a/vllm/v1/attention/backends/pallas.py +++ b/vllm/v1/attention/backends/pallas.py @@ -326,7 +326,54 @@ def kv_cache_update_op_non_xla(kv: torch.Tensor, slot_mapping: torch.Tensor, return kv_cache +# We can move this function to a common utils file if it's also useful for other +# hardware. +def dtype_bits(dtype: torch.dtype): + if dtype.is_floating_point: + try: + return torch.finfo(dtype).bits + except TypeError: + pass + elif dtype.is_complex: + if dtype is torch.complex32: + return 32 + elif dtype is torch.complex64: + return 64 + elif dtype is torch.complex128: + return 128 + else: + try: + return torch.iinfo(dtype).bits + # torch.iinfo cannot support int4, int2, bits8... + except TypeError: + pass + str_dtype = str(dtype) + # support torch.int4, torch.int5, torch.uint5... + if str_dtype.startswith("torch.int") or str_dtype.startswith("torch.uint"): + return int(str_dtype[-1]) + raise TypeError(f"Getting the bit width of {dtype} is not supported") + + +def get_dtype_packing(dtype): + bits = dtype_bits(dtype) + if 32 % bits != 0: + raise ValueError( + f"The bit width must be divisible by 32, but got bits={bits}, " + "dtype={dtype}") + return 32 // bits + + def get_page_size_bytes(block_size: int, num_kv_heads: int, head_size: int, kv_cache_dtype: torch.dtype) -> int: """Returns the size in bytes of one page of the KV cache.""" - return block_size * num_kv_heads * head_size * kv_cache_dtype.itemsize + padded_head_size = cdiv(head_size, + TPU_HEAD_SIZE_ALIGNMENT) * TPU_HEAD_SIZE_ALIGNMENT + num_combined_kv_heads = num_kv_heads * 2 + + # NOTE: for the implicit padding in XLA + packing = get_dtype_packing(kv_cache_dtype) + num_combined_kv_heads = cdiv(num_combined_kv_heads, packing) * packing + + kv_cache_dtype_bits = dtype_bits(kv_cache_dtype) + return (block_size * num_combined_kv_heads * padded_head_size * + kv_cache_dtype_bits // 8) diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 6ac06929935..ad62d204381 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -1863,8 +1863,9 @@ def _get_num_slices_per_kv_cache_update_block(page_size_bytes: int) -> int: out of scalar registers. Thus this function will limit the number of slices to 64. """ - # Conservative VMEM usage limit: 32 MiB - vmem_limit = 32 * 1024 * 1024 + # The default vmem_limit_bytes of a pallas kernel is 32MB. Here we + # calculate num_slices_per_block based on 16MB in case any register spills. + vmem_limit = 16 * 1024 * 1024 num_slices_per_block = vmem_limit // page_size_bytes assert num_slices_per_block > 0, "Number of slices should be positive" num_slices_per_block = prev_power_of_2(num_slices_per_block) From 150e33b435eba880af2325e92d40af6c56a76da6 Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Wed, 16 Jul 2025 01:27:29 -0400 Subject: [PATCH 131/552] [BugFix] Fix import error on non-blackwell machines (#21020) Signed-off-by: Lucas Wilkinson Signed-off-by: x22x22 --- csrc/attention/mla/sm100_cutlass_mla_kernel.cu | 10 ++++++++++ csrc/ops.h | 13 ------------- csrc/torch_bindings.cpp | 5 ++--- 3 files changed, 12 insertions(+), 16 deletions(-) diff --git a/csrc/attention/mla/sm100_cutlass_mla_kernel.cu b/csrc/attention/mla/sm100_cutlass_mla_kernel.cu index 0d57ff4cc7c..e0e95d06290 100644 --- a/csrc/attention/mla/sm100_cutlass_mla_kernel.cu +++ b/csrc/attention/mla/sm100_cutlass_mla_kernel.cu @@ -18,6 +18,7 @@ limitations under the License. * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 * by Alcanderian JieXin Liang */ +#include "core/registration.h" #include #include @@ -270,4 +271,13 @@ int64_t sm100_cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_ba } #endif + +TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) { + m.impl("sm100_cutlass_mla_decode", &sm100_cutlass_mla_decode); +} + +TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CatchAll, m) { + m.impl("sm100_cutlass_mla_get_workspace_size", &sm100_cutlass_mla_get_workspace_size); +} + // clang-format on diff --git a/csrc/ops.h b/csrc/ops.h index 20ad163dc0d..7f3e6b6923a 100644 --- a/csrc/ops.h +++ b/csrc/ops.h @@ -167,19 +167,6 @@ void cutlass_mla_decode(torch::Tensor const& out, torch::Tensor const& q_nope, torch::Tensor const& seq_lens, torch::Tensor const& page_table, double scale); -void sm100_cutlass_mla_decode( - torch::Tensor const& out, torch::Tensor const& q_nope, - torch::Tensor const& q_pe, torch::Tensor const& kv_c_and_k_pe_cache, - torch::Tensor const& seq_lens, torch::Tensor const& page_table, - torch::Tensor const& workspace, double sm_scale, - int64_t num_kv_splits = - 1 /* Set to 1 to avoid cuda_graph issue by default. */); - -int64_t sm100_cutlass_mla_get_workspace_size( - int64_t max_seq_len, int64_t num_batches, int64_t sm_count = 0, - int64_t num_kv_splits = - 1 /* Set to 1 to avoid cuda_graph issue by default. */); - torch::Tensor get_cuda_view_from_cpu_tensor(torch::Tensor& cpu_tensor); #ifndef USE_ROCM diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index 370edc20149..23e9212a2f1 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -521,15 +521,14 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { " Tensor page_table, Tensor workspace, float " "scale," " int num_kv_splits) -> ()"); - ops.impl("sm100_cutlass_mla_decode", torch::kCUDA, &sm100_cutlass_mla_decode); + // conditionally compiled so impl in source file // SM100 CUTLASS MLA workspace ops.def( "sm100_cutlass_mla_get_workspace_size(int max_seq_len, int num_batches," " int sm_count, int num_kv_splits) " "-> int"); - ops.impl("sm100_cutlass_mla_get_workspace_size", - &sm100_cutlass_mla_get_workspace_size); + // conditionally compiled so impl in source file // Compute NVFP4 block quantized tensor. ops.def( From 3c8b7cb16024e2a84d9aa30bc26ce57a800e318e Mon Sep 17 00:00:00 2001 From: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Wed, 16 Jul 2025 00:14:49 -0700 Subject: [PATCH 132/552] Fix inadvertently silenced PP tests for `mp`, add DeepSeek V2/V3 model family to PP tests (#20831) Signed-off-by: Seiji Eicher Signed-off-by: x22x22 --- tests/distributed/test_pipeline_parallel.py | 24 +++++++++++++++------ 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/tests/distributed/test_pipeline_parallel.py b/tests/distributed/test_pipeline_parallel.py index 7d569fd8382..926a33c949e 100644 --- a/tests/distributed/test_pipeline_parallel.py +++ b/tests/distributed/test_pipeline_parallel.py @@ -14,8 +14,9 @@ import pytest -from vllm.config import TaskOption +from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, TaskOption from vllm.logger import init_logger +from vllm.transformers_utils.config import get_config from ..models.registry import HF_EXAMPLE_MODELS from ..utils import compare_two_settings, create_new_process_for_each_test @@ -158,7 +159,7 @@ def iter_params(self, model_id: str): "databricks/dbrx-instruct": PPTestSettings.fast(load_format="dummy"), "Deci/DeciLM-7B-instruct": PPTestSettings.fast(), "deepseek-ai/deepseek-llm-7b-chat": PPTestSettings.fast(), - "deepseek-ai/DeepSeek-V2-Lite-Chat": PPTestSettings.fast(), + "deepseek-ai/DeepSeek-V2-Lite-Chat": PPTestSettings.fast(tp_base=2), "LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct": PPTestSettings.fast(), "tiiuae/falcon-7b": PPTestSettings.fast(), "google/gemma-1.1-2b-it": PPTestSettings.fast(), @@ -210,9 +211,11 @@ def iter_params(self, model_id: str): EMBEDDING_MODELS = { # type: ignore[var-annotated] # [Text-only] - "intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(), - "BAAI/bge-multilingual-gemma2": PPTestSettings.fast(), - "Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(load_format="dummy"), + "intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(task="embed"), + "BAAI/bge-multilingual-gemma2": PPTestSettings.fast(task="embed"), + "Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast( + load_format="dummy", task="embed" + ), } MULTIMODAL_MODELS = { @@ -248,6 +251,7 @@ def iter_params(self, model_id: str): "meta-llama/Llama-3.2-1B-Instruct", "ArthurZ/Ilama-3.2-1B", "ibm/PowerLM-3b", + "deepseek-ai/DeepSeek-V2-Lite-Chat", # [LANGUAGE EMBEDDING] "intfloat/e5-mistral-7b-instruct", "BAAI/bge-multilingual-gemma2", @@ -287,6 +291,11 @@ def _compare_tp( trust_remote_code = model_info.trust_remote_code tokenizer_mode = model_info.tokenizer_mode hf_overrides = model_info.hf_overrides + hf_config = get_config(model_id, trust_remote_code) + + dtype = "float16" + if hf_config.model_type in _FLOAT16_NOT_SUPPORTED_MODELS: + dtype = "bfloat16" if load_format == "dummy": # Avoid OOM @@ -316,7 +325,7 @@ def _compare_tp( common_args = [ # use half precision for speed and memory savings in CI environment "--dtype", - "float16", + dtype, "--max-model-len", "2048", "--max-num-seqs", @@ -338,6 +347,7 @@ def _compare_tp( common_args.extend(["--hf-overrides", json.dumps(hf_overrides)]) specific_case = tp_size == 2 and pp_size == 2 and chunked_prefill + testing_ray_compiled_graph = False if distributed_backend == "ray" and (vllm_major_version == "1" or specific_case): # For V1, test Ray Compiled Graph for all the tests @@ -351,6 +361,7 @@ def _compare_tp( # Temporary. Currently when zeromq + SPMD is used, it does not properly # terminate because of a Ray Compiled Graph issue. common_args.append("--disable-frontend-multiprocessing") + testing_ray_compiled_graph = True elif distributed_backend == "mp": # Both V0/V1 of multiprocessing executor support PP pp_env = { @@ -394,7 +405,6 @@ def _compare_tp( tp_env, method=method) except Exception: - testing_ray_compiled_graph = pp_env is not None if testing_ray_compiled_graph and vllm_major_version == "0": # Ray Compiled Graph tests are flaky for V0, # so we don't want to fail the test From 692eed7a2b9d7d4ef1cd7c9b0d8e565ffa638664 Mon Sep 17 00:00:00 2001 From: Michael Yao Date: Wed, 16 Jul 2025 21:11:38 +0800 Subject: [PATCH 133/552] [Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md (#19199) Signed-off-by: windsonsea Signed-off-by: x22x22 --- docs/assets/deployment/open_webui.png | Bin 69283 -> 58608 bytes docs/deployment/frameworks/open-webui.md | 50 +++++++++++++++-------- 2 files changed, 33 insertions(+), 17 deletions(-) diff --git a/docs/assets/deployment/open_webui.png b/docs/assets/deployment/open_webui.png index fe9a7e15ea71d908c76eedc52d92e901bad9dae1..7018b4dff6bb7a6331a9b922ffb82a5a98969987 100644 GIT binary patch literal 58608 zcmeFZWmHvd*EX!Uk?!t>O-e~iZo0dbknTo6x*KVvl@t)9K~fr2N+gs{>68-QbKm!K z-OuIa{qOtpy<@y%=%Dsqd#!VwbIxNP$1zuwnu;t2DkzQY2(N4|46!sZSH{u1z?6#NHo&A)r+ z9{73}{=0mHzy5lkJpbNb-``V(zc6_{;l-Uh;&5?bDOcXE(22&AXNLKZp4O$v9Y%Csi(*1%Kcoo7LWdT3d3gkd_=s>(E=$n$~LL=KnOkcT@*QS1l+`b zf2$>zuN-z-!T8%R|N8rAFbb5Velcc$l;QVx{p$tdMG*W{76z+Fe?Q*;&)+($Z7}BZ z*R}t3uK&I^@vo}CdXWU|%gIJRur+LzX#+Sl|FVm&%h`&wnpq>?T0>}YkW z(y$~~DT_CXoWA59rodul@Lm(;BOkxpUq53wTCCGKjB|Z{ zlpWD13Gdde5jDL2=T?);kHVb}*hLBW?r!$-Hj1^VPon0O@a4%>&)|dW*3v4i6Y;We zvQ^^faSZ-%2Uq(Yub}k~a}CNbZ@+4L&M+!vh~JG?Z;}vNlP{0IQ@*J#OHn^6iH1SS zTl%EKpUUr`3jvL7^q?)|?eY<* zjFCH@ms=(9gAcgjFWyM^i(Kv027q5d1BTffD_ZwjP7;0Jo1i_k96yQ|-t+#Vt@}l9 z{6DGc-${puuJb630V`zI|7y1e_k20#fn(dLj!qX%_*0}S5Du1J!%BFvC~q4j!{!=n ze~CGid9Z%#Hn6zvxha+C-H@qgA#Q1{$WQ<~c|mBqLUZ#Q;t}Pu)ufgO(cu_m4dV)| zG{g+YkhIN6WSkpD`*c0*!p9yzCZo(MHu3^)E-?6}^Ti{M-i=gP>v zwHC!&{gH1@B8D`X&Nh3glg`Hgvn}cdJ_W%ywqZKq6W2 z-D{4>CRb-R{1wMPKkaw?G2mh?-QJQfPNhfW_3hK*+~oDC$4+ghg*+bksZITYhzV$t zAr`6*2EW}Z>obUsv4=JxcLkf?Nb!@h3zC;qD^}SqeLDv@;$T}LLK5vQ1-FhHGSxV_ zf37YB*gmw0p5}cPEjV6BVbGY#@kE!_(d)JgapVQ+)3u-LK}eL)fYh6BZlBj0)!vXl z4c6At4|_--jXt+I_UiWXYsWm;5i5OhQZ_<|J$PJFfpoUR^8Vv=aPO!!E}L(C=F!Br{jf~VtFIUa=3xNiI1Xb*b6 zh1YuReIrr77@$k&@kR{_Jed^s?MhRx&lrbNm=0??Jo@I?y4yrF8vHDI#dp7r-=y=7 zEB13v{7Z0a(Be1qLok;OcxXAebW{dDf`znDJ(Ha=Xx(if!!lEfIGeS5YnA)#K7&`f zv1dUT$dqEtFCZ=(T$1d$>~-%qa(!uKIht2r$c;-yV18^!8`m4m0rwQDpelN?3O;~} z|KUcrT%1%tr7xQ%7eSRk&6o@s7cV6#L8IC}L6lnfKyPaK&ol?M-|puCZifHm)`dil z1x`CYbDk0}>XZaC6Rse)=BKgX)(mC>ofO7v?vL+e>t5QW4pz=q%Vb@DKw5dyVSn^F zuhH=(nNP-#bo9IwrwXmK(w^w}N|VpUGK}kz)-v`Lm(}u1NPauWQ%?e(v&HVUUpb~d zmI_Dj!jF`yz|ud4nTnPTu(Vur)Pzb^Zcdb_st~BSk4Y1;H-2*+&<1sv@onBpY=)w9 zRrR!_#Vd3Tq8RP89z9oM&lwBdC(k|VLaHPXLU5+uBz=fZF{{W+E`?#T5Zf7KuXrvz z_qq;ruA;$X66M7wOiOvc`T`tALaQ)|Dc2j}>&t47XgEn;AMgehexvXSjp`tJCX@Puft zQSB$8ZBTt;N|U~IVK;m4CCb-?@6#}?;Zkp#kc4{r_u$ByxQ8?oE<42{7*aw!IHoCs^K^;W%Y+3`q(Gs&Dy8HI;~#Exq7Oy{0&+LrAaM<62kD} zehkr5B*Ku28$DiA1DA#8KflCW&Gjh~phr)79`!5~>sxORcKAJ7`XFrBnB$n$duA=x zuduTBxI}x-=Ggq%#K0Qe4PFYPan?InkyeW-ZbhlSdF`3`_pqjWLoXd$q8h|R*qe9GhgjdfHd|v7|9HGw(@!~E)_o#AJe>7( z9p72^wwM^wR)N|O;5ioadRb?kF zP(&qLzpz+5f;<2xqJ=LOXR_M7V1g?8HxAPa$a@2=~rwI{oM&n zK78+<#1IGMaJh_ZP7Xg?JwnDeiF(s?t}Y%Q_(G3x9)$P3sy~M%54o7q`8(g<{3?rK z!~Jd<6~j&7stxfrbBN+TJsN(nf)L)YmW7>BdIzd7;3*i5o2k#*#dckm()X3yK&E*> z5V0HS)5~dcnxBi&y}Q*@26m<^;iWx0sLgko@3K-dKAvu;dBXiRkyMIOa;@hMY7Biv z8s}5)#xKpXz%SH;${huPAPBodWBe}m@146nauFeOD94NuT*CoAGNWUK8>r{g?k z@10YzN`M$>TkBZ}6(63+lB_m&N#8~I1ruG_>cUXD<1yP4o>cvWE8WtD6veY!*Z;N0kv7nwd%d{jRU!bxMSG`>7|e6nPCh@paJK zMWjS2LRgayS_Ts4a%%Av>J{h-b6+h=NV+$^6gm8AlQ!M@u|lyfx7fVZ@#U|Zz6HZy zH=5g9dK$_obtz@K)kz>NDA8f$z*tZ%;$hjX9Lq*3`$@^wRHk=F3MqCO-_#*69%`-#up5(m81o%qLNs&3~a$RO`G! z7i?k{lC8vOEUjfmM^Q+FII540nq1oBu>p-Y_ABHlAmsXtW1$8?ZulrgZ@1K~{*8t? zD&Z-lL{>_Idy%qB1gY(zy&GvwgotxD*rej-Zpuco9iTQoi;^HT1SuNgJPw*%ZTpbfR#d}zH~7FL@3>ou)Jmh%zn z7E1|MEbU$oy}iZudPyeMrp09UYn;<)R!VcLK9V@tLGaQzVVxO)^^^v8)!p^1>08sdRcD)}qJk@Ls}-{Z z{9WJV8Q{F}%@SJo9+1=HS~`NFjwwPv1})qf3EOrNj1NIXNxqep4aa>@!ut zZPwuL*o@46R@LJ28x|ZUaA&Pq_o}oQOm$sNUXw6 z@i?@g_@OINui~bq&>gUq-{r(o-dNqKG*~(slo8;OoW;XKmFKTOyihV=If;Bmr7h{r zK!H-dx*)Mta!+LElfX(WbYumA&3?&KmT|NinJ%a*q9JS^ro^CxV3Ucwy}EI7mm#cX zF|{vj-8zm;C)#G`5E40vO@NasKy=tpt-5+4V@c{1$DXcv=LLukujudLg7mHOTn|bR zr^8R=O=+bLXA&E=1s~PL{=N0jqPnZ2v_z9cK^jBUkP^bI*|S?Eekm7s-2pMsc!LtE znsuJ$z^fil)00!kHqsrXCa;}L3Dv^GZ|2u z@F2icG6~ybnvv{`@x`qp3PUZB&V)^KQj8n<(V?^s{ZScjXjsM`aXiYBd_SK1YN=L+ zvC4LHXp;y>G%7vE)H3YKaGHcGF;PTPewD|JN%{&pLE~QR!hn`TzWoxpGzEX#c9Z7$ zr>&=?h$trys;G4d@p7?K-^t4JoDD3U4Kk{wyl_q~&p(q9vPYL^!vFR}UaNbGd0SCV z+=br$`crl0vO4zXtrKZr&$&Cw5e&0wl>f!k{C@Ay!*|NTnk9BpY`u4nPo#OE2ya2tbq zn>pL=w+=!h-+g9Dm{fD8g&q{+#;rPmux0aq_vYIqDp#O)tLJRxTZkt)@q-=(bRtT8 z!uj^lx=_DB7bG+u!VR`1fl%Z72Dl97!|fJj*Pme4s|O;5glOs`DafUvzi`mhUo)uS zJn&w=42!$BM{~lG!ZsAKn(|2wtA>{*ke84F8aWVGLvDdwrsN-7&pNwGplc*ylH$E4 z=<*m_ZMjHm02^X36or_)Ajk75iUiqh{l3%JZo>PT=@qtfo#wLh%P&mrgZH8c$at{% zS(TzW+vL^jaM%iUiD_wA4jsA8ba~3uk|KLzOG7t^5MQ+71>@^@L)9YL)RQce4U+Eb zDTQ~Myx_(udI=*qOB|SzbNcq=(5Ua4i7lgBGeOiGKbg&|8fi7HUlU!L`Mnx5cGQFR z^I-Q+;X_Kpa-Shp(NgkTpHUwP)=y)^(A2Sq1}GuAU5ek`^MWzh$27fDiz7UEA+=N7 z7z^DdN(*Q!6nPtsCXKSKU+CqAIUuvp-9TxQ9nP02diN&ObDpX{;KxFPDUk%uF`_XQ1B`a)^4 z4c~=UbW_r16sF?@l6!>D#P{JlMJn1ddkOtCu!l8U4iGaP4YEgR(un(<{^;fKG8?7N zOW*8$Lo*m9E(z_^Ph)C4?5nHT&H2g&yUtm3;kD1hit84-NRh}A|-&WKaf=q)nD0n?;R)D_NPBs^DKG_mVN z4bfiOa>vzBQ#qk5Ws?+qPx%V5T6VQ?wS}1%%+wsL4ybO>Ave#-y2KVea~|p)n`M4a)P6mqk&3x<<)C7=vVHq799y z#V4jE=1s3{CB{CfYQMBS!HKSU!GyQMSbTRG*}&GeK@WkBnGYY@k~E~HjkQHki2hl2 zPgINMK4mghFOpr?q0?qV6eltCYJ(!^%qSBr6b5gNg-j2l)gI>wUf(^ zVU!lwQSZ_e$d!vtw~R!j6kwRkY4k1FY1StqvZU)>1I-%347hr4Xr0Caf+Ml0M6)}0 zaT3JzZNe@HSOj^<30~+nq*_;+Vm`XPKE`mP{Y?A!zL`*k8o|TdrF)2riQB8f4ws4- z+t1AQ%J^5%*SV90#Q??L+d7$0NB5#FOIp#NDV>s)H={3#^VmI)bD-Ar9&`+-Pm9c% zg9r%4RyLgU&GWRH9^*o|x)hKZ!g?tO)Kbo$eh`oH@O^!;3MK1>8ttPJ#f+0bDh)39 zQjUmPa+LjuL^Wx~jDW!}zw?u_G>j#-0XtRA)?59#=L*zEK}Vo;Q9)Y`aYAq&DS?P` zB6WtCCz=1PDZQj!#?oR`+Wd9cHz~$wRXI5kgzg#~F@<&K>>RW>b=-11K>=+V@q7+I z#`+JUzk@MxX)Z&L6dIJYPZQ${Ugl$m@h~rA@*>N@RPqPTM2edCkR2d2-JwX--`4e* z2zWF;{KBYJfSGb2sXoqHQ+_s0QZ)4rr2><=6Y`;wK)x~GF?R@yls7Uh4-jCJ+D6A5 zEt|Bf``4Q)WjcI8PAd{f6rt}UJx3C2=+G$0EkwQIlmAUFuLk$v($~!$nHus;=F|sC=hjLu3*O?72HdE{d#}t<6`0(f|Li|o_|7bR zdzCK~ckOiBQ8xAdcC%xv{ECVv)R`Tnq_%=YF2g)@5Ht^zR@z^Eg|(<4qj;)9Td%#P z4WiC6Xa|SiJCw*{(=stj?yyj1Z zl1F$C9XLyrLm8WReXwe!I*%eF=-$}~`si!*Jj1?V+>rL2;j6XTd5&BaDs6air}m5D z>&!%$<}%Afa)r8H6pLU)lg7XyvF+Mw?J!-HVWXo_7ngb8D%C^Tzc*{R)_o9X2lHRu zEFmTpO_ht=fq4)+$P~5UD~bMgjqFV1BAZG%M#cD;M7DQ4%q@N@rZi=Ou~nemELPLy z?GahU&+H%jrfpa=_x}!1fNG~DcvqzEz^N0f>Rn!qVyq)v=4$T^Ah%G!9~N`}CN}&1 zM)EEXgMvdLj_5hF^Yl_Myd-6)B{~OG9jp)#?j{u+VX)!it83%-R1?GCQ?s zt8ZuVP$9Z3&)yw}*EjpYs_M1M$9Q2ylg4gkT(TZ{q4-En(!HLmyo$lksSo^uXnxwybzE>@EdUFbsbrsTo4p;?oSB~E&@k4R)xJwO|? z80iJ4j6EjRUyqfS`c42d1kelBar)a%zdjKQ<;-61wsQjX{w@%0fI$G6;`DO+EtT(1 z&1B(P8{ngomHG6YyAXj(hP+za76mjp@D?MX&U+n=-Q$D{y@u4^7r|<7dHXK*5{tUp}<7c?CL#6U5*`; z2#u}l*L*9wEW{LxDZ&ravU=G<3$l`*SfdbpcK3he4He&*XxM4|slE_z1 zve!@k43aj;tkR6$_`5O1GlicI5f*z$v)y;ns+PyrmGi?%lJ6&Y&7~PSMxBK4)`Eq_ zD-yUpBtq`imG}(h=JBCkWWo3>8p!e>v>sk8Aw}Skn85)S%4z z427e^s5J5!Qi|)R47P!I8G93YA683$@bZ=U>hX9B5QizkkQLv6B*Cm;W;j`ErJB7z z5A>u=phxx+==Rv6M2J?TodJ;+C?Y{-Xb;|V=D7!|hB8eSNnr|x@`ly8N~-&;CdqaN zZ{dyQC$@CEBp^8cvAu(S&$+Ze5c}FTF`6c(n~U%(UP{YypeTr6Xx6 zSuDZVO&^^ZjeY_)>57A_e0)3bSb&R+nfMOYY$jh3I9up05+0rDc^8Ih&gom_gppD7 z!QQuVxI#k+3E551rGppMutW~-_`8M1w*al~@C9A6VVLTBP`ZKqdj6V%t;9JXX6Y@^ z@20J4)MH-7us0Cr{g(@6p~rT`U3y_&Z}^jP9J7)}otga$g{7u_fss1ODJ0@;j^`{& za$fuMD{f<-lg*&S&|O>(>sWG|D0kQq2Neui2;@V8A@!c!_-bWjJzmKZ+B#KSn1F(V zOcpbl;CX$K6m!i|IN36^e-FsxUO@I+|ccHo&e#~NzMJ%ebG90Gv5bwqmm#=vNFn@zr!g&u|d&nL;eDj z_S-D<@t$9jk%c-s*0R2R|Dp%jh7w6-1!?0ZsPe3}Gc-*PLnp{v^$dPJ8ww|*e(YWQ zk&i6%zubc#G@S-bl$FWRajYO+YgYtbd`y6sO&9T(z(8bn4Q?W9)YvQ8h)`Ih-6VsV zmaa)jh~)`T(j!9gos!Vz7-p&X9gUjGmt4Yg@LiIe{iF<*cl7QP2otF|*l$-hQfh~z zlLV)e$9q7{PQ*S*in}~zoMp1>z3Dt@Z?X(8coNud zLJr))BDq4p?IqS#7e;fcZW;;fIkapQr>Hvh)9wCg+5_^Lzyh{jaaT)_&qLGh(25TA zkepe(hay3)1U;r7!r`_tZ~q#c9MUJ8lGd_bbo0Kw>I+Hs)=0*TxTVrS3}qI?ag2hb zi9-eC4BC87CHJ}%*PtLJY36er9to&~eu{3Ta(YIWkNcD6;{39JJm=YZ=Cn4Be$>x?8m7Y= zVR}wl{zJy5U}Nq-k-_Kzgj7#G`fj1AG;wA))mtlj6NE}(&mFx_%$a!aXJb3%>D3%4 z0YJc~d`ZYRZE|lUc*v%N!@|4Gfa_k-_{ZhiaCVJy-)}DcIONgZa{#40i1)x57p@l% zJRqhMB$hE_4X);p6ateXV-IF%u_4kNx1t zt_RcmpIG)(G4P-ZLR7WD$7&sy$Q`Xf;A|(UJ>(H!Cn@#r4Q?p`z+JXs!`;7~fcPoi zV&ie&LrsOq$Z2JPtU9~GOdROfUzM2{b4&lh!T(!p8G`s6W*4u*xAh;!#^;(;XdS+3 zD>p?xD}V&KF5Ku8qG0_O#HhvqDQ6#2u0JhN;LtqY3{O+|xFRHk`cJ5vD9*|jQms|a zf(`LT^(`v@k(yIRgz)9zmmd~#e1E4w{(U;unUHe&6b;PlA3NGAOy4U5#TEUo3=uOb z#$!>=&l)ZLTqv^se(wxO?>@c^w@|Z~0|=FWLWu{C;sn14d;F*a4jAXnR|%p+f?Dew zeVXChj%k|eiba;s^eIyQeh4`7uM1TasXL<3eUrx6ot*i7lWmkCP|f&K z$9#p*8s3)0NzWfE`~kv$B#X~AIIvDRmd2E7RU?uw$BSNC3mLeqOct$Ol9xMZBY~S; zz9KY@_U9shN1l{ESsKY;{wQ)ha_>GmF`xZ3U4jU=^I|85aZ92YfJ5Jb2P%TXQ_t8( zx4-?%+T>J@59mUQDGd$((Po16;3dCnG|eJP;=#peU|UwC^W^y{$R9Q!J8h4>>X#-m zhJ^rs8rqL>fM)ewR!0 zgou|1o|;zq-nRMvUtQc^{hBz!8cye@()YpAiqB8K=x*up<#4mAOtv`@juV($?M4h-+O4^%!XMym)Ez~u_MzItpOssW z4`ue73pHLnK@k3T75)2>OOX)1*kiyChjhYTALW^f;Wx;TzB5z1aelZ07WhjNove@` z%LnT}nF{geI2oL#-E|WxhWd5@Uy>iKLK0O6W~lb&j!b`HEg>+D!u>f@^pK$O!GpUjrsRa1kqsa z3MrqUE*0|qvdZB|;MwH4=LBrD*8t=Idk8IdJmVbG#nA9(-k8C}ko>k;^O^G`-)( z^5DL4V5as490yK^^VFM~WWxWA^1tGYR}dis(c;PSRMW2#rYz=%dkh)nR&T?KaL-`i zeN?2QAa!IfFaP`7o^2a^6A0p2^s$r4a&NS~vYykChI zb}R@Zj5Y(40%GcYWh$eNl57Q!1#4rBDUSb1^yF{hb_!URw`0EXyxz^*e_#dW-bk zdRw5AX86eGx+(`_C}(9s#1{MGGzD=ocx`a^9pIiiGB350pj=)@Dv>KK1~oKwl%jP| z2Lyq(5hddetdu2=?7*Qqpaj&;WneAI;~thDrCyh_2SDErIH+{>Ov3;XLDxOGf|2e{ zo4r^qje1P0#Xgea?DoNuC5P5VD)cFB?R1LrWvS@~p3ooYYLpG7y4Vra%6EO%=Unej z;ys>!$eZQiBMvf%R5;lJX5tLMxl}XF0=i1!~kV+ ztQu;AJB0JAYa;Fy@zs*C>;kszG1^PIKhs$sB+_DJlT=z)Qa!qs-~2r$20}evYN_*g zkFPkl)__{+<-(|0pHLNJP%fqGHi4@c?_+jB9xG2`Tgp@mzy$X20;veIg&yBEY37?fw~EsDK*8~@TO#3?;+1BGkWAbTPF#o z4I?s)m-wD95yE+32K z_Psx@^j}d^&j?_j-BX=E^ErG))U;eN|Fbf`McHIlV4+53lD@rm=-*>TXgXPg;(}X%_2p=oHd{;i-Q)Jxp%caLVV8y#jl-+D-}%AgHXood zYylMZ$_)Uz12F%i$QXfoRuAL{w~r5rJeBStVZQDJe{u%%eH_)D2yi7LWx=K-htR;f z0I&iH*n2V_QU22^ZnX>7fdiaP96OANX>$vD0N&2t3E~6a6azgPzyi7(q z%0FUd4(y`jzCdF4^-Qbftu)tnC`xIOtB-(@6i-zoX?lJ}YVhN3IGLls*|d2CCPnlW zS-Y!oi~ArU&wmv^=tu8BSZcIbIt({O;dtx)@u!$yV_;N7v3i#w-RKBFKzgV~mkg&4 zEfpSp;X}Z%vZvA@xiKMlzj}JQ{dGTCRkRsk1TY@?%z!c1i^cLcOZI<~aK0(B7LXtP zH#puONRDJBhDNr;4UU5v3*UogUqFL058BKeqLI_B2`0qj;sYRa#kfp>@x93G(0(H@ zO?vfnU$wIsVB&i6kTSCpoOp#?OH~C5ZSfRU8o_)8LF82;^UvcT&Omk{fpR1AotG(~^`)*k!Ha{N|v{^wNw zFMR80)pC{&Ty-BcLUG@ntp{Oe_qOfKb(sBYkzzW?NAO8N&?+89;?U)KFGmI3UM+m| zSd3Tvy}F8DKn%QgrfaUp(^ABb>A>)L#5zAzR*NJEWVQxCoxyu{fE3jW!$5zS2d0Jr z#XuqE?-yl8`Qo?bryqR}0K>!m?d?rl_?_T+m8Nb6+J}zyEW6@NOVV&*ps@ejZ-T8Ior+DnIbP0+B_v{>#Gg;~(h8v{%aQnECU zqzdXsD3`cCErqMbP^=j_DfiO>kV~UGQ!q3Cc_2ynffCVnIb#6)V2LiaAkP4k$M6}_ z#b7jA^m@Cmm^sjsSEDbENM6692jk0u)Tdy;(-GiOFr_&Bz&`%psWIn=NBLjR0{E*E z2U#k0Hg-H*iP;_n+l&;M8fB@2Fj_;PorZvhi9j>Onj;>dMS|DR@(DSuw@ZE&-=YU(4VZ{#?S-lfH{Z8-T5|OW73)`c){>tat1sUpofO#{g_lJG(7aeq?$qvd=JwJ&V z3Ev{I08&;qAaR=YG*?#`;dP$~v}S;{C=emqJDE_GHHl^cQ`TC%`h{U;5kg*j??I;^~)w$`0@ z?kfPC5c61~zd+q6a)N^u0AX?_+REy#z*toNBAD^!i<{3QvC2LKmkF1gz=H(HZi1_Y z0bK(i#4kX}U6?ZwPk|=_G(UxRodE?+LLe^r%8gXy*{Yj_!1z5hE|x;rvth<^Ar$b@<<>}90rhZ%)LEETcjb#$ zpxiJ|OV*44!89cB@Z?rC;8L@O`#0VyUjcdYlLowfUNqz;Lt|-Ak-Y>%wr^f^O=G}w zs|Bt&zv)&yaVTd>b^&;_#DP2`Da4Rmx+Gu!Xh!b!tD0dZ!IZB2aLeHx@rLR4tG$@Q z(9p?_fPhJir(u8~yz9$gAiS2MDY-b1mPZPQTb}^*lN02f-~1)9EuroMVjr*`7r?+k zz&#zD1Qj>{s&2w{vWgJk(RI4VqwUm3M@m47fhvHVf}?=_5rD+>7`Ar{Vz6XP9 z>cQ2RAA+|4@6Xr)5*(*I;=3rYA9FoNOVn^}96(bmy4SP?`w| zucTBG;r4;2Ox!PDw;`DtKojkQ+3Oojtl1%$>1g90#D;9bDSh~Chp8J<6Q zod#;4NBhjNFee&kni2>A;Tr_rFlIap&u=cW?KxP)oF*rmV`b4O@P#~8F#!aQIFDrI zB1xD6>f}}D^o$AeW*eN<_XKYZ4*Ko9PO0f>?z2X-Z}|NdCufhVts(^$yO z&Cd=u{FfA$?*R_q0T}8$rVU)3w{ILV7wx4RASFy6hV&DDG6QafCBW#$F?ph6JPb9X zVR`5iU?KpxAt4#2nQmO+W{u&hchRzFJkKEY=1k!leO~7AIgG8-q&}KGU5k;*v7l z0%ny9Qhou#*{BSVhr0l7ZtTN-oUpqmZ{fl9HsF^XOP<3--EV;&D!$8>it6bDtRV0) ze)x!Np_V(C>whxK+!t*D9^ldW@(xM#lMu~bZyWxu2j@ZQ4<^phs#U1{d)z0KWl7qT zj=va5GcR!iNU6kt0QZj+qT}Ma;*#0mLp*OUJ_pdgt#~)Y0w2=8{{T*$%0NLU!fb$R zAsJ3Rg*C+mTrWu6Lf_HU87HAxl@1Ybn@_b83HMXE1`aIkPy$yq2Irk_SEOgGMyt;I zlJYch5_pJH`c@R4V~oTIjC{oWH1e@jVwB$@skR^t5-gu}YM}4lV@H>800~3*Xn;<= zY#ACxf^1b0e#424K!8(Z4GdL%(k|c;3jtu>h|?&#%2F(uL0wMzFq6BMfsO?)z^Cf{ zkwOafNe($4r@(Q#9eAWe=pZonh#S|vKaZoI!?d|EIn+&n_`A(W$x?XWi$$-VB$M<~ zNze~~XMZ>gF2B|jbU~Hr*M4PQs%Z2aCeQm0=(`boIIPWjkf5Okr{5k5@I@w{62@Q_ za8Q3$7{d|GiCqz|S5-Z0dN=&n7;t;VO&F2#EvzH}M~^IcCIiS_jCf01Pi`-XSX+Q0 z1w4l)b!pDd`SM`kmYr@J8j_w_B3)3iZL_y zGYuM=kKj2ENqAZKdAyRfytzNpk{5AEdqsh}D*Gr257C)_-5OR9uO_+^E=yd9y9HXk zNUuZ+=gzyY$KH>2bNUxVKxJ27W7I@RH9hZ@BHkVo2sI09L&rL1*-|}0bz_H!rV!y8 z74t@?Apb(%W_O2jbA4S-=2W_HReGO~K?)U!bTm@iDHwlPG^Jzc5EkeGZr&5%p&O;% z;ny+Vg3UnE1uRvp^=70lwg5YiGwVHGAXf+t!yL#P04XE&jKtVT^Kl!}1@H@HzLB9j zPhE-ZU+BNS|MBmQhxjRWDUx~=l`h;{L_7`@5t-m87D;L6uAiN{clyQ|P!HyA0Eer* zgAyE`_<_jB{z@O~Am^nA&XWnQ^Mn`va43+W0MYAvrB)qA*+Um;h4X;CGdZ2djL2<- ztAFuM*aPBuQ`4nj>v&fEX)TQ;7W*>aIy_|;C~vTPX&*#GD@CGoh?)F_Y&I0eaN-qh zstxzoxN+cGo6?}uu+Zq4;7FA^bxWH&hx)0m5WlDPC`ubsR*SdBo1@}h!B(|*Pdj-2 zCj930{Jh6kWA9xT>1uW}>Q%a*I6WgyWV)|V(jx_nf@1J?3So94od3V-b3ST zQO(765|XQxB8$@*`g9}5tB;p)kYi?VObeMDACrQ{w+@ztWJ%7zO`)UyIw?+4 z4)s+@^#W{vu~{MD;L}QyBFq*&{-9J+R}#0o8u1MYO8e$>m<_K5dTw zJ>)U5OC?TWT%7h2(#vF_gfv;w`mjxPNOX4#5(Dyw8fYK`(S6z1I9{7#Um3K-Drz_# z72;qB{kG^~*t6X?Xk&?Y^zOhKqY=)FP0I;wB2CTF2#p%xPi@Ig#$8|`h=KW;1;ADW zYw{!&3-7co^VG3tw*!?`$xO_&FecC2D9q1>ZLZ1Y@2!>=O^^?EDj2F~H512v9dbYm zf8-4sK|3v;bIOA?pqN4I?gYQ+8E}!O?)qtoS;R@Tik*gl#>mZV6c)Sl0`wyz$oV*- zrZ_8-QQ|!kO(?Z9@-&KI5ymB#Br@k4pN{%8S5_gMhX$jiBh`AjP{c>n$hUvsryeA0aip zXfh;4>o6nkF?h}(jGDrXiWFqKi8XSubXSmZo`clMH&Si6n)aIpItE2tmkeo5g7Hm! z^-BsT^0j0!?d$R+`xwbfi3tELP*B}$MbJ#OgIXJ!+@du?UTu1`)KvKeyQJ^^^1F>(5GCrf7;}aW`7*B@gVt z>BQz4>{PIHwjrt5+<0VXbaIMpr(B|_$|9@iLmLy3Qv(To$7m@7OBQG3-+h3AY0gpj zu~%TBM5ohZE18#(jDsnRCo!D0;AVIf>g+6E$q4ljvB^9uod=jV@@nAc3Dbf-Qwa~T zI1ZY~%p|W*K;(<~hu94w*o%_bA}=u`NQeTA4pQSWg7(2ih-|Fv#UfbMJOT#Qb-b>~ zMNQ@#{9x)#{LaTwtYjC|v^)VHWKhN;u7jBv^2Gf7%*1+!*X2yu!iVWsw;ZNcUP@h% zwa|19iUESqo1e;+I2yN=#yhMorRVMSkKbe32To>$qG*ub&8!HMZ)vKS=LJ83$GT{s)~z!u5j-^~mL6`FL5Fgz_i!1Otqb zrE$+W9=-oj{s#LlPO2nvw+-k<4Y|cWIdODVv?J>)^3Uv644rPnsM zyV(6rqk{8gmMQaR=uqY(StLUhAKM(xbRbwyL(Z@f%_$UbkCpSudH>dLn zG9RU=Js^Dh{X>=JX%hy5?+0OjU|Lr!x1VM=9u-NLQu+d%dGzK!&3LmC z9K9qpOpK*0KEO9^1~AxoCrT})S{dH2X@oLI)6n<6r882PmL$k9y-p>SM}buV0vI=$ z!$7=LQe85rkO@3iE3Yg!nn2OyRp=JS|6%W|zpC2SzY!6n8w8}gW7CZ&-5}k_CIuv= zL6j2d?nY3QQ0b76?(R-00f9}vbD#S?dd|In!8?Y(twF54_FQW{b3UJ%-_$@a%BvHn zDAV;kiSpJ8cc=xPxz-rKNzB=kIsX9&1=>;#3b#)>yr0JzNV9B(&*GdCeA-E(j<0DX zF&2sVj8V{U@!qXZ1lCA_9p3|5+I-w;Bt!ufCsYCo4WwY{3CUt9^Tn7)Z8V~lI9`*F zZijDEv*ZZiKJW%2SNSo!G)YvNa0n-Fw%#g*%1iz7RxFXx67u)ygS0r7-ME5+EARvt z5_IvGWQ`J2F~KDok+`g>R7CQVPLCpv>}iE;K2SZNUVkJ6Bi!df!O7oz@}{U?=ni4I zVtDA*lyonR`3H+KBF7g!6g>ZgA_Gfk2G4Re9xs(a*rNf-f?mH_--dZ(NQdwV;R!L` zz2HTg_O$h!zz?p2QYdD^FIA*wbB_ijBqE{?lw1A75$hS@ZTbLiLT5*Hn0r}#@z~6Q z?W(GOi%QA2SWM0A>GKh;yc4!xNF*E^krWR^vtU^hcml^x0t+7?yrzqgnrzW!zMON) z1i})4qEm#vaG8&17fA=pAnZqB(sC?X`Dg|#ILK0(srJe%%uU94j|-T%b8Ww?Hb{QK z{pTztmyL+{lSs4pg^K~)mm`^n(>(TCt?x{do;#G+T?VObDZ}nzt(kyZgp*>D>J*pt z1AZFHBD!Iuk+0Hk=F|$$@`?I}pO@JE7m_LoVaUEPfHYQPm9in$HYL0&o+4bfJb0+b zK*P5b@_^7hbO}ZF1ywha;yROhu4fAAkF=qvmI~Ch(ByG~W}(Paf81r^cxGPaRAG<> zhKx-yyJIax^{)$;M7#sgD8c}9?bkHGxRT4^LCS_W2!|c8>^2`#W0b%C(B_~^SI~mX z#>@zb45oUu+9rebkz5jp1^-%0{{!3t{;und<)H|aEr31ULS4m?JYOpK+4ktcBQyZd z)2YSFMTbSV5y{f};K@!^$XX<+hT|TkP0YVzT61#$uQZo$ND6nhq0ic*VDAgjuEiH3 zas)pWs!$QZ*!EkW=NDslT&vO}g47r|A~IK`N?E8N`1M)t^I6Lvr;RKz8zpt|l9ACe zv)04^AzgQ%N2dRE9Xx1itogh^28Q$8-O42~ePi4Kdd zxMlDgJTxt>`+T;ez~iATOvv?L#6b2GKGV*%jFvuO0n~rjpami2o6>qBmqv#!$*m{Z zU>TdWsPI;i8Z`%>`ra$3R#~4QDY7z9M`Xxo#iK3%mvR-1AX=!xBvXR>mGw8l`}#+; z%}<~T0!`0^a?feF9A>NCKv&lpO}(k`Da_5_)BR<2jU;FHynBz|%DIsif4rXjKVo0h8P_3y7dlis-J8P} zKU)OQEG=0%bw$Vrf~*}G@jFa1W`qUEb;6l)|3Z5K7*dm{04J_al3*-5UeoRgiYxXBrSrOkp!%XZBlaFq`D0%6-%0F}A~SyB!sNcXe<*|EcM zPt?`mv=sMg8@u-wKyN677>d}w%p|grHalh?fCU$yTj_|?za zzANPip@+aBbdE!P;czbPS@NUa6^6Zl%OoR)buhrvTB! z^grS+YU0#ehYLu-8tX3+3f<~~W4w_=w5=bT02)3#I!j9;5O{pH_shlpg5`Vf!ebJF zuAv(+oLP_$JsforK8@WQs~bB2;|YAv-(_g=hVP)KwV9EGZv8K86Q6Z(Q#c}AJ^xaF4s7ZJ%P2Qdy7sqr;UY!vFX^s^0PvfUPm4PZU?}i z^;_N9wXa!Al+*=A71p^4s0hLzL!jc9~f!!(E1Okp^#oo7vs$btUCw3QwWjLo= z=Ju-prS93O#9r0I7GTE$Qaer-yo@LZ6QK0)zxrg4iz=eap|q*vp)8wE z!@A(>7H7x@Kxzv0L9>MPf|;n|O+ehGb9Yix8f2~ez3%lw5)YYuo4MXP91+G6*(n2R zLVk1q#z8^f(pB&r?>QZajdPK^lhjVSszpNbjP;^BHZAi;yvk&d0^We6+v=|2FX z_I2_JFsLCNgYz=EJFpH9Aj|z8`Q`5yQQ^Wu$yC9FSE`mK3^D7bbFE> zl*Hk~w_SFYVL%dwzQGeX1C*a4sX8id%`@+{DYzIbULMd z;_&b~0|>?{qw_%82T*M;E;&lEJ|!XcWe-X=~gL=*gz5@pRU>?FYy5u!r?D zjQO{G3_`~VZ#V6JnHg?|K@?JUoRW{v2s%#BL?9VOkO0Y1f?BypaLJw2_nJn673EOCi%^6C^>se-5x%wRpdTQILoeOH zPXPiQDP*}xk=I-exOwyvt$^!Aya(-o?&kWT;ivRK2Fc$#Ur2HXqJ(`ml@$X=v*I^! z>Im|<<~Y_VYr9$AIZ-F86VS#qAXWPxVusi$t0sT2`QpS;xBy6#unsgr656DI*N@Qc z0WI>>U3+!l-E<@=A_!DOr)BXacG6&=BaX*{JW6q->bsei5xEm`|4kMs#l86VcyORt zMx`;LcdAVg+S_07^H%RY0WBa(vU+vEkB2lU-}#-fK#(v8W)&nV(mLI)+qX}C3PJId zM#6WPYOO|?8D)|8QLl(|c%Oen4pZFYkHC|kd)YAr^dBXg9h!iR`;+hyz;k4gDd`_xT{?I&GGoTQj{n}pvFKcfnm1ArP)12F7;#t-i zlP%H(A~Kg{1c7;xgg~_)UV6Nf2<_7yiWHkGGZ?YowCUX=(5cBD$G;sx5_^?LH1I@N zl2)|iFl&ZBIm@}5IbCesHjMG{*BYbqXML+V5(rCD@eU~8lC!XH)nZ0FDZ(Yc+=F zRS(i6UVk!Yr=>2B{&uqo<}7+OF?6z33PhK(hxusXcoN}6?+k>SbbA0GN^7*dp{?>& z$BYKe#rVDc-GFY6*D040IIESNs{<&w8e1D6cvOjsuF=>8h&A3qBe;EToejNk*A;+E z^zOB*H3CuuH=iXWW2dsSzhS}_G(ec)nW;By7IF`T3u_SeoV zrXx;Ocam|I^8r-ivrtRIt0f-w04kol!h%;rSmD?<#+3+S-S2@Y1+Ik!dSsGC&v-+eSJE(Zjq;uikl-sv65PshDpw8*P^EkdCCe8jvr;Vz zMG~OB+JGGajJ*&&UX#H}fV+2n#2-rb>$LBtTJ*u`0ViN1g z7H2M!_b%{~sH_71nmViv5fgM-2zR&#T(6tU)|sD=D*e^{FexLLA*RLe1W(44Tmk}S z{X4VCQ_tbU_W1o&ANe&!AC~AiDp!F81@0-+H=lrFz>=-Yg~!y#IxAY@nYJX^FhYrK5S zO^WhPI}wNA?HJ`C;5eBhw9i&`yi!?)72Ds6<6w*M zoPVSD1Kd~&A)82XXs{gV9Lkqtq8?_IulDj!uXi2&E%su93R}Z}0QXbTy;pj?xovAE zntKe$FEZ|oL|ve}VYl95X+dF5oO+>$S@~+K&%0YthsSg&0;u|hU8p{X)gZ<3v&klZ zB4FbGXe+YKTPsIUk3MLi920Va1 zG8P5gmJlD_3843y(`Aq4F?m#J7zFSG-lgY_IoJK@Z21&!*Vj%e?8>A=Cy{*(Vm;-8 znBVhMVb_T+DD>Wp1`eZ+V$w|vE~?G8oM8S8(Z_gcM5m7Qk#s|PtI~;hkfG|fOEB*M zP9|N!I|9-0=)iL9a6D~xYD1zr-l=Vv=7)5W%%_2`K~l@IpbjJJ0u|RYq36>fx&v87 zcyC-j4zX3f9ymsbdidl}tvfTxwOm51_+G|R6rz?zZ2l!r3bD6F0;FoBM&pG9eLikw zbsGRGvA*aqIjj=&s1o9$!F*px42tS%6#m8hK32(opdQA^xzKwS7f|(p4dKiNQE}hr z^#}`P)7(3L#JO6xmGA04DS3|dYcI1hbK`_UN=i3QWD(uynmG^Nf2uS{=|K5_hr8JThsT_(b#WaeIRg@wlM1=Md+W@b~0`g#PB>!dt`@h zfU0lmCL4H9U!h{DAc)xR zkKJ^rpR9kQ4FNNV3oI!*~-0i7oxUcY=yv z)jM*I^$s{fU6w?M__17X<3b*HF@$3Zb#qCBK-I_;{K~iYU`XK?G`7?i73tUTG;uMT zFkWn;%?!|`s3ZHvZQFiQTK{SKlET&K7zyDrsDJA$Y*lg0V@Zb zJz0n7_vDx~cNv^N4dmcNB!|JmAfT7JHz= z?7}?^6CapC&p@Y`*YTnb^EJkiCZoyEGc@DR(w{AqUm(23zhr`mcVP4R#H}Pb)%Goz zCi?r-JhTs&RyJpI*FgMC(@eZfU_i%`JKC`g82FgD&x=$jak(#w7FEpHUphWG*#wao zT;G-$J+u*ZSzfUvutyII?!Ux}b&mbvzm=$`Q#o&H*-vw8fc6vKY=`dkkE9p$vW;&E zt_DbJ$|9oT%ngI+aaokMrTKA(C07rQ9Ak4TEftDZWLHKJdyt3O)K(xNkXgQHHQe&_)@6# zGc2FTT*s|=;NFI8nP)_q%rCE#a%+h&p3X8#jP|h|^;`D@GJ=C#K2w;8Ofe_*< zOw7Qzg>$NP^2NNrBM0w@g+`-*h(_IMRLF8tSyHa3j;b-(a>{vSV}ibDGCI%6y_4G| z`4N5RMo=#Hlmo$zba%cbg`WKP2p^0=jllSTdkQ2;gj=^PSPyG7d=isy4wS|e8Sg%9 z_rRDJP`xlzc<5<(5KE3(0lnsOS(n~A5hBG{3B!F=sTxcAZi7HBvYehnStTKTu3P3M zLcBDecj-tk+mPS{?ky%!61Y){ZIlZGi4RS@iHato*!s$@Sdy8z&LWf*T%2CFzFuR!wX9 zJG#U40<lnj?L;^}I{VzH@TjMxusVx$cv8V1TbXD+phbIf^saZw}&)l^+ zZ=In2z<0tqTM2)f+SBLIpn|IGnT(hHNrvJs(@J#(WQ@)dJPy^5TZodR4O^qrV0*-0 zRKPloTUy~#y@#k@K53wSF^RcZdw>oHYC**W(%=p>SC&MWHV&qSi*lM+^sY;`NA*tDNPJ zmc%6xVo3=0M+bL&%UD@uu>TX1MgrHAG%A@{TJ-qb&=XNVgUp;gE!wpg^0|JFPd zJ9taow&lZ1H+$Y!pJEoEt1tU0wB=Ugts%$puEH&mbTd_QT)O7|Wzp!bOP@GsmK{BF z_j@kuMkB6XzE|J_W&Wm?oXJgJosB!h>>94cKwmvL)$ZRtApi2<>;1@{m!{oXQDI4% zvCkz}7uE6ThwkEtxnLwQNp4c6*WVK#!N!iq>8Wmf_FX-Lav;|d`2mf!tje1#zW#SS zvPLUmOwEs;m>+MAp!7u&ikRso8{J_gB6B8ynscerkV6=u>fDD)M*Tf5sE-(IaY=&% zHzq}icI2PX#I#cPX*{tkkxl=cuK&E*?2e2U!%k|=d$ty4HS_C`>ClEJpIQqr7`xM< zQ|<1-4r;M!!sh2xwh+wahXhRAl-1-;3)0kRw^E7FVxx+XR3o033_gd2R*8BC~KjohJ&cCBW# ztg3olTq$J9!+Da=!d;~(`XQy+$9Jtf$_=mZH0L9CO!%5?2=|v4Px%i$+^GuxZ(c-W zPtef)`a1{eI#J(d(~Q>0wB7(JU$(_}~`d`aNrx z3uyyZ=fnkuRx_#gVf4@RhSY;mx`Y-yK{g09r2Kb%x)?ScTa%fF<^bYT$)xRbU8gw_P2s_H-Q&_i<< z$HFB{zIHhYb%^sy$db%`klCD%NK0k!PNR=0g9f9=#f^3Z)K)|ancc)CB6G+*9(G!p z$USBJ+vq6>0e2W7<=Z48>&{AwxoosWhY;QHH?gE`E7*DL=iiQRt7C*e;!L7QL4AX4 zAVjo#_p>3Jd5Gl=e}POmy~4n|Ls_HuSN7sZAxVtpS20LyJ45!hU{1tW>hm7)s~4C>4Ty<-E9uj`cj-fXDo@*{SdYE!jVnz zFkD|y-ij!A_RQqKX|ex}1kojGXa}bRW19unXZ$cce=lc+^_|47`pLMGw%K(~@(q@= zYA8|6Kwbh#dlufPzqXR7MzzMBqr8Dh{~t-OS_sLU*CXbP;O#S}boK+qg5a%mE4Yjd z9?gDip4`bK6od&AKlGHk$NID}p@LsV{XvT)7g}H#qB_HcyjkNhZ73CEdHq-GyR3ThP_LbyWel{U?!?KmTy?pc|5XIV{ z8&xiuOOqJJHn$~8twO@^-7~UcdVnX4BDNKe_mGxRypp#9rYO)IwjwnPgA$rU9+ zyR1;E8YXeFcU$E{#XMOxbF^BlnlxrZFdp-Z04>(L+wCvVSWY$-b(MLV-|Tg%SCDH-%kvp^Y7ury`stg+{7ma;EX~+^EbvB|pd( z&SWmW^*KX>`)78Z^QWksM1LHI_UWx@RWjIEc!{oU)}*Ib95!^s8m$ji zmPoZ9ED*~#YG=s^7#`u$ldnj#U+^l|M}+CcWZkpbfNjMj^nL!Mq;EA=U)XnkvcGF> zQ_%GaU-3>;kK)L*RpsE&D#)nvHoFj7MR~;Yo-V4d9!h?*nYv)z=jlG>Nh&Qpn(FXT zeowzid^##~jibpc)x|DyMm-@!NR9Ytl>f%P_A;OJ?Nblfvy3DL3DKirx={y*;7O%X zvBC58Pe;s;^r5NN3-8K;5pE$$>zQhGFw)(A195B$cD3r;zrNJ%?Y?ABYmz;hFgp+5 zI5qdH6A+u3sj5;kWxBh{Ip?REm&Cu|n3I4+6fOy|&h>O0??Kh7+PBQ=<3D;faC?{K z`wJ4X{~;qIovD^?{C(zK%}xQ?L6O zU}Rejvd=f5ZWr@E*H&wGw{ zAL_Hv(h%(zm1&lf(!v%eX1V>A+B^0tvnBz2G?k=jH5vx=Y{u!clToWd^Fh_p)p@Nj z;$Oe|>*K-YH6t=?8#gav$^PFDs3iaI;2suYMvlf_0M~tv9zLAswE@9buGmrVv;O|e z=6W>OV1?Ur7H-B-d6~@*2nWlES(<|I|MfRq1e+KM`c-aZ!4uoDGGzh5VlPa5l2m@5 z1q{SMsHD|fUa-HKkjuI#ZAgVCBDy@4vM{DMHTkR=>c2np0DeB_1c+5XueSB`#%bEa zlHuf$eA(54I_rOZH>v{+v}1ghHERg8chF>uM62ri9r|c!`nUS*$aTSgjsh4dB-9A< zP5qG?+L1wEETI9L^N|2$iDa6ds_9*{Rb$7k?097R(HR^cw{$#k<9?3=26o^r=>kLJ z`Dii$UQuJHj?}-t8TiZOCd4aj|Fou#GJ${o<6qw@(2RjfixkTFwuky3;|hN??y}vz zl{5T8%^l;v|EZY^f0Msxn5#}JNjd-_aET6XYxw|t3V{Eg%61&D0frdWOi>{QK*>*l zHF^k}iUSD*mdHB;=o z>$;?dy%p@E=L~otY;y@;#X{w@#F=0?74DlwI-3f#LB2qPs72IO9wIjp6hhSL@<*N>s$(jELp z6zdo!*M~_Gr!@n8{J>b}h?uR+KSkFA_qW5Z6g+kupN%pVqDny4CUvq!hd3-vSis^6 z?EV!?E)F2`zHSNNOEKyed-UrmGre{9ZUPZPg*3Qhrj4j@ZVcGHbpi@^K87_C}g@ zVAR)~fYwIJMyK>m6$t9V@K@w4p%Vw>tb==+R>4pEeRQy8?E`)l!Lcg3PmK>mhK{ z`EMzjq|ROKc{H*1-TDL`<|1tMT3Qek^{8}~@j`w-ADEcEyW^SRsl0qz`ruK8#KNj!Tc0it%_@3hWkx9NQBqp(WhMe9TLkI&m#TYi2w zxe?CWOnq|{&S>h@V=!L<9OqEwY01-!=ZU``VniGq7IuUuA)tgd``}pP735<`Tucqj z9~TN(M#;yQhy@I{(o3n@US#Aks`6E`{q(C1lgK=+-{6}mlGRu)InJ`BXapavLY%qp zyYIba7VjX^1P0rC;c2C1@Rnu$JSDJeX)lq@9|Qx-t|&JCZ2rYCP`nbf*<^x*4U63B7HiPDE@R0O-~bm^9X*Fs?HDX&D7 z`hpI!KfQW4#zWxuG%FOgb$e%X5>KJJRWhQa9?$L3urD4*X_;E&>ZQejHs9uZ%OC2GWy>7$$nm-Vb;dd>32KzjS2GMEt%fmxT9} z-o|?)TLy>v(98)Z%u29ky)PK>XG`pt}x}T<6QvfmNyx8Rjik*$c&b z?U|zRWxxDAsO`PUnVO<56MP_aZb!7W0x$bmOo0Kh z2_)CX)jttFxa3mRI&Qv=eLVLb0PQ8c_eZz65@v?^ATEqDu&>yDgVTbZBX%Yi(>sv7 z6Ma?`eK~PHAVm!PVqfb}@Fm5g_;l%I&8MBpEA3|>#77DF9)q>_#mvII5*nNrEDBNI z=hWul5866^utpb`jp-UV{(I*B&$^}YFyZ5q=~3`_iGjUPwOxkgJphe$U+HQ-~U7e7tI%XbA3;v~wf^)8>Ia zt~D@H!n#y>wx0qi4>?VkMpQspNa`p5<*W>fkOa%= zy!*9|8YAhdkSeXngb-NYuy;tIYntsVlfDl;Pc^Ta9_}02Ci09H6F33&=i}@oJ5%fo809zO-I! zA%=g?{h6(XgaL_I4oiDeS+k^&k6vT%oKYH}L93cDf+C9*rNK;tGv7UzEW6y=NGmbA zU+-iY)g2k^b5of0sl@e)#5p@qb6g#DoP*H5^c=LUw)m$NmgBr+MGI=fxxJV%AtZLu zYaR{Xjdm}(-d5R+6t3-GVEY{ekv#DL3QfPWR?2?L^O^}a2tPHM|1y47XF&C3K_;{5 zNNF*%>PbAz5Q0N{AquvJs64#d_xFdY^tBK-9=w7Vp+0CMc-|KEs0Zm3YV-{JeCX8` z9*U2iGZo~iF<>h&#SIygB&3&j?UsSv$2A#-8$opUA+2wtYHWsD@u@Gg(_Q?X^&B^Q zjq15HAAY#iJtJW;%l`ddgai`atQAXdFWHH=vKATwORn;*s(G!}lBeDA7On>I80)Hgap~y`B!oLaZYN(X zem)ch)d6S(GDFbALjhlh_&5?VxIWcYvzsjKgj}Biloy~n@QF^TiDSAt2$AS*0YQc_ zKiV5q`zo6`c7a^9ANz#v8aC7bly1{m)7Fs5gdmtqA<8>1J3pYIDd zk@Iw^yI=`rLCXv%pcO?AE&2=>#~{6JOIz{rp2!dztrLHc1yHq5E%xLnjiU*(Qt#rN zA_=~ ztGT6ir~%4xusyGlpz@@q)#!X?9qLoc%YM-TLs+yZ_qBztLvxf$@{yJ;ZulGnI5fx~vz1 z?@`)9uUu7cU`9$*Dx@bZM2Wf_EW56DtR#9;q59)bgi`{xK;pza2pBkwaVyLM@85M9 zux+n>gk4T$U_YN3$<^Y+PP5@lojbXjR|w@{<3Q;{-g+H>yC6$TeBRMEtw!f!Tp!*y zP;*3Q#i4dUu!m5@4+1aMp6=1|oAe8j+v2rZZKOoMwFn8b8NRY|@^w!&f-lHmq`0yM+OWv6HiYCZd(eJru{ zQlgxOAL-<39e(2ewZMn4aGj|!7ySfSPF#WU-N+~x? zSFwi_RF3yR0>r^fg{=`@%kFrk@X=?ly22_L9KDB|ihpW->8;BE0Sj;SM!f1`1&sPq zd+Xu{!Yjr7V>UKnrv6S@qE(UBfnfeeSES18ie+7swUbdPxDAccG(2_w0z*yN%LQUT z)h05=TPYpaYFqTFyma!)HZ(Z^Nnyn7QXal4(o4Y-f!Qu(>-E!(2a#!y$5R5wEy?4I zsrW;(lGF>=bfsQ$v!sjwvvW`7u|)=#*@KJolQj(1p6ap_%ER4(lZx~ODH84sRZgDj_er}q zYy3efF}3?bF#X0_#}&Q5t1f)xO^u}s=mb6e58FBY%H3{lrDqjO@@z?qJ19#OVsM;| zS2ZS-F^@P~Ar{)Z1M|h2ci|5etK7>{xKy*sHtiLX`~qkPMs}H89GS-Muu@k z@>jQ9&*1)|eTO*K0vEf5pGO|bk;B~hCzo?g4ZjQBy%WSdlxMGC(xQ~sRcIqwOjYsy ztx!vcwnWsZSo(&ejQ!rZDV~~PT-p0ER*(aBn94jey-{~7)Y`yK(7?{I@4>oh_vrML zAbCbCie+m4Y<^9Ro@!*Q$HUSgG|%LZT9z-0b|gH?`ZG$?8rAzctKo_S*{+5S9?j<+ z>Y~4=_`m1+d+E(@Iabh-t2(sDKsx9U_ic~w#q-2G2^tP+8S-Df*}9}E*Xk*&I$Edf z@7?8n3cY6-s`#sJ|K|W*W03?SfoRoIkSI_J{cZxXa#8fJl`;H4rlQ$2xarLk&WV3x zI`I02b1Nqn`uElRKCv1R4ogB8!LMif*TLO{Pk+Ddzds|eB0#NJsqk0B`|CjRZ9e9& zfBkh3f9IuwCC{DT<@N7(=K=lGU;p~+Ae)U!izSZv_qF|fwKyy`xBvGXmbY>=Qj~7} z&qW5P$&>x>IheqBbw5G-`vQKS?xn)?@9+2b$t3k8-)b>v{(S+zPe-LSjQqdn=tg`{ zz>|Xb`<40o^f|Ho|9g%dzwa&$|52y8L%T@>hHDrxp6seg5e_|8Cy@)6V~C`2S!me_*YDV^x1(tv^WO zAMp0S5ZOQA?H|1S4@L2B?&1$c@rR=LLs9&pDE@`L{oy|U(9Qp~cDnnAqWD8m{GllR zRVF3>P!xYCia!*^ABy5%*xMh0+#iA5f5|&={!kQuD2hK6#lOmAz#od@4@L2ZqWG^i z`~PE#g6oT64t(CbLX7}B8V)@o?mPA28>k{WpPBtv`~?PTV2#TL|3=I!pT1N+)`gI^ z6iiqg4htqMsS_3N-ZWyK%o|bxhgq;UHa@i&hz5Y${*QdEvp!eS{$sS7N01K38BBnA zU(dOV9-pNz1!A?{o8lRC^y!T#NX}5(+M=Mqr5dc1)X4v_(!r?GjXjGb4@;SGDaQO| zZU+^K>l0EeB_FQF{2ZN8w@3y?d*@6#2?V-X%{!>b$6$nk?A7muNGnAp!64IO*?(%M zl>TYb{52n}A*<{tV*=|RrviB^eO4RtNr&@{3%;TA`#scJ+ez)+xO7xnC5@A)JH#eF zKh}$E@mFH*#-1rzIp7ue zS-+?GebuI`sH=VI^E7L3hCAzy`&o%t+`^wL1xSWS)-sd)95-90{)fJURy&&n#mc#W z7a?aj^h#&31@)LNnlc_1W&_>JEVF<3d!^OfiPLG2U#}j^ikcuJfm z0hzui^tbaP(2Qdc<3&$ep_yNX5vSur6CWKCRXnPLbm$`4-GAN|oK4nf&j&Kkt>|+! z7PQSUZy~CpMUe8bKn6(83iIs-jaO(1!jD85B(rhlmNTVVcz-_>;QXdkf%P`s21S9H zDYRbGy9Xj6(?(CNA12I@U=-^9^Q=fR!IJa5eEXR)-`3oX3)G4=+bj@~Xch|{ zQ~%|h@xrxsxINL6*y>1@GkoB!Y=j#KqsmSj_L2OA{`(q{WCA2d^lXp}ttFC}w6;;& zSrC!_+dt!ld#@{JxHCh!Bq`C143SpP)R;j_-?Dq#?!)JPo*oW}$DYex%|$;eN5k2W zEM-bm^2ZtPh0?0I>yfO>gS;qF!|C1XA1g%;pKKHy8rBLg_8N*BPHNH|g=|MR^(vu8 zZtVrT-`RiLbG|H`&wlw{5BydY`tnWTb7CB zVqU!sr?X?i0_eYP6*#z;3M@Z4uv$S@M07o>)&*n_rw$^sE2 z5ceYD2G@%W(g7Gd0oCmKnDz-9=(A1g-Y!>rC273 zY`TQ>C;ym~qP%FP>Q4B_^HipWgnWf;o0GolBl=d&M%;Xw5Ty0!S$-bNbNW`yXxVAd z`Xept#2VJ^|YvGSz$yTsu^tB#@j(7vB%Z@LipaPo_J zY~M?8VTR8N5~}9iJLtsy8-A|eh;Pqe&-$AvI1AK@olaYg^nJN60_{ymG7+^YwoJ|N z@LBNNAa?xR=Poom+d-+%cw z`G}}yh0_0~)zN<}A|GrTxI^K+$LF@997nR|!_lyLX>zl}s7yR%_>%ZB>4V^Z4MtNUA7rU7?jdoS4=*L9F4}k>9M=?{hf>d~=}L8(I>@_Gl;^oYHqR`v3l3w=>dUY+hcE3cn%Z?2A4K*Vk0PPi#MiL& zN>zz9uun-}Ylcbv_jjk$B56kcy9w|eu}9w&Jw2T58xUF1N_JxQ;e+bG>vfmsrpBkR*1z7_{$mBg;SI{0?j(d^Y{=Lps;9 z<#RHJPs`o!4Bda+5$X=iy1qQu6!2S(k*(N}S7gsnW1hKLd`mmbCIQ=$9@ldWy4bHf zO(b)a>mo7`yZO>I=XGl2!B4y_@{eLDEo<7F%nib#?t$c|L7{>>E(~5Q?Dwo>sktny&D>mC~z6fm{CI7&9yGTFu~To$pCy zTP(ga>45*&$SEqLK;w!T)>L*zRby%)6=5ddwder`b22e#uT|5*SiX7+iEVL_u>93! zheYQi%au8kpVJv8mldm;J#2M8ey0co}bKmPMW-B#*4Zi2s&S4xrzwc5d3co7(NogK~N^1PmFVoNqwkJd-_qkQq ze8gcX(!(y>+pHWt(kA~ZXK3(pcEBQtg4ilch1)yPcvjQKG_zZM6ilYv^S0_TDeZHH z^YD+3t94@3XR*moTqi*(PrGrRI-T|PX#DtPI#v1YakK9!f0%|gL$nu_?eR@!^9=i4J^R{Ll?9krJi2cXv)lOoD&{4VQ>_zGRv}v9) zhv4as#O1Y4$MyP#-pFyv^rh#FQgZdEr&Wgg&&N7Gm$&yj&wa?xVon*9QmaRY;}}%# zS4}RHtbYM}L_vMOZLVFU`@4TpkSJ!<>CZ@mm^^*+xgwp8UO@%2~eR*V#}pQ@2vY`6Fzg^r*pp21!4G%mfD5KSFjE3xoiLa zlp+SLX|p0?OO1RC;hrh+%f%aaYkjkV#~SI^+T^sl@o^lbkH>jllzr3DYw+r~LR3VAhy>=H zu1+H_p+P*)x=`&(tG*H@8IHDZ&U^25Rqn>Q2)MAg#rvG$XLV~S*K`;PQ;HodUO7G= zF%Xz{J&Js)kG)^>{R0X`wZw?Ue#5><<>9fWimh0}tMyRE`=>MOFDs!13EEwwiUJF+ z(taT4c=s}?p3NxTxNHPH?!wS_ZLMnF$720axBpfQ@6$?laF-}Z#!q97P6dCeGS**- zQ7W9c?l}6zlm~9nCEP7WP3EcK4AXEI_a5){X<8me|5&qKf_`)PeE{uw+5} zvZv|_VmV_PRP3_OkQ2-Fs)P8SWA(uFI9d>$Q7WL>2^UQ5$H#;UP5CG2-2|haXQxeb z=YzLTt*CD)BfsuT78koI6>wXbXk6+edwRZlU;4JMr9+-*g7*hLGCHB|hl}K7jM-i1 zQKi5Ur`JBqR3QCX3Uaj? ztA`3xT!hJ`KGq&~$rnQES@XV~NNg^m_lsB!YbZXgyXv1>-)7B9+;Y{O^_ww)47ow) zt`|G7dh4g{-Z$)R4rK&BAK_-fXW0c$1}G?OarBmI1}AY4*i_WbJByIM*7Pn`;|85X z&c?<1NWT))2p-Po&|t5D5OF^+zkRG)8f7c8($vTyyK&zA)K_&r8JHz>+B@RHTDALS zLC9MUnb-Ry{pptlo2DrL?a{cX0cR-dzVwFB=lvPf z0EfmO!4Xt;ks@w!Hk#T7D{U+ON!%t$CeBO9ngYgP=^m$}%iN`6p8bToDD$!d>a_#C zsN1h~+5VOWO^aaeIqdQmTpM?eO&^cwANIp6kCT zweUMO!In5S{r0&)Wca-vL{$X#(%$FV)9PdSI@mh2&W~Z$*x~=RbtT|XcI_V_OC(uZ z?2Kh(%`Qu1nUQS_GL|H=?;(m%Qe&TmvSpdEFIgsAlq_W|Z`s$BP{vXSA^D%a?|a`@ z@Bf|aa$V2mndhA6+~+>`eV^Z%>-XE{^l4bQC*1OLj%TGrQrhTn-gPH(-w4XJ*`~&I zZ|yU`!|i3o5MVTrLlnBKeNLxY0CGA1Pgy}b{7HMpJ}XN8{fI|5)JHJbhPAc0%BG!? z%palyO5K~PLyHv+ly;CQs*R+d%~Z!3B=M0b9-e2ELt_H_dXUp;<)qEVG0UIb?eA`D ze(JgY(xc+#!N7kHr~R(4Cv$t2z_oqKd;pq|4rnz3$6Ah`ctm4#rmFq*v#$}Ajhs;9 zbOrZetuOH5FR9ICcGoBCC*OPJxJSxIpHX*Ti5%{JSKa%t$~3s5hp)-&vA>_H6DWt} z!Ul_?OmezS#vhZ<2V$fi4O@KVzI;yEXR&GBF6e2O$Yi-xP4cwlN5Gz%KbpSh+_F6R zaT&yJmoA7HPwoEUvz_H*Dx}NbDaRI4gTC%>yVZ!sckj2S& z>uc-0wlEYa|HGTQRytNdR)Y1!<9W>t5`dVY{;6HXe@KYk+Sv6g{n!5OK&*if*rbkM;FL@sQ zMKku7KLZN7&&A)4w2m*==xdu2H`7WM46usp#6rYww$v8q>%x;amutRY&xY;?qKjhr zZ+%@ek~ufm{&CmG>)@VN*UGVd6Uh$U+u!V?HAFkvwoMHj0t%^1(zH4CQIR?K$$7Ng z)SO3F-Orhs->Q#9tJBP&nb8c=+$=i&QD61N{pkD;>D)l0neM+9f}+u^GxTJo7hY3e z_}6R(mrE!sl#>sMEyHd#CN9}MKRlLJb{1#5pu5UlFId^BFF@4Yz#j{lEs}~>fQp(y zn~LVwB|?zxlDS00VItDe{P2JNbKvHGE(PAxvf*id9)9sn@bAa|`jWO9xZ9trfBMe` z|GERZVe_N852xNg;?;Hk@He%+Y)A>GTE0 zf4<%iHip3j&OFRxEf$K-e~0tub4> zes6V}RzAEghx*?dEz+b72n^f>jHv(=3|<~5xxLMEXbwBIo5-uaw!S`6Zd>OyRvp94 z&Beiie9UNxPr0|gwpK#EbmA~A^iHIv9W;NUWVA3G8@;o!$j!-V*) z+(;JGrDx@Nn3CdFY^882OVM)}bUIRb`Em7^JxIrdyWQPGyrMOm1rx&Z3j<%v2|-&&8N`2dW=x;9A%$uu~3 zt@{7@QB?XxZ8`WK8e4XE=vl~TVTx(T8=y&8fm%Hz(@V>1iU0orZMKIh{!Xf#kUKDw zSt#3UvRqeBKrK-M%x6@I-ut`9O;i9(i}hLj3IuIE6Rr=PQp=Azux7Lc7zazOD~-~m z(|7ubMj=b1<|g#yo~+FhW!I4aNUZzpa&TiwF<0>I(qik>P$E2XYWfUfD158#wC}Uk z__KY$&bl)kAK|$nK(Vj_MM8g}U+lLyI3?OdHWi?I+nLDgNfY-IP?H`8GQL1F4SrJu zRsPhmHM+fbC~k!g*tO6jOf(#@-$GbC)qqUvFmQlTZ^w|^!U%~~etK6yswgG_Kfo}k zmv0`fi;R>x${h=6VF8I+Qek$O$LdT6pXFx8ALrJ47Av1S7z2jfx;-oSZ#H|;3dwmq zms&6Y#90BI;5;C{h2Lb*y>|rmCU||mS9$9C0KjBFQtSXvD$r}Z4j89qmWFdDp$=VY z1V=G#lm7;t7u13|5O?p1ry2qk`@yK^5#gL08o6898aL->XMcou&OOgi^2Q7)_^W!K zb-8v8$}@7!kW>V@0jOhJABbf?^f*L0e_*8w+@MW7w`RCYu&s5wqV>&k_!KVc(BUw} zZ6L)<%E;g#T9uv-SLOJyV=kL(l1MXo{&FxvhM2UZD z+fh&HZeuX*=;-*xEDRJjo`TVF@ILh#TdVGt2JiKjGE{n8ImL%qDOb+-91twC4?Y9r zc?$7eSTV68i>N}t<@XF6E`S3Dl;Aba&s^3A4gchLk!5M>-jj3V#*a0_pJ7VpKYiMw z3>kHEe8c`swe_hgaWB6UaaZzm9ZgDY`UfBa9+Y5>2v>cq9#RorgkyAMJ^t-CN#m~=w!B?5QR#l?>I?6vspDN5QtAiP_i6O0P1 z6s4|8Q${Sz^qmZOY+WD(nIozxZ!sQS;o{r6cD+-cFIKM#bN5>W>?>FZma_V zcFBq%Bd#jz;LTjUslA;I>>d!n-k-&qUZf))U3!yyhdl~#884!x`kj2s7Y0gv=Z^jO zoS*bAV%OUeJ4;afKxVyUX7&|Gkxpn_1_v*Qm|vW4zxbEI+P@9xrI%dG170Qa0(s&x zrskVlsda$r(jK0QNSX!&XCOGT;}rkLd|+M?>5fR^xbc=XxSeAJzH8wejQccEYE;(+b<76`QulTqO*Wp)ky zN3?^GvR{Ui%c2LxvoL2|B;aZrod6+IONq=|0|*m=IKp&!2TVzl>imgfLwUG$e>~Cv zV|C?9E{Aw`F8i*#;uug!C%wAclE6Vvti}nKMq8=?k>lDDyXoM77S9psg$qpW%|Q42 z;+l@JGY~3d>HwwdO8~%^$4UWG?gTYVy|A2733i+U@x@fcOY(Z$WF{eIQ>4itUY%)D z+8yW)@&dYr@_vIKQ>AR}7TtbWb&9)R8s-o`U`y>B1jCJv4M|8$XMQ2!fdP^Bl%!P| zpu9fXr?z;;)mY|JU;6nPf(+F7B+qswc25G%Wd}27q1?T<&urhNqeD%@RwPS9)*$%t z(S3KjU8j(FYEE3*eEQSl+Yx|4y(@#}K0zGj@itc|n~lpsMdQY{&a5#uG(F^Fo zs^79GPv-2ve-M_T9Ep-qu-N_(j!FWX7?xmW`TecH28 zmiE^34zEU>C6r^BEFRcoGmW6f+#fM#U~u+w7bmAEv*pjuDi2|DH7Ru zRpLIb_9>XP`<>vEBN2a~4wfND`K{Rl`Uuo&FUg=@_7Xmm$z@> z8c6mDy6Y71Nq*BBCz1E?8Aj={pgzfQ?PcXOn*adYTRfY6c`db-BDn9jb6R-{7cc zN;)52{Iy{imMd|g48idz*S#ylDU<_IYQSH8$(#hV*sF^LVhq_agFM7ZE5GDgG04q+ zsA`~6LP(9Y>2x(Jq+lN2UCx7`yzIv>Y2s6D|5E;N%=hC55vz* zE+Stnx9AZ+xav=Cg~i1<^(W=Qu}JpuIw&FD%Y96D&2O2$A_cAnu?|qV!qoUEmoq7A`Hp;lbeirPR^XhG4BqS~v%1EhM%m{`bdSu^^GlMl^UIu@A zrWYLgWDwG8=s8o%u>u0ukahm1oyAg8C=GVUxfi1b#(#**g=d6{5DGKGBojs*RR^Kx ztbHlg>g@}-&U{D61E*kfiPq7JN!AXbbDovWRzFvp5?O@*-hgO3GWI;yy6moB*|I|s z%f8{%e*YM}N`zxv%pikVQq^6A{gI+)>SdL2&R0KFwkBlpI|12n64mgk6qIFUo;?{! zFU`Zq>7B$Ae%|WUgiZ=ihxo$u=(}>r^j_2VQukalLfIo*3)M_nnPZ_Yb-lg4LBlFr zlYv?KHrMm#JnDp4?f!7OOGUdx8(-j9hbuh8LUiI)JA5@Y9JUA3s|1YC3Fq3&Icj!q zZ46nH{K{Pa$YxLCh;w!pn3Kh%=X(enBo2P$fHbFS=dmAkd8$^yn_m`#?7=84JSE`#7=LjhlI~^-t zoU}%x&dJ99CE*hW4)y)v%NZhs&)HHW^~AmvrQELrhn?O!=}GCas*{G_axZPa-@(Xa zGrB5T&0;(atea0{Js;k~Q5HA5NMK7V$rT^LSPxH)$QpdO13&6KpJz(%esajoq2HJ= z$P@5dkZkR3dSf;-I)EK^6ycac>shSY#}phlNtr{xT*M*y!t?7RibYghz6_R@&8xYv zbb3-1G>mbYsCH|C*wz@dC~8l;!jo|(?4*6q_rcte!Drx9_235Ef%Y3)$eWz=Aa_yix=_^{zddeVv-YOP6Y55>=7g;6$-5z1_u5K6sCQ z__&jvxy?`}-5&BOA4>9LNsW#%>3baDgB{P%wJ8{n_bXFC)s}v)VXT^&LK~s+_tq*# zAC5gU?Mj!(eZuv+K{;(=+jGiPwOmxJVqnE%)pBg;kUj%#YUIzfrwRonN6+(rL$gz+ zn(EeV+f~1FN?M&k)ruGA5$-W6a8z2Pe*U* zt;kjg|2E(6<6eM*uJHb@g{g)mHQeJnkc9Wd&LJa+O`Ae*4dUiATjA9gAIi28^euIZ zwKnHB+rSP(HKVk8Qq&4ZLlhcYz^}%uo7@`BZP*HR*Ut_A;BfL$ZrM(*p|?9cCG6GR zn$vpwJE()3ibaAbDXA;EhQ5QlX1yYxA{z)xa3GMZmi& z@U!rd@WD`r6r>6BDn`P|jX81>?Jb$Y7C5X0XFpRD$#VpLqQ62BMarqo-E_c>z?3yn zi#bxC=#QMp=?z_SH69ab@*9_mJNKEMtab$1`nYJY8b;%8I4SS+w%<=g^3%&CvcSEz zlrX7$_4T+BP2LLo%Qs(yrWnr;m+}TPXusD!_9nK8wD%*5a_ISCOk^12v_U1j` zgYQSsb?hg=;B)S%zHANK({I7nMvd2Ccl>h-1!$?nvYh7nbbdi#rdFghTD~Z3&P6F` zR(1i45;n5YJY@M*4667Vb_NPFBs1t4p{=!!#c{247Y{ZJukK_0Z`AekI~;{sue~nD zO?u~4Fk)l0IRylY9U6!OtQzGsXB`<1Rxpv;mSVC%$!)%OOXdOjk3VES-G7zDKMm}J z9;Y`EGzoZk-Hvr@C08xwvmDYFIt9z{fogS6#0+5{k`6 zG=7+_dh!!C8yUzJsXZjvFZe7^S=^(dWfh_>kHpSIu&EOy51Jd)FLRKbArIYHR=fz3&%i@k494A3%siz) z7pi;ub_)D-UHhrIup-^v;M;Ndm4xM~Q)sR$~xtye~?sRKW&g zzpCcw-rDN7q#;mi*e{;|@$fQjZQ3n>;@ncqmH#{4`~pLw;9@z;+4!oPK>lZ;{LVm$ za)&rEPu#s=JOWi`g4diQpNf2LNL0*VkFQZcs#iL}m&9Yiq zWVHqd89BDNF_<@cZyI~~(A4C(jHQW5VriSk&CU98Ih_~yNG1Gr(p3&>2xT}sov=8c z1giT3=uA-}P@1u+1}IEVPY>AhL$}SNCB!BA0mxvG$fF7rsAz{~Ox(MdqN8b=5V=eb z)Y0o^Q%BoELql8UzjoALtmO3TpD*)tCZ`fF%2j8Pw!XWbU%K05w3N!xop4$?xV&pk z+4{_hW`^1M`LHae0*#eJ=6M({#(YV-^`s~pV5PraC+8a5oAAp3A5a6`f&!IG3ia>; zKls7Pn~16=`^wJ-S+y$;#g&q- zP-U;tH$hie^AM+G?@ZE~ORP{{J%Fo8E{u86b6aC||KWG7A_A&9m09ZDN-cHIqeY?H zsj&smY)eLk)T`HUb8yyy^&ID=rr>2L6e@4KAVE#cvUuk9Mw18Yl>0j@ z%XX2bQMF$Rk-Q#gJ2G|I1fnPjW!$r2Q`@DT_P`AhoBc*IKa=?f(u!(D4DvA2gqhxN zK1{*;IKgBbl`0-1B;QzW#hjtjzk@t5S-&pL?l2mfL(iPS@qEa<`i8G_mkWP)Dem#r zg!PwVyeZ9g1>n`n>bCQ zRDYr`lZSub_{KG|O}VH||1rG6{{EhPa8K@sj~NA(lD+mPdF6zAV|*1CXk^pi0DQ&G zK$4GCnwmCe!qO1mv=#qh(xJ3Sn+;Jtm_)PvhOqV3jVxcNu&D(SvsgDm;IAZ{fN&6z z(no_>X#g_LA-K41-iQf!>*C@9v(`buQ%(>4iNpS9_)({{wd-`?jCq#6O(9>b#)96!w8l07ANhy z9F`HD6VZ5u-U;g~dwO>TgV3Ki% zF2pGj#Jea&@deyS&GueP$!!TSu|fbX2Q3bB9`DuPy2B{|wYjwOH0io8FXhcSkk0DQ zn>>0^>mB0tF?BTEIQQsC?0hBU4kiYz!zmP5G#ey`p}-M~i-W~igJUedBcgX4>>Xnd zAiKYOg^tlsQ2s3pRhqjzAM`x`JiVQ{$9S!K+dY<#(pGwCxPOJ}U2T%UujU?M&=*AY zVf>`Au`%|N4MtA_MDgPIaH+-8Z{wk&g~mkQ6Ix P1wJr6lM59(j?w=IJU&Dp literal 69283 zcmeFZWmuH$+BQte&?!08ASo%~fW*)s9U>@*fYJzvAku?$BPk%Dl8PcB5|TqH9nz8_ zARr*!?|He_{k&`4w>;n9@5lRtZJTZ4;=0Z_j{Vs8{W!yQwN*)p7>Tg3ut?R_l=QK% z&cU&;aCKnkz&q?$-Knv#ur2Hq6?N4W71?y}J6qd1-o?V=iFc2y*TCpfwAJ?|2GXvu z1?uJ`APjjouB$q<`2VwgTPxmd5vN1<0^_)s--_ za_Kd3` zqn~F|Qz$-czL}Db4(;D)T=depbLhUucle=>{YI&mJjZdip_uiR02(gsdq~Z7TGtva z4azkl2~%0B4A=df8=~nElqq4yH?(uB`Z-*A3>KK&GcFI87c)2S$bL;rXJ5@GVM8MW zQ=5_cId+rpOr#Jeuz`~kx z_r*}uri{z26Fh6=T|!-{_mN=Ck#~*Nt+lkU_`quz7Ivr|79Mzo4gMIx9~Kr)HqKw~ zoP%fM{`DGH2m0`x;F|(0EEJZylDwfO_R2GYR72&R50N}7QAIGK2osTocY%@?rtV*^ zbrv`bi7&uC%-p&UF1LvbSt|=$b*Jjmhz8!-`!4X|phoPgsEmwEujppx!qU?C<-s>z zYpcQw*B3I3n6G$bz{LpFvRkMYd!H*tGS}dy1mIm@!zNTi;ZVi>*GoRtZ1?%JQqDho z^dCRC!H10>n(Q^t{Naqt{QhosARY~z4>pkZAHV$POM#?-Kt+w(UokiTelP#|U0rz` zBC_*Ab_RdD-~ap$8#?(as_}X5tw(JFe|omRzFWWs?w+WrocDiwNR$R=_H?g)@yEI3 z{~DmbF7oeTGmJxx6II;(UvCSW&`<=uFBS4`(sH#ajktn5@}WJQR98ROj>lem_{`&( zIZQxLYcRJQoV9NM{mEn7FC|uOpSIgC*KSKe}W!f>(h-hXaW`( zx$NF;dZ`Dt1N33Fi71ZWQ_E3tZ{Vx}LWxtLY)72mFkp5%x(xLn{)8WSR~`lTZBwzN z%@w6yP0_8Chd4G-Owa*wHdbwH%N&?Y`OJV62ZtI-^zxTeWtm(iu| zV(fpsjrsa;ikU#pb+&nLqIx@#hFj~^b3-equ_H>pz|Ea-?#!^|BM!7ZS8tu~q08d) z&=6%>lM6n(Umwos_#YT}ZhyIsIe*<$&*Q^+_-B;`^ciDN;Du9D)Gr9HZ*(O6v9NTk z8kBC%Vs67^;5EBEQCs~`%@R}TI)`J48>mn>(f~7VvpU+_K>JeiAdM1G;?%aoZ_V}& zr%qW~Lcnv&z#mSy&dkNikr4}&>xw_Dd39Lv#pou!@z0Vr7Ue_9g$y^^uiryv@vPIG zDjHrEnwTB`{POxb@_VXPQ!A76#9)ZEKj%O9U}_BB`n4XBHu&>vuZ;%E>69dBsF=9u zc*}JiWt@7A%#3zf9tj9zsx?vsA311vt9RjL97S7fc3*L(A#06cW?G#4#&{`%m|H-_ z5O-N0{Ef6q#mGM}b}VU`Epn|8zq>qH>%Dih*^|$R!<%PhayQB4WPc&F@1?5F`>*E* zLQz`GNdeSlH$U7nfZeG-IdG`l|DKKsBBY*`(K;w@p);5Exc}9CIJ(;u!5rq&`di-YTx4JyG!-s;yhsa^6*gH=DvJIqn}`AIG7* z=>4z6mwZqmqz^k*KWl<(wm*@dvmdE!yhKS9QxLtv-<6^e*||{ z@!*0e5cu`*GOuuMVM^$KnB2pgeab)70C$1E?=D6|CXTB1h!DBTf_{^ z^wE40Zqa8-xZ3<<4%e5_5`4>hMC4T_Vxew(LDbr?vd&8yA8&;bf1+f!<^e zm0c^gFa7EER- z^Tqs;WZy;`S0~?8ix7s{pC4J1Z+R`PJU8GUIG6R<;cb=mbmT`tf)i{4IECEHO&hxP zs=6`z*DVBDPg6|qf1+6^93)2Y?P<2`krcnKPm(HL{bz!^&|8U@Z00clEt^ z^0kA>3frq?eGk4_ll$+_r_6vj8C^DC^~hCat!Az6Jf)XQr68EI^GjvXNAlSY!*8+M z02-K}%6#$1ZWszC+I&b zeZN}M%IMJC9xqh4hv=T<_QJE^mdrE<%f&eEFh9#L;%Ilu|J3u1i!E4tqa`j63BL;1 zrHmasokErP?!z1UZpFBgTjnexyx69$R%Pa7UbV!;u`41tltL18WF_w(j!&H!y`GLc z2v8%L@|la2n92kpV0~*qXTeD5<9d??h?0abfoE49l3gr%r{KBKLgzNVNjKtOT)Xqx zY;%xBkHO*1d!wI>Yh^vM=(W1tNpGS0@V)9p@46A$&92MljS{cz?(Y8l`ZBC*HQ6EpI@IE{E!|G$L#V`1#wj^ufA{IkRDm+*ANo$_Jq+6n>{oe0a1Xs z0&NgoK!6}!q$`vY1Y?2PB3M#bB-f)F0ehM$>FvuHDL!G`bP4qZ`LXybYad02PpTF& znNrlZ6~><3;ePf&|AEmONXG^+GMi`aM#kQ&@2v(27Py}g;gEJak0EMd+Sf8PQp`L| zD#5P9DU?N?!-!~)Ctc-!axhMnH9+#<=co_H#CvZhRCQx<%%PYxjI`Ekn+SPNPyV%* zBU_+#?e^7Mo`Zh>522!xny`Ny*zoXN(Bt>s~%xL09b9AT8f{M%@uY5l|X?kj7W?}c#s zL4sz|L5CTpPd6;wt^N)onE8>ZhVPsg#}B>B_;a$|L}W_A6n%}}PYyG@Gkc~_59?1g zw3=yb;XP63T2*P_zTZWU-Vfi33_PbN{;7bw$+ye9FpY$gfP*>fjFlHHjHdF(^0PTI z+*4mCoql+m>h5R#_$f%^ZBuQCI6r7>VFf zVF&s~^diX%+DZRY-)cB%K7zbE{XEl1`<|=x_u&aP0a|UjjIY9ytK|c;{N_Ov&14xd zFiK>sYt+VfHU$pJ?*+bSH!TNgoNx8qM4{w}<`6V4fl3>frJ}fsM26qNl0go~TxRo3 z7Kx2Oyq-^iB{J|+Y|lYC+1<;8WI2(RGViH{hsvBU{glTGNiw>@JDEZ*v!!boBpITb zX4iAF(&5{E7@v6ilUmhdy#*`|A|k~wCALqOn~CtM{7(<3T&dUQ>=~)uZJ$uq>ajMm ze-U77CU1>aTzl=l7$Mo-ox&lUV#+6b(#<3JsZcbU4oJn=u=kN|p&j&J-Cf4J zi=U^^mOrQ&b)Hgu>fXnVR!`Q=Selsi2t3~$dLM6CO$^!f+2E3_Wsd*s|6_*>BcWu;VEywWD@gzBf?w7#nO zRDBA&f19~(_39X?4~lcEKKX|cQnv=XvQ99!k`<_5zV%X2^T6FUwXla&d{!|%%|}E0 zxNxJ2?$X%X)aih-oQG?Z!QBQ6Qi)7(wj_!EBNus>@9UjWHW|n;Qe>7V|GNg9zQS%t zyNr3{_<*uz6K)TwIzvzx>Kk^Mf$Qu-Pm!`(z?4mvFYPlgm+V#?(GR?fN9!1J)jlj$ ze~E~or!(|D$TA+&^e96*`i1VxttPh8VXz#@#>R-ac=#I_QNca+MYU~i7Zk#}sQSmN zl6Mj@6-SAB`wKhn`*?eCjjrm(*Vs$2ZICfD_97XW!X3i0j&Ng?G~0Q@p{1@vxi0Ai z76q5~?g(=L=w2q!F5JKWbGA}hOv4g)jncX&VJc>H9(L*D6a~CAVW}n%|x}Y>urCl{`>pOow?=p=HHd?hI zT&1176y-hli=tHr_QpDUEm%2jZ0sH_CE0$q>%j#P_O8!1S>=NMr&sG*8KPc>j}GSy zvz^zt1W%-o=}x?dM~P!)B(ZeEWo7r3dS85ocFveQ1uCv#Dd|~Drz_H^WCb}FapS~= zg`beFkbe1|xZORMw=*l4r%Be6L;0&Ww3qX!U@p;M80-ZROlsqg=iE{#Hc=e347s95NKfW>f5j@y zG(Y41Rnt5H{ow{8n@3kUX)X^Pu~VTL1?YME_0z=gi27%9MwbfTk~FDZb>8l}Jk>lG zacm(W9f%*ZVb5q#a*|vsOGvh&Q7y<~$_<|jQ-Y=WST{;$i3j4uUZ8jRT9$l#zei4@ z#yrC1HuG>AkGK*7FStu$oZrRF<(&Y3#q z)*tUgz8(!tH1&ze@6r0{PNcj&MFScs{6p8B+(j_#|}uZ!e{CyXle-x2qmi@tncw&KW2!}qIQavD#t)SEo#1c&C> z@VTaMuHsKsxQ3N(m1jw^1P=`v>@ae3bhmfdmS?fk;<uWVA-G*3kWn)+fh=_mbamtA zw)BuoVT#3aCr-`P3Ohd>FsSfNCtncT5_dS{5bdD78f!0{(L}%YeC3c^aVfPgVT!K8f8_anD)KaUb=gE{E-l6htOxt!3AW_;yL}) z-1guc&-}T~EC7v$D*3{Fzg)^Y@*7$ZFIx-8ncKXOOFJu?G$`kz{p1NzqeYL*E=kQw zX$Otg+&b4QPQu$hlhy*SzuCxTl^(w9ti{;ScDqG9+>{JPp9qBy@|tulNMccORs=QMUL{Icx5yXgFfeg)Ga5PO|%>N?LtiV6~B-m-Tn z5sHw#mA{0C7%F*VI#-*4Shju4Bi+eJM)}zPpDBZrvLU4<2cr;D&o^Zg%*uZL$-VZ| z?cnaqU#inij~zZf65r1n58YV%%GmmGOEozxyxxb_+IS;dxa!Ni_}v_0)ytY+j3y(z zh{+zcU2oAK`uL=Sxs6g=@&42)s8+(ubX=Q-Z)QCZL^h{xM~LFvSWq$0<5}X1?%Tb7 z9+Hi6waI8TQ8a8wok=4K-dP&pmWq6muw8$8?3~fny<&F|s7TW`N_4S^;!sm$qG+bU z%0uwb{6U{`RoeL##d4f(z5WYd?uH3>cF(J{QEJxY6smWBtB!f-jUk-o=Ts&s%dPsU z?ug>Y^rgV7M`buSs|y~2F$#r?^_3% zwd2Gm5rONU!n!S^9Ci9kmd0{?J++s0=SzGVxt48JQG9ol*x7_PP@ZZpctn>dY1BrO zg7Q~3!YE-KCL9Mrj8tJg@KD=y=dGs>F>!6;0)rk>A(s6|$m#u+vL5Z%b2fK=Dx>iy z>y+Gyi-sK72!i$obV)|0h6&xn zx6`a#&a|5yd0>xUZJ_IDR*q1cOK;4NUt8wOzll-RCV6Iak!oIdu2|8{R$Q^I*DLRl z!tzsx9S5^HOuDc{`P26fT3)#o2K(EkzS*mmMxa!Mp)Gm&k%V>^T;jAfFd{kyn)g?P zZzd53tp{)pRIL5F<0vRfERA$A; zmoNarGd|=9L~lvbck5bLN!I#(Q4NhdI29&H$NmBR2I_W6egb=W@_a7a+v*H<^GSM`+E<@6Ca$RRImo7^5%J+YpZvO5Tq_9M<>^yLsb!Is|L@7e=%fuJQ)-%lc$)j4J-%uTXGe1e~97@g{BA|UUvd0CwLO|`NPkz zbR0K(dd=sast3{_gm?KVFAa99@0v=HFDERf$)5ICRA=r>hv2udzs)@iK9mVJa(}P) z!^7B3wr;$VRs0|9GfIOiZkzN1S?*hn!|OjJFT$s6^)I(^Jz zg4Wx#aF@@6Mq-Q$etriLmfXdIUj5`VQEl>>FKA(MjWaqYj)MG4=y!=8;ag%f9x4yi zKrE{G;ebp;>gad%Z+uI!;d1v-xw!v+i4Lw#sSr=`Ya-vIGFE2C_^lj_3`PQjAy-F? z+Ie&jKGYuz&qVH-Gu*nZ zaA9_Q=B&G6c+z21YHOhiqBs;rWx;Lgxy?5G!xS0|7l&-#G99Dnr_&6UA8)Qd4Wqyy zpX6@cTyi(-$*IDG*=Ms(y!10a8*xYsYPOzc3Ek~qdubn` zBFET7vZ`6Kx)A5-1=eX+8;F4L$3`QD{+`e;Gnl14_Px)d$T z$5x9caJ^2K?~PK;IB=K4l`w$fa#t>& zN`Gt>`Haiq#ZhlIBcq*CR!07`X-P>eZZf+Bsqe$HcAUZSX_Ct+(=aY>bFL9h8MxUs)|TC^Bf$ zbp26E5sf({z=)XfU(=ajU(=z2|6*>)<{h z)E4Xwa_ZRaxsqY&$ld+>8r}Es2r`bt1dx-IssjsZ_Ky$a44vH>vM7}l-T*dy!;PF> z%t6V5^Crk_n4z1_%;uAi7M^V!=ykG`5XA@fJT2O&%9N3HQlyyIYu`I6UcRN&tel}@ zJngm`r%1*-Xz01#NHFJX30Jl*4d(&k6=!!@atsgeP?IUYYeH8(G%(Aod<*R0WXnm(-vfmNFOrFq-D$~1b&aCs`0 z5$^P{kky<6_SwfhbdzoC@(&tKG@=uVHW-rhzrU01qF<2dIqvLqPcBS~ckcYAGT2~p zOS5?d5cQTKvnwT0?q1QnUx|a2dK_6DZmQVkOll>chg)cmeJHQ1kQhbNjv6b$Y z@K$L4QD?=qDD@lufFhaikL+U!L8{UyQ;|?Ec~0fPsG5J##6OWq>G;F^h{@n+?p^pQ zn!vnaU=ytgE6Vq})(QMb$!kTQ4<-LF!#p1vdUb~br^#d7iOz;AVJbaizOa(N@FI_&CSUS> zYqjg2C{|TO5e)<+qCeAL##@Kk2IbxBRG7M(Vs;-tDt|sj!NA7m^bg|$^yCw@nTg&g zs(BYBwp6P;O7U&D^~Htt36cC=R=xa@hVCxGT>Y2;VyZZlIv%5F6M9g@X}ls@{Jj2; z)F)E&+JAmD@a+~D#gfatte9=>VkO304%VV8rR?s8jqLI3CO86#C@Uf|0h&V088SEO z3uSg5#?PUPzQX)%dDI1(usR?bzaMgJXUz#zl%+J+$PT8|b+NUeo0$27S_WcPxh)Qi zd|6;fu&C~N?9|iM&B}W0E&C?>SL;0}@@4z%Or6o5%EB^y@{SO462T#Eho35&!JxXEw z<<+$^hDhA_|M~Q9gPX^SZNR4`x07k|znK<#9bxn{y^eypxbdh*c$$TVESF>diTVGn zQRQ98o)+dLH&@SYp%l*5kF5DdW_-rQqZJwwuSN%cQTg*a`giFl5D7-Hqk1yN=#0z! zex_a?VA5d3e;xhj$@}-s-Jr#GK$aIRuiA17pT~|=(Ftxo|FlPYV@jg?*A@T6tl;2@+()C-C<<3*GJ^)D10I84sy1B1=s$JgGCLlo5N{l;} z2?LHL1dxin(i_dxGZSt-q|{uRaW((@`s#{9nFL~hr~vocTBLmY@z-ZzzoWf(otHX- zpr|}kcleVSa0wbCfSti;UH(d=63uwN`oR*F2N1s30lgY}dCI4Z0r5bM%61GakHPZd ztlBf2IjHv(JQ^;LvV#G_1Jhf8MUXKj)=xQ$z$=TE^Vdl_gz9}fk2cycKo`$snhRmN zM{i{2{^b=NeCUS5n|FZIXf26m5ak;7r@PRbL71vvC~0oi_6NBFTJvZ--~*D(PXjV+ z(@2SxTWH=q8M9O~jYa~O)@LUxv>JiFK#~KH=5EkSd%7$Fvgd8MhG5l&w@gTv+xM7> z$)(p(o&a899DyR;a-e?;)WxzEYqh)gbj>v+V-Nqmto~=s{MSB9V?vkWg#x)zj{p6j z{&TY$kNCDH8iEX2p(G4cBNdA;BF{AwI0A)}5K;?)(r@8o)|~}}|BEaYKn|WKGq(sn zKYPg?k!)-kUMUTX7%F;}&0&=2&@-O)-(q1yIlHdW)m%_wa6no`iSUCT-`iOI;Sj>> z&C^OZ9=GVtmbWtiG8tQeVqJSflH=TetCH_y8+rQ!qdPyhSeR-(QKa^bPOP6L(?v~&_PeT2Z zg`=(Q-10eDs?=L}pRX1wrCrGQsst}8{meJhb3grKNyS zDJtYUZ+0M9%<^~ zXZ|PqB9^!_i1bV0(K&`!ll9do&!hL6T$L0k!sVOsROCcnoPU=%6wVV0$vY8jcH>Zd z{0RnW&)8%1?%avpO`nf+qZHAI+E_2EMv z9%(r;fB1@qQ0)V@YIJ)D4P^tr&kJfEMj4P4N)ICkSn?0auB=eB1YMr= zASer?P%L0>3~dCi492>>*~oXj3dHj}PE&Qhrg;^Lmk>f+nyC~?dPDIQ5p%+04T~vf zhCR(IID32gG??-$%d9pEp40NlbX<4xX*36_Su=HTT40HswR&PtrSLx$Bm~SL<$&zl z1c7!AHsgm&xhbu~N>{|1^(lXJKfBUqeGA0AJT=!Li1ICqtikd?4RR-RG+Rror%?ND5DY2mk! zT`$}A7bqc_T;eVdrgq8QX^+cz7l#)*e9K#4c&k9~rKK;fQBD}lUQqWY2u(MLSpAP& zpB>AWPs=;uzu;4=QF;cmS7kF5)+V}Qm1C+9)&K%et??Y7!OC@%q=9AQN(pHNCNQv1 z+RCk4(uOS}&`qxxyCkIiD(_t?JvfZCAZgG!6==(IzW_|3j?H({It$F%`rzErVlRE3 z7PZ2+NA@Z4g?ct4n7|gjD`?Bk#{}dtvD{fcmnh~@UN7$ydU<(iFEjm;zTFVuiFAGe zhE`ON!#c-Mk(VftR0&ra>`gQemOiVV2KLNz^LoFH%IcF|!EM@iW)HuLXFTXv7dh@X zy6JSjO*=Y_fYnD>`d2$gRUQR`f{TJl?TeCyT@V#Q)(-#x9xvTa#`wUxb!_G4=IG0l z^gjVAqGu1pl6N{66atC%=-Gib=7**k8_Q7My8P3W!1TXu$^Vg|__VVbTSCOz z4NawLcT8{w5>T&Wn6GHRbtb3`=M9)Bv*|GwLWhL{(F6L5QUIRIRnYBeq zn>5&-f@#tG{yN>VgYVGrwE~9c*>NKck}6U3A}xg>q)ZYm%L<2g9=6NL)^y!HEB&b2h%(|IR&B@Vx6bueU$y$ZQ(|OVcMyRV z&cI`o32#!Ks;W#r-i2Ny1VfRgPS^gq%FuaQSPL?GxLRFm1acsxx)bKu$=9q7s!p-_J|LbG;XY-vE?tcY}J@=d3yfb&aZ^JfY5Qp&Qy{GXS}hIe*+x zX%qn*&K1WGUp;2P!X&~aAmgvpsk*53+E7q3_9AqHRo|qw&m0FMJU+7;bbAn>T0+PA znQDQt(jw`a5utPKwo!}8rOa?@ZljN8Oo=EFP*JtCSKd1FN_u!My&P1ApKXf)2-kDr z8TttkhT5P=^89Rhm^hwwm6&IK2Qznr3@Q&Z99GrNsz}i{U}7PUSsE>NMs3J#GF(*_ zfg0oa>k~I7nC=$+_op_QBh>wZz*uZ%0rEz1uFbrYebh$cFmXx)HoyP4toFMq zD5AU8(xeb)E)5c#==tm{F*|+aw>U+dZAK~gA2dIKC9n6*%@4xErfx}n9sT5XMH`8e zxVw&acu_94cGo91&Mc${5*q-rXK6X7E#82?axwYfBzO<+D#=C_9H)!!L&3B+c#1$tW-(Fe3OQFvS$)>!*F)aYn2n!72n66UrZ)f#uUO-;%ICr8 zle?$Z7RxH9oW8UT%23>eX5+WzxfK_`zFk$%rbk1aEsnaJiNNPA1ouDMu$WJ}PG;`A zeQgiyq^8+OA(p!1opG1Xy!k;;r9Dz*CIoshu+gBDHIx@x3H;Mo6_DWoNa^POAP1+G zUfkXg|3(QRVWM`nJSCa`!9Z3Ml*5#R!8*Ns1SYUb)VXRtc?MuP5~#%R{85;Hf4Rd1 zb#0|TS7X`-z*IA+Md9j;Jif8)cwuT6TY z%;(2GIA>jH16KAXVD&o>8Jj1n?cNJ_@RtRN4;gbvA1uBkA{!v8UN~bm8pIH&>X+Zy zkU8aA1F-b`c+>@5k3QQMq+Yu}nYCs2zrEmL^%Op76t6^ZlykX|zj0#q0h{2L>KNiXakN z)t>tyF=!3KAPKm2&C0o$EB51WrYjh%Q`tK~;GNlMVP0p`AFN&KOh8ERNKi~F_1Ra{1CMM;OtWCk3RUcL-XJvd!wHci(EKn1 zO#U#MDlR^bYw6r~sSmdn`#6TuvPZ;=cL9_Rn3g{#jA1Xe1Y^*>eGd={WMQmpb%*y* z1#Gp6XJ$Kq7?cP!iy2Evb5-l3@7eMw#|ovOMk&dZB={?*1NHE-^AtpGT848rd9T%8 zcXWHUtOJ0!3+QeLhTQ+MJepds@*w0{w`AuJaW*TvfEQ@nQNxsjZkgbd%UPTNJF+YWaQ7_r@ z&zdz^S`3T#3gxmCo8O*wjM@>$Kh59Je4x3s0vKUV zu8kL%%tiEaaeWw5hO);F94VHRVZ&Iu?)}IaEsg| zqlhlYn$?OKz*WU|K$V}5wjOq5QV`E*m`OC9K;YSBmc=tG zw4D58DA#N4~L}*69td~kJAYApLyK)&CWo8X%dw6!hnu@*N%t` zx>f5He0s85j}Gjjxee6|%UW?`g!!3X_F!!hHvz9Zb*!pSc9>`&Zif;hluJr<#qeK> zAZVZ#n7wmWObYnCDZ)Pl;GWZH*Q#FD7Yv?7C9RAzV2#Dru-Q|{H+t$3{?lsl@BNv5 zmqLv?wv55xmD9H{(!#7SLzfQ+N?~GZE?8e=b|(2e1nhnQQk=-#P%+8 z=`K%$jZXrqSZ7Z_JuNGzyy!%pQolo*Gnn?ojsF1NSkv&%>`p8Uo0WD||6q z!K*O#oI8kIs^Dx&G9DhY6d6!DLCp{0zJ6d&Dy&V`{thbmq17?&XA;o8A7K}4{KO;3 zB1pTIZ#n#UY4>g#D*#lVBBmg4;hTH}w7?9gZAh=1x+)jFM3u`Is^DQjV*)b|j#iL- zDZ64w5LA3^a0cS((Slp^hBzl0I27$%pwfOdj2TAm_)_LTaZBOq1vWMZ_U4=UZ9XdBgMaR;f3~B&)uSmYbfUeu;)#D}c17r0bfS^^OEP~C z1yF23kU|Ej0~X-M3zvJPJ`BVv)0vfbA5X}i#Q^*c1?Uh;o-0ij-`eA60PrP&D2`N6 z6z~0bU2LfXf!}xEo@lf>tr(=g2jR05ByEem^_iKx^m`FCP@kR!DDTOL1uLg%0$U&! z6H-gLbTZ+$K%?V2&<9$NKLD4;yNUZ9Er}X)tFC zGQIB^gQE~I5O+UT;lu>-+y}dBtmjA>4R6wYJrl$JUX&<8Xt1(Z&HazXiHISvhO(V* zyrh6P{^L?J*2=SolTR(LH-|z5WEArii4U0-KssLKxkq7wHkS7#K=4BFWF1s`N`2TC zIFQciV?7R+I;!rl6S}YI=2PWv~efW%1(4Gej^`qfFZu#5+>a0d&ZOtwyFWMF4=`T#a{SecFGPr}pl*+r zH@wJW(Cy#4*$-ILv-0roIwOR#z&J>;>T4y?G(eN@5k)W>$ZZn*pilSrCIb8BZP_R8 zOiu(G{T3klAldH-*b)rjid~_An+O60GrR7G8BnMagGfS50Cz#LMGrtdY- z%WpK2%?l8QZN&91S&jd)?)+_hUqu1VGs%pi=1dF)ZE@QOz!av5BZfdNj&II!I0UG(_C@-3v8RaZz^sqeyw?E3>0NWa83p0 z+ouE!w3lyzH=r)-Q2EE_f-Nv1cgfxU>{$Y*wcK!D8F9UjKVAk-M#&DRgbs_LUM~5X z1ne{*Q_@KQ`5)SRczB`uB#kqGharZ}34xN{JjL#xIV7HVFmvJA6IwW3&^a$|wWVF& zrcft;C6NCW4tG}PP~0GxWZiskf8O&>v>E}6ak_-l>?SyeqX}{~oFk#9Fh5lN=&iqWmJ?2wYl&+_0l@o z|9QGdxq--lVHgXVJ4D~^Y_Uf5Y0t|2g6qvE4cXTd9O=po69MO)Po;`yq$npngPAg7aypVnSW@& zA^u&CLyfFHAk}sGeI#?S$5Gnnuifgm0B0pMPFWgkx<>zfJ0z-P%ZI;)gn7t)5@_QE z2!^JM#UJ%uZk}9)%R&n?!whioy+>VjzdeR%TFu*U-VYH$jWBL@iBCqbo|H@5_2)F; zYHTykMs~6Wor}=OKi>N}TXNY$0H&Dp-H?tHC)=p#ng$P;eqa zRd1fK$I8e%(C|kyVPrcGN%IwxUdt%#O<*kvQDyZNGX?l$2IN1%a&Oaz(AhXFGF~(j z-6cQ+dH5TL0F{_xmjN7+v-dsSdG;4@m7SjiUN09wL(@vvjeI_DWX%4M8R~yu2bm)T zQjMXboPcb7tDU)3d|u{0Kt|^dMk&(4j%bP!?WXS$2a!5NY`*00>$A5RVkH3U#C<;y z15QMtvuPn^6yrg#aaa6%mDs|tmC(@f9JF56457AE3{L_e(^4GEsGA;KiaB9uVNLaB zImR!H^23y$w=r#B;V$_3W^9)9 zsq77aLy26#fCoX$vZ2{(;2*T7w z#81XkO;DY@5_)rxO~{+dE};krUT6Z9db_L=1YVLQWO<JK)PNaZ2mb({g}$4xX&s=_{fUBI z;N;tj74Ob4_uJ>}&N~EXea^%BoFH&b^!05v60m-Yv9K7un?ib&5cLw5fFho%6ArQB zbCS?kD*dx$eQ`v0Q^R*Rd*pQbLWKU6vzDmYx-=qCYL>V?^i$+WeM-{Q!ehWgs#sXY zE7OLTbwizoree5<6#ET+KKaum;9Q%3iTE3i9G}r^%-#DSW?c65kS@dw}@fe3&&WewYb(^9`A%5O9b|DBxf(te-#@q}c+c{RMOZ2ZnV6ch+tehEXWx7OviBnnR%Y6bjuV96}>(%P(exRQM9Y@v9VB~11RK) za=^Jg;*fE|fP`@YNwxuMu`0ET(|$7$FEG3%oLA6pMQpa9$wxhd65$t#}fKL zZcr7KSPu$_CSXXUWp2kkM*;(OJ^ zvxgya$4ff#*wpwg8EYUqTdckxjyYB$sRn3LF>Og_{K(@r0{1;<>PUdD5=|?6D4fbc zOZYHr3Ok430Gv*hxolGsd;uxkhh8>~X?#**5ryk|4kGAhUxHHtHr$`fg<;X##fRY( zbWEeC6{}wYpM&(nVwbnocm$r5kI;>A%SllZvEVF|Jr;xj%_=tsQILc3An~pvuqa~W z378Wl+Pq_ik7Va$)NXzxXBfED!B4%0zF;J>5oAmBquBQEtE2Z4Oe}4>L9Pf!3Y~az zA^fEch=4KL_$TZwg>K+wD?={wj@m#R2{#rQ9$dcjHmBSr$ynD;M}WJr7htCaWzYdc zI-&djLQh>`SlxHv%|7H{L2P_QqXJ$gm~M`iON>4@*h?)y-k(h1;$@g(f)VG;sykV8 zaNDN4-M{y1;c_$W4cq4~%WpokuJMgEs@DST<9pf>*tfG$Zg{wkKp_wB_JAMRF~^KS z7EmJEB90pr&jUvHbDGs#C0LM)`BAV&Yst;%)tsLHAO39Wsel0JUQ; zO-_@heGFt521q8V)QN});9gJ>mUo%{2-$O`9g|J0!vk?Yx(-??;Zv|; z*oKUK2!f0scBYOxttH3~NU$;1!I7T>*Wy)oQ6uy8vx58W&BPA_sezUGSIZCCoXtF& zgT$z6;!o4+2W(1|{ao|IIS4*>hWHF6`G!!34NzGR-6&%6L+*)8;^rRhu;YlCM6|0F zt1Gq~e>%pBEQ)93V;BxS#8$vPkRC}1m z12C((Ea(IV-r@UvUhPE&)(@Xl2QcItpU8E0F5dah)s5ABj--uZ>5_}|MywP9|8zq3 zcx%A5=3hO{*^emM7zGJQ)M?mC(QA*vM&sv}U_(rk+?9|Ktx~5>_%%6K)W#mY7~RL> zl+)A~PQSW&TjtQ^XUG_EYbR4PsYRE8KFx2|YxxyVzvv8Nhor{juE``PB*=#2be?cu z;s_^&{M|Hs-QavGkJv*Se4nA2Y3|RnVH`1G9Yq|R7MJRDN(}4SC0|iqrnvV@U{!Ss zPmel`f<}SO}*ZY>AJo{l3Bq#WXTv$I) zNLN1Xn^^oR;xgT5h-_VuWz zN1ciJVXl|-w{;Cu5-%z7#{IGemhT9JQeAIc%*<0a7r`hu(S&;T)*|PXHyNwQ(Qb}eI7h6S%v(^lvHmAR7 zQ^JZg6bP0Q-);FuPwTdqE(6bYHxDxI1z`zRNg$27iDCfiK?5XUdukPM1Z>=7kLK(cMTDvY#MHCf86cp)AMU;+GLRF+m z6A-0$i1gk&DkvabY5)}oB29V+mEIDmp$7;MI#NUDKJmQgeDCp`_uu_<@8E}HI5IZL z-fKT)tvTnKPaus&H*mlCMEQ_!xo2tsxuj7Pp>L9SmP}PU((LjCbKO~f6cG>0+k1C{ zO@=v26)f0aLP9Nz0QwOaW3&J~)<{J(lwT6#EP|=!>K4ed#zWpeq)=w+`uvCN;O}Lh zVyZ;iInw>;)(&Y0#dRcS(E-`n*H(<1C*5HQ_N^Tai{b<6%wO&He(cr+3N?O1 zgh3A@9dEL_X+1;m1qey|Mt3QjeUIL~Ma4{jb{0JBaiyPg{{($W9!DV{ca|ymRI^** z_3Ct#Nai}+dFoJE`sE4G{M}%;%)0u*Xy{tk%MC7JDn^@abZX3_H*tc9OQLoTn8z}U za*=Gk2JiC-0wjak4RQu4gwLJ@_=ZvGPI)!p@n-r>_mtZ*1Ft!!KTm3T^pZYw_tZl& zyCX8)?=?{?4bVCY^nvjCz5V z+ozxO5d`Z=+BT3OvNnFAe;Gy58ZjjQod^}V)R<=s+~pF^F|=N@>@OE*i%NbNxXAyH zg-n8g?D^eq{#5VM%iqs32L6aR{C2qFdqiW&i?50Ni6#jN2%5wXUSh$HDJo`XbRBwj zc!xE`>DpANS?PyA67J6E6CdvdI>LyIPF^DQQf9nu&?{^ZD?;i{FKBB|Tg0VgSSxP# z^=78{aEPYE_af2cs)5>!94NmyV#zCr3+<^{e=fWW^#pU$@FXi=7jgQ7@7Dy1hKfih ztZKbKULbq+ny^F5+E@wQ#hpHOTKYVLSxbHN$2u?9BGN9umGQf3sVnnSFIhK{cuw^+ z{l|-TE^5N3WFSfwwM9J4T!jaoQ;{-8mn=%ZF$4*t_Xb;D4ecWBH+f*Yy^6A36tp2{ z9~Kkf6od5`GmI~P0p?iE-8Bt0XX^k=(74$fD;v9|nD2hG?S*BGsKA0+B}-H4+eHKI zN@N8PXEK`_#B_6Ay3$oMgH!GQl=@0#A^1U;j?72qTU*H_AA)>}-d563&?rBpv1Jwk zlE;f5uVJnyQ(mHe#V&RI%FqntcM#-x3gPXWf-r|RHZ`(I1`D&xYZryh&#kFl)B%~X zJ&n@qyEI()zKZ`4)Z5Oal#%E^i^ei&9DbN#(E9_5)8h0Wf~8r>4;LAIFaFrr;Y3Je zw$E~XTt^(FbnrwvF#v(jI^A?8MR?VU%hf!jrXdWvnpTJHPCjH| zlub8kEq*cn)t$_k%=xOG+l0p51XK45(MxIfKl-{JkHD|^6?QbtH4o%`T zKU)T9Jo#*2XZWy`Q5A6E&YfixaxWPrQx}r@+3G~~f5$2$dYFDx7E}fIFzZ{%VokJb z2O>%f#Dv7VA*yuTJ><<@zK83n#Y50v@9%jaq1{&_IY9}lne|kUGi(QAk{3lOpyrtI zhKe>mG{@o|(4=5%WmGS`9C{h#M2|>7bUIWtg*xrNoQTJR>5dQX`y^=+3G?;r9LhJw z_tcs*)E;%Sdpc}m(&5%-12|9|Jie3lL_R?BQD%evS>Oinr9l(84?-ywJYS}XX+HHb z(480lF3D@blfJU~WGZn(;q1xK3DC#+h7V_1C7qS4nj$dtJS{#fxcFgf7Ba^x^wWu# zr!+F|%JlBNzviwR7!jc1cgfsO!x69zGEYZ5-?RfQ5rdk73M}#YmHFq7?WlzD<)ElE z16jtG-m>jUU*At`_s?@aJo!4D2ZXG@pXCryL>cbRuvL7H~$P7ad0Lwr4 zJgQQiaa4CBgO|nPm1UD?!p8@PM|w=@3rZHZ5n2BzVcVI0%mzr_P+3zJZ6rxhk2Axm z7kdBU5#m43I}AjmG{% ztT>Es!7twwr6n-YA_~+plKdCN7S8uE$&zYOD-`wH&%%upl9ev7kYtJ$joQ<)%=YLv z$>pA6W>7^(4}2?V?dC<=$<3v#B_kdha|<&BxmmE5M5YLdQC$j>BmAJJ_BlM_;pHuP zX2SM|WRIUiyyN)JAS9{bqZYw}#^w(m%=0bS&9Y)dH02Kz_PB`|#ByXJzU2X1uv>@e zr4vmRZ>Z(&Wfodr)t+%ZOK{RWBZ}wh$Gfksz8UyEe_;^8%zH^(kg};J_)?cWkkd7t zy`l$gZeK9L3%p3=qGL=GN+~_yVy9KEe0k5+3bfOgEO{azcQhvxwhP33>2P4_?r@73 zBW60+zD09C5XZQJTHGy9`-h>g4l!A|ITK=qq0U~+d@{7pikOwvW4PP zhRsDI0#d}2`XH)J8k=&%MP33)&|+cyh@pF%$8rzYNW9+kAMkhL#g~dBlqXL>Z?QSW z=PEv;p)C+h;eqv}@1|{?cp#lYg7pd>krd1<=RN3vc(sdCvRT?)&YbmE9sb~UH!F(U zY@WrAqA;dAXv^@OSqa~QPY|8UEhpKU{w^_;wN~b+rb(pKr1KGQu3hNl6qZoE@vZJL z^_`}tvz6E3BeoA?6o@gfhiaj;LcH8Ff!C}JUp4hdxQPCv=`&HxKe*5C`R*-R|6dT^ z6Y`S|=x9CG&IeE`o)eh@e2odFN|$tDDQ8JU*4_dSgbN+L(mL|i*ZrB6!sGo5QoIFS zZ*X+l<(njB6dKvuY0_;@nV6k9ty2O;M3PBdbv{=vKhMf*&@Le`2Ny)i0 zjZN>6#jy2$+%-f3N0TlZQ?y4(^UQ_eAA9n9)636Q+1_c*+;Hf{mylXFPts1{MSDti zeh)4)##1TTczq;`Pfq1xDM}>v&_SXR86eV#&md%WeW&pvrWT%7})d);W z+7M>bRW(k|%yXduY&b`BN@K)wxXBx3BfCT zha@*8>M4;w-V^MxS3=Ca^ijFV6E!U2UTrVLpZ|C-=t;x$Q6`+;5TyN~CA4;*p^Yyo z_1u{H1>e8sG>Y;&~OZLD~1~dyUEWO0K{VQ=)>;zzcjssG=eU20Wk%!cs{!D#> zE#_8bk}BOQteb)9IwK_Adgs)AiSccU(TKWal-yT|`1c&4vu8gt0$cY#S%&|j7W{Yd zEyM?I6@U1%D@YOu9vY8NF!~%adN)2hNuVlnz*WA8z@HdF@rd%m6{_g?7IHL#Q;&nsj#gd$a8g44-k zY51=um*p#$w?@T0%7}8Po0Ic<{tJP@({e)ojCZJeGi?VkQ3DdIQ$Uc5nho6bbJG8} zfJDf=?~i#zfqY`s!ZIU&?upRJv7&0%d*|-55~Pa^+0|dD4`T4SwqpB|AoT2M!k5>X zd--RI^@D=}#An^jkNxpC0_m?fx3NBcV#ip>G~L8m=N=A(+i;?jV^!5bzrNSFDfAq` z8uZG6YrDfo>U%U%H|{gt5u?zG3%ehm#xa!sKRS8d8xo#}Kl!dXM=}@azxn*!lT#`>}G-(nv1*<~zBdY{AhK~4a5V!QXV$VVhB9!yf)>HN{NUl%A!nmu0-boB=PkNdqX z`5w?44lPe+C`y@6o>Sl=sEx~{bVm*k5R#q8lk$Lj4d=ER_;TyIT^&n2YZsGfQJA&J zK%6l!|NBsoAGP#*zJSN*y3bOC{?Z`86Pbp(zY6M|yGumSu}BZ3*}0ND-#${l4|m zty_Y#epU{sE>ijTxBM#|diH#{f)N)ke}WKbTNOy|3C5d!nna ztr`R{)VN>u1?ZM7xL=?5MEJt|!z*-nr)azJMg zbuWWvCuc<)uW1*&8}UVLn+ntMs!4#zyy^VuXtaV9JoW+3eY=C$m2kFvv%}3X&`~?-io*AF zsFvHCKg-cT@$e+u80LEyRQ`Q0=)qL5CnwKqIlch^B0qBD$VgNs^0huJInJKOm!_(nH34H8vX!*T&Dn&y%>K4)z}e~?$s;-pUIG#ZI+uTf#$Uei zPx<`DO#%n?;2DejxC7E=fu!~VK6NH`iNnc*)&?X0T?eQaC>356Si_vOb_m-hWH(Uw zMqD3|h+ciCx&mNHQhXmmIY23ye}dL-4t8KAU*C~F^}3z#CFLPMYwAOYB;Z#>Aqjdw!;S&*5* zWPS-F3c9Yf8U%}8T(8mQZyW}M3!Q*tIQa*BI~%*SZv*9FLLT8-KFVn+l+~8oA0A)( zmijqx5eSZ?!uKPEwpD?V`W;9j#ti8B{~&TXq9vqMZ?A^a6g8b61jhd4HUQcGQD%#N zkfh7{pVjy4(b@gYeBTS;3FQ@VKtdZnWo8_A{ySPR#K8 z%Pi1oEM<+oanJxfgs#Q{xA+k@AfPWSw*de&B9{fllVuZ3Rb($mvXyYJ^FfF;2jwo|h2>fP@|T9^W>ptpQ`6H1T( zz19QF~gP;NwOAd3>>YaoyUWMBp1aLU<@+62aNGev3MyJW-aw_BXe5^#jyL3kf=lr zuik-1*p~D?J$=zw!H)4!08u-biOUpw-M*?CKl<^ad-eH@jj@kt9bZA{$}@)z%!rZ8 z?UC+Ktt~@3K%=izsDslZf*rAS8fFQm1ULe=KEc50eNVq$!~Hz0Pdd7qwTP{F_Tmbg zYH&ndO6ho`=+_#_DiJ&po3)HFS_uzyczK5}tUy(qs2t^%Z=cWebnHd7Z+T$jA@;+; z&^5SY^!L|$Neicmmpz-}ODgk(_ttj4m5h%~KTUB7uG8dLz}X7Qhl%|Rzhf>ki^jVx5L*<6 zF(-g=toY5QfG*Pl-}!A3vr=~?HvU6sEi}szlr}=>-=>$Om5c(MC?oHrls=VOtqN_d@c`=`lE+gqo1ht8%1KraH0!Kxf{yL) zfRCiHOFt0bS{50I&Ca2*gG$`+yZ@N%@i{UQGJ-Qt2zdf-y&&P0`x5%dUxnh0x#iXjjz3sez^=8ktrdlFg588at6h6%sqZ-%U#b(wlB_?7 zp3ZhRNz1v@pKaE8%M^s3GXc#KCo>1V&=WR3@EXm%p<{3=e*Bu$XUDVld zd-x&5oMC^Q=pr~6V6fbKYDGam7CWTXXDVClP6SRDg zGo-1f(3aD3fy8pOFEYXSvBNuzQ|wWTjF`iSSQOgX9m!L3{`8=ajYm&XSd$Rx^bm>j ze#=s8JFECktGzFRpS*p<1@y^rr{6$t(nmEjxKBl5R(Xac)~d=!O|%g_EXa1} zr(!?-FNU_a_FA-hLK3ixM{iidQ?KR~zF$NSqPnRn1(P+$Fl(@ZnvD#uQQMlq{?C%T zKDa9Mc6s%T^RVsKw;ZnK_vY=M_Wkb!<4P3NVsPK8%kYP7>{;Jwr>|QqHLtz8f?PoE z>4&PFXy1@`ERCFVcuE)2UtyxHiBfgzC}3=UWhH6u%--T@rL(4&Jggf33A0z}j6L8L zYCVj^sD{x9ecI@4$Bt0#tyi>1s?RlCkD^mH(hu}(#5_%o5T0EyYlAx0y%#4=5)htU zBp2MTeA<%aHnup=9EaN6Wl#O}dGYlK1Ui&jd%w&r(L*`|8)78gW%kemaw$+6sts=VfSsD#}w&IM*G<#{S^=e89J zm~uHAV5BdKogQ;o&c(D)Zif!FGX}-?-esfmPK-sRkOuA3t?T6Qq|A}X2Pu{~8KkQ* zB|F@^%&+HZJ*Uulc6f?o)pT@DM=nU)NU4B{E_A0mb$`YxR0IyWP^Rq*8KPg0Zw^Ja zI`+835F>?2VqA1-b1`zw8vO}R8(|bD^*|XME|uVvGQAjP^)QP8nI2QbfoNKQ z{zjup?@mJ4l9L4dY_W~Ey8;Y|mIHX;)d;BQg63W>M5x0O`Du5G-%%drl5A~(uuSFT zrW+Naxkb0gW-_(2pap5`c@rIWFvTtPJ>C=&<8+g&l)`T^!hL<6s}9<`RXnW_@APAa z`Z6FJ_L?UjK;HHgzRetnKwz!ZTHU1^Sn+42xL3yGbJJ^ak=ER4hCPGtin!NcA_8=? zJRL4&Iim}DIBE{WsOy<7woDlBGL6t3<}^KU6*k3X$lSPVoYiNNwP{JDvBL|Ye^nWA z{ZT;gsY>`ctKPzU&I>unVlK3BeR!gf z8A>0Gd446KaH)FHl#8Y;q;V-XRyPfqeM+c|HKjPzEVbw;{NmVzuMk4lzVV8-)O)M6 zvZ~Rgri}C#LLAMtBe4r^>Ge*m!z<-}R7Z`s7TS3}ge?i_ab1-)W4D--RNd&*WCya~ zMgQABa)+g_QkIU2755x?*jEVoJS~`zWZ3-?QFNsct)Ly|J}Bk@%M0)l>*>(MHyes` zeima<$S;YW&gg@N?B?){3GUdALStGbn;%F9KM|_s(jK!}t6BCo&fHtSJ2tmM^))iK zh}yx-B8t;tx!ZHKHM&l9h3C8voXd`|}?Wx*V^21LOGh(Zp@rbZDUI}v0`*5Ka z%H?7&l}A{aw=)TJ2}G9#TL`-+qLlKye&)oI=Fi0yn}%mK@5=c}q!Lgu#fF79`UWu^ z9-#yHRJrs=*16Ky)=dTUKU&^(m*&f}+#yI+>a4tFLZ8aDJJ+j()huxeMT*uN^33cf zlV+J~myJ0=pfB~EIcJDHhwvwJg@q#GmgW^SwoZ$Oi%z^GUo(aUn(UjReG?uqCHu`8 zD7HTlP7q^BUYh&)DD!@TS&U(c(~~~gQK#?e(bth#D$$$e$%h>d7B5X9I^8Nsu^5kO zE;^GrV8`3@Eq+^-tW9=oNei-MF=ZESEgqe0jB84mIxZ=u_Jw4`a92b}eWW^wbn`nLG8i!tHPD9coW&geM&8IMaSY zMX5~RD-_^48caPHbh?Dv^-eYvWomER(#Xsi%`rXN;ZsLONrzDpR zzIMfw+nMeT=UFsoHoQ$Me+*%}2UB7(&%>C6l5RH-7FKhH39xNUZzn8emM&E1PYK4> zWv<*&D{>xQ+1Sk8t8s|+&VUR#RwL{^$vM!u;XsvQk4_*R(=L14p^MYg?M_JLqfrf& z|4_6Qr{xRDu{>X8rCm}pvwaUby1CtEx|T?~L2LYZI|BE@3z0SXFt^IbOw>_~W_pzk zp@@8tZQ6?*o$?&rYwG==XWL)ab5Ic`?rtluBKU$sbt;#+uX1nU-b_xFm(m-@;QHZ; ze$2sj3H+cRK1W+IN+(<+nkKx)#HC-_Be%AZDeetZyQ`nKF2$X>VL?c0>feD%nrrMn z7pr2m^8+Rn*5femv6bw_)c!ojqvONkbP2hj24T0{yDUk-190anMNuatG^b;{3eV#z1qkX{Nu&AYwG5B>y7`Lio~O24-U(Ok48iD_NKfwV6A>=454!9mxjOj{9O{EB$til2%ANEbl;CW zc|FO>u|Mw8nt8}hCt658U27{XlwaRUe^Xum-dOQ+32MFjTXbxe#%J>CW$jHqIy0)J zgvfC!jznK7)0B!lKDr?z=Du|%`OWGojK0c5tn9;PaQ8<+{l%u@^6d%dL#(R~ zSdzc2%3>YPoSW815V$qo(c{q*_}E_-pEg9laG1lDR+GumoOC;-f2pS^XJ(NhWFdEY z3VXYG4g;^thmiK3j-?c)U)Qu&*M@2!b-g$?UAJBcxpOUe=y3=Pqm|0OdDzgkXG+yp zazV$eP?sfx1qNpd_0{iJKt@0kG??$wvsN4RK&I9e<LhDLtHdOFIo+G(u5W#;09+ zuzWq%!@C6Lu{>#CWrJQq&K>eWX%ebZ6f4qCaF1qM9G(C<`pkR|UOwlIZDpGp^;Q{UchYIIxJ+O0_P%J)PMAPh729E0=NYNywq04LWt zT%b-q^Q#zV9Lh51KG9&D#TUs@EGs@|Ascw}wFz7@o}$*?LtnP?WHfZEElE%MfjQw> z5^c5Dy`N?hRj>2+os0+zDfZ~t4mvtZCOCWg^ZK3S-g`10HM1RQH3J1taD7NGCS_=Q zAu$?0o=RFV|GI4rn= zn9;~X@w{b&$~`GyVo~fl2@8ASq%bmmMj&2nLw?P3T0kl`(N@I&dfA-pqo6Sts7Cq= z(;pb^KQVuM&V%pbBhnoxcbs``=3ezNSC~`n|e*`WC zR}HlkTQ`S7Q`Dfw>{q`yphx(jNAXoTiXxn+K6Mwam{2EK+-s{m)Z21=Hz6ao7P-pS zvjh?76-!xiQyN>)P}#yfz-{Z(wNtJOLkk8O96COkI~MBqKaqFbI--twDsGu9P}%i% z9+e@!5L&(%M8%W(T_%iJi1=1>2t~)**s~WCg%ORYXMsISW#{8dy4CG_3o^q}H(_n& z2Q|Z~Z^Y5(92VQDSh(C<<;|>F98u5q#%SY~ViRyfv~_CNa?O`ZJr_5&Di)Ls*-$xI zbF_R(57uJA*5G~PerLC{{w_qeAwg8gs1lctoznx#sr`>LQNShDvlu_Qw;M9 zJ>O}a4I#PQ3{>QTfqXTe4B?ILSYfn~l46v~Pqq;_AA~Zs^9jvddaDx+N3qk1%nUX5 zu4-{B-qadx*en?bUsV6l)W=`9-AuFFw>Urx?S)T zy0rmexu(cSs=jJ;$3vwTx(HR)TZgtMAT!K&tSgso2m<}`A?%TDH%&X=H! z9k#5cr_9g;a_!HD;^E|CQn^-^&ReR|BY75sR~=)!l(d|^psBgmH=AG6l6QY7)@}ER zj3r(3(34((iXBkM#+FcATT8gk(&(5Q4!aRuMAN-CG=J@#wNYYRrL?3qraiW1xzuAB zc3>K!_Avpi#K4%;-t+MQ{U zs~&@c{_QDs%cYk^+D3s>US!?PY=(eXE!q%xwcemB+R&Jsf#sp>-ktN|d9?FQ?L_X|#N-7_<1HX5& zd$hOt=)q=XQJ&z+#=+;5l|NR-TmEm8Cn8j?_dR`^^0|3i4jdNI@;owBA+fYkC^b&h zY}c)hoYBti{FL|lG&s@s=d6sl2W+KW$yiy&4o>Z{Rg_w@_3G#w!~VAjVjSE~GdXy! zSWd=wSzeyiR9ZCCbUt(m&P~n0ylqy- zGEcG6na@2&ob6Wt))Eh-h0Ug*VnOMA{FIW4&}Y)-We%kp5bhoo`)1)Wd)R`wRFIQJ z200^?A!m0a1{H&NA;QYS$QO0DLcYtRDmFk~I=R1>yC*;r!72BFh4M1Op~&0xQIL~r zuk#6BQ7Nrp5;mpABKXmeO>%{(lwWbEbf;ze=GB=TuAZ3<@IQi3Qx1Pd>4kU3$Z9^zJE`i3M0scXnCZ7s2~k)ZUoGM`ui{HYNMvdcIJ_XGEWAd#e~3l9tG!)E*K6(z3`Q^$1Nby*y*p zVX3Y5srPV9n`JwHq7Il5P|}QgzS9VRv#x44zG<9B?07C87HBsM@hC7vGb`q4Ms)SC z33gZmKPT0 z9%aOWz?5^QThvEs-8%+-t3s=I{k-;Gw-ZfL(Gj+WslA1V2Nm7A%G#VH)VxAVleX4m6Xt#~H(;gw>#L)}ecJFMhThX2d=iYq-{C{qdYhjM z$l?PGgJIzwy=nb6EW^2;gJQ*~1IUVD#Z@Kva;wx_(NPXWOx?=SF$yCY+$OZ#cOsY8 zabEP^sSAaOr6Cyd*08ASDnNTc@+nQ*zl6l&U)S0`QwM(C40UqrbGqn`m|VBm$@gBeYz}J;eJ;-`2fCm?yyFV((~Oy( z+Sv&ZPOdaw?VD=%=5SrD)$%=n6^*f{L44++Hrb>fot|S$Y*3dirB7l>+PwS`PW(HU zQ|Fuz8Us(F*i+?RH~Cx29gNUfJUrSvakm0RlW2-8_Q`;O{~gCs|@422v?!E$ky z(Wj=aB3{wz16DxMQ*T{0i$z&hu9j=Je{)+Vx1@O~E`ei&Qv*)+yu$f?*K zASsU#6mmp9kk-j}%#VXF;7!!%d0muDERwB4<{D(N`kW7|7M;VCfk;tTA)lFt7_G)5*xqfDeZ2SZVTZr%H`Cu!VT43o6)FlU7zVFN~!KY zj5uJmH}g~kKAXO5Ue(Xh6w2zo+0F{SGlvR|k^31xB4ejHZTL1_gC)!phL|arI>nOt zF0NIoIkgmv-ZNDcnRmxTZyl69Phpml*2g|*bYA-8VRg{qq|fE{jkxLH1`qs8)B)S+ zHtIL9B5(vj_>r3|_St!=xrj$HPq#WDDeNJvtJ7QNNn;@SXmNC1h+T@{^ z$bxe}l7hJ#o?o?`J={H*+O^@1Zt;8^Tv>7#Y=}pxop#>><*)3 zjD1-9V4Wijftmj}w+U-RMGR*)_ju?WT`LeY$Bk{J=Jp->Uye*VI8#D0wxLn5ZI zJSRBYr%CA5I7N>G{E60bHcADVZ|pq!vlr{|rx$=2bktpG+8`0RyU%@~pt6vF-yJga zWEus$4>xcXJBWpBOe+lc)_Kn{tY~2@dc8IRNjxa6YKO3&_!z=TO^<8ji|q#!i0=hW zA-Zqy5G5-v)bQ<+DBV{#%=xV+uh zh1Wayng!b45j%h_hK5oDH{qVSx7T#4Sb~5s@5bZ3>B<9~mh%4O1yAoomv&v(J%5dr zN}wNBc*j>G_N}t~;>LmV$rbg5&J7~q1vDi1P9746s#$I{HI4CvWcB!$j+UICVdpD= z^O=}hxu*iy!hi?72$?Tg$8_E)K<;hLeM;hJ$cp>|E3<;ptO z$_@&W?t%oqQ$Ewrd`#yivFwhB1>MS@bYt2<@If8_adn_2=22Dv>UoZx*>yv$ zw^pCJUSZxIvivHZ37?){+zc7hOReKAzv4>&dNWU3W?3H>e}!wpZ=@v2sAU8dV{Y%W z*z98ks~2H;S$2ni>J>tB#>Qxr)E==rBVtw+&wnsf=1J$c^iw+I`614O z2aa(Gd{-T=>Pl~6_YB~sLwTK=X7qem`{7*U<)wxGg5^Wgg~3k8SPp1rnQ2Xu77iX; z)Pss>W&p8pp!+ixFbt-$Ij723m!zsS9}B|Z_w@tom&kzn$?DpKt4x`z>e|kRRUV;# z3_SeOtxqP~Y&Mf|*uW}Qz!C-Pi>N^IotD(Ka*`pQ-ToG|X1OA0tr&PX&@ATCQX=N? z^Z^{zdRLh3p~+Ces7q6y&U!-Ol!1A|(>thltxnPjYZ>ri#cW4!;)&`@dDl zDy~xZyn2#==z+l~B?jV*lV~ua`!)FgyzGy+a+bE5fRoz$=S=^d|GX#syZqZVL<$Dx z-|*+pJJZdbx*Boi65|)4lY473>>O|KNZ@bR`}YyZ6O&f76iNF@91(0yA3Dxn`TZNg z-^@zkzZB5l%Qt-NFEf65@U^y@pjr3#Z^M7T)e~?jQ`$%8CpNu<-_O$8`7hRky>EQ+ zsqEkG^Vi>wM_kd+vU}Zx6~hT7zYAJ4n|OuejCdfqW%%}AZWhGnk->0)BtbbK<6(1l z?|BVH3Hh%_@X-{mSp3_)kO@W*$mD5jO3P)QS_geK*GPdZqXUIW&c}94de~1zcGb(Y_QJ+pn2Dc{~5wRKKa7<#o3E2#(VU3in<#t_P>Rl&;3t zF1FWNl^-nGijT}n3usj>f~FLs?QtS;!F|xWxiZx3@!J#4Y66=8?Bbhb{O6<={!a1Y z8y_yxf!5?N6E?0@!*-wv)MjdPY&``>s|1-#2j>IS*?q(u9iWdc;5%l1_LtQv~zR#0K9++fL`pCw20|-uWx9uhL;nGBQ-EMXe zC~FvK!$h|R@LQvn*X&~5dSz6m?JnGwIyKsw>9ij>Ho$OyFL}iaYlru$17&= z;}>k~UKH3xk4)!87Q?dafg&t?VK@J1H-A+js$n(HwZ?LEazy!WGm)smFQEQ@iUC(R z{)O(63xXqt0%gQ^yD$kC62h>hWi?Ab!J!O3+H)$y)%HbR(74`E?_k zd9HNeh88ZkiwaZ?b~a^&^S*ilX%^_Jow2Mn)9C>GaN*LV!_!A4MtCnnE;-EXp`1y#bcw^BLPdZ&X%%~0$Ay*S$a1hbJ=qyilB z4(97PBbjW$1Zbb47u)R$4@+xtG?N&*qrcN8s(Xr&uIp&a_K1If6n!unfjDSHtT7sK*4;HHs9l9nSfvS=1x~WEeFu`y*QGy_q zW4f&-T>}OwO~_^RrY`zk^~}BegurSHn$NmNl7211Uf1^kU6ZK0q&Y0S#uZxOA*hy&m#J%h?SDu9#8_(nAq= z-V(Q%8|9HV&ea$9lLNiRp^fZtl4f)A8bx`J_FE^|+)CR#no>SA5jn*q^#)aUxT%d> z*@_FTM7}-(X_fOH$K<7Ow}$>ra@B$DAm8nv3BJ^w4xjLZJsG%cTk&I}&lz^n<}s7t zpK6EuBzdafp%30iNKNKO>vzEQHFe@)RvxtQh;UBcWM}IdbYt2<`$Jz z=7W4@g95e3)?zW<6=9NEqR@Tgpd-^ko;FR+_TUR_deiK|3Z3dph3%Z#`u|}#{n7{RZ~V!s!gRtDN0&e+mpH#*~vPId7X^3aj)@WMzFJM48h0C%?n^_hW^ z&fYy+;v@cRtNF_7??ejqNOTGVCFp`c`oPaP-<>#2@4fbvo_TA5idMy@ANbWgR80^Z zCkFZCsOUF5GQ)3d&xm4>mGQ{aEQpRJXjVR)QeJj_+60V{ik$sto#N{N!#Zpmxs&tr zlZ>c^jy^ncNUr_KwQjqDz~zaSRvW@ ze)N)6#L#j>cSP07ZP_tZcz+*xv|j?}A1JmArokh#fsL1Tq-@6RrT#y3jS^v$4hexwlSqPjE%tU_?{T&YWe}z z_Ym9PD}%WpyuAYQl7am9s>+=P;R+Yu`Lb>FLX+HPiV%8QaA@MN>f^yy43-o1D{230 z*?uqXpX^(Np1B9FXKZ}cL8GJZR^~qI?BLj+Y*Y`%$NYry(S~xfQV<0Z{qBz-(z-3n zI{vG4Ta}X?&%9PV(Vp9B;=3%TgKCUC#PAb7v}de0;R0A(*CNAdWpZ#jKrP_~sRkmA z3Z%CUKHZ8pW0_3I5Y_Hc6ZsDeLUV{B`)YzUG>Zp+EJmfJMs_#(qS{S{9kUW2} zB(x#Inzig;q3&=Yf>;>KJjix8ef#HDS)F%NRGj#A%G;^6r8*0fIG+_{Zr5LU)q+1hT(;RHg zv>fi%67A3=$yvP(7pO6B8Ke>t{x}5A@D)k#!CKpT4wu)a?q;P@uk7Mi=n1UPGPVir zixKcK-Ucbqfk*Iv%GzcEXERC2oAugwBT4B65c9JS5Y#QFlh$it<~~ZIo4pglb?d_U z)6r}(yc}Jg=)JMkU1Fj&=8lT3Z#Jm7@3Lp;y*APqMm-egHjtVzRPhn9$Qu=mTQunH zrd7yU_qSDSHofO>_meA&*jKalJk(wBK+zE2P2ijd+R>53+S`VHE{U?8%D1W#M2302HewZr*8Le9`bCOlOx;LleM@(d z?F`=3x?YYo`K}072uZ~imB59hy5JrpJYooQ6l5Xd%V_WhQ*pK3Bab|?j!(<3U|Pt0 z?u(`s8%6qiZqWti#!#m~TGtVKV-hw6ktK6wn!3B02CD_(-c}-N%xQZ*8#2C5+dy5# ztMtMSINiKG9pzP!qDNz7pL}@pR~3{^2m}Gd55Y1!YEOCV zkFhvj2vR7C#trG{2(qkX<}Z~1knP%2G2)S*fdaIL^ZAGKeyTF#T$L9)^;?0$NI-K(&DysNM_}q!s=ie0Coz|1kUZHnb&H)`>Sr~Ga z;n%J|kR>6#kM-t*_L|Pu4UcV=f<)Jfft`b8SM70Y9WWTSD#s`1SJme?`!>mW2Ni}cEozIr4&EHiBBl+Ay#S%U&plbqG)?zm^ z`Pqn{!(lxFrd6NgU8)XsvTwf%WamT8wyjh!I*X}IJ9dEvNAaubyl+jX9m*EgME;h; z3%U{VKeapCitmchd%QEJ?7RCj2P$gmHYcxNdBZtdd(@)*a5G=r<#YAVS6sBV`NY&? z`25|Wpjl^w`&A(9WMmcrZ#Q3>n5}(?cIfG#ju+J$A}s+9Oijg=5t!}f;3K$H2r*Z+ z9c-8JA~J3Pzg5>G4;PW72}Bi?2l%UJb zgQ~Jf=M!Or0=%LNdi`s)u&yI_+Yu9aCh?Vpd+}h`{Ipe#70veF&Ct1X2WnxKTrne+M`@3tLA#BF zfm?G8eQ~q2ePzTg>WHFY3}m-&U+@&x{;Ck-Os8tgs;4x6$uMB)73JDHX!GEdiMFyt4ZknnN`D`gbF`#kZ^}L2I;)#J{^QFk&KC2 zT`1Xe*r=DCGKDYTQ*&@AvM(HSU5#W)-m%2I%LtO`ApKJmR8aK9*EE$TJ48B?s6S{d z;C_|Hq`LmqtaL!+Xe_fzB)EHu;Me43Rh3~f3EopHEh#asNj$&LKJcFgtIB&FoUKLaFs^`*% ze(5s%dSAZJCP=-{IG9LCYX#F#ZA2oKlb-hksGV1@n0X3++t&MR$12W9aqZKkzhNl; zQvPo#X$u2%E=~Z|%~0jIqfNUe@yB>ascVGRfKWA?6XkjV_COpCWZ+W^!i4u$DIbcZ)A1xrH($+LhJJ`3SwRYZm8}rc_Hzy8WF$q4rqQ6Jb#zc zP{9+Qzq?qZfBN_<*Z}KD=Mzsp(hrVR+B!9=;?1ZB4JIA-mKG#i$alUT23aVyS^o`6 zp?yR^1H=svO=j+-ly;zCq&q>Pi-y{nqu7ohSgZm-F749Ro1$zkpi^cDe=X~(*mf8 z(TrX#I2lu1$=M#r&EFREW4M68(tbBnX8(_QhXzJ{*J1|R3d4Q zC7Ts*AQ-_RY9Okk#g$_X&R5VjMr*VLsCN?9E2zFTj=2AxkALO!lH>AE?F8?-)jT3t zBEBV9oXT?P^52H&ujP97zQK+d(7)d3Z~u|t13jCa=?@rx`~1%zeXkGPBYhlM z9sahw{@*YCz#CffWs#lz^?U#K`Jag6>@$D_p-0}p{`|tfBJsz6?g9_X9}2e#AN=TAee@iWn)1Kq3?>Ol_&iQ{UH~uKPj!T~7ito7K{rAP; zxZXd8JC4zwW3cPr%8FyG?-=SlMtYBd?tjZ1j&c5Dn&X(|`8SyQnB+R<_>L*gV}|$N zU}_1!V}kpb;65g}j|uL7%N&jg?qh=cnBe|5Bnr?v#|-cP`wXv1dBmkZy#W3z?8ms@ zG46Ma`yJzc$GG1=bphpMgD(@@w`^^kjb~KjU*VL5@|D zQR2Y`GLLg)k51g6Q2An@?W+Dc@AUBP9G|a+=O@p{g42Pto$u@H|Inqpb5@q;HQ8Lv znK`MmaNYXxHU17|1%yUKPXir$Aqhkl{NczuR)PMfGtl*Qf zm4|;geg9*VA$-KbIPo|!?|(M&e9H5CDe1hYH1)r5l)D6%Ni*jD_d<1i1IH0Kj=*sQ zjw5g!f#V1qN8mUD#}PP=z;OhQBXAsn{~ts^!q{1oQV0~@{ks15U!2KkYHFg$UsaC@ z3=EviefD2W*RRt1tJaM#4%2P@dzuACQo;GA?XlDOxkS~PX`qOR#`?l{_dF`z2muw}%{%+5Lzcf!GG^7DF{_VwjLDV&dHx2dO8S-x|yc@5D;RcP)^FD(Z;rox!Y9yZvvsWMEDCdNp!8`0_;~={2UO0k4VfVfJEU zuNKkw|L`RQcXd{;2>(-3jF&GoUTkmazVlBYAe)$ik9M-h&CYTA=fsXxkgUk9Q+F6A zxW9~P0qYtGMFO51cC$Y|xR>gE{BV<(_3wTKx!G5(jh>kSQH zCBRt3Kap^f$RJ1P8V%hQ%S-)EAEN&WfjlLD{ZcYCzd=KuLY_UhsfPMv*R=?qWUVZ*PWYYvKQZio$Pso2g_j3!2Hk{3;9Rk*WdfxP^Gj8>T{^KvKtcqc(om&44 zQ?EW_ta|Wo*I#~p;f+zHo=MlgjIw(yWX60v(#GqvyF2oNy+~C>{M3+R%8-{16aNlY4jO#belBk|xinekmS!QOCXRszy%LluH)j+ zD*gg@+x6&HSvd^4JMB!A$(4LKSp0ky#!DM1f+{12vLyUr4s_fSm+%? zCyD|pO$fcJG^Hv*LI@B6DFH%&Kp>Fv?woVay`yvQ%Ll%Yy?6H7YprLk|9aTJ{2;f{ zi{A0`fF*v8ryXOxk1s5u5ZN^d-iN(w3X!T^=4#sMBV+Zk56#!Id^l5BMv_M zSV96c16I%Dp6g??X&cb~@-{-VoE*5ZGVAOBb?dQjld0mu3)jz?6MB2okgkb6+X!Sk zEjM_8-@UvSot%ZBEy61}8r6S(hY)a+mZP=#AxA5d3bF%Mj`p7i{5RY29#XDxNWU<2 z<{NH~iQvz&H-M#FP7¥2IiU96MO^PgkhrVhY2!Jj`z=&^+hQXqA$KmdhX;ttyUC zat*Rh_Pk@%(A|8l?|^NJ3EC7$z23IoX*A`NR5ccaet08r243cfiJW@XK{B^t)e`kBoB`j+ zi(1c!8&jY!(@UKvJ~VF?lr?n{2}l@cqj3rvszYB#;(%K%2|<_j4Xtddi98FP^-Zgv zn@kz|iww|R%gV9T1Kuiu*tms)*@ri$T$?}$iR!*3{4z&~ZJ_QabNlvPQQ{%rVzPK#?c z$n>bCZV`HbS@e3dPhKhYjJ8ziGigvb*3pwvR zS{7+MTpm?38|3(%X&E6ZTFX&jMF1`SNvRa5K7>Xh9k5>`dr#iUJjWraSTToLj+TJ? zUXnU=ovT+G{j7$d;T2+v^9%ZaEYn9VkaaMhRPETL1l@V)iqZ}Yhz?b-nx`N-X)(y^Ho~)Ln7oO8 z9E+)oQU#f~R7M=pKU*BdNPY;@+e;mO?c{CqIps86ec`haC98*25PHu*JJr zeX=|4{I4nXrWue@y+9Y7fMBaJH6+%CesGuc|UlxuOSBD$cu zV|$zH7)eoGc<{K2`!Vbmx2v0K()od<0)KRVKyJ}+rc@?N;Cii*@ZDztpWL{7GUYV5 z77r?FK7kkDc?wh&5OJI{LbPPJTh(8_Wkq*;I8#8NpsqBz&sB-}X-y-=vcTpxr0o(W zXMRbb#+0dR?voEs|Bszz-I;VbqOywaGReWuNp8-vr$O7H`c1fqWas|XA!|DYQ&r`1 z3ArSqcN*=x5z-f3XH}a=3hAOdwk{1|m*BigN*4%W_f&)KQ!6_>Jq=;O8q^=Q0yq1O z-6`;xL|J`R(+9yYS^j4dd{386FQg1{NLnT4$(SO_MH8)uL=#=Uf*o#5YU^?fYvomz*A zL*4RMpV0RoC(7u!Ao|onXE?I`1D13z-ZgUy=FA4t5or6`M}DD!aBBcbSlsFvic1~% zl<{&bPKTC9`Aknu4ckRmsbO;^1y*o;z0m^?K%$FwoHXYK)+>07uZan-rZ$Y_sO zN!ppXKItv$rm|ij_mT@7=#O$^j$1?YbWh)2dX?0QZ=LJ&ldTw63u2DzB)8-PtGV~hb0@cr+zZactW5`PgP28ga{~j= zqw=3~6f{@fsg?R(@Pn7RBaW2}eNr{18LKWlml8wdu{!u_DT0Er9lb`5ojj0hJ%;_h zgSjlX@N-{ChZ4P#KPe_^kqM-_o zCIr%R;kVb(Uq~gQ)sx@JqcvyN`?u`=xR@zEs!hE~JEgy_$PUiP&TA)R?;Pl3b6F4i zpJHkSZG5IE&lHMbqxU8wP&2oE*^1c<=R*Ty>K}t@}bQ=ZPXqEu&yr-gM2b;Vo z12a{vPYl=prRuq9cF*$LQ62?`zfsKYZIwJB55*p1)!gwH={%4pNyKkYD_JZ<{^&V0 zL!@Lc;K&k5yvtyWUzbbd=%K}@;uWmMnJ@?Egc7O42MEc;%c2Dz-jQ!{Ci>roq#W75 zt|~jMnoIjO8j~-nf9ruMlW!B$7jNK-tZunj6SQE~S8f+7rFzE?TpUBnqr9v_TsfSd zomX)Sr%)9|hSb7RseK(g4rP8@Wzz(RU%%0n8 zl_+ow{2UXlP!kr_*l$ofZ6(g;H!*dekCAQRTukT}Hp9{7GOQkXNxfufJ zLT~5kjLP`ZjFKCm^%NY6eu$tz-RaZN{wf@v5CM)>R}J3qh;$KW6K$LvCqMQft}`K6 z5Ml`&*#6TA*Ut=n5+qGROb&XC%U|JQ3Skmg4a-rkdm@WCN%*Hs(ZR7JHmrlGa2OsR zV9@x1R2RHCXY`}rMQpj5Af>1zwjs6=(RF(-+isWqBkH7aJ3L)NhqhQ*(}$V39J3*V zW5Vhvz;T;PFK3HpX<-D@f@A@>Q}{+p<&6HrEKC{w&2E(cKkdlLr|mQOPN{;YKAf5D zHC(Uug)i0Xrj5~2-#vf6%MY&We8PT>08!VUX&{g| zZ9Uzglg+$xchhelS~l)|_~i3)f!V=7Cv?hpuYU2dFXGA$8uyXlnC>zv$sV0z5AiE56nd|E}MGP?~I=J|CkmPIYB+Eu@v zjB05Hs4&H5S_Dn$H1CS4reV%wD9VPKXSvVklK5SqDj(!HTEkUdUN*G+=c6;md$SRc z$PIKkk)I!fe;4E>taO1&p;Of@9`-~g#&zK2$d|2qIZSIac9zgoi|PC2?SMlI=Zrqx zp9>oqbxDLwsSA9^y0r;yFJRniYS+KgX*>O78m~gf{?*3w!2?;+8LuY;b7riEzBH4c zu8r-Y?&eAG^E@c%m%g`j+KFAfRA1ZWe8G?2;0Lf+b6UzLe2A_xQ$9K#+A zYPc2{_&1p8h|N#Pz6qdrge@)OvsLgZD_7Y> zsNf$7#a*h(YHBs>WY=ndm*&gX^uQTC*0*-v}+^#!?1!g z{dHHZoFiV&!aksv`mCIl7VS+=&-Q0t?{hd_iM389}6qbcRnElOG!QHdtYKyCSdXn2M6( zrA)@Zr4MJXnCfzcL`eKnkwi*<$p$rR#Ly@cNfLtg(vg(DOcT(Bpd&!~8<6?(^|uJ4 zuM;kRwp|wtz8PZ#=N^>d?iYQ^-K!Q7jhAnn#J;E)d*&j}KLrMh%NIv)Z~2gVUhP=G zL731x z=^IR379WkgsTzRDv|%Vy#N5NlPWVJB$bB~)N*;PL-#$s7EA}Cm7Wg%H?tfy)HGhvN zOO7K-f~zR@e`^}CCcyrj(>I!o)lVlQZvFjp|FQ;%J3V0KjotDx>nbH(?5Fem-CD6z z$HfJ`f*t~lu>butU!DNFczD&%{6l*C&lMY(ud9zQ{gC}Xj@NsB3J{;<0|fukiT|gz z_u@024siXIhW7){WFXM0Q$Kgj3@1H#>%E;hp+iD2(tp|H|GL`r+T~*1^T66BVAX8RC_H+0`Up z+0#1z(*3JvmE3MPVb;+IB*06^*yudmP=+4tDEm&;*L!>|accZgN8|DE@bhy*V^dzk ze~aDz15hg~+douEWK`X4IiR~4rJQ{RlXq-|O&4MQap@iNduvrq=KUfuQJQm)Imez~ zX^`tfIQJ$300*j+Bfu2E0dKq~ICkT!y}Z0kh|8m3JQ9%4zIE=I`2w|OgN8=ZdwW1~ z{X%z?0eO*IJZMC+*C20#|H-SkJHV5$cg^6dZN7_m;wJ(W+v zlzCpLx$3orjCb*^CFZ4qz$F(~n5_O+6W^=ine=U%83XWPunQK&cbg8ATT4b>JKP}x zM|pR$Ui-V_{zWJrHQ%2hD6J;94hO&1@=cQg6qbK$rT@#3&n1s8 z7eoBI#03)h0KwB&%KO&yz+u&z5QcBr@g8r)Mjvv4AD#RLTXq5@_ejcCKa#qK?ULVY zHu~B9-WeV@1fCIT%Dbi=ueH%EHVoJ80?)^H@T|5ZTf1K|6!>}zq?)-=$pR0P`|@Ra zy1j1AMo68M!uYifD=@1h{HpHl+_*>C-@+v;ja+x>Oa81-sN#J~bFC6FMjX|} zA`)9Qv|t+fs=oGW?o|BdP@WR9s%bVKKpz!Tx{b3gltM}9awCLTnR!vZ?x+1G`?hG* z`1nr#15?JvW@hrzvQR^-u-e#&jc>+G~t|Z2t6vJ7LJ_PKH zx;bNG0eZ%G+51i#mRj9S#F84DN<{#enP--nDsD3z!}@YqX;d45H=1Ql*YovuIBj_; zpV2zR9x$&1C?n6Nna~RGlaAM(9lH~QHIZ{qc(+l&)6@N?S+NA%?_iDe>gxtFW8{Lk zzG~&n{2CzGUB)*rr0%{p{{@ivT=^)*2QlpJjP|Oiaw6UV>oCHc5pAti^VMw|CWrVY zaIsH(GlQi;fr^XFLx(c=0V-1#X(kN2^GPf%*a2M6?<;dm;nF!JK_}h7>ya}(FFZ$w zpT6?#S1n~RJOhBeCjMH))#jNZi$)t?3AoD|#}t0Oym{1~A<)MjVUNcFMJ>c9l75JV z)9W@TUDCca%-A*J0K6R-!1V4CBVEvMtD7FG{HDaq* zm3XEwKn1lJlc{OX1jzMiCCZaDN`&y>@ zQ<}VY@Ay1nhg9msy2-uUH)c44U*?0T3kUYF<9d0)L+-Z=qYrQuj-34r#@&(<4+G*N z2R1$rySwUDrVkjPA4?o~|HlsIbR4_>*|GjJXY>4YUGKL3*iRWtU`jPVG~0bAV{cAH=tHf+`KtT{8&i(23z9&Fb6+0tIA&IfMeKL;Z}e zg!xLXUe{1S_@JLY@{Yf7mLyjshn4s_b5Im@=LHpI3T}1|ZKcdgfLf+o7;B8>QNb*iZjf|v710uJy1>_%H5WuXh3gPyj zC69!ab?nV*`L^Q9aA8b zZ3yB*`*o=&7#;_oj6(*eXRG$-G1J;>J)iw)+k~#cn$CmOs8Fc5aI4H#7!3fm9iquDohSo zKHgqn6S&s@J{V(7x!3U{({w5t(8Bj-k6^C1zPSG_xgk77l8Xj_6g5P8XbjL{@St4N zt5E=T_PI>O49FjO zJJUla@5d>-j}&s*s5Su$i#D09UlbS6$M62**kUc$Sf+%jJ=63;cs2teNdEJ1x+Mq+ zJoBtjEOakQzsjx-HyT4DzRl9yS94>&R=CCO+rL{Y@3!MWxvBy@2Flyqk=B5@k5Qhb zldrtn@QtRmIR>&U+>huP%JH7BG76VG(Zpt*J2}U& zdHH0;cL2~gl*OW$LIl8mH=?yquCi*bJfDbU1xTS$&uOo+%m?5<(q4T)b&?u_zsE9# z<^@PyY$kLbZc9Jkf@&wQCp^s!*Qy22nUWbY8j%pv+v;Ra(o#$&{y0ECg%W)-h3ny9I4AFCDv|aUq4-C zS1bB0k`AD<)V3d63qSr@TL9LQtxFCRJU;^2cCj}9eIv%Nz*%XzGkg9>+sJA-(18;5 zxAeeO5~``G95M9JX8|kKczaBc)9CS4SU`Lp%kB+Tt|s=7kvr`lZ^Rw#VYCREZ1F0m zF*(j|-HQE!NZOY_p6EYt+v(BgZLxl`F(s7b~ci7uvX)5rruNe*QA)JiTLj&TsCShjaTF(iaVuD zWZ#TWPZv514J;1k@QN(g7|1FrEdanu6WSfX*Cdq_AM5*Ma({l>C;>CMIIx355+Cc! z2HwmbROr&SP~F*y_5$g;L3-A?pZt`52PD}bB0Uc~x(>e8w4UDhqR(o^;9^?6g&TJ(*xz|zR!-zu_)jHS=I;O8e)5~c?KVkb4MT`UFI6#_fMH4 z1+_wbbWG2e1@P>yx2{!Jg>V&I=x5p*H(0M>82{ujK!V%bM!m5q?{uq)oA@J=yabR{ z2WpbR$6prhGwQ}p{@|^o`|h?)DG>nJ@RNniMh9L6)B!Ip#x}|ke(Kqt3m)D2a+l($ z{qV=9sJxj*#Bt|8bzrk8E`jE0TX+S;hBNN0s^k_1e5!xLwQ0r_^5Yq(d4r{fmN zldso)4l0)bM8?b8$7P%^>XV*zbw-l5ZB6Z!PkXLDe5zk4V}vuj-x3f2>y7%IA#GGS{)~+PN*D{w4As z3v?7r1D%<=Ao=kC2*$x2yF@6Db4$tffF5h}TE7QiliD66P(9?$QkK-pf#T_Mno6cr z?bpyDulUiIRr`zYDmWnKJ;H2OqO{yawhOgaLJ!Ags9dI?4uJGy(FJfl5Aifi|03k4 zo?%5KwxYWAVoWZhfP|UClyb)QGNHIL-8wbxt4pq$2q%x2*psg$gSD67T0>}xW?b? z{Px@F1mkO0ZjS04z5=?I#jbVgJa5;hJtICfzq{q%46P;HJbS49+lICNhU_tiNv&=0 zTw8Wi+YY?1{~)K(OcAy-8v9c-qStbKKDKfGMaSe#@|dZ@FnuJMdO{8TAd7OpbH898 zSlmJ#VK)=wc6IhYuvW7__Gp{9Z0VxX&}XwqJX^;i0|9OC9wg|GN%Z@V?^1zXFx*M7 zS1%wU&kO~(Q|kBoM%^E&qYu#N?JjK-F@k@%7M3WW8bMv)^1z1*wA z;p;+>pD3vYDAX5)pgUi2gJ9Z29lQ!GGha6*I45HU&IuB63au+8H*3&Gy5m?!`;X20V^RmyI=50N1X^Riv*7nN zFg+5Jww=ev;B0qJxeMWUEeqLVMhW#ua}`1hU(7-y^n2}O|GU?t-uR=?{JG861pPLD z=|PceOe&Rki*L#1PZl&xWi8;WPFC$g74i`%hkTloydFI1n zaenS2Z@{f-@0XYZ@za7SBs!Uxm(p?RPbu&vArpp7wukldA2R@2F zcq)dcz`p+vwW4PpN>kOPJE85(uvS!WkP`+1vYZS;5M{7j7^{8`=`OD-~t*=lfQ9wnrbnh8)8@|wI?Y#JS5~4_Y9-l`odiJ_zf@8POF2!0TR!2gx zQ3!L>g$te4j<%m*N$DPYz1P)>zCMwWvgTiKt9|>L{X^8KjgMdvXanO5ZD>OK+L{CE z!27f8@scN|+?P<-&z*IwIL<&~a)Eytk^=h)VLwu!hl$$=I8_$3{YeZ*gwvz8tQ?(0 zfJ_k#kkv}F%r)P19~hoz#mq=FH053Mlzbz`Iv1HiZR9tXqzQ(rVka&{VTk%V?S*nU<%}m9T<6#=hWj#{Zvyf{ zcwiW{mx(x=ZjY!7cvL!3q80SwS6Pdjtr<`wtWW=Fk{i_~-oW^;k#sVDV-gCsWF+*N z)%DjU82>Jig*@HQ5Bsh{50SC5jspMGLe8<=|08 zbrJ9}TXP-q4=&?Wl7KC$f|>)4)-VtNN%v-=U^iD2>`}|1?VxZJs$RIL;T(q2(w6Bv z^v*lkU6;za2MweB_>*Dmyd_#G&7`spbr8`4IM7x;@ z(K8Xyp8pR7+Ya#jG!i>_MHMXx93h4?pgShg&hfSU2!U0M03+}xhe|!yU;CXLs^dQu zQRacptbcgFu`$@^Jv)DV?&tIlFl``N+2f@#@O3udN=CP}ba^0P+g1M1ia{g;;}{kZ zwK3h|C>+Mnsi5p0o}R{LW(^yUkU6lsroYTZ*`(s$13g~>fHVy&4H^YC`3=hS%BJL^ zzaSsz$M-acB!89K{jV7o#9^k#>@0HD;S8R({o&DABHE%Hb{L&MwryuFT`DY81)dli#dJ*ciHwiXcQkYDo!hY~^lYx<12Z>Zss&FvT?O z(r$<=(u2cpwgwuqJlgI(9k)1$9N_DTQF-%ab9HT?QvbHReY!>D>rxSKFGP_6rT-=1 zsLy?lW=z&sTB9Qkv>D38-z3uadcR+*ySF<22KVul!Vu5L&$oKt)b_Lln%xx$*=g*_ zO^zFZRG`QqEW^XY=r1vZL40}gTIiipyDM*eFTdFD3{ROLo|`H{>AcM&=9GU1Ltjw) zI}V}&k;c)<-MCafLhx?V@wz+a-d|_34{vEsozrx4=w)qQykmXq*YW@Fl}Se#OA6=j zHf1+})!xj6^TLgPiJQ`Hqs2wV;qwW>doy9>IAnRbC$g**kRWyw-tZ;iq~`90>cc&3 z;GAbfv^*YY?r!cL@))Gtj#=Cb8VmEcvs^$s!W(BJ*A`9lfqAtf^C^AhXbar=J^FDgMtYtNcT$=Q9`Qb#=2#~!qfdb2ocL+>Z%F;N{$ zQamc0Txk??#0A;V^9m=AEHs3JYeL6?v3%^t@x;<+6umX`fnPUN|EWhf;k@2!ZPKw< zu{fY3Dag}Ud4nM!$@%W$3O&T8-ri2=x$5(K9nk43`_Hy>DV?4F`*}sT&|_oHx^WxN z1^*d!SI%X!D4yi}*S_8zjXAdaj?sk7oete$Jp6Zr{mTsx#?A}9XD3?oKUh-UdHeWO zZp$SW{hU7l`N|}gFGyPWU|sfob)TPRTith41UW>^swL3a@^|68@;F0&x8gp2QQ>>S!FKR{%5TXxAnmFuDvPn`+hy8RF^*rt z=ehmST8-djXz$oTZJb*eYj|i_yM0)fTZu0)Y^lcf{a!QM?GP%EKN3mIRSq5h?lfcT z!zKh#wKIHkJ4#*@*mW(7n|w(unhuNI$oxG}JDSfB+j9tXy?Y&2DnEq_!$=e7V<&6Q zbNt?$$T1_q*{aof=Ibu*)*)aT_7uLc^i=p|Q7BNddzourE18Q^8#U#%0NbHhN)E+; zI3M+H`SWr2`>=p8=D7IR>XHYNZd%vI?)N5g9}}tM`Kha7q%Urc_&Y3sTyck`m&2q% zoD(QkuAZg}dnMmVIjeG!(euRwsGc3Vy|I$Ue+Exj+U1OUg)$S&Cr$fYp9=M{nY#Dm zZ5}IvN7f>3_-_72RHMXyz1w+?v`Q zSU2_H$$VC3HmM@uj`MMu#p0(EgT%HjJ@}Nay~U>c`vQ}@piJk)P?5*8UtRCgyQ7No z)=|CT#>pwyN?w$fmKr5I74>3{yIOLHGEaqMTT9<~QKS*ueZ+_0@ZS{XaZnN>~5@ diff --git a/docs/deployment/frameworks/open-webui.md b/docs/deployment/frameworks/open-webui.md index 8f27a2b9bb6..eaa51bb6132 100644 --- a/docs/deployment/frameworks/open-webui.md +++ b/docs/deployment/frameworks/open-webui.md @@ -1,26 +1,42 @@ # Open WebUI -1. Install the [Docker](https://docs.docker.com/engine/install/) +[Open WebUI](https://github.com/open-webui/open-webui) is an extensible, feature-rich, +and user-friendly self-hosted AI platform designed to operate entirely offline. +It supports various LLM runners like Ollama and OpenAI-compatible APIs, +with built-in RAG capabilities, making it a powerful AI deployment solution. -2. Start the vLLM server with the supported chat completion model, e.g. +To get started with Open WebUI using vLLM, follow these steps: -```bash -vllm serve qwen/Qwen1.5-0.5B-Chat -``` +1. Install the [Docker](https://docs.docker.com/engine/install/). -1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port): +2. Start the vLLM server with a supported chat completion model: -```bash -docker run -d -p 3000:8080 \ ---name open-webui \ --v open-webui:/app/backend/data \ --e OPENAI_API_BASE_URL=http://:/v1 \ ---restart always \ -ghcr.io/open-webui/open-webui:main -``` + ```console + vllm serve Qwen/Qwen3-0.6B-Chat + ``` -1. Open it in the browser: + !!! note + When starting the vLLM server, be sure to specify the host and port using the `--host` and `--port` flags. + For example: -On the top of the web page, you can see the model `qwen/Qwen1.5-0.5B-Chat`. + ```console + python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 + ``` -![](../../assets/deployment/open_webui.png) +3. Start the Open WebUI Docker container: + + ```console + docker run -d \ + --name open-webui \ + -p 3000:8080 \ + -v open-webui:/app/backend/data \ + -e OPENAI_API_BASE_URL=http://0.0.0.0:8000/v1 \ + --restart always \ + ghcr.io/open-webui/open-webui:main + ``` + +4. Open it in the browser: + + At the top of the page, you should see the model `Qwen/Qwen3-0.6B-Chat`. + + ![Web portal of model Qwen/Qwen3-0.6B-Chat](../../assets/deployment/open_webui.png) From f6f0feb4a086defb06361bdaeeb30091acb8b1fe Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 16 Jul 2025 21:39:13 +0800 Subject: [PATCH 134/552] [Model] Consolidate pooler implementations (#20927) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/model_executor/layers/pooler.py | 681 +++++++++++++++-------- vllm/model_executor/models/adapters.py | 99 ++-- vllm/model_executor/models/bert.py | 25 +- vllm/model_executor/models/gritlm.py | 4 +- vllm/model_executor/models/interfaces.py | 2 +- vllm/model_executor/models/jamba.py | 39 +- vllm/model_executor/models/modernbert.py | 33 +- vllm/model_executor/models/roberta.py | 13 +- vllm/transformers_utils/config.py | 24 - 9 files changed, 553 insertions(+), 367 deletions(-) diff --git a/vllm/model_executor/layers/pooler.py b/vllm/model_executor/layers/pooler.py index d864a915a07..b378a3db032 100644 --- a/vllm/model_executor/layers/pooler.py +++ b/vllm/model_executor/layers/pooler.py @@ -1,22 +1,21 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project - +from abc import ABC, abstractmethod +from dataclasses import dataclass from enum import IntEnum -from typing import Optional, Union +from typing import Callable, Optional, TypeVar, Union import torch import torch.nn as nn import torch.nn.functional as F -from typing_extensions import assert_never +from transformers import PretrainedConfig from vllm.config import ModelConfig, PoolerConfig from vllm.model_executor.pooling_metadata import ( # noqa: E501 PoolingMetadata as V0PoolingMetadata) from vllm.model_executor.pooling_metadata import PoolingTensors from vllm.sequence import PoolerOutput, PoolingSequenceGroupOutput -from vllm.transformers_utils.config import ( - get_classification_activation_function, - get_cross_encoder_activation_function) +from vllm.utils import resolve_obj_by_qualname from vllm.v1.pool.metadata import PoolingMetadata as V1PoolingMetadata PoolingMetadata = Union[V0PoolingMetadata, V1PoolingMetadata] @@ -31,140 +30,202 @@ class PoolingType(IntEnum): MEAN = 4 -class SimplePooler(nn.Module): - """A layer that pools specific information from hidden states. +@dataclass(frozen=True) +class ResolvedPoolingConfig: + pooling_type: PoolingType - This layer does the following: - 1. Extracts specific tokens or aggregates data based on pooling method. - 2. Normalizes output if specified. - 3. Returns structured results as `PoolerOutput`. - - Attributes: - pooling_type: The type of pooling to use. - normalize: Whether to normalize the pooled data. - """ + normalize: bool + softmax: bool + step_tag_id: Optional[int] + returned_token_ids: Optional[list[int]] - @staticmethod - def from_pooling_type( + @classmethod + def from_config_with_defaults( + cls, + pooler_config: PoolerConfig, pooling_type: PoolingType, - *, normalize: bool, softmax: bool, step_tag_id: Optional[int] = None, returned_token_ids: Optional[list[int]] = None, - ) -> "SimplePooler": - if pooling_type == PoolingType.LAST: - assert step_tag_id is None and returned_token_ids is None - return LastPool(normalize=normalize, softmax=softmax) - if pooling_type == PoolingType.ALL: - assert step_tag_id is None and returned_token_ids is None - return AllPool(normalize=normalize, softmax=softmax) - if pooling_type == PoolingType.CLS: - assert step_tag_id is None and returned_token_ids is None - return CLSPool(normalize=normalize, softmax=softmax) - if pooling_type == PoolingType.MEAN: - assert step_tag_id is None and returned_token_ids is None - return MeanPool(normalize=normalize, softmax=softmax) - if pooling_type == PoolingType.STEP: - return StepPool(normalize=normalize, - softmax=softmax, - step_tag_id=step_tag_id, - returned_token_ids=returned_token_ids) + ) -> "ResolvedPoolingConfig": + return cls( + pooling_type=PoolingType[pooler_config.pooling_type] + if pooler_config.pooling_type is not None else pooling_type, + normalize=pooler_config.normalize + if pooler_config.normalize is not None else normalize, + softmax=pooler_config.softmax + if pooler_config.softmax is not None else softmax, + step_tag_id=pooler_config.step_tag_id + if pooler_config.step_tag_id is not None else step_tag_id, + returned_token_ids=pooler_config.returned_token_ids + if pooler_config.returned_token_ids is not None else + returned_token_ids, + ) - assert_never(pooling_type) - def __init__(self, *, normalize: bool, softmax: bool) -> None: - super().__init__() +def get_prompt_lens( + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, +) -> torch.Tensor: + if isinstance(pooling_metadata, V1PoolingMetadata): + return pooling_metadata.prompt_lens + + assert isinstance(hidden_states, torch.Tensor) + return PoolingTensors.from_pooling_metadata( + pooling_metadata, hidden_states.device).prompt_lens + + +def get_classification_activation_function(config: PretrainedConfig): + return PoolerClassify() + + +def get_cross_encoder_activation_function(config: PretrainedConfig): + function_name: Optional[str] = None + if (hasattr(config, "sentence_transformers") + and "activation_fn" in config.sentence_transformers): + function_name = config.sentence_transformers["activation_fn"] + elif (hasattr(config, "sbert_ce_default_activation_function") + and config.sbert_ce_default_activation_function is not None): + function_name = config.sbert_ce_default_activation_function + + if function_name is not None: + assert function_name.startswith("torch.nn.modules."), ( + "Loading of activation functions is restricted to " + "torch.nn.modules for security reasons") + fn = resolve_obj_by_qualname(function_name)() + return PoolerActivation.wraps(fn) - self.head = PoolerHead(normalize=normalize, softmax=softmax) + return PoolerScore() - def get_prompt_lens( + +def build_output(all_data: torch.Tensor) -> PoolerOutput: + all_outputs = [PoolingSequenceGroupOutput(data) for data in all_data] + return PoolerOutput(outputs=all_outputs) + + +class BasePooler(nn.Module): + + @abstractmethod + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, + ) -> PoolerOutput: + raise NotImplementedError + + +class PoolingMethod(nn.Module, ABC): + + @staticmethod + def from_pooling_type(pooling_type: PoolingType) -> "PoolingMethod": + if pooling_type == PoolingType.LAST: + return LastPool() + if pooling_type == PoolingType.ALL: + return AllPool() + if pooling_type == PoolingType.CLS: + return CLSPool() + if pooling_type == PoolingType.MEAN: + return MeanPool() + + raise NotImplementedError(f"Unsupported method: {pooling_type}") + + @abstractmethod + def forward_one( + self, + hidden_states: torch.Tensor, + prompt_len: Optional[torch.Tensor] = None, ) -> torch.Tensor: - if isinstance(pooling_metadata, V1PoolingMetadata): - return pooling_metadata.prompt_lens - assert isinstance(hidden_states, torch.Tensor) - return PoolingTensors.from_pooling_metadata( - pooling_metadata, hidden_states.device).prompt_lens + """ + Note: + `prompt_len=None` means `prompt_len=len(hidden_states)`. + """ + raise NotImplementedError - def extract_states( + @abstractmethod + def forward_all( self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, + hidden_states: torch.Tensor, + prompt_lens: torch.Tensor, ) -> Union[list[torch.Tensor], torch.Tensor]: raise NotImplementedError - def build_output(self, data: torch.Tensor) -> PoolingSequenceGroupOutput: - return PoolingSequenceGroupOutput(data) - def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, - ) -> PoolerOutput: - pooled_data = self.extract_states(hidden_states, pooling_metadata) - pooled_data = self.head(pooled_data, pooling_metadata) - pooled_outputs = [self.build_output(data) for data in pooled_data] - return PoolerOutput(outputs=pooled_outputs) + ) -> Union[list[torch.Tensor], torch.Tensor]: + prompt_lens = get_prompt_lens(hidden_states, pooling_metadata) + + if isinstance(hidden_states, list): + return [ + self.forward_one(h, prompt_len) + for h, prompt_len in zip(hidden_states, prompt_lens) + ] + return self.forward_all(hidden_states, prompt_lens) -class CLSPool(SimplePooler): - def extract_states( +class CLSPool(PoolingMethod): + + def forward_one( self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> Union[list[torch.Tensor], torch.Tensor]: - prompt_lens = self.get_prompt_lens(hidden_states, pooling_metadata) + hidden_states: torch.Tensor, + prompt_len: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + assert prompt_len is None or prompt_len == hidden_states.shape[0], \ + "partial prefill not supported with CLS pooling" - if isinstance(hidden_states, list): - result = [] - for req_state, prompt_len in zip(hidden_states, prompt_lens): - assert prompt_len == req_state.shape[0], \ - "partial prefill not supported with CLS pooling" - result.append(req_state[0]) - return result + return hidden_states[0] + def forward_all( + self, + hidden_states: torch.Tensor, + prompt_lens: torch.Tensor, + ) -> Union[list[torch.Tensor], torch.Tensor]: first_token_flat_indices = torch.zeros_like(prompt_lens) first_token_flat_indices[1:] += torch.cumsum(prompt_lens, dim=0)[:-1] return hidden_states[first_token_flat_indices] -class LastPool(SimplePooler): +class LastPool(PoolingMethod): - def extract_states( + def forward_one( self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> Union[list[torch.Tensor], torch.Tensor]: - if isinstance(hidden_states, list): - return [h[-1] for h in hidden_states] - - prompt_lens = self.get_prompt_lens(hidden_states, pooling_metadata) + hidden_states: torch.Tensor, + prompt_len: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + return hidden_states[-1] + def forward_all( + self, + hidden_states: torch.Tensor, + prompt_lens: torch.Tensor, + ) -> Union[list[torch.Tensor], torch.Tensor]: last_token_flat_indices = torch.cumsum(prompt_lens, dim=0) - 1 return hidden_states[last_token_flat_indices] -class AllPool(SimplePooler): +class AllPool(PoolingMethod): - def extract_states( + def forward_one( self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> Union[list[torch.Tensor], torch.Tensor]: - prompt_lens = self.get_prompt_lens(hidden_states, pooling_metadata) + hidden_states: torch.Tensor, + prompt_len: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + assert prompt_len is None or prompt_len == hidden_states.shape[0], \ + "partial prefill not supported with ALL pooling" - if isinstance(hidden_states, list): - for req_state, prompt_len in zip(hidden_states, prompt_lens): - assert prompt_len == req_state.shape[0], \ - "partial prefill not supported with ALL pooling" - return hidden_states + return hidden_states + def forward_all( + self, + hidden_states: torch.Tensor, + prompt_lens: torch.Tensor, + ) -> Union[list[torch.Tensor], torch.Tensor]: offset = 0 pooled_data = list[torch.Tensor]() + for prompt_len in prompt_lens: pooled_data.append(hidden_states[offset:offset + prompt_len]) offset += prompt_len @@ -172,24 +233,23 @@ def extract_states( return pooled_data -class MeanPool(SimplePooler): +class MeanPool(PoolingMethod): - def extract_states( + def forward_one( self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> Union[list[torch.Tensor], torch.Tensor]: - prompt_lens = self.get_prompt_lens(hidden_states, pooling_metadata) + hidden_states: torch.Tensor, + prompt_len: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + assert prompt_len is None or prompt_len == hidden_states.shape[0], \ + "partial prefill not supported with MEAN pooling" - if isinstance(hidden_states, list): - result = [] - for req_state, prompt_len in zip(hidden_states, prompt_lens): - assert prompt_len == req_state.shape[0], \ - "partial prefill not supported with mean pooling" - result.append(torch.mean(req_state, dim=0, - dtype=torch.float32)) - return result + return hidden_states.mean(dim=0, dtype=torch.float32) + def forward_all( + self, + hidden_states: torch.Tensor, + prompt_lens: torch.Tensor, + ) -> Union[list[torch.Tensor], torch.Tensor]: # Use float32 for torch.cumsum in MeanPool, # otherwise precision will be lost significantly. cumsum = torch.cumsum(hidden_states, dim=0, dtype=torch.float32) @@ -203,78 +263,127 @@ def extract_states( hidden_states[start_indices]) / prompt_lens.unsqueeze(1) -class StepPool(SimplePooler): +_T = TypeVar("_T", torch.Tensor, list[torch.Tensor]) - def __init__( - self, - *, - normalize: bool, - softmax: bool, - step_tag_id: Optional[int] = None, - returned_token_ids: Optional[list[int]] = None, - ): - super().__init__(normalize=normalize, softmax=softmax) - self.step_tag_id = step_tag_id - self.returned_token_ids = returned_token_ids +class BasePoolerActivation(nn.Module, ABC): - def get_prompt_token_ids( - self, - pooling_metadata: PoolingMetadata, - ) -> list[torch.Tensor]: - if isinstance(pooling_metadata, V1PoolingMetadata): - return [ - pooling_metadata.prompt_token_ids[i, :num] - for i, num in enumerate(pooling_metadata.prompt_lens) - ] - return [ - torch.tensor(seq_data_i.prompt_token_ids) - for seq_data_i in pooling_metadata.seq_data.values() - ] + @abstractmethod + def forward(self, pooled_data: _T) -> _T: + # shape: + # classify (& score) -> (batch_size, num_classes) + # embed -> (batch_size, embedding_dim) or list(embedding_dim) + # (batch_size, dimensions) or list(dimensions) if using MRL + raise NotImplementedError - def extract_states( - self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> Union[list[torch.Tensor], torch.Tensor]: - prompt_lens = self.get_prompt_lens(hidden_states, pooling_metadata) - prompt_token_ids = self.get_prompt_token_ids(pooling_metadata) - pooled_data_lst = list[torch.Tensor]() - if isinstance(hidden_states, list): - for req_state, prompt_len in zip(hidden_states, prompt_lens): - assert prompt_len == req_state.shape[0], \ - "partial prefill not supported with step pooling" - pooled_data_lst = hidden_states - else: - offset = 0 - for prompt_len in prompt_lens: - pooled_data_i = hidden_states[offset:offset + prompt_len] - offset += prompt_len - pooled_data_lst.append(pooled_data_i) +class PoolerActivation(BasePoolerActivation): - pooled_data = list[torch.Tensor]() - returned_token_ids = self.returned_token_ids - step_tag_id = self.step_tag_id + @staticmethod + def wraps(module: nn.Module): + if isinstance(module, nn.Identity): + return PoolerIdentity() + if isinstance(module, (nn.Sigmoid, nn.Softmax)): + return PoolerClassify() - for data, token_id in zip(pooled_data_lst, prompt_token_ids): - if returned_token_ids is not None and len(returned_token_ids) > 0: - data = data[:, returned_token_ids] + return LambdaPoolerActivation(module) + + @abstractmethod + def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: + raise NotImplementedError + + def forward(self, pooled_data: _T) -> _T: + if isinstance(pooled_data, list): + return [self.forward_chunk(data) for data in pooled_data] + + return self.forward_chunk(pooled_data) - if step_tag_id is not None: - data = data[token_id == step_tag_id] - pooled_data.append(data) +class PoolerIdentity(PoolerActivation): + + def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: return pooled_data +class PoolerNormalize(PoolerActivation): + + def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: + x = F.normalize(pooled_data.float(), p=2, dim=-1) + return x.to(pooled_data.dtype) + + +class PoolerClassify(PoolerActivation): + + def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: + num_labels = pooled_data.shape[-1] + if num_labels < 2: + return F.sigmoid(pooled_data.float()).to(pooled_data.dtype) + + return F.softmax(pooled_data.float(), dim=-1).to(pooled_data.dtype) + + +class PoolerScore(PoolerActivation): + + def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: + num_labels = pooled_data.shape[-1] + if num_labels < 2: + return F.sigmoid(pooled_data.float()).to(pooled_data.dtype) + + return pooled_data + + +class LambdaPoolerActivation(PoolerActivation): + + def __init__(self, fn: Callable[[torch.Tensor], torch.Tensor]): + super().__init__() + + self.fn = fn + + def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: + return self.fn(pooled_data) + + class PoolerHead(nn.Module): - def __init__(self, *, normalize: bool, softmax: bool) -> None: + @classmethod + def from_config_with_defaults( + cls, + pooler_config: PoolerConfig, + pooling_type: PoolingType, + normalize: bool, + softmax: bool, + ) -> "PoolerHead": + resolved_config = ResolvedPoolingConfig.from_config_with_defaults( + pooler_config=pooler_config, + pooling_type=pooling_type, + normalize=normalize, + softmax=softmax, + step_tag_id=None, + returned_token_ids=None, + ) + + return cls.from_config(resolved_config) + + @classmethod + def from_config(cls, pooler_config: ResolvedPoolingConfig) -> "PoolerHead": + if pooler_config.normalize and pooler_config.softmax: + raise ValueError("`normalize=True` and `softmax=True` should not " + "be set together") + + activation: PoolerActivation + if pooler_config.normalize: + activation = PoolerNormalize() + elif pooler_config.softmax: + activation = PoolerClassify() + else: + activation = PoolerIdentity() + + return cls(activation) + + def __init__(self, activation: PoolerActivation) -> None: super().__init__() - self.normalize = normalize - self.softmax = softmax + self.activation = activation def forward(self, pooled_data: Union[list[torch.Tensor], torch.Tensor], pooling_metadata: PoolingMetadata): @@ -312,35 +421,21 @@ def forward(self, pooled_data: Union[list[torch.Tensor], torch.Tensor], for vecs, d in zip(pooled_data, dimensions_list) ] - if self.normalize: - if isinstance(pooled_data, list): - pooled_data = [ - F.normalize(data, p=2, dim=-1) for data in pooled_data - ] - else: - pooled_data = F.normalize(pooled_data, p=2, dim=-1) + return self.activation(pooled_data) - if self.softmax: - if isinstance(pooled_data, list): - pooled_data = [ - F.softmax(data, dim=-1) - if data.shape[-1] >= 2 else F.sigmoid(data) - for data in pooled_data - ] - else: - if pooled_data.shape[-1] >= 2: - pooled_data = F.softmax(pooled_data, dim=-1) - else: - pooled_data = F.sigmoid(pooled_data) - # shape: - # classify (& score) -> (batch_size, num_classes) - # embed -> (batch_size, embedding_dim) or list(embedding_dim) - # (batch_size, dimensions) or list(dimensions) if using MRL - return pooled_data +class SimplePooler(BasePooler): + """A layer that pools specific information from hidden states. + This layer does the following: + 1. Extracts specific tokens or aggregates data based on pooling method. + 2. Normalizes output if specified. + 3. Returns structured results as `PoolerOutput`. -class Pooler(nn.Module): + Attributes: + pooling_type: The type of pooling to use. + normalize: Whether to normalize the pooled data. + """ @classmethod def from_config_with_defaults( @@ -349,23 +444,146 @@ def from_config_with_defaults( pooling_type: PoolingType, normalize: bool, softmax: bool, + ) -> "SimplePooler": + resolved_config = ResolvedPoolingConfig.from_config_with_defaults( + pooler_config=pooler_config, + pooling_type=pooling_type, + normalize=normalize, + softmax=softmax, + ) + assert resolved_config.pooling_type != PoolingType.STEP + + return cls.from_config(resolved_config) + + @classmethod + def from_config( + cls, + pooler_config: ResolvedPoolingConfig, + ) -> "SimplePooler": + pooling = PoolingMethod.from_pooling_type(pooler_config.pooling_type) + head = PoolerHead.from_config(pooler_config) + + return cls(pooling, head) + + def __init__(self, pooling: PoolingMethod, head: PoolerHead) -> None: + super().__init__() + + self.pooling = pooling + self.head = head + + def forward( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> PoolerOutput: + pooled_data = self.pooling(hidden_states, pooling_metadata) + pooled_data = self.head(pooled_data, pooling_metadata) + return build_output(pooled_data) + + +class StepPooler(BasePooler): + + @classmethod + def from_config(cls, pooler_config: ResolvedPoolingConfig) -> "StepPooler": + assert pooler_config.pooling_type == PoolingType.STEP + + return cls( + PoolerHead.from_config(pooler_config), + step_tag_id=pooler_config.step_tag_id, + returned_token_ids=pooler_config.returned_token_ids, + ) + + def __init__( + self, + head: PoolerHead, + *, step_tag_id: Optional[int] = None, returned_token_ids: Optional[list[int]] = None, - ) -> SimplePooler: - return SimplePooler.from_pooling_type( - pooling_type=PoolingType[pooler_config.pooling_type] - if pooler_config.pooling_type is not None else pooling_type, - normalize=pooler_config.normalize - if pooler_config.normalize is not None else normalize, - softmax=pooler_config.softmax - if pooler_config.softmax is not None else softmax, - step_tag_id=pooler_config.step_tag_id - if pooler_config.step_tag_id is not None else step_tag_id, - returned_token_ids=pooler_config.returned_token_ids - if pooler_config.returned_token_ids is not None else - returned_token_ids, + ) -> None: + super().__init__() + + self.pooling = AllPool() + self.head = head + self.step_tag_id = step_tag_id + self.returned_token_ids = returned_token_ids + + def get_prompt_token_ids( + self, + pooling_metadata: PoolingMetadata, + ) -> list[torch.Tensor]: + if isinstance(pooling_metadata, V1PoolingMetadata): + return [ + pooling_metadata.prompt_token_ids[i, :num] + for i, num in enumerate(pooling_metadata.prompt_lens) + ] + return [ + torch.tensor(seq_data_i.prompt_token_ids) + for seq_data_i in pooling_metadata.seq_data.values() + ] + + def extract_states( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> Union[list[torch.Tensor], torch.Tensor]: + pooled_data_lst = self.pooling(hidden_states, pooling_metadata) + prompt_token_ids = self.get_prompt_token_ids(pooling_metadata) + + pooled_data = list[torch.Tensor]() + returned_token_ids = self.returned_token_ids + step_tag_id = self.step_tag_id + + for data, token_id in zip(pooled_data_lst, prompt_token_ids): + if returned_token_ids is not None and len(returned_token_ids) > 0: + data = data[:, returned_token_ids] + + if step_tag_id is not None: + data = data[token_id == step_tag_id] + pooled_data.append(data) + + return pooled_data + + def forward( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> PoolerOutput: + pooled_data = self.extract_states(hidden_states, pooling_metadata) + pooled_data = self.head(pooled_data, pooling_metadata) + return build_output(pooled_data) + + +class Pooler(nn.Module): + + @staticmethod + def from_config_with_defaults( + pooler_config: PoolerConfig, + pooling_type: PoolingType, + normalize: bool, + softmax: bool, + step_tag_id: Optional[int] = None, + returned_token_ids: Optional[list[int]] = None, + ) -> BasePooler: + resolved_config = ResolvedPoolingConfig.from_config_with_defaults( + pooler_config=pooler_config, + pooling_type=pooling_type, + normalize=normalize, + softmax=softmax, + step_tag_id=step_tag_id, + returned_token_ids=returned_token_ids, ) + if pooling_type == PoolingType.STEP: + return StepPooler.from_config(resolved_config) + + return SimplePooler.from_config(resolved_config) + + +PoolingFn = Callable[ + [Union[torch.Tensor, list[torch.Tensor]], PoolingMetadata], + Union[torch.Tensor, list[torch.Tensor]]] +ClassifierFn = Callable[[torch.Tensor], torch.Tensor] + class ClassifierPooler(nn.Module): """A pooling layer for classification tasks. @@ -382,69 +600,39 @@ class ClassifierPooler(nn.Module): def __init__( self, config: ModelConfig, - classifier: nn.Module, - pooler: Optional[nn.Module] = None, - ): + pooling: PoolingFn, + classifier: ClassifierFn, + act_fn: Optional[PoolerActivation] = None, + ) -> None: super().__init__() + + self.pooling = pooling self.classifier = classifier - self.pooler = pooler self.classification_act_fn = get_classification_activation_function( - config.hf_config) + config.hf_config) if act_fn is None else act_fn self.cross_encoder_act_fn = get_cross_encoder_activation_function( - config.hf_config) + config.hf_config) if act_fn is None else act_fn def _get_act_fn(self, use_cross_encoder: bool): return (self.cross_encoder_act_fn if use_cross_encoder else self.classification_act_fn) - def get_prompt_lens( - self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> torch.Tensor: - if isinstance(pooling_metadata, V1PoolingMetadata): - return pooling_metadata.prompt_lens - assert isinstance(hidden_states, torch.Tensor) - return PoolingTensors.from_pooling_metadata( - pooling_metadata, hidden_states.device).prompt_lens - def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, ) -> PoolerOutput: """Pools sentence pair scores from the hidden_states.""" - prompt_lens = self.get_prompt_lens(hidden_states, pooling_metadata) + pooled_data = self.pooling(hidden_states, pooling_metadata) - pooled_data = list[torch.Tensor]() - if isinstance(hidden_states, list): - for req_state, prompt_len in zip(hidden_states, prompt_lens): - assert prompt_len == req_state.shape[0], \ - "partial prefill not supported with classifier" - pooled_data = hidden_states + # apply classifier once on the full batch if possible + if isinstance(pooled_data, torch.Tensor): + pooled_output = self.classifier(pooled_data) + elif len({data.shape for data in pooled_data}) <= 1: + pooled_output = self.classifier(torch.stack(pooled_data)) else: - offset = 0 - for prompt_len in prompt_lens: - pooled_data_i = hidden_states[offset:offset + prompt_len] - offset += prompt_len - pooled_data.append(pooled_data_i) - - pooled_data_lst = [] - for pooled_data_i in pooled_data: - - if self.pooler is not None: - final_shape_tensor = self.pooler(pooled_data_i) - else: - final_shape_tensor = self.classifier(pooled_data_i) - - pooled_data_lst.append(final_shape_tensor) - - pooled_output = torch.stack(pooled_data_lst) - - if self.pooler is not None: - # apply classifier once on the full batch if possible - pooled_output = self.classifier(pooled_output) + pooled_output = [self.classifier(data) for data in pooled_data] if isinstance(pooling_metadata, V0PoolingMetadata): use_cross_encoder_list = [ @@ -469,5 +657,4 @@ def forward( pooled_output) ]) - pooled_outputs = [PoolingSequenceGroupOutput(data) for data in scores] - return PoolerOutput(outputs=pooled_outputs) + return build_output(scores) diff --git a/vllm/model_executor/models/adapters.py b/vllm/model_executor/models/adapters.py index dcdf69f773a..5c09ac30605 100644 --- a/vllm/model_executor/models/adapters.py +++ b/vllm/model_executor/models/adapters.py @@ -58,22 +58,27 @@ def __init__( ) -> None: super().__init__(vllm_config=vllm_config, prefix=prefix, **kwargs) + self.vllm_config = vllm_config + # These are not used in pooling models for attr in ("lm_head", "logits_processor"): if hasattr(self, attr): delattr(self, attr) + # If the model already defines a pooler instance, don't overwrite it + if not getattr(self, "_pooler", None): + self._init_pooler(vllm_config, prefix=prefix) + + def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): pooler_config = vllm_config.model_config.pooler_config assert pooler_config is not None - # If the model already defines a pooler instance, don't overwrite it - if not getattr(self, "_pooler", None): - self._pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=default_pooling_type, - normalize=default_normalize, - softmax=default_softmax, - ) + self._pooler = Pooler.from_config_with_defaults( + pooler_config, + pooling_type=default_pooling_type, + normalize=default_normalize, + softmax=default_softmax, + ) def pooler( self, @@ -165,7 +170,9 @@ def as_seq_cls_model(cls: _T) -> _T: # Lazy import from vllm.model_executor.layers.linear import RowParallelLinear - from vllm.model_executor.layers.pooler import PoolerOutput, PoolingType + from vllm.model_executor.layers.pooler import (ClassifierPooler, + PoolerOutput, PoolingType, + SimplePooler) from vllm.model_executor.models.interfaces import SupportsCrossEncoding from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.sequence import IntermediateTensors @@ -182,30 +189,40 @@ def as_seq_cls_model(cls: _T) -> _T: class ModelForSequenceClassification(ModelForPooling, SupportsCrossEncoding): - def __init__( - self, - *, - vllm_config: "VllmConfig", - prefix: str = "", - **kwargs: Any, - ) -> None: - super().__init__(vllm_config=vllm_config, prefix=prefix, **kwargs) - + def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): config = vllm_config.model_config.hf_config quant_config = vllm_config.quant_config - self.vllm_config = vllm_config - self.task = vllm_config.model_config.task - self.pooling_type = ( - vllm_config.model_config.pooler_config.pooling_type) - - self.score = RowParallelLinear(config.hidden_size, - config.num_labels, - quant_config=quant_config, - input_is_parallel=False, - bias=False, - prefix=maybe_prefix( - prefix, "score")) + self.score = RowParallelLinear( + config.hidden_size, + config.num_labels, + input_is_parallel=False, + bias=False, + params_dtype=torch.float32, + quant_config=quant_config, + prefix=maybe_prefix(prefix, "score"), + ) + + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + pooler = SimplePooler.from_config_with_defaults( + pooler_config, + pooling_type=PoolingType.LAST, + normalize=False, + softmax=True, + ) + + self._pooler = ClassifierPooler( + vllm_config.model_config, + pooling=pooler.pooling, + classifier=self._classifier, + act_fn=pooler.head.activation, + ) + + def _classifier(self, x: torch.Tensor): + x, _ = self.score(x.float()) + return x def forward( self, @@ -222,27 +239,7 @@ def pooler( hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, ) -> PoolerOutput: - - def get_logits(hidden_states): - if isinstance(hidden_states, list): - logits = [self.score(state)[0] for state in hidden_states] - else: - logits, _ = self.score(hidden_states) - return logits - - if self.pooling_type == PoolingType.ALL: - logits = get_logits(hidden_states) - return self._pooler(logits, pooling_metadata) - else: - hidden_states = self._pooler.extract_states( - hidden_states, pooling_metadata) - logits = get_logits(hidden_states) - pooled_data = self._pooler.head(logits, pooling_metadata) - - pooled_outputs = [ - self._pooler.build_output(data) for data in pooled_data - ] - return PoolerOutput(outputs=pooled_outputs) + return self._pooler(hidden_states, pooling_metadata) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): tokens = getattr(self.config, "classifier_from_token", None) diff --git a/vllm/model_executor/models/bert.py b/vllm/model_executor/models/bert.py index a43803ed433..65e6428f491 100644 --- a/vllm/model_executor/models/bert.py +++ b/vllm/model_executor/models/bert.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from collections.abc import Iterable -from typing import Optional +from typing import Optional, Union import torch from torch import nn @@ -18,7 +18,7 @@ QKVParallelLinear, RowParallelLinear) from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, - PoolingType) + PoolingMethod, PoolingType) from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) @@ -84,14 +84,18 @@ class BertPooler(nn.Module): def __init__(self, config: BertConfig): super().__init__() + + self.pooling = PoolingMethod.from_pooling_type(PoolingType.CLS) self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh() - def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: - # We "pool" the model by simply taking the hidden state corresponding - # to the first token. - first_token_tensor = hidden_states[0, :] - pooled_output = self.dense(first_token_tensor) + def forward( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> Union[torch.Tensor, list[torch.Tensor]]: + pooled_output = self.pooling(hidden_states, pooling_metadata) + pooled_output = self.dense(pooled_output) pooled_output = self.activation(pooled_output) return pooled_output @@ -472,8 +476,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): embedding_class=BertEmbedding, add_pooling_layer=True) self.classifier = nn.Linear(config.hidden_size, config.num_labels) - self._pooler = ClassifierPooler(vllm_config.model_config, - self.classifier, self.bert.pooler) + self._pooler = ClassifierPooler( + vllm_config.model_config, + pooling=self.bert.pooler, + classifier=self.classifier, + ) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) diff --git a/vllm/model_executor/models/gritlm.py b/vllm/model_executor/models/gritlm.py index 4273afbf469..dfec8a51c4c 100644 --- a/vllm/model_executor/models/gritlm.py +++ b/vllm/model_executor/models/gritlm.py @@ -9,7 +9,7 @@ from vllm.config import ModelConfig, VllmConfig from vllm.logger import init_logger -from vllm.model_executor.layers.pooler import PoolerHead +from vllm.model_executor.layers.pooler import PoolerHead, PoolerNormalize from vllm.model_executor.models.llama import LlamaForCausalLM from vllm.model_executor.pooling_metadata import (PoolingMetadata, PoolingTensors) @@ -49,7 +49,7 @@ def tokens_to_ids(tokens: list[str]) -> array: self.embed_pattern_ids = tokens_to_ids( ["▁<", "|", "embed", "|", ">", "<0x0A>"]) - self.head = PoolerHead(normalize=True, softmax=False) + self.head = PoolerHead(PoolerNormalize()) def _find_array(self, arr: array, target: array, start_idx: int) -> int: """ diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 92ecb8972d5..9655bdf6f3e 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -659,7 +659,7 @@ def supports_cross_encoding( def has_step_pooler(model: Union[type[object], object]) -> bool: """Check if the model uses step pooler.""" return is_pooling_model(model) and any( - type(module).__name__ == "StepPool" for module in model.modules()) + type(module).__name__ == "StepPooler" for module in model.modules()) class SupportsQuant: diff --git a/vllm/model_executor/models/jamba.py b/vllm/model_executor/models/jamba.py index 8294f846bbd..233c222963b 100644 --- a/vllm/model_executor/models/jamba.py +++ b/vllm/model_executor/models/jamba.py @@ -19,7 +19,8 @@ RowParallelLinear) from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba_mixer import MambaMixer -from vllm.model_executor.layers.pooler import Pooler, PoolingType +from vllm.model_executor.layers.pooler import (ClassifierPooler, PoolingType, + SimplePooler) from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) @@ -564,29 +565,41 @@ class JambaForSequenceClassification(JambaForCausalLM): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__(vllm_config=vllm_config, prefix=prefix) + config = vllm_config.model_config.hf_config num_labels: int = config.num_labels score_bias: bool = getattr(config, 'score_bias', False) - self.score = nn.Linear(config.hidden_size, num_labels, bias=score_bias) + + # TODO: The original reward weights have float32 accuracy data, we + # would like to load them in fp32 to get that extra precision. + # Currently weight_loader passes the weight which is already in bf16 + self.score = nn.Linear( + config.hidden_size, + num_labels, + bias=score_bias, + dtype=torch.float32, + ) pooler_config = vllm_config.model_config.pooler_config - self._pooler = Pooler.from_config_with_defaults( + assert pooler_config is not None + + pooler = SimplePooler.from_config_with_defaults( pooler_config, pooling_type=PoolingType.LAST, normalize=False, - softmax=False) + softmax=False, + ) + + self._pooler = ClassifierPooler( + vllm_config.model_config, + pooling=pooler.pooling, + classifier=self.score, + act_fn=pooler.head.activation, + ) def pooler( self, hidden_states: torch.Tensor, pooling_metadata: PoolingMetadata, ) -> Optional[PoolerOutput]: - hidden_states = hidden_states.float() - logits = self.score(hidden_states) - return self._pooler(logits, pooling_metadata) - - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - # TODO: The reward weights themselves have float32 accuracy data, we - # would like to load them in fp32 to get that extra precision. - super().load_weights(weights) - self.score = self.score.float() + return self._pooler(hidden_states, pooling_metadata) diff --git a/vllm/model_executor/models/modernbert.py b/vllm/model_executor/models/modernbert.py index 9d619b38d38..e094ff16357 100644 --- a/vllm/model_executor/models/modernbert.py +++ b/vllm/model_executor/models/modernbert.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from collections.abc import Iterable -from typing import Optional +from typing import Optional, Union import torch from torch import nn @@ -13,7 +13,8 @@ from vllm.distributed import get_tensor_model_parallel_world_size from vllm.model_executor.layers.linear import (QKVParallelLinear, RowParallelLinear) -from vllm.model_executor.layers.pooler import ClassifierPooler +from vllm.model_executor.layers.pooler import (BasePooler, ClassifierPooler, + PoolingMethod, PoolingType) from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) @@ -252,10 +253,13 @@ def forward( return norm_outputs -class ModernBertPooler(nn.Module): +class ModernBertPooler(BasePooler): def __init__(self, config: ModernBertConfig): super().__init__() + + pooling_type = PoolingType[config.classifier_pooling.upper()] + self.pooling = PoolingMethod.from_pooling_type(pooling_type) self.dense = nn.Linear(config.hidden_size, config.hidden_size, config.classifier_bias) self.pooling_type = config.classifier_pooling @@ -264,15 +268,12 @@ def __init__(self, config: ModernBertConfig): eps=config.norm_eps, bias=config.norm_bias) - def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: - pooled_output = hidden_states - if self.pooling_type == "mean": - pooled_output = pooled_output.mean(dim=0, keepdim=False) - elif self.pooling_type == "cls": - pooled_output = pooled_output[0, :] - else: - raise ValueError("Pooling type should be either `cls` or `mean`, " - f"but got {self.pooling_type}") + def forward( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> Union[torch.Tensor, list[torch.Tensor]]: + pooled_output = self.pooling(hidden_states, pooling_metadata) pooled_output = self.norm(self.act(self.dense(pooled_output))) return pooled_output @@ -287,9 +288,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.model = ModernBertModel(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "modernbert")) self.classifier = nn.Linear(config.hidden_size, config.num_labels) - self._pooler = ClassifierPooler(vllm_config.model_config, - self.classifier, - ModernBertPooler(config)) + self._pooler = ClassifierPooler( + vllm_config.model_config, + pooling=ModernBertPooler(config), + classifier=self.classifier, + ) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): diff --git a/vllm/model_executor/models/roberta.py b/vllm/model_executor/models/roberta.py index 1d3a23a5e54..55ebb6e9e2a 100644 --- a/vllm/model_executor/models/roberta.py +++ b/vllm/model_executor/models/roberta.py @@ -9,7 +9,7 @@ from transformers import RobertaConfig from vllm.config import VllmConfig -from vllm.model_executor.layers.pooler import ClassifierPooler +from vllm.model_executor.layers.pooler import ClassifierPooler, CLSPool from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.models.bert import BertEmbeddingModel, BertModel @@ -106,8 +106,8 @@ def __init__(self, config: RobertaConfig): self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) - def forward(self, features, **kwargs): - x = features[0, :] # take token (equiv. to [CLS]) + def forward(self, x: torch.Tensor) -> torch.Tensor: + # CLSPool has already been applied in `pooling` x = self.dense(x) x = torch.tanh(x) x = self.out_proj(x) @@ -188,8 +188,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): add_pooling_layer=False) self.classifier = RobertaClassificationHead(config) - self._pooler = ClassifierPooler(vllm_config.model_config, - self.classifier) + self._pooler = ClassifierPooler( + vllm_config.model_config, + pooling=CLSPool(), + classifier=self.classifier, + ) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index cf3f519b027..db8f675bcc5 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -17,7 +17,6 @@ HFValidationError, LocalEntryNotFoundError, RepositoryNotFoundError, RevisionNotFoundError) -from torch import nn from transformers import GenerationConfig, PretrainedConfig from transformers.models.auto.image_processing_auto import ( get_image_processor_config) @@ -44,7 +43,6 @@ # yapf: enable from vllm.transformers_utils.configs.mistral import adapt_config_dict from vllm.transformers_utils.utils import check_gguf_file -from vllm.utils import resolve_obj_by_qualname if envs.VLLM_USE_MODELSCOPE: from modelscope import AutoConfig @@ -775,28 +773,6 @@ def try_get_generation_config( return None -def get_classification_activation_function(config: PretrainedConfig): - return nn.Sigmoid() if config.num_labels == 1 else nn.Softmax() - - -def get_cross_encoder_activation_function(config: PretrainedConfig): - function_name: Optional[str] = None - if (hasattr(config, "sentence_transformers") - and "activation_fn" in config.sentence_transformers): - function_name = config.sentence_transformers["activation_fn"] - elif (hasattr(config, "sbert_ce_default_activation_function") - and config.sbert_ce_default_activation_function is not None): - function_name = config.sbert_ce_default_activation_function - - if function_name is not None: - assert function_name.startswith("torch.nn.modules."), ( - "Loading of activation functions is restricted to " - "torch.nn.modules for security reasons") - return resolve_obj_by_qualname(function_name)() - - return nn.Sigmoid() if config.num_labels == 1 else nn.Identity() - - def try_get_safetensors_metadata( model: str, *, From 20073e6f3b46531ab1e717c90fb3732d97297882 Mon Sep 17 00:00:00 2001 From: Mac Misiura <82826099+m-misiura@users.noreply.github.com> Date: Wed, 16 Jul 2025 14:52:14 +0100 Subject: [PATCH 135/552] feat - add a new endpoint `get_tokenizer_info` to provide tokenizer/chat-template information (#20575) Signed-off-by: m-misiura Signed-off-by: x22x22 --- tests/entrypoints/openai/test_tokenization.py | 104 ++++++++++++++++++ vllm/entrypoints/openai/api_server.py | 14 +++ vllm/entrypoints/openai/cli_args.py | 3 + vllm/entrypoints/openai/protocol.py | 10 ++ .../openai/serving_tokenization.py | 54 ++++++++- 5 files changed, 182 insertions(+), 3 deletions(-) diff --git a/tests/entrypoints/openai/test_tokenization.py b/tests/entrypoints/openai/test_tokenization.py index 57dd25fe1b1..0dbbdfbfd24 100644 --- a/tests/entrypoints/openai/test_tokenization.py +++ b/tests/entrypoints/openai/test_tokenization.py @@ -32,6 +32,7 @@ def server(zephyr_lora_added_tokens_files: str): # noqa: F811 f"zephyr-lora2={zephyr_lora_added_tokens_files}", "--max-lora-rank", "64", + "--enable-tokenizer-info-endpoint", ] with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: @@ -283,3 +284,106 @@ async def test_detokenize( response.raise_for_status() assert response.json() == {"prompt": prompt} + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name,tokenizer_name", + [(MODEL_NAME, MODEL_NAME), ("zephyr-lora2", "zephyr-lora2")], + indirect=["tokenizer_name"], +) +async def test_tokenizer_info_basic( + server: RemoteOpenAIServer, + model_name: str, + tokenizer_name: str, +): + """Test basic tokenizer info endpoint functionality.""" + response = requests.get(server.url_for("tokenizer_info")) + response.raise_for_status() + result = response.json() + assert "tokenizer_class" in result + assert isinstance(result["tokenizer_class"], str) + assert result["tokenizer_class"] + + +@pytest.mark.asyncio +async def test_tokenizer_info_schema(server: RemoteOpenAIServer): + """Test that the response matches expected schema types.""" + response = requests.get(server.url_for("tokenizer_info")) + response.raise_for_status() + result = response.json() + field_types = { + "add_bos_token": bool, + "add_prefix_space": bool, + "clean_up_tokenization_spaces": bool, + "split_special_tokens": bool, + "bos_token": str, + "eos_token": str, + "pad_token": str, + "unk_token": str, + "chat_template": str, + "errors": str, + "model_max_length": int, + "additional_special_tokens": list, + "added_tokens_decoder": dict, + } + for field, expected_type in field_types.items(): + if field in result and result[field] is not None: + assert isinstance( + result[field], + expected_type), (f"{field} should be {expected_type.__name__}") + + +@pytest.mark.asyncio +async def test_tokenizer_info_added_tokens_structure( + server: RemoteOpenAIServer, ): + """Test added_tokens_decoder structure if present.""" + response = requests.get(server.url_for("tokenizer_info")) + response.raise_for_status() + result = response.json() + added_tokens = result.get("added_tokens_decoder") + if added_tokens: + for token_id, token_info in added_tokens.items(): + assert isinstance(token_id, str), "Token IDs should be strings" + assert isinstance(token_info, dict), "Token info should be a dict" + assert "content" in token_info, "Token info should have content" + assert "special" in token_info, ( + "Token info should have special flag") + assert isinstance(token_info["special"], + bool), ("Special flag should be boolean") + + +@pytest.mark.asyncio +async def test_tokenizer_info_consistency_with_tokenize( + server: RemoteOpenAIServer, ): + """Test that tokenizer info is consistent with tokenization endpoint.""" + info_response = requests.get(server.url_for("tokenizer_info")) + info_response.raise_for_status() + info = info_response.json() + tokenize_response = requests.post( + server.url_for("tokenize"), + json={ + "model": MODEL_NAME, + "prompt": "Hello world!" + }, + ) + tokenize_response.raise_for_status() + tokenize_result = tokenize_response.json() + info_max_len = info.get("model_max_length") + tokenize_max_len = tokenize_result.get("max_model_len") + if info_max_len and tokenize_max_len: + assert info_max_len >= tokenize_max_len, ( + "Info max length should be >= tokenize max length") + + +@pytest.mark.asyncio +async def test_tokenizer_info_chat_template(server: RemoteOpenAIServer): + """Test chat template is properly included.""" + response = requests.get(server.url_for("tokenizer_info")) + response.raise_for_status() + result = response.json() + chat_template = result.get("chat_template") + if chat_template: + assert isinstance(chat_template, + str), ("Chat template should be a string") + assert chat_template.strip(), "Chat template should not be empty" \ No newline at end of file diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 19d0110ff37..c2185acbf0c 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -522,6 +522,19 @@ async def detokenize(request: DetokenizeRequest, raw_request: Request): assert_never(generator) +def maybe_register_tokenizer_info_endpoint(args): + """Conditionally register the tokenizer info endpoint if enabled.""" + if getattr(args, 'enable_tokenizer_info_endpoint', False): + + @router.get("/tokenizer_info") + async def get_tokenizer_info(raw_request: Request): + """Get comprehensive tokenizer information.""" + result = await tokenization(raw_request).get_tokenizer_info() + return JSONResponse(content=result.model_dump(), + status_code=result.code if isinstance( + result, ErrorResponse) else 200) + + @router.get("/v1/models") async def show_available_models(raw_request: Request): handler = models(raw_request) @@ -1692,6 +1705,7 @@ async def run_server_worker(listen_address, uvicorn_kwargs['log_config'] = log_config async with build_async_engine_client(args, client_config) as engine_client: + maybe_register_tokenizer_info_endpoint(args) app = build_app(args) vllm_config = await engine_client.get_vllm_config() diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index bccce73b79f..6456d009b95 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -182,6 +182,9 @@ class FrontendArgs: """If set to True, enable tracking server_load_metrics in the app state.""" enable_force_include_usage: bool = False """If set to True, including usage on every request.""" + enable_tokenizer_info_endpoint: bool = False + """Enable the /get_tokenizer_info endpoint. May expose chat + templates and other tokenizer configuration.""" @staticmethod def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index f17faa23d01..16cb5b75032 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -1953,6 +1953,16 @@ class DetokenizeResponse(OpenAIBaseModel): prompt: str +class TokenizerInfoResponse(OpenAIBaseModel): + """ + Response containing tokenizer configuration + equivalent to tokenizer_config.json + """ + + model_config = ConfigDict(extra="allow") + tokenizer_class: str + + class LoadLoRAAdapterRequest(BaseModel): lora_name: str lora_path: str diff --git a/vllm/entrypoints/openai/serving_tokenization.py b/vllm/entrypoints/openai/serving_tokenization.py index 3db0a71fadd..8181b36ed0b 100644 --- a/vllm/entrypoints/openai/serving_tokenization.py +++ b/vllm/entrypoints/openai/serving_tokenization.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import Final, Optional, Union +from dataclasses import dataclass +from typing import Any, Final, Optional, Union import jinja2 from fastapi import Request @@ -17,11 +17,13 @@ ErrorResponse, TokenizeChatRequest, TokenizeRequest, - TokenizeResponse) + TokenizeResponse, + TokenizerInfoResponse) # yapf: enable from vllm.entrypoints.openai.serving_engine import OpenAIServing from vllm.entrypoints.openai.serving_models import OpenAIServingModels from vllm.logger import init_logger +from vllm.transformers_utils.tokenizer import AnyTokenizer logger = init_logger(__name__) @@ -155,3 +157,49 @@ async def create_detokenize( input_text = prompt_input["prompt"] return DetokenizeResponse(prompt=input_text) + + async def get_tokenizer_info( + self, ) -> Union[TokenizerInfoResponse, ErrorResponse]: + """Get comprehensive tokenizer information.""" + try: + tokenizer = await self.engine_client.get_tokenizer() + info = TokenizerInfo(tokenizer, self.chat_template).to_dict() + return TokenizerInfoResponse(**info) + except Exception as e: + return self.create_error_response( + f"Failed to get tokenizer info: {str(e)}") + + +@dataclass +class TokenizerInfo: + tokenizer: AnyTokenizer + chat_template: Optional[str] + + def to_dict(self) -> dict[str, Any]: + """Return the tokenizer configuration.""" + return self._get_tokenizer_config() + + def _get_tokenizer_config(self) -> dict[str, Any]: + """Get tokenizer configuration directly from the tokenizer object.""" + config = dict(getattr(self.tokenizer, "init_kwargs", None) or {}) + + # Remove file path fields + config.pop("vocab_file", None) + config.pop("merges_file", None) + + config = self._make_json_serializable(config) + config["tokenizer_class"] = type(self.tokenizer).__name__ + if self.chat_template: + config["chat_template"] = self.chat_template + return config + + def _make_json_serializable(self, obj): + """Convert any non-JSON-serializable objects to serializable format.""" + if hasattr(obj, "content"): + return obj.content + elif isinstance(obj, dict): + return {k: self._make_json_serializable(v) for k, v in obj.items()} + elif isinstance(obj, list): + return [self._make_json_serializable(item) for item in obj] + else: + return obj From 3c95f1102db10b2bdd5111b5e1d23cee382a9494 Mon Sep 17 00:00:00 2001 From: Avshalom Manevich Date: Wed, 16 Jul 2025 17:17:20 +0200 Subject: [PATCH 136/552] [fix] fix qwen image_embeds input (#21049) Signed-off-by: h-avsha Signed-off-by: x22x22 --- vllm/model_executor/models/qwen2_5_vl.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/model_executor/models/qwen2_5_vl.py b/vllm/model_executor/models/qwen2_5_vl.py index 42a87c4a796..8ae096536fd 100644 --- a/vllm/model_executor/models/qwen2_5_vl.py +++ b/vllm/model_executor/models/qwen2_5_vl.py @@ -974,7 +974,7 @@ def _process_image_input( grid_thw_list = grid_thw.tolist() if image_input["type"] == "image_embeds": - image_embeds = image_input["image_embeds"] + image_embeds = image_input["image_embeds"].type(self.visual.dtype) else: pixel_values = image_input["pixel_values"] image_embeds = self.visual(pixel_values, grid_thw=grid_thw_list) @@ -994,7 +994,7 @@ def _process_video_input( grid_thw_list = grid_thw.tolist() if video_input["type"] == "video_embeds": - video_embeds = video_input["video_embeds"] + video_embeds = video_input["video_embeds"].type(self.visual.dtype) else: pixel_values_videos = video_input["pixel_values_videos"] video_embeds = self.visual(pixel_values_videos, From 8e5c3495c4bfde5f52c1b464ba18c809835dab6b Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Wed, 16 Jul 2025 17:25:23 +0100 Subject: [PATCH 137/552] Remove Qwen Omni workaround that's no longer necessary (#21057) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/transformers_utils/config.py | 7 ------- 1 file changed, 7 deletions(-) diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index db8f675bcc5..dc35d212766 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -733,13 +733,6 @@ def get_hf_text_config(config: PretrainedConfig): """Get the "sub" config relevant to llm for multi modal models. No op for pure text models. """ - # This block should be unnecessary after https://github.com/huggingface/transformers/pull/37517 - if hasattr(config, "thinker_config"): - # TODO(suyang.fy): Refactor code. - # For Qwen2.5-Omni, change hf_text_config to - # thinker_config.text_config. - return config.thinker_config.text_config - text_config = config.get_text_config() if text_config is not config: From 6a65eb5a7b6a629a86aae2003eb6ee64002a13ce Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Thu, 17 Jul 2025 03:03:37 +0800 Subject: [PATCH 138/552] [Model] Remove model sampler (#21059) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/model_executor/models/bailing_moe.py | 10 ---------- vllm/model_executor/models/granite_speech.py | 2 -- vllm/model_executor/models/hunyuan_v1_moe.py | 10 ---------- vllm/model_executor/models/mimo.py | 2 -- vllm/model_executor/models/mimo_mtp.py | 11 ----------- vllm/model_executor/models/phi4flash.py | 10 ---------- 6 files changed, 45 deletions(-) diff --git a/vllm/model_executor/models/bailing_moe.py b/vllm/model_executor/models/bailing_moe.py index 325ba7bbad8..ccfc3997e45 100644 --- a/vllm/model_executor/models/bailing_moe.py +++ b/vllm/model_executor/models/bailing_moe.py @@ -47,7 +47,6 @@ from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) from vllm.model_executor.layers.rotary_embedding import get_rope -from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader @@ -485,7 +484,6 @@ def __init__( else: self.lm_head = PPMissingLayer() - self.sampler = get_sampler() self.make_empty_intermediate_tensors = ( self.model.make_empty_intermediate_tensors) @@ -512,14 +510,6 @@ def compute_logits( sampling_metadata) return logits - def sample( - self, - logits: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[SamplerOutput]: - next_tokens = self.sampler(logits, sampling_metadata) - return next_tokens - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: loader = AutoWeightsLoader( diff --git a/vllm/model_executor/models/granite_speech.py b/vllm/model_executor/models/granite_speech.py index 6c7c9f5cc93..6a4dee9ae48 100644 --- a/vllm/model_executor/models/granite_speech.py +++ b/vllm/model_executor/models/granite_speech.py @@ -36,7 +36,6 @@ from vllm.model_executor.layers.linear import (ColumnParallelLinear, RowParallelLinear) from vllm.model_executor.layers.quantization import QuantizationConfig -from vllm.model_executor.layers.sampler import get_sampler from vllm.model_executor.models.module_mapping import MultiModelKeys from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.multimodal import MULTIMODAL_REGISTRY @@ -549,7 +548,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str): self.config = config self.quant_config = quant_config self.cache_config = cache_config - self.sampler = get_sampler() # The language model is typically a Granite LLM self.language_model = init_vllm_registered_model( diff --git a/vllm/model_executor/models/hunyuan_v1_moe.py b/vllm/model_executor/models/hunyuan_v1_moe.py index 89ca3e8a607..43ffba00721 100644 --- a/vllm/model_executor/models/hunyuan_v1_moe.py +++ b/vllm/model_executor/models/hunyuan_v1_moe.py @@ -49,7 +49,6 @@ from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) from vllm.model_executor.layers.rotary_embedding import get_rope -from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler from vllm.model_executor.layers.vocab_parallel_embedding import ( DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import ( @@ -661,7 +660,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, config.vocab_size, logit_scale) - self.sampler = get_sampler() else: self.lm_head = PPMissingLayer() @@ -685,14 +683,6 @@ def compute_logits( sampling_metadata) return logits - def sample( - self, - logits: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[SamplerOutput]: - next_tokens = self.sampler(logits, sampling_metadata) - return next_tokens - def make_empty_intermediate_tensors( self, batch_size: int, dtype: torch.dtype, device: torch.device) -> IntermediateTensors: diff --git a/vllm/model_executor/models/mimo.py b/vllm/model_executor/models/mimo.py index 9b83f848ef4..5b497dd9d89 100644 --- a/vllm/model_executor/models/mimo.py +++ b/vllm/model_executor/models/mimo.py @@ -36,7 +36,6 @@ from vllm.distributed import get_pp_group from vllm.logger import init_logger from vllm.model_executor.layers.logits_processor import LogitsProcessor -from vllm.model_executor.layers.sampler import get_sampler from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead from vllm.model_executor.model_loader.weight_utils import ( default_weight_loader, maybe_remap_kv_scale_name) @@ -176,7 +175,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.lm_head = PPMissingLayer() self.logits_processor = LogitsProcessor(config.vocab_size) - self.sampler = get_sampler() self.make_empty_intermediate_tensors = ( self.model.make_empty_intermediate_tensors) diff --git a/vllm/model_executor/models/mimo_mtp.py b/vllm/model_executor/models/mimo_mtp.py index 6066ec76c5f..19afc5be3fb 100644 --- a/vllm/model_executor/models/mimo_mtp.py +++ b/vllm/model_executor/models/mimo_mtp.py @@ -30,7 +30,6 @@ from vllm.model_executor.layers.layernorm import RMSNorm from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.quantization import QuantizationConfig -from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader @@ -161,8 +160,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.lm_head = ParallelLMHead(self.config.vocab_size, self.config.hidden_size) - self.sampler = get_sampler() - def forward( self, input_ids: torch.Tensor, @@ -187,14 +184,6 @@ def compute_logits( return self.model.compute_logits(hidden_states, self.lm_head, sampling_metadata, spec_step_idx) - def sample( - self, - logits: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[SamplerOutput]: - next_tokens = self.sampler(logits, sampling_metadata) - return next_tokens - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: stacked_params_mapping = [ diff --git a/vllm/model_executor/models/phi4flash.py b/vllm/model_executor/models/phi4flash.py index c1dd9fab7fa..a4ded2b7a30 100644 --- a/vllm/model_executor/models/phi4flash.py +++ b/vllm/model_executor/models/phi4flash.py @@ -23,7 +23,6 @@ causal_conv1d_fn, causal_conv1d_update) from vllm.model_executor.layers.mamba.ops.mamba_ssm import ( selective_scan_fn, selective_state_update) -from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler from vllm.model_executor.layers.vocab_parallel_embedding import ( DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.models.interfaces import (HasInnerState, IsHybrid, @@ -641,7 +640,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, config.vocab_size, logits_as_input=False) - self.sampler = get_sampler() def forward( self, @@ -709,14 +707,6 @@ def compute_logits( prune_hidden_states=prune_hidden_states) return processed_logits - def sample( - self, - logits: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[SamplerOutput]: - next_tokens = self.sampler(logits, sampling_metadata) - return next_tokens - def load_weights( self, weights: Iterable[tuple[str, torch.Tensor]], From 7a738999b7b1cbe297904af827541ff399b89a45 Mon Sep 17 00:00:00 2001 From: Nir David Date: Wed, 16 Jul 2025 22:33:41 +0300 Subject: [PATCH 139/552] Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010) Signed-off-by: Nir David Signed-off-by: Uri Livne Co-authored-by: Uri Livne Signed-off-by: x22x22 --- docs/features/quantization/README.md | 1 + docs/features/quantization/inc.md | 56 +++++++++++++++++ .../quantization/supported_hardware.md | 25 ++++---- .../installation/intel_gaudi.md | 5 +- vllm/config.py | 13 ++-- vllm/engine/arg_utils.py | 10 ++- .../layers/quantization/__init__.py | 7 ++- .../model_executor/layers/quantization/inc.py | 61 +++++++++++++++++++ .../model_loader/base_loader.py | 10 ++- .../model_loader/weight_utils.py | 4 +- vllm/utils/__init__.py | 1 + 11 files changed, 168 insertions(+), 25 deletions(-) create mode 100644 docs/features/quantization/inc.md create mode 100644 vllm/model_executor/layers/quantization/inc.py diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md index c30abdab5d6..e8c3b112307 100644 --- a/docs/features/quantization/README.md +++ b/docs/features/quantization/README.md @@ -10,6 +10,7 @@ Contents: - [BitBLAS](bitblas.md) - [GGUF](gguf.md) - [GPTQModel](gptqmodel.md) +- [INC](inc.md) - [INT4 W4A16](int4.md) - [INT8 W8A8](int8.md) - [FP8 W8A8](fp8.md) diff --git a/docs/features/quantization/inc.md b/docs/features/quantization/inc.md new file mode 100644 index 00000000000..d97a462f543 --- /dev/null +++ b/docs/features/quantization/inc.md @@ -0,0 +1,56 @@ +--- +title: FP8 INC +--- +[](){ #inc } + +vLLM supports FP8 (8-bit floating point) weight and activation quantization using Intel® Neural Compressor (INC) on Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerators. +Currently, quantization is validated only in Llama models. + +Intel Gaudi supports quantization of various modules and functions, including, but not limited to `Linear`, `KVCache`, `Matmul` and `Softmax`. For more information, please refer to: +[Supported Modules\\Supported Functions\\Custom Patched Modules](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-modules). + +!!! note + Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the [vllm-hpu-extention](https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration/README.md) package. + +!!! note + `QUANT_CONFIG` is an environment variable that points to the measurement or quantization [JSON config file](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options). + The measurement configuration file is used during the calibration procedure to collect measurements for a given model. The quantization configuration is used during inference. + +## Run Online Inference Using FP8 + +Once you've completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command: + +```bash +export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json +vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8 +``` + +!!! tip + If you are just prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop. + +!!! tip + When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables: + `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes. + `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes. + +## Run Offline Inference Using FP8 + +To run offline inference (after completing the model calibration process): + +* Set the "QUANT_CONFIG" environment variable to point to a JSON configuration file with QUANTIZE mode. +* Pass `quantization=inc` and `kv_cache_dtype=fp8_inc` as parameters to the `LLM` object. +* Call shutdown method of the model_executor at the end of the run. + +```python +from vllm import LLM +llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc") +... +# Call llm.generate on the required prompts and sampling params. +... +llm.llm_engine.model_executor.shutdown() +``` + +## Device for the Model's Weights Uploading + +The unquantized weights are first loaded onto the CPU, then quantized and transferred to the target device (HPU) for model execution. +This reduces the device memory footprint of model weights, as only quantized weights are stored in the device memory. diff --git a/docs/features/quantization/supported_hardware.md b/docs/features/quantization/supported_hardware.md index bb4fe5b54b5..70a6a499562 100644 --- a/docs/features/quantization/supported_hardware.md +++ b/docs/features/quantization/supported_hardware.md @@ -2,18 +2,19 @@ The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: -| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Neuron | Google TPU | -|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------| -| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ | -| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ | -| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | -| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | -| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ | -| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | -| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | -| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | -| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | -| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | +| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | AWS Neuron | Google TPU | +|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|--------------| +| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | +| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | +| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | +| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ | +| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | +| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ | - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - ✅︎ indicates that the quantization method is supported on the specified hardware. diff --git a/docs/getting_started/installation/intel_gaudi.md b/docs/getting_started/installation/intel_gaudi.md index 09cffb29cb3..0be0d02d067 100644 --- a/docs/getting_started/installation/intel_gaudi.md +++ b/docs/getting_started/installation/intel_gaudi.md @@ -28,7 +28,7 @@ To verify that the Intel Gaudi software was correctly installed, run: hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed -pip list | grep neural # verify that neural_compressor is installed +pip list | grep neural # verify that neural_compressor_pt is installed ``` Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade) @@ -120,12 +120,13 @@ docker run \ - Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html) for accelerating low-batch latency and throughput - Attention with Linear Biases (ALiBi) +- INC quantization ### Unsupported features - Beam search - LoRA adapters -- Quantization +- AWQ quantization - Prefill chunking (mixed-batch inferencing) ### Supported configurations diff --git a/vllm/config.py b/vllm/config.py index 2965696090d..c3f0cebc6b3 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -963,7 +963,7 @@ def _verify_quantization(self) -> None: optimized_quantization_methods = [ "fp8", "marlin", "modelopt", "gptq_marlin_24", "gptq_marlin", "awq_marlin", "fbgemm_fp8", "compressed-tensors", "experts_int8", - "quark", "modelopt_fp4", "bitblas", "gptq_bitblas" + "quark", "modelopt_fp4", "bitblas", "gptq_bitblas", "inc" ] if self.quantization is not None: self.quantization = cast(me_quant.QuantizationMethods, @@ -1563,7 +1563,7 @@ def get_and_verify_max_len(self, max_model_len: int): BlockSize = Literal[1, 8, 16, 32, 64, 128] -CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2"] +CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2", "fp8_inc"] PrefixCachingHashAlgo = Literal["builtin", "sha256", "sha256_cbor_64bit"] @@ -1593,7 +1593,7 @@ class CacheConfig: cache_dtype: CacheDType = "auto" """Data type for kv cache storage. If "auto", will use model data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports - fp8 (=fp8_e4m3).""" + fp8 (=fp8_e4m3). Intel Gaudi (HPU) supports fp8 (using fp8_inc).""" is_attention_free: bool = False """Whether the model is attention-free. This is primarily set in `ModelConfig` and that value should be manually duplicated here.""" @@ -1691,7 +1691,7 @@ def _verify_cache_dtype(self) -> None: "Using fp8 data type to store kv cache. It reduces the GPU " "memory footprint and boosts the performance. " "Meanwhile, it may cause accuracy drop without a proper " - "scaling factor") + "scaling factor.") else: raise ValueError(f"Unknown kv cache dtype: {self.cache_dtype}") @@ -1781,6 +1781,9 @@ class LoadConfig: default_factory=dict) """Extra config for model loader. This will be passed to the model loader corresponding to the chosen load_format.""" + device: Optional[str] = None + """Device to which model weights will be loaded, default to + device_config.device""" ignore_patterns: Optional[Union[list[str], str]] = None """The list of patterns to ignore when loading the model. Default to "original/**/*" to avoid repeated loading of llama's checkpoints.""" @@ -1907,7 +1910,7 @@ class ParallelConfig: or equal to the number of GPUs available, "mp" will be used to keep processing on a single host. Otherwise, this will default to "ray" if Ray is installed and fail otherwise. Note that tpu - and hpu only support Ray for distributed inference.""" + only support Ray for distributed inference.""" worker_cls: str = "auto" """The full name of the worker class to use. If "auto", the worker class diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 7b73060e349..ae5eb46fa96 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -139,6 +139,10 @@ def get_type_hints(type_hint: TypeHint) -> set[TypeHint]: return type_hints +def is_online_quantization(quantization: Any) -> bool: + return quantization in ["inc"] + + @functools.lru_cache(maxsize=30) def _compute_kwargs(cls: ConfigType) -> dict[str, Any]: cls_docs = get_attr_docs(cls) @@ -960,6 +964,8 @@ def create_load_config(self) -> LoadConfig: return LoadConfig( load_format=self.load_format, download_dir=self.download_dir, + device="cpu" + if is_online_quantization(self.quantization) else None, model_loader_extra_config=self.model_loader_extra_config, ignore_patterns=self.ignore_patterns, use_tqdm_on_load=self.use_tqdm_on_load, @@ -1359,7 +1365,9 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: supported = False if current_platform.is_rocm() or ( current_platform.is_cuda() - and current_platform.is_device_capability(100)): + and current_platform.is_device_capability(100)) or ( + current_platform.device_name + == "hpu"): # handle hpu also for OOT platform supported = True elif fp8_attention and will_use_fa: from vllm.attention.utils.fa_utils import ( diff --git a/vllm/model_executor/layers/quantization/__init__.py b/vllm/model_executor/layers/quantization/__init__.py index 60217ee86ad..95aea912a15 100644 --- a/vllm/model_executor/layers/quantization/__init__.py +++ b/vllm/model_executor/layers/quantization/__init__.py @@ -36,6 +36,7 @@ "torchao", "auto-round", "rtn", + "inc", ] QUANTIZATION_METHODS: list[str] = list(get_args(QuantizationMethods)) @@ -104,6 +105,7 @@ def get_quantization_config(quantization: str) -> type[QuantizationConfig]: from .gptq_marlin import GPTQMarlinConfig from .gptq_marlin_24 import GPTQMarlin24Config from .hqq_marlin import HQQMarlinConfig + from .inc import INCConfig from .ipex_quant import IPEXConfig from .marlin import MarlinConfig from .modelopt import ModelOptFp8Config, ModelOptNvFp4Config @@ -144,7 +146,8 @@ def get_quantization_config(quantization: str) -> type[QuantizationConfig]: "moe_wna16": MoeWNA16Config, "torchao": TorchAOConfig, "auto-round": AutoRoundConfig, - "rtn": RTNConfig + "rtn": RTNConfig, + "inc": INCConfig, } # Update the `method_to_config` with customized quantization methods. method_to_config.update(_CUSTOMIZED_METHOD_TO_QUANT_CONFIG) @@ -157,4 +160,4 @@ def get_quantization_config(quantization: str) -> type[QuantizationConfig]: "QuantizationMethods", "get_quantization_config", "QUANTIZATION_METHODS", -] \ No newline at end of file +] diff --git a/vllm/model_executor/layers/quantization/inc.py b/vllm/model_executor/layers/quantization/inc.py new file mode 100644 index 00000000000..8aa1f1a14bf --- /dev/null +++ b/vllm/model_executor/layers/quantization/inc.py @@ -0,0 +1,61 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# +# Intel Gaudi supports quantization of various modules and functions, +# including, but not limited to `Linear`, `KVCache`, `Matmul` and `Softmax`. +# During model loading, +# INC will patch layers with quantization/dequantization operators. +# Meanwhile, INC will convert original weight to target datatype +# and loading to target device. +# static scaling should be provided through Quant_CONFIG: +# `QUANT_CONFIG` is an environment variable, +# that points to the measurement or quantization JSON config file. +# The measurement configuration file is used during the calibration procedure, +# to collect measurements for a given model. +# The quantization configuration is used during inference. +# For more information, please refer to: +# https://docs.habana.ai/en/v1.21.1/PyTorch/vLLM_Inference/vLLM_FP8_Inference.html + +from typing import Any, Optional + +import torch + +from vllm.model_executor.layers.fused_moe.layer import ( + FusedMoE, UnquantizedFusedMoEMethod) +from vllm.model_executor.layers.linear import (LinearBase, + UnquantizedLinearMethod) +from vllm.model_executor.layers.quantization import QuantizationMethods +from vllm.model_executor.layers.quantization.base_config import ( + QuantizationConfig, QuantizeMethodBase) + + +class INCConfig(QuantizationConfig): + """Config class for FP8 using Intel Neural Compressor.""" + + @classmethod + def get_name(cls) -> QuantizationMethods: + return "inc" + + @classmethod + def get_supported_act_dtypes(cls) -> list[torch.dtype]: + return [torch.bfloat16] + + @classmethod + def from_config(cls, config: dict[str, Any]) -> "INCConfig": + raise AssertionError + + def get_quant_method(self, layer: torch.nn.Module, + prefix: str) -> Optional["QuantizeMethodBase"]: + if isinstance(layer, LinearBase): + return UnquantizedLinearMethod() + elif isinstance(layer, FusedMoE): + return UnquantizedFusedMoEMethod(layer.moe_config) + return None + + @classmethod + def get_min_capability(cls) -> int: + raise AssertionError + + @staticmethod + def get_config_filenames() -> list[str]: + return [] diff --git a/vllm/model_executor/model_loader/base_loader.py b/vllm/model_executor/model_loader/base_loader.py index 5018c7d9a36..4cf6c798896 100644 --- a/vllm/model_executor/model_loader/base_loader.py +++ b/vllm/model_executor/model_loader/base_loader.py @@ -6,9 +6,12 @@ import torch.nn as nn from vllm.config import LoadConfig, ModelConfig, VllmConfig +from vllm.logger import init_logger from vllm.model_executor.model_loader.utils import ( initialize_model, process_weights_after_loading, set_default_torch_dtype) +logger = init_logger(__name__) + class BaseModelLoader(ABC): """Base class for model loaders.""" @@ -32,11 +35,16 @@ def load_model(self, vllm_config: VllmConfig, model_config: ModelConfig) -> nn.Module: """Load a model with the given configurations.""" device_config = vllm_config.device_config - target_device = torch.device(device_config.device) + load_config = vllm_config.load_config + load_device = device_config.device if load_config.device is None else \ + load_config.device + target_device = torch.device(load_device) with set_default_torch_dtype(model_config.dtype): with target_device: model = initialize_model(vllm_config=vllm_config, model_config=model_config) + + logger.debug("Loading weights on %s ...", load_device) # Quantization does not happen in `load_weights` but after it self.load_weights(model, model_config) process_weights_after_loading(model, model_config, target_device) diff --git a/vllm/model_executor/model_loader/weight_utils.py b/vllm/model_executor/model_loader/weight_utils.py index 178b37d7d70..64a2089921e 100644 --- a/vllm/model_executor/model_loader/weight_utils.py +++ b/vllm/model_executor/model_loader/weight_utils.py @@ -152,8 +152,8 @@ def get_quant_config(model_config: ModelConfig, quant_cls = get_quantization_config(model_config.quantization) # GGUF doesn't have config file - if model_config.quantization == "gguf": - return quant_cls.from_config({}) + if model_config.quantization in ("gguf", "inc"): + return quant_cls() # Read the quantization config from the HF model config, if available. hf_quant_config = getattr(model_config.hf_config, "quantization_config", diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index c18f1d12ba9..bbcc2a523dc 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -179,6 +179,7 @@ "fp8_e4m3": torch.uint8, "fp8_e5m2": torch.uint8, "int8": torch.int8, + "fp8_inc": torch.float8_e4m3fn, } TORCH_DTYPE_TO_NUMPY_DTYPE = { From d5ec147c9e604121e539e6187cc8f848e5ad39f9 Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Wed, 16 Jul 2025 17:25:26 -0700 Subject: [PATCH 140/552] Remove torch_xla.tpu.version() from pallas.py. (#21065) Signed-off-by: Qiliang Cui Signed-off-by: x22x22 --- vllm/v1/attention/backends/pallas.py | 4 ---- 1 file changed, 4 deletions(-) diff --git a/vllm/v1/attention/backends/pallas.py b/vllm/v1/attention/backends/pallas.py index b7fc1ffeb65..52e12a1a506 100644 --- a/vllm/v1/attention/backends/pallas.py +++ b/vllm/v1/attention/backends/pallas.py @@ -167,10 +167,6 @@ def __init__( "are not implemented for " "PallasAttentionBackendImpl") - tpu_version = torch_xla.tpu.version() - if tpu_version < 4: - raise NotImplementedError("TPU version must be 4 or higher.") - def forward( self, layer: AttentionLayer, From e499ff49aee6579af91cb790d965f4e1819ecb65 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 16 Jul 2025 22:30:44 -0400 Subject: [PATCH 141/552] Update PyTorch to `torch==2.7.1` for CUDA (#21011) Signed-off-by: mgoin Signed-off-by: x22x22 --- CMakeLists.txt | 2 +- pyproject.toml | 2 +- requirements/build.txt | 2 +- requirements/cuda.txt | 10 +++++----- requirements/test.in | 6 +++--- requirements/test.txt | 8 ++++---- tests/entrypoints/openai/test_vision.py | 4 ++-- 7 files changed, 17 insertions(+), 17 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 513f4a87f8f..edc64f87730 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -45,7 +45,7 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1 # requirements.txt files and should be kept consistent. The ROCm torch # versions are derived from docker/Dockerfile.rocm # -set(TORCH_SUPPORTED_VERSION_CUDA "2.7.0") +set(TORCH_SUPPORTED_VERSION_CUDA "2.7.1") set(TORCH_SUPPORTED_VERSION_ROCM "2.7.0") # diff --git a/pyproject.toml b/pyproject.toml index 65ba0b4d833..85a112ff51c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,7 +6,7 @@ requires = [ "packaging>=24.2", "setuptools>=77.0.3,<80.0.0", "setuptools-scm>=8.0", - "torch == 2.7.0", + "torch == 2.7.1", "wheel", "jinja2", ] diff --git a/requirements/build.txt b/requirements/build.txt index 528cd3b538e..dd644d621ef 100644 --- a/requirements/build.txt +++ b/requirements/build.txt @@ -4,7 +4,7 @@ ninja packaging>=24.2 setuptools>=77.0.3,<80.0.0 setuptools-scm>=8 -torch==2.7.0 +torch==2.7.1 wheel jinja2>=3.1.6 regex diff --git a/requirements/cuda.txt b/requirements/cuda.txt index a71d9728f38..c1273b224ea 100644 --- a/requirements/cuda.txt +++ b/requirements/cuda.txt @@ -6,9 +6,9 @@ numba == 0.61.2; python_version > '3.9' # Dependencies for NVIDIA GPUs ray[cgraph]>=2.43.0, !=2.44.* # Ray Compiled Graph, required for pipeline parallelism in V1. -torch==2.7.0 -torchaudio==2.7.0 +torch==2.7.1 +torchaudio==2.7.1 # These must be updated alongside torch -torchvision==0.22.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version -# https://github.com/facebookresearch/xformers/releases/tag/v0.0.30 -xformers==0.0.30; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch >= 2.7 +torchvision==0.22.1 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version +# https://github.com/facebookresearch/xformers/releases/tag/v0.0.31 +xformers==0.0.31; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch >= 2.7 diff --git a/requirements/test.in b/requirements/test.in index e8537d10fa7..e8715afaf4f 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -22,9 +22,9 @@ sentence-transformers # required for embedding tests soundfile # required for audio tests jiwer # required for audio tests timm # required for internvl test -torch==2.7.0 -torchaudio==2.7.0 -torchvision==0.22.0 +torch==2.7.1 +torchaudio==2.7.1 +torchvision==0.22.1 transformers_stream_generator # required for qwen-vl test mamba_ssm # required for plamo2 test matplotlib # required for qwen-vl test diff --git a/requirements/test.txt b/requirements/test.txt index 84303b83117..90d8f8ff0bc 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -762,7 +762,7 @@ tomli==2.2.1 # via schemathesis tomli-w==1.2.0 # via schemathesis -torch==2.7.0+cu128 +torch==2.7.1+cu128 # via # -r requirements/test.in # accelerate @@ -781,12 +781,12 @@ torch==2.7.0+cu128 # torchvision # vector-quantize-pytorch # vocos -torchaudio==2.7.0+cu128 +torchaudio==2.7.1+cu128 # via # -r requirements/test.in # encodec # vocos -torchvision==0.22.0+cu128 +torchvision==0.22.1+cu128 # via # -r requirements/test.in # timm @@ -816,7 +816,7 @@ transformers==4.53.2 # transformers-stream-generator transformers-stream-generator==0.0.5 # via -r requirements/test.in -triton==3.3.0 +triton==3.3.1 # via torch tritonclient==2.51.0 # via diff --git a/tests/entrypoints/openai/test_vision.py b/tests/entrypoints/openai/test_vision.py index fd613842f98..b6f1d64803e 100644 --- a/tests/entrypoints/openai/test_vision.py +++ b/tests/entrypoints/openai/test_vision.py @@ -36,11 +36,11 @@ ], [ "The image shows a Venn diagram with three over", - "This image shows a Venn diagram with three over", + "The image shows a Venn diagram with three intersect", ], [ "This image displays a gradient of colors ranging from", - "This image displays a gradient of colors transitioning from", + "The image displays a gradient of colors ranging from", ], ] From cdda63f9f499eae0e05c4fd72e1fa6c665bf5e7e Mon Sep 17 00:00:00 2001 From: Kevin_Xiong Date: Thu, 17 Jul 2025 10:36:36 +0800 Subject: [PATCH 142/552] [Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group (#21024) Signed-off-by: KevinXiong-C Signed-off-by: x22x22 --- vllm/model_executor/layers/linear.py | 53 +++++++++++++--------------- 1 file changed, 25 insertions(+), 28 deletions(-) diff --git a/vllm/model_executor/layers/linear.py b/vllm/model_executor/layers/linear.py index a05ae0edbd7..366dfd97d81 100644 --- a/vllm/model_executor/layers/linear.py +++ b/vllm/model_executor/layers/linear.py @@ -452,8 +452,10 @@ def __init__( else: self.register_parameter("bias", None) + self.tp_rank = get_tensor_model_parallel_rank() + def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor): - tp_rank = get_tensor_model_parallel_rank() + output_dim = getattr(param, "output_dim", None) is_sharded_weight = getattr(param, "is_sharded_weight", False) @@ -472,15 +474,15 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor): if is_gguf_weight and isinstance(param, UninitializedParameter): final_shape = list(loaded_weight.shape) if output_dim is not None: - tp_size = get_tensor_model_parallel_world_size() - assert final_shape[output_dim] % tp_size == 0 - final_shape[output_dim] = final_shape[output_dim] // tp_size + assert final_shape[output_dim] % self.tp_size == 0 + final_shape[output_dim] = (final_shape[output_dim] // + self.tp_size) param.materialize(final_shape, dtype=loaded_weight.dtype) param_data = param.data if output_dim is not None and not is_sharded_weight: shard_size = param_data.shape[output_dim] - start_idx = tp_rank * shard_size + start_idx = self.tp_rank * shard_size loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size) @@ -565,8 +567,11 @@ def __init__( return_bias: bool = True, ): self.output_sizes = output_sizes - tp_size = get_tensor_model_parallel_world_size() - assert all(output_size % tp_size == 0 for output_size in output_sizes) + self.tp_size = get_tensor_model_parallel_world_size() + self.tp_rank = get_tensor_model_parallel_rank() + + assert all(output_size % self.tp_size == 0 + for output_size in output_sizes) super().__init__(input_size=input_size, output_size=sum(output_sizes), bias=bias, @@ -598,12 +603,10 @@ def weight_loader(self, return if is_gguf_weight: - tp_size = get_tensor_model_parallel_world_size() - tp_rank = get_tensor_model_parallel_rank() output_dim = getattr(param, "output_dim", None) - shard_size = loaded_weight.size(output_dim) // tp_size - start_idx = tp_rank * shard_size + shard_size = loaded_weight.size(output_dim) // self.tp_size + start_idx = self.tp_rank * shard_size if loaded_shard_id is not None: loaded_weight = loaded_weight.narrow(output_dim, start_idx, @@ -669,11 +672,10 @@ def weight_loader(self, return assert loaded_shard_id < len(self.output_sizes) - tp_rank = get_tensor_model_parallel_rank() - tp_size = get_tensor_model_parallel_world_size() if output_dim is not None: - shard_offset = sum(self.output_sizes[:loaded_shard_id]) // tp_size - shard_size = self.output_sizes[loaded_shard_id] // tp_size + shard_offset = (sum(self.output_sizes[:loaded_shard_id]) // + self.tp_size) + shard_size = self.output_sizes[loaded_shard_id] // self.tp_size # Special case for quantization. # If quantized, we need to adjust the offset and size to account # for the packing. @@ -701,7 +703,7 @@ def weight_loader(self, param_data = param_data.narrow(output_dim, shard_offset, shard_size) - start_idx = tp_rank * shard_size + start_idx = self.tp_rank * shard_size if not is_sharded_weight: loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size) @@ -991,12 +993,9 @@ def weight_loader(self, return if is_gguf_weight: - tp_size = get_tensor_model_parallel_world_size() - tp_rank = get_tensor_model_parallel_rank() - output_dim = getattr(param, "output_dim", None) - shard_size = loaded_weight.size(output_dim) // tp_size - start_idx = tp_rank * shard_size + shard_size = loaded_weight.size(output_dim) // self.tp_size + start_idx = self.tp_rank * shard_size if loaded_shard_id is not None: loaded_weight = loaded_weight.narrow(output_dim, start_idx, @@ -1071,7 +1070,6 @@ def weight_loader(self, self.weight_loader(param, loaded_weight_shard, shard_id) return - tp_rank = get_tensor_model_parallel_rank() assert loaded_shard_id in ["q", "k", "v"] # If output dim is defined, use the default loading process. @@ -1123,9 +1121,9 @@ def weight_loader(self, param_data = param_data.narrow(output_dim, shard_offset, shard_size) if loaded_shard_id == "q": - shard_id = tp_rank + shard_id = self.tp_rank else: - shard_id = tp_rank // self.num_kv_head_replicas + shard_id = self.tp_rank // self.num_kv_head_replicas start_idx = shard_id * shard_size if not is_sharded_weight: @@ -1245,8 +1243,6 @@ def __init__( self.register_parameter("bias", None) def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor): - tp_rank = get_tensor_model_parallel_rank() - tp_size = get_tensor_model_parallel_world_size() input_dim = getattr(param, "input_dim", None) use_bitsandbytes_4bit = getattr(param, "use_bitsandbytes_4bit", False) is_sharded_weight = getattr(param, "is_sharded_weight", False) @@ -1264,13 +1260,14 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor): if is_gguf_weight and isinstance(param, UninitializedParameter): weight_shape = list(loaded_weight.shape) if input_dim: - weight_shape[input_dim] = weight_shape[input_dim] // tp_size + weight_shape[input_dim] = (weight_shape[input_dim] // + self.tp_size) param.materialize(tuple(weight_shape), dtype=loaded_weight.dtype) param_data = param.data if input_dim is not None and not is_sharded_weight: shard_size = param_data.shape[input_dim] - start_idx = tp_rank * shard_size + start_idx = self.tp_rank * shard_size loaded_weight = loaded_weight.narrow(input_dim, start_idx, shard_size) From c3f13bcabdcb793e32200dd391f9467a04a6363c Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 16 Jul 2025 22:37:13 -0400 Subject: [PATCH 143/552] [Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile (#21013) Signed-off-by: mgoin Signed-off-by: x22x22 --- docker/Dockerfile | 68 +++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 41 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index e0e08510c10..b06c4d33626 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -388,48 +388,33 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist # -rw-rw-r-- 1 mgoin mgoin 205M Jun 9 18:03 flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl # $ # upload the wheel to a public location, e.g. https://wheels.vllm.ai/flashinfer/v0.2.6.post1/flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl -# Allow specifying a version, Git revision or local .whl file -ARG FLASHINFER_CUDA128_INDEX_URL="https://download.pytorch.org/whl/cu128/flashinfer" -ARG FLASHINFER_CUDA128_WHEEL="flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl" +# Install FlashInfer from source ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git" ARG FLASHINFER_GIT_REF="v0.2.8rc1" -# Flag to control whether to use pre-built FlashInfer wheels (set to false to force build from source) -# TODO: Currently disabled because the pre-built wheels are not available for FLASHINFER_GIT_REF -ARG USE_FLASHINFER_PREBUILT_WHEEL=false RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' . /etc/environment - if [ "$TARGETPLATFORM" != "linux/arm64" ]; then - # FlashInfer already has a wheel for PyTorch 2.7.0 and CUDA 12.8. This is enough for CI use - if [[ "$CUDA_VERSION" == 12.8* ]] && [[ "$USE_FLASHINFER_PREBUILT_WHEEL" == "true" ]]; then - uv pip install --system ${FLASHINFER_CUDA128_INDEX_URL}/${FLASHINFER_CUDA128_WHEEL} - else - # Exclude CUDA arches for older versions (11.x and 12.0-12.7) - # TODO: Update this to allow setting TORCH_CUDA_ARCH_LIST as a build arg. - if [[ "${CUDA_VERSION}" == 11.* ]]; then - FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9" - elif [[ "${CUDA_VERSION}" == 12.[0-7]* ]]; then - FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a" - else - # CUDA 12.8+ supports 10.0a and 12.0 - FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 12.0" - fi - echo "🏗️ Building FlashInfer for arches: ${FI_TORCH_CUDA_ARCH_LIST}" - - git clone --depth 1 --recursive --shallow-submodules \ - --branch ${FLASHINFER_GIT_REF} \ - ${FLASHINFER_GIT_REPO} flashinfer - - # Needed to build AOT kernels - pushd flashinfer - TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ - python3 -m flashinfer.aot - TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ - uv pip install --system --no-build-isolation . - popd - - rm -rf flashinfer - fi \ - fi + git clone --depth 1 --recursive --shallow-submodules \ + --branch ${FLASHINFER_GIT_REF} \ + ${FLASHINFER_GIT_REPO} flashinfer + # Exclude CUDA arches for older versions (11.x and 12.0-12.7) + # TODO: Update this to allow setting TORCH_CUDA_ARCH_LIST as a build arg. + if [[ "${CUDA_VERSION}" == 11.* ]]; then + FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9" + elif [[ "${CUDA_VERSION}" == 12.[0-7]* ]]; then + FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a" + else + # CUDA 12.8+ supports 10.0a and 12.0 + FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 12.0" + fi + echo "🏗️ Building FlashInfer for arches: ${FI_TORCH_CUDA_ARCH_LIST}" + # Needed to build AOT kernels + pushd flashinfer + TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ + python3 -m flashinfer.aot + TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ + uv pip install --system --no-build-isolation . + popd + rm -rf flashinfer BASH COPY examples examples COPY benchmarks benchmarks @@ -521,10 +506,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --system -r requirements/kv_connectors.txt; \ fi; \ if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ - uv pip install --system accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.42.0' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \ + BITSANDBYTES_VERSION="0.42.0"; \ else \ - uv pip install --system accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.46.1' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \ - fi + BITSANDBYTES_VERSION="0.46.1"; \ + fi; \ + uv pip install --system accelerate hf_transfer 'modelscope!=1.15.0' "bitsandbytes>=${BITSANDBYTES_VERSION}" 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3] ENV VLLM_USAGE_SOURCE production-docker-image From 2e8285b5678eacf4df1ab77b0ddb9d3bfbb39ce5 Mon Sep 17 00:00:00 2001 From: XiongfeiWei Date: Wed, 16 Jul 2025 19:37:44 -0700 Subject: [PATCH 144/552] [TPU] Start using python 3.12 (#21000) Signed-off-by: Xiongfei Wei Signed-off-by: x22x22 --- .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh | 2 +- docker/Dockerfile.tpu | 4 ++-- docs/getting_started/installation/google_tpu.md | 4 ++-- requirements/tpu.txt | 9 ++++----- 4 files changed, 9 insertions(+), 10 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh index 90cad506ab1..60f0d174bd6 100755 --- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh @@ -70,7 +70,7 @@ export VLLM_XLA_CACHE_PATH= echo "Using VLLM V1" echo "--- Hardware Information ---" -tpu-info +# tpu-info echo "--- Starting Tests ---" set +e overall_script_exit_code=0 diff --git a/docker/Dockerfile.tpu b/docker/Dockerfile.tpu index 295270d29f7..3474ff50de7 100644 --- a/docker/Dockerfile.tpu +++ b/docker/Dockerfile.tpu @@ -1,5 +1,5 @@ -ARG NIGHTLY_DATE="20250124" -ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE" +ARG NIGHTLY_DATE="20250714" +ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE" FROM $BASE_IMAGE WORKDIR /workspace/vllm diff --git a/docs/getting_started/installation/google_tpu.md b/docs/getting_started/installation/google_tpu.md index 5dc2a7c93f4..55d69d11fa4 100644 --- a/docs/getting_started/installation/google_tpu.md +++ b/docs/getting_started/installation/google_tpu.md @@ -37,7 +37,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp - Google Cloud TPU VM - TPU versions: v6e, v5e, v5p, v4 -- Python: 3.10 or newer +- Python: 3.11 or newer ### Provision Cloud TPUs @@ -117,7 +117,7 @@ source ~/.bashrc Create and activate a Conda environment for vLLM: ```bash -conda create -n vllm python=3.10 -y +conda create -n vllm python=3.12 -y conda activate vllm ``` diff --git a/requirements/tpu.txt b/requirements/tpu.txt index db58b37c2b1..354771482ee 100644 --- a/requirements/tpu.txt +++ b/requirements/tpu.txt @@ -18,9 +18,8 @@ setuptools==78.1.0 --find-links https://storage.googleapis.com/libtpu-releases/index.html --find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html --find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html -torch==2.9.0.dev20250711 -torchvision==0.24.0.dev20250711 -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250711-cp39-cp39-linux_x86_64.whl ; python_version == "3.9" -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250711-cp310-cp310-linux_x86_64.whl ; python_version == "3.10" -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250711-cp311-cp311-linux_x86_64.whl ; python_version == "3.11" +torch==2.9.0.dev20250716 +torchvision==0.24.0.dev20250716 +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250716-cp311-cp311-linux_x86_64.whl ; python_version == "3.11" +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250716-cp312-cp312-linux_x86_64.whl ; python_version == "3.12" From f24a353ed6b4f01400ff67bfed7d896b12470ff5 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 16 Jul 2025 22:54:45 -0400 Subject: [PATCH 145/552] [Bugfix] Fix Machete zero point issue for GPTQ models on SM90 (#21066) Signed-off-by: mgoin Signed-off-by: x22x22 --- .../layers/quantization/kernels/mixed_precision/machete.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py b/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py index ed81b02bc4a..da951ddab2e 100644 --- a/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py +++ b/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py @@ -126,6 +126,11 @@ def apply_weights(self, if c.has_g_idx: x_2d = self.act_perm(x_2d) + if c.zero_points: + assert w_zp is not None + else: + w_zp = None + output = ops.machete_mm(a=x_2d, b_q=w_q, b_type=c.weight_type, From a12c63c5777d2e92a4f06027e7a2c5c861333b5f Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Thu, 17 Jul 2025 00:44:25 -0400 Subject: [PATCH 146/552] [Attention] Refactor attention metadata builder interface (#20466) Signed-off-by: Lucas Wilkinson Signed-off-by: x22x22 --- tests/v1/attention/test_attention_backends.py | 466 ++++++++++++++++++ tests/v1/attention/utils.py | 229 +++++++++ tests/v1/spec_decode/test_eagle.py | 68 ++- vllm/v1/attention/backends/cpu_attn.py | 65 +-- vllm/v1/attention/backends/flash_attn.py | 101 ++-- vllm/v1/attention/backends/flashinfer.py | 157 ++---- vllm/v1/attention/backends/flex_attention.py | 59 +-- vllm/v1/attention/backends/mamba_attn.py | 130 ++--- vllm/v1/attention/backends/mla/common.py | 183 +++---- vllm/v1/attention/backends/mla/flashmla.py | 15 +- .../attention/backends/mla/rocm_aiter_mla.py | 35 +- vllm/v1/attention/backends/rocm_aiter_fa.py | 89 ++-- vllm/v1/attention/backends/triton_attn.py | 73 ++- vllm/v1/attention/backends/utils.py | 140 +++++- vllm/v1/spec_decode/eagle.py | 198 ++++---- vllm/v1/spec_decode/utils.py | 27 - vllm/v1/worker/block_table.py | 41 +- vllm/v1/worker/gpu_model_runner.py | 149 +++--- 18 files changed, 1447 insertions(+), 778 deletions(-) create mode 100644 tests/v1/attention/test_attention_backends.py create mode 100644 tests/v1/attention/utils.py diff --git a/tests/v1/attention/test_attention_backends.py b/tests/v1/attention/test_attention_backends.py new file mode 100644 index 00000000000..b4e0101a0d4 --- /dev/null +++ b/tests/v1/attention/test_attention_backends.py @@ -0,0 +1,466 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Tests for v1 attention backends without GPUModelRunner dependency.""" + +import pytest +import torch + +from tests.v1.attention.utils import (BatchSpec, _Backend, + create_common_attn_metadata, + create_standard_kv_cache_spec, + create_vllm_config, + get_attention_backend) +from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, cdiv +from vllm.v1.attention.backends.utils import CommonAttentionMetadata +from vllm.v1.kv_cache_interface import FullAttentionSpec + +BACKENDS_TO_TEST = [ + _Backend.FLASH_ATTN_VLLM_V1, _Backend.FLASHINFER_VLLM_V1, + _Backend.FLEX_ATTENTION, _Backend.TRITON_ATTN_VLLM_V1 +] + +# Remove flashinfer from the list if it's not available +try: + import flashinfer # noqa: F401 +except ImportError: + BACKENDS_TO_TEST.remove(_Backend.FLASHINFER_VLLM_V1) + + +def _convert_dtype_to_torch(dtype): + """Convert ModelDType to torch.dtype.""" + if isinstance(dtype, str): + if dtype == "auto": + return torch.float16 # Default dtype for testing + elif dtype in STR_DTYPE_TO_TORCH_DTYPE: + return STR_DTYPE_TO_TORCH_DTYPE[dtype] + else: + raise ValueError(f"Unknown dtype: {dtype}") + elif isinstance(dtype, torch.dtype): + return dtype + else: + raise ValueError(f"Unknown dtype: {dtype}") + + +# Define common batch configurations +BATCH_SPECS = { + "small_decode": + BatchSpec(seq_lens=[32, 40], query_lens=[1, 1]), + "small_prefill": + BatchSpec(seq_lens=[32, 40], query_lens=[8, 8]), + "mixed_small": + BatchSpec(seq_lens=[32, 40, 48, 56], query_lens=[1, 1, 5, 5]), + "medium_decode": + BatchSpec(seq_lens=[128, 256, 512, 1024, 128, 256, 512, 1024], + query_lens=[1, 1, 1, 1, 1, 1, 1, 1]), + "medium_prefill": + BatchSpec(seq_lens=[256, 512, 1024, 2048], query_lens=[16, 16, 16, 16]), + "mixed_medium": + BatchSpec(seq_lens=[512, 1024, 2048, 512, 1024, 2048], + query_lens=[1, 1, 1, 7, 7, 7]), + "large_decode": + BatchSpec(seq_lens=[2048] * 32, query_lens=[1] * 32), + "large_prefill": + BatchSpec(seq_lens=[4096] * 8, query_lens=[32] * 8), + "single_decode": + BatchSpec(seq_lens=[1024], query_lens=[1]), + "single_prefill": + BatchSpec(seq_lens=[1024], query_lens=[64]), +} + + +def create_dummy_kv_cache(kv_cache_spec: FullAttentionSpec, + device: torch.device, + num_blocks: int = 100) -> torch.Tensor: + """Create a dummy KV cache tensor for testing.""" + kv_cache = torch.randn( + 2, # K and V + num_blocks, + kv_cache_spec.block_size, + kv_cache_spec.num_kv_heads, + kv_cache_spec.head_size, + dtype=_convert_dtype_to_torch(kv_cache_spec.dtype), + device=device, + ) + return kv_cache + + +def create_and_prepopulate_kv_cache( + k_contexts: list[torch.Tensor], + v_contexts: list[torch.Tensor], + block_size: int, + num_kv_heads: int, + head_size: int, + dtype: torch.dtype, + device: torch.device, + num_blocks: int, + common_attn_metadata: CommonAttentionMetadata, + randomize_blocks: bool = True) -> torch.Tensor: + """Create and prepopulate a KV cache with context data. + + Args: + k_contexts: List of key context tensors for each sequence + v_contexts: List of value context tensors for each sequence + seq_lens: List of sequence lengths + block_size: Size of each block + num_kv_heads: Number of KV heads + head_size: Size of each head + dtype: Data type for the cache + device: Device to create the cache on + num_blocks: Total number of blocks in the cache + block_table: Block table tensor to populate + randomize_blocks: Whether to randomly permute blocks + or use sequential order + + Returns: + Tuple of (kv_cache, updated_block_table) + """ + batch_size = len(k_contexts) + seq_lens = common_attn_metadata.seq_lens_cpu + query_lens = common_attn_metadata.query_start_loc_cpu[ + 1:] - common_attn_metadata.query_start_loc_cpu[:-1] + context_lens = common_attn_metadata.num_computed_tokens_cpu + block_table = common_attn_metadata.block_table_tensor + slot_mapping = common_attn_metadata.slot_mapping + + # Create KV cache + kv_cache = torch.empty(2, + num_blocks, + block_size, + num_kv_heads, + head_size, + dtype=dtype, + device=device) + kv_cache_flat = kv_cache.view(2, -1, num_kv_heads, head_size) + + # Populate the cache with the context tokens + # Start from block_id=1 since block_id=0 is considered the null block + start_block_idx = 1 + for i in range(batch_size): + k_context, v_context = k_contexts[i], v_contexts[i] + start = start_block_idx * block_size + end = start + k_context.shape[0] + kv_cache_flat[0, start:end, ...] = k_context + kv_cache_flat[1, start:end, ...] = v_context + + # Stay block aligned and allocate enough blocks for the new tokens + start_block_idx += cdiv(int(seq_lens[i]), block_size) + + blocks_end = start_block_idx + + # Permute the context blocks (excluding block 0 which is null) + if randomize_blocks: + perm = torch.randperm( + blocks_end - 1) + 1 # Random permutation starting from block 1 + else: + perm = torch.arange( + 1, blocks_end) # Sequential order starting from block 1 + + inv_perm = torch.zeros(blocks_end, dtype=torch.long, device=device) + inv_perm[1:] = torch.argsort( + perm) + 1 # Add 1 to account for starting from block 1 + kv_cache[:, 1:blocks_end, ...] = kv_cache[:, perm, ...] + + # Construct the right block table + # Start from block_id=1 since block_id=0 is considered the null block + start_block_idx = 1 + for i in range(batch_size): + num_blocks_for_seq = cdiv(int(seq_lens[i]), block_size) + start = start_block_idx + end = start + num_blocks_for_seq + block_table[i, :num_blocks_for_seq] = inv_perm[start:end] + start_block_idx += num_blocks_for_seq + + # Create a realistic slot mapping that corresponds to the block table + for i in range(batch_size): + token_offsets = torch.arange(int(query_lens[i])) + int(context_lens[i]) + block_indices = token_offsets // block_size + token_inter_block_offsets = token_offsets % block_size + start = common_attn_metadata.query_start_loc_cpu[i] + end = common_attn_metadata.query_start_loc_cpu[i + 1] + slot_mapping[start:end] = block_table[ + i, + block_indices] * block_size + token_inter_block_offsets.to(device) + + return kv_cache + + +class MockAttentionLayer: + """A mock attention layer for testing.""" + + def __init__(self, device: torch.device): + self._q_scale = torch.tensor(1.0, device=device) + self._k_scale = torch.tensor(1.0, device=device) + self._v_scale = torch.tensor(1.0, device=device) + # Add float versions for flashinfer + self._k_scale_float = 1.0 + self._v_scale_float = 1.0 + + +def run_attention_backend(backend: _Backend, kv_cache_spec: FullAttentionSpec, + vllm_config, device: torch.device, + common_attn_metadata: CommonAttentionMetadata, + query: torch.Tensor, key: torch.Tensor, + value: torch.Tensor, + kv_cache: torch.Tensor) -> torch.Tensor: + """Run attention computation using the specified backend's AttentionImpl.""" + + builder_cls, impl_cls = get_attention_backend(backend) + + # Mock flashinfer's get_per_layer_parameters if needed + if backend == _Backend.FLASHINFER_VLLM_V1: + import unittest.mock + + from vllm.v1.attention.backends.flashinfer import PerLayerParameters + + def mock_get_per_layer_parameters(vllm_config): + # Return mock parameters for a single layer + head_size = vllm_config.model_config.get_head_size() + return { + "mock_layer": + PerLayerParameters( + window_left=-1, # No sliding window + logits_soft_cap=0.0, # No soft cap + sm_scale=1.0 / (head_size**0.5) # Standard scale + ) + } + + with unittest.mock.patch( + 'vllm.v1.attention.backends.flashinfer.get_per_layer_parameters', + mock_get_per_layer_parameters): + builder = builder_cls(kv_cache_spec, vllm_config, device) + attn_metadata = builder.build( + common_prefix_len=0, + common_attn_metadata=common_attn_metadata, + ) + else: + # Build metadata + builder = builder_cls(kv_cache_spec, vllm_config, device) + attn_metadata = builder.build( + common_prefix_len=0, + common_attn_metadata=common_attn_metadata, + ) + + # Instantiate implementation + num_heads = vllm_config.model_config.get_num_attention_heads( + vllm_config.parallel_config) + num_kv_heads = vllm_config.model_config.get_num_kv_heads( + vllm_config.parallel_config) + head_size = vllm_config.model_config.get_head_size() + scale = 1.0 / (head_size**0.5) + impl = impl_cls( + num_heads=num_heads, + head_size=head_size, + scale=scale, + num_kv_heads=num_kv_heads, + alibi_slopes=None, + sliding_window=None, + kv_cache_dtype="auto", + ) + + # Create mock layer and output buffer + mock_layer = MockAttentionLayer(device) + output = torch.empty_like(query) + + # Run forward pass + # NOTE: The query, key, and value are already shaped correctly + # in the calling test function. + output = impl.forward(mock_layer, + query, + key, + value, + kv_cache, + attn_metadata, + output=output) + + return output + + +@pytest.mark.parametrize("batch_spec_name", [ + "small_decode", "small_prefill", "mixed_small", "medium_decode", + "medium_prefill", "mixed_medium" +]) +@pytest.mark.parametrize("model", ["meta-llama/Meta-Llama-3-8B"]) +def test_backend_correctness(batch_spec_name: str, model: str): + """ + Test that all backends produce similar outputs to a reference implementation + using torch.nn.functional.scaled_dot_product_attention. + + This test works by: + 1. Generating a batch of sequences with specified context and query lengths. + 2. Computing a ground-truth attention output using torch.sdpa on + contiguous Q, K, and V tensors. + 3. Simulating vLLM's paged KV cache: It takes the context portion of the + K/V tensors and manually places them into a paged buffer according to + the test's (randomly generated) block table. + 4. Running each vLLM attention backend with the new queries and the + simulated paged KV cache. + 5. Comparing the vLLM backend's output to the ground-truth SDPA output. + """ + batch_spec = BATCH_SPECS[batch_spec_name] + vllm_config = create_vllm_config(model_name=model) + device = torch.device("cuda:0") + + kv_cache_spec = create_standard_kv_cache_spec(vllm_config) + + # 1. Setup + batch_size = batch_spec.batch_size + seq_lens = batch_spec.seq_lens + query_lens = batch_spec.query_lens + num_q_heads = vllm_config.model_config.get_num_attention_heads( + vllm_config.parallel_config) + num_kv_heads = vllm_config.model_config.get_num_kv_heads( + vllm_config.parallel_config) + head_size = vllm_config.model_config.get_head_size() + dtype = _convert_dtype_to_torch(vllm_config.model_config.dtype) + block_size = vllm_config.cache_config.block_size + scale = 1.0 / (head_size**0.5) + + # 2. Generate data and compute SDPA reference output + all_q_vllm, all_k_vllm, all_v_vllm = [], [], [] + all_sdpa_outputs = [] + k_contexts, v_contexts = [], [] + + for i in range(batch_size): + s_len = seq_lens[i] + q_len = query_lens[i] + context_len = s_len - q_len + + # Generate Q, K, V for the whole sequence to be used in SDPA + q = torch.randn(q_len, + num_q_heads, + head_size, + dtype=dtype, + device=device) + k_full = torch.randn(s_len, + num_kv_heads, + head_size, + dtype=dtype, + device=device) + v_full = torch.randn(s_len, + num_kv_heads, + head_size, + dtype=dtype, + device=device) + + # SDPA expects (N, H, L, D), so unsqueeze batch and permute + q_sdpa_in = q.unsqueeze(0).transpose(1, 2) + k_sdpa_in = k_full.unsqueeze(0).transpose(1, 2) + v_sdpa_in = v_full.unsqueeze(0).transpose(1, 2) + + if num_q_heads != num_kv_heads: + assert num_q_heads % num_kv_heads == 0, ( + f"num_q_heads ({num_q_heads}) must be divisible by " + f"num_kv_heads ({num_kv_heads})") + repeats = num_q_heads // num_kv_heads + k_sdpa_in = k_sdpa_in.repeat_interleave(repeats, dim=1) + v_sdpa_in = v_sdpa_in.repeat_interleave(repeats, dim=1) + + # Create causal mask: query token i attends to positions 0 to + # (context_len + i) + kv_len = s_len + offset = context_len + attn_mask = torch.full((q_len, kv_len), + float('-inf'), + device=device, + dtype=dtype) + for i in range(q_len): + attn_mask[i, :offset + i + 1] = 0.0 + + sdpa_out_i = torch.nn.functional.scaled_dot_product_attention( + q_sdpa_in, + k_sdpa_in, + v_sdpa_in, + attn_mask=attn_mask, + scale=scale, + enable_gqa=True) + # Convert back to (L, H, D) + all_sdpa_outputs.append(sdpa_out_i.transpose(1, 2).squeeze(0)) + + # Inputs for vLLM backends are just the new tokens + all_q_vllm.append(q) + all_k_vllm.append(k_full[context_len:]) + all_v_vllm.append(v_full[context_len:]) + + # Contextual K/V data used to populate the paged cache + k_contexts.append(k_full[:context_len]) + v_contexts.append(v_full[:context_len]) + + query_vllm = torch.cat(all_q_vllm, dim=0) + key_vllm = torch.cat(all_k_vllm, dim=0) + value_vllm = torch.cat(all_v_vllm, dim=0) + sdpa_output = torch.cat(all_sdpa_outputs, dim=0) + + common_attn_metadata = create_common_attn_metadata( + batch_spec, vllm_config.cache_config.block_size, device) + + # 3. Simulate Paged KV Cache and a realistic slot_mapping + kv_cache = create_and_prepopulate_kv_cache( + k_contexts=k_contexts, + v_contexts=v_contexts, + block_size=block_size, + num_kv_heads=num_kv_heads, + head_size=head_size, + dtype=dtype, + device=device, + num_blocks=vllm_config.cache_config.num_gpu_blocks or 1000, + common_attn_metadata=common_attn_metadata, + randomize_blocks=True) + + # 4. Run vLLM backends and compare + # Note: flex_attention has known Triton kernel compatibility issues + # with test infrastructures + for backend_name in BACKENDS_TO_TEST: + # FlashAttentionm + FlexAttention: + # [2, num_blocks, block_size, num_kv_heads, head_size] + # FlashInfer: + # [num_blocks, 2, block_size, num_kv_heads, head_size] + # Select the appropriate KV cache format for each backend + kv_cache_for_backend = kv_cache + if backend_name == _Backend.FLASHINFER_VLLM_V1: + kv_cache_for_backend = kv_cache.transpose(0, 1) + + backend_output = run_attention_backend(backend_name, kv_cache_spec, + vllm_config, device, + common_attn_metadata, + query_vllm, key_vllm, + value_vllm, + kv_cache_for_backend) + + # Check shape and dtype consistency + assert backend_output.shape == sdpa_output.shape, ( + f"[{backend_name}] shape {backend_output.shape} != " + f"SDPA shape {sdpa_output.shape}") + assert backend_output.dtype == sdpa_output.dtype, ( + f"[{backend_name}] dtype {backend_output.dtype} != " + f"SDPA dtype {sdpa_output.dtype}") + + assert torch.isfinite(backend_output).all(), ( + f"[{backend_name}] produced non-finite values") + + # Check numerical similarity + rtol = 1e-2 + atol = 5e-3 + + if backend_name == _Backend.FLEX_ATTENTION: + atol = 5e-1 # TODO: figure out why flex_attention has such large + # numerical differences for medium_decode, medium_prefill, + # mixed_medium + + max_diff = torch.max(torch.abs(backend_output - sdpa_output)).item() + max_rel_diff = torch.max( + torch.abs(backend_output - sdpa_output) / + torch.abs(sdpa_output)).item() + all_close = torch.allclose(backend_output, + sdpa_output, + rtol=rtol, + atol=atol) + + if not all_close: + print(f"[{backend_name}] output differs from SDPA baseline. " + f"Max diff: {max_diff:.6f} (rel: {max_rel_diff:.6f})") + print(f"[{backend_name}] output: {backend_output}") + print(f"[{backend_name}] SDPA baseline: {sdpa_output}") + + assert all_close, ( + f"[{backend_name}] output differs from SDPA baseline. " + f"Max diff: {max_diff:.6f} (rel: {max_rel_diff:.6f})") diff --git a/tests/v1/attention/utils.py b/tests/v1/attention/utils.py new file mode 100644 index 00000000000..30cfbdda5d8 --- /dev/null +++ b/tests/v1/attention/utils.py @@ -0,0 +1,229 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Utility functions for attention-related v1 tests.""" + +from dataclasses import dataclass +from typing import Union + +import pytest +import torch + +from vllm.config import (CacheConfig, CompilationConfig, DeviceConfig, + LoadConfig, ModelConfig, ModelDType, ParallelConfig, + SchedulerConfig, VllmConfig) +from vllm.platforms import _Backend +from vllm.utils import resolve_obj_by_qualname +from vllm.v1.attention.backends.utils import CommonAttentionMetadata +from vllm.v1.kv_cache_interface import FullAttentionSpec + + +@dataclass +class BatchSpec: + """Specification for a batch configuration (workload shape only).""" + seq_lens: list[int] + query_lens: list[int] + + name: str = "unnamed" + + @property + def batch_size(self): + return len(self.seq_lens) + + def __post_init__(self): + assert len(self.seq_lens) == len(self.query_lens) + + def compute_num_tokens(self): + return sum(self.query_lens) + + +def create_common_attn_metadata( + batch_spec: BatchSpec, + block_size: int, + device: torch.device, + max_block_idx: int = 1000) -> CommonAttentionMetadata: + """Create CommonAttentionMetadata from a BatchSpec and ModelParams.""" + # Create query start locations + query_start_loc = torch.zeros(batch_spec.batch_size + 1, + dtype=torch.int32, + device=device) + query_start_loc[1:] = torch.tensor(batch_spec.query_lens, + dtype=torch.int32, + device=device).cumsum(0) + query_start_loc_cpu = query_start_loc.cpu() + num_tokens = batch_spec.compute_num_tokens() + + # Create sequence lengths + seq_lens = torch.tensor(batch_spec.seq_lens, + dtype=torch.int32, + device=device) + seq_lens_cpu = seq_lens.cpu() + + # Create computed tokens (context length for each sequence) + context_lens = [ + batch_spec.seq_lens[i] - batch_spec.query_lens[i] + for i in range(batch_spec.batch_size) + ] + num_computed_tokens_cpu = torch.tensor(context_lens, dtype=torch.int32) + + # Create block table (random for testing) + max_blocks = max(batch_spec.seq_lens) // block_size + 1 + block_table_tensor = torch.randint(0, + max_block_idx, + (batch_spec.batch_size, max_blocks), + dtype=torch.int32, + device=device) + + # Create slot mapping + slot_mapping = torch.randint(0, + max_block_idx, (num_tokens, ), + dtype=torch.int64, + device=device) + + # Calculate max query length + max_query_len = max(batch_spec.query_lens) + + return CommonAttentionMetadata( + query_start_loc=query_start_loc, + query_start_loc_cpu=query_start_loc_cpu, + seq_lens=seq_lens, + seq_lens_cpu=seq_lens_cpu, + num_computed_tokens_cpu=num_computed_tokens_cpu, + num_reqs=batch_spec.batch_size, + num_actual_tokens=num_tokens, + max_query_len=max_query_len, + block_table_tensor=block_table_tensor, + slot_mapping=slot_mapping, + ) + + +def get_attention_backend(backend_name: _Backend): + """Set up attention backend classes for testing. + + Args: + backend_name: Name of the backend ("flash_attn", "flashinfer", etc.) + vllm_config: VllmConfig instance + + Returns: + Tuple of (backend_builder_class, backend_impl_class) + """ + backend_map = { + _Backend.FLASH_ATTN_VLLM_V1: + "vllm.v1.attention.backends.flash_attn.FlashAttentionBackend", + _Backend.FLASHINFER_VLLM_V1: + "vllm.v1.attention.backends.flashinfer.FlashInferBackend", + _Backend.FLEX_ATTENTION: + "vllm.v1.attention.backends.flex_attention.FlexAttentionBackend", + _Backend.TRITON_ATTN_VLLM_V1: + "vllm.v1.attention.backends.triton_attn.TritonAttentionBackend", + } + + if backend_name not in backend_map: + raise ValueError(f"Unknown backend: {backend_name}") + + backend_class_name = backend_map[backend_name] + + try: + backend_class = resolve_obj_by_qualname(backend_class_name) + return backend_class.get_builder_cls(), backend_class.get_impl_cls() + except ImportError as e: + pytest.skip(f"{backend_name} not available: {e}") + + +def create_standard_kv_cache_spec( + vllm_config: VllmConfig) -> FullAttentionSpec: + """Create a FullAttentionSpec from ModelParams only.""" + return FullAttentionSpec( + block_size=vllm_config.cache_config.block_size, + num_kv_heads=vllm_config.model_config.get_num_kv_heads( + vllm_config.parallel_config), + head_size=vllm_config.model_config.get_head_size(), + dtype=vllm_config.model_config.dtype, + use_mla=vllm_config.model_config.use_mla, + sliding_window=vllm_config.model_config.get_sliding_window(), + ) + + +def create_vllm_config(model_name: str = "meta-llama/Meta-Llama-3-8B", + tensor_parallel_size: int = 1, + max_model_len: int = 1024, + dtype: Union[ModelDType, torch.dtype] = "auto", + block_size: int = 16, + max_num_seqs: int = 256, + max_num_batched_tokens: int = 8192, + add_mock_model_methods: bool = True) -> VllmConfig: + """Create a VllmConfig for testing with reasonable defaults.""" + + model_config = ModelConfig( + model=model_name, + tokenizer=model_name, + trust_remote_code=False, + dtype=dtype, + seed=0, + max_model_len=max_model_len, + ) + + cache_config = CacheConfig( + block_size=block_size, + cache_dtype="auto", + swap_space=0, + ) + # Set cache blocks for testing + # (these may be set during initialization normally) + cache_config.num_gpu_blocks = 1000 + cache_config.num_cpu_blocks = 0 + + parallel_config = ParallelConfig( + tensor_parallel_size=tensor_parallel_size, ) + + scheduler_config = SchedulerConfig( + max_num_seqs=max_num_seqs, + max_num_batched_tokens=max_num_batched_tokens, + ) + + device_config = DeviceConfig() + load_config = LoadConfig() + compilation_config = CompilationConfig() + + if add_mock_model_methods: + # Add mock methods to satisfy backends that need them + # This is a workaround because tests don't build full, real models, + # but some backends expect to query the model for layer-specific + # parameters + import types + model_config.get_num_layers = types.MethodType(lambda self: 1, + model_config) + model_config.get_sliding_window_for_layer = types.MethodType( + lambda self, i: None, model_config) + model_config.get_logits_soft_cap_for_layer = types.MethodType( + lambda self, i: 0.0, model_config) + model_config.get_sm_scale_for_layer = types.MethodType( + lambda self, i: 1.0 / model_config.get_head_size()**0.5, + model_config) + + return VllmConfig( + model_config=model_config, + cache_config=cache_config, + parallel_config=parallel_config, + scheduler_config=scheduler_config, + device_config=device_config, + load_config=load_config, + compilation_config=compilation_config, + ) + + +def create_dummy_kv_cache(block_size: int, + num_kv_heads: int, + head_size: int, + dtype: torch.dtype, + device: torch.device, + num_blocks: int = 100) -> torch.Tensor: + """Create a dummy KV cache tensor for testing.""" + kv_cache = torch.randn( + num_blocks, + 2, # K and V + block_size, + num_kv_heads, + head_size, + dtype=dtype, + device=device) + return kv_cache diff --git a/tests/v1/spec_decode/test_eagle.py b/tests/v1/spec_decode/test_eagle.py index 5efab2c1440..5c74a286c4a 100644 --- a/tests/v1/spec_decode/test_eagle.py +++ b/tests/v1/spec_decode/test_eagle.py @@ -6,6 +6,10 @@ import pytest import torch +from tests.v1.attention.utils import (BatchSpec, _Backend, + create_common_attn_metadata, + create_standard_kv_cache_spec, + get_attention_backend) from vllm.config import (CacheConfig, DeviceConfig, LoadConfig, ModelConfig, ParallelConfig, SchedulerConfig, SpeculativeConfig, VllmConfig) @@ -64,13 +68,19 @@ def test_prepare_inputs(): """ device = torch.device(current_platform.device_type) - # a = 4, b = 7, c = 5 + # q1 = 4, q2 = 7, q3 = 5 # n1 = 1, n2 = 3, n3 = 2 - # Cumulative lengths: [0, 4, 11, 16] - cu_target_query_lens = torch.tensor([0, 4, 11, 16], - dtype=torch.int32, - device=device) + batch_spec = BatchSpec( + seq_lens=[4, 7, 5], + query_lens=[4, 7, 5], + ) + + common_attn_metadata = create_common_attn_metadata( + batch_spec, + block_size=16, + device=device, + ) # Rejected tokens per request: [1, 3, 2] num_rejected_tokens = torch.tensor([1, 3, 2], @@ -104,15 +114,13 @@ def test_prepare_inputs(): ], dtype=torch.int32, device=device) + proposer = _create_proposer("eagle", 1) - # n1 + n2 + n3 - a - b -c - num_tokens = cu_target_query_lens[-1].item() - num_rejected_tokens.sum( - ).item() + updated_metadata, token_indices = proposer.prepare_inputs( + common_attn_metadata, num_rejected_tokens.cpu()) - cu_num_tokens, token_indices = EagleProposer.prepare_inputs( - cu_target_query_lens, num_rejected_tokens, num_tokens) - - assert torch.equal(cu_num_tokens, expected_cu_num_tokens) + assert torch.equal(updated_metadata.query_start_loc, + expected_cu_num_tokens) assert token_indices.shape[0] == expected_cu_num_tokens[-1].item() assert torch.equal(token_indices, expected_token_indices) @@ -209,6 +217,7 @@ def test_propose(num_speculative_tokens): seq_len_2 = 3 total_tokens = seq_len_1 + seq_len_2 vocab_size = 100 + seq_lens = [seq_len_1, seq_len_2] # Create proposer first so we can use its actual hidden_size proposer = _create_proposer("eagle", num_speculative_tokens) @@ -270,9 +279,16 @@ def create_deterministic_logits(token_ids): proposer.attn_layer_names = ["layer.0"] # Create input tensors - cu_num_tokens = torch.tensor([0, seq_len_1, total_tokens], - dtype=torch.int32, - device=device) + batch_spec = BatchSpec( + seq_lens=seq_lens, + query_lens=seq_lens, + ) + + common_attn_metadata = create_common_attn_metadata( + batch_spec, + block_size=16, + device=device, + ) target_token_ids = torch.randint(0, vocab_size, (total_tokens, ), @@ -284,25 +300,29 @@ def create_deterministic_logits(token_ids): target_hidden_states = torch.randn(total_tokens, hidden_size, device=device) - target_slot_mapping = torch.randint(0, - 100, (total_tokens, ), - device=device) next_token_ids = torch.randint(0, vocab_size, (batch_size, ), dtype=torch.int32, device=device) - block_table = torch.randint(0, 10, (batch_size, 10), device=device) - sampling_metadata = mock.MagicMock() - # Call the method under test + attn_metadata_builder_cls, _ = get_attention_backend( + _Backend.FLASH_ATTN_VLLM_V1) + attn_metadata_builder = attn_metadata_builder_cls( + kv_cache_spec=create_standard_kv_cache_spec(proposer.vllm_config), + vllm_config=proposer.vllm_config, + device=device, + ) + + # Mock runner for attention metadata building + proposer.runner = mock.MagicMock() + proposer.runner.attn_metadata_builders = [attn_metadata_builder] + result = proposer.propose(target_token_ids=target_token_ids, target_positions=target_positions, target_hidden_states=target_hidden_states, - target_slot_mapping=target_slot_mapping, next_token_ids=next_token_ids, - cu_num_tokens=cu_num_tokens, - block_table=block_table, + common_attn_metadata=common_attn_metadata, sampling_metadata=sampling_metadata) assert result.shape == (batch_size, num_speculative_tokens) diff --git a/vllm/v1/attention/backends/cpu_attn.py b/vllm/v1/attention/backends/cpu_attn.py index f1c6bdfc1c9..d63b82012a5 100644 --- a/vllm/v1/attention/backends/cpu_attn.py +++ b/vllm/v1/attention/backends/cpu_attn.py @@ -12,13 +12,12 @@ AttentionMetadata, AttentionType, is_quantized_kv_cache) from vllm.attention.backends.utils import CommonAttentionState +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, CommonAttentionMetadata) from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable -from vllm.v1.worker.cpu_model_runner import CPUModelRunner from vllm.v1.worker.gpu_input_batch import InputBatch try: @@ -316,19 +315,21 @@ def get_seq_len_block_table_args( class TorchSDPAMetadataBuilderV1(AttentionMetadataBuilder[TorchSDPAMetadata]): - def __init__(self, runner: CPUModelRunner, kv_cache_spec: AttentionSpec, - block_table: BlockTable) -> None: - self.runner = runner - self.block_table = block_table + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device) -> None: + self.kv_cache_spec = kv_cache_spec + self.vllm_config = vllm_config + self.scheduler_config = vllm_config.scheduler_config + # For reorder - self.reorder_prompt_req_index_list = np.empty(self.runner.max_num_reqs, - dtype=np.int64) - self.reorder_decode_req_index_list = np.empty(self.runner.max_num_reqs, - dtype=np.int64) + self.reorder_prompt_req_index_list = np.empty( + vllm_config.scheduler_config.max_num_seqs, dtype=np.int64) + self.reorder_decode_req_index_list = np.empty( + vllm_config.scheduler_config.max_num_seqs, dtype=np.int64) self.num_prompt_req: int = 0 self.seq_start_loc_cpu = torch.zeros( - runner.max_num_reqs + 1, + vllm_config.scheduler_config.max_num_seqs + 1, dtype=torch.int32, device="cpu", ) @@ -378,15 +379,15 @@ def reorder_batch(self, input_batch: InputBatch, return True - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata): + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> TorchSDPAMetadata: num_reqs = common_attn_metadata.num_reqs - num_actual_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - runner = self.runner - block_table = self.block_table - seq_lens_np = runner.seq_lens_np[:num_reqs] + seq_lens_cpu = common_attn_metadata.seq_lens_cpu + seq_lens_np = seq_lens_cpu.numpy() num_prompt_req = self.num_prompt_req max_prefill_seq_len = seq_lens_np[:num_prompt_req].max().item( ) if num_prompt_req > 0 else 0 @@ -394,34 +395,36 @@ def build(self, common_prefix_len: int, ) if num_prompt_req < num_reqs else 0 self.seq_start_loc_np[0] = 0 np.cumsum(seq_lens_np, out=self.seq_start_loc_np[1:num_reqs + 1]) - num_prefill_tokens = runner.query_start_loc_np[num_prompt_req].item() - num_decode_tokens = runner.query_start_loc_np[num_reqs].item( - ) - num_prefill_tokens - slot_mapping = block_table.slot_mapping_cpu[:num_actual_tokens].long() - block_table_tensor = block_table.get_device_tensor() + + query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu + num_prefill_tokens = int(query_start_loc_cpu[num_prompt_req].item()) + num_decode_tokens = int(query_start_loc_cpu[num_reqs].item() - + num_prefill_tokens) + + slot_mapping = common_attn_metadata.slot_mapping.long() + block_table_tensor = common_attn_metadata.block_table_tensor + attn_metadata = TorchSDPAMetadata( num_prefills=num_prompt_req, num_prefill_tokens=num_prefill_tokens, num_decode_tokens=num_decode_tokens, slot_mapping=slot_mapping, # to ensure inference when chunked_prefill is disabled - seq_lens=runner.seq_lens_cpu[:num_reqs].tolist(), - seq_lens_tensor=runner. - seq_lens_cpu[num_prompt_req:num_reqs], # decode + seq_lens=seq_lens_cpu.tolist(), + seq_lens_tensor=seq_lens_cpu[num_prompt_req:num_reqs], # decode max_decode_seq_len=max_decode_seq_len, # decode block_tables=block_table_tensor[num_prompt_req:num_reqs], # decode - chunked_prefill=self.runner.scheduler_config. - chunked_prefill_enabled, + chunked_prefill=self.scheduler_config.chunked_prefill_enabled, max_query_len=max_query_len, max_kv_len=max_prefill_seq_len, - prefill_query_start_loc=runner. - query_start_loc_cpu[:num_prompt_req + 1], # prefill + prefill_query_start_loc=query_start_loc_cpu[:num_prompt_req + + 1], # prefill kv_start_loc=self.seq_start_loc_cpu[:num_prompt_req + 1], # prefill prefill_block_tables=block_table_tensor[: num_prompt_req], # prefill - query_start_loc=runner.query_start_loc_cpu[:num_reqs + - 1], # for logits index + query_start_loc=query_start_loc_cpu[:num_reqs + + 1], # for logits index multi_modal_placeholder_index_maps=None, enable_kv_scales_calculation=False, ) diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index 552c2caf2fa..4224d807c2b 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with FlashAttention.""" from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, ClassVar, Optional +from typing import Any, ClassVar, Optional import numpy as np import torch @@ -29,10 +29,6 @@ AttentionMetadataBuilder, CommonAttentionMetadata, get_kv_cache_layout, make_local_attention_virtual_batches) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable - -if TYPE_CHECKING: - from vllm.v1.worker.gpu_model_runner import GPUModelRunner logger = init_logger(__name__) @@ -162,29 +158,30 @@ class FlashAttentionMetadataBuilder( AttentionMetadataBuilder[FlashAttentionMetadata]): full_cudagraph_supported: ClassVar[bool] = get_flash_attn_version() == 3 - def __init__(self, runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, - block_table: BlockTable): - model_config = runner.model_config - compilation_config = runner.vllm_config.compilation_config - - self.runner = runner - self.num_heads_q = model_config.get_num_attention_heads( - runner.parallel_config) - self.num_heads_kv = model_config.get_num_kv_heads( - runner.parallel_config) - self.headdim = model_config.get_head_size() + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + self.vllm_config = vllm_config + self.model_config = vllm_config.model_config + self.parallel_config = vllm_config.parallel_config + self.cache_config = vllm_config.cache_config + self.compilation_config = vllm_config.compilation_config + self.device = device + + self.num_heads_q = self.model_config.get_num_attention_heads( + self.parallel_config) + self.num_heads_kv = self.model_config.get_num_kv_heads( + self.parallel_config) + self.headdim = self.model_config.get_head_size() self.block_size = kv_cache_spec.block_size - self.kv_cache_spec = kv_cache_spec - self.block_table = block_table self.max_num_splits = 0 # No upper bound on the number of splits. self.aot_schedule = (get_flash_attn_version() == 3) - self.use_full_cuda_graph = compilation_config.full_cuda_graph + self.use_full_cuda_graph = self.compilation_config.full_cuda_graph if self.use_full_cuda_graph: if not self.aot_schedule: raise ValueError( "AoT scheduling is required for full cuda graph.") - capture_sizes = compilation_config.cudagraph_capture_sizes + capture_sizes = self.compilation_config.cudagraph_capture_sizes if not capture_sizes: raise ValueError( "cudagraph_capture_sizes should not be None when " @@ -198,9 +195,9 @@ def __init__(self, runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, "full cuda graph.") self.scheduler_metadata = torch.zeros( - self.runner.max_num_reqs + 1, + vllm_config.scheduler_config.max_num_seqs + 1, dtype=torch.int32, - device=self.runner.device, + device=self.device, ) # When using cuda graph, we need to set the upper bound of the # number of splits so that large enough intermediate buffers are @@ -211,28 +208,27 @@ def __init__(self, runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, # populated on first build() call. self.aot_sliding_window: Optional[tuple[int, int]] = None - def build( - self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata - ) -> FlashAttentionMetadata: + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> FlashAttentionMetadata: + """ + fast_build disables AOT scheduling, used when there will be few + iterations i.e. spec-decode + """ num_reqs = common_attn_metadata.num_reqs num_actual_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - - max_seq_len = int(self.runner.seq_lens_np[:num_reqs].max()) + max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) query_start_loc = common_attn_metadata.query_start_loc + query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu seq_lens = common_attn_metadata.seq_lens - block_table = self.block_table - block_table_tensor = block_table.get_device_tensor()[:num_reqs] - - block_table.slot_mapping[:num_actual_tokens].copy_( - block_table.slot_mapping_cpu[:num_actual_tokens], - non_blocking=True) - # Fill unused with -1. Needed for reshape_and_cache in full cuda graph - # mode. - block_table.slot_mapping[num_actual_tokens:].fill_(-1) + seq_lens_cpu = common_attn_metadata.seq_lens_cpu + block_table_tensor = common_attn_metadata.block_table_tensor + slot_mapping = common_attn_metadata.slot_mapping - slot_mapping = block_table.slot_mapping[:num_actual_tokens] + # the overhead of the aot schedule is not worth it for spec-decode + aot_schedule = self.aot_schedule and not fast_build if self.aot_sliding_window is None: self.aot_sliding_window = (-1, -1) @@ -240,19 +236,20 @@ def build( # constant for all layers to. We have to populate this on the first # build() call so the layers are constructed (cannot populate) # in __init__. - if self.aot_schedule: + if aot_schedule: sliding_window_configs = _get_sliding_window_configs( - self.runner.vllm_config) + self.vllm_config) if len(sliding_window_configs) == 1: sliding_window_config = sliding_window_configs.pop() if sliding_window_config is not None: self.aot_sliding_window = sliding_window_config elif len(sliding_window_configs) > 1: self.aot_schedule = False + aot_schedule = False def schedule(batch_size, cu_query_lens, max_query_len, seqlens, max_seq_len, causal): - if self.aot_schedule: + if aot_schedule: return get_scheduler_metadata( batch_size=batch_size, max_seqlen_q=max_query_len, @@ -271,19 +268,19 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, # for local attention local_attn_metadata = None - if self.runner.attention_chunk_size is not None: + if self.model_config.attention_chunk_size is not None: seqlens_q_local_np, virt_q_cu_seqlens_np, virt_k_seqlens_np, \ virt_block_table_tensor = make_local_attention_virtual_batches( - self.runner.attention_chunk_size, - self.runner.query_start_loc_np[:num_reqs + 1], - self.runner.seq_lens_np[:num_reqs], + self.model_config.attention_chunk_size, + query_start_loc_cpu.numpy(), + seq_lens_cpu.numpy(), block_table_tensor, self.block_size, ) local_query_start_loc = torch.from_numpy(virt_q_cu_seqlens_np).to( - self.runner.device, non_blocking=True) + self.device, non_blocking=True) local_seqused_k = torch.from_numpy(virt_k_seqlens_np).to( - self.runner.device, non_blocking=True) + self.device, non_blocking=True) local_max_query_len = seqlens_q_local_np.max() local_max_seq_len = virt_k_seqlens_np.max() local_scheduler_metadata = schedule( @@ -308,14 +305,12 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, if use_cascade: cu_prefix_query_lens = torch.tensor([0, num_actual_tokens], dtype=torch.int32, - device=self.runner.device) + device=self.device) prefix_kv_lens = torch.tensor([common_prefix_len], dtype=torch.int32, - device=self.runner.device) - suffix_kv_lens = (self.runner.seq_lens_np[:num_reqs] - - common_prefix_len) - suffix_kv_lens = torch.from_numpy(suffix_kv_lens).to( - self.runner.device) + device=self.device) + suffix_kv_lens = (seq_lens_cpu[:num_reqs] - common_prefix_len).to( + self.device, non_blocking=True) prefix_scheduler_metadata = schedule( batch_size=1, cu_query_lens=cu_prefix_query_lens, diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index f922e6e4c9e..1eb27d57acf 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -15,22 +15,20 @@ import vllm.envs as envs from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, AttentionType) +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform from vllm.v1.attention.backends.flash_attn import use_cascade_attention -from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, - CommonAttentionMetadata, - PerLayerParameters, - get_kv_cache_layout, - get_per_layer_parameters, - infer_global_hyperparameters) +from vllm.v1.attention.backends.utils import ( + AttentionMetadataBuilder, CommonAttentionMetadata, PerLayerParameters, + get_kv_cache_layout, get_per_layer_parameters, + infer_global_hyperparameters, reorder_batch_to_split_decodes_and_prefills, + split_decodes_and_prefills) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable if TYPE_CHECKING: from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.worker.gpu_input_batch import InputBatch - from vllm.v1.worker.gpu_model_runner import GPUModelRunner FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 * 1024 * 1024 @@ -226,9 +224,9 @@ def __post_init__(self): class FlashInferMetadataBuilder(AttentionMetadataBuilder[FlashInferMetadata]): - def __init__(self, runner: GPUModelRunner, kv_cache_spec: AttentionSpec, - block_table: BlockTable): - self.runner = runner + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + self.device = device self._workspace_buffer = None self._prefill_wrapper = None # Wrapper for prefill/append self._decode_wrapper = None # Wrapper for decode @@ -237,75 +235,22 @@ def __init__(self, runner: GPUModelRunner, kv_cache_spec: AttentionSpec, # Global hyperparameters shared by all attention layers self.global_hyperparameters: Optional[PerLayerParameters] = None - self.vllm_config = runner.vllm_config + self.vllm_config = vllm_config + self.cache_config = vllm_config.cache_config self.kv_cache_spec = kv_cache_spec - self.block_table = block_table def reorder_batch(self, input_batch: InputBatch, scheduler_output: SchedulerOutput) -> bool: - # We now want to reorder the batch so that the "decode" requests are and - # the front and the "prefill" requests are at the using the least amount - # swaps possible. (NOTE for now we loosely use "decode" to mean requests - # where attention is likely memory-bound and "prefill" to mean requests - # where attention is likely compute-bound, TODO(lucas): figure out a - # better naming here) - decodes = [] - prefills = [] - num_decode_tokens = 0 - num_prefill_tokens = 0 - - for i, req_id in enumerate(input_batch.req_ids): - num_tokens = scheduler_output.num_scheduled_tokens[req_id] - # for now treat 1 scheduled token as "decode" even if its not, - # we should update this to something like < 8 in the future but - # currently the decode run only supports num_tokens = 1 - if num_tokens == 1: - decodes.append(i) - num_decode_tokens += num_tokens - else: - prefills.append(i) - num_prefill_tokens += num_tokens - - # We hope that this is fairly minimal since decodes - # should be around for a number of iterations so hopefully they are - # relatively stationary (and new request are generally appended to the - # persistent batch so already should be at the back) - # To achieve this we loop over the decodes in descending order and - # the prefills in ascending order. We swap decodes from the "back" - # i.e. past where the last decode should be in the reodorered with - # prefills from the front of the batch. - # `decodes` and `prefills` are already in ascending order just based on - # the above loop - num_decodes = len(decodes) - num_prefills = len(prefills) - modified_batch = False - - for i in range(1, min(num_decodes, num_prefills) + 1): - # If the decode is at the "back" of the batch, i, we can swap it - # with the prefill closest to the front of the batch - decode_idx = decodes[num_decodes - i] - if decode_idx < num_decodes: - break - - input_batch.swap_states(prefills[i - 1], decode_idx) - modified_batch = True - - # Save for next `build` call - # TODO(lucas): this is a bit of a hack, we should probably have a - # better way of doing this - self._num_decodes = num_decodes - self._num_prefills = num_prefills - self._num_decode_tokens = num_decode_tokens - self._num_prefill_tokens = num_prefill_tokens - - return modified_batch + return reorder_batch_to_split_decodes_and_prefills(input_batch, + scheduler_output, + decode_threshold=1) def _get_workspace_buffer(self): if self._workspace_buffer is None: self._workspace_buffer = torch.empty( FLASHINFER_WORKSPACE_BUFFER_SIZE, dtype=torch.uint8, - device=self.runner.device) + device=self.device) return self._workspace_buffer def _get_prefill_wrapper(self): @@ -316,10 +261,11 @@ def _get_prefill_wrapper(self): def _get_decode_wrapper(self): if self._decode_wrapper is None: - num_qo_heads = (self.runner.model_config.get_num_attention_heads( - self.runner.parallel_config)) - num_kv_heads = self.runner.model_config.get_num_kv_heads( - self.runner.parallel_config) + num_qo_heads = ( + self.vllm_config.model_config.get_num_attention_heads( + self.vllm_config.parallel_config)) + num_kv_heads = self.vllm_config.model_config.get_num_kv_heads( + self.vllm_config.parallel_config) use_tensor_cores = envs.VLLM_FLASHINFER_FORCE_TENSOR_CORES or ( num_qo_heads // num_kv_heads > 4) self._decode_wrapper = BatchDecodeWithPagedKVCacheWrapper( @@ -334,7 +280,8 @@ def _get_cascade_wrapper(self): 2, self._get_workspace_buffer(), get_kv_cache_layout()) return self._cascade_wrapper - def _plan(self, attn_metadata: FlashInferMetadata): + def _plan(self, num_prefills: int, num_decodes: int, + attn_metadata: FlashInferMetadata): if self.global_hyperparameters is None: self.global_hyperparameters = infer_global_hyperparameters( get_per_layer_parameters(self.vllm_config, FlashInferImpl)) @@ -369,16 +316,16 @@ def _plan(self, attn_metadata: FlashInferMetadata): # Regular attention (common case). # Decodes are at the front and prefills are at the back, # according to reorder_batch() - if self._num_prefills > 0: + if num_prefills > 0: # Decodes are first so prefills start after the last decode - prefill_start = self._num_decodes + prefill_start = num_decodes attn_metadata.prefill_wrapper = self._get_prefill_wrapper() assert attn_metadata.qo_indptr[prefill_start:].shape[ - 0] == self._num_prefills + 1 + 0] == num_prefills + 1 assert attn_metadata.paged_kv_indptr[prefill_start:].shape[ - 0] == self._num_prefills + 1 + 0] == num_prefills + 1 assert attn_metadata.paged_kv_last_page_len[ - prefill_start:].shape[0] == self._num_prefills + prefill_start:].shape[0] == num_prefills # Since prefill_wrapper.run() will be called with # query[num_decode_tokens:] we need to adjust the qo_indptr # to be relative to the start of the prefill queries. @@ -402,17 +349,16 @@ def _plan(self, attn_metadata: FlashInferMetadata): kv_data_type=attn_metadata.kv_data_type, ) - if self._num_decodes > 0: + if num_decodes > 0: attn_metadata.decode_wrapper = self._get_decode_wrapper() if not FlashInferBackend.use_trtllm_decode_attention( - self._num_decodes, attn_metadata.max_seq_len, + num_decodes, attn_metadata.max_seq_len, attn_metadata.kv_data_type, attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim): attn_metadata.decode_wrapper.plan( - attn_metadata.paged_kv_indptr[:self._num_decodes + 1], + attn_metadata.paged_kv_indptr[:num_decodes + 1], attn_metadata.paged_kv_indices, - attn_metadata.paged_kv_last_page_len[:self. - _num_decodes], + attn_metadata.paged_kv_last_page_len[:num_decodes], attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim, @@ -427,22 +373,20 @@ def _plan(self, attn_metadata: FlashInferMetadata): kv_data_type=attn_metadata.kv_data_type, ) - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata): - num_reqs = common_attn_metadata.num_reqs + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> FlashInferMetadata: num_actual_tokens = common_attn_metadata.num_actual_tokens + num_decodes, num_prefills, num_decode_tokens, num_prefill_tokens =\ + split_decodes_and_prefills(common_attn_metadata) - assert self._num_decodes + self._num_prefills == num_reqs - assert (self._num_decode_tokens + - self._num_prefill_tokens == num_actual_tokens) page_size = self.kv_cache_spec.block_size - device = self.runner.device + device = self.device qo_indptr = common_attn_metadata.query_start_loc - max_seq_len = int(self.runner.seq_lens_np[:num_reqs].max()) + max_seq_len = common_attn_metadata.seq_lens_cpu.max() seq_lens = common_attn_metadata.seq_lens - block_table_tensor = self.block_table.get_device_tensor()[:num_reqs] - slot_mapping = self.block_table.slot_mapping_cpu[:num_actual_tokens].to( - self.runner.device, non_blocking=True).long() + block_table_tensor = common_attn_metadata.block_table_tensor block_table_bounds = (seq_lens + page_size - 1) // page_size @@ -487,7 +431,7 @@ def build(self, common_prefix_len: int, paged_kv_last_page_len = seq_lens % page_size paged_kv_last_page_len = torch.where(paged_kv_last_page_len == 0, page_size, paged_kv_last_page_len) - cache_dtype = self.runner.cache_config.cache_dtype + cache_dtype = self.cache_config.cache_dtype if cache_dtype.startswith("fp8"): kv_cache_dtype = FlashInferBackend.get_fp8_dtype_for_flashinfer( cache_dtype) @@ -499,17 +443,18 @@ def build(self, common_prefix_len: int, paged_kv_indptr=paged_kv_indptr, paged_kv_indices=paged_kv_indices, paged_kv_last_page_len=paged_kv_last_page_len, - num_qo_heads=self.runner.num_query_heads, + num_qo_heads=self.vllm_config.model_config.get_num_attention_heads( + self.vllm_config.parallel_config), num_kv_heads=self.kv_cache_spec.num_kv_heads, head_dim=self.kv_cache_spec.head_size, page_size=page_size, kv_data_type=kv_cache_dtype, - q_data_type=self.runner.dtype, - slot_mapping=slot_mapping, - num_decodes=self._num_decodes, - num_decode_tokens=self._num_decode_tokens, - num_prefills=self._num_prefills, - num_prefill_tokens=self._num_prefill_tokens, + q_data_type=self.vllm_config.model_config.dtype, + slot_mapping=common_attn_metadata.slot_mapping, + num_decodes=num_decodes, + num_decode_tokens=num_decode_tokens, + num_prefills=num_prefills, + num_prefill_tokens=num_prefill_tokens, use_cascade=use_cascade, shared_qo_indptr=shared_qo_indptr, shared_kv_page_indptr=shared_kv_page_indptr, @@ -521,12 +466,12 @@ def build(self, common_prefix_len: int, workspace_buffer=self._workspace_buffer, ) - self._plan(attn_metadata) + self._plan(num_prefills, num_decodes, attn_metadata) return attn_metadata def use_cascade_attention(self, *args, **kwargs) -> bool: - if self.kv_cache_spec.dtype != self.runner.model_config.dtype: + if self.kv_cache_spec.dtype != self.vllm_config.model_config.dtype: # TODO: The cascade wrapper currently does not support setting # kv cache dtype to something different from query dtype. return False diff --git a/vllm/v1/attention/backends/flex_attention.py b/vllm/v1/attention/backends/flex_attention.py index f0f54c28831..c229ec12fd1 100644 --- a/vllm/v1/attention/backends/flex_attention.py +++ b/vllm/v1/attention/backends/flex_attention.py @@ -3,7 +3,7 @@ """Attention layer with FlashAttention.""" from collections import defaultdict from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, Optional +from typing import Any, Optional import torch from torch.nn.attention.flex_attention import (BlockMask, _mask_mod_signature, @@ -14,18 +14,15 @@ from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, AttentionMetadata, AttentionType, is_quantized_kv_cache) +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, CommonAttentionMetadata) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable logger = init_logger(__name__) -if TYPE_CHECKING: - from vllm.v1.worker.gpu_model_runner import GPUModelRunner - create_block_mask_compiled = torch.compile(create_block_mask, fullgraph=True, mode="reduce-overhead") @@ -261,36 +258,34 @@ def __post_init__(self): class FlexAttentionMetadataBuilder( AttentionMetadataBuilder[FlexAttentionMetadata]): - def __init__(self, runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, - block_table: BlockTable): - model_config = runner.model_config - - self.runner = runner - self.num_heads_q = model_config.get_num_attention_heads( - runner.parallel_config) - self.num_heads_kv = model_config.get_num_kv_heads( - runner.parallel_config) - self.headdim = model_config.get_head_size() + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + self.model_config = vllm_config.model_config + self.parallel_config = vllm_config.parallel_config + self.cache_config = vllm_config.cache_config + + self.num_heads_q = self.model_config.get_num_attention_heads( + vllm_config.parallel_config) + self.num_heads_kv = self.model_config.get_num_kv_heads( + vllm_config.parallel_config) + self.headdim = self.model_config.get_head_size() self.block_size = kv_cache_spec.block_size self.kv_cache_spec = kv_cache_spec - self.block_table = block_table + self.device = device - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata): + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> FlexAttentionMetadata: num_reqs = common_attn_metadata.num_reqs num_actual_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - max_seq_len = self.runner.seq_lens_np[:num_reqs].max() + max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) query_start_loc = common_attn_metadata.query_start_loc seq_lens = common_attn_metadata.seq_lens - - block_table = self.block_table - block_table_tensor = block_table.get_device_tensor()[:num_reqs] - block_table.slot_mapping[:num_actual_tokens].copy_( - block_table.slot_mapping_cpu[:num_actual_tokens], - non_blocking=True) - slot_mapping = block_table.slot_mapping[:num_actual_tokens] + block_table_tensor = common_attn_metadata.block_table_tensor + slot_mapping = common_attn_metadata.slot_mapping use_cascade = common_prefix_len > 0 cu_prefix_query_lens = None @@ -300,17 +295,15 @@ def build(self, common_prefix_len: int, raise NotImplementedError("Not yet my friend") block_size = self.kv_cache_spec.block_size - max_possible_seq_len = self.runner.model_config.max_model_len - total_cache_tokens = (self.runner.cache_config.num_gpu_blocks * - block_size) + max_possible_seq_len = self.model_config.max_model_len + total_cache_tokens = self.cache_config.num_gpu_blocks * block_size inverse_block_table = physical_to_logical_mapping( - block_table_tensor, self.runner.cache_config.num_gpu_blocks) + block_table_tensor, self.cache_config.num_gpu_blocks) # Get the original offset tensor - offset_tensor = torch.tensor( - self.runner.input_batch.num_computed_tokens_cpu[:num_reqs]).to( - self.runner.device, non_blocking=True) + offset_tensor = common_attn_metadata.num_computed_tokens_cpu.to( + self.device, non_blocking=True) out = FlexAttentionMetadata( num_actual_tokens=num_actual_tokens, diff --git a/vllm/v1/attention/backends/mamba_attn.py b/vllm/v1/attention/backends/mamba_attn.py index 7b4ecd7c359..dca5de46c06 100644 --- a/vllm/v1/attention/backends/mamba_attn.py +++ b/vllm/v1/attention/backends/mamba_attn.py @@ -7,15 +7,15 @@ import torch from vllm.attention.backends.abstract import AttentionBackend -from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, - CommonAttentionMetadata) -from vllm.v1.kv_cache_interface import MambaSpec -from vllm.v1.worker.block_table import BlockTable +from vllm.config import VllmConfig +from vllm.v1.attention.backends.utils import ( + AttentionMetadataBuilder, CommonAttentionMetadata, + reorder_batch_to_split_decodes_and_prefills, split_decodes_and_prefills) +from vllm.v1.kv_cache_interface import AttentionSpec, MambaSpec if TYPE_CHECKING: from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.worker.gpu_input_batch import InputBatch - from vllm.v1.worker.gpu_model_runner import GPUModelRunner def _query_start_loc_to_chunk_indices_offsets(query_start_loc: torch.Tensor, @@ -87,80 +87,24 @@ class Mamba2AttentionMetadata: class Mamba2AttentionMetadataBuilder( AttentionMetadataBuilder[Mamba2AttentionMetadata]): - def __init__(self, runner: "GPUModelRunner", kv_cache_spec: MambaSpec, - block_table: BlockTable): - self.runner = runner + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + assert isinstance(kv_cache_spec, MambaSpec) self.kv_cache_spec = kv_cache_spec - self.block_table = block_table - self.chunk_size = runner.vllm_config.model_config.get_mamba_chunk_size( - ) + self.chunk_size = vllm_config.model_config.get_mamba_chunk_size() assert self.chunk_size is not None, ( "chunk_size needs to be set in the model config for Mamba2 models") def reorder_batch(self, input_batch: "InputBatch", scheduler_output: "SchedulerOutput") -> bool: - # NOTE (Chen): Copied from MLACommonMetadataBuilder and - # FlashInferMetadataBuilder. Should be refactored later to avoid code - # duplication of these 3 functions. - # We now want to reorder the batch so that the "decode" requests are and - # the front and the "prefill" requests are at the using the least amount - # swaps possible. (NOTE for now we loosely use "decode" to mean requests - # where attention is likely memory-bound and "prefill" to mean requests - # where attention is likely compute-bound, TODO(lucas): figure out a - # better naming here) - decodes = [] - prefills = [] - num_decode_tokens = 0 - num_prefill_tokens = 0 - - for i, req_id in enumerate(input_batch.req_ids): - num_tokens = scheduler_output.num_scheduled_tokens[req_id] - # for now treat 1 scheduled token as "decode" even if its not, - # we should update this to something like < 8 in the future but - # currently the decode run only supports num_tokens = 1 - if num_tokens == 1: - decodes.append(i) - num_decode_tokens += num_tokens - else: - prefills.append(i) - num_prefill_tokens += num_tokens - - # We hope that this is fairly minimal since decodes - # should be around for a number of iterations so hopefully they are - # relatively stationary (and new request are generally appended to the - # persistent batch so already should be at the back) - # To achieve this we loop over the decodes in descending order and - # the prefills in ascending order. We swap decodes from the "back" - # i.e. past where the last decode should be in the reodorered with - # prefills from the front of the batch. - # `decodes` and `prefills` are already in ascending order just based on - # the above loop - num_decodes = len(decodes) - num_prefills = len(prefills) - modified_batch = False - - for i in range(1, min(num_decodes, num_prefills) + 1): - # If the decode is at the "back" of the batch, i, we can swap it - # with the prefill closest to the front of the batch - decode_idx = decodes[num_decodes - i] - if decode_idx < num_decodes: - break - - input_batch.swap_states(prefills[i - 1], decode_idx) - modified_batch = True - - # Save for next `build` call - # TODO(lucas): this is a bit of a hack, we should probably have a - # better way of doing this - self._num_decodes = num_decodes - self._num_prefills = num_prefills - self._num_decode_tokens = num_decode_tokens - self._num_prefill_tokens = num_prefill_tokens - - return modified_batch - - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata): + return reorder_batch_to_split_decodes_and_prefills(input_batch, + scheduler_output, + decode_threshold=1) + + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> Mamba2AttentionMetadata: num_reqs = common_attn_metadata.num_reqs query_start_loc = common_attn_metadata.query_start_loc seq_lens = common_attn_metadata.seq_lens @@ -172,29 +116,31 @@ def build(self, common_prefix_len: int, has_initial_states = None prep_initial_states = False - state_indices_tensor = self.block_table.block_table[:num_reqs, 0] + state_indices_tensor = common_attn_metadata.block_table_tensor[:, 0] + + num_decodes, num_prefills, num_decode_tokens, num_prefill_tokens = ( + split_decodes_and_prefills(common_attn_metadata, + decode_threshold=1)) # Compute seq_idx, chunk_indices and chunk_offsets for prefill only - if self._num_prefills > 0: + if num_prefills > 0: #[batch,] has_initial_states_cpu = ( - self.runner.input_batch. - num_computed_tokens_cpu_tensor[num_reqs - - self._num_prefills:num_reqs] - > 0) + common_attn_metadata. + num_computed_tokens_cpu[num_reqs - num_prefills:num_reqs] > 0) prep_initial_states = torch.any(has_initial_states_cpu).item() has_initial_states = has_initial_states_cpu.to( query_start_loc.device) query_start_loc_p = common_attn_metadata.query_start_loc[ - -self._num_prefills - 1:] - self._num_decode_tokens - - seq_idx = torch.repeat_interleave( - torch.arange(self._num_prefills, - dtype=torch.int32, - device=query_start_loc_p.device), - query_start_loc_p.diff(), - output_size=self._num_prefill_tokens) + -num_prefills - 1:] - num_decode_tokens + + seq_idx = torch.repeat_interleave(torch.arange( + num_prefills, + dtype=torch.int32, + device=query_start_loc_p.device), + query_start_loc_p.diff(), + output_size=num_prefill_tokens) seq_idx.unsqueeze_(0) # We compute metadata for chunked prefill once at the top level @@ -204,13 +150,13 @@ def build(self, common_prefix_len: int, chunk_indices, chunk_offsets = ( _query_start_loc_to_chunk_indices_offsets( query_start_loc_p, self.chunk_size, - self._num_prefill_tokens)) + num_prefill_tokens)) attn_metadata = Mamba2AttentionMetadata( - num_prefills=self._num_prefills, - num_prefill_tokens=self._num_prefill_tokens, - num_decodes=self._num_decodes, - num_decode_tokens=self._num_decode_tokens, + num_prefills=num_prefills, + num_prefill_tokens=num_prefill_tokens, + num_decodes=num_decodes, + num_decode_tokens=num_decode_tokens, query_start_loc=query_start_loc, seq_lens=seq_lens, has_initial_states=has_initial_states, diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py index 173c8466f6d..93c8156b16a 100755 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -202,18 +202,18 @@ from vllm.attention.backends.utils import get_mla_dims from vllm.attention.ops.merge_attn_states import merge_attn_states from vllm.attention.utils.fa_utils import get_flash_attn_version +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.model_executor.layers.linear import (ColumnParallelLinear, LinearBase, UnquantizedLinearMethod) from vllm.platforms import current_platform from vllm.utils import cdiv, round_down -from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, - CommonAttentionMetadata, - get_per_layer_parameters, - infer_global_hyperparameters) +from vllm.v1.attention.backends.utils import ( + AttentionMetadataBuilder, CommonAttentionMetadata, + get_per_layer_parameters, infer_global_hyperparameters, + reorder_batch_to_split_decodes_and_prefills, split_decodes_and_prefills) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable try: from vllm.vllm_flash_attn import flash_attn_varlen_func @@ -235,7 +235,6 @@ if TYPE_CHECKING: from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.worker.gpu_input_batch import InputBatch - from vllm.v1.worker.gpu_model_runner import GPUModelRunner logger = init_logger(__name__) @@ -406,22 +405,23 @@ class MLACommonMetadataBuilder(AttentionMetadataBuilder[M]): """ def __init__(self, - runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, - block_table: BlockTable, + vllm_config: VllmConfig, + device: torch.device, metadata_cls: Optional[type[M]] = None): self.metadata_cls = metadata_cls \ if metadata_cls is not None else MLACommonMetadata - self.runner = runner - scheduler_config = runner.scheduler_config - model_config = runner.model_config - cache_config = runner.cache_config + self.kv_cache_spec = kv_cache_spec + self.device = device + scheduler_config = vllm_config.scheduler_config + self.model_config = vllm_config.model_config + cache_config = vllm_config.cache_config + parallel_config = vllm_config.parallel_config self.chunked_prefill_enabled = scheduler_config.chunked_prefill_enabled - self.num_heads = model_config.get_num_attention_heads( - runner.parallel_config) - self.mla_dims = get_mla_dims(model_config) + self.num_heads = self.model_config.get_num_attention_heads( + parallel_config) + self.mla_dims = get_mla_dims(self.model_config) self.aot_schedule = current_platform.is_cuda() - self.kv_cache_spec = kv_cache_spec # Dont try to access the runner on AMD if self.aot_schedule: @@ -432,7 +432,7 @@ def __init__(self, # Max sure there is enough for 8 full length request or at least # 4 pages of cache per request max( - 8 * model_config.max_model_len, 4 * + 8 * self.model_config.max_model_len, 4 * scheduler_config.max_num_seqs * cache_config.block_size), # For long-context models try not to over-allocate limiting # kv-cache space, limiting it to 64k tokens, @@ -447,13 +447,11 @@ def __init__(self, scheduler_config.max_num_seqs * cache_config.block_size self.chunked_prefill_workspace = torch.empty( (self.chunked_prefill_workspace_size, - model_config.get_head_size()), - dtype=model_config.dtype, - device=runner.device, + self.model_config.get_head_size()), + dtype=self.model_config.dtype, + device=device, ) - self.block_table = block_table - self._use_cudnn_prefill = use_cudnn_prefill() self._use_fi_prefill = use_flashinfer_prefill() self.prefill_metadata_cls = ( @@ -465,7 +463,7 @@ def __init__(self, self._workspace_buffer = torch.empty( FLASHINFER_WORKSPACE_BUFFER_SIZE, dtype=torch.uint8, - device=runner.device) + device=device) self._fi_prefill_main: Optional[ BatchPrefillWithRaggedKVCacheWrapper] = None @@ -473,13 +471,13 @@ def __init__(self, BatchPrefillWithRaggedKVCacheWrapper] = [] self._global_hyperparameters = infer_global_hyperparameters( - get_per_layer_parameters(runner.vllm_config, MLACommonImpl)) + get_per_layer_parameters(vllm_config, MLACommonImpl)) if self._use_cudnn_prefill: self.cudnn_workspace = torch.empty( CUDNN_WORKSPACE_SIZE * scheduler_config.max_num_seqs, dtype=torch.int8, - device=runner.device, + device=device, ) def _build_fi_prefill_wrappers(self, prefill: FlashInferPrefillMetadata): @@ -505,7 +503,7 @@ def _build_fi_prefill_wrappers(self, prefill: FlashInferPrefillMetadata): assert num_chunks <= len(self._fi_prefill_chunks) # In MLA, the non-latent num_qo_heads == num_kv_heads - num_qo_heads = self.runner.num_query_heads + num_qo_heads = self.num_heads num_kv_heads = num_qo_heads # Sanity: Verify that num_kv_heads == 1 since it is latent space @@ -531,7 +529,7 @@ def _build_fi_prefill_wrappers(self, prefill: FlashInferPrefillMetadata): sm_scale=self._global_hyperparameters.sm_scale, window_left=self._global_hyperparameters.window_left, logits_soft_cap=self._global_hyperparameters.logits_soft_cap, - q_data_type=self.runner.dtype, + q_data_type=self.model_config.dtype, kv_data_type=self.kv_cache_spec.dtype, ) @@ -552,7 +550,7 @@ def _build_fi_prefill_wrappers(self, prefill: FlashInferPrefillMetadata): window_left=self._global_hyperparameters.window_left, logits_soft_cap=self._global_hyperparameters. logits_soft_cap, - q_data_type=self.runner.dtype, + q_data_type=self.model_config.dtype, kv_data_type=self.kv_cache_spec.dtype, ) @@ -561,63 +559,9 @@ def _build_fi_prefill_wrappers(self, prefill: FlashInferPrefillMetadata): def reorder_batch(self, input_batch: "InputBatch", scheduler_output: "SchedulerOutput") -> bool: - # We now want to reorder the batch so that the "decode" requests are and - # the front and the "prefill" requests are at the using the least amount - # swaps possible. (NOTE for now we loosely use "decode" to mean requests - # where attention is likely memory-bound and "prefill" to mean requests - # where attention is likely compute-bound, TODO(lucas): figure out a - # better naming here) - decodes = [] - prefills = [] - num_decode_tokens = 0 - num_prefill_tokens = 0 - - for i, req_id in enumerate(input_batch.req_ids): - num_tokens = scheduler_output.num_scheduled_tokens[req_id] - # for now treat 1 scheduled token as "decode" even if its not, - # we should update this to something like < 8 in the future but - # currently the TritonMLA._forward_decode only supports - # num_tokens = 1 - if num_tokens == 1: - decodes.append(i) - num_decode_tokens += num_tokens - else: - prefills.append(i) - num_prefill_tokens += num_tokens - - # We hope that this is fairly minimal since decodes - # should be around for a number of iterations so hopefully they are - # relatively stationary (and new request are generally appended to the - # persistent batch so already should be at the back) - # To achieve this we loop over the decodes in descending order and - # the prefills in ascending order. We swap decodes from the "back" - # i.e. past where the last decode should be in the reodorered with - # prefills from the front of the batch. - # `decodes` and `prefills` are already in ascending order just based on - # the above loop - num_decodes = len(decodes) - num_prefills = len(prefills) - modified_batch = False - - for i in range(1, min(num_decodes, num_prefills) + 1): - # If the decode is at the "back" of the batch, i, we can swap it - # with the prefill closest to the front of the batch - decode_idx = decodes[num_decodes - i] - if decode_idx < num_decodes: - break - - input_batch.swap_states(prefills[i - 1], decode_idx) - modified_batch = True - - # Save for next `build` call - # TODO(lucas): this is a bit of a hack, we should probably have a - # better way of doing this - self._num_decodes = num_decodes - self._num_prefills = num_prefills - self._num_decode_tokens = num_decode_tokens - self._num_prefill_tokens = num_prefill_tokens - - return modified_batch + return reorder_batch_to_split_decodes_and_prefills(input_batch, + scheduler_output, + decode_threshold=1) def _build_decode(self, block_table_tensor: torch.Tensor, seq_lens: torch.Tensor): @@ -639,49 +583,50 @@ def build_for_cudagraph_capture( m.max_query_len = 1 # decode-only - # Update state usually set in reorder_batch. - self._num_decodes = m.num_reqs - self._num_decode_tokens = m.num_actual_tokens - self._num_prefills = 0 - self._num_prefill_tokens = 0 return self.build(0, m) - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata) -> M: + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> M: num_reqs = common_attn_metadata.num_reqs - num_actual_tokens = common_attn_metadata.num_actual_tokens + num_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - assert self._num_decodes + self._num_prefills == num_reqs - # Note(simon): be careful about the CPU <> GPU memory movement in this # function. We should avoid GPU -> CPU sync as much as possible because # it blocks on all previous kernels. - device = self.runner.device - block_table = self.block_table - block_table_tensor = block_table.get_device_tensor()[:num_reqs] - block_table.slot_mapping[:num_actual_tokens].copy_( - block_table.slot_mapping_cpu[:num_actual_tokens], - non_blocking=True) - block_table.slot_mapping[num_actual_tokens:].fill_(-1) - slot_mapping = block_table.slot_mapping[:num_actual_tokens] + device = self.device + block_table_tensor = common_attn_metadata.block_table_tensor + slot_mapping = common_attn_metadata.slot_mapping query_start_loc = common_attn_metadata.query_start_loc + query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu seq_lens = common_attn_metadata.seq_lens + query_seq_lens_cpu = query_start_loc_cpu[1:] - query_start_loc_cpu[:-1] + + num_computed_tokens_cpu = (common_attn_metadata.seq_lens_cpu - + query_seq_lens_cpu) + + num_decodes, num_prefills, num_decode_tokens, num_prefill_tokens = \ + split_decodes_and_prefills(common_attn_metadata) + + assert num_decodes + num_prefills == num_reqs + assert num_decode_tokens + num_prefill_tokens == num_tokens + prefill_metadata = None - if self._num_prefills > 0: - reqs_start = self._num_decodes # prefill_start + if num_prefills > 0: + reqs_start = num_decodes # prefill_start - context_lens_cpu = self.runner.input_batch.\ - num_computed_tokens_cpu_tensor[reqs_start:num_reqs] + context_lens_cpu = num_computed_tokens_cpu[reqs_start:num_reqs] max_context_len_cpu = context_lens_cpu.max().item() num_prefills_with_context_cpu = (context_lens_cpu > 0).sum().item() prefill_query_start_loc = query_start_loc[ reqs_start:] - query_start_loc[reqs_start] chunked_context_metadata = None - if self.chunked_prefill_enabled and self._num_prefills > 0 \ + if self.chunked_prefill_enabled and num_prefills > 0 \ and max_context_len_cpu > 0: # NOTE: it is recommend you read the `Chunked Prefill` section # in the comment at the top of the file before trying to @@ -712,14 +657,14 @@ def build(self, common_prefix_len: int, # of `to_list`. chunk_starts = \ torch.arange(num_chunks, dtype=torch.int32) \ - .unsqueeze(1).expand(-1, self._num_prefills) \ + .unsqueeze(1).expand(-1, num_prefills) \ * max_context_chunk chunk_ends = torch.min(context_lens_cpu.unsqueeze(0), chunk_starts + max_context_chunk) chunk_seq_lens = (chunk_ends - chunk_starts).clamp(min=0) cu_seq_lens_cpu = torch.zeros(num_chunks, - self._num_prefills + 1, + num_prefills + 1, dtype=torch.int32, pin_memory=True) torch.cumsum(chunk_seq_lens, @@ -762,28 +707,28 @@ def build(self, common_prefix_len: int, prefill_metadata.cudnn_workspace = self.cudnn_workspace decode_metadata = None - if self._num_decodes > 0: + if num_decodes > 0: decode_metadata = self._build_decode( - block_table_tensor=block_table_tensor[:self._num_decodes, ...], - seq_lens=seq_lens[:self._num_decodes], + block_table_tensor=block_table_tensor[:num_decodes, ...], + seq_lens=seq_lens[:num_decodes], ) attn_metadata = self.metadata_cls( num_reqs=common_attn_metadata.num_reqs, max_query_len=common_attn_metadata.max_query_len, - num_actual_tokens=num_actual_tokens, + num_actual_tokens=num_tokens, query_start_loc=query_start_loc, slot_mapping=slot_mapping, - head_dim=self.runner.model_config.get_head_size(), + head_dim=self.model_config.get_head_size(), # MLACommonMetadata Chunk prefill specific - num_decodes=self._num_decodes, - num_decode_tokens=self._num_decode_tokens, - num_prefills=self._num_prefills, + num_decodes=num_decodes, + num_decode_tokens=num_decode_tokens, + num_prefills=num_prefills, prefill=prefill_metadata, decode=decode_metadata, ) - if self._use_fi_prefill and self._num_prefills > 0: + if self._use_fi_prefill and num_prefills > 0: assert isinstance(attn_metadata.prefill, FlashInferPrefillMetadata) self._build_fi_prefill_wrappers(attn_metadata.prefill) diff --git a/vllm/v1/attention/backends/mla/flashmla.py b/vllm/v1/attention/backends/mla/flashmla.py index be26e0060db..935311aacc3 100644 --- a/vllm/v1/attention/backends/mla/flashmla.py +++ b/vllm/v1/attention/backends/mla/flashmla.py @@ -11,6 +11,7 @@ from vllm.attention.ops.flashmla import (flash_mla_with_kvcache, get_mla_metadata, is_flashmla_supported) +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.v1.attention.backends.mla.common import (MLACommonBackend, MLACommonDecodeMetadata, @@ -18,7 +19,6 @@ MLACommonMetadata, MLACommonMetadataBuilder) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable logger = init_logger(__name__) @@ -56,12 +56,13 @@ class FlashMLAMetadata(MLACommonMetadata[FlashMLADecodeMetadata]): class FlashMLAMetadataBuilder(MLACommonMetadataBuilder[FlashMLAMetadata]): full_cudagraph_supported: ClassVar[bool] = True # Decode-only - def __init__(self, runner, kv_cache_spec: AttentionSpec, - block_table: BlockTable): - super().__init__(runner, kv_cache_spec, block_table, FlashMLAMetadata) + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + super().__init__(kv_cache_spec, vllm_config, device, FlashMLAMetadata) - self.num_q_heads = self.runner.model_config.get_num_attention_heads( - self.runner.parallel_config) + self.compilation_config = vllm_config.compilation_config + self.num_q_heads = vllm_config.model_config.get_num_attention_heads( + vllm_config.parallel_config) self.cg_buf_tile_scheduler_metadata = None self.cg_buf_num_splits = None @@ -75,7 +76,7 @@ def _build_decode(self, block_table_tensor: torch.Tensor, 1, # MQA for the decode path ) - if self.runner.full_cuda_graph: + if self.compilation_config.full_cuda_graph: # First time around (CUDAGraph capture), allocate the static buffer if self.cg_buf_tile_scheduler_metadata is None: self.cg_buf_tile_scheduler_metadata = tile_scheduler_metadata diff --git a/vllm/v1/attention/backends/mla/rocm_aiter_mla.py b/vllm/v1/attention/backends/mla/rocm_aiter_mla.py index d5f9dfaea06..42a04258361 100644 --- a/vllm/v1/attention/backends/mla/rocm_aiter_mla.py +++ b/vllm/v1/attention/backends/mla/rocm_aiter_mla.py @@ -8,6 +8,8 @@ import vllm.envs as envs from vllm.attention.ops.rocm_aiter_mla import aiter_mla_decode_fwd +from vllm.config import VllmConfig +from vllm.utils import cdiv # yapf conflicts with isort for this docstring # yapf: disable from vllm.v1.attention.backends.mla.common import (MLACommonBackend, @@ -16,7 +18,6 @@ MLACommonMetadata, MLACommonMetadataBuilder) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable # yapf: enable @@ -65,24 +66,26 @@ class AiterMLAMetadata(MLACommonMetadata[AiterMLADecodeMetadata]): class AiterMLAMetadataBuilder(MLACommonMetadataBuilder[AiterMLAMetadata]): full_cudagraph_supported: ClassVar[bool] = True # decode only - def __init__(self, runner, kv_cache_spec: AttentionSpec, - block_table: BlockTable): - super().__init__(runner, kv_cache_spec, block_table, AiterMLAMetadata) + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + super().__init__(kv_cache_spec, vllm_config, device, AiterMLAMetadata) assert self.kv_cache_spec.block_size == 1, "AITER MLA" \ "only supports block size 1." + self.compilation_config = vllm_config.compilation_config + max_num_pages_per_req = cdiv(vllm_config.model_config.max_model_len, + self.kv_cache_spec.block_size) + max_num_reqs = vllm_config.scheduler_config.max_num_seqs + max_num_pages = max_num_reqs * max_num_pages_per_req + # Preparing persistent buffers - if self.runner.full_cuda_graph: - device = self.runner.device - max_num_reqs = self.runner.max_num_reqs + if vllm_config.compilation_config.full_cuda_graph: self.paged_kv_indptr = torch.zeros(max_num_reqs + 1, dtype=torch.int32, device=device) - self.paged_kv_indices = torch.zeros( - block_table.get_device_tensor().numel( - ), # max num pages possible - dtype=torch.int32, - device=device) + self.paged_kv_indices = torch.zeros(max_num_pages, + dtype=torch.int32, + device=device) self.paged_kv_last_page_len = torch.zeros(max_num_reqs, dtype=torch.int32, device=device) @@ -96,7 +99,8 @@ def _build_decode(self, block_table_tensor: torch.Tensor, seq_lens: torch.Tensor) -> AiterMLADecodeMetadata: page_size = self.kv_cache_spec.block_size block_table_bounds = (seq_lens + page_size - 1) // page_size - device = self.runner.device + device = self.device + num_reqs = seq_lens.size(0) mask = (torch.arange(block_table_tensor.size(1), dtype=block_table_tensor.dtype, @@ -113,8 +117,7 @@ def _build_decode(self, block_table_tensor: torch.Tensor, block_table_bounds.cumsum(dim=0, dtype=torch.int32) ]) - if self.runner.full_cuda_graph: - num_reqs = self._num_decodes + if self.compilation_config.full_cuda_graph: num_actual_pages = paged_kv_indices.size(0) @@ -137,7 +140,7 @@ def _build_decode(self, block_table_tensor: torch.Tensor, else: qo_indptr = torch.arange(0, - self._num_decodes + 1, + num_reqs + 1, step=1, dtype=torch.int32, device=device) diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index dd86e56885e..46802bf5c2a 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with AiterFlashAttention.""" from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, Optional +from typing import Any, Optional import torch @@ -10,18 +10,13 @@ from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, AttentionMetadata, AttentionType, is_quantized_kv_cache) +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform from vllm.v1.attention.backends.flash_attn import ( make_local_attention_virtual_batches) from vllm.v1.attention.backends.utils import CommonAttentionMetadata from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable - -if TYPE_CHECKING: - from vllm.v1.core.sched.output import SchedulerOutput - from vllm.v1.worker.gpu_input_batch import InputBatch - from vllm.v1.worker.gpu_model_runner import GPUModelRunner if current_platform.is_rocm(): import aiter @@ -172,54 +167,49 @@ def flash_attn_varlen_func_fake( class AiterFlashAttentionMetadataBuilder: - def __init__(self, runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, - block_table: BlockTable): - model_config = runner.model_config - - self.runner = runner - self.num_heads_q = model_config.get_num_attention_heads( - runner.parallel_config) - self.num_heads_kv = model_config.get_num_kv_heads( - runner.parallel_config) - self.headdim = model_config.get_head_size() + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + self.vllm_config = vllm_config + self.model_config = vllm_config.model_config + self.parallel_config = vllm_config.parallel_config + self.cache_config = vllm_config.cache_config + self.device = device + + self.num_heads_q = self.model_config.get_num_attention_heads( + self.parallel_config) + self.num_heads_kv = self.model_config.get_num_kv_heads( + self.parallel_config) + self.headdim = self.model_config.get_head_size() self.block_size = kv_cache_spec.block_size self.kv_cache_spec = kv_cache_spec - self.block_table = block_table # Sliding window size to be used with the AOT scheduler will be # populated on first build() call. self.aot_sliding_window: Optional[tuple[int, int]] = None - def reorder_batch(self, input_batch: "InputBatch", - scheduler_output: "SchedulerOutput") -> bool: + def reorder_batch(self, input_batch, scheduler_output) -> bool: return False - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata): + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> 'AiterFlashAttentionMetadata': - num_reqs = common_attn_metadata.num_reqs num_actual_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - max_seq_len = int(self.runner.seq_lens_np[:num_reqs].max()) - total_tokens = int(self.runner.seq_lens_np[:num_reqs].sum()) + max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) + total_tokens = int(common_attn_metadata.seq_lens_cpu.sum()) query_start_loc = common_attn_metadata.query_start_loc + query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu seq_lens = common_attn_metadata.seq_lens - block_table = self.block_table - block_table_tensor = block_table.get_device_tensor()[:num_reqs] - - block_table.slot_mapping[:num_actual_tokens].copy_( - block_table.slot_mapping_cpu[:num_actual_tokens], - non_blocking=True) - # Fill unused with -1. Needed for reshape_and_cache in full cuda graph - # mode. - block_table.slot_mapping[num_actual_tokens:].fill_(-1) - - slot_mapping = block_table.slot_mapping[:num_actual_tokens] + seq_lens_cpu = common_attn_metadata.seq_lens_cpu + block_table_tensor = common_attn_metadata.block_table_tensor + slot_mapping = common_attn_metadata.slot_mapping cu_seq_lens = torch.zeros(seq_lens.shape[0] + 1, dtype=torch.int32, - device="cuda") + device=self.device) torch.cumsum(seq_lens, dim=0, dtype=cu_seq_lens.dtype, @@ -231,21 +221,21 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, # for local attention local_attn_metadata = None - if self.runner.attention_chunk_size is not None: + if self.model_config.attention_chunk_size is not None: seqlens_q_local_np, virt_q_cu_seqlens_np, virt_k_seqlens_np, \ virt_block_table_tensor = make_local_attention_virtual_batches( - self.runner.attention_chunk_size, - self.runner.query_start_loc_np[:num_reqs + 1], - self.runner.seq_lens_np[:num_reqs], + self.model_config.attention_chunk_size, + query_start_loc_cpu.numpy(), + seq_lens_cpu.numpy(), block_table_tensor, self.block_size, ) local_query_start_loc = torch.from_numpy(virt_q_cu_seqlens_np).to( - self.runner.device, non_blocking=True) + self.device, non_blocking=True) local_seqused_k = torch.from_numpy(virt_k_seqlens_np).to( - self.runner.device, non_blocking=True) - local_max_query_len = int(seqlens_q_local_np.max()) - local_max_seq_len = int(virt_k_seqlens_np.max()) + self.device, non_blocking=True) + local_max_query_len = seqlens_q_local_np.max().item() + local_max_seq_len = virt_k_seqlens_np.max().item() local_scheduler_metadata = schedule( batch_size=local_query_start_loc.shape[0] - 1, cu_query_lens=local_query_start_loc, @@ -256,12 +246,11 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, local_cu_seq_lens = torch.zeros(virt_k_seqlens_np.shape[0] + 1, dtype=torch.int32, - device=self.runner.device) + device=self.device) local_cu_seq_lens[1:] = torch.cumsum( - torch.from_numpy(virt_k_seqlens_np).to( - device=self.runner.device, - dtype=torch.int32, - non_blocking=True), + torch.from_numpy(virt_k_seqlens_np).to(device=self.device, + dtype=torch.int32, + non_blocking=True), dim=0) diff --git a/vllm/v1/attention/backends/triton_attn.py b/vllm/v1/attention/backends/triton_attn.py index 7dc90a6a97e..ee95b5af6e4 100644 --- a/vllm/v1/attention/backends/triton_attn.py +++ b/vllm/v1/attention/backends/triton_attn.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with PagedAttention and Triton prefix prefill.""" from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, ClassVar, Optional +from typing import Any, ClassVar, Optional import torch @@ -14,6 +14,7 @@ chunked_prefill_paged_decode) from vllm.attention.ops.paged_attn import PagedAttention from vllm.attention.ops.triton_unified_attention import unified_attention +from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata @@ -21,10 +22,6 @@ AttentionMetadataBuilder, CommonAttentionMetadata, make_local_attention_virtual_batches) from vllm.v1.kv_cache_interface import AttentionSpec -from vllm.v1.worker.block_table import BlockTable - -if TYPE_CHECKING: - from vllm.v1.worker.gpu_model_runner import GPUModelRunner logger = init_logger(__name__) @@ -75,12 +72,21 @@ class TritonAttentionMetadataBuilder( AttentionMetadataBuilder[TritonAttentionMetadata]): full_cudagraph_supported: ClassVar[bool] = True - def __init__(self, runner: "GPUModelRunner", kv_cache_spec: AttentionSpec, - block_table: BlockTable): - self.runner = runner + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + self.device = device self.block_size = kv_cache_spec.block_size self.kv_cache_spec = kv_cache_spec - self.block_table = block_table + + model_config = vllm_config.model_config + self.num_heads_q = model_config.get_num_attention_heads( + vllm_config.parallel_config) + self.num_heads_kv = model_config.get_num_kv_heads( + vllm_config.parallel_config) + self.headdim = model_config.get_head_size() + + self.attention_chunk_size = getattr(vllm_config.scheduler_config, + 'attention_chunk_size', None) def build_for_cudagraph_capture( self, common_attn_metadata: CommonAttentionMetadata @@ -92,46 +98,36 @@ def build_for_cudagraph_capture( attn_metadata.seq_lens.fill_(1) return attn_metadata - def build( - self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata - ) -> TritonAttentionMetadata: - num_reqs = common_attn_metadata.num_reqs + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> TritonAttentionMetadata: num_actual_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - max_seq_len = int(self.runner.seq_lens_np[:num_reqs].max()) + max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) query_start_loc = common_attn_metadata.query_start_loc seq_lens = common_attn_metadata.seq_lens - block_table = self.block_table - block_table_tensor = block_table.get_device_tensor()[:num_reqs] - - block_table.slot_mapping[:num_actual_tokens].copy_( - block_table.slot_mapping_cpu[:num_actual_tokens], - non_blocking=True) - # Fill unused with -1. Needed for reshape_and_cache in full cuda graph - # mode. - block_table.slot_mapping[num_actual_tokens:].fill_(-1) - - slot_mapping = block_table.slot_mapping[:num_actual_tokens] + block_table_tensor = common_attn_metadata.block_table_tensor + slot_mapping = common_attn_metadata.slot_mapping # for local attention local_attn_metadata = None - if self.runner.attention_chunk_size is not None: + if self.attention_chunk_size is not None: seqlens_q_local_np, virt_q_cu_seqlens_np, virt_k_seqlens_np, \ virt_block_table_tensor = make_local_attention_virtual_batches( - self.runner.attention_chunk_size, - self.runner.query_start_loc_np[:num_reqs + 1], - self.runner.seq_lens_np[:num_reqs], + self.attention_chunk_size, + common_attn_metadata.query_start_loc_cpu.numpy(), + common_attn_metadata.seq_lens_cpu.numpy(), block_table_tensor, self.block_size, ) local_query_start_loc = torch.from_numpy(virt_q_cu_seqlens_np).to( - self.runner.device, non_blocking=True) + self.device, non_blocking=True) local_seqused_k = torch.from_numpy(virt_k_seqlens_np).to( - self.runner.device, non_blocking=True) - local_max_query_len = seqlens_q_local_np.max() - local_max_seq_len = virt_k_seqlens_np.max() + self.device, non_blocking=True) + local_max_query_len = seqlens_q_local_np.max().item() + local_max_seq_len = virt_k_seqlens_np.max().item() local_attn_metadata = TritonAttentionMetadata \ .LocalAttentionMetadata( @@ -148,14 +144,13 @@ def build( if use_cascade: cu_prefix_query_lens = torch.tensor([0, num_actual_tokens], dtype=torch.int32, - device=self.runner.device) + device=self.device) prefix_kv_lens = torch.tensor([common_prefix_len], dtype=torch.int32, - device=self.runner.device) - suffix_kv_lens = (self.runner.seq_lens_np[:num_reqs] - + device=self.device) + suffix_kv_lens = (common_attn_metadata.seq_lens_cpu - common_prefix_len) - suffix_kv_lens = torch.from_numpy(suffix_kv_lens).to( - self.runner.device) + suffix_kv_lens = suffix_kv_lens.to(self.device) else: cu_prefix_query_lens = None prefix_kv_lens = None diff --git a/vllm/v1/attention/backends/utils.py b/vllm/v1/attention/backends/utils.py index 88adc32406e..db6eaa55864 100644 --- a/vllm/v1/attention/backends/utils.py +++ b/vllm/v1/attention/backends/utils.py @@ -22,6 +22,7 @@ from vllm.distributed.kv_transfer.kv_connector.utils import ( get_kv_connector_cache_layout) from vllm.logger import init_logger +from vllm.v1.kv_cache_interface import AttentionSpec logger = init_logger(__name__) _KV_CACHE_LAYOUT_OVERRIDE = None @@ -32,14 +33,22 @@ class CommonAttentionMetadata: """ Per-batch attention metadata, shared across layers and backends. AttentionMetadataBuilder instances use it to construct per-layer metadata. + + For many of the tensors we keep both GPU and CPU versions. """ query_start_loc: torch.Tensor + query_start_loc_cpu: torch.Tensor """(batch_size + 1,), the start location of each request in query Tensor""" + seq_lens: torch.Tensor + seq_lens_cpu: torch.Tensor """(batch_size,), the length of each request including both computed tokens and newly scheduled tokens""" + num_computed_tokens_cpu: torch.Tensor + """(batch_size,), the number of computed tokens for each request""" + num_reqs: int """Number of requests""" num_actual_tokens: int @@ -47,6 +56,14 @@ class CommonAttentionMetadata: max_query_len: int """Longest query in batch""" + block_table_tensor: torch.Tensor + slot_mapping: torch.Tensor + + def __post_init__(self): + # Fill unused with -1. Needed for reshape_and_cache in full cuda graph + # mode. + self.slot_mapping[self.num_actual_tokens:].fill_(-1) + M = TypeVar("M") @@ -56,11 +73,25 @@ class AttentionMetadataBuilder(abc.ABC, Generic[M]): full_cudagraph_supported: ClassVar[bool] = False @abstractmethod - def build(self, common_prefix_len: int, - common_attn_metadata: CommonAttentionMetadata) -> M: + def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, + device: torch.device): + self.kv_cache_spec = kv_cache_spec + + @abstractmethod + def build(self, + common_prefix_len: int, + common_attn_metadata: CommonAttentionMetadata, + fast_build: bool = False) -> M: """ Central method that builds attention metadata. Some builders (MLA) require reorder_batch to be called prior to build. + + Args: + common_prefix_len: The length of the common prefix of the batch. + common_attn_metadata: The common attention metadata. + fast_build: The meta-data will prioritize speed of building over + then speed at execution. Can be used for spec-decode where the + result of a build call may only be used for few layers/iters. """ raise NotImplementedError @@ -351,3 +382,108 @@ def make_local_attention_virtual_batches( return seqlens_q_local, cu_seqlens_q_local, seqlens_k_local, \ block_table_local + + +def split_decodes_and_prefills( + common_attn_metadata: CommonAttentionMetadata, + decode_threshold: int = 1, +) -> tuple[int, int, int, int]: + """ + Assuming a reordered batch, finds the boundary between prefill and decode + requests. + + Args: + common_attn_metadata: CommonAttentionMetadata object containing the + batch metadata. + decode_threshold: The maximum query length to be considered a decode. + + Returns: + num_decodes: The number of decode requests. + num_prefills: The number of prefill requests. + num_decode_tokens: The number of tokens in the decode requests. + num_prefill_tokens: The number of tokens in the prefill requests. + """ + max_query_len = common_attn_metadata.max_query_len + num_reqs = common_attn_metadata.num_reqs + num_tokens = common_attn_metadata.num_actual_tokens + query_start_loc = common_attn_metadata.query_start_loc_cpu + + if max_query_len <= decode_threshold: + return num_reqs, 0, num_tokens, 0 + + query_lens = query_start_loc[1:] - query_start_loc[:-1] + is_prefill = query_lens > decode_threshold + if not torch.any(is_prefill): + return num_reqs, 0, num_tokens, 0 + + first_prefill = is_prefill.int().argmax(dim=-1).item() + assert torch.all(query_lens[first_prefill:] > decode_threshold) + assert torch.all(query_lens[:first_prefill] <= decode_threshold) + num_decodes = first_prefill + num_prefills = num_reqs - num_decodes + num_decode_tokens = query_start_loc[first_prefill].item() + num_prefill_tokens = num_tokens - num_decode_tokens + return (num_decodes, num_prefills, num_decode_tokens, num_prefill_tokens) + + +def reorder_batch_to_split_decodes_and_prefills( + input_batch: "InputBatch", + scheduler_output: "SchedulerOutput", + decode_threshold: int = 1, +) -> bool: + """ + Reorders the batch to split into prefill and decode requests; places all + requests with <= decode_threshold tokens at the front of the batch. + + Returns: + True if the batch was modified, False otherwise. + """ + # We now want to reorder the batch so that the "decode" requests are at + # the front and the "prefill" requests are at the back using the least + # amount of swaps possible. (NOTE for now we loosely use "decode" to mean + # requests where attention is likely memory-bound and "prefill" to mean + # requests where attention is likely compute-bound, TODO(lucas): figure out + # a better naming here) + decodes = [] + prefills = [] + num_decode_tokens = 0 + num_prefill_tokens = 0 + + for i, req_id in enumerate(input_batch.req_ids): + num_tokens = scheduler_output.num_scheduled_tokens[req_id] + # for now treat 1 scheduled token as "decode" even if its not, + # we should update this to something like < 8 in the future but + # currently the TritonMLA._forward_decode only supports + # num_tokens = 1 + if num_tokens <= decode_threshold: + decodes.append(i) + num_decode_tokens += num_tokens + else: + prefills.append(i) + num_prefill_tokens += num_tokens + + # We hope that this is fairly minimal since decodes + # should be around for a number of iterations so hopefully they are + # relatively stationary (and new request are generally appended to the + # persistent batch so already should be at the back) + # To achieve this we loop over the decodes in descending order and + # the prefills in ascending order. We swap decodes from the "back" + # i.e. past where the last decode should be in the reodorered with + # prefills from the front of the batch. + # `decodes` and `prefills` are already in ascending order just based on + # the above loop + num_decodes = len(decodes) + num_prefills = len(prefills) + modified_batch = False + + for i in range(1, min(num_decodes, num_prefills) + 1): + # If the decode is at the "back" of the batch, i, we can swap it + # with the prefill closest to the front of the batch + decode_idx = decodes[num_decodes - i] + if decode_idx < num_decodes: + break + + input_batch.swap_states(prefills[i - 1], decode_idx) + modified_batch = True + + return modified_batch diff --git a/vllm/v1/spec_decode/eagle.py b/vllm/v1/spec_decode/eagle.py index 6661d984a77..967847c02ff 100644 --- a/vllm/v1/spec_decode/eagle.py +++ b/vllm/v1/spec_decode/eagle.py @@ -1,5 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import numpy as np import torch import torch.nn as nn @@ -12,11 +13,11 @@ from vllm.model_executor.model_loader import get_model from vllm.model_executor.models import supports_multimodal from vllm.model_executor.models.llama_eagle3 import Eagle3LlamaForCausalLM +from vllm.utils import is_pin_memory_available from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata from vllm.v1.attention.backends.utils import CommonAttentionMetadata from vllm.v1.kv_cache_interface import KVCacheConfig from vllm.v1.sample.metadata import SamplingMetadata -from vllm.v1.spec_decode.utils import prepare_eagle_input_kernel logger = init_logger(__name__) @@ -37,7 +38,6 @@ def __init__( self.method = self.speculative_config.method self.runner = runner - self.dtype = vllm_config.model_config.dtype self.max_model_len = vllm_config.model_config.max_model_len self.block_size = vllm_config.cache_config.block_size @@ -45,6 +45,7 @@ def __init__( self.speculative_config.num_speculative_tokens) self.max_num_tokens = ( vllm_config.scheduler_config.max_num_batched_tokens) + self.token_arange_np = np.arange(self.max_num_tokens) # We need to get the hidden size from the draft model config because # the draft model's hidden size can be different from the target model's # hidden size (e.g., Llama 3.3 70B). @@ -83,19 +84,14 @@ def propose( target_positions: torch.Tensor, # [num_tokens, hidden_size] target_hidden_states: torch.Tensor, - # [num_tokens] - target_slot_mapping: torch.Tensor, # [batch_size] next_token_ids: torch.Tensor, - # [batch_size + 1] starting with 0 - cu_num_tokens: torch.Tensor, - # [batch_size, max_num_blocks_per_req] - block_table: torch.Tensor, + common_attn_metadata: CommonAttentionMetadata, sampling_metadata: SamplingMetadata, ) -> torch.Tensor: num_tokens = target_token_ids.shape[0] batch_size = next_token_ids.shape[0] - last_token_indices = cu_num_tokens[1:] - 1 + last_token_indices = common_attn_metadata.query_start_loc[1:] - 1 if self.method == "eagle3": assert isinstance(self.model, Eagle3LlamaForCausalLM) @@ -110,50 +106,14 @@ def propose( # E.g., [b1, b2, c1, c2, c3, c3] -> [a2, b2, b3, c2, c3, c4] self.input_ids[last_token_indices] = next_token_ids - # FA requires seq_len to have dtype int32. - seq_lens = (target_positions[last_token_indices] + 1).int() - - if self.method in ["eagle", "eagle3"]: - # FIXME(woosuk): The below two ops cause synchronization. Optimize. - max_seq_len = seq_lens.max().item() - max_num_tokens = (cu_num_tokens[1:] - - cu_num_tokens[:-1]).max().item() - attn_metadata = FlashAttentionMetadata( - num_actual_tokens=num_tokens, - max_query_len=max_num_tokens, - query_start_loc=cu_num_tokens, - max_seq_len=max_seq_len, - seq_lens=seq_lens, - block_table=block_table, - slot_mapping=target_slot_mapping, - # TODO(woosuk): Support cascade attention. - use_cascade=False, - common_prefix_len=0, - cu_prefix_query_lens=None, - prefix_kv_lens=None, - suffix_kv_lens=None, - ) - elif self.method == "deepseek_mtp": - query_lens = cu_num_tokens[1:] - cu_num_tokens[:-1] - max_query_len = query_lens.max().item() - - common_attn_metadata = CommonAttentionMetadata( - query_start_loc=cu_num_tokens, - seq_lens=seq_lens, - num_reqs=batch_size, - num_actual_tokens=num_tokens, - max_query_len=max_query_len, - ) - - assert self.runner is not None + assert self.runner is not None - # FIXME: need to consider multiple kv_cache_groups - attn_metadata = self.runner.attn_metadata_builders[0].build( - common_prefix_len=0, - common_attn_metadata=common_attn_metadata, - ) - else: - raise ValueError(f"Unsupported method: {self.method}") + # FIXME: need to consider multiple kv_cache_groups + attn_metadata = self.runner.attn_metadata_builders[0].build( + common_prefix_len=0, + common_attn_metadata=common_attn_metadata, + fast_build=True, + ) # At this moment, we assume all eagle layers belong to the same KV # cache group, thus using the same attention metadata. @@ -194,6 +154,11 @@ def propose( # one layer. Adapt this code to support multiple layers once # there's a multi-layer MTP module. + # Currently FlashAttention is the only backend that supports + # multi-token eagle spec decode. This is because the code below + # makes assumptions about attn_metadata attributes available. + assert isinstance(attn_metadata, FlashAttentionMetadata) + # Generate the remaining draft tokens. draft_token_ids_list = [draft_token_ids] @@ -238,8 +203,8 @@ def propose( # Compute the slot mapping. block_numbers = clamped_positions // self.block_size - block_ids = block_table.gather(dim=1, - index=block_numbers.view(-1, 1)) + block_ids = attn_metadata.block_table.gather( + dim=1, index=block_numbers.view(-1, 1)) block_ids = block_ids.view(-1) attn_metadata.slot_mapping = (block_ids * self.block_size + clamped_positions % self.block_size) @@ -275,46 +240,99 @@ def propose( draft_token_ids = torch.stack(draft_token_ids_list, dim=1) return draft_token_ids - @staticmethod def prepare_inputs( - # [batch_size + 1] - cu_target_query_lens: torch.Tensor, + self, + common_attn_metadata: CommonAttentionMetadata, # [batch_size] - num_rejected_tokens: torch.Tensor, - num_tokens: int, - ) -> tuple[torch.Tensor, torch.Tensor]: - # cu_target_query_lens: [0, a, a + b, a + b + c] - # num_rejected_tokens: [n1, n2, n3] - # num_tokens_per_req: [a - n1, b - n2, c - n3] - # cu_num_tokens: [0, a - n1, a + b - n1 - n2, a + b + c - n1 - n2 - n3] - # token_indices: [0, 1, ..., a - n1 - 1, - # a, a + 1, ..., a + b - n2 - 1, - # a + b, a + b + 1, ..., a + b + c - n3 - 1] - - # [0, a, a + b, a + b + c] -> [a, b, c] - query_len_per_req = (cu_target_query_lens[1:] - - cu_target_query_lens[:-1]) - # [a, b, c] -> [a - n1, b - n2, c - n3] - num_tokens_per_req = query_len_per_req - num_rejected_tokens - - # [a - n1, b - n2, c - n3] -> - # [0, a - n1, a + b - n1 - n2, a + b + c - n1 - n2 - n3] - cu_num_tokens = torch.zeros_like(cu_target_query_lens) - torch.cumsum(num_tokens_per_req, dim=0, out=cu_num_tokens[1:]) - token_indices = torch.empty( - num_tokens, + num_rejected_tokens: torch.Tensor + ) -> tuple[CommonAttentionMetadata, torch.Tensor]: + """ + This function is used to prepare the inputs for the spec decode. + It updates to the common_attn_metadata to account for the rejected + tokens (and newly sampled tokens). It also returns the token indices + of the tokens that should be fed to the speculator. + """ + # E.g. + # common_attn_metadata.query_start_loc{_cpu}: + # [0, q1, q1 + q2, q1 + q2 + q3] + # common_attn_metadata.seq_lens{_cpu}: [s1, s2, s3] + # num_rejected_tokens: [n1, n2, n3] + # This function computes the intermediate values: + # num_tokens_per_req: [q1 - n1, q2 - n2, q3 - n3] + # And returns: + # common_attn_metadata.query_start_loc{_cpu}: + # [0, q1 - n1, q1 + q2 - n1 - n2, q1 + q2 + q3 - n1 - n2 - n3] + # common_attn_metadata.seq_lens{_cpu}: + # [s1 - n1 + 1, s2 - n2 + 1, s3 - n3 + 1] + # token_indices: [0, 1, ..., q1 - n1 - 1, + # q1, q1 + 1, ..., q1 + q2 - n2 - 1, + # q1 + q2, q1 + q2 + 1, ..., q1 + q2 + q3 - n3 - 1] + + device = common_attn_metadata.query_start_loc.device + query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu + new_seq_lens_cpu = common_attn_metadata.seq_lens_cpu \ + - num_rejected_tokens + + # [0, q1, q1 + q2, q1 + q2 + q3] -> [q1, q2, q3] + new_query_len_per_req = (query_start_loc_cpu[1:] - + query_start_loc_cpu[:-1]) + # [q1, q2, q3] -> [q1 - n1, q2 - n2, q3 - n3] + new_num_tokens_per_req = new_query_len_per_req - num_rejected_tokens + new_num_tokens_per_req_np = new_num_tokens_per_req.numpy() + + # [q1 - n1, q2 - n2, q3 - n3] -> + # [0, q1 - n1, q1 + q2 - n1 - n2, q1 + q2 + q3 - n1 - n2 - n3] + new_query_start_loc_cpu = torch.zeros( + query_start_loc_cpu.shape, dtype=torch.int32, - device=cu_target_query_lens.device, - ) - batch_size = num_rejected_tokens.shape[0] - BLOCK_SIZE = 1024 - prepare_eagle_input_kernel[(batch_size, )]( - token_indices, - cu_target_query_lens, - cu_num_tokens, - BLOCK_SIZE=BLOCK_SIZE, + pin_memory=is_pin_memory_available()) + new_query_start_loc_np = new_query_start_loc_cpu.numpy() + np.cumsum(new_num_tokens_per_req_np, out=new_query_start_loc_np[1:]) + + total_num_tokens = new_query_start_loc_np[-1] + # Example assuming num_tokens_per_req_np = [2, 4, 3] + # this implies that `new_query_start_locs` is: + # [0, 2, 6, 9] -> + # [0, 0, 2, 2, 2, 2, 6, 6, 6] + # _r1_ ____r2____ ___r3__ + new_query_start_locs_expanded = np.repeat(new_query_start_loc_np[:-1], + new_num_tokens_per_req_np) + # [0, 1, 2, 3, 4, 5, 6, 7, 8] -> + # [0, 1, 0, 1, 2, 3, 0, 1, 2] + # _r1_ ____r2____ ___r3__ + token_offests = self.token_arange_np[:total_num_tokens] \ + - new_query_start_locs_expanded + + # Expand starting positions to match token pattern + # [0, q1, q1 + q2] -> + # [0, 0, q1, q1, q1, q1, q1 + q2, q1 + q2, q1 + q2] + # _r1_ _____r2_______ ___________r3____________ + old_query_start_locs_expanded = np.repeat( + query_start_loc_cpu[:-1].numpy(), new_num_tokens_per_req_np) + # Final token indices are: + # [0, 1, // req 1 + # q1 + 0, q1 + 1, q1 + 2, q1 + 3, // req 2 + # q1 + q2 + 0, q1 + q2 + 1, q1 + q2 + 2] // req 3 + token_indices_np = token_offests + old_query_start_locs_expanded + token_indices = torch.from_numpy(token_indices_np).to( + device, non_blocking=True) + + spec_common_attn_metadata = CommonAttentionMetadata( + query_start_loc=new_query_start_loc_cpu.to(device, + non_blocking=True), + seq_lens=new_seq_lens_cpu.to(device, non_blocking=True), + query_start_loc_cpu=new_query_start_loc_cpu, + seq_lens_cpu=new_seq_lens_cpu, + num_computed_tokens_cpu=common_attn_metadata. + num_computed_tokens_cpu, + num_reqs=common_attn_metadata.num_reqs, + num_actual_tokens=total_num_tokens, + max_query_len=new_query_len_per_req.max().item(), + block_table_tensor=common_attn_metadata.block_table_tensor, + slot_mapping=common_attn_metadata.slot_mapping[token_indices], ) - return cu_num_tokens, token_indices + + return spec_common_attn_metadata, token_indices def load_model(self, target_model: nn.Module) -> None: draft_model_config = \ diff --git a/vllm/v1/spec_decode/utils.py b/vllm/v1/spec_decode/utils.py index 3a86fea146f..1116179dc5b 100644 --- a/vllm/v1/spec_decode/utils.py +++ b/vllm/v1/spec_decode/utils.py @@ -1,7 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from vllm.sampling_params import SamplingParams -from vllm.triton_utils import tl, triton _SAMPLING_EPS = 1e-5 @@ -13,29 +12,3 @@ def is_spec_decode_unsupported(sampling_params: SamplingParams) -> bool: or sampling_params.repetition_penalty != 1.0 or sampling_params.min_p > _SAMPLING_EPS or sampling_params.logprobs is not None) - - -@triton.jit -def prepare_eagle_input_kernel( - out_ptr, - cu_query_lens_ptr, - cu_num_tokens_ptr, - BLOCK_SIZE: tl.constexpr, -): - pid = tl.program_id(0) - - # [start_pos, end_pos) - start_pos = tl.load(cu_num_tokens_ptr + pid) - end_pos = tl.load(cu_num_tokens_ptr + pid + 1) - num_tokens = end_pos - start_pos - - index_start = tl.load(cu_query_lens_ptr + pid) - - num_blocks = tl.cdiv(num_tokens, BLOCK_SIZE) - for i in tl.range(num_blocks): - offset = i * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) - tl.store( - out_ptr + start_pos + offset, - index_start + offset, - mask=offset < num_tokens, - ) diff --git a/vllm/v1/worker/block_table.py b/vllm/v1/worker/block_table.py index 8f4e8d64c61..bf38e88f0c2 100644 --- a/vllm/v1/worker/block_table.py +++ b/vllm/v1/worker/block_table.py @@ -14,12 +14,14 @@ class BlockTable: def __init__( self, + block_size: int, max_num_reqs: int, max_num_blocks_per_req: int, max_num_batched_tokens: int, pin_memory: bool, device: torch.device, ): + self.block_size = block_size self.max_num_reqs = max_num_reqs self.max_num_blocks_per_req = max_num_blocks_per_req self.max_num_batched_tokens = max_num_batched_tokens @@ -79,10 +81,31 @@ def swap_row(self, src: int, tgt: int) -> None: self.block_table_np[[src, tgt]] = self.block_table_np[[tgt, src]] - def commit(self, num_reqs: int) -> None: + def compute_slot_mapping(self, req_indices: np.ndarray, + positions: np.ndarray) -> None: + # E.g., [0, 1, 0, 1, 2, 3, 4, 0, 1, 2] + # -> [0, 0, K, K, K + 1, K + 1, K + 2, 2 * K, 2 * K, 2 * K + 1] + # where K is the max_num_blocks_per_req and the block size is 2. + # NOTE(woosuk): We can't simply use `token_indices // block_size` + # here because M (max_model_len) is not necessarily divisible by + # block_size. + block_table_indices = (req_indices * self.max_num_blocks_per_req + + positions // self.block_size) + block_table_cpu = self.get_cpu_tensor() + block_numbers = block_table_cpu.flatten()[block_table_indices].numpy() + block_offsets = positions % self.block_size + np.add(block_numbers * self.block_size, + block_offsets, + out=self.slot_mapping_np[:req_indices.shape[0]]) + + def commit_block_table(self, num_reqs: int) -> None: self.block_table[:num_reqs].copy_(self.block_table_cpu[:num_reqs], non_blocking=True) + def commit_slot_mapping(self, num_tokens: int) -> None: + self.slot_mapping[:num_tokens].copy_( + self.slot_mapping_cpu[:num_tokens], non_blocking=True) + def clear(self) -> None: self.block_table.fill_(0) self.block_table_cpu.fill_(0) @@ -107,7 +130,8 @@ def __init__(self, max_num_reqs: int, max_model_len: int, max_num_batched_tokens: int, pin_memory: bool, device: torch.device, block_sizes: list[int]) -> None: self.block_tables = [ - BlockTable(max_num_reqs, cdiv(max_model_len, block_size), + BlockTable(block_size, max_num_reqs, cdiv(max_model_len, + block_size), max_num_batched_tokens, pin_memory, device) for block_size in block_sizes ] @@ -129,9 +153,18 @@ def swap_row(self, src: int, tgt: int) -> None: for block_table in self.block_tables: block_table.swap_row(src, tgt) - def commit(self, num_reqs: int) -> None: + def compute_slot_mapping(self, req_indices: np.ndarray, + positions: np.ndarray) -> None: + for block_table in self.block_tables: + block_table.compute_slot_mapping(req_indices, positions) + + def commit_block_table(self, num_reqs: int) -> None: + for block_table in self.block_tables: + block_table.commit_block_table(num_reqs) + + def commit_slot_mapping(self, num_tokens: int) -> None: for block_table in self.block_tables: - block_table.commit(num_reqs) + block_table.commit_slot_mapping(num_tokens) def clear(self) -> None: for block_table in self.block_tables: diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index af216539c90..29f519393e4 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -3,7 +3,6 @@ import gc import time -import weakref from contextlib import contextmanager from typing import TYPE_CHECKING, Any, Optional, Union @@ -42,8 +41,7 @@ from vllm.sampling_params import SamplingType from vllm.sequence import IntermediateTensors from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, DeviceMemoryProfiler, - GiB_bytes, LazyLoader, async_tensor_h2d, - check_use_alibi, get_dtype_size, + GiB_bytes, LazyLoader, check_use_alibi, get_dtype_size, is_pin_memory_available, round_up) from vllm.v1.attention.backends.mamba_attn import Mamba2AttentionBackend from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, @@ -62,7 +60,6 @@ from vllm.v1.spec_decode.medusa import MedusaProposer from vllm.v1.spec_decode.metadata import SpecDecodeMetadata from vllm.v1.spec_decode.ngram_proposer import NgramProposer -from vllm.v1.worker.block_table import BlockTable from vllm.v1.worker.gpu_input_batch import CachedRequestState, InputBatch from vllm.v1.worker.lora_model_runner_mixin import LoRAModelRunnerMixin @@ -577,8 +574,9 @@ def _get_cumsum_and_arange( def _prepare_inputs( self, scheduler_output: "SchedulerOutput", - ) -> tuple[dict[str, Any], bool, torch.Tensor, - Optional[SpecDecodeMetadata], np.ndarray]: + ) -> tuple[dict[str, + Any], bool, torch.Tensor, Optional[SpecDecodeMetadata], + np.ndarray, Optional[CommonAttentionMetadata]]: """ :return: tuple[ attn_metadata: layer-to-attention_metadata mapping, @@ -593,7 +591,7 @@ def _prepare_inputs( # OPTIMIZATION: Start copying the block table first. # This way, we can overlap the copy with the following CPU operations. - self.input_batch.block_table.commit(num_reqs) + self.input_batch.block_table.commit_block_table(num_reqs) # Get the number of scheduled tokens for each request. req_ids = self.input_batch.req_ids @@ -637,29 +635,10 @@ def _prepare_inputs( torch.from_numpy(token_indices), out=self.input_ids_cpu[:total_num_scheduled_tokens]) - # Calculate the slot mapping for each KV cache group. - for kv_cache_group_id, kv_cache_group_spec in enumerate( - self.kv_cache_config.kv_cache_groups): - block_size = kv_cache_group_spec.kv_cache_spec.block_size - block_table: BlockTable = self.input_batch.block_table[ - kv_cache_group_id] - # E.g., [0, 1, 0, 1, 2, 3, 4, 0, 1, 2] - # -> [0, 0, K, K, K + 1, K + 1, K + 2, 2 * K, 2 * K, 2 * K + 1] - # where K is the max_num_blocks_per_req and the block size is 2. - # NOTE(woosuk): We can't simply use `token_indices // block_size` - # here because M (max_model_len) is not necessarily divisible by - # block_size. - block_table_indices = ( - req_indices * block_table.max_num_blocks_per_req + - positions_np // block_size) - block_table_cpu = block_table.get_cpu_tensor() - block_numbers = block_table_cpu.flatten( - )[block_table_indices].numpy() - block_offsets = positions_np % block_size - np.add( - block_numbers * block_size, - block_offsets, - out=block_table.slot_mapping_np[:total_num_scheduled_tokens]) + self.input_batch.block_table.compute_slot_mapping( + req_indices, positions_np) + self.input_batch.block_table.commit_slot_mapping( + total_num_scheduled_tokens) # Prepare the attention metadata. self.query_start_loc_np[0] = 0 @@ -696,15 +675,8 @@ def _prepare_inputs( self.query_start_loc_cpu[num_reqs].item()) query_start_loc = self.query_start_loc[:num_reqs + 1] - seq_lens = self.seq_lens[:num_reqs] - - common_attn_metadata = CommonAttentionMetadata( - query_start_loc=query_start_loc, - seq_lens=seq_lens, - num_reqs=num_reqs, - num_actual_tokens=total_num_scheduled_tokens, - max_query_len=max_num_scheduled_tokens, - ) + + spec_decode_common_attn_metadata = None attn_metadata: dict[str, Any] = {} # Prepare the attention metadata for each KV cache group and make layers @@ -712,6 +684,27 @@ def _prepare_inputs( for kv_cache_group_id, kv_cache_group_spec in enumerate( self.kv_cache_config.kv_cache_groups): + blk_table = self.input_batch.block_table[kv_cache_group_id] + blk_table_tensor = blk_table.get_device_tensor()[:num_reqs] + slot_mapping = blk_table.slot_mapping[:total_num_scheduled_tokens] + common_attn_metadata = CommonAttentionMetadata( + query_start_loc=self.query_start_loc[:num_reqs + 1], + query_start_loc_cpu=self.query_start_loc_cpu[:num_reqs + 1], + seq_lens=self.seq_lens[:num_reqs], + seq_lens_cpu=self.seq_lens_cpu[:num_reqs], + num_computed_tokens_cpu=self.input_batch. + num_computed_tokens_cpu_tensor[:num_reqs], + num_reqs=num_reqs, + num_actual_tokens=total_num_scheduled_tokens, + max_query_len=max_num_scheduled_tokens, + block_table_tensor=blk_table_tensor, + slot_mapping=slot_mapping, + ) + + if self.speculative_config and \ + spec_decode_common_attn_metadata is None: + spec_decode_common_attn_metadata = common_attn_metadata + # Prepare for cascade attention if enabled & beneficial. common_prefix_len = 0 builder = self.attn_metadata_builders[kv_cache_group_id] @@ -765,7 +758,8 @@ def _prepare_inputs( self.set_active_loras(self.input_batch, num_scheduled_tokens) return (attn_metadata, attention_cuda_graphs, logits_indices, - spec_decode_metadata, num_scheduled_tokens) + spec_decode_metadata, num_scheduled_tokens, + spec_decode_common_attn_metadata) def _compute_cascade_attn_prefix_len( self, @@ -1286,8 +1280,9 @@ def execute_model( # Prepare the decoder inputs. (attn_metadata, attention_cuda_graphs, logits_indices, - spec_decode_metadata, - num_scheduled_tokens_np) = (self._prepare_inputs(scheduler_output)) + spec_decode_metadata, num_scheduled_tokens_np, + spec_decode_common_attn_metadata) = ( + self._prepare_inputs(scheduler_output)) num_scheduled_tokens = scheduler_output.total_num_scheduled_tokens if (self.use_cuda_graph and num_scheduled_tokens <= self.cudagraph_batch_sizes[-1]): @@ -1528,6 +1523,7 @@ def execute_model( # Speculative decoding is not enabled. spec_token_ids = None else: + assert spec_decode_common_attn_metadata is not None spec_token_ids = self.propose_draft_token_ids( scheduler_output, valid_sampled_token_ids, @@ -1536,7 +1532,7 @@ def execute_model( sample_hidden_states, aux_hidden_states, spec_decode_metadata, - attn_metadata, + spec_decode_common_attn_metadata, ) self.eplb_step() @@ -1561,7 +1557,7 @@ def propose_draft_token_ids( sample_hidden_states: torch.Tensor, aux_hidden_states: Optional[torch.Tensor], spec_decode_metadata: Optional[SpecDecodeMetadata], - attn_metadata: dict[str, Any], + common_attn_metadata: CommonAttentionMetadata, ) -> list[list[int]]: num_scheduled_tokens = scheduler_output.total_num_scheduled_tokens if self.speculative_config.method == "ngram": @@ -1608,16 +1604,6 @@ def propose_draft_token_ids( next_token_ids = torch.tensor(next_token_ids, dtype=torch.int32, device=self.device) - # At this moment, we assume all eagle layers belong to the same KV - # cache group, thus using the same attention metadata. - eagle_attn_metadata = attn_metadata[ - self.drafter.attn_layer_names[0]] - - # NOTE: deepseek_mtp uses MLA which does not have `block_table` - if hasattr(eagle_attn_metadata, "block_table"): - block_table = eagle_attn_metadata.block_table - else: - block_table = None if spec_decode_metadata is None: # input_ids can be None for multimodal models. @@ -1630,8 +1616,6 @@ def propose_draft_token_ids( dim=-1) else: target_hidden_states = hidden_states[:num_scheduled_tokens] - target_slot_mapping = eagle_attn_metadata.slot_mapping - cu_num_tokens = eagle_attn_metadata.query_start_loc else: # TODO(woosuk): Refactor this. num_draft_tokens = spec_decode_metadata.num_draft_tokens @@ -1639,17 +1623,12 @@ def propose_draft_token_ids( n + 1 - len(sampled_token_ids[i]) if n > 0 else 0 for i, n in enumerate(num_draft_tokens) ] - num_rejected_tokens_tensor = async_tensor_h2d( - num_rejected_tokens, - dtype=torch.int32, - target_device=self.device, - pin_memory=True) - num_tokens = num_scheduled_tokens - sum(num_rejected_tokens) - cu_num_tokens, token_indices = self.drafter.prepare_inputs( - eagle_attn_metadata.query_start_loc, - num_rejected_tokens_tensor, - num_tokens, - ) + num_rejected_tokens_cpu = torch.tensor(num_rejected_tokens, + dtype=torch.int32) + common_attn_metadata, token_indices =\ + self.drafter.prepare_inputs( + common_attn_metadata, num_rejected_tokens_cpu) + target_token_ids = self.input_ids[token_indices] # TODO(woosuk): Support M-RoPE. target_positions = self.positions[token_indices] @@ -1658,17 +1637,13 @@ def propose_draft_token_ids( [h[token_indices] for h in aux_hidden_states], dim=-1) else: target_hidden_states = hidden_states[token_indices] - target_slot_mapping = eagle_attn_metadata.slot_mapping[ - token_indices] draft_token_ids = self.drafter.propose( target_token_ids=target_token_ids, target_positions=target_positions, target_hidden_states=target_hidden_states, - target_slot_mapping=target_slot_mapping, next_token_ids=next_token_ids, - cu_num_tokens=cu_num_tokens, - block_table=block_table, sampling_metadata=sampling_metadata, + common_attn_metadata=common_attn_metadata, ) spec_token_ids = draft_token_ids.tolist() return spec_token_ids @@ -1970,24 +1945,29 @@ def _dummy_run( if capture_attn_cudagraph: attn_metadata = {} - query_start_loc = self.query_start_loc[:num_reqs + 1] # Make sure max_model_len is used at the graph capture time. self.seq_lens_np[:num_reqs] = self.max_model_len self.seq_lens_np[num_reqs:] = 0 self.seq_lens[:num_reqs].copy_(self.seq_lens_cpu[:num_reqs], non_blocking=True) - seq_lens = self.seq_lens[:num_reqs] - - common_attn_metadata = CommonAttentionMetadata( - query_start_loc=query_start_loc, - seq_lens=seq_lens, - num_reqs=num_reqs, - num_actual_tokens=num_tokens, - max_query_len=num_tokens, - ) for kv_cache_group_id, kv_cache_group_spec in enumerate( self.kv_cache_config.kv_cache_groups): + common_attn_metadata = CommonAttentionMetadata( + query_start_loc=self.query_start_loc[:num_reqs + 1], + query_start_loc_cpu=self.query_start_loc_cpu[:num_reqs + + 1], + seq_lens=self.seq_lens[:num_reqs], + seq_lens_cpu=self.seq_lens_cpu[:num_reqs], + num_computed_tokens_cpu=self.input_batch. + num_computed_tokens_cpu_tensor[:num_reqs], + num_reqs=num_reqs, + num_actual_tokens=num_tokens, + max_query_len=num_tokens, + block_table_tensor=self.input_batch.block_table[ + kv_cache_group_id].get_device_tensor()[:num_reqs], + slot_mapping=self.input_batch. + block_table[kv_cache_group_id].slot_mapping[:num_reqs]) attn_metadata_i = self.attn_metadata_builders[ kv_cache_group_id].build_for_cudagraph_capture( @@ -2339,11 +2319,10 @@ def initialize_attn_backend(self, kv_cache_config: KVCacheConfig) -> None: raise ValueError( f"Unknown KV cache spec type: {type(kv_cache_spec)}") - block_table_i = self.input_batch.block_table[i] attn_metadata_builder_i = attn_backend_i.get_builder_cls()( - weakref.proxy(self), kv_cache_spec, - block_table_i, + self.vllm_config, + self.device, ) if (self.full_cuda_graph From e9ea31d3f8fceb0584ddccb93a7f59158a6f127b Mon Sep 17 00:00:00 2001 From: Zhonghua Deng Date: Thu, 17 Jul 2025 13:13:00 +0800 Subject: [PATCH 147/552] [V1][P/D]Enhance Performance and code readability for P2pNcclConnector (#20906) Signed-off-by: Abatom Signed-off-by: x22x22 --- docs/design/v1/p2p_nccl_connector.md | 92 ++--- .../disagg_proxy_p2p_nccl_xpyd.py | 39 +- .../kv_connector/v1/p2p/p2p_nccl_connector.py | 38 +- .../kv_connector/v1/p2p/p2p_nccl_engine.py | 353 ++++++++++-------- 4 files changed, 266 insertions(+), 256 deletions(-) diff --git a/docs/design/v1/p2p_nccl_connector.md b/docs/design/v1/p2p_nccl_connector.md index b1df93cfc85..8f6a2b3b2dd 100644 --- a/docs/design/v1/p2p_nccl_connector.md +++ b/docs/design/v1/p2p_nccl_connector.md @@ -31,7 +31,7 @@ Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (cur ## KV Cache Transfer Methods -There are three methods for KVcache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVcache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVcache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVcache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVcache from the P instance once it has allocated space for the KVcache. +There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache. Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT. @@ -39,13 +39,13 @@ Experimental results have shown that the performance of these methods, from high As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart. -Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVcache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVcache data itself. +Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVCache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVCache data itself. -When a P instance and a D instance transmit KVcache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVcache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVcache transmission can be performed, without being restricted by rank or world size. +When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size. ## NCCL Group Topology -Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVcache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance. +Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance. ![image2](https://github.com/user-attachments/assets/837e61d6-365e-4cbf-8640-6dd7ab295b36) @@ -53,32 +53,18 @@ Each NCCL group occupies a certain amount of GPU memory buffer for communication ## GPU Memory Buffer and Tensor Memory Pool -The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVcache sent by P instances. If it is too large, it will reduce the KVcache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size. +The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size. -If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVcache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVcache loss. Once KVcache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance. +If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVCache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVCache loss. Once KVCache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance. -To address the above issues, I have designed and developed a local Tensor memory pool for storing KVcache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVcache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVcache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store. +To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store. # Install vLLM ??? console "Commands" ```shell - # Enter the home directory or your working directory. - cd /home - - # Download the installation package, and I will update the commit-id in time. You can directly copy the command. - wget https://vllm-wheels.s3.us-west-2.amazonaws.com/9112b443a042d8d815880b8780633882ad32b183/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl - - # Download the code repository. - git clone -b xpyd-v1 https://github.com/Abatom/vllm.git - cd vllm - - # Set the installation package path. - export VLLM_PRECOMPILED_WHEEL_LOCATION=/home/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl - - # installation - pip install -e . -v + pip install "vllm>=0.9.2" ``` # Run xPyD @@ -90,7 +76,7 @@ To address the above issues, I have designed and developed a local Tensor memory - You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict). - `PUT_ASYNC` offers the best performance and should be prioritized. - The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`. -- The `disagg_prefill_proxy_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances). +- The `disagg_proxy_p2p_nccl_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances). - The node running the proxy must have `quart` installed. - Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`. - In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**. @@ -100,8 +86,8 @@ To address the above issues, I have designed and developed a local Tensor memory ### Proxy (e.g. 10.0.1.1) ```shell -cd {your vllm directory}/examples/online_serving/disagg_xpyd/ -python3 disagg_prefill_proxy_xpyd.py & +cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ +python3 disagg_proxy_p2p_nccl_xpyd.py & ``` ### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) @@ -111,7 +97,7 @@ python3 disagg_prefill_proxy_xpyd.py & ```shell VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ --host 0.0.0.0 \ - --port 20005 \ + --port 20001 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ @@ -123,7 +109,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.9 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & ``` ### Decode1 (e.g. 10.0.1.3 or 10.0.1.1) @@ -133,7 +119,7 @@ python3 disagg_prefill_proxy_xpyd.py & ```shell VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ --host 0.0.0.0 \ - --port 20009 \ + --port 20002 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ @@ -145,7 +131,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.7 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & ``` ### Decode2 (e.g. 10.0.1.4 or 10.0.1.1) @@ -167,7 +153,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.7 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & ``` ### Decode3 (e.g. 10.0.1.5 or 10.0.1.1) @@ -177,7 +163,7 @@ python3 disagg_prefill_proxy_xpyd.py & ```shell VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ --host 0.0.0.0 \ - --port 20008 \ + --port 20004 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ @@ -189,7 +175,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.7 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & ``` ## Run 3P1D @@ -197,8 +183,8 @@ python3 disagg_prefill_proxy_xpyd.py & ### Proxy (e.g. 10.0.1.1) ```shell -cd {your vllm directory}/examples/online_serving/disagg_xpyd/ -python3 disagg_prefill_proxy_xpyd.py & +cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ +python3 disagg_proxy_p2p_nccl_xpyd.py & ``` ### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) @@ -208,7 +194,7 @@ python3 disagg_prefill_proxy_xpyd.py & ```shell VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ --host 0.0.0.0 \ - --port 20005 \ + --port 20001 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ @@ -220,7 +206,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.9 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & ``` ### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1) @@ -230,7 +216,7 @@ python3 disagg_prefill_proxy_xpyd.py & ```shell VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ --host 0.0.0.0 \ - --port 20009 \ + --port 20002 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ @@ -242,7 +228,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.9 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & ``` ### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1) @@ -264,7 +250,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.9 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & ``` ### Decode1 (e.g. 10.0.1.5 or 10.0.1.1) @@ -274,7 +260,7 @@ python3 disagg_prefill_proxy_xpyd.py & ```shell VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ --host 0.0.0.0 \ - --port 20008 \ + --port 20004 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ @@ -286,7 +272,7 @@ python3 disagg_prefill_proxy_xpyd.py & --gpu-memory-utilization 0.7 \ --disable-log-request \ --kv-transfer-config \ - '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & + '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & ``` # Single request @@ -334,24 +320,6 @@ pgrep python | xargs kill -9 && pkill -f python # Test data -## **Scenario 1**: 1K input & 1K output tokens, E2E P99 latency ~20s -- **1P5D (6×A800) vs vLLM (1×A800)**: - - Throughput ↑7.2% (1085 → 6979/6) - - ITL (P99) ↓81.3% (120ms → 22.9ms) - - TTFT (P99) ↑26.8% (175ms → 222ms) - - TPOT: No change - -- **1P6D (7×A800) vs vLLM (1×A800)**: - - Throughput ↑9.6% (1085 → 8329/7) - - ITL (P99) ↓81.0% (120ms → 22.7ms) - - TTFT (P99) ↑210% (175ms →543ms) - - TPOT: No change - -## **Scenario 2**: 1K input & 200 output tokens, E2E P99 latency ~4s -- **1P1D (2×A800) vs vLLM (1×A800)**: - - Throughput ↑37.4% (537 → 1476/2) - - ITL (P99) ↓81.8% (127ms → 23.1ms) - - TTFT (P99) ↑41.8% (160ms → 227ms) - - TPOT: No change - -![testdata](https://github.com/user-attachments/assets/f791bfc7-9f3d-4e5c-9171-a42f9f4da627) +## **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s + +![testdata](https://github.com/user-attachments/assets/cef0953b-4567-4bf9-b940-405b92a28eb1) diff --git a/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py b/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py index 4e82424d6cd..ec58a183061 100644 --- a/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py +++ b/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py @@ -4,7 +4,9 @@ import os import socket import threading +import time import uuid +from typing import Any import aiohttp import msgpack @@ -12,12 +14,25 @@ from quart import Quart, make_response, request count = 0 -prefill_instances: dict[str, str] = {} # http_address: zmq_address -decode_instances: dict[str, str] = {} # http_address: zmq_address +prefill_instances: dict[str, Any] = {} # http_address: (zmq_address, stamp) +decode_instances: dict[str, Any] = {} # http_address: (zmq_address, stamp) prefill_cv = threading.Condition() decode_cv = threading.Condition() +DEFAULT_PING_SECONDS = 5 + + +def _remove_oldest_instances(instances: dict[str, Any]) -> None: + oldest_key = next(iter(instances), None) + while oldest_key is not None: + value = instances[oldest_key] + if value[1] > time.time(): + break + print(f"🔴Remove [HTTP:{oldest_key}, ZMQ:{value[0]}, stamp:{value[1]}]") + instances.pop(oldest_key, None) + oldest_key = next(iter(instances), None) + def _listen_for_register(poller, router_socket): while True: @@ -31,12 +46,23 @@ def _listen_for_register(poller, router_socket): global prefill_instances global prefill_cv with prefill_cv: - prefill_instances[data["http_address"]] = data["zmq_address"] + node = prefill_instances.pop(data["http_address"], None) + prefill_instances[data["http_address"]] = ( + data["zmq_address"], + time.time() + DEFAULT_PING_SECONDS, + ) + _remove_oldest_instances(prefill_instances) + elif data["type"] == "D": global decode_instances global decode_cv with decode_cv: - decode_instances[data["http_address"]] = data["zmq_address"] + node = decode_instances.pop(data["http_address"], None) + decode_instances[data["http_address"]] = ( + data["zmq_address"], + time.time() + DEFAULT_PING_SECONDS, + ) + _remove_oldest_instances(decode_instances) else: print( "Unexpected, Received message from %s, data: %s", @@ -44,6 +70,9 @@ def _listen_for_register(poller, router_socket): data, ) + if node is None: + print(f"🔵Add [HTTP:{data['http_address']}, ZMQ:{data['zmq_address']}]") + def start_service_discovery(hostname, port): if not hostname: @@ -105,12 +134,14 @@ async def handle_request(): with prefill_cv: prefill_list = list(prefill_instances.items()) prefill_addr, prefill_zmq_addr = prefill_list[count % len(prefill_list)] + prefill_zmq_addr = prefill_zmq_addr[0] global decode_instances global decode_cv with decode_cv: decode_list = list(decode_instances.items()) decode_addr, decode_zmq_addr = decode_list[count % len(decode_list)] + decode_zmq_addr = decode_zmq_addr[0] print( f"handle_request count: {count}, [HTTP:{prefill_addr}, " diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py b/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py index 52f589a6d71..d47a75461d7 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py @@ -13,7 +13,6 @@ from vllm.distributed.kv_transfer.kv_connector.v1.p2p.p2p_nccl_engine import ( P2pNcclEngine) from vllm.distributed.parallel_state import get_world_group -from vllm.forward_context import get_forward_context from vllm.logger import init_logger from vllm.v1.attention.backends.mla.common import MLACommonMetadata from vllm.v1.core.sched.output import SchedulerOutput @@ -238,32 +237,16 @@ def save_kv_layer(self, layer_name: str, kv_layer: torch.Tensor, assert self.p2p_nccl_engine is not None - def extract_kv_from_layer( - layer: torch.Tensor, - slot_mapping: torch.Tensor, - ) -> torch.Tensor: - """Extract the KV cache from the layer. - - Assume the shape of the layer is (2, num_pages, page_size, xxx) - if MLA is not used, and (num_pages, page_size, xxx) otherwise. - """ - if isinstance(attn_metadata, MLACommonMetadata): - num_pages, page_size = layer.shape[0], layer.shape[1] - return layer.reshape(num_pages * page_size, -1)[slot_mapping, - ...] - num_pages, page_size = layer.shape[1], layer.shape[2] - return layer.reshape(2, num_pages * page_size, -1)[:, slot_mapping, - ...] - connector_metadata = self._get_connector_metadata() assert isinstance(connector_metadata, P2pNcclConnectorMetadata) for request in connector_metadata.requests: request_id = request.request_id ip, port = self.parse_request_id(request_id, True) remote_address = ip + ":" + str(port + self._rank) - kv_cache = extract_kv_from_layer(kv_layer, request.slot_mapping) - self.p2p_nccl_engine.send_tensor(request_id + "#" + layer_name, - kv_cache, remote_address) + self.p2p_nccl_engine.send_tensor( + request_id + "#" + layer_name, kv_layer, remote_address, + request.slot_mapping, + isinstance(attn_metadata, MLACommonMetadata)) def wait_for_save(self): if self.is_producer: @@ -286,9 +269,10 @@ def get_finished( assert self.p2p_nccl_engine is not None - forward_context: ForwardContext = get_forward_context() + no_compile_layers = ( + self._vllm_config.compilation_config.static_forward_context) return self.p2p_nccl_engine.get_finished(finished_req_ids, - forward_context) + no_compile_layers) # ============================== # Scheduler-side methods @@ -418,14 +402,6 @@ def build_connector_meta( block_ids=block_ids, block_size=self._block_size) - # Requests loaded asynchronously are not in the scheduler_output. - # for request_id in self._requests_need_load: - # request, block_ids = self._requests_need_load[request_id] - # meta.add_request(request_id=request.request_id, - # token_ids=request.prompt_token_ids, - # block_ids=block_ids, - # block_size=self._block_size) - self._requests_need_load.clear() return meta diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py b/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py index 6c9ccb2e301..b94f2296dcb 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py @@ -8,7 +8,8 @@ import typing from collections import deque from contextlib import contextmanager -from typing import TYPE_CHECKING, Any, Optional +from dataclasses import dataclass +from typing import Any, Optional import msgpack import torch @@ -21,9 +22,6 @@ TensorMemoryPool) from vllm.utils import current_stream, get_ip -if TYPE_CHECKING: - from vllm.forward_context import ForwardContext - logger = logging.getLogger(__name__) DEFAULT_MEM_POOL_SIZE_GB = 32 @@ -59,6 +57,15 @@ def set_p2p_nccl_context(num_channels: str): os.environ.pop(var, None) +@dataclass +class SendQueueItem: + tensor_id: str + remote_address: str + tensor: torch.Tensor + slot_mapping: torch.Tensor + is_mla: bool + + class P2pNcclEngine: def __init__(self, @@ -112,24 +119,26 @@ def __init__(self, self.send_stream = torch.cuda.Stream() self.recv_stream = torch.cuda.Stream() - mem_pool_size_gb = self.config.get_from_extra_config( - "mem_pool_size_gb", DEFAULT_MEM_POOL_SIZE_GB) - self.pool = TensorMemoryPool(max_block_size=int(mem_pool_size_gb) * - 1024**3) # GB + mem_pool_size_gb = float( + self.config.get_from_extra_config("mem_pool_size_gb", + DEFAULT_MEM_POOL_SIZE_GB)) + self.pool = TensorMemoryPool(max_block_size=int(mem_pool_size_gb * + 1024**3)) # GB # The sending type includes tree mutually exclusive options: # PUT, GET, PUT_ASYNC. - self.send_type = self.config.get_from_extra_config("send_type", "PUT") + self.send_type = self.config.get_from_extra_config( + "send_type", "PUT_ASYNC") if self.send_type == "GET": # tensor_id: torch.Tensor self.send_store: dict[str, torch.Tensor] = {} else: # PUT or PUT_ASYNC # tensor_id: torch.Tensor - self.send_queue: deque[list[Any]] = deque() + self.send_queue: deque[SendQueueItem] = deque() self.send_request_id_to_tensor_ids: dict[str, set[str]] = {} if self.send_type == "PUT_ASYNC": - self._send_thread = threading.Thread(target=self._send_async, + self._send_thread = threading.Thread(target=self.send_async, daemon=True) self._send_thread.start() @@ -146,13 +155,12 @@ def __init__(self, "nccl_num_channels", "8") self._listener_thread = threading.Thread( - target=self._listen_for_requests, daemon=True) + target=self.listen_for_requests, daemon=True) self._listener_thread.start() self._ping_thread = None if port_offset == 0 and self.proxy_address != "": - self._ping_thread = threading.Thread(target=self._ping, - daemon=True) + self._ping_thread = threading.Thread(target=self.ping, daemon=True) self._ping_thread.start() logger.info( @@ -162,7 +170,7 @@ def __init__(self, self.http_address, self.zmq_address, self.proxy_address, self.send_type, self.buffer_size_threshold, self.nccl_num_channels) - def _create_connect(self, remote_address: typing.Optional[str] = None): + def create_connect(self, remote_address: typing.Optional[str] = None): assert remote_address is not None if remote_address not in self.socks: sock = self.context.socket(zmq.DEALER) @@ -184,7 +192,7 @@ def _create_connect(self, remote_address: typing.Optional[str] = None): comm: ncclComm_t = self.nccl.ncclCommInitRank( 2, unique_id, rank) self.comms[remote_address] = (comm, rank) - logger.info("🤝ncclCommInitRank Success, %s👉%s, MyRank: %s", + logger.info("🤝ncclCommInitRank Success, %s👉%s, MyRank:%s", self.zmq_address, remote_address, rank) return self.socks[remote_address], self.comms[remote_address] @@ -194,44 +202,54 @@ def send_tensor( tensor_id: str, tensor: torch.Tensor, remote_address: typing.Optional[str] = None, + slot_mapping: torch.Tensor = None, + is_mla: bool = False, ) -> bool: if remote_address is None: with self.recv_store_cv: self.recv_store[tensor_id] = tensor self.recv_store_cv.notify() return True - else: - if self.send_type == "PUT": - return self._send_sync(tensor_id, tensor, remote_address) - elif self.send_type == "PUT_ASYNC": - with self.send_queue_cv: - self.send_queue.append([tensor_id, remote_address, tensor]) - self.send_queue_cv.notify() - else: # GET - with self.send_store_cv: - tensor_size = tensor.element_size() * tensor.numel() - while (self.buffer_size + tensor_size - > self.buffer_size_threshold): - oldest_tenser_id = next(iter(self.send_store)) - oldest_tenser = self.send_store.pop(oldest_tenser_id) - oldest_tenser_size = oldest_tenser.element_size( - ) * oldest_tenser.numel() - self.buffer_size -= oldest_tenser_size - logger.info( - "⛔[GET]Send to %s, tensor_id:%s, tensor_size:%d," - " buffer_size:%d, oldest_tenser_size:%d, rank:%d", - remote_address, tensor_id, tensor_size, - self.buffer_size, oldest_tenser_size, self.rank) - - self.send_store[tensor_id] = tensor - self.buffer_size += tensor_size - logger.debug( - "🔵[GET]Send to %s, tensor_id:%s, tensor_size:%d, " - "shape:%s, rank:%d, buffer_size:%d(%.2f%%)", - remote_address, tensor_id, tensor_size, tensor.shape, - self.rank, self.buffer_size, - self.buffer_size / self.buffer_size_threshold * 100) + item = SendQueueItem(tensor_id=tensor_id, + remote_address=remote_address, + tensor=tensor, + slot_mapping=slot_mapping, + is_mla=is_mla) + + if self.send_type == "PUT": + return self.send_sync(item) + + if self.send_type == "PUT_ASYNC": + with self.send_queue_cv: + self.send_queue.append(item) + self.send_queue_cv.notify() + return True + + # GET + with self.send_store_cv: + tensor_size = tensor.element_size() * tensor.numel() + while (self.buffer_size + tensor_size + > self.buffer_size_threshold): + oldest_tenser_id = next(iter(self.send_store)) + oldest_tenser = self.send_store.pop(oldest_tenser_id) + oldest_tenser_size = oldest_tenser.element_size( + ) * oldest_tenser.numel() + self.buffer_size -= oldest_tenser_size + logger.info( + "⛔[GET]Send to %s, tensor_id:%s, tensor_size:%d," + " buffer_size:%d, oldest_tenser_size:%d, rank:%d", + remote_address, tensor_id, tensor_size, self.buffer_size, + oldest_tenser_size, self.rank) + + self.send_store[tensor_id] = tensor + self.buffer_size += tensor_size + logger.debug( + "🔵[GET]Send to %s, tensor_id:%s, tensor_size:%d, " + "shape:%s, rank:%d, buffer_size:%d(%.2f%%)", remote_address, + tensor_id, tensor_size, tensor.shape, self.rank, + self.buffer_size, + self.buffer_size / self.buffer_size_threshold * 100) return True def recv_tensor( @@ -267,7 +285,7 @@ def recv_tensor( return None if remote_address not in self.socks: - self._create_connect(remote_address) + self.create_connect(remote_address) sock = self.socks[remote_address] comm, rank = self.comms[remote_address] @@ -282,121 +300,121 @@ def recv_tensor( remote_address, tensor_id, data["ret"]) return None - tensor = torch.empty(data["shape"], - dtype=getattr(torch, data["dtype"]), - device=self.device) + with torch.cuda.stream(self.recv_stream): + tensor = torch.empty(data["shape"], + dtype=getattr(torch, data["dtype"]), + device=self.device) - self._recv(comm, tensor, rank ^ 1, self.recv_stream) + self.recv(comm, tensor, rank ^ 1, self.recv_stream) return tensor - def _listen_for_requests(self): + def listen_for_requests(self): while True: socks = dict(self.poller.poll()) - if self.router_socket in socks: - remote_address, message = self.router_socket.recv_multipart() - data = msgpack.loads(message) - if data["cmd"] == "NEW": - unique_id = self.nccl.unique_id_from_bytes( - bytes(data["unique_id"])) - with torch.cuda.device(self.device): - rank = 1 - with set_p2p_nccl_context(self.nccl_num_channels): - comm: ncclComm_t = self.nccl.ncclCommInitRank( - 2, unique_id, rank) - self.comms[remote_address.decode()] = (comm, rank) - logger.info( - "🤝ncclCommInitRank Success, %s👈%s, MyRank:%s", - self.zmq_address, remote_address.decode(), rank) - elif data["cmd"] == "PUT": - tensor_id = data["tensor_id"] - try: - with torch.cuda.stream(self.recv_stream): - tensor = torch.empty(data["shape"], - dtype=getattr( - torch, data["dtype"]), - device=self.device) - self.router_socket.send_multipart( - [remote_address, b"0"]) - comm, rank = self.comms[remote_address.decode()] - self._recv(comm, tensor, rank ^ 1, self.recv_stream) - tensor_size = tensor.element_size() * tensor.numel() - if (self.buffer_size + tensor_size - > self.buffer_size_threshold): - # Store Tensor in memory pool - addr = self.pool.store_tensor(tensor) - tensor = (addr, tensor.dtype, tensor.shape) - logger.warning( - "🔴[PUT]Recv Tensor, Out Of Threshold, " - "%s👈%s, data:%s, addr:%d", self.zmq_address, - remote_address.decode(), data, addr) - else: - self.buffer_size += tensor_size - - except torch.cuda.OutOfMemoryError: - self.router_socket.send_multipart( - [remote_address, b"1"]) - tensor = None + if self.router_socket not in socks: + continue + + remote_address, message = self.router_socket.recv_multipart() + data = msgpack.loads(message) + if data["cmd"] == "NEW": + unique_id = self.nccl.unique_id_from_bytes( + bytes(data["unique_id"])) + with torch.cuda.device(self.device): + rank = 1 + with set_p2p_nccl_context(self.nccl_num_channels): + comm: ncclComm_t = self.nccl.ncclCommInitRank( + 2, unique_id, rank) + self.comms[remote_address.decode()] = (comm, rank) + logger.info("🤝ncclCommInitRank Success, %s👈%s, MyRank:%s", + self.zmq_address, remote_address.decode(), + rank) + elif data["cmd"] == "PUT": + tensor_id = data["tensor_id"] + try: + with torch.cuda.stream(self.recv_stream): + tensor = torch.empty(data["shape"], + dtype=getattr( + torch, data["dtype"]), + device=self.device) + self.router_socket.send_multipart([remote_address, b"0"]) + comm, rank = self.comms[remote_address.decode()] + self.recv(comm, tensor, rank ^ 1, self.recv_stream) + tensor_size = tensor.element_size() * tensor.numel() + if (self.buffer_size + tensor_size + > self.buffer_size_threshold): + # Store Tensor in memory pool + addr = self.pool.store_tensor(tensor) + tensor = (addr, tensor.dtype, tensor.shape) logger.warning( - "🔴[PUT]Recv Tensor, Out Of Memory, %s👈%s, " - "data:%s", self.zmq_address, - remote_address.decode(), data) - - with self.recv_store_cv: - self.recv_store[tensor_id] = tensor - self._have_received_tensor_id(tensor_id) - self.recv_store_cv.notify() - - elif data["cmd"] == "GET": - tensor_id = data["tensor_id"] - with self.send_store_cv: - tensor = self.send_store.pop(tensor_id, None) - if tensor is not None: - data = { - "ret": 0, - "shape": tensor.shape, - "dtype": - str(tensor.dtype).replace("torch.", "") - } - # LRU - self.send_store[tensor_id] = tensor - self._have_sent_tensor_id(tensor_id) - else: - data = {"ret": 1} - - self.router_socket.send_multipart( - [remote_address, msgpack.dumps(data)]) - - if data["ret"] == 0: - comm, rank = self.comms[remote_address.decode()] - self._send(comm, tensor.to(self.device), rank ^ 1, - self.send_stream) - else: + "🔴[PUT]Recv Tensor, Out Of Threshold, " + "%s👈%s, data:%s, addr:%d", self.zmq_address, + remote_address.decode(), data, addr) + else: + self.buffer_size += tensor_size + + except torch.cuda.OutOfMemoryError: + self.router_socket.send_multipart([remote_address, b"1"]) + tensor = None logger.warning( - "🚧Unexpected, Received message from %s, data:%s", - remote_address, data) + "🔴[PUT]Recv Tensor, Out Of Memory, %s👈%s, " + "data:%s", self.zmq_address, remote_address.decode(), + data) - def _have_sent_tensor_id(self, tensor_id: str): + with self.recv_store_cv: + self.recv_store[tensor_id] = tensor + self.have_received_tensor_id(tensor_id) + self.recv_store_cv.notify() + + elif data["cmd"] == "GET": + tensor_id = data["tensor_id"] + with self.send_store_cv: + tensor = self.send_store.pop(tensor_id, None) + if tensor is not None: + data = { + "ret": 0, + "shape": tensor.shape, + "dtype": str(tensor.dtype).replace("torch.", "") + } + # LRU + self.send_store[tensor_id] = tensor + self.have_sent_tensor_id(tensor_id) + else: + data = {"ret": 1} + + self.router_socket.send_multipart( + [remote_address, msgpack.dumps(data)]) + + if data["ret"] == 0: + comm, rank = self.comms[remote_address.decode()] + self.send(comm, tensor.to(self.device), rank ^ 1, + self.send_stream) + else: + logger.warning( + "🚧Unexpected, Received message from %s, data:%s", + remote_address, data) + + def have_sent_tensor_id(self, tensor_id: str): request_id = tensor_id.split('#')[0] if request_id not in self.send_request_id_to_tensor_ids: self.send_request_id_to_tensor_ids[request_id] = set() self.send_request_id_to_tensor_ids[request_id].add(tensor_id) - def _have_received_tensor_id(self, tensor_id: str): + def have_received_tensor_id(self, tensor_id: str): request_id = tensor_id.split('#')[0] if request_id not in self.recv_request_id_to_tensor_ids: self.recv_request_id_to_tensor_ids[request_id] = set() self.recv_request_id_to_tensor_ids[request_id].add(tensor_id) - def _send_async(self): + def send_async(self): while True: with self.send_queue_cv: while not self.send_queue: self.send_queue_cv.wait() - tensor_id, remote_address, tensor = self.send_queue.popleft() + item = self.send_queue.popleft() if not self.send_queue: self.send_queue_cv.notify() - self._send_sync(tensor_id, tensor, remote_address) + self.send_sync(item) def wait_for_sent(self): if self.send_type == "PUT_ASYNC": @@ -409,22 +427,21 @@ def wait_for_sent(self): "🚧[PUT_ASYNC]It took %.3fms to wait for the send_queue" " to be empty, rank:%d", duration * 1000, self.rank) - def _send_sync( - self, - tensor_id: str, - tensor: torch.Tensor, - remote_address: typing.Optional[str] = None, - ) -> bool: - if remote_address is None: + def send_sync(self, item: SendQueueItem) -> bool: + if item.remote_address is None: return False - if remote_address not in self.socks: - self._create_connect(remote_address) + if item.remote_address not in self.socks: + self.create_connect(item.remote_address) - sock = self.socks[remote_address] - comm, rank = self.comms[remote_address] + with self.send_stream: + tensor = self.extract_kv_from_layer(item.is_mla, item.tensor, + item.slot_mapping) + + sock = self.socks[item.remote_address] + comm, rank = self.comms[item.remote_address] data = { "cmd": "PUT", - "tensor_id": tensor_id, + "tensor_id": item.tensor_id, "shape": tensor.shape, "dtype": str(tensor.dtype).replace("torch.", "") } @@ -435,20 +452,21 @@ def _send_sync( logger.error( "🔴Send Tensor, Peer Out Of Memory/Threshold, %s 👉 %s, " "MyRank:%s, data:%s, tensor:%s, size:%fGB, response:%s", - self.zmq_address, remote_address, rank, data, tensor.shape, + self.zmq_address, item.remote_address, rank, data, + tensor.shape, tensor.element_size() * tensor.numel() / 1024**3, response.decode()) return False - self._send(comm, tensor.to(self.device), rank ^ 1, self.send_stream) + self.send(comm, tensor.to(self.device), rank ^ 1, self.send_stream) if self.send_type == "PUT_ASYNC": - self._have_sent_tensor_id(tensor_id) + self.have_sent_tensor_id(item.tensor_id) return True def get_finished( - self, finished_req_ids: set[str], forward_context: "ForwardContext" + self, finished_req_ids: set[str], no_compile_layers ) -> tuple[Optional[set[str]], Optional[set[str]]]: """ Notifies worker-side connector ids of requests that have @@ -463,7 +481,7 @@ def get_finished( # Clear the buffer upon request completion. for request_id in finished_req_ids: - for layer_name in forward_context.no_compile_layers: + for layer_name in no_compile_layers: tensor_id = request_id + "#" + layer_name if tensor_id in self.recv_store: with self.recv_store_cv: @@ -472,7 +490,6 @@ def get_finished( request_id, None) self.recv_request_id_to_tensor_ids.pop( request_id, None) - addr = 0 if isinstance(tensor, tuple): addr, _, _ = tensor self.pool.free(addr) @@ -485,7 +502,7 @@ def get_finished( return finished_sending or None, finished_recving or None - def _ping(self): + def ping(self): sock = self.context.socket(zmq.DEALER) sock.setsockopt_string(zmq.IDENTITY, self.zmq_address) logger.debug("ping start, zmq_address:%s", self.zmq_address) @@ -499,7 +516,7 @@ def _ping(self): sock.send(msgpack.dumps(data)) time.sleep(3) - def _send(self, comm, tensor: torch.Tensor, dst: int, stream=None): + def send(self, comm, tensor: torch.Tensor, dst: int, stream=None): assert tensor.device == self.device, ( f"this nccl communicator is created to work on {self.device}, " f"but the input tensor is on {tensor.device}") @@ -512,7 +529,7 @@ def _send(self, comm, tensor: torch.Tensor, dst: int, stream=None): comm, cudaStream_t(stream.cuda_stream)) stream.synchronize() - def _recv(self, comm, tensor: torch.Tensor, src: int, stream=None): + def recv(self, comm, tensor: torch.Tensor, src: int, stream=None): assert tensor.device == self.device, ( f"this nccl communicator is created to work on {self.device}, " f"but the input tensor is on {tensor.device}") @@ -531,3 +548,21 @@ def close(self) -> None: self._send_thread.join() if self._ping_thread is not None: self._ping_thread.join() + + @staticmethod + def extract_kv_from_layer( + is_mla: bool, + layer: torch.Tensor, + slot_mapping: torch.Tensor, + ) -> torch.Tensor: + """Extract the KV cache from the layer. + Assume the shape of the layer is (2, num_pages, page_size, xxx) + if MLA is not used, and (num_pages, page_size, xxx) otherwise. + """ + if is_mla: + num_pages, page_size = layer.shape[0], layer.shape[1] + return layer.reshape(num_pages * page_size, -1)[slot_mapping, ...] + + num_pages, page_size = layer.shape[1], layer.shape[2] + return layer.reshape(2, num_pages * page_size, -1)[:, slot_mapping, + ...] From 710f4ab01a029dca246a16324756247f613e6637 Mon Sep 17 00:00:00 2001 From: David Ben-David Date: Thu, 17 Jul 2025 08:29:45 +0300 Subject: [PATCH 148/552] [V1] [KVConnector] Fix MultiprocExecutor worker output aggregation (#21048) Signed-off-by: David Ben-David Co-authored-by: David Ben-David Signed-off-by: x22x22 --- tests/v1/executor/test_multiproc_executor.py | 127 +++++++++++++++++++ vllm/v1/executor/multiproc_executor.py | 6 +- 2 files changed, 129 insertions(+), 4 deletions(-) create mode 100644 tests/v1/executor/test_multiproc_executor.py diff --git a/tests/v1/executor/test_multiproc_executor.py b/tests/v1/executor/test_multiproc_executor.py new file mode 100644 index 00000000000..c1425d82bec --- /dev/null +++ b/tests/v1/executor/test_multiproc_executor.py @@ -0,0 +1,127 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import threading +from collections import defaultdict +from concurrent.futures import Future +from typing import Optional + +from vllm.v1.executor.multiproc_executor import MultiprocExecutor +from vllm.v1.outputs import ModelRunnerOutput + + +class DummyMultiprocExecutor(MultiprocExecutor): + + def __init__(self, output_rank, world_size): + # Manually initialize minimal required fields + self.output_rank = output_rank + self.world_size = world_size + self._send_remaining_count = defaultdict[str, + int](lambda: self.world_size) + self._recv_remaining_count = defaultdict[str, + int](lambda: self.world_size) + self.io_thread_pool = None + self.shutdown_event = threading.Event() + + +class DummyModelRunnerOutput(ModelRunnerOutput): + + def __init__(self, + finished_sending: Optional[set[str]] = None, + finished_recving: Optional[set[str]] = None): + self.finished_sending = finished_sending + self.finished_recving = finished_recving + + +def test_aggregate_workers_output(): + executor = DummyMultiprocExecutor(output_rank=0, world_size=2) + + output1 = DummyModelRunnerOutput(finished_sending={'req1'}, + finished_recving={'req2'}) + output2 = DummyModelRunnerOutput(finished_sending=None, + finished_recving=None) + + aggregated = executor._aggregate_workers_output([output1, output2]) + + assert aggregated is output1 + assert aggregated.finished_sending is None + assert aggregated.finished_recving is None + + output1 = DummyModelRunnerOutput(finished_sending=None, + finished_recving=None) + output2 = DummyModelRunnerOutput(finished_sending={'req1'}, + finished_recving=None) + + aggregated = executor._aggregate_workers_output([output1, output2]) + + assert aggregated is output1 + assert aggregated.finished_sending == {'req1'} + assert aggregated.finished_recving is None + + output1 = DummyModelRunnerOutput(finished_sending=None, + finished_recving=None) + output2 = DummyModelRunnerOutput(finished_sending={'req1'}, + finished_recving={'req2'}) + + aggregated = executor._aggregate_workers_output([output1, output2]) + + assert aggregated is output1 + assert aggregated.finished_sending is None + assert aggregated.finished_recving == {'req2'} + + +def test_async_aggregate_workers_output(): + executor = DummyMultiprocExecutor(output_rank=0, world_size=2) + + future1: Future[DummyModelRunnerOutput] = Future() + future2: Future[DummyModelRunnerOutput] = Future() + result_future = executor._async_aggregate_workers_output( + [future1, future2]) + + output1 = DummyModelRunnerOutput(finished_sending={'req1'}, + finished_recving={'req2'}) + output2 = DummyModelRunnerOutput(finished_sending=None, + finished_recving=None) + future1.set_result(output1) + future2.set_result(output2) + + assert result_future.done() + aggregated = result_future.result() + assert aggregated is output1 + assert aggregated.finished_sending is None + assert aggregated.finished_recving is None + + future1 = Future() + future2 = Future() + result_future = executor._async_aggregate_workers_output( + [future1, future2]) + + output1 = DummyModelRunnerOutput(finished_sending=None, + finished_recving=None) + output2 = DummyModelRunnerOutput(finished_sending={'req1'}, + finished_recving=None) + future1.set_result(output1) + future2.set_result(output2) + + assert result_future.done() + aggregated = result_future.result() + assert aggregated is output1 + assert aggregated.finished_sending == {'req1'} + assert aggregated.finished_recving is None + + future1 = Future() + future2 = Future() + result_future = executor._async_aggregate_workers_output( + [future1, future2]) + + output1 = DummyModelRunnerOutput(finished_sending=None, + finished_recving=None) + output2 = DummyModelRunnerOutput(finished_sending={'req1'}, + finished_recving={'req2'}) + future1.set_result(output1) + future2.set_result(output2) + + assert result_future.done() + aggregated = result_future.result() + assert aggregated is output1 + assert aggregated.finished_sending is None + assert aggregated.finished_recving == {'req2'} diff --git a/vllm/v1/executor/multiproc_executor.py b/vllm/v1/executor/multiproc_executor.py index 5960dd766c8..4a4144c4860 100644 --- a/vllm/v1/executor/multiproc_executor.py +++ b/vllm/v1/executor/multiproc_executor.py @@ -273,10 +273,8 @@ def update_finished_set(req_ids: Optional[set[str]], output = outputs[self.output_rank] # set the aggregated finished_sending / finished_recving - if finished_sending: - output.finished_sending = finished_sending - if finished_recving: - output.finished_recving = finished_recving + output.finished_sending = finished_sending if finished_sending else None + output.finished_recving = finished_recving if finished_recving else None return output From 41b0266571147ebb002b8861f7137429c00efd10 Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Thu, 17 Jul 2025 13:47:49 +0800 Subject: [PATCH 149/552] [Misc] Fix PhiMoE expert mapping (#21085) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- vllm/model_executor/models/phimoe.py | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/vllm/model_executor/models/phimoe.py b/vllm/model_executor/models/phimoe.py index 0fc64e88a6b..cfe0982204f 100644 --- a/vllm/model_executor/models/phimoe.py +++ b/vllm/model_executor/models/phimoe.py @@ -533,14 +533,9 @@ def load_weights(self, weights: Iterable[tuple[str, ("qkv_proj", "v_proj", "v"), ] - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="w1", - ckpt_down_proj_name="w2", - ckpt_up_proj_name="w3", - num_experts=self.config.num_local_experts) - params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if (self.quant_config is not None and (scale_name := self.quant_config.get_cache_scale(name))): From 0e563adec48039165fbb9504afd3fa88876d4bee Mon Sep 17 00:00:00 2001 From: Chauncey Date: Thu, 17 Jul 2025 15:29:09 +0800 Subject: [PATCH 150/552] [Bugfix]: Fix final_res_batch list index out of range error (#21055) Signed-off-by: chaunceyjiang Signed-off-by: x22x22 --- .../v1/entrypoints/openai/test_completion.py | 18 +++- vllm/entrypoints/openai/serving_completion.py | 100 +++++++++++------- 2 files changed, 78 insertions(+), 40 deletions(-) diff --git a/tests/v1/entrypoints/openai/test_completion.py b/tests/v1/entrypoints/openai/test_completion.py index 776fd42bbc3..2462f8f9f10 100644 --- a/tests/v1/entrypoints/openai/test_completion.py +++ b/tests/v1/entrypoints/openai/test_completion.py @@ -7,6 +7,7 @@ import pytest import pytest_asyncio import regex as re +import requests from openai import BadRequestError from tests.utils import RemoteOpenAIServer @@ -26,7 +27,8 @@ def default_server_args(): "2048", "--max-num-seqs", "128", - "--enforce-eager" + "--enforce-eager", + "--enable-prompt-tokens-details", ] @@ -679,3 +681,17 @@ async def test_invalid_grammar(client: openai.AsyncOpenAI, model_name: str): prompt=prompt, extra_body={"guided_grammar": invalid_simplified_sql_grammar}, ) + + +@pytest.mark.asyncio +async def test_completion_with_empty_prompt_embeds( + client: openai.AsyncOpenAI) -> None: + """Test completion with empty prompt embeds.""" + payload: dict[str, list] = {"prompt_embeds": []} + headers: dict[str, str] = {"Content-Type": "application/json"} + # base_url = http://localhost:8000/v1/completions + response = requests.post(f"{client.base_url}completions", + headers=headers, + json=payload) + assert response.status_code == 200, ( + f"Expected status code 200, got {response.status_code}. ") diff --git a/vllm/entrypoints/openai/serving_completion.py b/vllm/entrypoints/openai/serving_completion.py index eb9a35a7a37..1e1f655022f 100644 --- a/vllm/entrypoints/openai/serving_completion.py +++ b/vllm/entrypoints/openai/serving_completion.py @@ -60,20 +60,25 @@ def __init__( enable_prompt_tokens_details: bool = False, enable_force_include_usage: bool = False, ): - super().__init__(engine_client=engine_client, - model_config=model_config, - models=models, - request_logger=request_logger, - return_tokens_as_token_ids=return_tokens_as_token_ids, - enable_force_include_usage=enable_force_include_usage) + super().__init__( + engine_client=engine_client, + model_config=model_config, + models=models, + request_logger=request_logger, + return_tokens_as_token_ids=return_tokens_as_token_ids, + enable_force_include_usage=enable_force_include_usage, + ) self.enable_prompt_tokens_details = enable_prompt_tokens_details self.default_sampling_params = ( self.model_config.get_diff_sampling_param()) if self.default_sampling_params: source = self.model_config.generation_config source = "model" if source == "auto" else source - logger.info("Using default completion sampling params from %s: %s", - source, self.default_sampling_params) + logger.info( + "Using default completion sampling params from %s: %s", + source, + self.default_sampling_params, + ) async def create_completion( self, @@ -172,23 +177,28 @@ async def create_completion( max_model_len=self.max_model_len, request=request, input_length=input_length, - default_sampling_params=self.default_sampling_params) + default_sampling_params=self.default_sampling_params, + ) if request.use_beam_search: sampling_params = request.to_beam_search_params( max_tokens, self.default_sampling_params) else: sampling_params = request.to_sampling_params( - max_tokens, self.model_config.logits_processor_pattern, - self.default_sampling_params) + max_tokens, + self.model_config.logits_processor_pattern, + self.default_sampling_params, + ) request_id_item = f"{request_id}-{i}" - self._log_inputs(request_id_item, - request_prompts[i], - params=sampling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + self._log_inputs( + request_id_item, + request_prompts[i], + params=sampling_params, + lora_request=lora_request, + prompt_adapter_request=prompt_adapter_request, + ) trace_headers = (None if raw_request is None else await self._get_trace_headers(raw_request.headers)) @@ -245,7 +255,8 @@ async def create_completion( num_prompts=num_prompts, tokenizer=tokenizer, request_metadata=request_metadata, - enable_force_include_usage=self.enable_force_include_usage) + enable_force_include_usage=self.enable_force_include_usage, + ) # Non-streaming response final_res_batch: list[Optional[RequestOutput]] = [None] * num_prompts @@ -321,10 +332,10 @@ async def completion_stream_generator( stream_options = request.stream_options if stream_options: - include_usage = stream_options.include_usage or \ - enable_force_include_usage - include_continuous_usage = include_usage and \ - stream_options.continuous_usage_stats + include_usage = (stream_options.include_usage + or enable_force_include_usage) + include_continuous_usage = (include_usage and + stream_options.continuous_usage_stats) else: include_usage, include_continuous_usage = False, False @@ -370,7 +381,8 @@ async def completion_stream_generator( # echo the prompt and first token delta_text = prompt_text + output.text delta_token_ids = [ - *prompt_token_ids, *output.token_ids + *prompt_token_ids, + *output.token_ids, ] out_logprobs = [ *(prompt_logprobs or []), @@ -383,8 +395,8 @@ async def completion_stream_generator( delta_token_ids = output.token_ids out_logprobs = output.logprobs - if not delta_text and not delta_token_ids \ - and not previous_num_tokens[i]: + if (not delta_text and not delta_token_ids + and not previous_num_tokens[i]): # Chunked prefill case, don't return empty chunks continue @@ -420,7 +432,8 @@ async def completion_stream_generator( finish_reason=finish_reason, stop_reason=stop_reason, ) - ]) + ], + ) if include_continuous_usage: prompt_tokens = num_prompt_tokens[prompt_idx] completion_tokens = previous_num_tokens[i] @@ -438,7 +451,8 @@ async def completion_stream_generator( final_usage_info = UsageInfo( prompt_tokens=total_prompt_tokens, completion_tokens=total_completion_tokens, - total_tokens=total_prompt_tokens + total_completion_tokens) + total_tokens=total_prompt_tokens + total_completion_tokens, + ) if self.enable_prompt_tokens_details and num_cached_tokens: final_usage_info.prompt_tokens_details = PromptTokenUsageInfo( @@ -452,8 +466,8 @@ async def completion_stream_generator( choices=[], usage=final_usage_info, ) - final_usage_data = (final_usage_chunk.model_dump_json( - exclude_unset=False, exclude_none=True)) + final_usage_data = final_usage_chunk.model_dump_json( + exclude_unset=False, exclude_none=True) yield f"data: {final_usage_data}\n\n" # report to FastAPI middleware aggregate usage across all choices @@ -478,8 +492,10 @@ def request_output_to_completion_response( choices: list[CompletionResponseChoice] = [] num_prompt_tokens = 0 num_generated_tokens = 0 - + kv_transfer_params = None + last_final_res = None for final_res in final_res_batch: + last_final_res = final_res prompt_token_ids = final_res.prompt_token_ids assert prompt_token_ids is not None prompt_logprobs = clamp_prompt_logprobs(final_res.prompt_logprobs) @@ -548,19 +564,22 @@ def request_output_to_completion_response( total_tokens=num_prompt_tokens + num_generated_tokens, ) - if self.enable_prompt_tokens_details and final_res.num_cached_tokens: + if (self.enable_prompt_tokens_details and last_final_res + and last_final_res.num_cached_tokens): usage.prompt_tokens_details = PromptTokenUsageInfo( - cached_tokens=final_res.num_cached_tokens) + cached_tokens=last_final_res.num_cached_tokens) request_metadata.final_usage_info = usage - + if final_res_batch: + kv_transfer_params = final_res_batch[0].kv_transfer_params return CompletionResponse( id=request_id, created=created_time, model=model_name, choices=choices, usage=usage, - kv_transfer_params=final_res_batch[0].kv_transfer_params) + kv_transfer_params=kv_transfer_params, + ) def _create_completion_logprobs( self, @@ -579,8 +598,9 @@ def _create_completion_logprobs( last_token_len = 0 - should_return_as_token_id = return_as_token_id if \ - return_as_token_id is not None else self.return_tokens_as_token_ids + should_return_as_token_id = (return_as_token_id + if return_as_token_id is not None else + self.return_tokens_as_token_ids) for i, token_id in enumerate(token_ids): step_top_logprobs = top_logprobs[i] if step_top_logprobs is None: @@ -612,10 +632,12 @@ def _create_completion_logprobs( out_top_logprobs.append({ # Convert float("-inf") to the # JSON-serializable float that OpenAI uses - self._get_decoded_token(top_lp[1], - top_lp[0], - tokenizer, - return_as_token_id=should_return_as_token_id): + self._get_decoded_token( + top_lp[1], + top_lp[0], + tokenizer, + return_as_token_id=should_return_as_token_id, + ): max(top_lp[1].logprob, -9999.0) for i, top_lp in enumerate(step_top_logprobs.items()) if num_output_top_logprobs >= i From 7ee4141f86d55917a8817344dab51628d9fe0503 Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Thu, 17 Jul 2025 13:40:37 +0530 Subject: [PATCH 151/552] [Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- .../moe/modular_kernel_tools/cli_args.py | 1 - .../layers/fused_moe/batched_deep_gemm_moe.py | 1 + .../batched_triton_or_deep_gemm_moe.py | 7 +- .../layers/fused_moe/cutlass_moe.py | 1 + .../layers/fused_moe/deep_gemm_moe.py | 101 +++-- .../layers/fused_moe/deep_gemm_utils.py | 413 ++++++++++++++++++ .../layers/fused_moe/fused_batched_moe.py | 2 + .../layers/fused_moe/fused_moe.py | 1 + .../layers/fused_moe/modular_kernel.py | 16 +- .../layers/fused_moe/triton_deep_gemm_moe.py | 7 +- 10 files changed, 491 insertions(+), 59 deletions(-) create mode 100644 vllm/model_executor/layers/fused_moe/deep_gemm_utils.py diff --git a/tests/kernels/moe/modular_kernel_tools/cli_args.py b/tests/kernels/moe/modular_kernel_tools/cli_args.py index 261f1eb6e5c..b95d87cd04f 100644 --- a/tests/kernels/moe/modular_kernel_tools/cli_args.py +++ b/tests/kernels/moe/modular_kernel_tools/cli_args.py @@ -85,7 +85,6 @@ def to_quant_torch_dtype(s: str) -> torch.dtype: help="num topk") parser.add_argument( "--fused-moe-chunk-size", - nargs="+", type=int, help="Fused moe chunk size used for the non-batched fused experts impl." ) diff --git a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py index 0b394329215..e61d350388e 100644 --- a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py @@ -239,6 +239,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_metadata: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: assert a.dim() == 2 # FIXME (varun): We should be able to dispatch only from the leader diff --git a/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py index 12df9bb34d2..1a63b323734 100644 --- a/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py @@ -116,6 +116,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_metadata: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: # Note: the deep gemm workspaces are strictly larger than the triton # workspaces so we can be pessimistic here and allocate for DeepGemm @@ -123,11 +124,13 @@ def workspace_shapes( if self.allow_deep_gemm: assert self.batched_deep_gemm_experts is not None return self.batched_deep_gemm_experts.workspace_shapes( - a, aq, M, N, K, topk, global_num_experts, local_num_experts) + a, aq, M, N, K, topk, global_num_experts, local_num_experts, + expert_tokens_metadata) else: assert self.batched_triton_experts is not None return self.batched_triton_experts.workspace_shapes( - a, aq, M, N, K, topk, global_num_experts, local_num_experts) + a, aq, M, N, K, topk, global_num_experts, local_num_experts, + expert_tokens_metadata) def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index e479f1b4044..d09161ead46 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -271,6 +271,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: workspace1: tuple[int, ...] = () workspace2: tuple[int, ...] = () diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py index cc5e7cf5714..bb462938a39 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py @@ -8,16 +8,16 @@ import vllm.model_executor.layers.fused_moe.modular_kernel as mk from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig -from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import ( - _moe_permute) +from vllm.model_executor.layers.fused_moe.deep_gemm_utils import ( + compute_aligned_M, deepgemm_moe_permute, deepgemm_unpermute_and_reduce) from vllm.model_executor.layers.fused_moe.prepare_finalize import ( MoEPrepareAndFinalizeNoEP) from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( - TopKWeightAndReduceContiguous, TopKWeightAndReduceNoOP) + TopKWeightAndReduceNoOP) from vllm.model_executor.layers.fused_moe.utils import _resize_cache from vllm.model_executor.layers.quantization.utils.fp8_utils import ( per_token_group_quant_fp8) -from vllm.utils import has_deep_gemm, round_up +from vllm.utils import has_deep_gemm from vllm.utils.deep_gemm import m_grouped_fp8_gemm_nt_contiguous logger = init_logger(__name__) @@ -93,18 +93,25 @@ def finalize_weight_and_reduce_impl(self) -> mk.TopKWeightAndReduce: return TopKWeightAndReduceNoOP() def workspace_shapes( - self, a: torch.Tensor, aq: torch.Tensor, M: int, N: int, K: int, - topk: int, global_num_experts: int, local_num_experts: int + self, + a: torch.Tensor, + aq: torch.Tensor, + M: int, + N: int, + K: int, + topk: int, + global_num_experts: int, + local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: assert self.block_shape is not None - # We use global_num_experts due to how moe_align_block_size handles - # expert_maps. - num_experts = global_num_experts block_m = self.block_shape[0] - M_sum = (M * topk) + num_experts * (block_m - 1) - M_sum = round_up(M_sum, block_m) - workspace1 = (M_sum, max(N // 2, K)) - workspace2 = (M_sum, max(N, K)) + M_sum = compute_aligned_M(M, topk, local_num_experts, block_m, + expert_tokens_meta) + assert M_sum % block_m == 0 + + workspace1 = (M_sum, max(N, K)) + workspace2 = (M_sum, max(N // 2, K)) output = (M, K) return (workspace1, workspace2, output, a.dtype) @@ -131,43 +138,40 @@ def apply( apply_router_weight_on_input: bool, ): assert self.block_shape is not None + assert a1q_scale is not None a1q = hidden_states _, N, K = w1.size() - M, _ = output.size() - num_topk = topk_ids.size(1) + local_num_experts = w1.size(0) if global_num_experts == -1: - global_num_experts = w1.size(0) + global_num_experts = local_num_experts assert w2.size(1) == K - a1q, a1q_scale, _, expert_ids, inv_perm = _moe_permute( - a1q, - a1q_scale, - topk_ids, - global_num_experts, - expert_map, - self.block_shape[0], - ) - - if expert_map is not None: - # DeepGemm (Grouped Contiguous) kernel needs a valid B index - # for all rows of A. To that effect, simply compute with - # the 0th weight matrix. - # Note that this relies on the fact that corresponding topk - # weights would be 0 during weight multiplication. - expert_ids = torch.where(expert_ids == -1, 0, expert_ids) - - # Note: M_sum is different than the pre-permuted shape of a1q. - M_sum = a1q.size(0) - - mm1_out = _resize_cache(workspace2, (M_sum, N)) - act_out = _resize_cache(workspace13, (M_sum, N // 2)) - quant_out = _resize_cache(workspace2.view(dtype=torch.float8_e4m3fn), + M_sum = compute_aligned_M(M=topk_ids.size(0), + num_topk=topk_ids.size(1), + local_num_experts=local_num_experts, + alignment=deep_gemm_block_shape()[0], + expert_tokens_meta=expert_tokens_meta) + + a1q_perm = _resize_cache(workspace2.view(dtype=torch.float8_e4m3fn), + (M_sum, K)) + mm1_out = _resize_cache(workspace13, (M_sum, N)) + act_out = _resize_cache(workspace2, (M_sum, N // 2)) + quant_out = _resize_cache(workspace13.view(dtype=torch.float8_e4m3fn), (M_sum, N // 2)) - mm2_out = _resize_cache(workspace13, (M_sum, K)) - perm_out = _resize_cache(workspace2, (M * num_topk, K)) + mm2_out = _resize_cache(workspace2, (M_sum, K)) + + a1q, a1q_scale, expert_ids, inv_perm = deepgemm_moe_permute( + aq=a1q, + aq_scale=a1q_scale, + topk_ids=topk_ids, + local_num_experts=local_num_experts, + expert_map=expert_map, + expert_tokens_meta=expert_tokens_meta, + aq_out=a1q_perm) + assert a1q.size(0) == M_sum m_grouped_fp8_gemm_nt_contiguous((a1q, a1q_scale), (w1, w1_scale), mm1_out, expert_ids) @@ -183,14 +187,15 @@ def apply( m_grouped_fp8_gemm_nt_contiguous((a2q, a2q_scale), (w2, w2_scale), mm2_out, expert_ids) - torch.index_select(mm2_out, 0, inv_perm, out=perm_out) + if apply_router_weight_on_input: + topk_weights = torch.ones_like(topk_weights) - TopKWeightAndReduceContiguous().apply( - output=output, - fused_expert_output=perm_out, - topk_weights=topk_weights, - topk_ids=topk_ids, - apply_router_weight_on_input=apply_router_weight_on_input) + deepgemm_unpermute_and_reduce(a=mm2_out, + topk_ids=topk_ids, + topk_weights=topk_weights, + inv_perm=inv_perm, + expert_map=expert_map, + output=output) def deep_gemm_moe_fp8( diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py b/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py new file mode 100644 index 00000000000..8cc5a747c67 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py @@ -0,0 +1,413 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Taken from https://github.com/ModelTC/LightLLM/blob/8ed97c74c18f11505b048b1ba00ba5c0cef8bff6/lightllm/common/fused_moe/deepep_scatter_gather.py +and updated to fit vllm needs and terminology. +""" + +import functools +from typing import Optional + +import torch + +import vllm.model_executor.layers.fused_moe.modular_kernel as mk +from vllm.model_executor.layers.fused_moe.utils import count_expert_num_tokens +from vllm.triton_utils import tl, triton +from vllm.utils import round_up + + +@functools.cache +def deep_gemm_block_shape() -> list[int]: + # Lazy import to avoid CUDA initialization problems. + import deep_gemm as dg + block = dg.get_m_alignment_for_contiguous_layout() + return [block, block] + + +def expert_num_tokens_round_up_and_sum(expert_num_tokens: torch.Tensor, + alignment: int) -> int: + # Round up each element in expert_num_tokens to the nearest multiple of + # alignment. + ent = (expert_num_tokens.to(torch.int64) + + (alignment - 1)) // alignment * alignment + return torch.sum(ent).item() + + +def compute_aligned_M(M: int, num_topk: int, local_num_experts: int, + alignment: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata]): + + if ((expert_tokens_meta is not None) + and (expert_tokens_meta.expert_num_tokens_cpu is not None)): + return expert_num_tokens_round_up_and_sum( + expert_tokens_meta.expert_num_tokens_cpu, alignment=alignment) + + # expert_num_tokens information is not available on the cpu. + # compute the max required size. + M_sum = (M * num_topk) + local_num_experts * (alignment - 1) + M_sum = round_up(M_sum, alignment) + return M_sum + + +@triton.jit +def apply_expert_map(expert_id, expert_map): + if expert_id != -1: + expert_id = tl.load(expert_map + expert_id).to(tl.int64) + return expert_id + + +@triton.jit +def round_up_128(x: int) -> int: + y = 128 + return ((x + y - 1) // y) * y + + +@triton.jit +def _fwd_kernel_ep_scatter_1( + num_recv_tokens_per_expert, + expert_start_loc, + m_indices, + num_experts: tl.constexpr, + BLOCK_E: tl.constexpr, + BLOCK_EXPERT_NUM: tl.constexpr, +): + cur_expert = tl.program_id(0) + + offset_cumsum = tl.arange(0, BLOCK_EXPERT_NUM) + tokens_per_expert = tl.load(num_recv_tokens_per_expert + offset_cumsum, + mask=offset_cumsum < num_experts, + other=0) + tokens_per_expert = round_up_128(tokens_per_expert) + cumsum = tl.cumsum(tokens_per_expert) - tokens_per_expert + tl.store(expert_start_loc + offset_cumsum, + cumsum, + mask=offset_cumsum < num_experts) + + cur_expert_start = tl.load(expert_start_loc + cur_expert) + cur_expert_token_num = tl.load(num_recv_tokens_per_expert + cur_expert) + + m_indices_start_ptr = m_indices + cur_expert_start + off_expert = tl.arange(0, BLOCK_E) + + for start_m in tl.range(0, cur_expert_token_num, BLOCK_E, num_stages=4): + tl.store( + m_indices_start_ptr + start_m + off_expert, + cur_expert, + ) + + +@triton.jit +def _fwd_kernel_ep_scatter_2( + total_token_num, + expert_start_loc, + recv_x, + recv_x_stride0, + recv_x_stride1, + recv_x_scale, + recv_x_scale_stride0, + recv_x_scale_stride1, + recv_topk, + recv_topk_stride0, + recv_topk_stride1, + output_tensor, + output_tensor_stride0, + output_tensor_stride1, + output_tensor_scale, + output_tensor_scale_stride0, + output_tensor_scale_stride1, + output_index, + output_index_stride0, + output_index_stride1, + topk_num: tl.constexpr, + expert_map, + HAS_EXPERT_MAP: tl.constexpr, + HIDDEN_SIZE: tl.constexpr, + HIDDEN_SIZE_PAD: tl.constexpr, + SCALE_HIDDEN_SIZE: tl.constexpr, + SCALE_HIDDEN_SIZE_PAD: tl.constexpr, +): + start_token_id = tl.program_id(0) + grid_num = tl.num_programs(0) + + offset_in = tl.arange(0, HIDDEN_SIZE_PAD) + mask = offset_in < HIDDEN_SIZE + + offset_in_s = tl.arange(0, SCALE_HIDDEN_SIZE_PAD) + mask_s = offset_in_s < SCALE_HIDDEN_SIZE + + for token_id in range(start_token_id, total_token_num, grid_num): + to_copy = tl.load(recv_x + token_id * recv_x_stride0 + offset_in, + mask=mask) + to_copy_s = tl.load(recv_x_scale + token_id * recv_x_scale_stride0 + + offset_in_s, + mask=mask_s) + + for topk_index in tl.range(0, topk_num, 1, num_stages=4): + expert_id = tl.load(recv_topk + token_id * recv_topk_stride0 + + topk_index) + + if HAS_EXPERT_MAP: + expert_id = apply_expert_map(expert_id, expert_map) + + if expert_id >= 0: + dest_token_index = tl.atomic_add(expert_start_loc + expert_id, + 1) + tl.store( + output_index + token_id * output_index_stride0 + + topk_index, dest_token_index) + output_tensor_ptr = (output_tensor + + dest_token_index * output_tensor_stride0) + output_tensor_scale_ptr = ( + output_tensor_scale + + dest_token_index * output_tensor_scale_stride0) + tl.store(output_tensor_ptr + offset_in, to_copy, mask=mask) + tl.store(output_tensor_scale_ptr + offset_in_s, + to_copy_s, + mask=mask_s) + + +@torch.no_grad() +def ep_scatter( + recv_x: torch.Tensor, + recv_x_scale: torch.Tensor, + recv_topk: torch.Tensor, + num_recv_tokens_per_expert: torch.Tensor, + expert_map: Optional[torch.Tensor], + expert_start_loc: torch.Tensor, + output_tensor: torch.Tensor, + output_tensor_scale: torch.Tensor, + m_indices: torch.Tensor, + output_index: torch.Tensor, +): + BLOCK_E = 128 # token num of per expert is aligned to 128 + BLOCK_D = 128 # block size of quantization + num_warps = 8 + num_experts = num_recv_tokens_per_expert.shape[0] + hidden_size = recv_x.shape[1] + # grid = (triton.cdiv(hidden_size, BLOCK_D), num_experts) + grid = num_experts + + assert m_indices.shape[0] % BLOCK_E == 0 + + _fwd_kernel_ep_scatter_1[(grid, )]( + num_recv_tokens_per_expert, + expert_start_loc, + m_indices, + num_experts=num_experts, + num_warps=num_warps, + BLOCK_E=BLOCK_E, + BLOCK_EXPERT_NUM=triton.next_power_of_2(num_experts), + ) + + grid = min(recv_topk.shape[0], 1024 * 8) + + _fwd_kernel_ep_scatter_2[(grid, )]( + recv_topk.shape[0], + expert_start_loc, + recv_x, + recv_x.stride(0), + recv_x.stride(1), + recv_x_scale, + recv_x_scale.stride(0), + recv_x_scale.stride(1), + recv_topk, + recv_topk.stride(0), + recv_topk.stride(1), + output_tensor, + output_tensor.stride(0), + output_tensor.stride(1), + output_tensor_scale, + output_tensor_scale.stride(0), + output_tensor_scale.stride(1), + output_index, + output_index.stride(0), + output_index.stride(1), + topk_num=recv_topk.shape[1], + expert_map=expert_map, + HAS_EXPERT_MAP=expert_map is not None, + num_warps=num_warps, + HIDDEN_SIZE=hidden_size, + HIDDEN_SIZE_PAD=triton.next_power_of_2(hidden_size), + SCALE_HIDDEN_SIZE=hidden_size // BLOCK_D, + SCALE_HIDDEN_SIZE_PAD=triton.next_power_of_2(hidden_size // BLOCK_D), + ) + return + + +@triton.jit +def _fwd_kernel_ep_gather( + total_token_num, + input_tensor, + input_tensor_stride0, + input_tensor_stride1, + recv_topk_ids, + recv_topk_ids_stride0, + recv_topk_ids_stride1, + recv_topk_weight, + recv_topk_weight_stride0, + recv_topk_weight_stride1, + input_index, + input_index_stride0, + input_index_stride1, + output_tensor, + output_tensor_stride0, + output_tensor_stride1, + topk_num: tl.constexpr, + expert_map, + HAS_EXPERT_MAP: tl.constexpr, + BLOCK_D: tl.constexpr, +): + cur_block = tl.program_id(0) + start_cur_token = tl.program_id(1) + grid_num = tl.num_programs(1) + + for cur_token in range(start_cur_token, total_token_num, grid_num): + off_d = tl.arange(0, BLOCK_D) + accumulator = tl.zeros([BLOCK_D], dtype=tl.float32) + for topk_index in range(0, topk_num): + expert_id = tl.load(recv_topk_ids + + cur_token * recv_topk_ids_stride0 + topk_index) + + if HAS_EXPERT_MAP: + expert_id = apply_expert_map(expert_id, expert_map) + + if expert_id >= 0: + source_token_index = tl.load(input_index + + cur_token * input_index_stride0 + + topk_index) + acc_weight = tl.load(recv_topk_weight + + cur_token * recv_topk_weight_stride0 + + topk_index) + tmp = tl.load(input_tensor + + source_token_index * input_tensor_stride0 + + cur_block * BLOCK_D + off_d) + accumulator += tmp.to(tl.float32) * acc_weight + + tl.store( + output_tensor + cur_token * output_tensor_stride0 + + cur_block * BLOCK_D + off_d, + accumulator.to(output_tensor.dtype.element_ty), + ) + + +@torch.no_grad() +def ep_gather( + input_tensor: torch.Tensor, + recv_topk_ids: torch.Tensor, + recv_topk_weight: torch.Tensor, + input_index: torch.Tensor, + expert_map: Optional[torch.Tensor], + output_tensor: torch.Tensor, +): + num_warps = 2 + num_tokens = output_tensor.shape[0] + hidden_size = input_tensor.shape[1] + BLOCK_D = min(hidden_size, 1024) + assert hidden_size % BLOCK_D == 0 + grid = (triton.cdiv(hidden_size, BLOCK_D), min(num_tokens, 1024)) + + _fwd_kernel_ep_gather[grid]( + num_tokens, + input_tensor, + input_tensor.stride(0), + input_tensor.stride(1), + recv_topk_ids, + recv_topk_ids.stride(0), + recv_topk_ids.stride(1), + recv_topk_weight, + recv_topk_weight.stride(0), + recv_topk_weight.stride(1), + input_index, + input_index.stride(0), + input_index.stride(1), + output_tensor, + output_tensor.stride(0), + output_tensor.stride(1), + topk_num=recv_topk_ids.shape[1], + expert_map=expert_map, + HAS_EXPERT_MAP=expert_map is not None, + num_warps=num_warps, + BLOCK_D=BLOCK_D, + ) + return + + +def deepgemm_moe_permute(aq: torch.Tensor, + aq_scale: torch.Tensor, + topk_ids: torch.Tensor, + local_num_experts: int, + expert_map: Optional[torch.Tensor], + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + aq_out: Optional[torch.Tensor] = None): + + assert aq.ndim == 2 + assert topk_ids.dtype.is_signed, ( + "The kernel uses -1 to represent invalid topk_ids") + H = aq.size(1) + device = aq.device + + block_m = deep_gemm_block_shape()[0] + block_k = deep_gemm_block_shape()[1] + + M_sum = compute_aligned_M(M=topk_ids.size(0), + num_topk=topk_ids.size(1), + local_num_experts=local_num_experts, + alignment=block_m, + expert_tokens_meta=expert_tokens_meta) + + expert_start_loc = torch.empty((local_num_experts), + device=device, + dtype=torch.int32) + + assert aq_out is None or aq_out.shape == (M_sum, H) + if aq_out is None: + aq_out = torch.empty((M_sum, H), device=device, dtype=aq.dtype) + + aq_scale_out = torch.empty((M_sum, H // block_k), + device=device, + dtype=torch.float32) + + maybe_has_empty_blocks = ((expert_tokens_meta is None) + or (expert_tokens_meta.expert_num_tokens_cpu + is None)) + expert_ids_init = torch.zeros if maybe_has_empty_blocks else torch.empty + + expert_ids = expert_ids_init((M_sum), device=device, dtype=torch.int32) + inv_perm = torch.empty(topk_ids.shape, device=device, dtype=torch.int32) + + expert_num_tokens = None + if expert_tokens_meta is not None: + expert_num_tokens = expert_tokens_meta.expert_num_tokens + else: + expert_num_tokens = count_expert_num_tokens(topk_ids, + local_num_experts, + expert_map) + + ep_scatter(recv_x=aq, + recv_x_scale=aq_scale, + recv_topk=topk_ids, + num_recv_tokens_per_expert=expert_num_tokens, + expert_start_loc=expert_start_loc, + expert_map=expert_map, + output_tensor=aq_out, + output_tensor_scale=aq_scale_out, + m_indices=expert_ids, + output_index=inv_perm) + + return aq_out, aq_scale_out, expert_ids, inv_perm + + +def deepgemm_unpermute_and_reduce( + a: torch.Tensor, # Grouped gemm output + topk_ids: torch.Tensor, + topk_weights: torch.Tensor, + inv_perm: torch.Tensor, + expert_map: Optional[torch.Tensor], + output: torch.Tensor): + + return ep_gather(input_tensor=a, + recv_topk_ids=topk_ids, + recv_topk_weight=topk_weights, + input_index=inv_perm, + expert_map=expert_map, + output_tensor=output) diff --git a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py index b311ef1ac1c..ab8a281b390 100644 --- a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py @@ -677,6 +677,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: assert a.dim() == 2 num_dp = self.num_dispatchers @@ -889,6 +890,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: assert a.dim() == 2 num_dp = self.num_dispatchers diff --git a/vllm/model_executor/layers/fused_moe/fused_moe.py b/vllm/model_executor/layers/fused_moe/fused_moe.py index 079486dd438..ddda87c441b 100644 --- a/vllm/model_executor/layers/fused_moe/fused_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_moe.py @@ -1618,6 +1618,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: workspace1 = (M, topk, max(N // 2, K)) workspace2 = (M, topk, max(N, K)) diff --git a/vllm/model_executor/layers/fused_moe/modular_kernel.py b/vllm/model_executor/layers/fused_moe/modular_kernel.py index 028eee24178..bc4eb3b1932 100644 --- a/vllm/model_executor/layers/fused_moe/modular_kernel.py +++ b/vllm/model_executor/layers/fused_moe/modular_kernel.py @@ -317,6 +317,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_meta: Optional[ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: """ Compute the shapes for the temporary and final outputs of the two gemms @@ -479,7 +480,8 @@ def _do_fused_experts(self, fused_out: Optional[torch.Tensor], (workspace13_shape, workspace2_shape, fused_out_shape, workspace_dtype) = self.fused_experts.workspace_shapes( - a1, a1q, M, N, K, top_k, global_num_experts, local_num_experts) + a1, a1q, M, N, K, top_k, global_num_experts, local_num_experts, + expert_tokens_meta) # We can reuse the memory between cache1 and cache3 because by the # time we need cache3, we're done with cache1. @@ -572,10 +574,9 @@ def _maybe_chunk_fused_experts( assert num_chunks > 1 # Construct the entire output that can then be processed in chunks. - (_, _, fused_out_shape, - _) = self.fused_experts.workspace_shapes(a1, a1q, M, N, K, top_k, - global_num_experts, - local_num_experts) + (_, _, fused_out_shape, _) = self.fused_experts.workspace_shapes( + a1, a1q, M, N, K, top_k, global_num_experts, local_num_experts, + expert_tokens_meta) fused_out = torch.empty(fused_out_shape, device=a1q.device, dtype=a1.dtype) @@ -613,8 +614,11 @@ def slice_expert_tokens_metadata( need_expert_num_tokens_cpu = ( full_expert_tokens_meta.expert_num_tokens_cpu is not None) if need_expert_num_tokens_cpu: + # This is blocking as some implementations need the count + # on the CPU to determine appropriate input/out fused-moe + # buffers c_expert_num_tokens_cpu = c_expert_num_tokens.to( - "cpu", non_blocking=True) + "cpu", non_blocking=False) return ExpertTokensMetadata( expert_num_tokens=c_expert_num_tokens, diff --git a/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py index 2f35c19b705..51b95c9aa92 100644 --- a/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py @@ -102,6 +102,7 @@ def workspace_shapes( topk: int, global_num_experts: int, local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: # Note: the deep gemm workspaces are strictly larger than the triton # workspaces so we can be pessimistic here and allocate for DeepGemm @@ -110,11 +111,13 @@ def workspace_shapes( or is_blackwell_deep_gemm_used()): assert self.deep_gemm_expert is not None return self.deep_gemm_expert.workspace_shapes( - a, aq, M, N, K, topk, global_num_experts, local_num_experts) + a, aq, M, N, K, topk, global_num_experts, local_num_experts, + expert_tokens_meta) else: return self.triton_expert.workspace_shapes(a, aq, M, N, K, topk, global_num_experts, - local_num_experts) + local_num_experts, + expert_tokens_meta) def apply( self, From f83e791a8493d8de305e87925a381259e58bcb61 Mon Sep 17 00:00:00 2001 From: Asher Date: Thu, 17 Jul 2025 17:10:09 +0800 Subject: [PATCH 152/552] [Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820) Signed-off-by: Asher Zhang Signed-off-by: x22x22 --- benchmarks/kernels/benchmark_moe.py | 5 + docs/features/reasoning_outputs.md | 1 + docs/features/tool_calling.md | 10 + .../tool_chat_template_hunyuan_a13b.jinja | 113 ++++++ .../test_hunyuan_a13b_tool_parser.py | 153 +++++++ .../test_hunyuan_reasoning_parser.py | 11 + vllm/entrypoints/openai/serving_chat.py | 19 +- .../openai/tool_parsers/__init__.py | 3 +- .../tool_parsers/hunyuan_a13b_tool_parser.py | 372 ++++++++++++++++++ ...device_name=NVIDIA_H20,dtype=fp8_w8a8.json | 146 +++++++ ...device_name=NVIDIA_H20,dtype=fp8_w8a8.json | 146 +++++++ .../E=64,N=3072,device_name=NVIDIA_H20.json | 146 +++++++ ...device_name=NVIDIA_H20,dtype=fp8_w8a8.json | 146 +++++++ .../E=64,N=384,device_name=NVIDIA_H20.json | 146 +++++++ ...device_name=NVIDIA_H20,dtype=fp8_w8a8.json | 146 +++++++ .../E=64,N=768,device_name=NVIDIA_H20.json | 146 +++++++ .../hunyuan_a13b_reasoning_parser.py | 7 + 17 files changed, 1712 insertions(+), 4 deletions(-) create mode 100644 examples/tool_chat_template_hunyuan_a13b.jinja create mode 100644 tests/entrypoints/openai/tool_parsers/test_hunyuan_a13b_tool_parser.py create mode 100644 vllm/entrypoints/openai/tool_parsers/hunyuan_a13b_tool_parser.py create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=1536,device_name=NVIDIA_H20,dtype=fp8_w8a8.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20,dtype=fp8_w8a8.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20,dtype=fp8_w8a8.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20,dtype=fp8_w8a8.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20.json diff --git a/benchmarks/kernels/benchmark_moe.py b/benchmarks/kernels/benchmark_moe.py index 51c9f68e43a..132c325ce59 100644 --- a/benchmarks/kernels/benchmark_moe.py +++ b/benchmarks/kernels/benchmark_moe.py @@ -586,6 +586,11 @@ def main(args: argparse.Namespace): topk = config.num_experts_per_tok intermediate_size = config.moe_intermediate_size shard_intermediate_size = 2 * intermediate_size // args.tp_size + elif config.architectures[0] in ("HunYuanMoEV1ForCausalLM"): + E = config.num_experts + topk = config.moe_topk[0] + intermediate_size = config.moe_intermediate_size[0] + shard_intermediate_size = 2 * intermediate_size // args.tp_size else: # Support for llama4 config = config.get_text_config() diff --git a/docs/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md index 7ab7efd5e76..6b84eca2753 100644 --- a/docs/features/reasoning_outputs.md +++ b/docs/features/reasoning_outputs.md @@ -14,6 +14,7 @@ vLLM currently supports the following reasoning models: | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | `deepseek_r1` | `guided_json`, `guided_regex` | ✅ | | [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ | | [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ | +| [Hunyuan A13B series](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | `hunyuan_a13b` | `guided_json`, `guided_regex` | ✅ | !!! note IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`. diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index f1e5dad35f1..9b9d6e1360e 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -288,6 +288,16 @@ Supported models: Flags: `--tool-call-parser kimi_k2` +### Hunyuan Models (`hunyuan_a13b`) + +Supported models: + +* `tencent/Hunyuan-A13B-Instruct` (chat template already included huggingface model file.) + +Flags: +* For non-reasoning: `--tool-call-parser hunyuan_a13b` +* For reasoning: `--tool-call-parser hunyuan_a13b --reasoning-parser hunyuan_a13b --enable_reasoning` + ### Models with Pythonic Tool Calls (`pythonic`) A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models. diff --git a/examples/tool_chat_template_hunyuan_a13b.jinja b/examples/tool_chat_template_hunyuan_a13b.jinja new file mode 100644 index 00000000000..a0808e44858 --- /dev/null +++ b/examples/tool_chat_template_hunyuan_a13b.jinja @@ -0,0 +1,113 @@ +{% set loop_messages = messages %} +{% if tools %} + {% set weekday_map = {'Monday': '星期一', 'Tuesday': '星期二', 'Wednesday': '星期三', 'Thursday': '星期四', 'Friday': '星期五', 'Saturday': '星期六', 'Sunday': '星期日'} %} + {% set weekday_cn = weekday_map[strftime_now('%A')] %} + {% set datetime_str = strftime_now('%Y-%m-%d %H:%M:%S') %} + {% set datetime_str = datetime_str + ' ' + weekday_cn %} + {% for message in loop_messages %} + {% if 'content' in message %} + {% set content = message['content'] %} + {% else %} + {% set content = '' %} + {% endif %} + {% if loop.index0 == 0 %} + {% set content_tmp = '你是一位函数组合专家。你会得到一个问题和一组可能的函数。根据问题,你需要进行一个或多个函数/工具调用以实现目的。 +如果没有一个函数可以使用,请直接使用自然语言回复用户,以助手:开头。 +如果给定的问题缺少函数所需的参数,请使用自然语言进行提问,向用户询问必要信息,以助手:开头。 +如果调用结果已经足够回答用户问题,请对历史结果进行总结,使用自然语言回复用户,以助手:开头。 +你应该只在工具调用部分返回函数调用。如果你决定调用任何函数,你必须将其格式化为[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]。你不应该在回复中包含任何其他文本。以下是你可以调用的函数列表,格式为JSON。 +' %} + {% set content_tmp = content_tmp + ' +' + tools | tojson + ' +' %} + {% if message['role'] == 'system' %} + {% set content_tmp = content_tmp + ' +额外要求: +' + content + ' + +如果你决定返回函数调用,请将其格式化为[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...],不得包含其他文本。如果额外要求里有格式要求,请忽略,以此处为准。 +否则,请参考开头说的三种情况,以助手:开头进行回复。 + +如果额外要求里有时间信息,就以额外要求里的时间为准,否则,参考当前时间:' + datetime_str %} + {% set content = '<|startoftext|>' + content_tmp + '<|extra_4|>' %} + {% elif message['role'] == 'user' %} + {% set content_tmp = content_tmp + ' +如果你决定返回函数调用,请将其格式化为[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...],不得包含其他文本。 +否则,请参考开头说的三种情况,以助手:开头进行回复。 + +当前时间:' + datetime_str %} + {% set content_tmp = '<|startoftext|>' + content_tmp + '<|extra_4|>'%} + {% set content = content_tmp + '用户:' + content + '<|extra_0|>' %} + {% endif %} + {% else %} + {% if message['role'] == 'user' %} + {% set content = '用户:' + content + '<|extra_0|>' %} + {% elif message['role'] == 'assistant' %} + {% if 'tool_calls' in message %} + {% set tool_calls = message['tool_calls'] %} + {% set ns = namespace(tool_calls="[") %} + {% for tool_call in tool_calls %} + {% set function = tool_call['function'] %} + {% set name = function['name'] %} + {% set ns.tool_calls = ns.tool_calls + '{"name": "' + name + '", '%} + {% set arguments = function['arguments'] %} + {% if arguments is not string %} + {% set arguments = arguments | tojson %} + {% endif %} + {% set ns.tool_calls = ns.tool_calls + '"arguments": ' + arguments + '}' %} + {% if not loop.last %} + {% set ns.tool_calls = ns.tool_calls + ', '%} + {% endif %} + {% endfor %} + {% set ns.tool_calls = ns.tool_calls + ']' %} + {% set content = content + '' + ns.tool_calls + '' %} + {% else %} + {% set content = '助手:' + content %} + {% endif %} + {% set content = content + '<|eos|>' %} + {% elif message['role'] == 'tool' %} + {% if content is not string %} + {set content = content | tojson } + {% endif %} + {% set content = '' + content + '' %} + {% set content = content + '<|extra_0|>' %} + {% endif %} + {% endif %} + {{- content -}} + {% endfor %} +{% else %} + {% set context = {'has_head': true} %} + {% for message in loop_messages %} + {% if 'content' in message %} + {% set content = message['content'] %} + {% else %} + {% set content = '' %} + {% endif %} + {% if loop.index0 == 0 %} + {% if content == '' %} + {% set _ = context.update({'has_head': false}) %} + {% elif message['role'] == 'system' %} + {% set content = '<|startoftext|>' + content + '<|extra_4|>' %} + {% endif %} + {% endif %} + {% if message['role'] == 'user' %} + {% if loop.index0 == 1 and not context.has_head %} + {% set content = '<|startoftext|>' + content %} + {% endif %} + {% if loop.index0 == 1 and context.has_head %} + {% set content = content + '<|extra_0|>' %} + {% else %} + {% set content = '<|startoftext|>' + content + '<|extra_0|>' %} + {% endif %} + {% elif message['role'] == 'assistant' %} + {% set content = content + '<|eos|>' %} + {% elif message['role'] == 'tool' %} + {% set content = content + '<|extra_0|>' %} + {% endif %} + {{- content -}} + {% endfor %} +{% endif %} +{%- if enable_thinking is defined and enable_thinking is false %} + {{- '\n\n\n' }} +{%- endif %} + diff --git a/tests/entrypoints/openai/tool_parsers/test_hunyuan_a13b_tool_parser.py b/tests/entrypoints/openai/tool_parsers/test_hunyuan_a13b_tool_parser.py new file mode 100644 index 00000000000..bd8e06513e1 --- /dev/null +++ b/tests/entrypoints/openai/tool_parsers/test_hunyuan_a13b_tool_parser.py @@ -0,0 +1,153 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# ruff: noqa: E501 + +import json +from unittest.mock import MagicMock + +import pytest + +from tests.entrypoints.openai.tool_parsers.utils import ( + run_tool_extraction, run_tool_extraction_streaming) +from vllm.entrypoints.openai.protocol import FunctionCall, ToolCall +from vllm.entrypoints.openai.tool_parsers import ToolParser, ToolParserManager + + +def make_tool_call(name, arguments): + return ToolCall(type="function", + function=FunctionCall(name=name, + arguments=json.dumps(arguments))) + + +# TODO: add reason prefix and suffix. + + +@pytest.mark.parametrize( + "model_output,expected_tool_calls,expected_content", + [ + # No tool call + ("How can I help you today?", [], "How can I help you today?"), + # Single tool call, no content + ( + "[{\"name\": \"get_weather\", \"arguments\": {\"city\": \"San Francisco\", \"metric\": \"celsius\"}}]", #noqa: E501 + [ + make_tool_call("get_weather", { + "city": "San Francisco", + "metric": "celsius" + }) + ], + None), + # Multiple tool calls + ( + "[{\"name\": \"get_weather\", \"arguments\": {\"city\": \"San Francisco\", \"metric\": \"celsius\"}}, {\"name\": \"register_user\", \"arguments\": {\"name\": \"John Doe\", \"age\": 37, \"address\": {\"city\": \"San Francisco\", \"state\": \"CA\"}, \"role\": null, \"passed_test\": true, \"aliases\": [\"John\", \"Johnny\"]}}]", #noqa: E501 + [ + make_tool_call("get_weather", { + "city": "San Francisco", + "metric": "celsius" + }), + make_tool_call( + "register_user", { + "name": "John Doe", + "age": 37, + "address": { + "city": "San Francisco", + "state": "CA" + }, + "role": None, + "passed_test": True, + "aliases": ["John", "Johnny"] + }) + ], + None), + # Content before tool call + ( + "I will call the tool now. [{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Boston\"}}]", #noqa: E501 + [make_tool_call("get_weather", {"city": "Boston"})], + "I will call the tool now. "), + # Content after tool call (should be stripped) + ( + "[{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Seattle\"}}]\nThank you!", #noqa: E501 + [make_tool_call("get_weather", {"city": "Seattle"})], + None), + ( + "[{\"name\": \"complex_tool\", \"arguments\": {\"level1\": {\"level2\": {\"level3\": {\"value\": 123}}}}}]", + [ + make_tool_call( + "complex_tool", + {"level1": { + "level2": { + "level3": { + "value": 123 + } + } + }}) + ], + None, + ), + ]) +def test_hunyuan_a13b_tool_parser_extract(model_output, expected_tool_calls, + expected_content): + mock_tokenizer = MagicMock() + tool_parser: ToolParser = ToolParserManager.get_tool_parser( + "hunyuan_a13b")(mock_tokenizer) + content, tool_calls = run_tool_extraction(tool_parser, + model_output, + streaming=False) + + # align the random id. + for idx in range(len(tool_calls)): + tool_calls[idx].id = expected_tool_calls[idx].id + assert tool_calls == expected_tool_calls + assert content == expected_content + + +# Streaming test: simulate incremental output +@pytest.mark.parametrize("model_deltas,expected_tool_calls", [ + ([ + "[{\"name\": \"get_weather\", ", + "\"arguments\": {\"city\": \"San Francisco\", ", + "\"metric\": \"celsius\"}}]", "" + ], [ + make_tool_call("get_weather", { + "city": "San Francisco", + "metric": "celsius" + }) + ]), + ([ + "[{\"name\":", " \"get_weather\",", " \"arguments\":", + " {\"city\": \"Boston\"}", "}]", "" + ], [make_tool_call("get_weather", {"city": "Boston"})]), + ([ + "", "[{\"name\":", " \"get_weather\",", " \"arguments\":", + " {\"city\": \"Boston\"}", "}]", "", "\n" + ], [make_tool_call("get_weather", {"city": "Boston"})]), + pytest.param([ + "[{\"name\": \"complex_tool\",", " \"arguments\": ", + " {\"level1\": {\"level2\": ", "{\"level3\": {\"value\": 123}}}}}", + "]" + ], [ + make_tool_call("complex_tool", + {"level1": { + "level2": { + "level3": { + "value": 123 + } + } + }}) + ], + marks=pytest.mark.xfail( + reason="stream parsing not support nested json yet.")), +]) +def test_hunyuan_a13b_tool_parser_streaming(model_deltas, expected_tool_calls): + mock_tokenizer = MagicMock() + + tool_parser: ToolParser = ToolParserManager.get_tool_parser( + "hunyuan_a13b")(mock_tokenizer) + reconstructor = run_tool_extraction_streaming( + tool_parser, model_deltas, assert_one_tool_per_delta=False) + + # align the random id. + for idx in range(len(reconstructor.tool_calls)): + reconstructor.tool_calls[idx].id = expected_tool_calls[idx].id + + assert reconstructor.tool_calls == expected_tool_calls diff --git a/tests/reasoning/test_hunyuan_reasoning_parser.py b/tests/reasoning/test_hunyuan_reasoning_parser.py index f70cf453f0e..f9238267f02 100644 --- a/tests/reasoning/test_hunyuan_reasoning_parser.py +++ b/tests/reasoning/test_hunyuan_reasoning_parser.py @@ -30,6 +30,12 @@ "reasoning_content": "This is a reasoning section", "content": None, } + +COMPLETE_REASONING_WITH_SYMBOL = { + "output": f"{START_REASONING}This is a reasoning section!{START_RESPONSE}", + "reasoning_content": "This is a reasoning section!", + "content": None, +} NO_REASONING = { "output": "This is content", "reasoning_content": None, @@ -70,6 +76,11 @@ COMPLETE_REASONING, id="complete_reasoning", ), + pytest.param( + False, + COMPLETE_REASONING_WITH_SYMBOL, + id="complete_reasoning_with_symbol", + ), pytest.param( False, NO_REASONING, diff --git a/vllm/entrypoints/openai/serving_chat.py b/vllm/entrypoints/openai/serving_chat.py index b902166a25b..a5eb16a5397 100644 --- a/vllm/entrypoints/openai/serving_chat.py +++ b/vllm/entrypoints/openai/serving_chat.py @@ -613,8 +613,13 @@ async def chat_completion_stream_generator( previous_text = previous_texts[i] previous_token_ids = all_previous_token_ids[i] current_text = previous_text + delta_text - current_token_ids = previous_token_ids + list( - output.token_ids) + + # avoid the None + list error. + if previous_token_ids: + current_token_ids = previous_token_ids + list( + output.token_ids) + else: + current_token_ids = list(output.token_ids) # handle streaming deltas for tools with named tool_choice if tool_choice_function_name: @@ -1077,9 +1082,17 @@ async def chat_completion_full_generator( else: # FOR NOW make it a chat message; we will have to detect # the type to make it later. + ret_content = content + + # try to use content return from tool parser first, + # tool parser may do some modify for the content. + if (tool_call_info.content + and len(tool_call_info.content) > 0): + ret_content = tool_call_info.content + message = ChatMessage(role=role, reasoning_content=reasoning_content, - content=content) + content=ret_content) # undetermined case that is still important to handle else: diff --git a/vllm/entrypoints/openai/tool_parsers/__init__.py b/vllm/entrypoints/openai/tool_parsers/__init__.py index 218a120a5bb..137375b9707 100644 --- a/vllm/entrypoints/openai/tool_parsers/__init__.py +++ b/vllm/entrypoints/openai/tool_parsers/__init__.py @@ -6,6 +6,7 @@ from .granite_20b_fc_tool_parser import Granite20bFCToolParser from .granite_tool_parser import GraniteToolParser from .hermes_tool_parser import Hermes2ProToolParser +from .hunyuan_a13b_tool_parser import HunyuanA13BToolParser from .internlm2_tool_parser import Internlm2ToolParser from .jamba_tool_parser import JambaToolParser from .kimi_k2_tool_parser import KimiK2ToolParser @@ -23,5 +24,5 @@ "Internlm2ToolParser", "Llama3JsonToolParser", "JambaToolParser", "Llama4PythonicToolParser", "PythonicToolParser", "Phi4MiniJsonToolParser", "DeepSeekV3ToolParser", "xLAMToolParser", "MinimaxToolParser", - "KimiK2ToolParser" + "KimiK2ToolParser", "HunyuanA13BToolParser" ] diff --git a/vllm/entrypoints/openai/tool_parsers/hunyuan_a13b_tool_parser.py b/vllm/entrypoints/openai/tool_parsers/hunyuan_a13b_tool_parser.py new file mode 100644 index 00000000000..2b65f2579fb --- /dev/null +++ b/vllm/entrypoints/openai/tool_parsers/hunyuan_a13b_tool_parser.py @@ -0,0 +1,372 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# ruff: noqa: E501, SIM102 + +import json +from collections.abc import Sequence +from typing import Any, Optional, Union + +import regex as re + +from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, + DeltaFunctionCall, DeltaMessage, + DeltaToolCall, + ExtractedToolCallInformation, + FunctionCall, ToolCall) +from vllm.entrypoints.openai.tool_parsers.abstract_tool_parser import ( + ToolParser, ToolParserManager) +from vllm.entrypoints.openai.tool_parsers.utils import consume_space +from vllm.logger import init_logger +from vllm.transformers_utils.tokenizer import AnyTokenizer +from vllm.utils import random_uuid + +logger = init_logger(__name__) + + +@ToolParserManager.register_module("hunyuan_a13b") +class HunyuanA13BToolParser(ToolParser): + + def __init__(self, tokenizer: AnyTokenizer): + super().__init__(tokenizer) + + # Initialize state for streaming mode + self.prev_tool_calls: list[dict] = [] + self.current_tool_id = -1 + self.current_tool_name_sent = False + self.streamed_args: list[str] = [ + ] # Track arguments sent for each tool + + # For backward compatibility with tests + self.current_tools_sent: list[bool] = [] + + # For backward compatibility with serving code + self.prev_tool_call_arr = [] + + # Regex patterns for preprocessing + self.answer_tool_calls_pattern = re.compile( + r"([\s\S]*?)", re.DOTALL) + + self.tool_name_reg = re.compile(r'"name"\s*:\s*"([^"]+)"') + + self.tool_empty_arg_reg = re.compile( + r'"name"\s*:\s*"[^"]+"\s*,\s*"arguments"\s*:\s*\{\s*\}') + + # TODO: not support nested json object in fc arguments. + self.tool_non_empty_arg_reg = re.compile( + r'"name"\s*:\s*"[^"]+"\s*,\s*"arguments"\s*:\s*(\{(?:[^{}]|(?:\{[^{}]*\}))*\})' + ) + + self.bot_string = "" + + # Define streaming state type to be initialized later + self.streaming_state: dict[str, Any] = { + "current_tool_index": -1, + "tool_ids": [], + "sent_tools": [], + } + + def preprocess_model_output( + self, model_output: str) -> tuple[Optional[str], Optional[str]]: + # find the location tool call + for match in self.answer_tool_calls_pattern.finditer(model_output): + start, end = match.span() + # check tool_calls whether in side of + think_regions = [(m.start(), m.end()) for m in re.finditer( + r"(.*?)", model_output, flags=re.DOTALL)] + in_think = any(start > t_start and end < t_end + for t_start, t_end in think_regions) + if not in_think: + content = model_output[:start] + tool_calls_content = match.group(1).strip() + try: + json.loads(tool_calls_content) + return content, tool_calls_content + except Exception: + continue + return model_output, None + + def extract_tool_calls( + self, model_output: str, + request: ChatCompletionRequest) -> ExtractedToolCallInformation: + """ + Extract tool calls from a complete model output. + """ + try: + # Preprocess the model output + content, potential_tool_calls = self.preprocess_model_output( + model_output) + + if not potential_tool_calls: + # some text should be filtered out for no function call + # this text is in a13b's chat template. + if content: + content = content.replace("助手:", "", 1) + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=content) + + # Parse the potential tool calls as JSON + tool_calls_data = json.loads(potential_tool_calls) + + # Ensure it's an array + if not isinstance(tool_calls_data, list): + logger.debug("Tool calls data is not an array") + return ExtractedToolCallInformation( + tools_called=False, + tool_calls=[], + content=content or model_output, + ) + + tool_calls: list[ToolCall] = [] + + for idx, call in enumerate(tool_calls_data): + if (not isinstance(call, dict) or "name" not in call + or "arguments" not in call): + continue + + tool_call = ToolCall( + id=f"call_{random_uuid()}", + type="function", + function=FunctionCall( + name=call["name"], + arguments=(json.dumps(call["arguments"]) if isinstance( + call["arguments"], dict) else call["arguments"]), + ), + ) + tool_calls.append(tool_call) + + if not content or len(content.strip()) == 0: + # clear the whitespace content. + content = None + + return ExtractedToolCallInformation( + tools_called=len(tool_calls) > 0, + tool_calls=tool_calls, + content=content, + ) + + except Exception: + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=model_output) + + def extract_tool_calls_streaming( + self, + previous_text: str, + current_text: str, + delta_text: str, + previous_token_ids: Sequence[int], + current_token_ids: Sequence[int], + delta_token_ids: Sequence[int], + request: ChatCompletionRequest, + ) -> Union[DeltaMessage, None]: + """ + Extract tool calls for streaming mode. + """ + + start_idx = consume_space(0, current_text) + if current_text[start_idx:].startswith(self.bot_string): + start_idx = consume_space(start_idx + len(self.bot_string), + current_text) + if not current_text or start_idx >= len( + current_text) or current_text[start_idx] != '[': + return DeltaMessage(content=delta_text) + + self._try_parse_json_tools(current_text[start_idx:]) + + test_delta = self._handle_test_compatibility(current_text) + if test_delta: + return test_delta + + name_matches = list(self.tool_name_reg.finditer(current_text)) + tool_count = len(name_matches) + if tool_count == 0: + return None + self._ensure_state_arrays(tool_count) + current_idx = self.streaming_state["current_tool_index"] + + name_delta = self._handle_tool_name_streaming(current_idx, tool_count, + name_matches) + if name_delta: + return name_delta + + args_delta = self._handle_tool_args_streaming(current_text, + current_idx, tool_count) + if args_delta: + return args_delta + + return None + + def _try_parse_json_tools(self, current_text: str): + try: + parsed_tools = json.loads(current_text) + if isinstance(parsed_tools, list): + self.prev_tool_call_arr = parsed_tools + except json.JSONDecodeError: + pass + + def _handle_test_compatibility(self, current_text: str): + if len(self.current_tools_sent) > 0: + if (len(self.current_tools_sent) == 1 + and self.current_tools_sent[0] is False): + name_match = self.tool_name_reg.search(current_text) + if name_match: + function_name = name_match.group(1) + tool_id = f"chatcmpl-tool-{random_uuid()}" + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=0, + type="function", + id=tool_id, + function=DeltaFunctionCall( + name=function_name).model_dump( + exclude_none=True), + ) + ]) + self.current_tools_sent = [True] + self.current_tool_id = 0 + self.streaming_state["current_tool_index"] = 0 + if len(self.streaming_state["sent_tools"]) == 0: + self.streaming_state["sent_tools"].append({ + "sent_name": + True, + "sent_arguments_prefix": + False, + "sent_arguments": + "", + }) + else: + self.streaming_state["sent_tools"][0][ + "sent_name"] = True + self.current_tool_name_sent = True + return delta + return None + + def _ensure_state_arrays(self, tool_count: int): + while len(self.streaming_state["sent_tools"]) < tool_count: + self.streaming_state["sent_tools"].append({ + "sent_name": False, + "sent_arguments_prefix": False, + "sent_arguments": "", + }) + while len(self.streaming_state["tool_ids"]) < tool_count: + self.streaming_state["tool_ids"].append(None) + + def _handle_tool_name_streaming(self, current_idx: int, tool_count: int, + name_matches): + if current_idx == -1 or current_idx < tool_count - 1: + next_idx = current_idx + 1 + if (next_idx < tool_count + and not self.streaming_state["sent_tools"][next_idx] + ["sent_name"]): + self.streaming_state["current_tool_index"] = next_idx + self.current_tool_id = next_idx + current_idx = next_idx + tool_name = name_matches[current_idx].group(1) + tool_id = f"call_{current_idx}_{random_uuid()}" + self.streaming_state["tool_ids"][current_idx] = tool_id + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=current_idx, + type="function", + id=tool_id, + function=DeltaFunctionCall(name=tool_name).model_dump( + exclude_none=True), + ) + ]) + self.streaming_state["sent_tools"][current_idx][ + "sent_name"] = True + self.current_tool_name_sent = True + while len(self.streamed_args) <= current_idx: + self.streamed_args.append("") + return delta + return None + + def _handle_tool_args_streaming(self, current_text: str, current_idx: int, + tool_count: int): + + if current_idx >= 0 and current_idx < tool_count: + empty_args_match = self.tool_empty_arg_reg.search(current_text) + if empty_args_match and empty_args_match.start() > 0: + for i in range(tool_count): + if i == current_idx: + if not self.streaming_state["sent_tools"][current_idx][ + "sent_arguments_prefix"]: + self.streaming_state["sent_tools"][current_idx][ + "sent_arguments_prefix"] = True + self.streaming_state["sent_tools"][current_idx][ + "sent_arguments"] = "{}" + while len(self.streamed_args) <= current_idx: + self.streamed_args.append("") + self.streamed_args[current_idx] += "{}" + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=current_idx, + function=DeltaFunctionCall( + arguments="{}").model_dump( + exclude_none=True), + ) + ]) + if current_idx < tool_count - 1: + self.streaming_state["current_tool_index"] += 1 + self.current_tool_id = self.streaming_state[ + "current_tool_index"] + return delta + + args_matches = list( + self.tool_non_empty_arg_reg.finditer(current_text)) + if current_idx < len(args_matches): + args_text = args_matches[current_idx].group(1) + is_last_tool = current_idx == tool_count - 1 + if not is_last_tool: + next_tool_pos = current_text.find( + "},{", args_matches[current_idx].start()) + if next_tool_pos != -1: + args_end_pos = (next_tool_pos + 1) + args_text = ( + current_text[args_matches[current_idx].start( + ):args_end_pos].split('"arguments":')[1].strip()) + sent_args = self.streaming_state["sent_tools"][current_idx][ + "sent_arguments"] + if not self.streaming_state["sent_tools"][current_idx][ + "sent_arguments_prefix"] and args_text.startswith("{"): + self.streaming_state["sent_tools"][current_idx][ + "sent_arguments_prefix"] = True + self.streaming_state["sent_tools"][current_idx][ + "sent_arguments"] = "{" + while len(self.streamed_args) <= current_idx: + self.streamed_args.append("") + self.streamed_args[current_idx] += "{" + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=current_idx, + function=DeltaFunctionCall( + arguments="{").model_dump(exclude_none=True), + ) + ]) + return delta + + if args_text.startswith(sent_args): + args_diff = args_text[len(sent_args):] + if args_diff: + self.streaming_state["sent_tools"][current_idx][ + "sent_arguments"] = args_text + while len(self.streamed_args) <= current_idx: + self.streamed_args.append("") + self.streamed_args[current_idx] += args_diff + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=current_idx, + function=DeltaFunctionCall( + arguments=args_diff).model_dump( + exclude_none=True), + ) + ]) + return delta + + if args_text.endswith("}") and args_text == sent_args: + if current_idx < tool_count - 1: + self.streaming_state["current_tool_index"] += 1 + self.current_tool_id = self.streaming_state[ + "current_tool_index"] + return None diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=1536,device_name=NVIDIA_H20,dtype=fp8_w8a8.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=1536,device_name=NVIDIA_H20,dtype=fp8_w8a8.json new file mode 100644 index 00000000000..298a36175e6 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=1536,device_name=NVIDIA_H20,dtype=fp8_w8a8.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "128": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 5 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 5 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 3 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20,dtype=fp8_w8a8.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20,dtype=fp8_w8a8.json new file mode 100644 index 00000000000..0e210cb0f38 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20,dtype=fp8_w8a8.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 256, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 256, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 2 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 256, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 256, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 256, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 2 + }, + "128": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 8, + "num_stages": 4 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 5 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20.json new file mode 100644 index 00000000000..e4fa1e2e6e9 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=3072,device_name=NVIDIA_H20.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 8, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 3 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "128": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 3 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 5 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20,dtype=fp8_w8a8.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20,dtype=fp8_w8a8.json new file mode 100644 index 00000000000..082456d319d --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20,dtype=fp8_w8a8.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "128": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20.json new file mode 100644 index 00000000000..c3b2e7fa91e --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=384,device_name=NVIDIA_H20.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "128": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 8, + "num_stages": 3 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20,dtype=fp8_w8a8.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20,dtype=fp8_w8a8.json new file mode 100644 index 00000000000..bba1d21aa2b --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20,dtype=fp8_w8a8.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "128": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 5 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20.json b/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20.json new file mode 100644 index 00000000000..de1c413b6e1 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_H20.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "128": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + } +} diff --git a/vllm/reasoning/hunyuan_a13b_reasoning_parser.py b/vllm/reasoning/hunyuan_a13b_reasoning_parser.py index fb29d51eae8..b2452b95c1c 100644 --- a/vllm/reasoning/hunyuan_a13b_reasoning_parser.py +++ b/vllm/reasoning/hunyuan_a13b_reasoning_parser.py @@ -83,6 +83,13 @@ def __init__(self, tokenizer: PreTrainedTokenizerBase): def is_reasoning_end(self, input_ids: list[int]) -> bool: return self.current_state == "response" + def extract_content_ids(self, input_ids: list[int]) -> list[int]: + # for hunyuan streaming reason parsing, the stream parse + # will call first, and the same token will be called in + # is_reasoning_end and extract_content_ids + # this id is not part of content, so just return [] here. + return [] + def extract_reasoning_content( self, model_output: str, request: ChatCompletionRequest ) -> tuple[Optional[str], Optional[str]]: From f8ab8f9852129106ecfef70b715bc4be482b54cc Mon Sep 17 00:00:00 2001 From: kYLe Date: Thu, 17 Jul 2025 05:07:55 -0500 Subject: [PATCH 153/552] [VLM] Add Nemotron-Nano-VL-8B-V1 support (#20349) Signed-off-by: Kyle Huang Co-authored-by: Cyrus Leung Signed-off-by: x22x22 --- docker/Dockerfile.cpu | 2 +- docs/models/supported_models.md | 1 + examples/offline_inference/vision_language.py | 39 ++ requirements/test.in | 1 + requirements/test.txt | 16 +- .../multimodal/processing/test_common.py | 1 + .../multimodal/processing/test_nemotron_vl.py | 134 +++++ tests/models/registry.py | 2 + vllm/model_executor/models/nemotron_vl.py | 505 ++++++++++++++++++ vllm/model_executor/models/registry.py | 1 + vllm/transformers_utils/configs/nemotron.py | 2 +- 11 files changed, 701 insertions(+), 3 deletions(-) create mode 100644 tests/models/multimodal/processing/test_nemotron_vl.py create mode 100644 vllm/model_executor/models/nemotron_vl.py diff --git a/docker/Dockerfile.cpu b/docker/Dockerfile.cpu index 5da2c9467bf..982c1ddf274 100644 --- a/docker/Dockerfile.cpu +++ b/docker/Dockerfile.cpu @@ -95,7 +95,7 @@ WORKDIR /workspace/vllm RUN --mount=type=bind,src=requirements/test.in,target=requirements/test.in \ cp requirements/test.in requirements/cpu-test.in && \ sed -i '/mamba_ssm/d' requirements/cpu-test.in && \ - sed -i 's/torch==.*/torch==2.6.0/g' requirements/cpu-test.in && \ + sed -i 's/^torch==.*/torch==2.6.0/g' requirements/cpu-test.in && \ sed -i 's/torchaudio.*/torchaudio/g' requirements/cpu-test.in && \ sed -i 's/torchvision.*/torchvision/g' requirements/cpu-test.in && \ uv pip compile requirements/cpu-test.in -o requirements/cpu-test.txt --index-strategy unsafe-best-match --torch-backend cpu diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index cbb2236eed5..ad5bf43f7fd 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -584,6 +584,7 @@ Specified using `--task generate`. | `KeyeForConditionalGeneration` | Keye-VL-8B-Preview | T + IE+ + VE+ | `Kwai-Keye/Keye-VL-8B-Preview` | | | ✅︎ | | `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I+ | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | | ✅︎ | | `Llama4ForConditionalGeneration` | Llama 4 | T + I+ | `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc. | | ✅︎ | ✅︎ | +| `Llama_Nemotron_Nano_VL` | Llama Nemotron Nano VL | T + IE+ | `nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1` | ✅︎ | ✅︎ | ✅︎ | | `LlavaForConditionalGeneration` | LLaVA-1.5, Pixtral (HF Transformers) | T + IE+ | `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), `mistral-community/pixtral-12b`, etc. | | ✅︎ | ✅︎ | | `LlavaNextForConditionalGeneration` | LLaVA-NeXT | T + IE+ | `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc. | | ✅︎ | ✅︎ | | `LlavaNextVideoForConditionalGeneration` | LLaVA-NeXT-Video | T + V | `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. | | ✅︎ | ✅︎ | diff --git a/examples/offline_inference/vision_language.py b/examples/offline_inference/vision_language.py index 5bd75a78f2c..e4811c02337 100644 --- a/examples/offline_inference/vision_language.py +++ b/examples/offline_inference/vision_language.py @@ -429,6 +429,44 @@ def run_internvl(questions: list[str], modality: str) -> ModelRequestData: ) +# Nemontron_VL +def run_nemotron_vl(questions: list[str], modality: str) -> ModelRequestData: + model_name = "nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1" + + engine_args = EngineArgs( + model=model_name, + trust_remote_code=True, + max_model_len=8192, + limit_mm_per_prompt={modality: 1}, + ) + + assert modality == "image" + placeholder = "" + + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + messages = [ + [{"role": "user", "content": f"{placeholder}\n{question}"}] + for question in questions + ] + prompts = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + + # Stop tokens for InternVL + # models variants may have different stop tokens + # please refer to the model card for the correct "stop words": + # https://huggingface.co/OpenGVLab/InternVL2-2B/blob/main/conversation.py + stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"] + stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens] + stop_token_ids = [token_id for token_id in stop_token_ids if token_id is not None] + + return ModelRequestData( + engine_args=engine_args, + prompts=prompts, + stop_token_ids=stop_token_ids, + ) + + # Keye-VL def run_keye_vl(questions: list[str], modality: str) -> ModelRequestData: model_name = "Kwai-Keye/Keye-VL-8B-Preview" @@ -1186,6 +1224,7 @@ def run_skyworkr1v(questions: list[str], modality: str) -> ModelRequestData: "h2ovl_chat": run_h2ovl, "idefics3": run_idefics3, "internvl_chat": run_internvl, + "nemotron_vl": run_nemotron_vl, "keye_vl": run_keye_vl, "kimi_vl": run_kimi_vl, "llava": run_llava, diff --git a/requirements/test.in b/requirements/test.in index e8715afaf4f..c6c68891d6a 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -30,6 +30,7 @@ mamba_ssm # required for plamo2 test matplotlib # required for qwen-vl test mistral_common[opencv] >= 1.8.0 # required for voxtral test num2words # required for smolvlm test +open_clip_torch==2.32.0 # Required for nemotron_vl test opencv-python-headless >= 4.11.0 # required for video test datamodel_code_generator # required for minicpm3 test lm-eval[api]==0.4.8 # required for model evaluation test diff --git a/requirements/test.txt b/requirements/test.txt index 90d8f8ff0bc..aadbab03f6f 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -174,6 +174,8 @@ fsspec==2024.9.0 # fastparquet # huggingface-hub # torch +ftfy==6.3.1 + # via open-clip-torch genai-perf==0.0.8 # via -r requirements/test.in genson==1.3.0 @@ -208,6 +210,7 @@ huggingface-hub==0.33.0 # accelerate # datasets # evaluate + # open-clip-torch # peft # sentence-transformers # timm @@ -414,6 +417,8 @@ nvidia-nvjitlink-cu12==12.8.61 # torch nvidia-nvtx-cu12==12.8.55 # via torch +open-clip-torch==2.32.0 + # via -r requirements/test.in opencensus==0.11.4 # via ray opencensus-context==0.1.3 @@ -615,6 +620,7 @@ referencing==0.35.1 regex==2024.9.11 # via # nltk + # open-clip-torch # sacrebleu # tiktoken # transformers @@ -665,6 +671,7 @@ sacrebleu==2.4.3 safetensors==0.4.5 # via # accelerate + # open-clip-torch # peft # timm # transformers @@ -753,7 +760,9 @@ tiktoken==0.7.0 # lm-eval # mistral-common timm==1.0.11 - # via -r requirements/test.in + # via + # -r requirements/test.in + # open-clip-torch tokenizers==0.21.1 # via # -r requirements/test.in @@ -772,6 +781,7 @@ torch==2.7.1+cu128 # lm-eval # mamba-ssm # mteb + # open-clip-torch # peft # runai-model-streamer # sentence-transformers @@ -789,6 +799,7 @@ torchaudio==2.7.1+cu128 torchvision==0.22.1+cu128 # via # -r requirements/test.in + # open-clip-torch # timm tqdm==4.66.6 # via @@ -798,6 +809,7 @@ tqdm==4.66.6 # lm-eval # mteb # nltk + # open-clip-torch # peft # pqdm # sentence-transformers @@ -863,6 +875,8 @@ virtualenv==20.31.2 # via ray vocos==0.1.0 # via -r requirements/test.in +wcwidth==0.2.13 + # via ftfy webcolors==24.11.1 # via jsonschema werkzeug==3.1.3 diff --git a/tests/models/multimodal/processing/test_common.py b/tests/models/multimodal/processing/test_common.py index ab21941fae9..fd584252317 100644 --- a/tests/models/multimodal/processing/test_common.py +++ b/tests/models/multimodal/processing/test_common.py @@ -291,6 +291,7 @@ def _test_processing_correctness_one( "allenai/Molmo-7B-D-0924", "allenai/Molmo-7B-O-0924", "nvidia/NVLM-D-72B", + "nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1", "AIDC-AI/Ovis1.6-Gemma2-9B", "AIDC-AI/Ovis1.6-Llama3.2-3B", "AIDC-AI/Ovis2-1B", diff --git a/tests/models/multimodal/processing/test_nemotron_vl.py b/tests/models/multimodal/processing/test_nemotron_vl.py new file mode 100644 index 00000000000..3ce88bc427f --- /dev/null +++ b/tests/models/multimodal/processing/test_nemotron_vl.py @@ -0,0 +1,134 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Tests for Nemotron-Nano-VL's multimodal preprocessing kwargs.""" +from collections.abc import Mapping +from typing import Optional + +import pytest +from PIL import Image +from transformers import PretrainedConfig + +from vllm.multimodal import MULTIMODAL_REGISTRY +from vllm.multimodal.image import rescale_image_size +from vllm.multimodal.processing import BaseMultiModalProcessor + +from ....conftest import ImageTestAssets +from ...utils import build_model_context + + +def _get_expected_num_patches( + config: PretrainedConfig, + image: Image.Image, + num_imgs: int, + min_num: int, + max_num: int, +): + from vllm.model_executor.models.internvl import ( + calculate_internvl_targets, get_internvl_target_ratios) + + width, height = image.size + + blocks, _, _ = calculate_internvl_targets( + orig_width=width, + orig_height=height, + target_ratios=get_internvl_target_ratios( + min_num, + max_num, + ), + image_size=config.force_image_size, + use_thumbnail=False, + ) + expected_num_patches = blocks + + if config.use_thumbnail and expected_num_patches > 1: + expected_num_patches += 1 + + return expected_num_patches + + +def _run_check( + processor: BaseMultiModalProcessor, + images: list[Image.Image], + min_num: int, + max_num: int, + mm_processor_kwargs: Mapping[str, object], +): + tokenizer = processor.info.get_tokenizer() + config = processor.info.get_hf_config() + image_processor = processor.info.get_image_processor() + + config.use_thumbnail = image_processor.use_thumbnail + prompt = "" * len(images) + mm_data = {"image": images} + + total_expected_num_patches = sum( + _get_expected_num_patches(config, image, len(images), min_num, max_num) + for image in images) + print(total_expected_num_patches) + processed_inputs = processor.apply(prompt, mm_data, mm_processor_kwargs) + + # Ensure we have the right number of placeholders per num_crops size + image_token_id = tokenizer.convert_tokens_to_ids("") + img_tok_count = processed_inputs["prompt_token_ids"].count(image_token_id) + pixel_shape = processed_inputs["mm_kwargs"]["pixel_values_flat"].shape + print("Image token count:", img_tok_count, "Pixel shape:", pixel_shape) + assert img_tok_count == 256 * total_expected_num_patches + assert pixel_shape[0] == total_expected_num_patches + + +@pytest.mark.parametrize("model_id", + ["nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1"]) +@pytest.mark.parametrize( + "size_factors", + [ + # Single-scale + [1.0], + # Single-scale, batched + [1.0, 1.0, 1.0], + # Multi-scale + [0.25, 0.5, 1.0], + [4.0, 2.0, 1.0], + ], +) +@pytest.mark.parametrize( + ("min_dynamic_patch", "max_dynamic_patch"), + [(1, 1), (1, 2), (1, 4), (1, 8), (2, 4), (4, 8)], +) +@pytest.mark.parametrize("dynamic_image_size", [True, False]) +@pytest.mark.parametrize("kwargs_on_init", [True, False]) +def test_processor_override( + model_id: str, + image_assets: ImageTestAssets, + size_factors: list[int], + min_dynamic_patch: int, + max_dynamic_patch: int, + dynamic_image_size: Optional[bool], + kwargs_on_init: bool, +): + mm_processor_kwargs = { + "min_dynamic_patch": min_dynamic_patch, + "max_dynamic_patch": max_dynamic_patch, + "dynamic_image_size": dynamic_image_size, + } + + ctx = build_model_context( + model_id, + mm_processor_kwargs=mm_processor_kwargs if kwargs_on_init else None, + limit_mm_per_prompt={"image": len(size_factors)}, + ) + processor = MULTIMODAL_REGISTRY.create_processor(ctx.model_config) + hf_processor_mm_kwargs = {} if kwargs_on_init else mm_processor_kwargs + + min_num = min_dynamic_patch if dynamic_image_size else 1 + max_num = max_dynamic_patch if dynamic_image_size else 1 + + _run_check( + processor, + [ + rescale_image_size(image_assets[0].pil_image, f) + for f in size_factors + ], + min_num, + max_num, + hf_processor_mm_kwargs, + ) diff --git a/tests/models/registry.py b/tests/models/registry.py index d2e70e291df..2adfa859a1c 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -401,6 +401,8 @@ def check_available_online( trust_remote_code=True), "NVLM_D": _HfExamplesInfo("nvidia/NVLM-D-72B", trust_remote_code=True), + "Llama_Nemotron_Nano_VL" : _HfExamplesInfo("nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1", # noqa: E501 + trust_remote_code=True), "PaliGemmaForConditionalGeneration": _HfExamplesInfo("google/paligemma-3b-mix-224", # noqa: E501 extras={"v2": "google/paligemma2-3b-ft-docci-448"}), # noqa: E501 "Phi3VForCausalLM": _HfExamplesInfo("microsoft/Phi-3-vision-128k-instruct", diff --git a/vllm/model_executor/models/nemotron_vl.py b/vllm/model_executor/models/nemotron_vl.py new file mode 100644 index 00000000000..5d0513d7074 --- /dev/null +++ b/vllm/model_executor/models/nemotron_vl.py @@ -0,0 +1,505 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# adapted from https://huggingface.co/OpenGVLab/InternVL2-4B/blob/main/modeling_internvl_chat.py +# -------------------------------------------------------- +# InternVL +# Copyright (c) 2023 OpenGVLab +# Licensed under The MIT License [see LICENSE for details] +# -------------------------------------------------------- +from abc import ABC +from collections.abc import Iterable +from typing import Optional + +import torch +import torch.nn as nn +from PIL import Image +from transformers import AutoModel, PretrainedConfig +from transformers.image_processing_utils_fast import BaseImageProcessorFast + +from vllm.config import VllmConfig +from vllm.model_executor.layers.quantization import QuantizationConfig +from vllm.model_executor.layers.quantization.awq import AWQConfig +from vllm.model_executor.models.internvl import ( + BaseInternVLDummyInputsBuilder, BaseInternVLMultiModalProcessor, + BaseInternVLProcessingInfo, InternVLImageEmbeddingInputs, + InternVLImageInputs, InternVLImagePixelInputs, InternVLProcessor) +from vllm.model_executor.models.module_mapping import MultiModelKeys +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.multimodal import MULTIMODAL_REGISTRY +from vllm.multimodal.inputs import NestedTensors +from vllm.multimodal.processing import PromptUpdateDetails +from vllm.sequence import IntermediateTensors +from vllm.transformers_utils.processor import ( + cached_image_processor_from_config) +from vllm.transformers_utils.tokenizer import AnyTokenizer + +from .interfaces import (MultiModalEmbeddings, SupportsLoRA, + SupportsMultiModal, SupportsPP) +from .utils import (AutoWeightsLoader, flatten_bn, init_vllm_registered_model, + maybe_prefix, merge_multimodal_embeddings) + +IMG_START = '' +IMG_END = '' +IMG_CONTEXT = '' + + +class NemotronVLProcessor(InternVLProcessor): + + def __init__( + self, + config: PretrainedConfig, + tokenizer: AnyTokenizer, + image_processor: BaseImageProcessorFast, + *, + min_dynamic_patch: Optional[int] = None, + max_dynamic_patch: Optional[int] = None, + dynamic_image_size: Optional[bool] = None, + ) -> None: + ABC.__init__(self) + self.config = config + self.tokenizer = tokenizer + self.image_processor = image_processor + image_size: int = config.force_image_size + patch_size: int = config.patch_size + + if min_dynamic_patch is None: + min_dynamic_patch = 1 + assert isinstance(min_dynamic_patch, int) + + if max_dynamic_patch is None: + max_dynamic_patch = self.image_processor.max_num_tiles + assert isinstance(max_dynamic_patch, int) + + if dynamic_image_size is None: + dynamic_image_size = True + assert isinstance(dynamic_image_size, bool) + + self.num_image_token = int( + (image_size // patch_size)**2 * (config.downsample_ratio**2)) + self.image_size = image_size + self.min_dynamic_patch = min_dynamic_patch + self.max_dynamic_patch = max_dynamic_patch + self.dynamic_image_size = dynamic_image_size + self.use_thumbnail: bool = self.image_processor.use_thumbnail + + @property + def image_token_id(self) -> int: + return self.tokenizer.get_vocab()[IMG_CONTEXT] + + def _preprocess_image( + self, + text: list[str], + images: list[Image.Image], + min_dynamic_patch: Optional[int] = None, + max_dynamic_patch: Optional[int] = None, + dynamic_image_size: Optional[bool] = None, + ) -> tuple[list[str], dict[str, torch.Tensor]]: + if len(images) == 0: + image_inputs = {} + else: + pixel_values_lst = self._images_to_pixel_values_lst( + images, + min_dynamic_patch=min_dynamic_patch, + max_dynamic_patch=max_dynamic_patch, + dynamic_image_size=dynamic_image_size, + ) + image_inputs: dict[str, NestedTensors] = { + "pixel_values_flat": + torch.cat(pixel_values_lst), + "image_num_patches": + torch.tensor([len(item) for item in pixel_values_lst]), + } + + for pixel_values in pixel_values_lst: + num_patches = pixel_values.shape[0] + feature_size = num_patches * self.num_image_token + image_repl = self.get_image_repl(feature_size, num_patches) + NVL_IMAGE_CONTEXT = image_repl.full.replace( + "", "") + text = [ + t.replace('', NVL_IMAGE_CONTEXT, 1) for t in text + ] + text = [t.replace("", IMG_CONTEXT) for t in text] + return text, image_inputs + + def get_image_repl( + self, + feature_size: int, + num_patches: Optional[int], + ) -> PromptUpdateDetails[str]: + repl_features = IMG_CONTEXT * feature_size + repl_full = IMG_START + repl_features + IMG_END + + return PromptUpdateDetails.select_text(repl_full, IMG_CONTEXT) + + +class NemotronVLProcessingInfo(BaseInternVLProcessingInfo): + """Processing info for Nemotron VL models.""" + + def get_hf_processor( + self, + *, + min_dynamic_patch: Optional[int] = None, + max_dynamic_patch: Optional[int] = None, + dynamic_image_size: Optional[bool] = None, + **kwargs: object, + ) -> NemotronVLProcessor: + if min_dynamic_patch is not None: + kwargs["min_dynamic_patch"] = min_dynamic_patch + if max_dynamic_patch is not None: + kwargs["max_dynamic_patch"] = max_dynamic_patch + if dynamic_image_size is not None: + kwargs["dynamic_image_size"] = dynamic_image_size + + image_processor = self.get_image_processor() + return self.ctx.init_processor( + NemotronVLProcessor, + config=self.get_hf_config(), + tokenizer=self.get_tokenizer(), + image_processor=image_processor, + **kwargs, + ) + + def get_image_processor( + self, + **kwargs: object, + ): + return cached_image_processor_from_config( + self.ctx.model_config, + **kwargs, + ) + + +@MULTIMODAL_REGISTRY.register_processor( + BaseInternVLMultiModalProcessor[NemotronVLProcessingInfo], + info=NemotronVLProcessingInfo, + dummy_inputs=BaseInternVLDummyInputsBuilder[NemotronVLProcessingInfo]) +class LlamaNemotronVLChatModel(nn.Module, SupportsMultiModal, SupportsPP, + SupportsLoRA): + + @classmethod + def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: + if modality.startswith("image"): + return "" + + raise ValueError("Only image modality is supported") + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = "") -> None: + super().__init__() + + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + multimodal_config = vllm_config.model_config.multimodal_config + + self.config = config + self.multimodal_config = multimodal_config + self._patch_quant_config(config, quant_config) + + image_size = config.force_image_size or config.vision_config.image_size + patch_size = config.vision_config.patch_size + self.patch_size = patch_size + self.num_image_token = int( + (image_size // patch_size)**2 * (config.downsample_ratio**2)) + self.downsample_ratio = config.downsample_ratio + self.ps_version = config.ps_version + + self.llm_arch_name = config.text_config.architectures[0] + self.vision_model = self._init_vision_model( + config, + quant_config=quant_config, + prefix=maybe_prefix(prefix, "vision_model"), + ) + + self.language_model = init_vllm_registered_model( + vllm_config=vllm_config, + hf_config=config.text_config, + prefix=maybe_prefix(prefix, "language_model"), + ) + + self.mlp1 = self._init_mlp1(config) + + self.img_context_token_id = None + + self.visual_token_mask = None + self.make_empty_intermediate_tensors = ( + self.language_model.make_empty_intermediate_tensors) + + def _patch_quant_config(self, config: PretrainedConfig, + quant_config: QuantizationConfig): + # the awq models from OpenGVLab missing `modules_to_not_convert` + # patch the quant_config to add `modules_to_not_convert` back + if isinstance(quant_config, AWQConfig): + text_config = config.text_config + llm_quant_config = getattr(text_config, "quantization_config", + None) + if (not quant_config.modules_to_not_convert) and \ + (llm_quant_config is not None): + quant_config.modules_to_not_convert.append("vision_model") + + def _init_vision_model( + self, + config: PretrainedConfig, + quant_config: Optional[QuantizationConfig], + *, + prefix: str, + ): + return AutoModel.from_config(config.vision_config, + trust_remote_code=True) + + def _init_mlp1(self, config: PretrainedConfig) -> nn.Sequential: + vit_hidden_size = config.vit_hidden_size + vision_projection_hidden_size = config.projector_hidden_size + llm_hidden_size = config.text_config.hidden_size + + return nn.Sequential( + nn.LayerNorm(vit_hidden_size * int(1 / self.downsample_ratio)**2, + bias=True), + nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio)**2, + vision_projection_hidden_size, + bias=True), + nn.GELU(), + nn.Linear(vision_projection_hidden_size, llm_hidden_size), + ) + + def pixel_shuffle(self, x, scale_factor=0.5): + n, w, h, c = x.size() + # N, W, H, C --> N, W, H * scale, C // scale + x = x.view(n, w, int(h * scale_factor), int(c / scale_factor)) + # N, W, H * scale, C // scale --> N, H * scale, W, C // scale + x = x.permute(0, 2, 1, 3).contiguous() + x = x.view(n, int(h * scale_factor), int(w * scale_factor), + int(c / (scale_factor * scale_factor))) + if self.ps_version == 'v1': + pass + else: + x = x.permute(0, 2, 1, 3).contiguous() + return x + + def extract_feature(self, pixel_values: torch.Tensor) -> torch.Tensor: + # https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1/blob/main/modeling.py#L177 + vit_embeds = self.vision_model(x=pixel_values).features + vit_embeds = vit_embeds.to(dtype=torch.bfloat16) + + h = w = int(vit_embeds.shape[1]**0.5) + vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1) + vit_embeds = self.pixel_shuffle(vit_embeds, + scale_factor=self.downsample_ratio) + vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, + vit_embeds.shape[-1]) + vit_embeds = self.mlp1(vit_embeds) + return vit_embeds + + def _validate_pixel_values(self, data: torch.Tensor) -> torch.Tensor: + + #use force_image_size to get image_size + h = w = self.config.force_image_size + expected_dims = (3, h, w) + + def _validate_shape(d: torch.Tensor): + actual_dims = tuple(d.shape) + + if actual_dims != expected_dims: + expected_expr = str(expected_dims) + raise ValueError( + "The expected shape of pixel values per image per batch " + f" per patch is {expected_expr}. " + f"You supplied {tuple(d.shape)}.") + + for d in data: + _validate_shape(d) + + return data + + def _parse_and_validate_image_input( + self, **kwargs: object) -> Optional[InternVLImageInputs]: + pixel_values_flat = kwargs.pop("pixel_values_flat", None) + image_num_patches = kwargs.pop("image_num_patches", None) + image_embeds = kwargs.pop("image_embeds", None) + + if pixel_values_flat is None and image_embeds is None: + return None + + if image_embeds is not None: + if not isinstance(image_embeds, (torch.Tensor, list)): + raise ValueError("Incorrect type of image embeddings. " + f"Got type: {type(image_embeds)}") + + return InternVLImageEmbeddingInputs( + type="image_embeds", + data=flatten_bn(image_embeds), + ) + + image_token_id = kwargs["image_token_id"] + assert isinstance(image_token_id, torch.Tensor) + self.img_context_token_id = image_token_id.flatten().unique().item() + + if pixel_values_flat is not None: + if not isinstance(pixel_values_flat, (torch.Tensor, list)): + raise ValueError("Incorrect type of pixel values. " + f"Got type: {type(pixel_values_flat)}") + + if not isinstance(image_num_patches, (torch.Tensor, list)): + raise ValueError("Incorrect type of image_num_patches. " + f"Got type: {type(image_num_patches)}") + + pixel_values_flat = flatten_bn(pixel_values_flat, concat=True) + image_num_patches = flatten_bn(image_num_patches, concat=True) + + return InternVLImagePixelInputs( + type="pixel_values", + pixel_values_flat=self._validate_pixel_values( + pixel_values_flat), + num_patches=image_num_patches, + ) + + raise AssertionError("This line should be unreachable.") + + def _process_image_input( + self, + image_input: InternVLImageInputs, + ) -> tuple[torch.Tensor, ...]: + if image_input["type"] == "image_embeds": + return image_input["data"] + + assert self.vision_model is not None + + image_embeds = self.extract_feature(image_input["pixel_values_flat"]) + + num_patches = image_input["num_patches"] + + # Only one image in the current batch + if len(num_patches) == 1: + return (image_embeds.view(-1, + self.config.text_config.hidden_size), ) + + # NOTE: Image embeddings are split into separate tensors for each image + # by the size of each embedding. + feature_size = image_embeds.shape[1] + image_embeds = image_embeds.view(-1, + self.config.text_config.hidden_size) + image_feature_sizes = [ + num_patches * feature_size for num_patches in num_patches + ] + return image_embeds.split(image_feature_sizes) + + def _parse_and_validate_multimodal_inputs(self, **kwargs: object) -> dict: + modalities = {} + + # Preserve the order of modalities if there are multiple of them + # from the order of kwargs. + for input_key in kwargs: + if input_key in ("pixel_values_flat", + "image_embeds") and "images" not in modalities: + modalities["images"] = self._parse_and_validate_image_input( + **kwargs) + + return modalities + + def _set_visual_token_mask(self, input_ids: torch.Tensor) -> None: + self.visual_token_mask = None + + def get_language_model(self) -> torch.nn.Module: + return self.language_model + + def get_multimodal_embeddings(self, + **kwargs: object) -> MultiModalEmbeddings: + + modalities = self._parse_and_validate_multimodal_inputs(**kwargs) + if not modalities: + return [] + + # The result multimodal_embeddings is tuple of tensors, with each + # tensor correspoending to a multimodal data item (image). + multimodal_embeddings: tuple[torch.Tensor, ...] = () + + # NOTE: It is important to iterate over the keys in this dictionary + # to preserve the order of the modalities. + for modality in modalities: + if modality == "images": + image_input = modalities["images"] + vision_embeddings = self._process_image_input(image_input) + multimodal_embeddings += vision_embeddings + + return multimodal_embeddings + + def get_input_embeddings( + self, + input_ids: torch.Tensor, + multimodal_embeddings: Optional[MultiModalEmbeddings] = None, + ) -> torch.Tensor: + inputs_embeds = self.language_model.get_input_embeddings(input_ids) + if multimodal_embeddings is not None \ + and len(multimodal_embeddings) != 0: + context_token_ids = [self.img_context_token_id] + assert len(context_token_ids) >= 1 + self._set_visual_token_mask(input_ids) + inputs_embeds = merge_multimodal_embeddings( + input_ids, + inputs_embeds, + multimodal_embeddings, + context_token_ids, + ) + return inputs_embeds + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + **kwargs: object, + ) -> IntermediateTensors: + + if intermediate_tensors is not None: + input_ids = None + inputs_embeds = None + + # NOTE: In v1, inputs_embeds is always generated at model runner, this + # condition is for v0 compatibility. + elif inputs_embeds is None: + vision_embeddings = self.get_multimodal_embeddings(**kwargs) + inputs_embeds = self.get_input_embeddings(input_ids, + vision_embeddings) + input_ids = None + + forward_kwargs = { + "input_ids": input_ids, + "positions": positions, + "intermediate_tensors": intermediate_tensors, + "inputs_embeds": inputs_embeds, + } + + # Only required if the model is mono-architecture + if self.visual_token_mask is not None: + forward_kwargs.update( + {"visual_token_mask": self.visual_token_mask}) + self.visual_token_mask = None + + hidden_states = self.language_model.model(**forward_kwargs) + return hidden_states + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + return self.language_model.compute_logits(hidden_states, + sampling_metadata) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + ## Ignore registered_buffers + ## see https://huggingface.co/nvidia/C-RADIOv2-H/blob/main/input_conditioner.py#L28 # noqa: E501 + skip_substrs = ["norm_mean", "norm_std"] + loader = AutoWeightsLoader(self, skip_substrs=skip_substrs) + return loader.load_weights(weights) + + def get_mm_mapping(self) -> MultiModelKeys: + """ + Get the module prefix in multimodal models + """ + return MultiModelKeys.from_string_field( + language_model="language_model", + connector="mlp1", + tower_model="vision_model") diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index bc936500bdc..52fdb910891 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -206,6 +206,7 @@ "SmolVLMForConditionalGeneration": ("smolvlm","SmolVLMForConditionalGeneration"), # noqa: E501 "KeyeForConditionalGeneration": ("keye", "KeyeForConditionalGeneration"), "KimiVLForConditionalGeneration": ("kimi_vl", "KimiVLForConditionalGeneration"), # noqa: E501 + "Llama_Nemotron_Nano_VL": ("nemotron_vl", "LlamaNemotronVLChatModel"), "LlavaForConditionalGeneration": ("llava", "LlavaForConditionalGeneration"), "LlavaNextForConditionalGeneration": ("llava_next", "LlavaNextForConditionalGeneration"), # noqa: E501 "LlavaNextVideoForConditionalGeneration": ("llava_next_video", "LlavaNextVideoForConditionalGeneration"), # noqa: E501 diff --git a/vllm/transformers_utils/configs/nemotron.py b/vllm/transformers_utils/configs/nemotron.py index d65b572dc7f..9a7243b1262 100644 --- a/vllm/transformers_utils/configs/nemotron.py +++ b/vllm/transformers_utils/configs/nemotron.py @@ -202,4 +202,4 @@ def _rope_scaling_validation(self): rope_scaling_factor, float) or rope_scaling_factor <= 1.0: raise ValueError( "`rope_scaling`'s factor field must be a float > 1, got " - f"{rope_scaling_factor}") + f"{rope_scaling_factor}") \ No newline at end of file From 1b7398cd3fcbe638108f030d47ac39113c22148b Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Thu, 17 Jul 2025 12:13:00 +0100 Subject: [PATCH 154/552] [Docs] Improve docstring formatting for `FusedMoEParallelConfig.make` (#21117) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- .../model_executor/layers/fused_moe/config.py | 62 ++++++++++--------- 1 file changed, 34 insertions(+), 28 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/config.py b/vllm/model_executor/layers/fused_moe/config.py index 432617ba046..def1c2b4556 100644 --- a/vllm/model_executor/layers/fused_moe/config.py +++ b/vllm/model_executor/layers/fused_moe/config.py @@ -192,68 +192,74 @@ def use_deepep_ll_kernels(self): def make(tp_size_: int, dp_size_: int, vllm_parallel_config: ParallelConfig) -> "FusedMoEParallelConfig": """ - Determine MoE parallel configuration. Based on the input tp_size_, - dp_size_, ep_size_ and vllm's parallel config, determine what + Determine MoE parallel configuration. Based on the input `tp_size_`, + `dp_size_` and vllm's parallel config, determine what level's of parallelism to use in the fused moe layer. Args: - tp_size_ (int): tp_size passed into the FusedMoE constructor. - dp_size_ (int): dp_size passed into the FusedMoE constructor. - ep_size_ (int): ep_size passed into the FusedMoE constructor. - vllm_parallel_config (ParallelConfig): vllm's parallel config - object. + tp_size_ (int): `tp_size` passed into the FusedMoE constructor. + dp_size_ (int): `dp_size` passed into the FusedMoE constructor. + vllm_parallel_config (ParallelConfig): vLLM's parallel config + object which contains the `enable_expert_parallel` flag. Examples: - When there is no parallelism requested, i.e. tp_size_ = dp_size_ = 1, - we simply return the sizes unaltered and the ranks set to 0. + When there is no parallelism requested, + i.e. `tp_size_` = `dp_size_` = 1, we simply return the sizes + unaltered and the ranks set to 0. - Expert Parallelism is considered only when either dp_size_ or tp_size_ - is non trivial. + Expert Parallelism is considered only when either `dp_size_` or + `tp_size_` is non trivial. + + When TP = 2, DP = 1 and EP = False, the configuration on different + devices: - When TP = 2, DP = 1 and EP = False, the configuration on different - devices, - device 0 : TP = {2, 0} DP = {1, 0} EP = {1, 0} // - legend : {size, rank} + legend : {size, rank} - device 1 : TP = {2, 1} DP = {1, 0} EP = {1, 0} - Comment : Tensors are sharded across 2 devices. - When TP = 1, DP = 2 and EP = False, the configuration on different - devices, + When TP = 1, DP = 2 and EP = False, the configuration on different + devices: + - device 0 : TP = {2, 0} DP = {2, 0} EP = {1, 0} - device 1 : TP = {2, 1} DP = {2, 1} EP = {1, 0} - Comment: There are 2 engine instances and the tensors are sharded - across 2 decvices. + across 2 decvices. + + When TP = 2, DP = 2 and EP = False, the configuration on different + devices: - When TP = 2, DP = 2 and EP = False, the configuration on different - devices, - device 0: TP = {4, 0} DP = {2, 0} EP = {1, 0} - device 1: TP = {4, 1} DP = {2, 0} EP = {1, 0} - device 2: TP = {4, 2} DP = {2, 1} EP = {1, 0} - device 3: TP = {4, 3} DP = {2, 1} EP = {1, 0} - Comment: There are 2 engine instances and the tensors are sharded - across 4 devices. + across 4 devices. + + When, TP = 2, DP = 1 and EP = True, the configuration on different + devices: - When, TP = 2, DP = 1 and EP = True, the configuration on different - devices, - device 0: TP = {1, 0} DP = {1, 0} EP = {2, 0} - device 1: TP = {1, 0} DP = {1, 0} EP = {2, 1} - Comment: The experts are split between the 2 devices. - When, TP = 1, DP = 2 and EP = True, the configuration on different - devices, + When, TP = 1, DP = 2 and EP = True, the configuration on different + devices: + - device 0: TP = {1, 0} DP = {2, 0} EP = {2, 0} - device 1: TP = {1, 0} DP = {2, 1} EP = {2, 1} - Comment: There are 2 engine instances and the experts are split - between the 2 devices. + between the 2 devices. + + When TP = 2, DP = 2 and EP = True, the configuration on different + devices: - When TP = 2, DP = 2 and EP = True, the configuration on different - devices, - device 0: TP = {1, 0} DP = {2, 0} EP = {4, 0} - device 1: TP = {1, 0} DP = {2, 0} EP = {4, 1} - device 2: TP = {1, 0} DP = {2, 1} EP = {4, 2} - device 3: TP = {1, 0} DP = {2, 1} EP = {4, 3} - Comment: There are 2 engine instances and the experts are split - between the 4 devices. + between the 4 devices. """ def flatten_tp_across_dp(dp_rank: int): From 8c458c51dad57894f35e422b7fc352a8186b07e4 Mon Sep 17 00:00:00 2001 From: wangxiyuan Date: Thu, 17 Jul 2025 20:57:41 +0800 Subject: [PATCH 155/552] [Misc] Avoid unnecessary import (#21106) Signed-off-by: wangxiyuan Signed-off-by: x22x22 --- vllm/entrypoints/openai/speech_to_text.py | 2 +- vllm/lora/utils.py | 20 ++++++++++++-------- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/vllm/entrypoints/openai/speech_to_text.py b/vllm/entrypoints/openai/speech_to_text.py index e7589a3804c..09b346dcef6 100644 --- a/vllm/entrypoints/openai/speech_to_text.py +++ b/vllm/entrypoints/openai/speech_to_text.py @@ -24,7 +24,6 @@ from vllm.entrypoints.openai.serving_models import OpenAIServingModels from vllm.inputs.data import PromptType from vllm.logger import init_logger -from vllm.model_executor.model_loader import get_model_cls from vllm.model_executor.models import SupportsTranscription from vllm.outputs import RequestOutput from vllm.utils import PlaceholderModule @@ -78,6 +77,7 @@ def __init__( @cached_property def model_cls(self) -> type[SupportsTranscription]: + from vllm.model_executor.model_loader import get_model_cls model_cls = get_model_cls(self.model_config) return cast(type[SupportsTranscription], model_cls) diff --git a/vllm/lora/utils.py b/vllm/lora/utils.py index ee196e3f689..6b3291e9c92 100644 --- a/vllm/lora/utils.py +++ b/vllm/lora/utils.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import os -from typing import Optional, Union +from typing import TYPE_CHECKING, Optional, Union import huggingface_hub import regex as re @@ -31,10 +31,14 @@ RowParallelLinearWithLoRA, VocabParallelEmbeddingWithLoRA) from vllm.model_executor.layers.linear import LinearBase + # yapf: enable -from vllm.model_executor.layers.logits_processor import LogitsProcessor -from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead -from vllm.model_executor.models.utils import WeightsMapper + +if TYPE_CHECKING: + from vllm.model_executor.layers.logits_processor import LogitsProcessor + from vllm.model_executor.layers.vocab_parallel_embedding import ( + ParallelLMHead) + from vllm.model_executor.models.utils import WeightsMapper logger = init_logger(__name__) @@ -75,8 +79,8 @@ def from_layer(layer: nn.Module, def from_layer_logits_processor( - layer: LogitsProcessor, - lm_head: ParallelLMHead, + layer: "LogitsProcessor", + lm_head: "ParallelLMHead", max_loras: int, lora_config: LoRAConfig, model_config: Optional[PretrainedConfig] = None, @@ -98,8 +102,8 @@ def replace_submodule(model: nn.Module, module_name: str, def parse_fine_tuned_lora_name( - name: str, - weights_mapper: Optional[WeightsMapper] = None + name: str, + weights_mapper: Optional["WeightsMapper"] = None ) -> tuple[str, bool, bool]: """Parse the name of lora weights. From b32d48bc3a5f44710c21772a593636dd36820c21 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Thu, 17 Jul 2025 14:12:29 +0100 Subject: [PATCH 156/552] [Docs] Move code block out of admonition now that it's short (#21118) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/design/v1/p2p_nccl_connector.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/docs/design/v1/p2p_nccl_connector.md b/docs/design/v1/p2p_nccl_connector.md index 8f6a2b3b2dd..9f6acf3291d 100644 --- a/docs/design/v1/p2p_nccl_connector.md +++ b/docs/design/v1/p2p_nccl_connector.md @@ -61,11 +61,9 @@ To address the above issues, I have designed and developed a local Tensor memory # Install vLLM -??? console "Commands" - - ```shell - pip install "vllm>=0.9.2" - ``` +```shell +pip install "vllm>=0.9.2" +``` # Run xPyD From 112f9cf71948ee8f42b875ad6a4cb8d78e67f670 Mon Sep 17 00:00:00 2001 From: ElizaWszola Date: Thu, 17 Jul 2025 15:56:44 +0200 Subject: [PATCH 157/552] [Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) Signed-off-by: ElizaWszola Signed-off-by: x22x22 --- .../kernels/benchmark_grouped_gemm_cutlass.py | 35 ++++++++++- csrc/moe/moe_permute_unpermute_op.cu | 53 ++++++++++++---- tests/kernels/moe/test_cutlass_moe.py | 14 ++++- tests/kernels/moe/test_pplx_cutlass_moe.py | 22 +++++++ .../layers/fused_moe/cutlass_moe.py | 62 ++++++++++++------- .../compressed_tensors_moe.py | 26 +++++++- 6 files changed, 174 insertions(+), 38 deletions(-) diff --git a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py index 1d4e730f99a..a6b42406b5c 100644 --- a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py +++ b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py @@ -80,6 +80,11 @@ def bench_run( a, score, topk, renormalize=False ) + ab_strides1 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64) + ab_strides2 = torch.full((num_experts,), n, device="cuda", dtype=torch.int64) + c_strides1 = torch.full((num_experts,), 2 * n, device="cuda", dtype=torch.int64) + c_strides2 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64) + def run_triton_moe( a: torch.Tensor, w1: torch.Tensor, @@ -111,6 +116,10 @@ def run_cutlass_moe( w2: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, + ab_strides1: torch.Tensor, + ab_strides2: torch.Tensor, + c_strides1: torch.Tensor, + c_strides2: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, per_act_token: bool, @@ -125,6 +134,10 @@ def run_cutlass_moe( topk_ids, w1_scale, w2_scale, + ab_strides1, + ab_strides2, + c_strides1, + c_strides2, per_act_token, a1_scale=None, ) @@ -136,6 +149,10 @@ def run_cutlass_from_graph( w2_q: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, + ab_strides1: torch.Tensor, + ab_strides2: torch.Tensor, + c_strides1: torch.Tensor, + c_strides2: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, ): @@ -150,6 +167,10 @@ def run_cutlass_from_graph( topk_ids, w1_scale, w2_scale, + ab_strides1, + ab_strides2, + c_strides1, + c_strides2, per_act_token, a1_scale=None, ) @@ -194,6 +215,10 @@ def replay_graph(graph, num_repeats): w2_q, w1_scale, w2_scale, + ab_strides1, + ab_strides2, + c_strides1, + c_strides2, topk_weights, topk_ids, ) @@ -231,6 +256,10 @@ def replay_graph(graph, num_repeats): "w1_scale": w1_scale, "w2_scale": w2_scale, "per_act_token": per_act_token, + "ab_strides1": ab_strides1, + "ab_strides2": ab_strides2, + "c_strides1": c_strides1, + "c_strides2": c_strides2, # cuda graph params "cutlass_graph": cutlass_graph, "triton_graph": triton_graph, @@ -289,6 +318,10 @@ def replay_graph(graph, num_repeats): w2_q, w1_scale, w2_scale, + ab_strides1, + ab_strides2, + c_strides1, + c_strides2, topk_weights, topk_ids, per_act_token, @@ -297,7 +330,7 @@ def replay_graph(graph, num_repeats): results.append( benchmark.Timer( - stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501 + stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, ab_strides1, ab_strides2, c_strides1, c_strides2, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501 globals=globals, label=label, sub_label=sub_label, diff --git a/csrc/moe/moe_permute_unpermute_op.cu b/csrc/moe/moe_permute_unpermute_op.cu index a77471a7f20..13aecd8007a 100644 --- a/csrc/moe/moe_permute_unpermute_op.cu +++ b/csrc/moe/moe_permute_unpermute_op.cu @@ -160,6 +160,30 @@ __global__ void shuffleInputRowsKernel(const T* input, } } +template +__global__ void shuffleInputRowsKernelSlow(const T* input, + const int32_t* dst2src_map, + T* output, int64_t num_src_rows, + int64_t num_dst_rows, + int64_t num_cols) { + int64_t dest_row_idx = blockIdx.x; + int64_t const source_row_idx = dst2src_map[dest_row_idx]; + + if (blockIdx.x < num_dst_rows) { + // Duplicate and permute rows + auto const* source_row_ptr = input + source_row_idx * num_cols; + auto* dest_row_ptr = output + dest_row_idx * num_cols; + + int64_t const start_offset = threadIdx.x; + int64_t const stride = blockDim.x; + + for (int elem_index = start_offset; elem_index < num_cols; + elem_index += stride) { + dest_row_ptr[elem_index] = source_row_ptr[elem_index]; + } + } +} + void shuffle_rows(const torch::Tensor& input_tensor, const torch::Tensor& dst2src_map, torch::Tensor& output_tensor) { @@ -173,17 +197,24 @@ void shuffle_rows(const torch::Tensor& input_tensor, int64_t const num_src_rows = input_tensor.size(0); int64_t const num_cols = input_tensor.size(1); - TORCH_CHECK(!(num_cols % (128 / sizeof(input_tensor.scalar_type()) / 8)), - "num_cols must be divisible by 128 / " - "sizeof(input_tensor.scalar_type()) / 8"); - - MOE_DISPATCH(input_tensor.scalar_type(), [&] { - shuffleInputRowsKernel<<>>( - reinterpret_cast(input_tensor.data_ptr()), - dst2src_map.data_ptr(), - reinterpret_cast(output_tensor.data_ptr()), num_src_rows, - num_dest_rows, num_cols); - }); + if (num_cols % (128 / sizeof(input_tensor.scalar_type()) / 8)) { + // use slow kernel if num_cols can't be aligned to 128 bits + MOE_DISPATCH(input_tensor.scalar_type(), [&] { + shuffleInputRowsKernelSlow<<>>( + reinterpret_cast(input_tensor.data_ptr()), + dst2src_map.data_ptr(), + reinterpret_cast(output_tensor.data_ptr()), num_src_rows, + num_dest_rows, num_cols); + }); + } else { + MOE_DISPATCH(input_tensor.scalar_type(), [&] { + shuffleInputRowsKernel<<>>( + reinterpret_cast(input_tensor.data_ptr()), + dst2src_map.data_ptr(), + reinterpret_cast(output_tensor.data_ptr()), num_src_rows, + num_dest_rows, num_cols); + }); + } } #else diff --git a/tests/kernels/moe/test_cutlass_moe.py b/tests/kernels/moe/test_cutlass_moe.py index 5fac7166bc2..5fb49c2da4f 100644 --- a/tests/kernels/moe/test_cutlass_moe.py +++ b/tests/kernels/moe/test_cutlass_moe.py @@ -206,6 +206,10 @@ def run_8_bit(moe_tensors: MOETensors8Bit, 'topk_ids': topk_ids, 'w1_scale': moe_tensors.w1_scale, 'w2_scale': moe_tensors.w2_scale, + 'ab_strides1': moe_tensors.ab_strides1, + 'ab_strides2': moe_tensors.ab_strides2, + 'c_strides1': moe_tensors.c_strides1, + 'c_strides2': moe_tensors.c_strides2, 'per_act_token': per_act_token, 'a1_scale': None #moe_tensors.a_scale } @@ -439,6 +443,11 @@ def test_run_cutlass_moe_fp8( expert_map[start:end] = list(range(num_local_experts)) expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda") + ab_strides1 = torch.full((e, ), k, device="cuda", dtype=torch.int64) + ab_strides2 = torch.full((e, ), n, device="cuda", dtype=torch.int64) + c_strides1 = torch.full((e, ), 2 * n, device="cuda", dtype=torch.int64) + c_strides2 = torch.full((e, ), k, device="cuda", dtype=torch.int64) + activation = lambda o, i: torch.ops._C.silu_and_mul(o, i) a1q, a1q_scale = moe_kernel_quantize_input(mt.a, mt.a_scale, torch.float8_e4m3fn, @@ -447,8 +456,9 @@ def test_run_cutlass_moe_fp8( func = lambda output: run_cutlass_moe_fp8( output, a1q, mt.w1_q, mt.w2_q, topk_ids, activation, global_num_experts, expert_map, mt.w1_scale, mt.w2_scale, - a1q_scale, None, workspace13, workspace2, None, mt.a.dtype, - per_act_token, per_out_channel, False) + a1q_scale, None, ab_strides1, ab_strides2, c_strides1, c_strides2, + workspace13, workspace2, None, mt.a.dtype, per_act_token, + per_out_channel, False) workspace13.random_() output_random_workspace = torch.empty(output_shape, diff --git a/tests/kernels/moe/test_pplx_cutlass_moe.py b/tests/kernels/moe/test_pplx_cutlass_moe.py index e4f4a393dfd..77adc89ea9d 100644 --- a/tests/kernels/moe/test_pplx_cutlass_moe.py +++ b/tests/kernels/moe/test_pplx_cutlass_moe.py @@ -75,6 +75,7 @@ def pplx_cutlass_moe( assert torch.cuda.current_device() == pgi.local_rank num_tokens, hidden_dim = a.shape + intermediate_dim = w2.shape[2] num_experts = w1.shape[0] block_size = hidden_dim # TODO support more cases device = pgi.device @@ -123,10 +124,31 @@ def pplx_cutlass_moe( num_local_experts=num_local_experts, num_dispatchers=num_dispatchers) + ab_strides1 = torch.full((num_local_experts, ), + hidden_dim, + device="cuda", + dtype=torch.int64) + ab_strides2 = torch.full((num_local_experts, ), + intermediate_dim, + device="cuda", + dtype=torch.int64) + c_strides1 = torch.full((num_local_experts, ), + 2 * intermediate_dim, + device="cuda", + dtype=torch.int64) + c_strides2 = torch.full((num_local_experts, ), + hidden_dim, + device="cuda", + dtype=torch.int64) + experts = CutlassExpertsFp8(num_local_experts, out_dtype, per_act_token, per_out_ch, + ab_strides1, + ab_strides2, + c_strides1, + c_strides2, num_dispatchers=num_dispatchers, use_batched_format=True) diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index d09161ead46..978c5322362 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -13,8 +13,7 @@ MoEPrepareAndFinalizeNoEP) from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( TopKWeightAndReduceDelegate) -from vllm.model_executor.layers.fused_moe.utils import (_fp8_perm, - _fp8_quantize, +from vllm.model_executor.layers.fused_moe.utils import (_fp8_quantize, _resize_cache) from vllm.scalar_type import scalar_types @@ -34,6 +33,10 @@ def run_cutlass_moe_fp8( w2_scale: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], a2_scale: Optional[torch.Tensor], + ab_strides1: torch.Tensor, + ab_strides2: torch.Tensor, + c_strides1: torch.Tensor, + c_strides2: torch.Tensor, workspace13: torch.Tensor, workspace2: torch.Tensor, expert_num_tokens: Optional[torch.Tensor], @@ -152,27 +155,11 @@ def run_cutlass_moe_fp8( problem_sizes1, problem_sizes2, a_map, c_map, global_num_experts, N, K) - a1q = _fp8_perm(a1q, a_map) - a1q_scale = a1q_scale[a_map] if per_act_token else a1q_scale + a1q = ops.shuffle_rows(a1q, a_map) + a1q_scale = (ops.shuffle_rows(a1q_scale, a_map) + if per_act_token else a1q_scale) expert_offsets = expert_offsets[:-1] - ab_strides1 = torch.full((w1.size(0), ), - K, - device=device, - dtype=torch.int64) - c_strides1 = torch.full((w1.size(0), ), - 2 * N, - device=device, - dtype=torch.int64) - ab_strides2 = torch.full((w1.size(0), ), - N, - device=device, - dtype=torch.int64) - c_strides2 = torch.full((w1.size(0), ), - K, - device=device, - dtype=torch.int64) - if use_batched_format: c1 = _resize_cache(workspace13, (local_E * padded_M, N * 2)) c2 = _resize_cache(workspace2, (local_E * padded_M, N)) @@ -209,7 +196,8 @@ def run_cutlass_moe_fp8( else: # We can't do this inplace because output may point to the same tensor # as c3. - output.copy_(c3[c_map].view(M * topk, K), non_blocking=True) + output.copy_(ops.shuffle_rows(c3, c_map).view(M * topk, K), + non_blocking=True) # TODO (bnell): split class batched vs. non-batched? @@ -222,6 +210,10 @@ def __init__( out_dtype: Optional[torch.dtype], per_act_token_quant: bool, per_out_ch_quant: bool, + ab_strides1: torch.Tensor, + ab_strides2: torch.Tensor, + c_strides1: torch.Tensor, + c_strides2: torch.Tensor, block_shape: Optional[list[int]] = None, num_dispatchers: Optional[int] = None, use_batched_format: bool = False, @@ -238,6 +230,10 @@ def __init__( self.max_experts_per_worker = max_experts_per_worker self.num_dispatchers = num_dispatchers self.out_dtype = out_dtype + self.ab_strides1 = ab_strides1 + self.ab_strides2 = ab_strides2 + self.c_strides1 = c_strides1 + self.c_strides2 = c_strides2 self.use_batched_format = use_batched_format @property @@ -316,7 +312,8 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, run_cutlass_moe_fp8( output, hidden_states, w1, w2, topk_ids, activation_callable, global_num_experts, expert_map, w1_scale, w2_scale, a1q_scale, - a2_scale, workspace13, workspace2, expert_num_tokens, + a2_scale, self.ab_strides1, self.ab_strides2, self.c_strides1, + self.c_strides2, workspace13, workspace2, expert_num_tokens, self.out_dtype if self.out_dtype is not None else in_dtype, self.per_act_token_quant, self.per_out_ch_quant, self.use_batched_format) @@ -330,6 +327,10 @@ def cutlass_moe_fp8( topk_ids: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, + ab_strides1: torch.Tensor, + ab_strides2: torch.Tensor, + c_strides1: torch.Tensor, + c_strides2: torch.Tensor, per_act_token: Optional[bool] = None, activation: str = "silu", a1_scale: Optional[torch.Tensor] = None, @@ -357,6 +358,17 @@ def cutlass_moe_fp8( Shape: [num_experts] or [num_experts, 2N] - w2_scale (torch.Tensor): The fp32 scale to dequantize w2_q. Shape: [num_experts] or [num_experts, K] + - ab_strides1 (torch.Tensor): The input/weight strides for the first gemm. + Shape: [num_experts] + - ab_strides2 (torch.Tensor): The input/weight strides for the second gemm. + Shape: [num_experts] + - c_strides1 (torch.Tensor): The output strides for the first gemm. + Shape: [num_experts] + - c_strides2 (torch.Tensor): The output strides for the second gemm. + Shape: [num_experts] + - per_act_token (Optional[bool]): Whether the scale is per-token or + per-tensor. + - activation (str): The activation function to use. - a1_scale (Optional[torch.Tensor]): The optional fp32 scale to quantize a. Shape: scalar or [M] - a2_scale (Optional[torch.Tensor]): The optional fp32 scale to @@ -389,6 +401,10 @@ def cutlass_moe_fp8( out_dtype=a.dtype, per_act_token_quant=per_act_token, per_out_ch_quant=per_out_ch, + ab_strides1=ab_strides1, + ab_strides2=ab_strides2, + c_strides1=c_strides1, + c_strides2=c_strides2, use_batched_format=False, ), ) diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py index c636e7e79bf..fcf8ea023f6 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py @@ -859,6 +859,21 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None: layer.w13_weight_scale = torch.nn.Parameter(max_w13_scales, requires_grad=False) + device = layer.w13_weight.device + # ab_strides1 and c_strides2 are the same + self.ab_strides1_c_strides2 = torch.full((layer.local_num_experts, ), + layer.hidden_size, + device=device, + dtype=torch.int64) + self.ab_strides2 = torch.full((layer.local_num_experts, ), + layer.intermediate_size_per_partition, + device=device, + dtype=torch.int64) + self.c_strides1 = torch.full((layer.local_num_experts, ), + 2 * layer.intermediate_size_per_partition, + device=device, + dtype=torch.int64) + def select_gemm_impl( self, prepare_finalize: FusedMoEPrepareAndFinalize, @@ -881,6 +896,10 @@ def select_gemm_impl( moe.in_dtype, self.input_quant.strategy == QuantizationStrategy.TOKEN, self.weight_quant.strategy == QuantizationStrategy.CHANNEL, + ab_strides1=self.ab_strides1_c_strides2, + ab_strides2=self.ab_strides2, + c_strides1=self.c_strides1, + c_strides2=self.ab_strides1_c_strides2, num_dispatchers=num_dispatchers, use_batched_format=use_batched_format, ) @@ -927,7 +946,8 @@ def apply( num_expert_group=num_expert_group, custom_routing_function=custom_routing_function, scoring_func=scoring_func, - e_score_correction_bias=e_score_correction_bias) + e_score_correction_bias=e_score_correction_bias, + indices_type=self.topk_indices_dtype) per_act_token = ( self.input_quant.strategy == QuantizationStrategy.TOKEN) @@ -948,6 +968,10 @@ def apply( expert_map=None if self.disable_expert_map else expert_map, w1_scale=layer.w13_weight_scale, w2_scale=layer.w2_weight_scale, + ab_strides1=self.ab_strides1_c_strides2, + ab_strides2=self.ab_strides2, + c_strides1=self.c_strides1, + c_strides2=self.ab_strides1_c_strides2, a1_scale=layer.w13_input_scale, a2_scale=layer.w2_input_scale, ) From 8ffe2f67cc938bd9f56b0d529f16050c2208ca88 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 18 Jul 2025 00:05:40 +0800 Subject: [PATCH 158/552] [Model] Update pooling model interface (#21058) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- .../my_gemma_embedding.py | 15 +- vllm/entrypoints/openai/protocol.py | 34 +--- vllm/model_executor/layers/pooler.py | 176 +++++++++++------- vllm/model_executor/models/adapters.py | 31 +-- vllm/model_executor/models/bert.py | 37 ++-- vllm/model_executor/models/gpt2.py | 14 +- vllm/model_executor/models/gritlm.py | 12 +- vllm/model_executor/models/interfaces.py | 86 ++------- vllm/model_executor/models/interfaces_base.py | 33 ++-- vllm/model_executor/models/internlm2.py | 14 +- vllm/model_executor/models/jamba.py | 14 +- vllm/model_executor/models/jina_vl.py | 15 +- vllm/model_executor/models/modernbert.py | 24 +-- .../models/prithvi_geospatial_mae.py | 20 +- vllm/model_executor/models/qwen2_rm.py | 23 +-- vllm/model_executor/models/roberta.py | 13 +- vllm/pooling_params.py | 31 +-- 17 files changed, 247 insertions(+), 345 deletions(-) diff --git a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py index aff3498567d..797353e4f7a 100644 --- a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py +++ b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py @@ -11,11 +11,13 @@ from vllm.model_executor.layers.pooler import Pooler, PoolingType from vllm.model_executor.models.gemma2 import Gemma2Model from vllm.model_executor.models.utils import WeightsMapper, maybe_prefix -from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors class MyGemma2Embedding(nn.Module): + + is_pooling_model = True + hf_to_vllm_mapper = WeightsMapper(orig_to_new_prefix={"model.": ""}) def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): @@ -24,7 +26,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.model = Gemma2Model(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")) - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( vllm_config.model_config.pooler_config, pooling_type=PoolingType.LAST, normalize=True, @@ -54,13 +56,6 @@ def forward( # Return all-zero embeddings return torch.zeros_like(hidden_states) - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): weights = self.hf_to_vllm_mapper.apply(weights) diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index 16cb5b75032..a421ed1fc32 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -1237,10 +1237,6 @@ class EmbeddingCompletionRequest(OpenAIBaseModel): user: Optional[str] = None truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None - # --8<-- [start:embedding-pooling-params] - additional_data: Optional[Any] = None - # --8<-- [end:embedding-pooling-params] - # --8<-- [start:embedding-extra-params] add_special_tokens: bool = Field( default=True, @@ -1259,8 +1255,7 @@ class EmbeddingCompletionRequest(OpenAIBaseModel): # --8<-- [end:embedding-extra-params] def to_pooling_params(self): - return PoolingParams(dimensions=self.dimensions, - additional_data=self.additional_data) + return PoolingParams(dimensions=self.dimensions) class EmbeddingChatRequest(OpenAIBaseModel): @@ -1272,10 +1267,6 @@ class EmbeddingChatRequest(OpenAIBaseModel): user: Optional[str] = None truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None - # --8<-- [start:chat-embedding-pooling-params] - additional_data: Optional[Any] = None - # --8<-- [end:chat-embedding-pooling-params] - # --8<-- [start:chat-embedding-extra-params] add_special_tokens: bool = Field( default=False, @@ -1323,8 +1314,7 @@ def check_generation_prompt(cls, data): return data def to_pooling_params(self): - return PoolingParams(dimensions=self.dimensions, - additional_data=self.additional_data) + return PoolingParams(dimensions=self.dimensions) EmbeddingRequest = Union[EmbeddingCompletionRequest, EmbeddingChatRequest] @@ -1340,10 +1330,6 @@ class ScoreRequest(OpenAIBaseModel): text_2: Union[list[str], str, ScoreMultiModalParam] truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None - # --8<-- [start:score-pooling-params] - additional_data: Optional[Any] = None - # --8<-- [end:score-pooling-params] - # --8<-- [start:score-extra-params] mm_processor_kwargs: Optional[dict[str, Any]] = Field( @@ -1362,8 +1348,7 @@ class ScoreRequest(OpenAIBaseModel): # --8<-- [end:score-extra-params] def to_pooling_params(self, *, use_cross_encoder: bool = False): - return PoolingParams(use_cross_encoder=use_cross_encoder, - additional_data=self.additional_data) + return PoolingParams(use_cross_encoder=use_cross_encoder) class RerankRequest(OpenAIBaseModel): @@ -1373,10 +1358,6 @@ class RerankRequest(OpenAIBaseModel): top_n: int = Field(default_factory=lambda: 0) truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None - # --8<-- [start:rerank-pooling-params] - additional_data: Optional[Any] = None - # --8<-- [end:rerank-pooling-params] - # --8<-- [start:rerank-extra-params] mm_processor_kwargs: Optional[dict[str, Any]] = Field( @@ -1395,8 +1376,7 @@ class RerankRequest(OpenAIBaseModel): # --8<-- [end:rerank-extra-params] def to_pooling_params(self, *, use_cross_encoder: bool = False): - return PoolingParams(use_cross_encoder=use_cross_encoder, - additional_data=self.additional_data) + return PoolingParams(use_cross_encoder=use_cross_encoder) class RerankDocument(BaseModel): @@ -1534,10 +1514,6 @@ class ClassificationRequest(OpenAIBaseModel): truncate_prompt_tokens: Optional[int] = None user: Optional[str] = None - # --8<-- [start:classification-pooling-params] - additional_data: Optional[Any] = None - # --8<-- [end:classification-pooling-params] - # --8<-- [start:classification-extra-params] priority: int = Field( default=0, @@ -1550,7 +1526,7 @@ class ClassificationRequest(OpenAIBaseModel): # --8<-- [end:classification-extra-params] def to_pooling_params(self): - return PoolingParams(additional_data=self.additional_data) + return PoolingParams() class ClassificationData(OpenAIBaseModel): diff --git a/vllm/model_executor/layers/pooler.py b/vllm/model_executor/layers/pooler.py index b378a3db032..74916492f57 100644 --- a/vllm/model_executor/layers/pooler.py +++ b/vllm/model_executor/layers/pooler.py @@ -3,22 +3,25 @@ from abc import ABC, abstractmethod from dataclasses import dataclass from enum import IntEnum -from typing import Callable, Optional, TypeVar, Union +from typing import Callable, Literal, Optional, TypeVar, Union import torch import torch.nn as nn import torch.nn.functional as F from transformers import PretrainedConfig +from typing_extensions import assert_never from vllm.config import ModelConfig, PoolerConfig from vllm.model_executor.pooling_metadata import ( # noqa: E501 PoolingMetadata as V0PoolingMetadata) from vllm.model_executor.pooling_metadata import PoolingTensors +from vllm.pooling_params import PoolingParams from vllm.sequence import PoolerOutput, PoolingSequenceGroupOutput from vllm.utils import resolve_obj_by_qualname from vllm.v1.pool.metadata import PoolingMetadata as V1PoolingMetadata PoolingMetadata = Union[V0PoolingMetadata, V1PoolingMetadata] +PoolingTask = Literal["encode", "embed", "classify", "score"] class PoolingType(IntEnum): @@ -64,6 +67,48 @@ def from_config_with_defaults( ) +class Pooler(nn.Module, ABC): + """The interface required for all poolers used in pooling models in vLLM.""" + + @staticmethod + def from_config_with_defaults( + pooler_config: PoolerConfig, + pooling_type: PoolingType, + normalize: bool, + softmax: bool, + step_tag_id: Optional[int] = None, + returned_token_ids: Optional[list[int]] = None, + ) -> "Pooler": + resolved_config = ResolvedPoolingConfig.from_config_with_defaults( + pooler_config=pooler_config, + pooling_type=pooling_type, + normalize=normalize, + softmax=softmax, + step_tag_id=step_tag_id, + returned_token_ids=returned_token_ids, + ) + + if pooling_type == PoolingType.STEP: + return StepPooler.from_config(resolved_config) + + return SimplePooler.from_config(resolved_config) + + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + """ + Construct the pooling parameters to use for a task, + or `None` if the task is not supported. + """ + return None + + @abstractmethod + def forward( + self, + hidden_states: Union[list[torch.Tensor], torch.Tensor], + pooling_metadata: PoolingMetadata, + ) -> PoolerOutput: + raise NotImplementedError + + def get_prompt_lens( hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, @@ -104,17 +149,6 @@ def build_output(all_data: torch.Tensor) -> PoolerOutput: return PoolerOutput(outputs=all_outputs) -class BasePooler(nn.Module): - - @abstractmethod - def forward( - self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> PoolerOutput: - raise NotImplementedError - - class PoolingMethod(nn.Module, ABC): @staticmethod @@ -130,6 +164,10 @@ def from_pooling_type(pooling_type: PoolingType) -> "PoolingMethod": raise NotImplementedError(f"Unsupported method: {pooling_type}") + @abstractmethod + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + raise NotImplementedError + @abstractmethod def forward_one( self, @@ -168,6 +206,14 @@ def forward( class CLSPool(PoolingMethod): + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + # The equalities are split up to keep mypy happy + if (task == "encode" or task == "embed" or task == "classify" + or task == "score"): + return PoolingParams() + + assert_never(task) + def forward_one( self, hidden_states: torch.Tensor, @@ -190,6 +236,14 @@ def forward_all( class LastPool(PoolingMethod): + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + # The equalities are split up to keep mypy happy + if (task == "encode" or task == "embed" or task == "classify" + or task == "score"): + return PoolingParams() + + assert_never(task) + def forward_one( self, hidden_states: torch.Tensor, @@ -208,6 +262,16 @@ def forward_all( class AllPool(PoolingMethod): + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + if task == "encode": + return PoolingParams() + + # The equalities are split up to keep mypy happy + if task == "embed" or task == "classify" or task == "score": + return None + + assert_never(task) + def forward_one( self, hidden_states: torch.Tensor, @@ -235,6 +299,14 @@ def forward_all( class MeanPool(PoolingMethod): + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + # The equalities are split up to keep mypy happy + if (task == "encode" or task == "embed" or task == "classify" + or task == "score"): + return PoolingParams() + + assert_never(task) + def forward_one( self, hidden_states: torch.Tensor, @@ -345,25 +417,6 @@ def forward_chunk(self, pooled_data: torch.Tensor) -> torch.Tensor: class PoolerHead(nn.Module): - @classmethod - def from_config_with_defaults( - cls, - pooler_config: PoolerConfig, - pooling_type: PoolingType, - normalize: bool, - softmax: bool, - ) -> "PoolerHead": - resolved_config = ResolvedPoolingConfig.from_config_with_defaults( - pooler_config=pooler_config, - pooling_type=pooling_type, - normalize=normalize, - softmax=softmax, - step_tag_id=None, - returned_token_ids=None, - ) - - return cls.from_config(resolved_config) - @classmethod def from_config(cls, pooler_config: ResolvedPoolingConfig) -> "PoolerHead": if pooler_config.normalize and pooler_config.softmax: @@ -424,21 +477,17 @@ def forward(self, pooled_data: Union[list[torch.Tensor], torch.Tensor], return self.activation(pooled_data) -class SimplePooler(BasePooler): +class SimplePooler(Pooler): """A layer that pools specific information from hidden states. This layer does the following: 1. Extracts specific tokens or aggregates data based on pooling method. 2. Normalizes output if specified. 3. Returns structured results as `PoolerOutput`. - - Attributes: - pooling_type: The type of pooling to use. - normalize: Whether to normalize the pooled data. """ @classmethod - def from_config_with_defaults( + def from_config_with_defaults( # type: ignore[override] cls, pooler_config: PoolerConfig, pooling_type: PoolingType, @@ -471,6 +520,9 @@ def __init__(self, pooling: PoolingMethod, head: PoolerHead) -> None: self.pooling = pooling self.head = head + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + return self.pooling.get_pooling_params(task) + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], @@ -481,7 +533,7 @@ def forward( return build_output(pooled_data) -class StepPooler(BasePooler): +class StepPooler(Pooler): @classmethod def from_config(cls, pooler_config: ResolvedPoolingConfig) -> "StepPooler": @@ -543,6 +595,16 @@ def extract_states( return pooled_data + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + if task == "encode": + return PoolingParams(logits_processing_needs_token_ids=True) + + # The equalities are split up to keep mypy happy + if task == "embed" or task == "classify" or task == "score": + return None + + assert_never(task) + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], @@ -553,32 +615,6 @@ def forward( return build_output(pooled_data) -class Pooler(nn.Module): - - @staticmethod - def from_config_with_defaults( - pooler_config: PoolerConfig, - pooling_type: PoolingType, - normalize: bool, - softmax: bool, - step_tag_id: Optional[int] = None, - returned_token_ids: Optional[list[int]] = None, - ) -> BasePooler: - resolved_config = ResolvedPoolingConfig.from_config_with_defaults( - pooler_config=pooler_config, - pooling_type=pooling_type, - normalize=normalize, - softmax=softmax, - step_tag_id=step_tag_id, - returned_token_ids=returned_token_ids, - ) - - if pooling_type == PoolingType.STEP: - return StepPooler.from_config(resolved_config) - - return SimplePooler.from_config(resolved_config) - - PoolingFn = Callable[ [Union[torch.Tensor, list[torch.Tensor]], PoolingMetadata], Union[torch.Tensor, list[torch.Tensor]]] @@ -618,6 +654,18 @@ def _get_act_fn(self, use_cross_encoder: bool): return (self.cross_encoder_act_fn if use_cross_encoder else self.classification_act_fn) + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + if task == "encode": + return PoolingParams() + if task == "embed": + return None + if task == "classify": + return PoolingParams() + if task == "score": + return PoolingParams(use_cross_encoder=True) + + assert_never(task) + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], diff --git a/vllm/model_executor/models/adapters.py b/vllm/model_executor/models/adapters.py index 5c09ac30605..f319c0c4441 100644 --- a/vllm/model_executor/models/adapters.py +++ b/vllm/model_executor/models/adapters.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from collections.abc import Iterable -from typing import TYPE_CHECKING, Any, Optional, TypeVar, Union, cast +from typing import TYPE_CHECKING, Any, Optional, TypeVar, cast import torch import torch.nn as nn @@ -42,13 +42,14 @@ def _create_pooling_model_cls( default_softmax: bool, ) -> _T: # Lazy import - from vllm.model_executor.layers.pooler import Pooler, PoolerOutput - from vllm.model_executor.pooling_metadata import PoolingMetadata + from vllm.model_executor.layers.pooler import Pooler from .utils import AutoWeightsLoader, WeightsMapper class ModelForPooling(orig_cls, VllmModelForPooling): + is_pooling_model = True + def __init__( self, *, @@ -66,27 +67,20 @@ def __init__( delattr(self, attr) # If the model already defines a pooler instance, don't overwrite it - if not getattr(self, "_pooler", None): + if not getattr(self, "pooler", None): self._init_pooler(vllm_config, prefix=prefix) def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): pooler_config = vllm_config.model_config.pooler_config assert pooler_config is not None - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( pooler_config, pooling_type=default_pooling_type, normalize=default_normalize, softmax=default_softmax, ) - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> PoolerOutput: - return self._pooler(hidden_states, pooling_metadata) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): # TODO: Support uninitialized params tracking @@ -171,10 +165,8 @@ def as_seq_cls_model(cls: _T) -> _T: # Lazy import from vllm.model_executor.layers.linear import RowParallelLinear from vllm.model_executor.layers.pooler import (ClassifierPooler, - PoolerOutput, PoolingType, - SimplePooler) + PoolingType, SimplePooler) from vllm.model_executor.models.interfaces import SupportsCrossEncoding - from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.sequence import IntermediateTensors from .utils import maybe_prefix @@ -213,7 +205,7 @@ def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): softmax=True, ) - self._pooler = ClassifierPooler( + self.pooler = ClassifierPooler( vllm_config.model_config, pooling=pooler.pooling, classifier=self._classifier, @@ -234,13 +226,6 @@ def forward( return super().forward(input_ids, positions, intermediate_tensors, inputs_embeds) - def pooler( - self, - hidden_states: Union[torch.Tensor, list[torch.Tensor]], - pooling_metadata: PoolingMetadata, - ) -> PoolerOutput: - return self._pooler(hidden_states, pooling_metadata) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): tokens = getattr(self.config, "classifier_from_token", None) method = getattr(self.config, "method", None) diff --git a/vllm/model_executor/models/bert.py b/vllm/model_executor/models/bert.py index 65e6428f491..bd4445c49a0 100644 --- a/vllm/model_executor/models/bert.py +++ b/vllm/model_executor/models/bert.py @@ -18,12 +18,14 @@ QKVParallelLinear, RowParallelLinear) from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, - PoolingMethod, PoolingType) + PoolingMethod, PoolingTask, + PoolingType) from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.pooling_params import PoolingParams +from vllm.sequence import IntermediateTensors from .interfaces import SupportsCrossEncoding, SupportsQuant, SupportsV0Only from .utils import AutoWeightsLoader, WeightsMapper, maybe_prefix @@ -80,7 +82,7 @@ def forward( return embeddings -class BertPooler(nn.Module): +class BertPooler(Pooler): def __init__(self, config: BertConfig): super().__init__() @@ -89,6 +91,9 @@ def __init__(self, config: BertConfig): self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh() + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + return self.pooling.get_pooling_params(task) + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], @@ -319,6 +324,9 @@ def forward(self, hidden_states: torch.Tensor, class BertModel(nn.Module, SupportsQuant): + + is_pooling_model = True + packed_modules_mapping = {"qkv_proj": ["query", "key", "value"]} def __init__(self, @@ -403,12 +411,15 @@ class BertEmbeddingModel(nn.Module, SupportsV0Only, SupportsQuant): _pooler: An instance of Pooler used for pooling operations. """ + is_pooling_model = True + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() + pooler_config = vllm_config.model_config.pooler_config self.model = self._build_model(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")) - self._pooler = self._build_pooler(pooler_config) + self.pooler = self._build_pooler(pooler_config) def forward( self, @@ -422,13 +433,6 @@ def forward( inputs_embeds=inputs_embeds, intermediate_tensors=intermediate_tensors) - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): weights_list = list(weights) @@ -466,6 +470,8 @@ class BertForSequenceClassification(nn.Module, SupportsV0Only, _pooler: An instance of Pooler used for pooling operations. """ + is_pooling_model = True + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() config = vllm_config.model_config.hf_config @@ -476,7 +482,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): embedding_class=BertEmbedding, add_pooling_layer=True) self.classifier = nn.Linear(config.hidden_size, config.num_labels) - self._pooler = ClassifierPooler( + self.pooler = ClassifierPooler( vllm_config.model_config, pooling=self.bert.pooler, classifier=self.classifier, @@ -487,13 +493,6 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loaded_params = loader.load_weights(weights) return loaded_params - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def forward( self, input_ids: Optional[torch.Tensor], diff --git a/vllm/model_executor/models/gpt2.py b/vllm/model_executor/models/gpt2.py index 27021550f99..82883bfa890 100644 --- a/vllm/model_executor/models/gpt2.py +++ b/vllm/model_executor/models/gpt2.py @@ -40,9 +40,8 @@ from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.model_executor.sampling_metadata import SamplingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors from ..layers.pooler import Pooler, PoolingType from .interfaces import SupportsPP @@ -332,6 +331,8 @@ class GPT2ForSequenceClassification(nn.Module): _pooler: An instance of Pooler used for pooling operations. """ + is_pooling_model = True + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() config = vllm_config.model_config.hf_config @@ -339,7 +340,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): prefix=maybe_prefix(prefix, "gpt2")) self.score = nn.Linear(config.n_embd, config.num_labels, bias=False) pooler_config = vllm_config.model_config.pooler_config - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( pooler_config, pooling_type=PoolingType.LAST, normalize=False, @@ -349,13 +350,6 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) return loader.load_weights(weights) - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def forward( self, input_ids: torch.Tensor, diff --git a/vllm/model_executor/models/gritlm.py b/vllm/model_executor/models/gritlm.py index dfec8a51c4c..ba0e22892d8 100644 --- a/vllm/model_executor/models/gritlm.py +++ b/vllm/model_executor/models/gritlm.py @@ -2,7 +2,6 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from array import array -from typing import Optional import torch import torch.nn as nn @@ -195,6 +194,8 @@ class GritLM(LlamaForCausalLM, SupportsV0Only): - "<|user|>\nPROMPT\n<|assistant|>\n" """ + is_pooling_model = True + def __init__( self, vllm_config: VllmConfig, @@ -214,11 +215,4 @@ def __init__( super().__init__(vllm_config=vllm_config, prefix=prefix, **kwargs) - self._pooler = GritLMPooler(vllm_config.model_config) - - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) + self.pooler = GritLMPooler(vllm_config.model_config) diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 9655bdf6f3e..417f9059449 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -119,13 +119,6 @@ def get_input_embeddings( ... -# We can't use runtime_checkable with ClassVar for issubclass checks -# so we need to treat the class as an instance and use isinstance instead -@runtime_checkable -class _SupportsMultiModalType(Protocol): - supports_multimodal: Literal[True] - - @overload def supports_multimodal( model: type[object]) -> TypeIs[type[SupportsMultiModal]]: @@ -140,10 +133,7 @@ def supports_multimodal(model: object) -> TypeIs[SupportsMultiModal]: def supports_multimodal( model: Union[type[object], object], ) -> Union[TypeIs[type[SupportsMultiModal]], TypeIs[SupportsMultiModal]]: - if isinstance(model, type): - return isinstance(model, _SupportsMultiModalType) - - return isinstance(model, SupportsMultiModal) + return getattr(model, "supports_multimodal", False) @runtime_checkable @@ -174,13 +164,6 @@ def post_process_tokens(cls, prompt: TokensPrompt) -> None: ... -# We can't use runtime_checkable with ClassVar for issubclass checks -# so we need to treat the class as an instance and use isinstance instead -@runtime_checkable -class _SupportsScoreTemplateType(Protocol): - supports_score_template: Literal[True] - - @overload def supports_score_template( model: type[object]) -> TypeIs[type[SupportsScoreTemplate]]: @@ -195,11 +178,7 @@ def supports_score_template(model: object) -> TypeIs[SupportsScoreTemplate]: def supports_score_template( model: Union[type[object], object], ) -> Union[TypeIs[type[SupportsScoreTemplate]], TypeIs[SupportsScoreTemplate]]: - - if isinstance(model, type): - return isinstance(model, _SupportsScoreTemplateType) - - return isinstance(model, SupportsScoreTemplate) + return getattr(model, "supports_score_template", False) @runtime_checkable @@ -409,11 +388,6 @@ class HasInnerState(Protocol): """ -@runtime_checkable -class _HasInnerStateType(Protocol): - has_inner_state: ClassVar[Literal[True]] - - @overload def has_inner_state(model: object) -> TypeIs[HasInnerState]: ... @@ -427,10 +401,7 @@ def has_inner_state(model: type[object]) -> TypeIs[type[HasInnerState]]: def has_inner_state( model: Union[type[object], object] ) -> Union[TypeIs[type[HasInnerState]], TypeIs[HasInnerState]]: - if isinstance(model, type): - return isinstance(model, _HasInnerStateType) - - return isinstance(model, HasInnerState) + return getattr(model, "has_inner_state", False) @runtime_checkable @@ -446,11 +417,6 @@ class IsAttentionFree(Protocol): """ -@runtime_checkable -class _IsAttentionFreeType(Protocol): - is_attention_free: ClassVar[Literal[True]] - - @overload def is_attention_free(model: object) -> TypeIs[IsAttentionFree]: ... @@ -464,10 +430,7 @@ def is_attention_free(model: type[object]) -> TypeIs[type[IsAttentionFree]]: def is_attention_free( model: Union[type[object], object] ) -> Union[TypeIs[type[IsAttentionFree]], TypeIs[IsAttentionFree]]: - if isinstance(model, type): - return isinstance(model, _IsAttentionFreeType) - - return isinstance(model, IsAttentionFree) + return getattr(model, "is_attention_free", False) @runtime_checkable @@ -502,11 +465,6 @@ def get_mamba_state_shape_from_config( ... -@runtime_checkable -class _IsHybridType(Protocol): - is_hybrid: ClassVar[Literal[True]] - - @overload def is_hybrid(model: object) -> TypeIs[IsHybrid]: ... @@ -520,10 +478,7 @@ def is_hybrid(model: type[object]) -> TypeIs[type[IsHybrid]]: def is_hybrid( model: Union[type[object], object] ) -> Union[TypeIs[type[IsHybrid]], TypeIs[IsHybrid]]: - if isinstance(model, type): - return isinstance(model, _IsHybridType) - - return isinstance(model, IsHybrid) + return getattr(model, "is_hybrid", False) @runtime_checkable @@ -598,11 +553,6 @@ class HasNoOps(Protocol): has_noops: ClassVar[Literal[True]] = True -@runtime_checkable -class _HasNoOpsType(Protocol): - has_noops: ClassVar[Literal[True]] - - @overload def has_noops(model: object) -> TypeIs[HasNoOps]: ... @@ -616,10 +566,7 @@ def has_noops(model: type[object]) -> TypeIs[type[HasNoOps]]: def has_noops( model: Union[type[object], object] ) -> Union[TypeIs[type[HasNoOps]], TypeIs[HasNoOps]]: - if isinstance(model, type): - return isinstance(model, _HasNoOpsType) - - return isinstance(model, HasNoOps) + return getattr(model, "has_noops", False) @runtime_checkable @@ -643,11 +590,7 @@ def supports_cross_encoding(model: object) -> TypeIs[SupportsCrossEncoding]: def _supports_cross_encoding( model: Union[type[object], object], ) -> Union[TypeIs[type[SupportsCrossEncoding]], TypeIs[SupportsCrossEncoding]]: - - if isinstance(model, type): - return isinstance(model, SupportsCrossEncoding) - - return isinstance(model, SupportsCrossEncoding) + return getattr(model, "supports_cross_encoding", False) def supports_cross_encoding( @@ -658,8 +601,9 @@ def supports_cross_encoding( def has_step_pooler(model: Union[type[object], object]) -> bool: """Check if the model uses step pooler.""" - return is_pooling_model(model) and any( - type(module).__name__ == "StepPooler" for module in model.modules()) + from vllm.model_executor.layers.pooler import StepPooler + + return is_pooling_model(model) and isinstance(model.pooler, StepPooler) class SupportsQuant: @@ -770,10 +714,7 @@ def supports_transcription(model: object) -> TypeIs[SupportsTranscription]: def supports_transcription( model: Union[type[object], object], ) -> Union[TypeIs[type[SupportsTranscription]], TypeIs[SupportsTranscription]]: - if isinstance(model, type): - return isinstance(model, SupportsTranscription) - - return isinstance(model, SupportsTranscription) + return getattr(model, "supports_transcription", False) @runtime_checkable @@ -796,7 +737,4 @@ def supports_v0_only(model: object) -> TypeIs[SupportsV0Only]: def supports_v0_only( model: Union[type[object], object], ) -> Union[TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]]: - if isinstance(model, type): - return isinstance(model, SupportsV0Only) - - return isinstance(model, SupportsV0Only) + return getattr(model, "supports_v0_only", False) diff --git a/vllm/model_executor/models/interfaces_base.py b/vllm/model_executor/models/interfaces_base.py index 4a1ea74a218..4d68227b2af 100644 --- a/vllm/model_executor/models/interfaces_base.py +++ b/vllm/model_executor/models/interfaces_base.py @@ -1,8 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import (TYPE_CHECKING, Optional, Protocol, Union, overload, - runtime_checkable) +from typing import (TYPE_CHECKING, ClassVar, Literal, Optional, Protocol, + Union, overload, runtime_checkable) import torch import torch.nn as nn @@ -13,8 +12,7 @@ if TYPE_CHECKING: from vllm.config import VllmConfig - from vllm.model_executor.layers.pooler import PoolerOutput - from vllm.model_executor.pooling_metadata import PoolingMetadata + from vllm.model_executor.layers.pooler import Pooler from vllm.model_executor.sampling_metadata import SamplingMetadata logger = init_logger(__name__) @@ -130,16 +128,20 @@ def is_text_generation_model( @runtime_checkable -class VllmModelForPooling(VllmModel[T], Protocol[T]): +class VllmModelForPooling(VllmModel[T_co], Protocol[T_co]): """The interface required for all pooling models in vLLM.""" - def pooler( - self, - hidden_states: T, - pooling_metadata: "PoolingMetadata", - ) -> "PoolerOutput": - """Only called on TP rank 0.""" - ... + is_pooling_model: ClassVar[Literal[True]] = True + """ + A flag that indicates this model supports pooling. + + Note: + There is no need to redefine this flag if this class is in the + MRO of your model class. + """ + + pooler: "Pooler" + """The pooler is only called on TP rank 0.""" @overload @@ -158,7 +160,4 @@ def is_pooling_model( if not is_vllm_model(model): return False - if isinstance(model, type): - return isinstance(model, VllmModelForPooling) - - return isinstance(model, VllmModelForPooling) + return getattr(model, "is_pooling_model", False) diff --git a/vllm/model_executor/models/internlm2.py b/vllm/model_executor/models/internlm2.py index e8549b4e053..d9bbee0a246 100644 --- a/vllm/model_executor/models/internlm2.py +++ b/vllm/model_executor/models/internlm2.py @@ -28,9 +28,8 @@ from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.model_executor.sampling_metadata import SamplingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors from .interfaces import SupportsLoRA, SupportsPP from .utils import (is_pp_missing_parameter, @@ -404,6 +403,8 @@ def load_weights(self, weights: Iterable[tuple[str, class InternLM2ForRewardModel(InternLM2ForCausalLM): + is_pooling_model = True + def __init__( self, *, @@ -428,7 +429,7 @@ def __init__( ) pooler_config = vllm_config.model_config.pooler_config - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( pooler_config, pooling_type=PoolingType.ALL, normalize=False, @@ -446,10 +447,3 @@ def forward( inputs_embeds) logits, _ = self.v_head(hidden_states) return logits - - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) diff --git a/vllm/model_executor/models/jamba.py b/vllm/model_executor/models/jamba.py index 233c222963b..e95f3491c6b 100644 --- a/vllm/model_executor/models/jamba.py +++ b/vllm/model_executor/models/jamba.py @@ -27,9 +27,8 @@ from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.models.mamba_cache import (MambaCacheManager, MambaCacheParams) -from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.model_executor.sampling_metadata import SamplingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors from vllm.utils import LayerBlockType from .interfaces import (HasInnerState, IsHybrid, SupportsLoRA, SupportsPP, @@ -563,6 +562,8 @@ def _is_moe_layer(name: str): class JambaForSequenceClassification(JambaForCausalLM): + is_pooling_model = True + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__(vllm_config=vllm_config, prefix=prefix) @@ -590,16 +591,9 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): softmax=False, ) - self._pooler = ClassifierPooler( + self.pooler = ClassifierPooler( vllm_config.model_config, pooling=pooler.pooling, classifier=self.score, act_fn=pooler.head.activation, ) - - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) diff --git a/vllm/model_executor/models/jina_vl.py b/vllm/model_executor/models/jina_vl.py index 78e58896e0d..6b191b09b4b 100644 --- a/vllm/model_executor/models/jina_vl.py +++ b/vllm/model_executor/models/jina_vl.py @@ -13,9 +13,8 @@ from vllm.model_executor.layers.linear import (ColumnParallelLinear, RowParallelLinear) from vllm.model_executor.layers.pooler import Pooler, PoolingType -from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.multimodal import MULTIMODAL_REGISTRY -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors from .interfaces import (SupportsCrossEncoding, SupportsMultiModal, SupportsScoreTemplate) @@ -72,6 +71,8 @@ class JinaVLForSequenceClassification(Qwen2VLForConditionalGeneration, SupportsCrossEncoding, SupportsMultiModal, SupportsScoreTemplate): + + is_pooling_model = True weight_mapper = WeightsMapper( orig_to_new_prefix={ "score.0.": "score.dense.", @@ -95,7 +96,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.score = JinaVLScorer(config) - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( pooler_config, pooling_type=PoolingType.LAST, normalize=False, @@ -137,14 +138,6 @@ def forward( logits = self.score(hidden_states) - self.LOGIT_BIAS return logits - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - loader = AutoWeightsLoader(self) return loader.load_weights(weights, mapper=self.weight_mapper) diff --git a/vllm/model_executor/models/modernbert.py b/vllm/model_executor/models/modernbert.py index e094ff16357..94a7ddcc01c 100644 --- a/vllm/model_executor/models/modernbert.py +++ b/vllm/model_executor/models/modernbert.py @@ -13,14 +13,16 @@ from vllm.distributed import get_tensor_model_parallel_world_size from vllm.model_executor.layers.linear import (QKVParallelLinear, RowParallelLinear) -from vllm.model_executor.layers.pooler import (BasePooler, ClassifierPooler, - PoolingMethod, PoolingType) +from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, + PoolingMethod, PoolingTask, + PoolingType) from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.pooling_params import PoolingParams +from vllm.sequence import IntermediateTensors from .interfaces import SupportsCrossEncoding, SupportsV0Only from .utils import WeightsMapper, maybe_prefix @@ -253,7 +255,7 @@ def forward( return norm_outputs -class ModernBertPooler(BasePooler): +class ModernBertPooler(Pooler): def __init__(self, config: ModernBertConfig): super().__init__() @@ -268,6 +270,9 @@ def __init__(self, config: ModernBertConfig): eps=config.norm_eps, bias=config.norm_bias) + def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + return self.pooling.get_pooling_params(task) + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], @@ -281,6 +286,8 @@ def forward( class ModernBertForSequenceClassification(nn.Module, SupportsV0Only, SupportsCrossEncoding): + is_pooling_model = True + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() config = vllm_config.model_config.hf_config @@ -288,7 +295,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.model = ModernBertModel(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "modernbert")) self.classifier = nn.Linear(config.hidden_size, config.num_labels) - self._pooler = ClassifierPooler( + self.pooler = ClassifierPooler( vllm_config.model_config, pooling=ModernBertPooler(config), classifier=self.classifier, @@ -321,13 +328,6 @@ def weight_filter(): default_weight_loader) weight_loader(param, loaded_weight) - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def forward( self, input_ids: Optional[torch.LongTensor], diff --git a/vllm/model_executor/models/prithvi_geospatial_mae.py b/vllm/model_executor/models/prithvi_geospatial_mae.py index a36f24bc80e..d51fcec07fd 100644 --- a/vllm/model_executor/models/prithvi_geospatial_mae.py +++ b/vllm/model_executor/models/prithvi_geospatial_mae.py @@ -24,12 +24,13 @@ from transformers import BatchFeature from vllm.config import VllmConfig +from vllm.model_executor.layers.pooler import (AllPool, PoolerHead, + PoolerIdentity, SimplePooler) from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.models.interfaces import (IsAttentionFree, SupportsMultiModal, SupportsV0Only) from vllm.model_executor.models.utils import AutoWeightsLoader -from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, MultiModalInputs, MultiModalKwargs) @@ -37,8 +38,7 @@ from vllm.multimodal.processing import (BaseMultiModalProcessor, BaseProcessingInfo, PromptUpdate) from vllm.multimodal.profiling import BaseDummyInputsBuilder -from vllm.sequence import (IntermediateTensors, PoolerOutput, - PoolingSequenceGroupOutput) +from vllm.sequence import IntermediateTensors class PrithviGeoSpatialMAEProcessingInfo(BaseProcessingInfo): @@ -116,7 +116,9 @@ def apply( dummy_inputs=PrithviGeoSpatialMAEInputBuilder) class PrithviGeoSpatialMAE(nn.Module, IsAttentionFree, SupportsMultiModal, SupportsV0Only): - """ Prithvi Masked Autoencoder""" + """Prithvi Masked Autoencoder""" + + is_pooling_model = True @classmethod def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: @@ -162,6 +164,8 @@ def __init__(self, vllm_config: VllmConfig, prefix: str = ""): "Only SemanticSegmentationTask is supported for now " "by PrithviGeospatialMAE.") + self.pooler = SimplePooler(AllPool(), PoolerHead(PoolerIdentity())) + def _parse_and_validate_multimodal_data( self, **kwargs) -> tuple[torch.Tensor, Optional[torch.Tensor]]: @@ -189,7 +193,6 @@ def forward( inputs_embeds: Optional[torch.Tensor] = None, **kwargs: object, ): - pixel_values, location_coords = ( self._parse_and_validate_multimodal_data(**kwargs)) model_output = self.model(pixel_values, @@ -197,13 +200,6 @@ def forward( return model_output.output - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return PoolerOutput([PoolingSequenceGroupOutput(hidden_states)]) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: params_list = [] diff --git a/vllm/model_executor/models/qwen2_rm.py b/vllm/model_executor/models/qwen2_rm.py index 9a850808167..58f95d6eebf 100644 --- a/vllm/model_executor/models/qwen2_rm.py +++ b/vllm/model_executor/models/qwen2_rm.py @@ -16,8 +16,7 @@ from vllm.model_executor.layers.linear import (ColumnParallelLinear, RowParallelLinear) from vllm.model_executor.layers.pooler import Pooler, PoolingType, SimplePooler -from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors from .interfaces import SupportsLoRA, SupportsPP from .qwen2 import Qwen2Model @@ -25,6 +24,10 @@ class Qwen2RewardBaseModel(nn.Module, SupportsLoRA, SupportsPP): + + is_pooling_model = True + pooler: SimplePooler + packed_modules_mapping = { "qkv_proj": [ "q_proj", @@ -61,7 +64,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): quant_config=quant_config, return_bias=False), ) - self._pooler: SimplePooler self.make_empty_intermediate_tensors = ( self.model.make_empty_intermediate_tensors) @@ -80,13 +82,6 @@ def forward( logits = self.score(hidden_states) return logits - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: loader = AutoWeightsLoader(self, @@ -96,11 +91,11 @@ def load_weights(self, weights: Iterable[tuple[str, class Qwen2ForRewardModel(Qwen2RewardBaseModel): - def __init__(self, *, vllm_config, prefix=""): + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): vllm_config.model_config.hf_config.num_labels = 1 super().__init__(vllm_config=vllm_config, prefix=prefix) pooler_config = vllm_config.model_config.pooler_config - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( pooler_config, pooling_type=PoolingType.ALL, normalize=False, @@ -109,11 +104,11 @@ def __init__(self, *, vllm_config, prefix=""): class Qwen2ForProcessRewardModel(Qwen2RewardBaseModel): - def __init__(self, *, vllm_config, prefix=""): + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): vllm_config.model_config.hf_config.num_labels = 2 super().__init__(vllm_config=vllm_config, prefix=prefix) pooler_config = vllm_config.model_config.pooler_config - self._pooler = Pooler.from_config_with_defaults( + self.pooler = Pooler.from_config_with_defaults( pooler_config, pooling_type=PoolingType.STEP, normalize=False, diff --git a/vllm/model_executor/models/roberta.py b/vllm/model_executor/models/roberta.py index 55ebb6e9e2a..7d3b56ced5c 100644 --- a/vllm/model_executor/models/roberta.py +++ b/vllm/model_executor/models/roberta.py @@ -15,8 +15,7 @@ from vllm.model_executor.models.bert import BertEmbeddingModel, BertModel from vllm.model_executor.models.utils import (AutoWeightsLoader, WeightsMapper, maybe_prefix) -from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.sequence import IntermediateTensors from .bert_with_rope import BertWithRope, JinaRobertaModel from .interfaces import SupportsCrossEncoding, SupportsV0Only @@ -165,6 +164,7 @@ class RobertaForSequenceClassification(nn.Module, SupportsCrossEncoding, _pooler: An instance of Pooler used for pooling operations. """ + is_pooling_model = True jina_to_vllm_mapper = WeightsMapper( orig_to_new_substr={ 'emb_ln': "embeddings.LayerNorm", @@ -188,7 +188,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): add_pooling_layer=False) self.classifier = RobertaClassificationHead(config) - self._pooler = ClassifierPooler( + self.pooler = ClassifierPooler( vllm_config.model_config, pooling=CLSPool(), classifier=self.classifier, @@ -198,13 +198,6 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) return loader.load_weights(weights, mapper=self.jina_to_vllm_mapper) - def pooler( - self, - hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> Optional[PoolerOutput]: - return self._pooler(hidden_states, pooling_metadata) - def forward( self, input_ids: Optional[torch.Tensor], diff --git a/vllm/pooling_params.py b/vllm/pooling_params.py index 106f3e8b22b..1a7305727e1 100644 --- a/vllm/pooling_params.py +++ b/vllm/pooling_params.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import TYPE_CHECKING, Any, Optional +from typing import TYPE_CHECKING, Optional import msgspec @@ -15,24 +15,31 @@ class PoolingParams( msgspec.Struct, omit_defaults=True, # type: ignore[call-arg] array_like=True): # type: ignore[call-arg] - """API parameters for pooling models. This is currently a placeholder. + """API parameters for pooling models. This Attributes: dimensions: Reduce the dimensions of embeddings if model support matryoshka representation. - additional_data: Any additional data needed for pooling. """ dimensions: Optional[int] = None + use_cross_encoder: bool = False - additional_data: Optional[Any] = None + """Internal use only.""" + + logits_processing_needs_token_ids: bool = False + """Internal use only.""" + output_kind: RequestOutputKind = RequestOutputKind.FINAL_ONLY def clone(self) -> "PoolingParams": """Returns a deep copy of the PoolingParams instance.""" - return PoolingParams(dimensions=self.dimensions, - use_cross_encoder=self.use_cross_encoder, - additional_data=self.additional_data) + return PoolingParams( + dimensions=self.dimensions, + use_cross_encoder=self.use_cross_encoder, + logits_processing_needs_token_ids=self. + logits_processing_needs_token_ids, + ) def verify(self, model_config: "ModelConfig") -> None: if self.dimensions is not None: @@ -54,10 +61,12 @@ def verify(self, model_config: "ModelConfig") -> None: raise ValueError("Dimensions must be greater than 0") def __repr__(self) -> str: - return (f"PoolingParams(" - f"dimensions={self.dimensions}, " - f"use_cross_encoder={self.use_cross_encoder}, " - f"additional_metadata={self.additional_data})") + return ( + f"PoolingParams(" + f"dimensions={self.dimensions}, " + f"use_cross_encoder={self.use_cross_encoder}, " + f"logits_processing_needs_token_ids={self.logits_processing_needs_token_ids})" + ) def __post_init__(self) -> None: assert self.output_kind == RequestOutputKind.FINAL_ONLY,\ From 078514b5083551eaa2e5bfbf16109e362105cc39 Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Fri, 18 Jul 2025 02:32:52 +0800 Subject: [PATCH 159/552] [Misc] Qwen MoE model supports LoRA (#20932) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- docs/models/supported_models.md | 4 ++-- vllm/lora/models.py | 13 +++++++++++++ vllm/model_executor/models/qwen2_moe.py | 7 +++---- vllm/model_executor/models/qwen3_moe.py | 4 ++-- 4 files changed, 20 insertions(+), 8 deletions(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index ad5bf43f7fd..fc304fb6fd5 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -380,9 +380,9 @@ Specified using `--task generate`. | `Plamo2ForCausalLM` | PLaMo2 | `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc. | | | | | `QWenLMHeadModel` | Qwen | `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | -| `Qwen2MoeForCausalLM` | Qwen2MoE | `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. | | ✅︎ | ✅︎ | +| `Qwen2MoeForCausalLM` | Qwen2MoE | `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B`, etc. | ✅︎ | ✅︎ | ✅︎ | -| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B`, etc. | | ✅︎ | ✅︎ | +| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `StableLmForCausalLM` | StableLM | `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc. | | | ✅︎ | | `Starcoder2ForCausalLM` | Starcoder2 | `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. | | ✅︎ | ✅︎ | | `SolarForCausalLM` | Solar Pro | `upstage/solar-pro-preview-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | diff --git a/vllm/lora/models.py b/vllm/lora/models.py index bff4e912578..521bb079da4 100644 --- a/vllm/lora/models.py +++ b/vllm/lora/models.py @@ -29,6 +29,7 @@ get_supported_lora_modules, is_regex_target_modules, parse_fine_tuned_lora_name, replace_submodule) +from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.model_loader.tensorizer import TensorizerConfig from vllm.model_executor.models import SupportsLoRA, supports_multimodal from vllm.model_executor.models.interfaces import is_pooling_model @@ -60,6 +61,17 @@ def get_lora_id(): return _GLOBAL_LORA_ID +def is_moe_model(model: nn.Module) -> bool: + """Checks if the model contains FusedMoE layers and warns the user.""" + if any(isinstance(module, FusedMoE) for module in model.modules()): + logger.warning_once( + "For MoE models, vLLM currently does not support fused MoE LoRA " + "inference. Please ensure that the loaded LoRA model does not " + "contain expert weights.") + return True + return False + + class LoRAModel(AdapterModel): """A LoRA fine-tuned model.""" @@ -375,6 +387,7 @@ def __init__( # text modules (e.g. ChatGLM) and hasattr(self.model, "get_mm_mapping")) self.is_pooling_model = is_pooling_model(self.model) + self.is_moe_model = is_moe_model(self.model) self.packed_modules: dict[str, list[str]] = {} self.modules: dict[str, BaseLayerWithLoRA] = {} # Dict instead of a set for compatibility with LRUCache. diff --git a/vllm/model_executor/models/qwen2_moe.py b/vllm/model_executor/models/qwen2_moe.py index 84bae87804c..b061e2f69a6 100644 --- a/vllm/model_executor/models/qwen2_moe.py +++ b/vllm/model_executor/models/qwen2_moe.py @@ -53,7 +53,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsPP +from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, extract_layer_index, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, @@ -448,8 +448,7 @@ def load_weights(self, weights: Iterable[tuple[str, if weight_name not in name: continue name = name.replace(weight_name, param_name) - if "layers.13.mlp.experts.w2_weight" in name: - pass + # Skip layers on other devices. if is_pp_missing_parameter(name, self): continue @@ -494,7 +493,7 @@ def load_weights(self, weights: Iterable[tuple[str, return loaded_params -class Qwen2MoeForCausalLM(nn.Module, SupportsPP): +class Qwen2MoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA): fall_back_to_pt_during_load = False packed_modules_mapping = { diff --git a/vllm/model_executor/models/qwen3_moe.py b/vllm/model_executor/models/qwen3_moe.py index 0f749b3e38f..12899c28016 100644 --- a/vllm/model_executor/models/qwen3_moe.py +++ b/vllm/model_executor/models/qwen3_moe.py @@ -50,7 +50,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsPP +from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, extract_layer_index, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, @@ -482,7 +482,7 @@ def load_weights(self, weights: Iterable[tuple[str, return loaded_params -class Qwen3MoeForCausalLM(nn.Module, SupportsPP): +class Qwen3MoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA): packed_modules_mapping = { "qkv_proj": [ "q_proj", From 549f050fc42bb3d1642a0a25fedaed16a1655556 Mon Sep 17 00:00:00 2001 From: Eric Curtin Date: Thu, 17 Jul 2025 19:52:17 +0100 Subject: [PATCH 160/552] On environments where numa cannot be detected we get 0 (#21115) Signed-off-by: Eric Curtin Signed-off-by: x22x22 --- vllm/v1/worker/cpu_worker.py | 188 +++++++++++++++++++++-------------- 1 file changed, 111 insertions(+), 77 deletions(-) diff --git a/vllm/v1/worker/cpu_worker.py b/vllm/v1/worker/cpu_worker.py index 0bd3e580ba0..d31991b5b36 100644 --- a/vllm/v1/worker/cpu_worker.py +++ b/vllm/v1/worker/cpu_worker.py @@ -13,12 +13,20 @@ from vllm.model_executor.utils import set_random_seed from vllm.platforms import CpuArchEnum, current_platform from vllm.sequence import IntermediateTensors +from vllm.utils import PlaceholderModule from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.outputs import ModelRunnerOutput from vllm.v1.worker.cpu_model_runner import CPUModelRunner from vllm.v1.worker.gpu_worker import (Worker, init_worker_distributed_environment) +try: + import psutil + from numa import info +except ImportError: + psutil = PlaceholderModule("psutil") # type: ignore[assignment] + numa = PlaceholderModule("numa") # type: ignore[assignment] + logger = init_logger(__name__) @@ -37,6 +45,8 @@ def __init__(self, is_driver_worker=is_driver_worker) self.parallel_config.disable_custom_all_reduce = True + self.manually_bind_threads_suggestion = ( + "To get better performance, please try to manually bind threads.") def init_device(self): # Setup OpenMP threads affinity. @@ -112,50 +122,111 @@ def execute_model( assert isinstance(output, ModelRunnerOutput) return output if self.is_driver_worker else None + def warn_inability_to_detect_numa(self) -> None: + logger.warning( + "Auto thread-binding failed due to the " + "inability to detect numa nodes. %s", + self.manually_bind_threads_suggestion) + + def warn_lack_of_numa_and_psutil(self) -> None: + logger.warning( + "Auto thread-binding failed due to " + "the lack of package numa and psutil. %s", + self.manually_bind_threads_suggestion) + + def warn_world_size_too_large(self, world_size: int, + node_to_cpus_len: int) -> None: + logger.warning( + "Auto thread-binding failed due to " + "world size: %d being larger than " + "allowed NUMA nodes number: %d. %s", world_size, node_to_cpus_len, + self.manually_bind_threads_suggestion) + + def get_cpus_allow_list_and_numa_size(self): + cpus_allow_list = psutil.Process().cpu_affinity() + numa_size = info.get_num_configured_nodes() + return cpus_allow_list, numa_size + + def auto_thread_binding_based_on_numa_nodes(self, world_size: int, + rank_to_cpus: str) -> str: + cpu_count = psutil.cpu_count(logical=False) + cpus_allow_list, numa_size = self.get_cpus_allow_list_and_numa_size() + if not numa_size: + self.warn_inability_to_detect_numa() + return rank_to_cpus + + cpu_count_per_numa = cpu_count // numa_size + num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU, + cpu_count_per_numa // 2) + + node_to_cpus = [] + for i in range(numa_size): + node_intersect = set( + info.node_to_cpus(i)).intersection(cpus_allow_list) + if bool(node_intersect): + node_to_cpus.append(list(node_intersect)) + + node_to_cpus_len = len(node_to_cpus) + if world_size > node_to_cpus_len: + self.warn_world_size_too_large(world_size, node_to_cpus_len) + else: + end = cpu_count_per_numa - num_of_reserved_cpu + rank_to_cpus_list = node_to_cpus[self.rank][:end] + rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list) + logger.info("auto thread-binding list: %s", rank_to_cpus) + return rank_to_cpus + + def libnuma_and_psutil_found(self) -> bool: + libnuma_found = util.find_spec("numa") is not None + psutil_found = util.find_spec("psutil") is not None + + return libnuma_found and psutil_found + def get_cpus_id_binding_based_on_numa_nodes(self) -> str: """Return CPUs id binding based on NUMA nodes. """ rank_to_cpus = self.local_omp_cpuid # Setup OpenMP thread affinity based on NUMA nodes automatically world_size = self.vllm_config.parallel_config.world_size - libnuma_found = util.find_spec("numa") is not None - psutil_found = util.find_spec("psutil") is not None - if libnuma_found and psutil_found: - import psutil - from numa import info - cpu_count = psutil.cpu_count(logical=False) - cpus_allow_list = psutil.Process().cpu_affinity() - numa_size = info.get_num_configured_nodes() - cpu_count_per_numa = cpu_count // numa_size - num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU, - cpu_count_per_numa // 2) + if self.libnuma_and_psutil_found(): + rank_to_cpus = self.auto_thread_binding_based_on_numa_nodes( + world_size, rank_to_cpus) + else: + self.warn_lack_of_numa_and_psutil() + return rank_to_cpus - # check allow node_to_cpus list - node_to_cpus = [] - for i in range(numa_size): - node_intersect = set( - info.node_to_cpus(i)).intersection(cpus_allow_list) - if bool(node_intersect): - node_to_cpus.append(list(node_intersect)) - - if world_size > len(node_to_cpus): - logger.error( - "Auto thread-binding failed due to " - "world size: %d is larger than " - "allowed NUMA nodes number: %d." - "Please try to bind threads manually.", world_size, - len(node_to_cpus)) - else: - end = cpu_count_per_numa - num_of_reserved_cpu - rank_to_cpus_list = node_to_cpus[self.rank][:end] - rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list) - logger.info("auto thread-binding list: %s", rank_to_cpus) + def select_threads_per_power_core(self, + node_cpu_ids: list[int]) -> list[int]: + return [cpu for cpu in node_cpu_ids if cpu % 8 < 4] + + def auto_thread_binding_based_on_numa_nodes_ppc64le( + self, world_size: int, rank_to_cpus: str) -> str: + cpus_allow_list, numa_size = self.get_cpus_allow_list_and_numa_size() + if not numa_size: + self.warn_inability_to_detect_numa() + return rank_to_cpus + + node_to_cpus = [] + for i in range(numa_size): + node_intersect = set( + info.node_to_cpus(i)).intersection(cpus_allow_list) + if bool(node_intersect): + node_to_cpus.append(sorted(list(node_intersect))) + + node_to_cpus_len = len(node_to_cpus) + if world_size > node_to_cpus_len: + self.warn_world_size_too_large(world_size, node_to_cpus_len) else: - logger.warning( - "Auto thread-binding is not supported due to " - "the lack of package numa and psutil," - "fallback to no thread-binding. To get better performance," - "please try to manually bind threads.") + node_cpus_this_rank = node_to_cpus[self.rank] + node_cpus_this_rank = self.select_threads_per_power_core( + node_cpus_this_rank) + cpu_count_per_numa = len(node_cpus_this_rank) + num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU, + cpu_count_per_numa // 2) + end = cpu_count_per_numa - num_of_reserved_cpu + rank_to_cpus_list = node_cpus_this_rank[:end] + rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list) + logger.info("ppc64le thread-binding list: %s", rank_to_cpus) return rank_to_cpus def get_cpus_id_binding_based_on_numa_nodes_ppc64le(self) -> str: @@ -166,48 +237,11 @@ def get_cpus_id_binding_based_on_numa_nodes_ppc64le(self) -> str: performance by avoiding oversubscription of logical CPUs on Power. """ - def select_threads_per_power_core(node_cpu_ids): - return [cpu for cpu in node_cpu_ids if cpu % 8 < 4] - rank_to_cpus = self.local_omp_cpuid world_size = self.vllm_config.parallel_config.world_size - libnuma_found = util.find_spec("numa") is not None - psutil_found = util.find_spec("psutil") is not None - if libnuma_found and psutil_found: - import psutil - from numa import info - cpus_allow_list = psutil.Process().cpu_affinity() - numa_size = info.get_num_configured_nodes() - - node_to_cpus = [] - for i in range(numa_size): - node_intersect = set( - info.node_to_cpus(i)).intersection(cpus_allow_list) - if bool(node_intersect): - node_to_cpus.append(sorted(list(node_intersect))) - - if world_size > len(node_to_cpus): - logger.error( - "Auto thread-binding failed due to " - "world size: %d is larger than " - "allowed NUMA nodes number: %d." - "Please try to bind threads manually.", world_size, - len(node_to_cpus)) - else: - node_cpus_this_rank = node_to_cpus[self.rank] - node_cpus_this_rank = select_threads_per_power_core( - node_cpus_this_rank) - cpu_count_per_numa = len(node_cpus_this_rank) - num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU, - cpu_count_per_numa // 2) - end = cpu_count_per_numa - num_of_reserved_cpu - rank_to_cpus_list = node_cpus_this_rank[:end] - rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list) - logger.info("ppc64le thread-binding list: %s", rank_to_cpus) + if self.libnuma_and_psutil_found(): + rank_to_cpus = self.auto_thread_binding_based_on_numa_nodes_ppc64le( + world_size, rank_to_cpus) else: - logger.warning( - "Auto thread-binding is not supported due to " - "the lack of package numa and psutil," - "fallback to no thread-binding. To get better performance," - "please try to manually bind threads.") + self.warn_lack_of_numa_and_psutil() return rank_to_cpus From 31997cac05a861f117883b5f81bd3fb2efef2e11 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Thu, 17 Jul 2025 16:37:36 -0700 Subject: [PATCH 161/552] [V0 deprecation] Remove V0 HPU backend (#21131) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- docker/Dockerfile.hpu | 21 - requirements/hpu.txt | 12 - setup.py | 36 +- vllm/_custom_ops.py | 3 +- vllm/attention/backends/hpu_attn.py | 319 --- vllm/attention/ops/hpu_paged_attn.py | 88 - vllm/config.py | 2 +- vllm/core/block/cpu_gpu_block_allocator.py | 4 +- .../device_communicators/hpu_communicator.py | 46 - vllm/engine/arg_utils.py | 5 +- vllm/envs.py | 15 - vllm/lora/layers.py | 4 - vllm/lora/punica_wrapper/punica_hpu.py | 145 -- vllm/model_executor/custom_op.py | 7 - vllm/model_executor/layers/fused_moe/layer.py | 36 - vllm/model_executor/layers/layernorm.py | 20 - .../model_executor/layers/rotary_embedding.py | 58 - .../layers/vocab_parallel_embedding.py | 16 +- .../model_loader/bitsandbytes_loader.py | 11 +- .../model_loader/default_loader.py | 10 - vllm/platforms/__init__.py | 18 - vllm/platforms/hpu.py | 114 - vllm/platforms/interface.py | 5 - vllm/plugins/__init__.py | 13 - vllm/worker/hpu_model_runner.py | 2320 ----------------- vllm/worker/hpu_worker.py | 485 ---- vllm/worker/multi_step_hpu_worker.py | 123 - 27 files changed, 10 insertions(+), 3926 deletions(-) delete mode 100644 docker/Dockerfile.hpu delete mode 100644 requirements/hpu.txt delete mode 100644 vllm/attention/backends/hpu_attn.py delete mode 100644 vllm/attention/ops/hpu_paged_attn.py delete mode 100644 vllm/distributed/device_communicators/hpu_communicator.py delete mode 100644 vllm/lora/punica_wrapper/punica_hpu.py delete mode 100644 vllm/platforms/hpu.py delete mode 100644 vllm/worker/hpu_model_runner.py delete mode 100644 vllm/worker/hpu_worker.py delete mode 100644 vllm/worker/multi_step_hpu_worker.py diff --git a/docker/Dockerfile.hpu b/docker/Dockerfile.hpu deleted file mode 100644 index 224f142b5ff..00000000000 --- a/docker/Dockerfile.hpu +++ /dev/null @@ -1,21 +0,0 @@ -FROM vault.habana.ai/gaudi-docker/1.20.1/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest - -COPY ./ /workspace/vllm - -WORKDIR /workspace/vllm - -RUN pip install -v -r requirements/hpu.txt - -ENV no_proxy=localhost,127.0.0.1 -ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true - -RUN VLLM_TARGET_DEVICE=hpu python3 setup.py install - -# install development dependencies (for testing) -RUN python3 -m pip install -e tests/vllm_test_utils - -WORKDIR /workspace/ - -RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks - -ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] diff --git a/requirements/hpu.txt b/requirements/hpu.txt deleted file mode 100644 index a88777268a3..00000000000 --- a/requirements/hpu.txt +++ /dev/null @@ -1,12 +0,0 @@ -# Common dependencies --r common.txt - -# Dependencies for HPU code -ray -triton==3.1.0 -pandas -numpy==1.26.4 -tabulate -setuptools>=77.0.3,<80.0.0 -setuptools-scm>=8 -vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@f1f6624 diff --git a/setup.py b/setup.py index 795d5496455..9a5ca3456a0 100644 --- a/setup.py +++ b/setup.py @@ -410,29 +410,6 @@ def run(self) -> None: package_data[package_name].append(file_name) -def _is_hpu() -> bool: - # if VLLM_TARGET_DEVICE env var was set explicitly, skip HPU autodetection - if os.getenv("VLLM_TARGET_DEVICE", None) == VLLM_TARGET_DEVICE: - return VLLM_TARGET_DEVICE == "hpu" - - # if VLLM_TARGET_DEVICE was not set explicitly, check if hl-smi succeeds, - # and if it doesn't, check if habanalabs driver is loaded - is_hpu_available = False - try: - out = subprocess.run(["hl-smi"], capture_output=True, check=True) - is_hpu_available = out.returncode == 0 - except (FileNotFoundError, PermissionError, subprocess.CalledProcessError): - if sys.platform.startswith("linux"): - try: - output = subprocess.check_output( - 'lsmod | grep habanalabs | wc -l', shell=True) - is_hpu_available = int(output) > 0 - except (ValueError, FileNotFoundError, PermissionError, - subprocess.CalledProcessError): - pass - return is_hpu_available - - def _no_device() -> bool: return VLLM_TARGET_DEVICE == "empty" @@ -440,7 +417,7 @@ def _no_device() -> bool: def _is_cuda() -> bool: has_cuda = torch.version.cuda is not None return (VLLM_TARGET_DEVICE == "cuda" and has_cuda - and not (_is_neuron() or _is_tpu() or _is_hpu())) + and not (_is_neuron() or _is_tpu())) def _is_hip() -> bool: @@ -573,12 +550,6 @@ def get_vllm_version() -> str: if neuron_version != MAIN_CUDA_VERSION: neuron_version_str = neuron_version.replace(".", "")[:3] version += f"{sep}neuron{neuron_version_str}" - elif _is_hpu(): - # Get the Intel Gaudi Software Suite version - gaudi_sw_version = str(get_gaudi_sw_version()) - if gaudi_sw_version != MAIN_CUDA_VERSION: - gaudi_sw_version = gaudi_sw_version.replace(".", "")[:3] - version += f"{sep}gaudi{gaudi_sw_version}" elif _is_tpu(): version += f"{sep}tpu" elif _is_cpu(): @@ -625,8 +596,6 @@ def _read_requirements(filename: str) -> list[str]: requirements = _read_requirements("rocm.txt") elif _is_neuron(): requirements = _read_requirements("neuron.txt") - elif _is_hpu(): - requirements = _read_requirements("hpu.txt") elif _is_tpu(): requirements = _read_requirements("tpu.txt") elif _is_cpu(): @@ -635,8 +604,7 @@ def _read_requirements(filename: str) -> list[str]: requirements = _read_requirements("xpu.txt") else: raise ValueError( - "Unsupported platform, please use CUDA, ROCm, Neuron, HPU, " - "or CPU.") + "Unsupported platform, please use CUDA, ROCm, Neuron, or CPU.") return requirements diff --git a/vllm/_custom_ops.py b/vllm/_custom_ops.py index f25db40a4ef..81f4f6bdada 100644 --- a/vllm/_custom_ops.py +++ b/vllm/_custom_ops.py @@ -13,8 +13,7 @@ logger = init_logger(__name__) -if not current_platform.is_tpu() and not current_platform.is_hpu()\ - and not current_platform.is_xpu(): +if not current_platform.is_tpu() and not current_platform.is_xpu(): try: import vllm._C except ImportError as e: diff --git a/vllm/attention/backends/hpu_attn.py b/vllm/attention/backends/hpu_attn.py deleted file mode 100644 index b8fdf763a04..00000000000 --- a/vllm/attention/backends/hpu_attn.py +++ /dev/null @@ -1,319 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -############################################################################### -# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company -############################################################################### - -from dataclasses import dataclass -from typing import Any, Dict, List, Optional, Tuple, Type - -import torch -import vllm_hpu_extension.kernels as kernels -import vllm_hpu_extension.ops as ops -from vllm_hpu_extension.flags import enabled_flags -from vllm_hpu_extension.utils import Matmul, Softmax, VLLMKVCache - -from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, - AttentionLayer, - AttentionMetadata, AttentionType, - is_quantized_kv_cache) -from vllm.attention.backends.utils import CommonAttentionState -from vllm.attention.ops.hpu_paged_attn import (HPUPagedAttention, - HPUPagedAttentionMetadata) -from vllm.logger import init_logger - -logger = init_logger(__name__) - - -class HPUAttentionBackend(AttentionBackend): - - @staticmethod - def get_name() -> str: - return "HPU_ATTN" - - @staticmethod - def get_impl_cls() -> Type["HPUAttentionImpl"]: - return HPUAttentionImpl - - @staticmethod - def get_metadata_cls() -> Type["AttentionMetadata"]: - return HPUAttentionMetadata - - @staticmethod - def get_state_cls() -> Type["CommonAttentionState"]: - return CommonAttentionState - - @staticmethod - def get_kv_cache_shape( - num_blocks: int, - block_size: int, - num_kv_heads: int, - head_size: int, - ) -> Tuple[int, ...]: - return HPUPagedAttention.get_kv_cache_shape(num_blocks, block_size, - num_kv_heads, head_size) - - @staticmethod - def swap_blocks( - src_kv_cache: torch.Tensor, - dst_kv_cache: torch.Tensor, - src_to_dsts: torch.Tensor, - ) -> None: - HPUPagedAttention.swap_blocks(src_kv_cache, dst_kv_cache, src_to_dsts) - - @staticmethod - def copy_blocks( - kv_caches: List[torch.Tensor], - src_to_dsts: torch.Tensor, - ) -> None: - HPUPagedAttention.copy_blocks(kv_caches, src_to_dsts) - - -@dataclass -class HPUAttentionMetadata(HPUPagedAttentionMetadata, AttentionMetadata): - """Metadata for HPUAttentionbackend.""" - # Currently, input sequences can only contain all prompts - # or all decoding. True if all sequences are prompts. - is_prompt: bool - attn_bias: Optional[torch.Tensor] - seq_lens_tensor: Optional[torch.Tensor] - context_lens_tensor: Optional[torch.Tensor] - - -class HPUAttentionImpl(AttentionImpl, torch.nn.Module): - """ - If the input tensors contain prompt tokens, the layout is as follows: - |<--------------- num_prefill_tokens ----------------->| - |<--prefill_0-->|<--prefill_1-->|...|<--prefill_N-1--->| - - Otherwise, the layout is as follows: - |<----------------- num_decode_tokens ------------------>| - |<--decode_0-->|..........|<--decode_M-1-->|<--padding-->| - - Generation tokens can contain padding when cuda-graph is used. - Currently, prompt tokens don't contain any padding. - - The prompts might have different lengths, while the generation tokens - always have length 1. - """ - - def __init__( - self, - num_heads: int, - head_size: int, - scale: float, - num_kv_heads: int, - alibi_slopes: Optional[List[float]], - sliding_window: Optional[int], - kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, - max_seq_len: int = 4096, - attn_type: str = AttentionType.DECODER, - kv_sharing_target_layer_name: Optional[str] = None, - use_irope: bool = False, - ) -> None: - super(AttentionImpl, self).__init__() - if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0 " - "HPU_ATTN backend.") - if use_irope: - logger.warning_once( - "Using irope in HPU is not supported yet, it will fall back " - "to global attention for long context.") - self.kv_cache_dtype = kv_cache_dtype - self.num_heads = num_heads - self.head_size = head_size - self.scale = float(scale) - self.matmul_qk = Matmul() - self.softmax = Softmax() - self.matmul_av = Matmul() - self.batch2block_matmul = Matmul() - self.block2batch_matmul = Matmul() - self.k_cache = VLLMKVCache() - self.v_cache = VLLMKVCache() - self.fused_scaled_dot_product_attention = kernels.fsdpa() - - self.prefill_impl = 'naive' - if "flex_attention" in enabled_flags(): - self.prefill_impl = 'flex' - if "fsdpa" in enabled_flags(): - assert alibi_slopes is None, \ - 'Prefill with FusedSDPA not supported with alibi slopes!' - self.prefill_impl = 'fsdpa' - - self.num_kv_heads = num_heads if num_kv_heads is None else num_kv_heads - self.sliding_window = sliding_window - self.alibi_slopes = alibi_slopes - if alibi_slopes is not None: - alibi_slopes_tensor = torch.tensor(alibi_slopes, - dtype=torch.bfloat16) - self.alibi_slopes = alibi_slopes_tensor - self.num_queries_per_kv = self.num_heads // self.num_kv_heads - - if self.prefill_impl == 'fsdpa': - assert alibi_slopes is None, \ - 'Prefill with FusedSDPA not supported with alibi slopes!' - - supported_head_sizes = HPUPagedAttention.get_supported_head_sizes() - if head_size not in supported_head_sizes: - raise ValueError( - f"Head size {head_size} is not supported by PagedAttention. " - f"Supported head sizes are: {supported_head_sizes}.") - - self.attn_type = attn_type - if self.attn_type != AttentionType.DECODER: - raise NotImplementedError("Encoder self-attention and " - "encoder/decoder cross-attention " - "are not implemented for " - "HPUAttentionImpl") - - if is_quantized_kv_cache(self.kv_cache_dtype): - raise NotImplementedError( - "HPUAttention with FP8 KV cache not yet supported") - - def forward( - self, - layer: AttentionLayer, - query: torch.Tensor, - key: torch.Tensor, - value: torch.Tensor, - kv_cache: torch.Tensor, - attn_metadata: HPUAttentionMetadata, - output: Optional[torch.Tensor] = None, - output_scale: Optional[torch.Tensor] = None, - ) -> torch.Tensor: - """Forward pass with xFormers and PagedAttention. - - Args: - query: shape = [num_tokens, num_heads * head_size] - key: shape = [num_tokens, num_kv_heads * head_size] - value: shape = [num_tokens, num_kv_heads * head_size] - kv_cache = [2, num_blocks, block_size * num_kv_heads * head_size] - attn_metadata: Metadata for attention. - Returns: - shape = [num_tokens, num_heads * head_size] - """ - if output_scale is not None: - raise NotImplementedError( - "fused output quantization is not yet supported" - " for HPUAttentionImpl") - - batch_size, seq_len, hidden_size = query.shape - _, seq_len_kv, _ = key.shape - - key = key.view(-1, self.num_kv_heads, self.head_size) - value = value.view(-1, self.num_kv_heads, self.head_size) - block_indices = attn_metadata.block_indices - block_offsets = attn_metadata.block_offsets - key_cache = None - value_cache = None - if attn_metadata.is_prompt and self.attn_type \ - is not AttentionType.ENCODER_ONLY: - key = key.unflatten(0, (block_indices.size(0), -1)) - value = value.unflatten(0, (block_indices.size(0), -1)) - if kv_cache is not None and isinstance(kv_cache, tuple): - key_cache, value_cache = HPUPagedAttention.split_kv_cache( - kv_cache, self.num_kv_heads, self.head_size) - - # Reshape the input keys and values and store them in the cache. - # If kv_cache is not provided, the new key and value tensors are - # not cached. This happens during the initial memory profiling run. - key_cache = self.k_cache(key, key_cache, block_indices, - block_offsets) - value_cache = self.v_cache(value, value_cache, block_indices, - block_offsets) - - if attn_metadata.is_prompt: - # Prompt run. - query_shape = (batch_size, seq_len, self.num_heads, self.head_size) - kv_shape = (batch_size, seq_len_kv, self.num_kv_heads, - self.head_size) - - attn_bias = attn_metadata.attn_bias - if attn_bias is not None and self.alibi_slopes is not None: - position_bias = _make_alibi_bias(self.alibi_slopes, - self.num_kv_heads, - attn_bias.dtype, - attn_bias.shape[-1]) - attn_bias = attn_bias.tile((1, self.num_kv_heads, 1, 1)) - attn_bias.add_(position_bias) - - block_list = attn_metadata.block_list if attn_metadata \ - and attn_metadata.block_list is not None else None - - out = ops.prompt_attention( - impl=self.prefill_impl, - query=query.view(query_shape), - key=key.view(kv_shape), - value=value.view(kv_shape), - is_causal=True, - attn_bias=attn_bias, - valid_seq_lengths=attn_metadata.seq_lens_tensor, - **self.common_attention_args(block_list, key_cache, - value_cache)) - output = out.reshape(batch_size, seq_len, hidden_size) - else: - # Decoding run. - output = HPUPagedAttention.forward_decode( - query=query, - block_mapping=attn_metadata.block_mapping, - block_bias=attn_metadata.attn_bias, - block_groups=attn_metadata.block_groups, - **self.common_attention_args(attn_metadata.block_list, - key_cache, value_cache)) - # Reshape the output tensor. - return output.view(batch_size, seq_len, hidden_size) - - def common_attention_args(self, - block_list=None, - key_cache=None, - value_cache=None): - fsdpa_op = self.fused_scaled_dot_product_attention.apply \ - if self.fused_scaled_dot_product_attention is not None else None - return { - 'scale': self.scale, - 'matmul_qk_op': self.matmul_qk, - 'matmul_av_op': self.matmul_av, - 'batch2block_matmul_op': self.batch2block_matmul, - 'block2batch_matmul_op': self.block2batch_matmul, - 'fsdpa_op': fsdpa_op, - 'keys_fetch_func': self.k_cache.fetch_from_cache, - 'values_fetch_func': self.v_cache.fetch_from_cache, - 'softmax_op': self.softmax, - 'block_list': block_list, - 'key_cache': key_cache, - 'value_cache': value_cache, - } - - -def _make_alibi_bias( - alibi_slopes: torch.Tensor, - num_kv_heads: int, - dtype: torch.dtype, - seq_len: int, -) -> torch.Tensor: - bias = torch.arange(seq_len, dtype=dtype) - # NOTE(zhuohan): HF uses - # `bias = bias[None, :].repeat(seq_len, 1)` - # here. We find that both biases give the same results, but - # the bias below more accurately follows the original ALiBi - # paper. - # Calculate a matrix where each element represents ith element- jth - # element. - bias = bias[None, :] - bias[:, None] - - padded_len = (seq_len + 7) // 8 * 8 - num_heads = alibi_slopes.shape[0] - bias = torch.empty( - 1, # batch size - num_heads, - seq_len, - padded_len, - device=alibi_slopes.device, - dtype=dtype, - )[:, :, :, :seq_len].copy_(bias) - bias.mul_(alibi_slopes[:, None, None]) - if num_heads != num_kv_heads: - bias = bias.unflatten(1, (num_kv_heads, num_heads // num_kv_heads)) - return bias diff --git a/vllm/attention/ops/hpu_paged_attn.py b/vllm/attention/ops/hpu_paged_attn.py deleted file mode 100644 index 412dd20ec1d..00000000000 --- a/vllm/attention/ops/hpu_paged_attn.py +++ /dev/null @@ -1,88 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -############################################################################### -# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company -############################################################################### - -from dataclasses import dataclass -from typing import List, Optional, Tuple - -import torch -from vllm_hpu_extension import cache_ops, ops - -# Should be the same as PARTITION_SIZE in `paged_attention_v2_launcher`. -_PARTITION_SIZE = 512 - - -@dataclass -class HPUPagedAttentionMetadata: - """Metadata for PagedAttention.""" - block_list: Optional[torch.Tensor] - block_mapping: Optional[torch.Tensor] - block_usage: Optional[torch.Tensor] - block_indices: Optional[torch.Tensor] - block_offsets: Optional[torch.Tensor] - block_groups: Optional[torch.Tensor] - - -class HPUPagedAttention: - - @staticmethod - def get_supported_head_sizes() -> List[int]: - return [64, 80, 96, 112, 128, 256] - - @staticmethod - def get_kv_cache_shape( - num_blocks: int, - block_size: int, - num_kv_heads: int, - head_size: int, - ) -> Tuple[int, ...]: - return (num_blocks, block_size, num_kv_heads, head_size) - - @staticmethod - def split_kv_cache( - kv_cache: torch.Tensor, - num_kv_heads: int, - head_size: int, - ) -> Tuple[torch.Tensor, torch.Tensor]: - key_cache = kv_cache[0] - value_cache = kv_cache[1] - return key_cache, value_cache - - @staticmethod - def write_to_paged_cache(key: torch.Tensor, value: torch.Tensor, - key_cache: torch.Tensor, - value_cache: torch.Tensor, - slot_mapping: torch.Tensor, kv_cache_dtype: str, - is_prompt: bool) -> None: - cache_ops.reshape_and_cache(key, value, key_cache, value_cache, - slot_mapping, kv_cache_dtype, is_prompt) - - @staticmethod - def forward_decode(**kwargs) -> torch.Tensor: - return ops.flat_pa(**kwargs) - - @staticmethod - def swap_blocks( - src_kv_cache: Tuple[torch.Tensor, torch.Tensor], - dst_kv_cache: Tuple[torch.Tensor, torch.Tensor], - src_to_dsts: torch.Tensor, - ) -> None: - src_key_cache = src_kv_cache[0] - dst_key_cache = dst_kv_cache[0] - cache_ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dsts) - - src_value_cache = src_kv_cache[1] - dst_value_cache = dst_kv_cache[1] - cache_ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dsts) - - @staticmethod - def copy_blocks( - kv_caches: List[Tuple[torch.Tensor, torch.Tensor]], - src_to_dsts: torch.Tensor, - ) -> None: - key_caches = [kv_cache[0] for kv_cache in kv_caches] - value_caches = [kv_cache[1] for kv_cache in kv_caches] - cache_ops.copy_blocks(key_caches, value_caches, src_to_dsts) diff --git a/vllm/config.py b/vllm/config.py index c3f0cebc6b3..41997488fa6 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -2452,7 +2452,7 @@ def is_multi_step(self) -> bool: return self.num_scheduler_steps > 1 -Device = Literal["auto", "cuda", "neuron", "cpu", "tpu", "xpu", "hpu"] +Device = Literal["auto", "cuda", "neuron", "cpu", "tpu", "xpu"] @config diff --git a/vllm/core/block/cpu_gpu_block_allocator.py b/vllm/core/block/cpu_gpu_block_allocator.py index ea490c32791..92bc5e157e1 100644 --- a/vllm/core/block/cpu_gpu_block_allocator.py +++ b/vllm/core/block/cpu_gpu_block_allocator.py @@ -7,7 +7,6 @@ DeviceAwareBlockAllocator) from vllm.core.block.naive_block import NaiveBlock, NaiveBlockAllocator from vllm.core.block.prefix_caching_block import PrefixCachingBlockAllocator -from vllm.platforms import current_platform from vllm.utils import Device @@ -56,8 +55,7 @@ def create( - The block IDs are assigned contiguously, with GPU block IDs coming before CPU block IDs. """ - # For HPU, block id 0 is used only for padding - reserved_blocks = 1 if current_platform.is_hpu() else 0 + reserved_blocks = 0 block_ids = list( range(reserved_blocks, num_gpu_blocks + num_cpu_blocks)) num_gpu_blocks -= reserved_blocks diff --git a/vllm/distributed/device_communicators/hpu_communicator.py b/vllm/distributed/device_communicators/hpu_communicator.py deleted file mode 100644 index f00f6b62bf2..00000000000 --- a/vllm/distributed/device_communicators/hpu_communicator.py +++ /dev/null @@ -1,46 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import torch -import torch.distributed as dist - -from vllm.platforms import current_platform - -from .base_device_communicator import DeviceCommunicatorBase - -if current_platform.is_hpu(): - import habana_frameworks.torch as htorch # noqa: F401 - - -class HpuCommunicator(DeviceCommunicatorBase): - - def all_reduce(self, input_: torch.Tensor) -> torch.Tensor: - # FIXME(kzawora): this is a workaround for a bug in Habana PT bridge - # occurring when PT_HPU_ENABLE_LAZY_COLLECTIVES=true env var is used - # (which is required for tensor parallel HPUGraph inference) - htorch.core.mark_step() - dist.all_reduce(input_, group=self.device_group) - return input_ - - def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: - world_size = self.world_size - if dim < 0: - # Convert negative dim to positive. - dim += input_.dim() - input_size = input_.size() - # Allocate output tensor. - output_tensor = torch.empty((world_size, ) + input_size, - dtype=input_.dtype, - device=input_.device) - # All-gather. - htorch.core.mark_step() - dist.all_gather_into_tensor(output_tensor, - input_, - group=self.device_group) - # Reshape - output_tensor = output_tensor.movedim(0, dim) - output_tensor = output_tensor.reshape(input_size[:dim] + - (world_size * - input_size[dim], ) + - input_size[dim + 1:]) - return output_tensor diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index ae5eb46fa96..b20defde73e 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1365,9 +1365,8 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: supported = False if current_platform.is_rocm() or ( current_platform.is_cuda() - and current_platform.is_device_capability(100)) or ( - current_platform.device_name - == "hpu"): # handle hpu also for OOT platform + and current_platform.is_device_capability(100) + ): # handle hpu also for OOT platform supported = True elif fp8_attention and will_use_fa: from vllm.attention.utils.fa_utils import ( diff --git a/vllm/envs.py b/vllm/envs.py index 502978c7685..ba0c55160b7 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -106,8 +106,6 @@ VLLM_RAY_PER_WORKER_GPUS: float = 1.0 VLLM_RAY_BUNDLE_INDICES: str = "" VLLM_CUDART_SO_PATH: Optional[str] = None - VLLM_USE_HPU_CONTIGUOUS_CACHE_FETCH: bool = True - VLLM_HPU_USE_DELAYED_SAMPLING: bool = False VLLM_DP_RANK: int = 0 VLLM_DP_RANK_LOCAL: int = -1 VLLM_DP_SIZE: int = 1 @@ -780,19 +778,6 @@ def get_vllm_port() -> Optional[int]: "VLLM_CUDART_SO_PATH": lambda: os.getenv("VLLM_CUDART_SO_PATH", None), - # Contiguous cache fetching to avoid using costly gather operation on - # Gaudi3. This is only applicable to HPU contiguous cache. If set to true, - # contiguous cache fetch will be used. - "VLLM_USE_HPU_CONTIGUOUS_CACHE_FETCH": - lambda: os.environ.get("VLLM_CONTIGUOUS_PA", "true").lower() in - ("1", "true"), - - # Use delayed sampling for HPU to reduce host cpu overhead - # between each step. - "VLLM_HPU_USE_DELAYED_SAMPLING": - lambda: os.environ.get("VLLM_DELAYED_SAMPLING", "false").lower() in - ("1", "true"), - # Rank of the process in the data parallel setting "VLLM_DP_RANK": lambda: int(os.getenv("VLLM_DP_RANK", "0")), diff --git a/vllm/lora/layers.py b/vllm/lora/layers.py index 39b45027bd5..779f0264684 100644 --- a/vllm/lora/layers.py +++ b/vllm/lora/layers.py @@ -1164,10 +1164,6 @@ def _get_logits( posinf=pos_inf, neginf=neg_inf)) - # HPU needs special handling to prune out dummy samples. - if current_platform.is_hpu(): - lora_logits = lora_logits[:logits.shape[0], :] - logits[:, self.base_layer.org_vocab_size:self.base_layer.org_vocab_size + lora_logits.shape[1]] = lora_logits diff --git a/vllm/lora/punica_wrapper/punica_hpu.py b/vllm/lora/punica_wrapper/punica_hpu.py deleted file mode 100644 index b20c9785a74..00000000000 --- a/vllm/lora/punica_wrapper/punica_hpu.py +++ /dev/null @@ -1,145 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import TYPE_CHECKING, Optional, Union, final - -import torch -from vllm_hpu_extension.ops import (dispatch_bgmv_embedding, - dispatch_bgmv_linear) - -from .punica_base import PunicaWrapperBase -from .utils import convert_mapping - -if TYPE_CHECKING: - # avoid circuit import - from vllm.lora.layers import LoRAMapping - from vllm.lora.models import LongContextLoRAContext - - -@final -class PunicaWrapperHPU(PunicaWrapperBase): - - def __init__(self, max_num_batched_tokens: int, max_batches: int, - device: Union[torch.device, str], **kwargs): - # Increasing max_num_batched_tokens by 3x to handle increase in - # tensor size due to padding. - PunicaWrapperBase.__init__(self, 3 * max_num_batched_tokens, - max_batches, device) - - def _update_base_metadata( - self, - mapping: "LoRAMapping", - lora_index_to_id: list[Optional[int]], - max_loras: int, - vocab_size: int, - extra_vocab_size: int, - long_lora_context: Optional["LongContextLoRAContext"] = None, - ): - ( - base_indices, - sampler_indices, - sampler_indices_padded, - embeddings_indices, - long_lora_offsets_tensor, - indices_len, - ) = convert_mapping(mapping, lora_index_to_id, max_loras, vocab_size, - extra_vocab_size, self.device, None) - # Updating each element in `long_lora_offsets` with `lora_offset` slows - # down perf in HPU due to a series of `strided_insert` ops during lazy - # graph accumulation. Hence HPU appends `lora_offset` to a list and - # converts it to a tensor only after it is ready. - if long_lora_context: - index_mapping_indices: list[int] = list( - mapping.index_mapping).copy() - long_lora_offsets: list[int] = [] - for i in range(len(index_mapping_indices)): - lora_offset: int = long_lora_context.offsets_by_lora_id.get( - index_mapping_indices[i], 0) - long_lora_offsets.append(lora_offset) - long_lora_offsets_tensor = torch.tensor(long_lora_offsets, - device=self.device, - dtype=torch.long) - indices_len[-1] = long_lora_offsets_tensor.shape[-1] - - self._token_lora_indices[:base_indices.shape[0]].copy_(base_indices) - self._sampler_indices[:sampler_indices.shape[0]].copy_(sampler_indices) - self._sampler_indices_padded[:sampler_indices_padded.shape[0]].copy_( - sampler_indices_padded) - self._embeddings_indices[:embeddings_indices. - shape[0], :embeddings_indices.shape[1]].copy_( - embeddings_indices) - if long_lora_offsets_tensor is not None: - self._long_lora_indices[:long_lora_offsets_tensor.shape[0]].copy_( - long_lora_offsets_tensor) - else: - self._long_lora_indices.zero_() - self.indices_len[:] = indices_len - - def add_lora_embedding(self, - y: torch.Tensor, - x: torch.Tensor, - lora_b_stacked: torch.Tensor, - add_inputs: bool = True, - **kwargs) -> None: - dispatch_bgmv_embedding(y, x, lora_b_stacked, 0) - - def add_lora_linear(self, - y: torch.Tensor, - x: torch.Tensor, - lora_a_stacked: tuple[torch.Tensor, ...], - lora_b_stacked: tuple[torch.Tensor, ...], - lora_bias_stacked: Optional[tuple[torch.Tensor, ...]], - scale: float, - output_slices: tuple[int, ...], - *, - buffer: Optional[tuple[torch.Tensor, ...]] = None, - **kwargs) -> None: - y_org = y - x = x.view(-1, x.shape[-1]) - y = y.view(-1, y.shape[-1]) - offset_left = 0 - - for slice_idx in range(len(output_slices)): - dispatch_bgmv_linear( - y[:, offset_left:offset_left + output_slices[slice_idx]], x, - lora_a_stacked[slice_idx], lora_b_stacked[slice_idx], 0, scale) - offset_left += output_slices[slice_idx] - y = y.view_as(y_org) - - def add_lora_logits(self, - y: torch.Tensor, - x: torch.Tensor, - lora_a_stacked: torch.Tensor, - lora_b_stacked: torch.Tensor, - scale, - *, - buffer: Optional[torch.Tensor] = None, - **kwargs) -> None: - y_org = y - y = y.view(-1, y.shape[-1]) - x = x.view(-1, x.shape[-1]) - dispatch_bgmv_linear(y, x, lora_a_stacked, lora_b_stacked, 0, scale) - y = y.view_as(y_org) - - def add_shrink( - self, - y: Union[tuple[torch.Tensor, ...], torch.Tensor], - x: torch.Tensor, - lora_a_stacked: tuple[torch.Tensor, ...], - scale: float, - **kwargs, - ) -> None: - raise NotImplementedError - - def add_expand( - self, - y: torch.Tensor, - x: Union[tuple[torch.Tensor, ...], torch.Tensor], - lora_b_stacked: tuple[torch.Tensor, ...], - lora_bias_stacked: Optional[tuple[torch.Tensor, ...]], - output_slices: tuple[int, ...], - offset_start: int = 0, - add_inputs=True, - **kwargs, - ) -> None: - raise NotImplementedError diff --git a/vllm/model_executor/custom_op.py b/vllm/model_executor/custom_op.py index 9c88721fb27..f6e79cd676f 100644 --- a/vllm/model_executor/custom_op.py +++ b/vllm/model_executor/custom_op.py @@ -73,11 +73,6 @@ def forward_tpu(self, *args, **kwargs): # NOTE(woosuk): This is a placeholder for future extensions. return self.forward_native(*args, **kwargs) - def forward_hpu(self, *args, **kwargs): - # By default, we assume that Gaudi ops are compatible with the - # PyTorch-native implementation. - return self.forward_native(*args, **kwargs) - def forward_neuron(self, *args, **kwargs): # By default, we assume that Neuron ops are compatible with the # PyTorch-native implementation. @@ -106,8 +101,6 @@ def dispatch_forward(self): return self.forward_hip elif current_platform.is_cpu(): return self.forward_cpu - elif current_platform.is_hpu(): - return self.forward_hpu elif current_platform.is_tpu(): return self.forward_tpu elif current_platform.is_xpu(): diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index da772c11155..b3cee55e8ba 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -475,39 +475,6 @@ def forward_cpu( activation, ) - def forward_hpu( - self, - layer: torch.nn.Module, - x: torch.Tensor, - use_grouped_topk: bool, - top_k: int, - router_logits: torch.Tensor, - renormalize: bool, - topk_group: Optional[int] = None, - num_expert_group: Optional[int] = None, - global_num_experts: int = -1, - expert_map: Optional[torch.Tensor] = None, - custom_routing_function: Optional[Callable] = None, - scoring_func: str = "softmax", - e_score_correction_bias: Optional[torch.Tensor] = None, - apply_router_weight_on_input: bool = False, - activation: str = "silu", - ) -> torch.Tensor: - assert not use_grouped_topk - assert num_expert_group is None - assert topk_group is None - assert custom_routing_function is None - assert layer is not None - assert apply_router_weight_on_input is False - if scoring_func != "softmax": - raise NotImplementedError( - "Only softmax scoring function is supported for HPU.") - if e_score_correction_bias is not None: - raise NotImplementedError( - "Expert score correction bias is not supported for HPU.") - return layer.hpu_fused_moe(x, layer.w13_weight, layer.w2_weight, - router_logits, top_k) - def forward_tpu( self, layer: torch.nn.Module, @@ -716,9 +683,6 @@ def __init__( if self.scoring_func != "softmax" and not self.use_grouped_topk: raise ValueError("Only softmax scoring function is supported for " "non-grouped topk.") - if current_platform.is_hpu(): - from vllm_hpu_extension.ops import DynamicFusedMOE - self.hpu_fused_moe = DynamicFusedMOE(self.global_num_experts) if vllm_config.model_config is not None: model_dtype = vllm_config.model_config.dtype diff --git a/vllm/model_executor/layers/layernorm.py b/vllm/model_executor/layers/layernorm.py index e8d1fd63550..a5fc1db2dc1 100644 --- a/vllm/model_executor/layers/layernorm.py +++ b/vllm/model_executor/layers/layernorm.py @@ -170,26 +170,6 @@ def forward_cuda( else: return norm_func(x, self.weight.data, self.variance_epsilon) - def forward_hpu( - self, - x: torch.Tensor, - residual: Optional[torch.Tensor] = None, - ) -> Union[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]: - from vllm_hpu_extension.kernels import rms_norm - HPUFusedRMSNorm = rms_norm() - if HPUFusedRMSNorm is None: - return self.forward_native(x, residual) - if residual is not None: - orig_shape = x.shape - residual += x.view(residual.shape) - # Note: HPUFusedRMSNorm requires 3D tensors as inputs - x = HPUFusedRMSNorm.apply(residual, self.weight, - self.variance_epsilon) - return x.view(orig_shape), residual - - x = HPUFusedRMSNorm.apply(x, self.weight, self.variance_epsilon) - return x - def forward_xpu( self, x: torch.Tensor, diff --git a/vllm/model_executor/layers/rotary_embedding.py b/vllm/model_executor/layers/rotary_embedding.py index a4615132a51..dddd4d6a711 100644 --- a/vllm/model_executor/layers/rotary_embedding.py +++ b/vllm/model_executor/layers/rotary_embedding.py @@ -229,64 +229,6 @@ def forward_xpu( self.cos_sin_cache, self.is_neox_style) return query, key - def forward_hpu( - self, - positions: torch.Tensor, - query: torch.Tensor, - key: Optional[torch.Tensor] = None, - offsets: Optional[torch.Tensor] = None, - ) -> tuple[torch.Tensor, Optional[torch.Tensor]]: - from habana_frameworks.torch.hpex.kernels import ( - RotaryPosEmbeddingMode, apply_rotary_pos_emb) - if offsets is not None: - offsets = offsets.view(positions.shape[0], -1) - positions = positions + offsets - positions = positions.flatten() - num_tokens = positions.shape[0] - cos_sin = self.cos_sin_cache.index_select(0, positions).view( - num_tokens, 1, -1) - cos, sin = cos_sin.chunk(2, dim=-1) - # HPU RoPE kernel requires hidden dimension for cos and sin to be equal - # to query hidden dimension, so the original tensors need to be - # expanded - # GPT-NeoX kernel requires position_ids = None, offset, mode = BLOCKWISE - # and expansion of cos/sin tensors via concatenation - # GPT-J kernel requires position_ids = None, offset = 0, mode = PAIRWISE - # and expansion of cos/sin tensors via repeat_interleave - rope_mode: RotaryPosEmbeddingMode - if self.is_neox_style: - rope_mode = RotaryPosEmbeddingMode.BLOCKWISE - cos = torch.cat((cos, cos), dim=-1) - sin = torch.cat((sin, sin), dim=-1) - else: - rope_mode = RotaryPosEmbeddingMode.PAIRWISE - sin = torch.repeat_interleave(sin, - 2, - dim=-1, - output_size=cos_sin.shape[-1]) - cos = torch.repeat_interleave(cos, - 2, - dim=-1, - output_size=cos_sin.shape[-1]) - - query_shape = query.shape - query = query.view(num_tokens, -1, self.head_size) - query_rot = query[..., :self.rotary_dim] - query_pass = query[..., self.rotary_dim:] - query_rot = apply_rotary_pos_emb(query_rot, cos, sin, None, 0, - rope_mode) - query = torch.cat((query_rot, query_pass), dim=-1).reshape(query_shape) - - if key is not None: - key_shape = key.shape - key = key.view(num_tokens, -1, self.head_size) - key_rot = key[..., :self.rotary_dim] - key_pass = key[..., self.rotary_dim:] - key_rot = apply_rotary_pos_emb(key_rot, cos, sin, None, 0, - rope_mode) - key = torch.cat((key_rot, key_pass), dim=-1).reshape(key_shape) - return query, key - def forward_neuron( self, positions: torch.Tensor, diff --git a/vllm/model_executor/layers/vocab_parallel_embedding.py b/vllm/model_executor/layers/vocab_parallel_embedding.py index f35f969781b..a5f262c832b 100644 --- a/vllm/model_executor/layers/vocab_parallel_embedding.py +++ b/vllm/model_executor/layers/vocab_parallel_embedding.py @@ -388,20 +388,8 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor): # Copy the data. Select chunk corresponding to current shard. loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size) - - if current_platform.is_hpu(): - # FIXME(kzawora): Weight copy with slicing bugs out on Gaudi here, - # so we're using a workaround. Remove this when fixed in - # HPU PT bridge. - padded_weight = torch.cat([ - loaded_weight, - torch.zeros(param.shape[0] - loaded_weight.shape[0], - *loaded_weight.shape[1:]) - ]) - param.data.copy_(padded_weight) - else: - param[:loaded_weight.shape[0]].data.copy_(loaded_weight) - param[loaded_weight.shape[0]:].data.fill_(0) + param[:loaded_weight.shape[0]].data.copy_(loaded_weight) + param[loaded_weight.shape[0]:].data.fill_(0) def forward(self, input_): if self.tp_size > 1: diff --git a/vllm/model_executor/model_loader/bitsandbytes_loader.py b/vllm/model_executor/model_loader/bitsandbytes_loader.py index 907bc3c1361..68fcb785691 100644 --- a/vllm/model_executor/model_loader/bitsandbytes_loader.py +++ b/vllm/model_executor/model_loader/bitsandbytes_loader.py @@ -199,10 +199,6 @@ def _get_quantized_weights_iterator( if self.pre_quant: if self.load_8bit: - if current_platform.is_hpu(): - raise ValueError( - "currently hpu supports 4bit quantization only") - return self._quantized_8bit_generator( hf_weights_files, use_safetensors, quant_state_dict), quant_state_dict @@ -306,10 +302,6 @@ def _parse_quant_state(param_name: str, in temp_state_dict): quant_state = _parse_quant_state(mapped_weight_name, temp_state_dict) - if current_platform.is_hpu(): - assert quant_state.quant_type == "nf4", ( - "currently hpu supports nf4 quant_type only") - quant_state_dict[mapped_weight_name] = quant_state yield org_weight_name, weight_tensor else: @@ -380,8 +372,7 @@ def _unquantized_generator(self, hf_weights_files, use_safetensors, ...] # bitsandbytes requires data in GPU - if (weight_sub_tensor.is_cuda - or weight_sub_tensor.device.type == "hpu"): + if weight_sub_tensor.is_cuda: loaded_weight = weight_sub_tensor else: loaded_weight = weight_sub_tensor.to( diff --git a/vllm/model_executor/model_loader/default_loader.py b/vllm/model_executor/model_loader/default_loader.py index 4624ff01ddc..2fcae7eb6e6 100644 --- a/vllm/model_executor/model_loader/default_loader.py +++ b/vllm/model_executor/model_loader/default_loader.py @@ -218,16 +218,6 @@ def _xla_weights_iterator(iterator: Generator): weights_iterator = _xla_weights_iterator(weights_iterator) - elif current_platform.is_hpu(): - import habana_frameworks.torch.core as htcore - - def _hpu_weights_iterator(iterator: Generator): - for weights in iterator: - yield weights - htcore.mark_step() - - weights_iterator = _hpu_weights_iterator(weights_iterator) - if self.counter_before_loading_weights == 0.0: self.counter_before_loading_weights = time.perf_counter() # Apply the prefix. diff --git a/vllm/platforms/__init__.py b/vllm/platforms/__init__.py index 7b8953fd75b..c13659f8a06 100644 --- a/vllm/platforms/__init__.py +++ b/vllm/platforms/__init__.py @@ -116,23 +116,6 @@ def rocm_platform_plugin() -> Optional[str]: return "vllm.platforms.rocm.RocmPlatform" if is_rocm else None -def hpu_platform_plugin() -> Optional[str]: - is_hpu = False - logger.debug("Checking if HPU platform is available.") - try: - from importlib import util - is_hpu = util.find_spec('habana_frameworks') is not None - if is_hpu: - logger.debug("Confirmed HPU platform is available.") - else: - logger.debug("HPU platform is not available because " - "habana_frameworks is not found.") - except Exception as e: - logger.debug("HPU platform is not available because: %s", str(e)) - - return "vllm.platforms.hpu.HpuPlatform" if is_hpu else None - - def xpu_platform_plugin() -> Optional[str]: is_xpu = False logger.debug("Checking if XPU platform is available.") @@ -208,7 +191,6 @@ def neuron_platform_plugin() -> Optional[str]: 'tpu': tpu_platform_plugin, 'cuda': cuda_platform_plugin, 'rocm': rocm_platform_plugin, - 'hpu': hpu_platform_plugin, 'xpu': xpu_platform_plugin, 'cpu': cpu_platform_plugin, 'neuron': neuron_platform_plugin, diff --git a/vllm/platforms/hpu.py b/vllm/platforms/hpu.py deleted file mode 100644 index 3faf481087e..00000000000 --- a/vllm/platforms/hpu.py +++ /dev/null @@ -1,114 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import os -from typing import TYPE_CHECKING, Optional - -import torch - -from vllm import envs -from vllm.logger import init_logger -from vllm.utils import DEFAULT_MAX_NUM_BATCHED_TOKENS - -from .interface import Platform, PlatformEnum, _Backend - -if TYPE_CHECKING: - from vllm.config import VllmConfig -else: - VllmConfig = None - -logger = init_logger(__name__) - - -class HpuPlatform(Platform): - _enum = PlatformEnum.HPU - device_name: str = "hpu" - device_type: str = "hpu" - dispatch_key: str = "HPU" - ray_device_key: str = "HPU" - dist_backend: str = "hccl" - device_control_env_var: str = "HABANA_VISIBLE_MODULES" - - @classmethod - def get_attn_backend_cls(cls, selected_backend: _Backend, head_size: int, - dtype: torch.dtype, kv_cache_dtype: Optional[str], - block_size: int, use_v1: bool, - use_mla: bool) -> str: - logger.info("Using HPUAttention backend.") - return "vllm.attention.backends.hpu_attn.HPUAttentionBackend" - - @classmethod - def is_async_output_supported(cls, enforce_eager: Optional[bool]) -> bool: - return True - - @classmethod - def inference_mode(cls): - return torch.no_grad() - - @classmethod - def set_device(cls, device: torch.device) -> None: - """ - Set the device for the current platform. - """ - torch.hpu.set_device(device) - - @classmethod - def check_and_update_config(cls, vllm_config: VllmConfig) -> None: - - scheduler_config = vllm_config.scheduler_config - parallel_config = vllm_config.parallel_config - if scheduler_config.is_multi_step: - parallel_config.worker_cls = \ - "vllm.worker.multi_step_hpu_worker.MultiStepHPUWorker" - - if vllm_config.speculative_config is not None: - raise NotImplementedError( - "Speculative decoding is not implemented for HPU") - - if parallel_config.worker_cls == "auto": - parallel_config.worker_cls = "vllm.worker.hpu_worker.HPUWorker" - - # NOTE(kzawora): default block size for Gaudi should be 128 - # smaller sizes still work, but very inefficiently - cache_config = vllm_config.cache_config - if cache_config and cache_config.block_size is None: - cache_config.block_size = 128 - if (parallel_config.distributed_executor_backend == 'mp' - and envs.VLLM_WORKER_MULTIPROC_METHOD == 'fork'): - if os.environ.get("VLLM_WORKER_MULTIPROC_METHOD", - None) is not None: - logger.warning("On HPU, VLLM_WORKER_MULTIPROC_METHOD=fork " - "might cause application hangs on exit. Using " - "VLLM_WORKER_MULTIPROC_METHOD=fork anyway, " - "as it was explicitly requested.") - else: - logger.warning( - "On HPU, VLLM_WORKER_MULTIPROC_METHOD=fork " - "might cause application hangs on exit. Setting " - "VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. " - "To override that behavior, please set " - "VLLM_WORKER_MULTIPROC_METHOD=fork explicitly.") - os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" - - if vllm_config.model_config and vllm_config.model_config.use_mla: - logger.info( - "MLA is enabled on a non-GPU platform; forcing chunked " - "prefill and prefix caching to be disabled.") - vllm_config.scheduler_config.enable_chunked_prefill = False - vllm_config.scheduler_config.chunked_prefill_enabled = False - vllm_config.scheduler_config.max_num_batched_tokens = max( - vllm_config.scheduler_config.max_model_len, - DEFAULT_MAX_NUM_BATCHED_TOKENS) - - @classmethod - def is_pin_memory_available(cls): - logger.warning("Pin memory is not supported on HPU.") - return False - - @classmethod - def get_punica_wrapper(cls) -> str: - return "vllm.lora.punica_wrapper.punica_hpu.PunicaWrapperHPU" - - @classmethod - def get_device_communicator_cls(cls) -> str: - return "vllm.distributed.device_communicators.hpu_communicator.HpuCommunicator" # noqa diff --git a/vllm/platforms/interface.py b/vllm/platforms/interface.py index ae675bcc8d2..b8e788de11c 100644 --- a/vllm/platforms/interface.py +++ b/vllm/platforms/interface.py @@ -54,7 +54,6 @@ class _Backend(enum.Enum): FLASHMLA_VLLM_V1 = enum.auto() FLASHMLA = enum.auto() # Supported by V1 CUTLASS_MLA_VLLM_V1 = enum.auto() - HPU_ATTN = enum.auto() PALLAS = enum.auto() PALLAS_VLLM_V1 = enum.auto() IPEX = enum.auto() @@ -69,7 +68,6 @@ class PlatformEnum(enum.Enum): CUDA = enum.auto() ROCM = enum.auto() TPU = enum.auto() - HPU = enum.auto() XPU = enum.auto() CPU = enum.auto() NEURON = enum.auto() @@ -154,9 +152,6 @@ def is_rocm(self) -> bool: def is_tpu(self) -> bool: return self._enum == PlatformEnum.TPU - def is_hpu(self) -> bool: - return self._enum == PlatformEnum.HPU - def is_xpu(self) -> bool: return self._enum == PlatformEnum.XPU diff --git a/vllm/plugins/__init__.py b/vllm/plugins/__init__.py index 2cb177b9ba7..51c78ddc1a9 100644 --- a/vllm/plugins/__init__.py +++ b/vllm/plugins/__init__.py @@ -2,7 +2,6 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import logging -import os from typing import Any, Callable import torch @@ -75,18 +74,6 @@ def load_general_plugins(): if current_platform.is_xpu(): # see https://github.com/pytorch/pytorch/blob/43c5f59/torch/_dynamo/config.py#L158 torch._dynamo.config.disable = True - elif current_platform.is_hpu(): - # NOTE(kzawora): PT HPU lazy backend (PT_HPU_LAZY_MODE = 1) - # does not support torch.compile - # Eager backend (PT_HPU_LAZY_MODE = 0) must be selected for - # torch.compile support - is_lazy = os.environ.get('PT_HPU_LAZY_MODE', '1') == '1' - if is_lazy: - torch._dynamo.config.disable = True - # NOTE(kzawora) multi-HPU inference with HPUGraphs (lazy-only) - # requires enabling lazy collectives - # see https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html # noqa: E501 - os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES'] = 'true' plugins = load_plugins_by_group(group=DEFAULT_PLUGINS_GROUP) # general plugins, we only need to execute the loaded functions diff --git a/vllm/worker/hpu_model_runner.py b/vllm/worker/hpu_model_runner.py deleted file mode 100644 index 58603682988..00000000000 --- a/vllm/worker/hpu_model_runner.py +++ /dev/null @@ -1,2320 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -############################################################################### -# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company -############################################################################### - -import collections -import contextlib -import dataclasses -import functools -import gc -import itertools -import math -import os -import time -from array import array -from enum import Enum, IntEnum -from typing import (TYPE_CHECKING, Any, Callable, Dict, List, NamedTuple, - Optional, Set, Tuple, Type, TypeVar, Union) - -import habana_frameworks.torch as htorch -import habana_frameworks.torch.internal.bridge_config as bc -import torch -import torch.nn as nn -import vllm_hpu_extension.environment as environment -from vllm_hpu_extension.bucketing.common import get_bucketing_context -from vllm_hpu_extension.ops import LoraMask as LoraMask -from vllm_hpu_extension.profiler import (HabanaHighLevelProfiler, - HabanaMemoryProfiler, format_bytes) - -import vllm.envs as envs -from vllm.attention import AttentionMetadata, get_attn_backend -from vllm.config import DeviceConfig, VllmConfig -from vllm.distributed import broadcast_tensor_dict -from vllm.distributed.parallel_state import get_world_group -from vllm.forward_context import set_forward_context -from vllm.logger import init_logger -from vllm.lora.layers import LoRAMapping -from vllm.lora.request import LoRARequest -from vllm.lora.worker_manager import LRUCacheWorkerLoRAManager -from vllm.model_executor import SamplingMetadata -from vllm.model_executor.layers.layernorm import RMSNorm -from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler -from vllm.model_executor.layers.vocab_parallel_embedding import ( - VocabParallelEmbedding) -from vllm.model_executor.model_loader import get_model -from vllm.model_executor.sampling_metadata import SequenceGroupToSample -from vllm.multimodal import BatchedTensorInputs, MultiModalKwargs -from vllm.sampling_params import SamplingParams -from vllm.sequence import (CompletionSequenceGroupOutput, IntermediateTensors, - Logprob, SequenceData, SequenceGroupMetadata, - SequenceOutput) -from vllm.utils import (bind_kv_cache, is_pin_memory_available, - make_tensor_with_pad) -from vllm.worker.model_runner_base import ( - ModelRunnerBase, ModelRunnerInputBase, - _add_attn_metadata_broadcastable_dict, - _add_sampling_metadata_broadcastable_dict, - _init_attn_metadata_from_tensor_dict, - _init_sampling_metadata_from_tensor_dict) - -if TYPE_CHECKING: - from vllm.attention.backends.abstract import AttentionBackend - -logger = init_logger(__name__) - -_TYPE_CACHE = {} -# These values are assumed to be zero in several places. -# Use caution when updating them! -_PAD_SLOT_ID = 0 -_PAD_BLOCK_ID = 0 - -LORA_WARMUP_RANK = 8 - -DUMMY_TOKEN_ID = -1 - - -class PhaseType(Enum): - PREFILL = 'prefill' - PREFIX_PREFILL = 'prefix_prefill' - DECODE = 'decode' - - -def subtuple(obj: object, - typename: str, - to_copy: List[str], - to_override: Optional[Dict[str, object]] = None): - if obj is None: - return None - if to_override is None: - to_override = {} - fields = set(to_copy) | set(to_override.keys()) - if type(obj) is dict: - values = {key: obj[key] for key in fields if key in obj} - else: - values = {f: to_override.get(f, getattr(obj, f)) for f in fields} - if typename not in _TYPE_CACHE: - _TYPE_CACHE[typename] = collections.namedtuple(typename, - ' '.join(fields)) - return _TYPE_CACHE[typename](**values) - - -def round_up(value: int, k: int): - return (value + k - 1) // k * k - - -def align_workers(value, op): - group = get_world_group().cpu_group - world_size = torch.distributed.get_world_size() - if world_size <= 1: - return value - value_t = torch.tensor(value, device='cpu') - torch.distributed.all_reduce(value_t, op=op, group=group) - return value_t.item() - - -def setup_profiler(): - schedule = torch.profiler.schedule(wait=0, warmup=2, active=1, repeat=1) - DEVICE = 'hpu' - activities = [torch.profiler.ProfilerActivity.CPU] - activities.extend([torch.profiler.ProfilerActivity.HPU] if DEVICE == - 'hpu' else []) - #from habana_frameworks.torch.activity_profiler import DebugActivity - #debug_activities=[DebugActivity.BRIDGE_FUNCTION_CALLS] - - profiler = torch.profiler.profile( - schedule=schedule, - activities=activities, - #debug_activities=debug_activities, - on_trace_ready=torch.profiler.tensorboard_trace_handler('.', - use_gzip=True), - record_shapes=False, - with_stack=True) - return profiler - - -def pad_list(input, k, v): - input_len = len(input) - target_len = round_up(input_len, k) - padding = target_len - input_len - return input + [v] * padding - - -def gather_list(input, indices, v): - return [input[i] if i is not None else v for i in indices] - - -def flatten(in_list): - return list(itertools.chain(*in_list)) - - -def precompute_indices_and_offsets(block_size, slot_mapping, is_prompt): - slot_mapping = slot_mapping.flatten() - indices = torch.div(slot_mapping, block_size, rounding_mode="floor") - if is_prompt: - indices = indices.unflatten(0, (-1, block_size))[:, 0] - offsets = None - else: - offsets = torch.fmod(slot_mapping, block_size) - return indices, offsets - - -def modify_decoder_layer(module: torch.nn.Module, suffix="DecoderLayer"): - if module.__class__.__name__.endswith(suffix): - - def forward_hook(module, args, output): - htorch.core.mark_step() - return output - - module.register_forward_hook(forward_hook) - - for child_name, child_module in module.named_children(): - modify_decoder_layer(child_module) - - -class HpuModelAdapter: - - def __init__(self, model, vllm_config): - self.model = model - self.sampler = get_sampler() - self.prefill_use_fusedsdpa = os.getenv('VLLM_PROMPT_USE_FUSEDSDPA', - '0').lower() in ['1', 'true'] - self.vllm_config = vllm_config - self.block_size = vllm_config.cache_config.block_size - self.dtype = vllm_config.model_config.dtype - enforce_eager = vllm_config.model_config.enforce_eager - - if not htorch.utils.internal.is_lazy() and not enforce_eager: - if os.getenv('VLLM_REGIONAL_COMPILATION', - 'true').lower() == 'true': - self.regional_compilation_layers_list = [ - RMSNorm, VocabParallelEmbedding - ] - self._regional_compilation(self.model) - else: - self.model = torch.compile(self.model, - backend='hpu_backend', - dynamic=False) - - def _regional_compilation(self, - module, - parent_module=None, - module_name=None): - if isinstance(module, torch.nn.ModuleList): - for children_name, children_module in module.named_children(): - self._compile_region(module, children_name, children_module) - elif any( - isinstance(module, layer) - for layer in self.regional_compilation_layers_list): - self._compile_region(parent_module, module_name, module) - else: - for children_name, children_module in module.named_children(): - self._regional_compilation(children_module, module, - children_name) - - def _compile_region(self, model, name, module): - module = torch.compile(module, backend='hpu_backend', dynamic=False) - setattr(model, name, module) - - def _set_attn_bias(self, attn_metadata, batch_size, seq_len, device, - dtype): - if (attn_metadata is None - or (self.prefill_use_fusedsdpa \ - and attn_metadata.block_list is None) - or not attn_metadata.is_prompt): - return attn_metadata - - prefill_metadata = attn_metadata - - seq_lens_t = prefill_metadata.seq_lens_tensor - context_lens_t = prefill_metadata.context_lens_tensor - query_lens_t = seq_lens_t - context_lens_t - - block_list = attn_metadata.block_list - max_context_len = (block_list.size(-1) // - batch_size if block_list is not None else 0) - max_context_len = max_context_len * self.block_size - past_mask = torch.arange(0, - max_context_len, - dtype=torch.int32, - device=device) - past_mask = (past_mask.view(1, -1).expand(batch_size, -1).ge( - context_lens_t.view(-1, 1)).view(batch_size, 1, -1).expand( - batch_size, seq_len, -1).view(batch_size, 1, seq_len, -1)) - - len_mask = (torch.arange(0, seq_len, device=device, - dtype=torch.int32).view(1, seq_len).ge( - query_lens_t.unsqueeze(-1)).view( - batch_size, 1, 1, seq_len)) - causal_mask = torch.triu(torch.ones((batch_size, 1, seq_len, seq_len), - device=device, - dtype=torch.bool), - diagonal=1) - mask = causal_mask.logical_or(len_mask) - mask = torch.concat((past_mask, mask), dim=-1) - attn_bias = (torch.zeros_like(mask, dtype=dtype).masked_fill_( - mask, -math.inf)) - attn_metadata = prefill_metadata._replace(attn_bias=attn_bias) - return attn_metadata - - def _set_block_mapping(self, metadata, batch_size, device, dtype): - mask = torch.arange(0, - self.block_size, - device=device, - dtype=torch.int32).unsqueeze(0) - mask = mask >= metadata.block_usage.unsqueeze(-1) - attn_bias = (torch.zeros_like(mask, dtype=dtype).masked_fill_( - mask, -math.inf)) - if os.environ.get('VLLM_USE_FAKE_HPU', - '0') == '0' and htorch.utils.internal.is_lazy(): - block_mapping = torch.nn.functional.one_hot(metadata.block_groups, - num_classes=batch_size) - else: - # Unfortunately one_hot on CPU/torch.compile mode/eager mode - # doesn't handle out of bounds classes so we need to convert - # all negative values to 0 (block_mapping) or bs (block_groups) - block_groups = metadata.block_groups.to(torch.long) - block_mapping = torch.nn.functional.relu(block_groups) - block_mapping = torch.nn.functional.one_hot(block_mapping, - num_classes=batch_size) - oob_values = block_groups.lt(0) - block_mapping.masked_fill_(oob_values.unsqueeze(-1), 0) - block_groups.masked_fill_(oob_values, batch_size) - metadata = metadata._replace(block_groups=block_groups) - block_mapping = block_mapping.to(dtype) - metadata = metadata._replace(block_mapping=block_mapping, - attn_bias=attn_bias) - return metadata - - def _update_metadata(self, attn_metadata, batch_size, seq_len, device, - dtype): - if attn_metadata.is_prompt: - meta = attn_metadata - attn_metadata = self._set_attn_bias(meta, batch_size, seq_len, - device, dtype) - else: - meta = attn_metadata - attn_metadata = self._set_block_mapping(meta, batch_size, device, - dtype) - return attn_metadata - - def forward(self, *args, **kwargs): - kwargs = kwargs.copy() - selected_token_indices = kwargs.pop('selected_token_indices') - if 'warmup_mode' in kwargs: - kwargs.pop('warmup_mode') - virtual_engine = 0 - if 'virtual_engine' in kwargs: - virtual_engine = kwargs.pop('virtual_engine') - input_ids = kwargs['input_ids'] - attn_metadata = self._update_metadata(kwargs.pop('attn_metadata'), - input_ids.size(0), - input_ids.size(1), - input_ids.device, self.dtype) - LoraMask.setLoraMask(kwargs.pop('lora_mask')) - with set_forward_context(attn_metadata, self.vllm_config, - virtual_engine): - hidden_states = self.model(*args, **kwargs) - hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) - hidden_states = hidden_states.index_select(0, - selected_token_indices) - return hidden_states - - def compute_logits(self, *args, **kwargs): - return self.model.compute_logits(*args, **kwargs) - - def sample(self, *args, **kwargs): - return self.sampler(*args, **kwargs) - - -class PreparePromptMetadata(NamedTuple): - input_tokens: torch.Tensor - input_positions: List[List[int]] - attn_metadata: Optional[AttentionMetadata] - seq_lens: List[int] - query_lens: List[int] - lora_index_mapping: List[List[int]] - lora_prompt_mapping: List[List[int]] - lora_requests: Set[LoRARequest] - multi_modal_kwargs: Optional[Dict[str, BatchedTensorInputs]] - slot_mapping: List[List[int]] - lora_ids: List[int] - - @classmethod - def empty(cls): - return PreparePromptMetadata(input_tokens=[], - input_positions=[], - attn_metadata=None, - seq_lens=[], - query_lens=[], - lora_index_mapping=[], - lora_prompt_mapping=[], - lora_requests=set(), - multi_modal_kwargs=None, - slot_mapping=[], - lora_ids=[]) - - -class PrepareDecodeMetadata(NamedTuple): - input_tokens: torch.Tensor - input_positions: List[List[int]] - attn_metadata: Optional[AttentionMetadata] - lora_index_mapping: List[List[int]] - lora_prompt_mapping: List[List[int]] - lora_requests: Set[LoRARequest] - slot_mapping: List[List[int]] - lora_ids: List[int] - - @classmethod - def empty(cls): - return PrepareDecodeMetadata(input_tokens=[], - input_positions=[], - attn_metadata=None, - lora_index_mapping=[], - lora_prompt_mapping=[], - lora_requests=set(), - slot_mapping=[], - lora_ids=[]) - - -# How batches are constructed. -class BatchType(IntEnum): - # Every batch is prefill. - PREFILL = 0 - # Every batch is decode. - DECODE = 1 - # Batch is a mixture of prefill and decode. - MIXED = 2 - - -TModelInputForHPU = TypeVar('TModelInputForHPU', bound="ModelInputForHPU") - - -@dataclasses.dataclass(frozen=True) -class ModelInputForHPU(ModelRunnerInputBase): - """ - This base class contains metadata needed for the base model forward pass - but not metadata for possible additional steps, e.g., sampling. Model - runners that run additional steps should subclass this method to add - additional fields. - """ - input_tokens: Optional[torch.Tensor] = None - input_positions: Optional[torch.Tensor] = None - seq_lens: Optional[List[int]] = None - query_lens: Optional[List[int]] = None - lora_mapping: Optional["LoRAMapping"] = None - lora_requests: Optional[Set[LoRARequest]] = None - attn_metadata: Optional["AttentionMetadata"] = None - multi_modal_kwargs: Optional[Dict[str, torch.Tensor]] = None - real_batch_size: Optional[int] = None - batch_size_padded: Optional[int] = None - virtual_engine: int = 0 - lora_ids: Optional[List[int]] = None - async_callback: Optional[Callable] = None - is_first_multi_step: bool = True - is_last_step: bool = True - - def as_broadcastable_tensor_dict(self) -> Dict[str, Any]: - tensor_dict = { - "input_tokens": self.input_tokens, - "input_positions": self.input_positions, - "lora_requests": self.lora_requests, - "lora_mapping": self.lora_mapping, - "multi_modal_kwargs": self.multi_modal_kwargs, - "real_batch_size": self.real_batch_size, - "batch_size_padded": self.batch_size_padded, - "virtual_engine": self.virtual_engine, - "lora_ids": self.lora_ids, - "is_first_multi_step": self.is_first_multi_step, - "is_last_step": self.is_last_step, - } - _add_attn_metadata_broadcastable_dict(tensor_dict, self.attn_metadata) - return tensor_dict - - @classmethod - def from_broadcasted_tensor_dict( - cls: Type[TModelInputForHPU], - tensor_dict: Dict[str, Any], - attn_backend: Optional["AttentionBackend"] = None, - ) -> TModelInputForHPU: - if attn_backend is not None: - tensor_dict = _init_attn_metadata_from_tensor_dict( - attn_backend, tensor_dict) - return cls(**tensor_dict) - - -@dataclasses.dataclass(frozen=True) -class ModelInputForHPUWithSamplingMetadata(ModelInputForHPU): - """ - Used by the ModelRunner. - """ - sampling_metadata: Optional["SamplingMetadata"] = None - # Used for speculative decoding. We do not broadcast it because it is only - # used by the driver worker. - is_prompt: Optional[bool] = None - - def as_broadcastable_tensor_dict(self) -> Dict[str, Any]: - tensor_dict = { - "input_tokens": self.input_tokens, - "input_positions": self.input_positions, - "lora_requests": self.lora_requests, - "lora_mapping": self.lora_mapping, - "multi_modal_kwargs": self.multi_modal_kwargs, - "lora_ids": self.lora_ids, - } - _add_attn_metadata_broadcastable_dict(tensor_dict, self.attn_metadata) - _add_sampling_metadata_broadcastable_dict(tensor_dict, - self.sampling_metadata) - return tensor_dict - - @classmethod - def from_broadcasted_tensor_dict( - cls, - tensor_dict: Dict[str, Any], - attn_backend: Optional["AttentionBackend"] = None, - ) -> "ModelInputForHPUWithSamplingMetadata": - tensor_dict = _init_sampling_metadata_from_tensor_dict(tensor_dict) - # FIXME(kzawora): this fails for whatever reason - why? - if attn_backend is not None: - tensor_dict = _init_attn_metadata_from_tensor_dict( - attn_backend, tensor_dict) - return cls(**tensor_dict) - - -class HPUModelRunnerBase(ModelRunnerBase[TModelInputForHPU]): - """ - Helper class for shared methods between GPU model runners. - """ - _model_input_cls: Type[TModelInputForHPU] - - def __init__( - self, - vllm_config: VllmConfig, - is_driver_worker: bool = False, - return_hidden_states: bool = False, - ): - ModelRunnerBase.__init__(self, vllm_config=vllm_config) - environment.set_model_config(self.model_config) - self.is_driver_worker = is_driver_worker - self.return_hidden_states = return_hidden_states - - self.sliding_window = (self.model_config.get_sliding_window() - if self.model_config is not None else None) - self.device_config = (self.device_config if self.device_config - is not None else DeviceConfig()) - self.device = self.device_config.device - self.enforce_eager = self.model_config.enforce_eager - self.max_num_seqs = self.scheduler_config.max_num_seqs - # NOTE(kzawora): Change that to scheduler_config.max_num_prefill_seqs - # once padding-aware scheduling gets merged - self.max_num_prefill_seqs = 64 - self.max_model_len = self.scheduler_config.max_model_len - self.max_num_batched_tokens = \ - self.scheduler_config.max_num_batched_tokens - self.block_size = self.cache_config.block_size - - self.pin_memory = is_pin_memory_available() - self.kv_cache_dtype = self.cache_config.cache_dtype - - self.attn_backend = get_attn_backend( - self.model_config.get_head_size(), - self.model_config.dtype, - self.kv_cache_dtype, - self.block_size, - self.model_config.is_attention_free, - ) - - # Lazy initialization - self.lora_manager: LRUCacheWorkerLoRAManager = None - self.model: torch.nn.Module = None - self.inc_initialized_successfully = False - - # Profiler stats - self.profiler = HabanaHighLevelProfiler() - self.profiler_counter_helper = HabanaProfilerCounterHelper() - self.seen_configs: set = set() - self._mem_margin: Optional[int] = None - HPUBucketingContext = get_bucketing_context() - self.bucketing_ctx = HPUBucketingContext(self.max_num_seqs, - self.max_num_prefill_seqs, - self.block_size, - self.max_num_batched_tokens, - False, self.max_model_len) - self.graphed_buckets: Set[Any] = set() - self._set_gc_threshold() - if self.vllm_config.cache_config.enable_prefix_caching: - os.environ.setdefault("VLLM_CONTIGUOUS_PA", "False") - assert os.environ.get( - "VLLM_CONTIGUOUS_PA", - "").lower() != "true", "Contiguous PA doesn't support APC" - self.use_contiguous_pa = envs.VLLM_USE_HPU_CONTIGUOUS_CACHE_FETCH - - # For multi-step scheduling - self.cached_step_outputs: List[torch.Tensor] = [] - # For delayed sampling - self.cached_step_inputs: List[ - ModelInputForHPUWithSamplingMetadata] = [] - - def _set_gc_threshold(self) -> None: - # Read https://docs.python.org/3/library/gc.html#gc.set_threshold - # for comprehensive description of gc generations. - # We can either use VLLM_GC_THR_GEN[0-2] (this has higher priority) - # to set particular generation threshold or use simpler - # VLLM_GC_THR_MULTIPLIER to multiply default values. - default_gc_thrs = list(gc.get_threshold()) - requested_gc_thrs = [0] * len(default_gc_thrs) - for i in range(len(default_gc_thrs)): - requested_gc_thrs[i] = int( - os.environ.get(f'VLLM_GC_THR_GEN{i}', default_gc_thrs[i])) - if requested_gc_thrs == default_gc_thrs: - gc_thr_multiplier = int(os.environ.get('VLLM_GC_THR_MULTIPLIER', - 2)) - requested_gc_thrs = [ - t * gc_thr_multiplier for t in default_gc_thrs - ] - gc.set_threshold(*requested_gc_thrs) - - self.skip_warmup = os.environ.get('VLLM_SKIP_WARMUP', - 'false').lower() == 'true' - - def load_model(self) -> None: - import habana_frameworks.torch.core as htcore - if self.model_config.quantization == 'inc' or \ - self.model_config.quantization == 'fp8': - htcore.hpu_set_env() - with HabanaMemoryProfiler() as m: - with HabanaMemoryProfiler() as m_getmodel: - self.model = get_model(vllm_config=self.vllm_config) - msg = ("Pre-loading model weights on " - f"{next(self.model.parameters()).device} " - f"took {m_getmodel.get_summary_string()}") - logger.info(msg) - - if self.lora_config: - assert hasattr(self.model, "embedding_modules" - ), "Model does not have embedding_modules" - assert hasattr( - self.model, "embedding_padding_modules" - ), "Model does not have embedding_padding_modules" - assert not self.lora_config.bias_enabled, \ - "Bias support in LoRA is not enabled in HPU yet." - assert not self.lora_config.fully_sharded_loras, \ - "Fully sharded LoRAs is not enabled in HPU yet." - - # Use get_text_config() in case of multimodal models - text_config = self.model_config.hf_config.get_text_config() - - self.lora_manager = LRUCacheWorkerLoRAManager( - self.scheduler_config.max_num_seqs, - self.scheduler_config.max_num_batched_tokens, - self.vocab_size, - self.lora_config, - self.device, - self.model.embedding_modules, - self.model.embedding_padding_modules, - max_position_embeddings=text_config. - max_position_embeddings, - ) - self.model = self.lora_manager.create_lora_manager(self.model) - - if self.model_config.quantization == 'inc': - logger.info("Preparing model with INC..") - with HabanaMemoryProfiler() as m_inc: - from neural_compressor.torch.quantization import ( - FP8Config, convert, prepare) - config = FP8Config.from_json_file( - os.getenv("QUANT_CONFIG", "")) - if config.measure: - self.model = prepare(self.model, config) - elif config.quantize: - self.model = convert(self.model, config) - htcore.hpu_initialize(self.model, - mark_only_scales_as_const=True) - self.inc_initialized_successfully = True - logger.info("Preparing model with INC took %s", - m_inc.get_summary_string()) - else: - self.model = self.model.to("hpu") - htcore.mark_step() - modify_decoder_layer(self.model) - torch.hpu.synchronize() - - with HabanaMemoryProfiler() as m_wrap: - self.model = _maybe_wrap_in_hpu_graph( - self.model, vllm_config=self.vllm_config) - msg = f"Wrapping in HPU Graph took {m_wrap.get_summary_string()}" - logger.info(msg) - - self.model_memory_usage = m.consumed_device_memory - msg = f"Loading model weights took in total {m.get_summary_string()}" - logger.info(msg) - - def _add_dummy_seq(self, seq_group_metadata_list, is_prompt): - real_batch_size = len(seq_group_metadata_list) - batch_size_padded = self.bucketing_ctx.get_padded_batch_size( - real_batch_size, is_prompt) - batch_size_padding = batch_size_padded - real_batch_size - - seq_group_metadata_list = seq_group_metadata_list.copy() - - if batch_size_padding > 0: - dummy_seq_group_metadata = self.create_dummy_seq_group_metadata( - 0, 0, is_prompt) - seq_group_metadata_list.extend(dummy_seq_group_metadata - for _ in range(batch_size_padding)) - return seq_group_metadata_list, real_batch_size, batch_size_padded - - def _maybe_wrap_in_hpu_graph(self, *args, **kwargs): - return htorch.hpu.wrap_in_hpu_graph( - HpuModelAdapter(*args, **kwargs), disable_tensor_cache=True - ) if htorch.utils.internal.is_lazy() else HpuModelAdapter( - *args, **kwargs) - - def get_model(self) -> nn.Module: - return self.model - - def _use_graphs(self, batch_size, seq_len, is_prompt): - if self.enforce_eager: - return False - if self.skip_warmup: - return True - return (batch_size, seq_len, is_prompt) in self.graphed_buckets - - def _is_valid_bucket(self, bucket): - return bucket[0] * bucket[1] <= self.max_num_batched_tokens - - def _prepare_prompt( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - ) -> PreparePromptMetadata: - input_tokens: List[List[int]] = [] - input_positions: List[List[int]] = [] - slot_mapping: List[List[int]] = [] - lora_index_mapping: List[List[int]] = [] - lora_prompt_mapping: List[List[int]] = [] - lora_requests: Set[LoRARequest] = set() - - seq_lens: List[int] = [] - context_lens: List[int] = [] - query_lens: List[int] = [] - prefix_block_tables: List[List[int]] = [] - multi_modal_kwargs_list: List[MultiModalKwargs] = [] - - if len(seq_group_metadata_list) == 0: - return PreparePromptMetadata.empty() - - for seq_group_metadata in seq_group_metadata_list: - assert seq_group_metadata.is_prompt - seq_ids = list(seq_group_metadata.seq_data.keys()) - assert len(seq_ids) == 1 - seq_id = seq_ids[0] - - computed_block_nums = seq_group_metadata.computed_block_nums - if (self.scheduler_config is not None - and self.scheduler_config.chunked_prefill_enabled - and not (computed_block_nums is None - or computed_block_nums == [])): - raise RuntimeError( - "chunked prefill cannot be used with prefix caching " - "now.") - - token_chunk_size = seq_group_metadata.token_chunk_size - seq_data = seq_group_metadata.seq_data[seq_id] - context_len = seq_data.get_num_computed_tokens() - # We should use get_len here because in case of preemption - # it contains output tokens. - seq_len = min(seq_data.get_len(), context_len + token_chunk_size) - prompt_tokens = seq_data.get_token_ids()[context_len:seq_len] - seq_lens.append(seq_len) - - # NOTE: This only works for oooooooxxx style attention. - if computed_block_nums is not None and len( - computed_block_nums) > 0 and self.sliding_window is None: - # Prefix is not supported with sliding_window - context_len = len(computed_block_nums) * self.block_size - if context_len == seq_len \ - and self.vllm_config.cache_config.enable_prefix_caching: - # Fully cached prompt - compute only last token - context_len = context_len - 1 - prompt_tokens = prompt_tokens[context_len:] - prefix_block_tables.append(computed_block_nums) - elif self.scheduler_config.chunked_prefill_enabled: - if seq_group_metadata.block_tables is not None: - # Prefill has chunked before. - block_table = seq_group_metadata.block_tables[seq_id] - prefix_block_tables.append(block_table) - else: - # The first prefill. - prefix_block_tables.append([]) - else: - prefix_block_tables.append([]) - # Right now, prefill start is always 0. However, this - # assumption can be changed once chunked prefill is introduced. - assert context_len == 0 - - # actual prompt lens - context_lens.append(context_len) - query_lens.append(seq_len - context_len) - input_tokens.append(prompt_tokens) - # NOTE(woosuk): Here we assume that the first token in the prompt - # is always the first token in the sequence. - input_positions.append(list(range(context_len, seq_len))) - - mm_kwargs = seq_group_metadata.multi_modal_data - if mm_kwargs: - multi_modal_kwargs_list.append(mm_kwargs) - - if seq_group_metadata.block_tables is None: - # During memory profiling, the block tables are not initialized - # yet. In this case, we just use a dummy slot mapping. - slot_mapping.append([_PAD_SLOT_ID] * seq_len) - continue - - # Compute the slot mapping. - slot_mapping.append([]) - block_table = seq_group_metadata.block_tables[seq_id] - - # Mask the [0, start_idx) tokens of the prompt with _PAD_SLOT_ID, - # where start_idx is max(0, seq_len - sliding_window). - # For example, if the prompt len is 10, sliding window is 8, and - # block size is 4, the first two tokens are masked and the slot - # mapping will be [-1, -1, 2, 3, 4, 5, 6, 7, 0, 1]. - start_idx = 0 - if self.sliding_window is not None: - assert context_len == 0, ( - "Prefix caching is currently not supported with " - "sliding window attention") - start_idx = max(0, seq_len - self.sliding_window) - for i in range(context_len, seq_len): - if i < start_idx: - slot_mapping[-1].append(_PAD_SLOT_ID) - continue - - block_number = block_table[i // self.block_size] - block_offset = i % self.block_size - slot = block_number * self.block_size + block_offset - slot_mapping[-1].append(slot) - - max_query_len = max(query_lens) - sum_query_len = sum(query_lens) - real_num_seqs = len(query_lens) - assert max_query_len > 0 - - max_prompt_len = max( - self.bucketing_ctx.get_padded_prompt_seq_len(max_query_len), - self.block_size) - - lora_ids: List[int] = [] - for seq_group_metadata, context_len in zip(seq_group_metadata_list, - context_lens): - lora_id = seq_group_metadata.lora_int_id - lora_ids.append(lora_id) - - if lora_id > 0: - lora_requests.add(seq_group_metadata.lora_request) - - lora_index_mapping += [lora_id] * max_prompt_len - lora_prompt_mapping.extend( - [lora_id] * - (max_prompt_len - if seq_group_metadata.sampling_params.prompt_logprobs else 1)) - - if any(context_lens): - assert not self.scheduler_config.chunked_prefill_enabled - # prefix caching - - max_num_block = max(len(bt) for bt in prefix_block_tables) - prefix_block_list = list( - itertools.chain.from_iterable( - bt if len(bt) == max_num_block else bt + - ([_PAD_BLOCK_ID] * (max_num_block - len(bt))) - for bt in prefix_block_tables)) - - pad_len = len(prefix_block_list) - prefix_block_list = pad_list(prefix_block_list, pad_len, - _PAD_BLOCK_ID) - - prefix_block_list_tensor = torch.tensor(prefix_block_list, - dtype=torch.long, - device=self.device) - else: - prefix_block_list_tensor = None - - input_tokens = make_tensor_with_pad(input_tokens, - max_len=max_prompt_len, - pad=0, - dtype=torch.long, - device=self.device) - - input_positions = make_tensor_with_pad(input_positions, - max_len=max_prompt_len, - pad=0, - dtype=torch.long, - device=self.device) - - slot_mapping = make_tensor_with_pad(slot_mapping, - max_len=max_prompt_len, - pad=_PAD_SLOT_ID, - dtype=torch.long, - device=self.device) - - seq_lens_tensor = torch.tensor(seq_lens, - dtype=torch.long, - device=self.device) - - context_lens_tensor = torch.tensor(context_lens, - dtype=torch.long, - device=self.device) - - block_indices, block_offsets = precompute_indices_and_offsets( - self.block_size, slot_mapping, True) - attn_metadata = self.attn_backend.make_metadata( - is_prompt=True, - block_list=prefix_block_list_tensor, - block_mapping=None, - block_usage=None, - block_indices=block_indices, - block_offsets=block_offsets, - block_groups=None, - attn_bias=None, - seq_lens_tensor=seq_lens_tensor, - context_lens_tensor=context_lens_tensor, - num_prefills=real_num_seqs, - num_prefill_tokens=sum_query_len, - num_decode_tokens=0, - slot_mapping=slot_mapping, - multi_modal_placeholder_index_maps= - None, # FIXME(kzawora): multi-modality will not work here - enable_kv_scales_calculation=False, - ) - multi_modal_kwargs = MultiModalKwargs.batch(multi_modal_kwargs_list) - - return PreparePromptMetadata(input_tokens=input_tokens, - input_positions=input_positions, - attn_metadata=attn_metadata, - seq_lens=seq_lens, - query_lens=query_lens, - lora_index_mapping=lora_index_mapping, - lora_prompt_mapping=lora_prompt_mapping, - lora_requests=lora_requests, - multi_modal_kwargs=multi_modal_kwargs, - slot_mapping=slot_mapping, - lora_ids=lora_ids) - - def _prepare_decode( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - output=None, - ) -> PrepareDecodeMetadata: - input_tokens: List[List[int]] = [] - input_positions: List[List[int]] = [] - slot_mapping: List[List[int]] = [] - seq_lens: List[int] = [] - block_tables: List[List[int]] = [] - lora_index_mapping: List[List[int]] = [] - lora_prompt_mapping: List[List[int]] = [] - lora_requests: Set[LoRARequest] = set() - - if len(seq_group_metadata_list) == 0: - return PrepareDecodeMetadata.empty() - lora_ids: List[int] = [] - - dummy_slots = itertools.cycle( - range(_PAD_SLOT_ID, _PAD_SLOT_ID + self.block_size)) - - for seq_group_metadata in seq_group_metadata_list: - assert not seq_group_metadata.is_prompt - assert seq_group_metadata.token_chunk_size == 1 - - seq_ids = list(seq_group_metadata.seq_data.keys()) - lora_id = seq_group_metadata.lora_int_id - lora_ids.append(lora_id) - - if lora_id > 0: - lora_requests.add(seq_group_metadata.lora_request) - - for seq_id in seq_ids: - seq_data = seq_group_metadata.seq_data[seq_id] - if output is None: - generation_token = seq_data.get_last_token_id() - input_tokens.append([generation_token]) - - seq_len = seq_data.get_len() - position = seq_len - 1 - input_positions.append([position]) - - seq_len = seq_len if self.sliding_window is None else min( - seq_len, self.sliding_window) - seq_lens.append(seq_len) - - block_table = seq_group_metadata.block_tables[seq_id] - num_fully_occupied_blocks = position // self.block_size - block_table = block_table[:num_fully_occupied_blocks + 1] - - if len(block_table) == 0: - block_number = _PAD_BLOCK_ID - else: - block_number = block_table[position // self.block_size] - if block_number == _PAD_BLOCK_ID: - slot = next(dummy_slots) - else: - block_offset = position % self.block_size - slot = block_number * self.block_size + block_offset - slot_mapping.append([slot]) - lora_index_mapping.append(lora_id) - lora_prompt_mapping.append(lora_id) - - if self.sliding_window is not None: - sliding_window_blocks = (self.sliding_window // - self.block_size) - block_table = block_table[-sliding_window_blocks:] - block_tables.append(block_table) - - if output is None: - input_tokens = torch.tensor(input_tokens, - dtype=torch.long, - device=self.device) - else: - real_batch_size = len(seq_group_metadata_list) - input_tokens = output[:real_batch_size] - - input_positions = torch.tensor(input_positions, - dtype=torch.long, - device=self.device) - - num_decode_tokens = sum(seq_lens) - - last_block_usage = [ - slot[0] % self.block_size + 1 for slot in slot_mapping - ] - block_groups = [[i] * len(bt) for i, bt in enumerate(block_tables)] - block_usage = [[self.block_size] * (len(bt) - 1) + [lbu] - for bt, lbu in zip(block_tables, last_block_usage) - if bt] - - block_list = flatten(block_tables) - block_groups = flatten(block_groups) - block_usage = flatten(block_usage) - - assert len(block_list) == len(block_groups) - assert len(block_list) == len(block_usage) - - padding_fn = None - if self.use_contiguous_pa: - block_bucket_size = max(max(block_list) + 1, len(block_list)) - block_bucket_size = self.bucketing_ctx.get_padded_decode_num_blocks( - block_bucket_size) - indices: List[Any] - indices = [None] * block_bucket_size - for i, bid in enumerate(block_list): - indices[bid] = i - padding_fn = lambda tensor, pad_value: gather_list( - tensor, indices, pad_value) - else: - block_bucket_size = \ - self.bucketing_ctx.get_padded_decode_num_blocks( - len(block_list)) - padding_fn = lambda tensor, pad_value: pad_list( - tensor, block_bucket_size, pad_value) - - block_list = padding_fn(block_list, _PAD_BLOCK_ID) - block_groups = padding_fn(block_groups, -1) - block_usage = padding_fn(block_usage, 1) - - block_list = torch.tensor(block_list, - dtype=torch.int, - device=self.device) - block_groups = torch.tensor(block_groups, - dtype=torch.int, - device=self.device) - block_usage = torch.tensor(block_usage, - dtype=self.model_config.dtype, - device=self.device) - slot_mapping = torch.tensor(slot_mapping, - dtype=torch.long, - device=self.device) - - block_indices, block_offsets = precompute_indices_and_offsets( - self.block_size, slot_mapping, False) - - attn_metadata = self.attn_backend.make_metadata( - is_prompt=False, - block_list=block_list, - block_mapping=None, - block_usage=block_usage, - block_indices=block_indices, - block_offsets=block_offsets, - block_groups=block_groups, - attn_bias=None, - seq_lens_tensor=None, - context_lens_tensor=None, - num_prefills=0, - num_prefill_tokens=0, - num_decode_tokens=num_decode_tokens, - slot_mapping=slot_mapping, - multi_modal_placeholder_index_maps=None, - enable_kv_scales_calculation=False, - ) - return PrepareDecodeMetadata(input_tokens=input_tokens, - input_positions=input_positions, - attn_metadata=attn_metadata, - lora_index_mapping=lora_index_mapping, - lora_prompt_mapping=lora_prompt_mapping, - lora_requests=lora_requests, - slot_mapping=slot_mapping, - lora_ids=lora_ids) - - def prepare_input_tensors( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - ) -> Tuple[TModelInputForHPU, SamplingMetadata]: - if len(seq_group_metadata_list) == 0: - return self._model_input_cls(), None - - input_tokens = None - input_positions = None - lora_mapping = None - lora_requests = None - multi_modal_kwargs = None - batch_type = None - seq_lens = None - query_lens = None - real_batch_size = None - batch_size_padded = None - - self.event_start = self.profiler.get_timestamp_us() - is_prompt = seq_group_metadata_list[0].is_prompt - base_event_name = 'prompt' if is_prompt else 'decode' - self.profiler.start('internal', base_event_name) - - seq_group_metadata_list, real_batch_size, batch_size_padded = ( - self._add_dummy_seq(seq_group_metadata_list, is_prompt)) - - prefill_reqs = [] - decode_reqs = [] - for seq_group_meta in seq_group_metadata_list: - if seq_group_meta.is_prompt: - prefill_reqs.append(seq_group_meta) - else: - decode_reqs.append(seq_group_meta) - - # Prepare input tensors. - ( - input_tokens, - input_positions, - prefill_attn_metadata, - seq_lens, - query_lens, - lora_index_mapping, - lora_prompt_mapping, - lora_requests, - multi_modal_kwargs, - slot_mapping, - lora_ids, - ) = self._prepare_prompt(prefill_reqs) - ( - decode_input_tokens, - decode_input_positions, - decode_attn_metadata, - decode_lora_index_mapping, - decode_lora_prompt_mapping, - decode_lora_requests, - decode_slot_mapping, - decode_lora_ids, - ) = self._prepare_decode(decode_reqs) - sampling_metadata = SamplingMetadata.prepare(seq_group_metadata_list, - seq_lens, query_lens, - self.device, - self.pin_memory) - - if not self.scheduler_config.chunked_prefill_enabled: - assert (len(prefill_reqs) and len(decode_reqs)) == 0 - - num_prefills = len(seq_lens) - num_prefill_tokens = len(input_tokens) - num_decode_tokens = len(decode_input_tokens) - - # NOTE(kzawora): Here we diverge from GPU code - we don't - # support mixed batches, so we either use decode or prefill - # inputs, without coalescing. - assert (num_prefills == 0 and num_decode_tokens > 0) or ( - num_prefills > 0 - and num_decode_tokens == 0), "HPU does not support mixed batches!" - if num_decode_tokens > 0: - input_tokens = decode_input_tokens - input_positions = decode_input_positions - slot_mapping = decode_slot_mapping - lora_index_mapping = decode_lora_index_mapping - lora_prompt_mapping = decode_lora_prompt_mapping - lora_requests = decode_lora_requests - lora_ids = decode_lora_ids - - # FIXME: We need to adjust selected_token_indices to accommodate - # for padding - max_len = input_tokens.size(1) - paddings = [max_len - q for q in query_lens] - paddings = [0] + paddings[:-1] - paddings = list(itertools.accumulate(paddings)) - paddings_prompt_logprobs = [] - for i, seq_group_metadata in enumerate(seq_group_metadata_list): - if seq_group_metadata.sampling_params.prompt_logprobs is not None \ - and seq_group_metadata.is_prompt: - paddings_prompt_logprobs += ([paddings[i]] * seq_lens[i]) - paddings = torch.tensor( - paddings_prompt_logprobs if paddings_prompt_logprobs else paddings, - dtype=sampling_metadata.selected_token_indices.dtype, - device=sampling_metadata.selected_token_indices.device) - sampling_metadata.selected_token_indices.add_(paddings) - - if self.lora_config: - lora_mapping = LoRAMapping( - **dict(index_mapping=lora_index_mapping, - prompt_mapping=lora_prompt_mapping, - is_prefill=(num_prefills > 0))) - else: - lora_mapping = None - - if (prefill_attn_metadata is not None - and decode_attn_metadata is not None): - batch_type = BatchType.MIXED - raise NotImplementedError("Mixed batch is not supported on HPU") - elif prefill_attn_metadata is not None: - batch_type = BatchType.PREFILL - else: - batch_type = BatchType.DECODE - - metadata_dict = { - "input_tokens": input_tokens, - "input_positions": input_positions, - "selected_token_indices": sampling_metadata.selected_token_indices, - "lora_requests": lora_requests, - "lora_mapping": lora_mapping, - "multi_modal_kwargs": multi_modal_kwargs, - "num_prefill_tokens": num_prefill_tokens, - "num_decode_tokens": num_decode_tokens, - "slot_mapping": slot_mapping, - "num_prefills": num_prefills, - "batch_type": batch_type, - "seq_lens": seq_lens, - "query_lens": query_lens - } - if prefill_attn_metadata is not None: - metadata_dict.update(prefill_attn_metadata.asdict_zerocopy()) - else: - assert decode_attn_metadata is not None - metadata_dict.update(decode_attn_metadata.asdict_zerocopy()) - - attn_metadata = prefill_attn_metadata if \ - prefill_attn_metadata is not None else decode_attn_metadata - - return self._model_input_cls(input_tokens=input_tokens, - seq_lens=seq_lens, - query_lens=query_lens, - input_positions=input_positions, - attn_metadata=attn_metadata, - lora_requests=lora_requests, - lora_mapping=lora_mapping, - multi_modal_kwargs=multi_modal_kwargs, - real_batch_size=real_batch_size, - batch_size_padded=batch_size_padded, - lora_ids=lora_ids), \ - sampling_metadata - - def _seq_len(self, attn_metadata): - if attn_metadata.num_prefills != 0: - return attn_metadata.slot_mapping.size(1) - else: - return attn_metadata.block_list.numel() - - def trim_attn_metadata(self, metadata: AttentionMetadata) -> object: - # NOTE(kzawora): To anyone working on this in the future: - # Trimming metadata is required when using HPUGraphs. - # Attention metadata is going to be hashed by PT bridge, and - # appropriate HPUGraphs will be matched based on all inputs' hash. - - # Before you put more keys in here, make sure you know their - # value type and make sure you know how it's going to be hashed. - # You can find that information in input_hash function - # in habana_frameworks/torch/hpu/graphs.py. You can also hash - # it manually with torch.hpu.graphs.input_hash(attention_metadata) - - # If you use primitive types here - they will get hashed based - # on their value. You *will* get lots of excessive graph captures - # (and an OOM eventually) if you decide to put something like - # seq_len int here. - # If you absolutely need a scalar, put it in a tensor. Tensors - # get hashed using their metadata, not their values: - # input_hash(torch.tensor(123)) == input_hash(torch.tensor(321)) - # input_hash(123) != input_hash(321) - # input_hash("abc") != input_hash("cba") - attention_metadata = subtuple(metadata, 'TrimmedAttentionMetadata', [ - 'attn_bias', - 'seq_lens_tensor', - 'context_lens_tensor', - 'block_list', - 'block_mapping', - 'block_usage', - 'slot_mapping', - 'is_prompt', - 'block_indices', - 'block_offsets', - 'block_groups', - ]) - return attention_metadata - - def create_dummy_seq_group_metadata(self, - group_id, - seq_len, - is_prompt, - lora_request=None): - sampling_params = SamplingParams(temperature=0) - num_blocks = math.ceil(seq_len / self.block_size) - seq_len = max(seq_len, 1) - if is_prompt: - input_len = seq_len - output_len = 0 - block_tables = None - else: - input_len = seq_len - 1 - output_len = 1 - block_tables = {group_id: [_PAD_BLOCK_ID] * num_blocks} - prompt_token_ids = [0] * input_len - output_token_ids = [1] * output_len - prompt_token_ids_array = array('l', prompt_token_ids) # noqa: F821 - seq_data = SequenceData(prompt_token_ids_array) - seq_data.output_token_ids = output_token_ids - return SequenceGroupMetadata(request_id=str(group_id), - is_prompt=(output_len == 0), - seq_data={group_id: seq_data}, - sampling_params=sampling_params, - block_tables=block_tables, - lora_request=lora_request) - - def profile_run(self) -> None: - num_layers = self.model_config.get_num_layers(self.parallel_config) - kv_caches = [None] * num_layers - bind_kv_cache( - self.vllm_config.compilation_config.static_forward_context, - [kv_caches]) - _, max_seq_len = self.bucketing_ctx.get_max_prompt_shape() - max_batch_size = min(self.max_num_seqs, - self.max_num_batched_tokens // max_seq_len) - self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches, - False, True) - return - - def warmup_scenario(self, - batch_size, - seq_len, - is_prompt, - kv_caches, - is_pt_profiler_run=False, - is_lora_profile_run=False) -> None: - use_graphs = self._use_graphs(batch_size, seq_len, is_prompt) - scenario_name = ("warmup_" - f"{'prompt' if is_prompt else 'decode'}_" - f"bs{batch_size}_" - f"seq{seq_len}_" - f"graphs{'T' if use_graphs else 'F'}") - # This represents the maximum number of different requests - # that will have unique loras, an therefore the max amount of memory - # consumption create dummy lora request copies from the lora request - # passed in, which contains a lora from the lora warmup path. - dummy_lora_requests: List[LoRARequest] = [] - dummy_lora_requests_per_seq: List[LoRARequest] = [] - if self.lora_config and is_lora_profile_run: - assert self.lora_manager is not None - with self.lora_manager.dummy_lora_cache(): - for idx in range(self.lora_config.max_loras): - lora_id = idx + 1 - dummy_lora_request = LoRARequest( - lora_name=f"warmup_{lora_id}", - lora_int_id=lora_id, - lora_local_path="/not/a/real/path", - ) - self.lora_manager.add_dummy_lora(dummy_lora_request, - rank=LORA_WARMUP_RANK) - dummy_lora_requests.append(dummy_lora_request) - dummy_lora_requests_per_seq = [ - dummy_lora_requests[idx % len(dummy_lora_requests)] - for idx in range(batch_size) - ] - self.profiler.start('internal', scenario_name) - times = 3 if use_graphs or is_pt_profiler_run else 1 - if is_prompt: - seqs = [ - self.create_dummy_seq_group_metadata( - i, - seq_len, - is_prompt, - lora_request=dummy_lora_requests_per_seq[i] - if dummy_lora_requests_per_seq else None) - for i in range(batch_size) - ] - else: - # FIXME: seq_len is actually number of blocks - blocks = [seq_len // batch_size for _ in range(batch_size)] - blocks[0] += seq_len % batch_size - seqs = [ - self.create_dummy_seq_group_metadata( - i, - b * self.block_size - 1, - is_prompt, - lora_request=dummy_lora_requests_per_seq[i] - if dummy_lora_requests_per_seq else None) - for i, b in enumerate(blocks) - ] - torch.hpu.synchronize() - profiler = None - if is_pt_profiler_run and self.is_driver_worker: - profiler = setup_profiler() - profiler.start() - for _ in range(times): - inputs = self.prepare_model_input(seqs) - is_single_step = \ - self.vllm_config.scheduler_config.num_scheduler_steps == 1 - if is_prompt or is_single_step: - self.execute_model(inputs, None, warmup_mode=True) - else: # decode with multi-step - inputs = dataclasses.replace(inputs, - is_first_multi_step=True, - is_last_step=False) - self.execute_model(inputs, - None, - warmup_mode=True, - num_steps=2, - seqs=seqs) - inputs = dataclasses.replace(inputs, - is_first_multi_step=False, - is_last_step=True) - self.execute_model(inputs, - None, - warmup_mode=True, - num_steps=2, - seqs=seqs) - torch.hpu.synchronize() - if profiler: - profiler.step() - if profiler: - profiler.stop() - self.profiler.end() - gc.collect() - - def remove_all_loras(self): - if not self.lora_manager: - raise RuntimeError("LoRA is not enabled.") - self.lora_manager.remove_all_adapters() - - def set_active_loras(self, lora_requests: Set[LoRARequest], - lora_mapping: LoRAMapping) -> None: - if not self.lora_manager: - raise RuntimeError("LoRA is not enabled.") - self.lora_manager.set_active_adapters(lora_requests, lora_mapping) - - def add_lora(self, lora_request: LoRARequest) -> bool: - if not self.lora_manager: - raise RuntimeError("LoRA is not enabled.") - return self.lora_manager.add_adapter(lora_request) - - def remove_lora(self, lora_id: int) -> bool: - if not self.lora_manager: - raise RuntimeError("LoRA is not enabled.") - return self.lora_manager.remove_adapter(lora_id) - - def pin_lora(self, lora_id: int) -> bool: - if not self.lora_manager: - raise RuntimeError("LoRA is not enabled.") - return self.lora_manager.pin_adapter(lora_id) - - def list_loras(self) -> Set[int]: - if not self.lora_manager: - raise RuntimeError("LoRA is not enabled.") - return self.lora_manager.list_adapters() - - def log_warmup(self, phase, i, max_i, batch_size, seq_len): - free_mem = format_bytes( - HabanaMemoryProfiler.current_free_device_memory()) - dim = "num_blocks" - if phase == "Prompt": - dim = "seq_len" - msg = (f"[Warmup][{phase}][{i+1}/{max_i}] " - f"batch_size:{batch_size} " - f"{dim}:{seq_len} " - f"free_mem:{free_mem}") - logger.info(msg) - - def warmup_all_buckets(self, buckets, is_prompt, kv_caches): - for i, (batch_size, seq_len) in enumerate(reversed(buckets)): - self.log_warmup('Prompt' if is_prompt else 'Decode', i, - len(buckets), batch_size, seq_len) - self.warmup_scenario(batch_size, seq_len, is_prompt, kv_caches) - - def warmup_graphs(self, - strategy, - buckets, - is_prompt, - kv_caches, - available_mem, - starting_mem=0, - total_batch_seq=0.001): - total_mem = starting_mem - idx = 0 - phase = f'Graph/{"Prompt" if is_prompt else "Decode"}' - num_candidates = len(buckets) - ordering : Union[Callable[[Any], Tuple[Any, Any]], \ - Callable[[Any], Tuple[Any, Any, Any]]] - if strategy == 'min_tokens': - ordering = lambda b: (b[0] * b[1], b[1], b[0]) - elif strategy == 'max_bs': - ordering = lambda b: (-b[0], b[1]) - else: - raise NotImplementedError( - f'Unsupported graph allocation strategy: {strategy}') - buckets = list(sorted(buckets, key=ordering)) - captured_all = True - for idx, (batch_size, seq_len) in enumerate(buckets): - # Graph memory usage is proportional to seq dimension in a batch - batch_seq = batch_size * seq_len if is_prompt else batch_size - mem_estimate = batch_seq / total_batch_seq * total_mem - if mem_estimate >= available_mem: - captured_all = False - continue - graphed_bucket = (batch_size, seq_len, is_prompt) - if graphed_bucket in self.graphed_buckets: - continue - self.graphed_buckets.add(graphed_bucket) - self.log_warmup(phase, idx, num_candidates, batch_size, seq_len) - with HabanaMemoryProfiler() as mem_prof: - self.warmup_scenario(batch_size, seq_len, is_prompt, kv_caches) - used_mem = align_workers(mem_prof.consumed_device_memory, - torch.distributed.ReduceOp.MAX) - available_mem -= used_mem - total_mem += used_mem - total_batch_seq += batch_seq - - return total_mem, total_batch_seq, captured_all - - def log_graph_warmup_summary(self, buckets, is_prompt, total_mem): - num_candidates = len(buckets) - phase = f'Graph/{"Prompt" if is_prompt else "Decode"}' - graphed = list(c[:2] for c in self.graphed_buckets - if c[2] == is_prompt) - if num_candidates == 0: - num_candidates = 1 - msg = (f'{phase} captured:{len(graphed)} ' - f'({100 * len(graphed) / num_candidates:.1f}%) ' - f'used_mem:{format_bytes(total_mem)} ' - f'buckets:{sorted(list(graphed))}') - logger.info(msg) - - @torch.inference_mode() - def warmup_model(self, kv_caches: List[torch.Tensor]) -> None: - max_blocks = kv_caches[0][0].size(0) - self.bucketing_ctx.generate_decode_buckets(max_blocks) - if profile := os.environ.get('VLLM_PT_PROFILE', None): - phase, bs, seq_len, graph = profile.split('_') - is_prompt = phase == 'prompt' - graphs = graph == 't' - if graphs: - self.graphed_buckets.add((int(bs), int(seq_len), is_prompt)) - self.warmup_scenario(int(bs), int(seq_len), is_prompt, kv_caches, - True) - raise AssertionError("Finished profiling") - if not htorch.utils.internal.is_lazy() and not self.enforce_eager: - cache_size_limit = 1 + 3 * ( - len(self.bucketing_ctx.prompt_buckets) + - len(self.bucketing_ctx.decode_buckets)) - torch._dynamo.config.cache_size_limit = max( - cache_size_limit, torch._dynamo.config.cache_size_limit) - # Multiply by 8 to follow the original default ratio between - # the cache_size_limit and accumulated_cache_size_limit - torch._dynamo.config.accumulated_cache_size_limit = max( - cache_size_limit * 8, - torch._dynamo.config.accumulated_cache_size_limit) - if self.skip_warmup: - logger.info("Skipping warmup...") - return - self.profiler.start('internal', 'warmup') - start_mem = HabanaMemoryProfiler.current_device_memory_usage() - start_time = time.perf_counter() - - compile_only_mode_context = functools.partial(bc.env_setting, - "PT_COMPILE_ONLY_MODE", - True) - can_use_compile_only_mode = True - try: - with compile_only_mode_context(): - pass - logger.debug("Using PT_COMPILE_ONLY_MODE.") - except KeyError: - can_use_compile_only_mode = False - logger.warning('Cannot use PT_COMPILE_ONLY_MODE. ' - 'Warmup time will be negatively impacted. ' - 'Please update Gaudi Software Suite.') - with compile_only_mode_context( - ) if can_use_compile_only_mode else contextlib.nullcontext(): - self.warmup_all_buckets(self.bucketing_ctx.prompt_buckets, True, - kv_caches) - self.warmup_all_buckets(self.bucketing_ctx.decode_buckets, False, - kv_caches) - - if not self.enforce_eager and htorch.utils.internal.is_lazy(): - assert self.mem_margin is not None, \ - ("HabanaWorker.determine_num_available_blocks needs " - "to be called before warming up the model.") - free_mem = HabanaMemoryProfiler.current_free_device_memory() - graph_free_mem = free_mem - self.mem_margin - graph_free_mem = align_workers(graph_free_mem, - torch.distributed.ReduceOp.MIN) - prompt_graph_mem_ratio = float( - os.environ.get('VLLM_GRAPH_PROMPT_RATIO', '0.3')) - prompt_available_memory = (prompt_graph_mem_ratio * - graph_free_mem) - decode_available_memory = (graph_free_mem - - prompt_available_memory) - msg = ( - f"Using {format_bytes(graph_free_mem)}" - f"/{format_bytes(free_mem)} " - "of free device memory for HPUGraphs, " - f"{format_bytes(prompt_available_memory)} for prompt and " - f"{format_bytes(decode_available_memory)} for decode " - f"(VLLM_GRAPH_PROMPT_RATIO={prompt_graph_mem_ratio})") - logger.info(msg) - prompt_strategy = os.environ.get('VLLM_GRAPH_PROMPT_STRATEGY', - 'min_tokens') - decode_strategy = os.environ.get('VLLM_GRAPH_DECODE_STRATEGY', - 'max_bs') - mem_post_prompt, prompt_batch_seq, prompt_captured_all = \ - self.warmup_graphs( - prompt_strategy, self.bucketing_ctx.prompt_buckets, - True, kv_caches, prompt_available_memory) - mem_post_decode, decode_batch_seq, decode_captured_all = \ - self.warmup_graphs( - decode_strategy, self.bucketing_ctx.decode_buckets, - False, kv_caches, decode_available_memory) - - # Not all prompt buckets were captured, but all decode buckets - # were captured and we have some free graph-allocated space - # left. Let's try to use it for capturing more prompt buckets. - if (mem_post_decode + mem_post_prompt < graph_free_mem - and not prompt_captured_all and decode_captured_all): - mem_post_prompt, _, prompt_captured_all = ( - self.warmup_graphs( - prompt_strategy, self.bucketing_ctx.prompt_buckets, - True, kv_caches, - graph_free_mem - mem_post_prompt - mem_post_decode, - mem_post_prompt, prompt_batch_seq)) - - # Not all decode buckets were captured, but all prompt buckets - # were captured and we have some free graph-allocated space - # left. Let's try to use it for capturing more decode buckets. - if mem_post_decode + mem_post_prompt < graph_free_mem \ - and not decode_captured_all \ - and prompt_captured_all: - mem_post_decode, _, _ = self.warmup_graphs( - decode_strategy, self.bucketing_ctx.decode_buckets, - False, kv_caches, - graph_free_mem - mem_post_prompt - mem_post_decode, - mem_post_decode, decode_batch_seq) - - self.log_graph_warmup_summary( - self.bucketing_ctx.prompt_buckets, True, mem_post_prompt) - self.log_graph_warmup_summary( - self.bucketing_ctx.decode_buckets, False, mem_post_decode) - - end_time = time.perf_counter() - end_mem = HabanaMemoryProfiler.current_device_memory_usage() - elapsed_time = end_time - start_time - msg = ( - f"Warmup finished in {elapsed_time:.0f} secs, " - f"allocated {format_bytes(end_mem - start_mem)} of device memory") - logger.info(msg) - self.profiler.end() - - @property - def vocab_size(self) -> int: - return self.model_config.get_vocab_size() - - @property - def mem_margin(self) -> Optional[int]: - return self._mem_margin - - @mem_margin.setter - def mem_margin(self, value): - self._mem_margin = value - - -def _maybe_wrap_in_hpu_graph(*args, **kwargs): - return htorch.hpu.wrap_in_hpu_graph( - HpuModelAdapter(*args, **kwargs), disable_tensor_cache=True - ) if htorch.utils.internal.is_lazy() else HpuModelAdapter(*args, **kwargs) - - -class HabanaProfilerCounterHelper: - - def __init__(self): - self.niter = 0 - self.average_real_throughput = None - self.logged_once = False - self.real_seq_lens = [] - self.prompt_seq_lens = [] - - def capture_seq_group_metadata_stats(self, seq_group_metadata_list): - self.real_seq_lens = [ - len(seq_data.prompt_token_ids) + len(seq_data.output_token_ids) - for seq_group_metadata in seq_group_metadata_list - for seq_data in seq_group_metadata.seq_data.values() - ] - self.prompt_seq_lens = [ - len(seq_data.prompt_token_ids) - for seq_group_metadata in seq_group_metadata_list - for seq_data in seq_group_metadata.seq_data.values() - ] - - def get_counter_dict(self, cache_config, duration, seq_len, - batch_size_padded, real_batch_size, is_prompt): - throughput = batch_size_padded / (duration / 1e6) - throughput_effective = real_batch_size / (duration / 1e6) - - real_max_seq_len = max(self.real_seq_lens) - real_num_tokens = sum(self.real_seq_lens) - padded_num_tokens = batch_size_padded * seq_len - batch_token_utilization = real_num_tokens / padded_num_tokens - if self.average_real_throughput is None: - self.average_real_throughput = throughput_effective - else: # https://www.heikohoffmann.de/htmlthesis/node134.html - self.average_real_throughput = self.average_real_throughput + 1 / ( - self.niter + 1) * (throughput_effective - - self.average_real_throughput) - phase = "prompt" if is_prompt else "decode" - counters = { - f'{phase}_bucket_batch_size': batch_size_padded, - f'{phase}_batch_size': real_batch_size, - f'{phase}_bucket_seq_len': seq_len, - f'{phase}_seq_len': real_max_seq_len, - f'{phase}_bucket_gen_throughput': throughput, - f'{phase}_real_gen_throughput': throughput_effective, - f'{phase}_batch_token_utilization': batch_token_utilization, - 'average_real_throughput': self.average_real_throughput, - 'engine_iteration': self.niter, - } - self.niter += 1 - if is_prompt: - prompt_bucket_in_throughput = (seq_len * batch_size_padded) / ( - duration / 1e6) - prompt_real_in_throughput = sum( - self.prompt_seq_lens) / (duration / 1e6) - counters[ - f'{phase}_bucket_in_throughput'] = prompt_bucket_in_throughput - counters[f'{phase}_real_in_throughput'] = prompt_real_in_throughput - - # KV cache might not be created yet (e.g. for profiling run) - if cache_config.num_gpu_blocks is not None and \ - cache_config.num_gpu_blocks != 0: - cache_num_blocks_used = [ - math.ceil(sl / cache_config.block_size) - for sl in self.real_seq_lens - ] - cache_total_num_blocks_used = sum(cache_num_blocks_used) - num_cache_blocks = cache_config.num_gpu_blocks - cache_total_num_free_blocks = \ - num_cache_blocks - cache_total_num_blocks_used - cache_computed_utilization = \ - cache_total_num_blocks_used / num_cache_blocks - max_blocks_per_seq = math.ceil(seq_len / cache_config.block_size) - batch_block_utilization = cache_total_num_blocks_used / ( - batch_size_padded * max_blocks_per_seq) - counters['cache_num_blocks_used'] = cache_total_num_blocks_used - counters['cache_num_free_blocks'] = cache_total_num_free_blocks - counters['cache_computed_utilization'] = cache_computed_utilization - counters[ - f'{phase}_batch_block_utilization'] = batch_block_utilization - if not self.logged_once: - counters['const_cache_num_blocks'] = cache_config.num_gpu_blocks - counters[ - 'const_gpu_memory_utilization'] = \ - cache_config.gpu_memory_utilization - counters['const_block_size'] = cache_config.block_size - self.logged_once = True - return counters - - -def unwrap_model(model): - if isinstance(model, torch._dynamo.eval_frame.OptimizedModule): - return unwrap_model(model._orig_mod) - else: - model = list(vars(model)['_modules'].values())[0] - modules = list(vars(model)['_modules'].values()) - return modules - - -class HPUModelRunner(HPUModelRunnerBase[ModelInputForHPUWithSamplingMetadata]): - """ - GPU model runner with sampling step. - """ - _model_input_cls: Type[ModelInputForHPUWithSamplingMetadata] = ( - ModelInputForHPUWithSamplingMetadata) - - def make_model_input_from_broadcasted_tensor_dict( - self, - tensor_dict: Dict[str, Any], - ) -> ModelInputForHPUWithSamplingMetadata: - return ( - ModelInputForHPUWithSamplingMetadata.from_broadcasted_tensor_dict( - tensor_dict, - attn_backend=self.attn_backend, - )) - - @torch.inference_mode() - def prepare_model_input( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - virtual_engine: int = 0, - finished_requests_ids: Optional[List[str]] = None - ) -> ModelInputForHPUWithSamplingMetadata: - """Prepare the model input based on a given sequence group, including - metadata for the sampling step. - The API assumes seq_group_metadata_list is sorted by prefill -> decode. - The result tensors and data structure also batches input in prefill - -> decode order. For example, - - input_tokens[:num_prefill_tokens] contains prefill tokens. - - input_tokens[num_prefill_tokens:] contains decode tokens. - If cuda graph is required, this API automatically pads inputs. - """ - with self.profiler.record_event('internal', 'prepare_input_tensors'): - assert seq_group_metadata_list is not None - if self.profiler.enabled: - self.profiler_counter_helper.capture_seq_group_metadata_stats( - seq_group_metadata_list=seq_group_metadata_list) - model_input, sampling_metadata = self.prepare_input_tensors( - seq_group_metadata_list) - assert model_input.attn_metadata is not None - is_prompt = model_input.attn_metadata.is_prompt - - return dataclasses.replace(model_input, - sampling_metadata=sampling_metadata, - is_prompt=is_prompt, - virtual_engine=virtual_engine) - - def finish_measurements(self): - from neural_compressor.torch.quantization import finalize_calibration - finalize_calibration(self.model.model) - - def _num_blocks(self, attn_metadata): - if attn_metadata.block_list is None: - return 0 - return attn_metadata.block_list.numel() - - def _phase(self, attn_metadata): - phase_type: PhaseType - is_prompt = attn_metadata.is_prompt - is_prefix_prefill = is_prompt and attn_metadata.block_list is not None - if is_prompt and is_prefix_prefill: - phase_type = PhaseType.PREFIX_PREFILL - elif is_prompt and not is_prefix_prefill: - phase_type = PhaseType.PREFILL - elif not is_prompt: - phase_type = PhaseType.DECODE - else: - raise ValueError("Unrecognized pass type, likely due to malformed " - "attention metadata") - return phase_type - - def _check_config(self, batch_size, seq_len, attn_metadata, warmup_mode): - is_prefix_caching = self.vllm_config.cache_config.enable_prefix_caching - cfg: Optional[tuple] = None - assert cfg is None, "Configs changed between 2D and 3D" - if is_prefix_caching: - phase = self._phase(attn_metadata) - num_blocks = self._num_blocks(attn_metadata) - cfg = (batch_size, seq_len, num_blocks, phase) - else: - phase = 'prompt' if attn_metadata.is_prompt else 'decode' - cfg = (batch_size, seq_len, phase) - seen = cfg in self.seen_configs - self.seen_configs.add(cfg) - if not seen and not warmup_mode: - logger.warning("Configuration: %s was not warmed-up!", - (phase.value, batch_size, seq_len, - num_blocks) if is_prefix_caching else - (phase, batch_size, seq_len)) - - def create_lora_mask(self, input_tokens: torch.Tensor, lora_ids: List[int], - is_prompt: bool): - ''' - This is a helper function to create the mask for lora computations. - Lora Mask is needed to ensure we match the correct lora weights for the - for the request. - For Prompt phase we have - lora_mask with shape (batch_size * seq_len, max_loras * max_rank) - lora_logits_mask with shape (batch_size, max_loras * max_rank) - For Decode phase we have both - lora_mask and lora_logits_mask with shape - (batch_size, max_loras * max_rank) - ''' - lora_mask: torch.Tensor = None - lora_logits_mask: torch.Tensor = None - lora_index = 0 - - if self.lora_config: - if is_prompt: - lora_mask = torch.zeros( - input_tokens.shape[0] * input_tokens.shape[1], - (self.lora_config.max_loras) *\ - self.lora_config.max_lora_rank, - dtype=self.lora_config.lora_dtype) - lora_logits_mask = torch.zeros( - input_tokens.shape[0], (self.lora_config.max_loras) * - self.lora_config.max_lora_rank, - dtype=self.lora_config.lora_dtype) - - ones = torch.ones(input_tokens.shape[1], - self.lora_config.max_lora_rank, - dtype=self.lora_config.lora_dtype) - logit_ones = torch.ones(1, - self.lora_config.max_lora_rank, - dtype=self.lora_config.lora_dtype) - - for i in range(len(lora_ids)): - if lora_ids[i] == 0: - continue - lora_index = self.lora_manager._adapter_manager.\ - lora_index_to_id.index(lora_ids[i]) - start_row = i * input_tokens.shape[1] - end_row = start_row + input_tokens.shape[1] - start_col = lora_index * self.lora_config.max_lora_rank - end_col = start_col + self.lora_config.max_lora_rank - lora_mask[start_row:end_row, start_col:end_col] = ones - lora_logits_mask[i, start_col:end_col] = logit_ones - lora_mask = lora_mask.to('hpu') - lora_logits_mask = lora_logits_mask.to('hpu') - else: - lora_mask = torch.zeros(input_tokens.shape[0], - (self.lora_config.max_loras) * - self.lora_config.max_lora_rank, - dtype=self.lora_config.lora_dtype) - ones = torch.ones(1, - self.lora_config.max_lora_rank, - dtype=self.lora_config.lora_dtype) - for i in range(len(lora_ids)): - if lora_ids[i] == 0: - continue - lora_index = self.lora_manager._adapter_manager.\ - lora_index_to_id.index(lora_ids[i]) - start_pos = lora_index * self.lora_config.max_lora_rank - end_pos = start_pos + self.lora_config.max_lora_rank - lora_mask[i, start_pos:end_pos] = ones - lora_mask = lora_mask.to('hpu') - lora_logits_mask = lora_mask - - return lora_mask, lora_logits_mask - - def _get_seq_ids(self, model_input): - return ([ - sg.seq_ids[0] for sg in model_input.sampling_metadata.seq_groups - ]) - - def _pad_to_max_num_seqs(self, tensor, value): - padding_needed = self.max_num_seqs - tensor.size(0) - if padding_needed: - padding = torch.full((padding_needed, *tensor.shape[1:]), - value, - device=tensor.device, - dtype=tensor.dtype) - tensor = torch.cat([tensor, padding]) - return tensor - - @torch.inference_mode() - def execute_model( - self, - model_input: ModelInputForHPUWithSamplingMetadata, - kv_caches: List[torch.Tensor], - intermediate_tensors: Optional[IntermediateTensors] = None, - num_steps: int = 1, - warmup_mode=False, - seqs=None, - ) -> Optional[Union[List[SamplerOutput], IntermediateTensors]]: - VLLM_DELAYED_SAMPLING = envs.VLLM_HPU_USE_DELAYED_SAMPLING - use_delayed_sampling = VLLM_DELAYED_SAMPLING and not warmup_mode - assert not (use_delayed_sampling and num_steps != 1), \ - 'Delayed sampling is not compatible with MSS!' - assert model_input.input_tokens is not None - if use_delayed_sampling and not model_input.is_prompt and \ - self.is_driver_worker: - num_cached = len(self.cached_step_outputs) - assert num_cached > 0 - cur_seq_ids = self._get_seq_ids(model_input) - cur_seq_id_pos = { - sid: idx - for idx, sid in enumerate(cur_seq_ids) if sid >= 0 - } - htorch.core.mark_step() - for i in range(num_cached): - prev_seq_ids = self._get_seq_ids(self.cached_step_inputs[i]) - target_indices = [ - cur_seq_id_pos.get(psi, -1) for psi in prev_seq_ids - ] - padding = self.cached_step_outputs[i].size(0) - len( - target_indices) - target_indices.extend([-1] * padding) - target_indices = torch.tensor( - target_indices, - device=model_input.input_tokens.device, - dtype=model_input.input_tokens.dtype) - model_input.input_tokens.index_copy_( - 0, target_indices, self.cached_step_outputs[i]) - htorch.core.mark_step() - - if not model_input.is_first_multi_step: - if not model_input.is_last_step: - # not first or last multi-step - return [] - # last multi-step - output = self._decode_sampler_outputs( - model_input) if self.is_driver_worker else [] - torch.hpu.synchronize() - if model_input.is_first_multi_step: - # first multi-step - if self.lora_config: - assert model_input.lora_requests is not None - assert model_input.lora_mapping is not None - self.set_active_loras(model_input.lora_requests, - model_input.lora_mapping) - # Rank!=0 workers has is_prompt==None - if use_delayed_sampling and not model_input.is_prompt and \ - model_input.input_tokens.size(1) == 1: - if self.is_driver_worker: - model_kwargs_broadcast_data = { - "input_tokens": model_input.input_tokens - } - broadcast_tensor_dict(model_kwargs_broadcast_data, src=0) - input_tokens = model_input.input_tokens - - else: - model_kwargs_broadcast_data = broadcast_tensor_dict(src=0) - input_tokens = model_kwargs_broadcast_data["input_tokens"] - else: - input_tokens = model_input.input_tokens - input_positions = model_input.input_positions - attn_metadata = model_input.attn_metadata - sampling_metadata = model_input.sampling_metadata - real_batch_size = model_input.real_batch_size - batch_size_padded = model_input.batch_size_padded - assert input_tokens is not None - assert input_positions is not None - assert sampling_metadata is not None - assert attn_metadata is not None - is_prompt = attn_metadata.is_prompt - assert is_prompt is not None - batch_size = input_tokens.size(0) - seq_len = self._seq_len(attn_metadata) - use_graphs = self._use_graphs(batch_size, seq_len, is_prompt) - self._check_config(batch_size, seq_len, attn_metadata, warmup_mode) - - lora_mask: torch.Tensor = None - lora_logits_mask: torch.Tensor = None - if self.lora_config: - assert model_input.lora_ids is not None - lora_mask, lora_logits_mask = self.create_lora_mask( - input_tokens, model_input.lora_ids, - attn_metadata.is_prompt) - - execute_model_kwargs = { - "input_ids": input_tokens, - "positions": input_positions, - "attn_metadata": self.trim_attn_metadata(attn_metadata), - "intermediate_tensors": intermediate_tensors, - "lora_mask": lora_mask, - "virtual_engine": model_input.virtual_engine, - **(model_input.multi_modal_kwargs or {}), - } - if htorch.utils.internal.is_lazy(): - execute_model_kwargs.update( - {"bypass_hpu_graphs": not use_graphs}) - - htorch.core.mark_step() - if self.is_driver_worker: - model_event_name = ("model_" - f"{'prompt' if is_prompt else 'decode'}_" - f"bs{batch_size}_" - f"seq{seq_len}_" - f"graphs{'T' if use_graphs else 'F'}") - else: - model_event_name = 'model_executable' - if num_steps > 1 or use_delayed_sampling: - # in case of multi-step scheduling - # we only want to pythonize in the last step - sampling_metadata.skip_sampler_cpu_output = True - self.model.sampler.include_gpu_probs_tensor = True - cache_orig_output_tokens_len: List[Dict] = [] - - def try_revert_dummy_output_tokens(): - if len(cache_orig_output_tokens_len) > 0: - # Reuse the original output token ids length - for i, seq_group_metadata in enumerate( - seq_group_metadata_list): - for j, data in seq_group_metadata.seq_data.items(): - orig_output_tokens_len = \ - cache_orig_output_tokens_len[i][j] - data.output_token_ids = \ - data.output_token_ids[:orig_output_tokens_len] - - for i in range(num_steps): - if i != 0 and not self.is_driver_worker: - broadcast_data = broadcast_tensor_dict(src=0) - if 'early_exit' in broadcast_data and broadcast_data[ - 'early_exit']: - return [output] if num_steps == 1 else [] - execute_model_kwargs.update({ - "input_ids": - broadcast_data["input_ids"], - "positions": - broadcast_data["positions"], - "attn_metadata": - self.trim_attn_metadata( - broadcast_data["attn_metadata"]) - }) - with self.profiler.record_event('internal', model_event_name): - hidden_states = self.model.forward( - **execute_model_kwargs, - selected_token_indices=sampling_metadata. - selected_token_indices) - - if self.lora_config: - LoraMask.setLoraMask( - lora_logits_mask.index_select( - 0, sampling_metadata.selected_token_indices)) - - # Compute the logits. - with self.profiler.record_event( - 'internal', - ('compute_logits_' - f'{"prompt" if is_prompt else "decode"}_bs' - f'{batch_size}_' - f'seq{seq_len}')): - if num_steps == 1: - sampling_metadata.selected_token_indices = None - logits = self.model.compute_logits(hidden_states, - sampling_metadata) - htorch.core.mark_step() - # Only perform sampling in the driver worker. - if not self.is_driver_worker: - continue - - if use_delayed_sampling: - fake_output = self._delayed_sampler_outputs(model_input) - - with self.profiler.record_event( - 'internal', ('sample_' - f'{"prompt" if is_prompt else "decode"}_' - f'bs{batch_size}_' - f'seq{seq_len}')): - output = self.model.sample( - logits=logits, - sampling_metadata=sampling_metadata, - ) - if num_steps > 1: - output = output.sampled_token_ids - self.cached_step_outputs.append(output) - if use_delayed_sampling and self.is_driver_worker: - self._patch_prev_output() - output = self._pad_to_max_num_seqs( - output.sampled_token_ids, DUMMY_TOKEN_ID) - self.cached_step_outputs.append(output) - self.cached_step_inputs.append(model_input) - htorch.core.mark_step() - if model_input.async_callback is not None: - model_input.async_callback() - if i < num_steps - 1: - if i == 0: - if model_input.async_callback is not None: - ctx = model_input.async_callback.keywords[ # type: ignore - "ctx"] - seq_group_metadata_list = \ - ctx.seq_group_metadata_list - elif seqs is not None: - seq_group_metadata_list = seqs - else: - raise RuntimeError( - "seq_group_metadata_list is uninitialized") - for i, seq_group_metadata in enumerate( - seq_group_metadata_list): - # Skip empty steps - seq_group_metadata.state.current_step += ( - num_steps - 2) - # Cache the original output token ids - cache_orig_output_tokens_len.append({}) - for j, data in seq_group_metadata.seq_data.items(): - cache_orig_output_tokens_len[i][j] = \ - len(data.output_token_ids) - for seq_group_metadata in seq_group_metadata_list: - for data in seq_group_metadata.seq_data.values(): - max_output_len = sampling_metadata.seq_groups[ - 0].sampling_params.max_tokens - if len(data.output_token_ids) < max_output_len - 1: - # add a place holder for prepare_decode - # arbitrary value, this could be any token - dummy_token = (540, ) - data.output_token_ids += (dummy_token) - else: - broadcast_tensor_dict({'early_exit': True}, - src=0) - if num_steps == 1: - return [output] - else: - try_revert_dummy_output_tokens() - return [] - - result = self._prepare_decode(seq_group_metadata_list, - output=output) - execute_model_kwargs.update({ - "input_ids": - result.input_tokens, - "positions": - result.input_positions, - "attn_metadata": - self.trim_attn_metadata(result.attn_metadata) - }) - model_kwargs_broadcast_data = { - "input_ids": result.input_tokens, - "positions": result.input_positions, - "attn_metadata": vars(result.attn_metadata) - } - broadcast_tensor_dict(model_kwargs_broadcast_data, src=0) - else: - try_revert_dummy_output_tokens() - - if self.is_driver_worker and self.profiler.enabled: - # Stop recording 'execute_model' event - self.profiler.end() - event_end = self.profiler.get_timestamp_us() - counters = self.profiler_counter_helper.get_counter_dict( - cache_config=self.cache_config, - duration=event_end - self.event_start, - seq_len=seq_len, - batch_size_padded=batch_size_padded, - real_batch_size=real_batch_size, - is_prompt=is_prompt) - self.profiler.record_counter(self.event_start, counters) - if num_steps == 1: - if self.return_hidden_states: - # we only need to pass hidden states of most recent token - assert model_input.sampling_metadata is not None - if model_input.is_prompt: - output.prefill_hidden_states = hidden_states - output.hidden_states = hidden_states - if use_delayed_sampling: - if self.is_driver_worker: - return [fake_output] - else: - return [] - - return [output] if self.is_driver_worker else [] - else: - return [] - return output if type(output) is list else [output] - - def _delayed_sampler_outputs(self, model_input): - next_token_ids = [[DUMMY_TOKEN_ID]] * len( - model_input.sampling_metadata.seq_groups) - sampler_output = self._make_decode_output( - next_token_ids, model_input.sampling_metadata.seq_groups) - return sampler_output - - def _decode_sampler_outputs(self, model_input): - use_async_out_proc = model_input.async_callback is not None - sampler_outputs = [] - num_outputs = len(self.cached_step_outputs) - for i in range(num_outputs): - next_token_ids = self.cached_step_outputs.pop(0) - next_token_ids = next_token_ids.cpu().tolist() - sampler_output = self._make_decode_output( - next_token_ids, model_input.sampling_metadata.seq_groups) - sampler_outputs.append(sampler_output) - - if i < num_outputs - 1 and use_async_out_proc: - assert model_input.async_callback is not None - ctx = model_input.async_callback.keywords[ # type: ignore - "ctx"] - ctx.append_output( - outputs=[sampler_output], - seq_group_metadata_list=ctx.seq_group_metadata_list, - scheduler_outputs=ctx.scheduler_outputs, - is_async=False, - is_last_step=False, - is_first_step_output=False) - model_input.async_callback() - - if use_async_out_proc: - return [sampler_outputs[-1]] - else: - return sampler_outputs - - def _make_decode_output( - self, - next_token_ids: List[List[int]], - seq_groups: List[SequenceGroupToSample], - ) -> SamplerOutput: - zero_logprob = Logprob(0.0) - sampler_outputs = [] - batch_idx = 0 - for seq_group in seq_groups: - seq_ids = seq_group.seq_ids - seq_outputs = [] - for seq_id in seq_ids: - next_token_id = next_token_ids[batch_idx][0] - seq_outputs.append( - SequenceOutput(seq_id, next_token_id, - {next_token_id: zero_logprob})) - batch_idx += 1 - sampler_outputs.append( - CompletionSequenceGroupOutput(seq_outputs, None)) - return SamplerOutput(sampler_outputs) - - def shutdown_inc(self): - can_finalize_inc = False - from contextlib import suppress - with suppress(AttributeError): - can_finalize_inc = (self.model_config.quantization == 'inc') and \ - (self.model.model is not None) and \ - self.inc_initialized_successfully and \ - not getattr(self, "_is_inc_finalized", False) - if can_finalize_inc: - from neural_compressor.torch.quantization import ( - finalize_calibration) - finalize_calibration(self.model.model) - self._is_inc_finalized = True - - def __del__(self): - self.shutdown_inc() - - def _patch_prev_output(self): - assert len(self.cached_step_inputs) == len(self.cached_step_outputs), \ - f'''Inputs and outputs are out of sync! - {len(self.cached_step_inputs)} vs {len(self.cached_step_outputs)}''' - if len(self.cached_step_inputs) == 0: - return - model_input = self.cached_step_inputs.pop(0) - delayed_output = self.cached_step_outputs.pop(0).cpu().squeeze( - -1).tolist() - ctx = model_input.async_callback.keywords["ctx"] # type: ignore - # If there's no output to patch with, which is usually the case when - # we're starting a new request after all requests are completed. - if len(ctx.output_queue) == 0: - return - assert len( - ctx.output_queue) == 1, 'There should be exactly 1 output waiting!' - output_data = ctx.output_queue[0] - assert len(output_data.outputs) == 1 - for fake_out, real_out in zip(output_data.outputs[0], delayed_output): - fake_out.samples[0].output_token = real_out - for sg, real_out in zip(output_data.seq_group_metadata_list, - delayed_output): - assert len(sg.seq_data) == 1 - seq_data = list(sg.seq_data.values())[0] - # This is a hack. Assigning output_token_ids triggers - # a cache recomputation and we only need to update the last token - seq_data.output_token_ids_array[-1] = real_out - seq_data._cached_all_token_ids[-1] = real_out diff --git a/vllm/worker/hpu_worker.py b/vllm/worker/hpu_worker.py deleted file mode 100644 index 560110df0a3..00000000000 --- a/vllm/worker/hpu_worker.py +++ /dev/null @@ -1,485 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -############################################################################### -# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company -############################################################################### - -import contextlib -import gc -import os -from typing import List, Optional, Set, Tuple, Type - -import habana_frameworks.torch as htorch # noqa:F401 -import torch -import torch.distributed -from vllm_hpu_extension.profiler import HabanaMemoryProfiler, format_bytes - -import vllm.envs as envs -from vllm.config import ParallelConfig, VllmConfig -from vllm.distributed import (ensure_model_parallel_initialized, - init_distributed_environment) -from vllm.logger import init_logger -from vllm.lora.request import LoRARequest -from vllm.model_executor import set_random_seed -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.platforms import current_platform -from vllm.prompt_adapter.request import PromptAdapterRequest -from vllm.sequence import ExecuteModelRequest -from vllm.utils import bind_kv_cache -from vllm.worker.cache_engine import CacheEngine -from vllm.worker.hpu_model_runner import HPUModelRunner -from vllm.worker.model_runner_base import ModelRunnerBase -from vllm.worker.worker_base import (LocalOrDistributedWorkerBase, WorkerBase, - WorkerInput) - -logger = init_logger(__name__) - - -class HPUWorker(LocalOrDistributedWorkerBase): - """A worker class that executes (a partition of) the model on a HPU. - - Each worker is associated with a single HPU. The worker is responsible for - maintaining the KV cache and executing the model on the HPU. In case of - distributed inference, each worker is assigned a partition of the model. - """ - - def __init__( - self, - vllm_config: VllmConfig, - local_rank: int, - rank: int, - distributed_init_method: str, - is_driver_worker: bool = False, - model_runner_cls: Optional[Type[ModelRunnerBase]] = None, - ) -> None: - WorkerBase.__init__(self, vllm_config=vllm_config) - self.parallel_config.rank = rank - self.local_rank = local_rank - self.rank = rank - self.distributed_init_method = distributed_init_method - self.is_driver_worker = is_driver_worker - if self.is_driver_worker: - assert self.rank == 0, "The driver worker must have rank 0." - - if self.model_config.trust_remote_code: - # note: lazy import to avoid importing torch before initializing - from vllm.utils import init_cached_hf_modules - init_cached_hf_modules() - - self.model_runner: HPUModelRunner = HPUModelRunner( - vllm_config=vllm_config, is_driver_worker=is_driver_worker) - # Uninitialized cache engine. Will be initialized by - # initialize_cache. - self.cache_engine: List[HPUCacheEngine] - # Initialize gpu_cache as pooling models don't initialize kv_caches - self.hpu_cache: Optional[List[List[torch.Tensor]]] = None - # Torch profiler. Enabled and configured through env vars: - # VLLM_TORCH_PROFILER_DIR=/path/to/save/trace - if envs.VLLM_TORCH_PROFILER_DIR: - torch_profiler_trace_dir = envs.VLLM_TORCH_PROFILER_DIR - logger.info("Profiling enabled. Traces will be saved to: %s", - torch_profiler_trace_dir) - self.profiler = torch.profiler.profile( - activities=[ - torch.profiler.ProfilerActivity.CPU, - torch.profiler.ProfilerActivity.HPU, - ], - with_stack=True, - on_trace_ready=torch.profiler.tensorboard_trace_handler( - torch_profiler_trace_dir, use_gzip=True)) - else: - self.profiler = None - - def start_profile(self): - if self.profiler is None: - raise RuntimeError("Profiler is not enabled.") - self.profiler.start() - - def stop_profile(self): - if self.profiler is None: - raise RuntimeError("Profiler is not enabled.") - self.profiler.stop() - - def _set_env_vars(self): - local_rank = self.local_rank - if self.parallel_config.world_size == 1: - local_rank = -1 - import os - os.environ["LOCAL_RANK"] = str(local_rank) - os.environ["ID"] = str(local_rank) - os.environ["WORLD_SIZE"] = str(self.parallel_config.world_size) - os.environ["RANK"] = str(self.rank) - - def init_device(self) -> None: - if self.device_config.device.type == "hpu": - self.device = torch.device("hpu") - torch.hpu.set_device(self.device) - else: - raise RuntimeError( - f"Not support device type: {self.device_config.device}") - # Initialize the distributed environment. - if self.model_config.quantization == 'inc': - self._set_env_vars() - init_worker_distributed_environment(self.parallel_config, self.rank, - self.distributed_init_method, - self.local_rank) - # Set random seed. - set_random_seed(self.model_config.seed) - - def load_model(self): - self.model_runner.load_model() - - def execute_model( - self, - execute_model_req: Optional[ExecuteModelRequest] = None, - ) -> Optional[List[SamplerOutput]]: - # VLLM_HPU_LOG_STEP_GRAPH_COMPILATION - will log graph compilations per engine step, only when there was any - highly recommended to use alongside PT_HPU_METRICS_GC_DETAILS! # noqa:E501 - # VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL - will log graph compilations per engine step, always, even if there were none # noqa:E501 - # VLLM_HPU_LOG_STEP_CPU_FALLBACKS - will log cpu fallbacks per engine step, only when there was any # noqa:E501 - # VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL - will log cpu fallbacks per engine step, always, even if there were none # noqa:E501 - log_graph_compilation_all = os.environ.get( - 'VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL', '0') != '0' - log_graph_compilation = os.environ.get( - 'VLLM_HPU_LOG_STEP_GRAPH_COMPILATION', - '0') != '0' or log_graph_compilation_all - log_cpu_fallbacks_all = os.environ.get( - 'VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL', '0') != '0' - log_cpu_fallbacks = os.environ.get('VLLM_HPU_LOG_STEP_CPU_FALLBACKS', - '0') != '0' or log_cpu_fallbacks_all - if (log_graph_compilation or log_cpu_fallbacks) and \ - execute_model_req is not None: - from habana_frameworks.torch.hpu.metrics import metric_localcontext - seq_group_metadata_list = execute_model_req.seq_group_metadata_list - is_prompt = any([ - seq_group_metadata.is_prompt - for seq_group_metadata in seq_group_metadata_list - ]) - max_context_len = max([ - max([ - len(v.prompt_token_ids) + len(v.output_token_ids) - for v in seq_group_metadata.seq_data.values() - ]) for seq_group_metadata in seq_group_metadata_list - ]) # whoa, that's some spicy stuff right here - max_num_blocks = ( - (max_context_len - 1) // self.cache_config.block_size) + 1 - input_stats = (f'is_prompt: {is_prompt}, ' - f'num_seqs: {len(seq_group_metadata_list)}, ' - f'max_context_len: {max_context_len}, ' - f'max_num_blocks {max_num_blocks}') - gc_ctx = metric_localcontext( - "graph_compilation" - ) if log_graph_compilation else contextlib.nullcontext() - cpu_fallback_ctx = metric_localcontext( - "cpu_fallback" - ) if log_cpu_fallbacks else contextlib.nullcontext() - with gc_ctx as gc_local_metric, \ - cpu_fallback_ctx as cpu_fallback_local_metric: - output = LocalOrDistributedWorkerBase.execute_model( - self, execute_model_req) - if (log_graph_compilation and gc_local_metric.stats()[0][1] - > 0) or log_graph_compilation_all: - msg = ("VLLM_HPU_STEP_GRAPH_COMPILATION: " - f"{gc_local_metric.stats()}, {input_stats}") - logger.warning(msg) - if (log_cpu_fallbacks and cpu_fallback_local_metric.stats()[0][1] - > 0) or log_cpu_fallbacks_all: - msg = ("VLLM_HPU_STEP_CPU_FALLBACK: " - f"{cpu_fallback_local_metric.stats()}, {input_stats}") - logger.warning(msg) - - return output - - output = LocalOrDistributedWorkerBase.execute_model( - self, execute_model_req) - return output - - @torch.inference_mode() - def determine_num_available_blocks(self) -> Tuple[int, int]: - """Profiles the peak memory usage of the model to determine how many - KV blocks may be allocated without OOMs. - - The engine will first conduct a profiling of the existing memory usage. - Then, it calculate the maximum possible number of GPU and CPU blocks - that can be allocated with the remaining free memory. - - Tip: - You may limit the usage of GPU memory - by adjusting the `gpu_memory_utilization` parameter. - """ - # Profile the memory usage of the model and get the maximum number of - # cache blocks that can be allocated with the remaining free memory. - - # Execute a forward pass with dummy inputs to profile the memory usage - # of the model. - with HabanaMemoryProfiler() as m: - self.model_runner.profile_run() - torch.hpu.synchronize() - msg = ("Model profiling run " - f"took {m.get_summary_string()}") - logger.info(msg) - # At this point we should've allocated the maximum workspace for all - # recipes we will use the extra memory for graphs/blocks - free_hpu_memory = torch.hpu.mem_get_info()[0] - - cache_block_size = self.get_cache_block_size_bytes() - graph_reserved_mem = (float( - os.environ.get('VLLM_GRAPH_RESERVED_MEM', '0.1')) - if not self.model_config.enforce_eager else 0) - graph_headroom = 1 - graph_reserved_mem - available_hpu_memory = free_hpu_memory * \ - self.cache_config.gpu_memory_utilization - hpu_memory_margin = free_hpu_memory * ( - 1 - self.cache_config.gpu_memory_utilization) - self.model_runner.mem_margin = hpu_memory_margin - cache_size_bytes = available_hpu_memory * graph_headroom - graph_headroom_bytes = available_hpu_memory * (1 - graph_headroom) - msg = ( - f"Free device memory: {format_bytes(free_hpu_memory)}, " - f"{format_bytes(available_hpu_memory)} usable " - f"(gpu_memory_utilization={self.cache_config.gpu_memory_utilization})," - f" {format_bytes(graph_headroom_bytes)} reserved for HPUGraphs " - f"(VLLM_GRAPH_RESERVED_MEM={graph_reserved_mem}), " - f"{format_bytes(cache_size_bytes)} reserved for KV cache") - logger.info(msg) - num_hpu_blocks = int(cache_size_bytes // cache_block_size) - num_cpu_blocks = int(self.cache_config.swap_space_bytes // - cache_block_size) - num_hpu_blocks = max(num_hpu_blocks, 0) - num_cpu_blocks = max(num_cpu_blocks, 0) - self.model_runner.bucketing_ctx.num_hpu_blocks = num_hpu_blocks - - if self.model_runner.lora_manager: - self.model_runner.remove_all_loras() - - gc.collect() - return num_hpu_blocks, num_cpu_blocks - - def initialize_cache(self, num_gpu_blocks: int, - num_cpu_blocks: int) -> None: - """Allocate GPU and CPU KV cache with the specified number of blocks. - - This also warms up the model, which may record CUDA graphs. - """ - raise_if_cache_size_invalid( - num_gpu_blocks, self.cache_config.block_size, - self.model_config.max_model_len, - self.parallel_config.pipeline_parallel_size) - - self.cache_config.num_gpu_blocks = num_gpu_blocks - self.cache_config.num_cpu_blocks = num_cpu_blocks - - with HabanaMemoryProfiler() as m: - self._init_cache_engine() - torch.hpu.synchronize() - msg = ("Initializing cache engine " - f"took {m.get_summary_string()}") - logger.info(msg) - self._warm_up_model() - - def _init_cache_engine(self): - assert self.cache_config.num_gpu_blocks is not None - self.cache_engine = [ - HPUCacheEngine(self.cache_config, self.model_config, - self.parallel_config, self.device_config) - for _ in range(self.parallel_config.pipeline_parallel_size) - ] - self.hpu_cache = [ - self.cache_engine[ve].gpu_cache - for ve in range(self.parallel_config.pipeline_parallel_size) - ] - bind_kv_cache(self.compilation_config.static_forward_context, - self.hpu_cache) - - def _warm_up_model(self) -> None: - # NOTE(kzawora): We should use virtual engine index here - # for pipeline parallelism. Using 0 for now. - assert self.hpu_cache is not None - self.model_runner.warmup_model(self.hpu_cache[0]) - # Reset the seed to ensure that the random state is not affected by - # the model initialization and profiling. - set_random_seed(self.model_config.seed) - - def finish_measurements(self): - self.model_runner.finish_measurements() - - @property - def do_metadata_broadcast(self) -> bool: - return self.parallel_config.tensor_parallel_size > 1 - - @property - def kv_cache(self) -> Optional[List[List[torch.Tensor]]]: - return self.hpu_cache - - @torch.inference_mode() - def prepare_worker_input( - self, execute_model_req: ExecuteModelRequest) -> WorkerInput: - virtual_engine = execute_model_req.virtual_engine - num_seq_groups = len(execute_model_req.seq_group_metadata_list) - # `blocks_to_swap_in` and `blocks_to_swap_out` are cpu tensors. - # they contain parameters to launch cudamemcpyasync. - blocks_to_swap_in = torch.tensor(execute_model_req.blocks_to_swap_in, - device="cpu", - dtype=torch.int64).view(-1, 2) - blocks_to_swap_out = torch.tensor(execute_model_req.blocks_to_swap_out, - device="cpu", - dtype=torch.int64).view(-1, 2) - # `blocks_to_copy` is a gpu tensor. The src and tgt of - # blocks to copy are in the same device, and `blocks_to_copy` - # can be used directly within cuda kernels. - blocks_to_copy = torch.tensor(execute_model_req.blocks_to_copy, - device=self.device, - dtype=torch.int64).view(-1, 2) - - return WorkerInput( - num_seq_groups=num_seq_groups, - blocks_to_swap_in=blocks_to_swap_in, - blocks_to_swap_out=blocks_to_swap_out, - blocks_to_copy=blocks_to_copy, - virtual_engine=virtual_engine, - ) - - @torch.inference_mode() - def execute_worker(self, worker_input: WorkerInput) -> None: - virtual_engine = worker_input.virtual_engine - # Issue cache operations. - if (worker_input.blocks_to_swap_in is not None - and worker_input.blocks_to_swap_in.numel() > 0): - self.cache_engine[virtual_engine].swap_in( - worker_input.blocks_to_swap_in) - if (worker_input.blocks_to_swap_out is not None - and worker_input.blocks_to_swap_out.numel() > 0): - self.cache_engine[virtual_engine].swap_out( - worker_input.blocks_to_swap_out) - if (worker_input.blocks_to_copy is not None - and worker_input.blocks_to_copy.numel() > 0): - self.cache_engine[virtual_engine].copy(worker_input.blocks_to_copy) - - def add_lora(self, lora_request: LoRARequest) -> bool: - return self.model_runner.add_lora(lora_request) - - def remove_lora(self, lora_id: int) -> bool: - return self.model_runner.remove_lora(lora_id) - - def pin_lora(self, lora_id: int) -> bool: - return self.model_runner.pin_lora(lora_id) - - def list_loras(self) -> Set[int]: - return self.model_runner.list_loras() - - def add_prompt_adapter( - self, prompt_adapter_request: PromptAdapterRequest) -> bool: - raise NotImplementedError( - "Prompt Adapter is not implemented for HPU backend.") - - def remove_prompt_adapter(self, prompt_adapter_id: int) -> bool: - raise NotImplementedError( - "Prompt Adapter is not implemented for HPU backend.") - - def pin_prompt_adapter(self, prompt_adapter_id: int) -> bool: - raise NotImplementedError( - "Prompt Adapter is not implemented for HPU backend.") - - def list_prompt_adapters(self) -> Set[int]: - raise NotImplementedError( - "Prompt Adapter is not implemented for HPU backend.") - - def shutdown_inc(self): - self.model_runner.shutdown_inc() - - @property - def max_model_len(self) -> int: - return self.model_config.max_model_len - - @property - def vocab_size(self) -> int: - return self.model_runner.vocab_size - - def get_cache_block_size_bytes(self) -> int: - """Get the size of the KV cache block size in bytes. - """ - return HPUCacheEngine.get_cache_block_size(self.cache_config, - self.model_config, - self.parallel_config) - - -def init_worker_distributed_environment( - parallel_config: ParallelConfig, - rank: int, - distributed_init_method: Optional[str] = None, - local_rank: int = -1, -) -> None: - """Initialize the distributed environment.""" - init_distributed_environment(parallel_config.world_size, - rank, - distributed_init_method, - local_rank, - backend=current_platform.dist_backend) - - ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, - parallel_config.pipeline_parallel_size) - - if torch.distributed.is_initialized(): - torch_world_size = torch.distributed.get_world_size() - if torch_world_size != parallel_config.world_size: - raise RuntimeError( - "torch.distributed is already initialized but the torch world " - "size does not match parallel_config.world_size " - f"({torch_world_size} vs. {parallel_config.world_size}).") - elif not distributed_init_method: - raise ValueError( - "distributed_init_method must be set if torch.distributed " - "is not already initialized") - else: - torch.distributed.init_process_group( - backend="hccl", - world_size=parallel_config.world_size, - rank=rank, - init_method=distributed_init_method, - ) - - # A small all_reduce for warmup & checking conformance. - dummy_tensor_hpu = torch.ones(1).to('hpu') - torch.distributed.all_reduce(dummy_tensor_hpu) - assert dummy_tensor_hpu.item() == parallel_config.world_size - ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, - parallel_config.pipeline_parallel_size) - - -def raise_if_cache_size_invalid(num_gpu_blocks, block_size, max_model_len, - pipeline_parallel_size) -> None: - if num_gpu_blocks <= 0: - raise ValueError("No available memory for the cache blocks. " - "Try increasing `gpu_memory_utilization` when " - "initializing the engine.") - max_seq_len = block_size * (num_gpu_blocks // pipeline_parallel_size) - if max_model_len > max_seq_len: - raise ValueError( - f"The model's max seq len ({max_model_len}) " - "is larger than the maximum number of tokens that can be " - f"stored in KV cache ({max_seq_len}). Try increasing " - "`gpu_memory_utilization` or decreasing `max_model_len` when " - "initializing the engine.") - - -class HPUCacheEngine(CacheEngine): - - def _allocate_kv_cache( - self, - num_blocks: int, - device: str, - ) -> List[Tuple[torch.Tensor, torch.Tensor]]: - """Allocates KV cache on the specified device.""" - kv_cache_shape = self.attn_backend.get_kv_cache_shape( - num_blocks, self.block_size, self.num_kv_heads, self.head_size) - kv_cache: List[Tuple[torch.Tensor, torch.Tensor]] = [] - for _ in range(self.num_attention_layers): - key_cache = torch.zeros(kv_cache_shape, - dtype=self.dtype, - device=device) - value_cache = torch.zeros(kv_cache_shape, - dtype=self.dtype, - device=device) - kv_layer = (key_cache, value_cache) - kv_cache.append(kv_layer) - return kv_cache diff --git a/vllm/worker/multi_step_hpu_worker.py b/vllm/worker/multi_step_hpu_worker.py deleted file mode 100644 index f0210c13c75..00000000000 --- a/vllm/worker/multi_step_hpu_worker.py +++ /dev/null @@ -1,123 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -############################################################################### -# Copyright (C) 2025 Habana Labs, Ltd. an Intel Company -############################################################################### - -import dataclasses -from typing import Dict, Optional, Tuple - -import torch - -from vllm.distributed import broadcast_tensor_dict -from vllm.sequence import ExecuteModelRequest -from vllm.worker.hpu_model_runner import ModelInputForHPU -from vllm.worker.hpu_worker import HPUWorker -from vllm.worker.worker_base import WorkerInput - - -class MultiStepHPUWorker(HPUWorker): - - def __init__(self, *args, **kwargs): - super().__init__(*args, **kwargs) - self.cached_model_input: Optional[ModelInputForHPU] = None - - def _get_driver_input_and_broadcast( - self, execute_model_req: ExecuteModelRequest - ) -> Tuple[ModelInputForHPU, WorkerInput, Dict[str, torch.Tensor]]: - """ - Get the driver input and broadcast it to other workers. - """ - assert self.is_driver_worker - assert execute_model_req.virtual_engine == 0 - - is_first_multi_step = execute_model_req.is_first_multi_step - is_last_step = execute_model_req.is_last_step - - if is_first_multi_step: - # on first step we prepare the worker input and model input normally - worker_input: WorkerInput = self.prepare_worker_input( - execute_model_req=execute_model_req) - worker_input = dataclasses.replace( - worker_input, - num_steps=execute_model_req.num_lookahead_slots + 1) - model_input: ModelInputForHPU = ( - self.model_runner.prepare_model_input( - execute_model_req.seq_group_metadata_list, - execute_model_req.virtual_engine, - execute_model_req.finished_requests_ids)) - - if execute_model_req.async_callback: - model_input = dataclasses.replace( - model_input, - async_callback=execute_model_req.async_callback) - else: - # on subsequent steps we reuse the worker input and model input - assert self.cached_model_input is not None - model_input = self.cached_model_input - worker_input = WorkerInput() - - model_input = dataclasses.replace( - model_input, - is_first_multi_step=is_first_multi_step, - is_last_step=is_last_step) - - if self.do_metadata_broadcast: - if is_first_multi_step: - broadcast_data = worker_input.as_broadcastable_tensor_dict() - broadcast_data.update( - model_input.as_broadcastable_tensor_dict()) - broadcast_tensor_dict(broadcast_data, src=0) - else: - broadcast_data = { - "is_first_multi_step": is_first_multi_step, - "is_last_step": is_last_step, - } - broadcast_tensor_dict(broadcast_data, src=0) - - # Returning empty dict here to keep this compatible with - # `LocalOrDistributedWorkerBase._get_driver_input_and_broadcast` - return model_input, worker_input, {} - - def prepare_input( - self, - execute_model_req: Optional[ExecuteModelRequest] = None, - ) -> Optional[Tuple[ModelInputForHPU, WorkerInput, Dict[str, - torch.Tensor]]]: - if self.is_driver_worker: - if execute_model_req is None: - if self.do_metadata_broadcast: - # This signals that there's no more requests to process for - # now. All workers are running infinite loop with - # broadcast_tensor_dict, and it stops the loop when the - # driver broadcasts an empty input. Send an empty input to - # notify all other workers to stop their execution loop. - broadcast_tensor_dict({}, src=0) - return None - model_input, worker_input, _ = self._get_driver_input_and_broadcast( - execute_model_req) - if model_input.is_first_multi_step: - self.cached_model_input = model_input - return model_input, worker_input, {} - else: - broadcast_data = broadcast_tensor_dict(src=0) - if not broadcast_data: - return None - - if len(broadcast_data) == 2: - assert self.cached_model_input is not None - self.cached_model_input = dataclasses.replace( - self.cached_model_input, - is_first_multi_step=broadcast_data["is_first_multi_step"], - is_last_step=broadcast_data["is_last_step"]) - empty_worker_input = WorkerInput() - return self.cached_model_input, empty_worker_input, {} - - worker_input = WorkerInput.from_broadcasted_tensor_dict( - broadcast_data) - model_input = ( - self.model_runner. - make_model_input_from_broadcasted_tensor_dict(broadcast_data)) - self.cached_model_input = model_input - return model_input, worker_input, {} From 81be953d98749d019a7875b59b80eb539e55b03a Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Thu, 17 Jul 2025 20:19:46 -0400 Subject: [PATCH 162/552] [Log] Debugging Log with more Information (#20770) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- .../layers/fused_moe/cutlass_moe.py | 26 ++++++++++++------- .../layers/fused_moe/deep_gemm_moe.py | 24 ++++++++++++++--- 2 files changed, 37 insertions(+), 13 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index 978c5322362..a1f87ba92a5 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -571,34 +571,42 @@ def _valid_cutlass_block_scaled_grouped_gemm_shape(N: int, K: int): _, K, N = w2.size() if not _valid_cutlass_block_scaled_grouped_gemm_shape(N, K): - logger.debug( - "CutlassBlockScaledGroupedGemm disabled: unalinged problem size.") + logger.debug_once( + "CutlassBlockScaledGroupedGemm disabled: unaligned problem size. " + "N: %s, K: %s", + N, + K, + ) return False if (w1.dtype != torch.float8_e4m3fn or w2.dtype != torch.float8_e4m3fn): - logger.debug( - "CutlassBlockScaledGroupedGemm disabled: invalid weight dtype(s).") + logger.debug_once( + "CutlassBlockScaledGroupedGemm disabled: invalid weight dtype(s). " + "w1.dtype: %s, w2.dtype: %s", + w1.dtype, + w2.dtype, + ) return False if expert_map is not None: - logger.debug( + logger.debug_once( "CutlassBlockScaledGroupedGemm disabled: expert_parallel is" " not supported.") return False if activation != "silu": - logger.debug( + logger.debug_once( "CutlassBlockScaledGroupedGemm disabled: only activation silu is" " supported.") return False if apply_router_weight_on_input: - logger.debug("CutlassBlockScaledGroupedGemm disabled:" - " apply_router_weight_on_input is not supported.") + logger.debug_once("CutlassBlockScaledGroupedGemm disabled:" + " apply_router_weight_on_input is not supported.") return False if inplace: - logger.debug( + logger.debug_once( "CutlassBlockScaledGroupedGemm disabled: inplace is not supported." ) return False diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py index bb462938a39..f0c4ca5e52b 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py @@ -50,17 +50,33 @@ def _valid_deep_gemm(hidden_states: torch.Tensor, w1: torch.Tensor, M = hidden_states.size(0) _, K, N = w2.size() if not _valid_deep_gemm_shape(M, N, K): - logger.debug("DeepGemm disabled: unaligned problem size.") + logger.debug_once( + "DeepGemm disabled: unaligned problem size. M: %s, N: %s, K: %s", + M, + N, + K, + ) return False if (w1.dtype != torch.float8_e4m3fn or w2.dtype != torch.float8_e4m3fn): - logger.debug("DeepGemm disabled: invalid weight dtype(s).") + logger.debug_once( + "DeepGemm disabled: invalid weight dtype(s). " + "w1.dtype: %s, w2.dtype: %s", + w1.dtype, + w2.dtype, + ) return False if (not hidden_states.is_contiguous() or not w1.is_contiguous() or not w2.is_contiguous()): - logger.debug( - "DeepGemm disabled: weights or activations not contiguous.") + logger.debug_once( + "DeepGemm disabled: weights or activations not contiguous. " + "hidden_states.is_contiguous(): %s, w1.is_contiguous(): %s, " + "w2.is_contiguous(): %s", + hidden_states.is_contiguous(), + w1.is_contiguous(), + w2.is_contiguous(), + ) return False return True From 002e126fcd9a2c67e2571ee77fc36f00f272c684 Mon Sep 17 00:00:00 2001 From: elvischenv <219235043+elvischenv@users.noreply.github.com> Date: Fri, 18 Jul 2025 08:35:58 +0800 Subject: [PATCH 163/552] [Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133) Signed-off-by: x22x22 --- vllm/v1/attention/backends/flashinfer.py | 34 ++++++++++++++++-------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 1eb27d57acf..2abfb457b84 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -353,8 +353,9 @@ def _plan(self, num_prefills: int, num_decodes: int, attn_metadata.decode_wrapper = self._get_decode_wrapper() if not FlashInferBackend.use_trtllm_decode_attention( num_decodes, attn_metadata.max_seq_len, - attn_metadata.kv_data_type, attn_metadata.num_qo_heads, - attn_metadata.num_kv_heads, attn_metadata.head_dim): + self.cache_config.cache_dtype, + attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, + attn_metadata.head_dim): attn_metadata.decode_wrapper.plan( attn_metadata.paged_kv_indptr[:num_decodes + 1], attn_metadata.paged_kv_indices, @@ -539,10 +540,10 @@ def forward( query: shape = [num_tokens, num_heads, head_size] key: shape = [num_tokens, num_kv_heads, head_size] value: shape = [num_tokens, num_kv_heads, head_size] - kv_cache: shape - + kv_cache: shape - # NHD: [num_blocks, 2, block_size, num_kv_heads, head_size] # HND: [num_blocks, 2, num_kv_heads, block_size, head_size] - + attn_metadata: Metadata for attention. Returns: @@ -614,6 +615,7 @@ def forward( num_prefill_tokens = attn_metadata.num_prefill_tokens stride_order = FlashInferBackend.get_kv_cache_stride_order() + kv_cache_permute = kv_cache.permute(*stride_order) # Regular attention (common case). # Decodes are at the front and prefills are at the back, # according to reorder_batch() @@ -628,7 +630,7 @@ def forward( assert prefill_wrapper._sm_scale == self.scale prefill_wrapper.run( prefill_query, - kv_cache.permute(*stride_order), + kv_cache_permute, k_scale=layer._k_scale_float, v_scale=layer._v_scale_float, out=output[num_decode_tokens:], @@ -647,7 +649,7 @@ def forward( assert decode_wrapper._sm_scale == self.scale decode_wrapper.run( decode_query, - kv_cache.permute(*stride_order), + kv_cache_permute, k_scale=layer._k_scale_float, v_scale=layer._v_scale_float, out=output[:num_decode_tokens], @@ -655,19 +657,29 @@ def forward( else: # This path needs to be enabled with VLLM_KV_CACHE_LAYOUT = HND if num_decode_tokens > 0: + # decode_query may be non-contiguous + decode_query = decode_query.contiguous() + block_tables_decode = attn_metadata.block_table_tensor[: + num_decode_tokens] + seq_lens_decode = attn_metadata.seq_lens[: + num_decode_tokens] + assert get_kv_cache_layout() == "HND" + assert decode_query.is_contiguous() + assert kv_cache_permute.is_contiguous() + assert block_tables_decode.is_contiguous() + assert seq_lens_decode.is_contiguous() + output[:num_decode_tokens] = ( trtllm_batch_decode_with_kv_cache( query=decode_query, - kv_cache=kv_cache.permute(*stride_order), + kv_cache=kv_cache_permute, workspace_buffer=attn_metadata.workspace_buffer, num_heads=self.num_heads, num_kv_heads=self.num_kv_heads, scale=self.scale, - block_tables=attn_metadata. - block_table_tensor[:num_decode_tokens], - seq_lens=attn_metadata. - seq_lens[:num_decode_tokens], + block_tables=block_tables_decode, + seq_lens=seq_lens_decode, block_size=attn_metadata.page_size, max_seq_len=attn_metadata.max_seq_len, kv_cache_dtype=self.kv_cache_dtype, From da4dd19cebee001b45f8cfe5ca8f15ff6df988d1 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 17 Jul 2025 20:09:19 -0700 Subject: [PATCH 164/552] [Docs] Add minimal demo of Ray Data API usage (#21080) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- docs/serving/offline_inference.md | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/docs/serving/offline_inference.md b/docs/serving/offline_inference.md index 4ec879e0bc8..ddda4769000 100644 --- a/docs/serving/offline_inference.md +++ b/docs/serving/offline_inference.md @@ -30,8 +30,31 @@ This API adds several batteries-included capabilities that simplify large-scale, - Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance. - Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization. - Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference. - -The following example shows how to run batched inference with Ray Data and vLLM: - +- Reading and writing to most popular file formats and cloud object storage. +- Scaling up the workload without code changes. + +??? code + + ```python + import ray # Requires ray>=2.44.1 + from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor + + config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct") + processor = build_llm_processor( + config, + preprocess=lambda row: { + "messages": [ + {"role": "system", "content": "You are a bot that completes unfinished haikus."}, + {"role": "user", "content": row["item"]}, + ], + "sampling_params": {"temperature": 0.3, "max_tokens": 250}, + }, + postprocess=lambda row: {"answer": row["generated_text"]}, + ) + + ds = ray.data.from_items(["An old silent pond..."]) + ds = processor(ds) + ds.write_parquet("local:///tmp/data/") + ``` For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html). From 4f9f0419e5f73f051f2f17ca48de868e5f2b8def Mon Sep 17 00:00:00 2001 From: Lucia Fang <116399278+luccafong@users.noreply.github.com> Date: Fri, 18 Jul 2025 11:12:13 +0800 Subject: [PATCH 165/552] [Docs] Update supported models documentation with missing models (#20844) Signed-off-by: Lu Fang Signed-off-by: x22x22 --- docs/models/supported_models.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index fc304fb6fd5..e7ceca81087 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -331,6 +331,7 @@ Specified using `--task generate`. | `Ernie4_5_ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | | ✅︎ | ✅︎ | | `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. | | ✅︎ | ✅︎ | | `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Fairseq2LlamaForCausalLM` | Llama (fairseq2 format) | `mgleize/fairseq2-dummy-Llama-3.2-1B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | | ✅︎ | ✅︎ | | `FalconMambaForCausalLM` | FalconMamba | `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. | | ✅︎ | ✅︎ | | `FalconH1ForCausalLM` | Falcon-H1 | `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | @@ -359,6 +360,7 @@ Specified using `--task generate`. | `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA, Yi | `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `MambaForCausalLM` | Mamba | `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc. | | ✅︎ | | | `Mamba2ForCausalLM` | Mamba2 | `mistralai/Mamba-Codestral-7B-v0.1`, etc. | | ✅︎ | ✅︎ | +| `MiMoForCausalLM` | MiMo | `XiaomiMiMo/MiMo-7B-RL`, etc. | ✅︎ | ✅︎ | ✅︎ | | `MiniCPMForCausalLM` | MiniCPM | `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc. | ✅︎ | ✅︎ | ✅︎ | | `MiniCPM3ForCausalLM` | MiniCPM3 | `openbmb/MiniCPM3-4B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `MistralForCausalLM` | Mistral, Mistral-Instruct | `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc. | ✅︎ | ✅︎ | ✅︎ | From 201d00a0369c0c328a9f6175e5b094f05a10f41d Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Fri, 18 Jul 2025 00:10:42 -0400 Subject: [PATCH 166/552] [Attention] Make local attention backend agnostic (#21093) Signed-off-by: x22x22 --- vllm/v1/attention/backends/flash_attn.py | 84 ++--------------- vllm/v1/attention/backends/flashinfer.py | 5 +- vllm/v1/attention/backends/rocm_aiter_fa.py | 97 ++------------------ vllm/v1/attention/backends/triton_attn.py | 68 ++------------ vllm/v1/attention/backends/utils.py | 30 ++++-- vllm/v1/core/single_type_kv_cache_manager.py | 10 +- vllm/v1/kv_cache_interface.py | 15 +++ vllm/v1/worker/gpu_model_runner.py | 27 +++++- 8 files changed, 94 insertions(+), 242 deletions(-) diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index 4224d807c2b..d5b30ac685a 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -25,9 +25,9 @@ from vllm.config import VllmConfig, get_layers_from_vllm_config from vllm.logger import init_logger from vllm.utils import cdiv -from vllm.v1.attention.backends.utils import ( - AttentionMetadataBuilder, CommonAttentionMetadata, get_kv_cache_layout, - make_local_attention_virtual_batches) +from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, + CommonAttentionMetadata, + get_kv_cache_layout) from vllm.v1.kv_cache_interface import AttentionSpec logger = init_logger(__name__) @@ -130,18 +130,6 @@ class FlashAttentionMetadata: prefix_scheduler_metadata: Optional[torch.Tensor] = None max_num_splits: int = 0 - # for local attention - @dataclass - class LocalAttentionMetadata: - local_query_start_loc: torch.Tensor - local_seqused_k: torch.Tensor - local_block_table: torch.Tensor - local_max_query_len: int - local_max_seq_len: int - local_scheduler_metadata: Optional[torch.Tensor] - - local_attn_metadata: Optional[LocalAttentionMetadata] = None - def _get_sliding_window_configs( vllm_config: VllmConfig) -> set[Optional[tuple[int, int]]]: @@ -221,7 +209,6 @@ def build(self, max_query_len = common_attn_metadata.max_query_len max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) query_start_loc = common_attn_metadata.query_start_loc - query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu seq_lens = common_attn_metadata.seq_lens seq_lens_cpu = common_attn_metadata.seq_lens_cpu block_table_tensor = common_attn_metadata.block_table_tensor @@ -266,40 +253,6 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, ) return None - # for local attention - local_attn_metadata = None - if self.model_config.attention_chunk_size is not None: - seqlens_q_local_np, virt_q_cu_seqlens_np, virt_k_seqlens_np, \ - virt_block_table_tensor = make_local_attention_virtual_batches( - self.model_config.attention_chunk_size, - query_start_loc_cpu.numpy(), - seq_lens_cpu.numpy(), - block_table_tensor, - self.block_size, - ) - local_query_start_loc = torch.from_numpy(virt_q_cu_seqlens_np).to( - self.device, non_blocking=True) - local_seqused_k = torch.from_numpy(virt_k_seqlens_np).to( - self.device, non_blocking=True) - local_max_query_len = seqlens_q_local_np.max() - local_max_seq_len = virt_k_seqlens_np.max() - local_scheduler_metadata = schedule( - batch_size=local_query_start_loc.shape[0] - 1, - cu_query_lens=local_query_start_loc, - max_query_len=local_max_query_len, - seqlens=local_seqused_k, - max_seq_len=local_max_seq_len, - causal=True) - - local_attn_metadata = FlashAttentionMetadata.LocalAttentionMetadata( - local_query_start_loc=local_query_start_loc, - local_seqused_k=local_seqused_k, - local_block_table=virt_block_table_tensor, - local_max_query_len=local_max_query_len, - local_max_seq_len=local_max_seq_len, - local_scheduler_metadata=local_scheduler_metadata, - ) - use_cascade = common_prefix_len > 0 if use_cascade: @@ -371,7 +324,6 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, cu_prefix_query_lens=cu_prefix_query_lens, prefix_kv_lens=prefix_kv_lens, suffix_kv_lens=suffix_kv_lens, - local_attn_metadata=local_attn_metadata, prefix_scheduler_metadata=prefix_scheduler_metadata, max_num_splits=max_num_splits, ) @@ -517,27 +469,13 @@ def forward( layer._q_scale) query = query.reshape((num_tokens, num_heads, head_size)) - # Compute attention and update output up to `num_actual_tokens`. - use_local_attn = \ - (self.use_irope and attn_metadata.local_attn_metadata is not None) - - if not attn_metadata.use_cascade or use_local_attn: - if use_local_attn: - assert attn_metadata.local_attn_metadata is not None - local_metadata = attn_metadata.local_attn_metadata - cu_seqlens_q = local_metadata.local_query_start_loc - seqused_k = local_metadata.local_seqused_k - max_seqlen_q = local_metadata.local_max_query_len - max_seqlen_k = local_metadata.local_max_seq_len - block_table = local_metadata.local_block_table - scheduler_metadata = local_metadata.local_scheduler_metadata - else: - cu_seqlens_q = attn_metadata.query_start_loc - seqused_k = attn_metadata.seq_lens - max_seqlen_q = attn_metadata.max_query_len - max_seqlen_k = attn_metadata.max_seq_len - block_table = attn_metadata.block_table - scheduler_metadata = attn_metadata.scheduler_metadata + if not attn_metadata.use_cascade: + cu_seqlens_q = attn_metadata.query_start_loc + seqused_k = attn_metadata.seq_lens + max_seqlen_q = attn_metadata.max_query_len + max_seqlen_k = attn_metadata.max_seq_len + block_table = attn_metadata.block_table + scheduler_metadata = attn_metadata.scheduler_metadata descale_shape = (cu_seqlens_q.shape[0] - 1, key.shape[1]) @@ -565,8 +503,6 @@ def forward( ) return output - assert not use_local_attn, ( - "Cascade attention does not support local attention.") # Cascade attention (rare case). cascade_attention( output[:num_actual_tokens], diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 2abfb457b84..7f3c4ed129c 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -496,10 +496,6 @@ def __init__( kv_sharing_target_layer_name: Optional[int] = None, use_irope: bool = False, ) -> None: - if use_irope: - logger.warning_once( - "Using irope in FlashInfer is not supported yet, it will fall" - " back to global attention for long context.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) @@ -514,6 +510,7 @@ def __init__( self.kv_cache_dtype = kv_cache_dtype self.logits_soft_cap = logits_soft_cap self.kv_sharing_target_layer_name = kv_sharing_target_layer_name + self.use_irope = use_irope self.num_queries_per_kv = self.num_heads // self.num_kv_heads diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index 46802bf5c2a..43fe30a9a89 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -13,8 +13,6 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform -from vllm.v1.attention.backends.flash_attn import ( - make_local_attention_virtual_batches) from vllm.v1.attention.backends.utils import CommonAttentionMetadata from vllm.v1.kv_cache_interface import AttentionSpec @@ -201,9 +199,7 @@ def build(self, max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) total_tokens = int(common_attn_metadata.seq_lens_cpu.sum()) query_start_loc = common_attn_metadata.query_start_loc - query_start_loc_cpu = common_attn_metadata.query_start_loc_cpu seq_lens = common_attn_metadata.seq_lens - seq_lens_cpu = common_attn_metadata.seq_lens_cpu block_table_tensor = common_attn_metadata.block_table_tensor slot_mapping = common_attn_metadata.slot_mapping @@ -215,56 +211,6 @@ def build(self, dtype=cu_seq_lens.dtype, out=cu_seq_lens[1:]) - def schedule(batch_size, cu_query_lens, max_query_len, seqlens, - max_seq_len, causal): - return None - - # for local attention - local_attn_metadata = None - if self.model_config.attention_chunk_size is not None: - seqlens_q_local_np, virt_q_cu_seqlens_np, virt_k_seqlens_np, \ - virt_block_table_tensor = make_local_attention_virtual_batches( - self.model_config.attention_chunk_size, - query_start_loc_cpu.numpy(), - seq_lens_cpu.numpy(), - block_table_tensor, - self.block_size, - ) - local_query_start_loc = torch.from_numpy(virt_q_cu_seqlens_np).to( - self.device, non_blocking=True) - local_seqused_k = torch.from_numpy(virt_k_seqlens_np).to( - self.device, non_blocking=True) - local_max_query_len = seqlens_q_local_np.max().item() - local_max_seq_len = virt_k_seqlens_np.max().item() - local_scheduler_metadata = schedule( - batch_size=local_query_start_loc.shape[0] - 1, - cu_query_lens=local_query_start_loc, - max_query_len=local_max_query_len, - seqlens=local_seqused_k, - max_seq_len=local_max_seq_len, - causal=True) - - local_cu_seq_lens = torch.zeros(virt_k_seqlens_np.shape[0] + 1, - dtype=torch.int32, - device=self.device) - local_cu_seq_lens[1:] = torch.cumsum( - torch.from_numpy(virt_k_seqlens_np).to(device=self.device, - dtype=torch.int32, - non_blocking=True), - dim=0) - - - local_attn_metadata = \ - AiterFlashAttentionMetadata.LocalAttentionMetadata( - local_query_start_loc=local_query_start_loc, - local_seqused_k=local_seqused_k, - local_block_table=virt_block_table_tensor, - local_max_query_len=local_max_query_len, - local_max_seq_len=local_max_seq_len, - local_cu_seq_lens=local_cu_seq_lens, - local_scheduler_metadata=local_scheduler_metadata, - ) - use_cascade = common_prefix_len > 0 cu_prefix_query_lens = None @@ -286,7 +232,6 @@ def schedule(batch_size, cu_query_lens, max_query_len, seqlens, cu_prefix_query_lens=cu_prefix_query_lens, prefix_kv_lens=prefix_kv_lens, suffix_kv_lens=suffix_kv_lens, - local_attn_metadata=local_attn_metadata, ) return attn_metadata @@ -377,19 +322,6 @@ class AiterFlashAttentionMetadata: prefix_kv_lens: Optional[torch.Tensor] suffix_kv_lens: Optional[torch.Tensor] - # for local attention - @dataclass - class LocalAttentionMetadata: - local_query_start_loc: torch.Tensor - local_seqused_k: torch.Tensor - local_block_table: torch.Tensor - local_max_query_len: int - local_max_seq_len: int - local_cu_seq_lens: torch.Tensor - local_scheduler_metadata: Optional[torch.Tensor] - - local_attn_metadata: Optional[LocalAttentionMetadata] = None - class AiterFlashAttentionImpl(AttentionImpl): @@ -521,25 +453,12 @@ def forward( layer._q_scale) query = query.reshape((num_tokens, num_heads, head_size)) - # Compute attention and update output up to `num_actual_tokens`. - use_local_attn = \ - (self.use_irope and attn_metadata.local_attn_metadata is not None) - - if not attn_metadata.use_cascade or use_local_attn: - if use_local_attn: - assert attn_metadata.local_attn_metadata is not None - local_metadata = attn_metadata.local_attn_metadata - cu_seqlens_q = local_metadata.local_query_start_loc - seqused_k = local_metadata.local_seqused_k - max_seqlen_q = local_metadata.local_max_query_len - max_seqlen_k = local_metadata.local_max_seq_len - block_table = local_metadata.local_block_table - else: - cu_seqlens_q = attn_metadata.query_start_loc - seqused_k = attn_metadata.seq_lens - max_seqlen_q = attn_metadata.max_query_len - max_seqlen_k = attn_metadata.max_seq_len - block_table = attn_metadata.block_table + if not attn_metadata.use_cascade: + cu_seqlens_q = attn_metadata.query_start_loc + seqused_k = attn_metadata.seq_lens + max_seqlen_q = attn_metadata.max_query_len + max_seqlen_k = attn_metadata.max_seq_len + block_table = attn_metadata.block_table if max_seqlen_q > 1: cu_seq_lens = attn_metadata.cu_seq_lens @@ -557,9 +476,7 @@ def forward( alibi_slopes=self.alibi_slopes, window_size=self.sliding_window, block_table=block_table, - cu_seqlens_k=(cu_seq_lens if not use_local_attn else - local_metadata.local_cu_seq_lens), - ) + cu_seqlens_k=cu_seq_lens) _, num_heads, head_size = query.shape _PARTITION_SIZE_ROCM = 256 diff --git a/vllm/v1/attention/backends/triton_attn.py b/vllm/v1/attention/backends/triton_attn.py index ee95b5af6e4..79796ac1492 100644 --- a/vllm/v1/attention/backends/triton_attn.py +++ b/vllm/v1/attention/backends/triton_attn.py @@ -18,9 +18,8 @@ from vllm.logger import init_logger from vllm.platforms import current_platform from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata -from vllm.v1.attention.backends.utils import ( - AttentionMetadataBuilder, CommonAttentionMetadata, - make_local_attention_virtual_batches) +from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, + CommonAttentionMetadata) from vllm.v1.kv_cache_interface import AttentionSpec logger = init_logger(__name__) @@ -55,18 +54,6 @@ class TritonAttentionMetadata: scheduler_metadata: Optional[torch.Tensor] = None prefix_scheduler_metadata: Optional[torch.Tensor] = None - # for local attention - @dataclass - class LocalAttentionMetadata: - local_query_start_loc: torch.Tensor - local_seqused_k: torch.Tensor - local_block_table: torch.Tensor - local_max_query_len: int - local_max_seq_len: int - local_scheduler_metadata: Optional[torch.Tensor] - - local_attn_metadata: Optional[LocalAttentionMetadata] = None - class TritonAttentionMetadataBuilder( AttentionMetadataBuilder[TritonAttentionMetadata]): @@ -111,34 +98,6 @@ def build(self, block_table_tensor = common_attn_metadata.block_table_tensor slot_mapping = common_attn_metadata.slot_mapping - # for local attention - local_attn_metadata = None - if self.attention_chunk_size is not None: - seqlens_q_local_np, virt_q_cu_seqlens_np, virt_k_seqlens_np, \ - virt_block_table_tensor = make_local_attention_virtual_batches( - self.attention_chunk_size, - common_attn_metadata.query_start_loc_cpu.numpy(), - common_attn_metadata.seq_lens_cpu.numpy(), - block_table_tensor, - self.block_size, - ) - local_query_start_loc = torch.from_numpy(virt_q_cu_seqlens_np).to( - self.device, non_blocking=True) - local_seqused_k = torch.from_numpy(virt_k_seqlens_np).to( - self.device, non_blocking=True) - local_max_query_len = seqlens_q_local_np.max().item() - local_max_seq_len = virt_k_seqlens_np.max().item() - - local_attn_metadata = TritonAttentionMetadata \ - .LocalAttentionMetadata( - local_query_start_loc=local_query_start_loc, - local_seqused_k=local_seqused_k, - local_block_table=virt_block_table_tensor, - local_max_query_len=local_max_query_len, - local_max_seq_len=local_max_seq_len, - local_scheduler_metadata=None, - ) - use_cascade = common_prefix_len > 0 if use_cascade: @@ -170,7 +129,6 @@ def build(self, cu_prefix_query_lens=cu_prefix_query_lens, prefix_kv_lens=prefix_kv_lens, suffix_kv_lens=suffix_kv_lens, - local_attn_metadata=local_attn_metadata, prefix_scheduler_metadata=prefix_scheduler_metadata, ) return attn_metadata @@ -384,23 +342,11 @@ def forward( layer._q_scale) query = query.reshape((num_tokens, num_heads, head_size)) - use_local_attn = \ - (self.use_irope and attn_metadata.local_attn_metadata is not None) - - if use_local_attn: - assert attn_metadata.local_attn_metadata is not None - local_metadata = attn_metadata.local_attn_metadata - cu_seqlens_q = local_metadata.local_query_start_loc - seqused_k = local_metadata.local_seqused_k - max_seqlen_q = local_metadata.local_max_query_len - max_seqlen_k = local_metadata.local_max_seq_len - block_table = local_metadata.local_block_table - else: - cu_seqlens_q = attn_metadata.query_start_loc - seqused_k = attn_metadata.seq_lens - max_seqlen_q = attn_metadata.max_query_len - max_seqlen_k = attn_metadata.max_seq_len - block_table = attn_metadata.block_table + cu_seqlens_q = attn_metadata.query_start_loc + seqused_k = attn_metadata.seq_lens + max_seqlen_q = attn_metadata.max_query_len + max_seqlen_k = attn_metadata.max_seq_len + block_table = attn_metadata.block_table if use_prefill_decode_attn: # Compute attention and update output up to `num_actual_tokens`. diff --git a/vllm/v1/attention/backends/utils.py b/vllm/v1/attention/backends/utils.py index db6eaa55864..b6a06b17bca 100644 --- a/vllm/v1/attention/backends/utils.py +++ b/vllm/v1/attention/backends/utils.py @@ -272,11 +272,14 @@ def infer_global_hyperparameters( # block_table_local : shape[local_virtual_batches, pages_per_local_batch] def make_local_attention_virtual_batches( attn_chunk_size: int, - query_start_loc_np: np.ndarray, - seq_lens_np: np.ndarray, - block_table: torch.Tensor, + common_attn_metadata: CommonAttentionMetadata, block_size: int = 0, -) -> tuple[np.ndarray, np.ndarray, np.ndarray, torch.Tensor]: +) -> CommonAttentionMetadata: + query_start_loc_np = common_attn_metadata.query_start_loc_cpu.numpy() + seq_lens_np = common_attn_metadata.seq_lens_cpu.numpy() + block_table = common_attn_metadata.block_table_tensor + device = common_attn_metadata.query_start_loc.device + q_seqlens = query_start_loc_np[1:] - query_start_loc_np[:-1] actual_batch_size = seq_lens_np.shape[0] @@ -339,6 +342,7 @@ def make_local_attention_virtual_batches( attn_chunk_size, dtype=np.int32) seqlens_k_local[cu_num_blocks - 1] = tokens_in_last_block + num_computed_tokens_local = seqlens_k_local - seqlens_q_local k_seqstarts_absolute = np.repeat(seq_lens_np, local_blocks) - \ (rarange * attn_chunk_size + \ @@ -380,8 +384,22 @@ def make_local_attention_virtual_batches( block_table_local = block_table[batch_indices, block_indices]\ .view(virtual_batches, -1) - return seqlens_q_local, cu_seqlens_q_local, seqlens_k_local, \ - block_table_local + query_start_loc_cpu = torch.from_numpy(cu_seqlens_q_local) + seq_lens_cpu = torch.from_numpy(seqlens_k_local) + + return CommonAttentionMetadata( + query_start_loc_cpu=query_start_loc_cpu, + query_start_loc=query_start_loc_cpu.to(device=device, + non_blocking=True), + seq_lens_cpu=seq_lens_cpu, + seq_lens=seq_lens_cpu.to(device=device, non_blocking=True), + num_computed_tokens_cpu=torch.from_numpy(num_computed_tokens_local), + num_reqs=len(seq_lens_cpu), + num_actual_tokens=common_attn_metadata.num_actual_tokens, + max_query_len=seqlens_q_local.max(), + block_table_tensor=block_table_local, + slot_mapping=common_attn_metadata.slot_mapping, + ) def split_decodes_and_prefills( diff --git a/vllm/v1/core/single_type_kv_cache_manager.py b/vllm/v1/core/single_type_kv_cache_manager.py index 5b471803807..1560406c900 100644 --- a/vllm/v1/core/single_type_kv_cache_manager.py +++ b/vllm/v1/core/single_type_kv_cache_manager.py @@ -7,7 +7,8 @@ from vllm.utils import cdiv from vllm.v1.core.block_pool import BlockPool from vllm.v1.core.kv_cache_utils import BlockHash, KVCacheBlock -from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheSpec, +from vllm.v1.kv_cache_interface import (ChunkedLocalAttentionSpec, + FullAttentionSpec, KVCacheSpec, MambaSpec, SlidingWindowSpec) from vllm.v1.request import Request @@ -256,8 +257,10 @@ def find_longest_cache_hit( kv_cache_spec: KVCacheSpec, use_eagle: bool, ) -> tuple[list[KVCacheBlock], ...]: - assert isinstance(kv_cache_spec, FullAttentionSpec), ( - "FullAttentionManager can only be used for full attention groups") + assert isinstance( + kv_cache_spec, (FullAttentionSpec, ChunkedLocalAttentionSpec) + ), "FullAttentionManager can only be used for full attention " \ + "and chunked local attention groups" computed_blocks: tuple[list[KVCacheBlock], ...] = tuple( [] for _ in range(len(kv_cache_group_ids))) max_num_blocks = max_length // kv_cache_spec.block_size @@ -432,6 +435,7 @@ def allocate_new_blocks(self, request_id: str, spec_manager_map: dict[type[KVCacheSpec], type[SingleTypeKVCacheManager]] = { FullAttentionSpec: FullAttentionManager, + ChunkedLocalAttentionSpec: FullAttentionManager, SlidingWindowSpec: SlidingWindowManager, MambaSpec: MambaManager, } diff --git a/vllm/v1/kv_cache_interface.py b/vllm/v1/kv_cache_interface.py index 43456a987de..6726709955f 100644 --- a/vllm/v1/kv_cache_interface.py +++ b/vllm/v1/kv_cache_interface.py @@ -125,6 +125,21 @@ def merge(cls, specs: list[Self]) -> Self: return merged_spec +@dataclass +class ChunkedLocalAttentionSpec(AttentionSpec): + attention_chunk_size: int + + def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int: + max_model_len = vllm_config.model_config.max_model_len + return cdiv(max_model_len, self.block_size) * self.page_size_bytes + + @property + def type_id(self) -> str: + return ( + f"local_attention_{self.attention_chunk_size}_{self.block_size}_{self.page_size_bytes}" + ) # noqa + + @dataclass class SlidingWindowSpec(AttentionSpec): sliding_window: int diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 29f519393e4..fc7f2538881 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -44,11 +44,14 @@ GiB_bytes, LazyLoader, check_use_alibi, get_dtype_size, is_pin_memory_available, round_up) from vllm.v1.attention.backends.mamba_attn import Mamba2AttentionBackend -from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, - CommonAttentionMetadata) +from vllm.v1.attention.backends.utils import ( + AttentionMetadataBuilder, CommonAttentionMetadata, + make_local_attention_virtual_batches) from vllm.v1.core.encoder_cache_manager import compute_encoder_budget -from vllm.v1.kv_cache_interface import (AttentionSpec, FullAttentionSpec, - KVCacheConfig, KVCacheSpec, MambaSpec, +from vllm.v1.kv_cache_interface import (AttentionSpec, + ChunkedLocalAttentionSpec, + FullAttentionSpec, KVCacheConfig, + KVCacheSpec, MambaSpec, SlidingWindowSpec) from vllm.v1.outputs import (EMPTY_MODEL_RUNNER_OUTPUT, LogprobsTensors, ModelRunnerOutput) @@ -705,6 +708,12 @@ def _prepare_inputs( spec_decode_common_attn_metadata is None: spec_decode_common_attn_metadata = common_attn_metadata + if isinstance(kv_cache_group_spec.kv_cache_spec, + ChunkedLocalAttentionSpec): + common_attn_metadata = make_local_attention_virtual_batches( + kv_cache_group_spec.kv_cache_spec.attention_chunk_size, + common_attn_metadata, self.cache_config.block_size) + # Prepare for cascade attention if enabled & beneficial. common_prefix_len = 0 builder = self.attn_metadata_builders[kv_cache_group_id] @@ -2589,6 +2598,8 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: # TODO: Support other attention modules, e.g., cross-attention if attn_module.attn_type == AttentionType.DECODER: + use_local_attention = (self.attention_chunk_size is not None + and attn_module.impl.use_irope) if attn_module.sliding_window is not None: kv_cache_spec[layer_name] = SlidingWindowSpec( block_size=block_size, @@ -2597,6 +2608,14 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: dtype=self.kv_cache_dtype, sliding_window=attn_module.sliding_window, use_mla=use_mla) + elif use_local_attention: + kv_cache_spec[layer_name] = (ChunkedLocalAttentionSpec( + block_size=block_size, + num_kv_heads=attn_module.num_kv_heads, + head_size=attn_module.head_size, + dtype=self.kv_cache_dtype, + attention_chunk_size=self.attention_chunk_size, + use_mla=use_mla)) else: kv_cache_spec[layer_name] = FullAttentionSpec( block_size=block_size, From ffb33798212755d6fa52ffdbc33c880e5398ceeb Mon Sep 17 00:00:00 2001 From: 22quinn <33176974+22quinn@users.noreply.github.com> Date: Thu, 17 Jul 2025 21:12:23 -0700 Subject: [PATCH 167/552] [Doc] Add inplace weights loading example (#19640) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: x22x22 --- .../skip_loading_weights_in_engine_init.py | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 examples/offline_inference/skip_loading_weights_in_engine_init.py diff --git a/examples/offline_inference/skip_loading_weights_in_engine_init.py b/examples/offline_inference/skip_loading_weights_in_engine_init.py new file mode 100644 index 00000000000..1a616817dd2 --- /dev/null +++ b/examples/offline_inference/skip_loading_weights_in_engine_init.py @@ -0,0 +1,53 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from vllm import LLM, RequestOutput, SamplingParams + +# Sample prompts. +prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] +# Create a sampling params object. +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + +def print_prompts_and_outputs(outputs: list[RequestOutput]) -> None: + print("-" * 60) + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}") + print(f"Output: {generated_text!r}") + print("-" * 60) + + +def main(): + # Create an LLM without loading real weights + llm = LLM( + model="Qwen/Qwen3-0.6B", + load_format="dummy", + enforce_eager=True, + tensor_parallel_size=4, + ) + outputs = llm.generate(prompts, sampling_params) + print("\nOutputs do not make sense:") + print_prompts_and_outputs(outputs) + + # Update load format from `dummy` to `auto` + llm.collective_rpc( + "update_config", args=({"load_config": {"load_format": "auto"}},) + ) + # Now reload real weights inplace + llm.collective_rpc("reload_weights") + + # Check outputs make sense + outputs = llm.generate(prompts, sampling_params) + print("\nOutputs make sense after loading real weights:") + print_prompts_and_outputs(outputs) + + +if __name__ == "__main__": + main() From 76c463e2022a9716176fdf72f6e84fd8ba9a430d Mon Sep 17 00:00:00 2001 From: Shu Wang Date: Thu, 17 Jul 2025 23:32:45 -0500 Subject: [PATCH 168/552] [Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037) Signed-off-by: shuw Signed-off-by: mgoin Co-authored-by: mgoin Signed-off-by: x22x22 --- vllm/_custom_ops.py | 22 +- vllm/envs.py | 5 + .../layers/fused_moe/batched_deep_gemm_moe.py | 36 +-- .../batched_triton_or_deep_gemm_moe.py | 7 +- .../model_executor/layers/fused_moe/config.py | 16 + .../layers/fused_moe/cutlass_moe.py | 284 +++++++++++++++--- .../layers/fused_moe/deep_gemm_moe.py | 3 +- .../fused_moe/deepep_ht_prepare_finalize.py | 19 +- .../fused_moe/deepep_ll_prepare_finalize.py | 19 +- .../fused_moe/flashinfer_cutlass_moe.py | 198 ++++++++++++ .../flashinfer_cutlass_prepare_finalize.py | 114 +++++++ .../layers/fused_moe/fused_batched_moe.py | 36 +-- .../layers/fused_moe/fused_moe.py | 1 + vllm/model_executor/layers/fused_moe/layer.py | 36 ++- .../layers/fused_moe/modular_kernel.py | 99 +++--- .../layers/fused_moe/pplx_prepare_finalize.py | 30 +- .../layers/fused_moe/prepare_finalize.py | 44 +-- .../layers/fused_moe/triton_deep_gemm_moe.py | 37 +-- vllm/model_executor/layers/fused_moe/utils.py | 32 +- .../compressed_tensors_moe.py | 10 +- .../layers/quantization/modelopt.py | 211 +++++++++++-- vllm/utils/flashinfer.py | 107 +++++++ 22 files changed, 1095 insertions(+), 271 deletions(-) create mode 100644 vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py create mode 100644 vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py create mode 100644 vllm/utils/flashinfer.py diff --git a/vllm/_custom_ops.py b/vllm/_custom_ops.py index 81f4f6bdada..cf296a3b534 100644 --- a/vllm/_custom_ops.py +++ b/vllm/_custom_ops.py @@ -956,11 +956,11 @@ def cutlass_moe_mm(out_tensors: torch.Tensor, a_tensors: torch.Tensor, c_strides, per_act_token, per_out_ch) -def cutlass_fp4_moe_mm(a_tensors: torch.Tensor, b_tensors: torch.Tensor, - a_scales: torch.Tensor, b_scales: torch.Tensor, - alphas: torch.Tensor, problem_sizes: torch.Tensor, - expert_offsets: torch.Tensor, sf_offsets: torch.Tensor, - out_dtype: torch.dtype, device: torch.device): +def cutlass_fp4_moe_mm(out_tensors: torch.Tensor, a_tensors: torch.Tensor, + b_tensors: torch.Tensor, a_scales: torch.Tensor, + b_scales: torch.Tensor, alphas: torch.Tensor, + problem_sizes: torch.Tensor, + expert_offsets: torch.Tensor, sf_offsets: torch.Tensor): """ An FP4 Blockscaled Group Gemm that takes in a_tensors, b_tensors and runs the gemms for each combination based on the specified problem sizes. @@ -977,14 +977,10 @@ def cutlass_fp4_moe_mm(a_tensors: torch.Tensor, b_tensors: torch.Tensor, - problem_sizes: MxNxK sizes of each expert's multiplication in two grouped MMs used in the fused MoE operation. """ - m_topk = a_tensors.shape[0] - n = b_tensors.shape[1] - c_shape = (m_topk, n) - c = torch.empty(c_shape, device=device, dtype=out_dtype) - torch.ops._C.cutlass_fp4_group_mm(c, a_tensors, b_tensors, a_scales, - b_scales, alphas, problem_sizes, - expert_offsets, sf_offsets) - return c.to(out_dtype) + return torch.ops._C.cutlass_fp4_group_mm(out_tensors, a_tensors, b_tensors, + a_scales, b_scales, alphas, + problem_sizes, expert_offsets, + sf_offsets) # aqlm diff --git a/vllm/envs.py b/vllm/envs.py index ba0c55160b7..261cc7855b7 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -119,6 +119,7 @@ VLLM_TPU_BUCKET_PADDING_GAP: int = 0 VLLM_TPU_MOST_MODEL_LEN: Optional[int] = None VLLM_USE_DEEP_GEMM: bool = False + VLLM_USE_FLASHINFER_MOE: bool = False VLLM_XGRAMMAR_CACHE_MB: int = 0 VLLM_MSGPACK_ZERO_COPY_THRESHOLD: int = 256 VLLM_ALLOW_INSECURE_SERIALIZATION: bool = False @@ -853,6 +854,10 @@ def get_vllm_port() -> Optional[int]: "VLLM_USE_DEEP_GEMM": lambda: bool(int(os.getenv("VLLM_USE_DEEP_GEMM", "0"))), + # Allow use of FlashInfer CUTLASS kernels for fused moe ops. + "VLLM_USE_FLASHINFER_MOE": + lambda: bool(int(os.getenv("VLLM_USE_FLASHINFER_MOE", "0"))), + # Control the cache sized used by the xgrammar compiler. The default # of 512 MB should be enough for roughly 1000 JSON schemas. # It can be changed with this variable if needed for some reason. diff --git a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py index e61d350388e..628aa5c7bb0 100644 --- a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Any, Optional import torch @@ -255,28 +255,18 @@ def workspace_shapes( output = (num_experts, max_num_tokens * num_dispatchers, K) return (workspace13, workspace2, output, a.dtype) - def apply( - self, - output: torch.Tensor, - hidden_states: torch.Tensor, - w1: torch.Tensor, - w2: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - activation: str, - global_num_experts: int, - expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], - w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], - w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - workspace13: torch.Tensor, - workspace2: torch.Tensor, - expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - apply_router_weight_on_input: bool, - ): + def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, + workspace2: torch.Tensor, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): assert expert_tokens_meta is not None expert_num_tokens = expert_tokens_meta.expert_num_tokens diff --git a/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py index 1a63b323734..fc30e84e665 100644 --- a/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Any, Optional import torch @@ -142,7 +142,8 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - apply_router_weight_on_input: bool): + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): experts = (self.batched_deep_gemm_experts if self.allow_deep_gemm else self.batched_triton_experts) assert experts is not None @@ -150,4 +151,4 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, activation, global_num_experts, expert_map, w1_scale, w2_scale, w1_zp, w2_zp, a1q_scale, a2_scale, workspace13, workspace2, expert_tokens_meta, - apply_router_weight_on_input) + apply_router_weight_on_input, extra_expert_args) diff --git a/vllm/model_executor/layers/fused_moe/config.py b/vllm/model_executor/layers/fused_moe/config.py index def1c2b4556..9bebb6a65fc 100644 --- a/vllm/model_executor/layers/fused_moe/config.py +++ b/vllm/model_executor/layers/fused_moe/config.py @@ -15,6 +15,7 @@ from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) from vllm.utils import cdiv +from vllm.utils.flashinfer import has_flashinfer_cutlass_fused_moe logger = init_logger(__name__) @@ -188,6 +189,11 @@ def use_deepep_ll_kernels(self): return (self.use_all2all_kernels and envs.VLLM_ALL2ALL_BACKEND == "deepep_low_latency") + @property + def use_flashinfer_cutlass_kernels(self): + return (envs.VLLM_USE_FLASHINFER_MOE + and has_flashinfer_cutlass_fused_moe()) + @staticmethod def make(tp_size_: int, dp_size_: int, vllm_parallel_config: ParallelConfig) -> "FusedMoEParallelConfig": @@ -392,6 +398,10 @@ def use_deepep_ht_kernels(self): def use_deepep_ll_kernels(self): return self.moe_parallel_config.use_deepep_ll_kernels + @property + def use_flashinfer_cutlass_kernels(self): + return self.moe_parallel_config.use_flashinfer_cutlass_kernels + @staticmethod def make( num_experts: int, @@ -435,6 +445,12 @@ def make( if quant_dtype is None and isinstance(quant_config, Fp8Config): quant_dtype = torch.float8_e4m3fn + from vllm.model_executor.layers.quantization.modelopt import ( + ModelOptNvFp4Config) + if quant_dtype is None and isinstance(quant_config, + ModelOptNvFp4Config): + quant_dtype = torch.uint8 + if weight_quant is not None: per_out_ch_quant = ( weight_quant.strategy == QuantizationStrategy.CHANNEL) diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index a1f87ba92a5..facc01a5ba8 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """ CUTLASS based Fused MoE kernels.""" -from typing import Callable, Optional +from typing import Any, Callable, Optional import torch @@ -14,7 +14,8 @@ from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( TopKWeightAndReduceDelegate) from vllm.model_executor.layers.fused_moe.utils import (_fp8_quantize, - _resize_cache) + _resize_cache, + extract_required_args) from vllm.scalar_type import scalar_types logger = init_logger(__name__) @@ -298,7 +299,8 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - apply_router_weight_on_input: bool): + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): assert w1_zp is None, "w1_zp is not supported in CUTLASS MoE" assert w2_zp is None, "w2_zp is not supported in CUTLASS MoE" @@ -431,23 +433,28 @@ def cutlass_moe_fp8( FLOAT8_E4M3_MAX = torch.finfo(torch.float8_e4m3fn).max -def cutlass_moe_fp4(a: torch.Tensor, - a1_gscale: torch.Tensor, - w1_fp4: torch.Tensor, - w1_blockscale: torch.Tensor, - w1_alphas: torch.Tensor, - a2_gscale: torch.Tensor, - w2_fp4: torch.Tensor, - w2_blockscale: torch.Tensor, - w2_alphas: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - m: int, - n: int, - k: int, - e: int, - device: torch.device, - apply_router_weight_on_input: bool = False): +def run_cutlass_moe_fp4( + output: torch.Tensor, + a: torch.Tensor, + a1_gscale: torch.Tensor, + w1_fp4: torch.Tensor, + w1_blockscale: torch.Tensor, + w1_alphas: torch.Tensor, + a2_gscale: torch.Tensor, + w2_fp4: torch.Tensor, + w2_blockscale: torch.Tensor, + w2_alphas: torch.Tensor, + topk_weights: torch.Tensor, + topk_ids: torch.Tensor, + workspace13: torch.Tensor, + workspace2: torch.Tensor, + m: int, + n: int, + k: int, + e: int, + device: torch.device, + apply_router_weight_on_input: bool = False, +) -> None: """ MoE implementation for FP4 Inputs @@ -487,16 +494,16 @@ def cutlass_moe_fp4(a: torch.Tensor, assert (e_w1 == e_w2 and e_w1 == e), ("Number of experts must match", " between weights.") - assert (k_a // 2 == half_k_w1 + assert (k_a == half_k_w1 * 2 and k == k_w2), ("Hidden size mismatch between a, w1 and w2") - assert (nx2_w1 == n * 2 and half_n_w2 == n // 2), ("mismatch in " - "expected `n`") + assert (nx2_w1 == n * 2 and half_n_w2 * 2 == n), ("mismatch in " + "expected `n`") assert (m == m_a), "input shape mismatch" assert 2 * half_k_w1 == k_w2, "Hidden size mismatch w2 and w1" assert a.dtype in [torch.half, torch.bfloat16], "Invalid input dtype" assert (topk_weights.size(0) == m and topk_ids.size(0) == m), ("topk must be provided for each row of a") - + topk = topk_ids.size(1) out_dtype = a.dtype num_topk = topk_ids.size(1) @@ -523,7 +530,6 @@ def cutlass_moe_fp4(a: torch.Tensor, blockscale_offsets) a = ops.shuffle_rows(a, a_map) - rep_a_fp4, rep_a_blockscale = ops.scaled_fp4_experts_quant( a, a1_gscale, @@ -531,34 +537,220 @@ def cutlass_moe_fp4(a: torch.Tensor, blockscale_offsets, num_topk, ) - - c1 = ops.cutlass_fp4_moe_mm(rep_a_fp4, w1_fp4, rep_a_blockscale, - w1_blockscale, w1_alphas, problem_sizes1, - expert_offsets[:-1], blockscale_offsets[:-1], - out_dtype, device) + c1 = _resize_cache(workspace13, (m * topk, n * 2)) + c2 = _resize_cache(workspace2, (m * topk, n)) + c3 = _resize_cache(workspace13, (m * topk, k)) + ops.cutlass_fp4_moe_mm(c1, rep_a_fp4, w1_fp4, rep_a_blockscale, + w1_blockscale, w1_alphas, problem_sizes1, + expert_offsets[:-1], blockscale_offsets[:-1]) del rep_a_fp4, rep_a_blockscale - # hidden size dimension is split to one halfpytho sized tensor. - intermediate = torch.empty((m * num_topk, w1_fp4.size(1) // 2), - device=device, - dtype=out_dtype) - - torch.ops._C.silu_and_mul(intermediate, c1) - + torch.ops._C.silu_and_mul(c2, c1) int_fp4, int_blockscale = ops.scaled_fp4_experts_quant( - intermediate, a2_gscale, expert_offsets, blockscale_offsets, num_topk) + c2, a2_gscale, expert_offsets, blockscale_offsets, num_topk) - c2 = ops.cutlass_fp4_moe_mm(int_fp4, w2_fp4, int_blockscale, w2_blockscale, - w2_alphas, problem_sizes2, expert_offsets[:-1], - blockscale_offsets[:-1], out_dtype, device) + ops.cutlass_fp4_moe_mm(c3, int_fp4, w2_fp4, int_blockscale, w2_blockscale, + w2_alphas, problem_sizes2, expert_offsets[:-1], + blockscale_offsets[:-1]) del int_fp4, int_blockscale - c2 = ops.shuffle_rows(c2, c_map) + c3 = ops.shuffle_rows(c3, c_map) + + assert output.dtype == out_dtype if not apply_router_weight_on_input: - out = (c2.view(m, num_topk, k) * - topk_weights.view(m, num_topk, 1).to(out_dtype)).sum(dim=1) + output.copy_( + (c3.view(m, num_topk, k) * + topk_weights.view(m, num_topk, 1).to(out_dtype)).sum(dim=1), + non_blocking=True) else: - out = c2.view(m, num_topk, k).sum(dim=1) - return out.to(dtype=out_dtype) + output.copy_(c3.view(m, num_topk, k).sum(dim=1), non_blocking=True) + return + + +class CutlassExpertsFp4(mk.FusedMoEPermuteExpertsUnpermute): + + def __init__( + self, + max_experts_per_worker: int, + out_dtype: torch.dtype, + per_act_token_quant: bool, + per_out_ch_quant: bool, + block_shape: Optional[list[int]] = None, + use_batched_format: bool = False, + ): + super().__init__( + FusedMoEQuantConfig( + quant_dtype=torch.uint8, + per_act_token_quant=per_act_token_quant, + per_out_ch_quant=per_out_ch_quant, + block_shape=block_shape, + )) + self.max_experts_per_worker = max_experts_per_worker + self.out_dtype = out_dtype + self.use_batched_format = use_batched_format + + @property + def activation_formats( + self + ) -> tuple[mk.FusedMoEActivationFormat, mk.FusedMoEActivationFormat]: + if self.use_batched_format: + return (mk.FusedMoEActivationFormat.BatchedExperts, + mk.FusedMoEActivationFormat.BatchedExperts) + else: + return (mk.FusedMoEActivationFormat.Standard, + mk.FusedMoEActivationFormat.Standard) + + def supports_expert_map(self) -> bool: + return False + + def supports_chunking(self) -> bool: + return True + + def finalize_weight_and_reduce_impl(self) -> mk.TopKWeightAndReduce: + # Let PrepareAndFinalize::finalize() decide the impl. + return TopKWeightAndReduceDelegate() + + def workspace_shapes( + self, + a: torch.Tensor, + aq: torch.Tensor, + M: int, + N: int, + K: int, + topk: int, + global_num_experts: int, + local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: + workspace1: tuple[int, ...] = () + workspace2: tuple[int, ...] = () + output: tuple[int, ...] = () + if self.use_batched_format: + padded_M = aq.size(1) + workspace1 = (self.max_experts_per_worker, padded_M, max(N, K)) + workspace2 = (self.max_experts_per_worker, padded_M, (N // 2)) + output = (self.max_experts_per_worker, padded_M, K) + else: + workspace1 = (M * topk, max(2 * N, K)) + workspace2 = (M * topk, N) + output = (M, K) + return (workspace1, workspace2, output, + self.out_dtype if self.out_dtype is not None else a.dtype) + + def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, + expert_map: Optional[torch.Tensor], w1_scale: torch.Tensor, + w2_scale: torch.Tensor, w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], + a2_scale: torch.Tensor, workspace13: Optional[torch.Tensor], + workspace2: Optional[torch.Tensor], + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): + required_keys = [ + "g1_alphas", "g2_alphas", "a1_gscale", "a2_gscale", "m", "n", "k", + "e", "device" + ] + (g1_alphas, g2_alphas, a1_gscale, a2_gscale, m, n, k, e, + device) = extract_required_args(extra_expert_args, required_keys) + run_cutlass_moe_fp4( + output=output, + a=hidden_states, + a1_gscale=a1_gscale, + w1_fp4=w1, + w1_blockscale=w1_scale, + w1_alphas=g1_alphas, + a2_gscale=a2_gscale, + w2_fp4=w2, + w2_blockscale=w2_scale, + w2_alphas=g2_alphas, + topk_weights=topk_weights, + topk_ids=topk_ids, + workspace13=workspace13, + workspace2=workspace2, + m=m, + n=n, + k=k, + e=e, + device=device, + apply_router_weight_on_input=apply_router_weight_on_input, + ) + + +def cutlass_moe_fp4( + a: torch.Tensor, + w1_fp4: torch.Tensor, + w2_fp4: torch.Tensor, + w1_blockscale: torch.Tensor, + w2_blockscale: torch.Tensor, + g1_alphas: torch.Tensor, + g2_alphas: torch.Tensor, + a1_gscale: torch.Tensor, + a2_gscale: torch.Tensor, + topk_weights: torch.Tensor, + topk_ids: torch.Tensor, + m: int, + n: int, + k: int, + e: int, + device: torch.device, + expert_map: Optional[torch.Tensor] = None, + apply_router_weight_on_input: bool = False) -> torch.Tensor: + assert expert_map is None, ("Expert Parallelism / expert_map " + "is currently not supported for " + "ModelOptNvFp4FusedMoE's cutlass_moe_fp4.") + fn = mk.FusedMoEModularKernel( + MoEPrepareAndFinalizeNoEP(), + CutlassExpertsFp4( + max_experts_per_worker=e, + out_dtype=a.dtype, + per_act_token_quant=False, + per_out_ch_quant=False, + use_batched_format=False, + ), + ) + extra_expert_args = { + 'g1_alphas': g1_alphas, + 'g2_alphas': g2_alphas, + 'a1_gscale': a1_gscale, + 'a2_gscale': a2_gscale, + 'm': m, + 'n': n, + 'k': k, + 'e': e, + 'device': device, + } + + # NVFP4 requires two levels of quantization, which involves computing some + # scaling factors dynamically. This makes it incompatible with the typical + # prepare -> MoE -> finalize pipeline. Move the quantization logic into the + # MoE body. + extra_prepare_args = { + 'skip_quant': True, + } + # Similar reason as above. + extra_finalize_args = { + 'skip_weight_reduce': True, + } + return fn( + hidden_states=a, + w1=w1_fp4, + w2=w2_fp4, + topk_weights=topk_weights, + topk_ids=topk_ids, + inplace=False, + activation="silu", + global_num_experts=e, + expert_map=None, + w1_scale=w1_blockscale, + w2_scale=w2_blockscale, + a1_scale=None, + a2_scale=None, + apply_router_weight_on_input=apply_router_weight_on_input, + extra_expert_args=extra_expert_args, + extra_prepare_args=extra_prepare_args, + extra_finalize_args=extra_finalize_args, + ) def _valid_cutlass_block_scaled_grouped_gemm( diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py index f0c4ca5e52b..b89e5ac6f09 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import functools -from typing import Optional +from typing import Any, Optional import torch @@ -152,6 +152,7 @@ def apply( workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]], ): assert self.block_shape is not None assert a1q_scale is not None diff --git a/vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py index e10927c4dce..7016ff34c3a 100644 --- a/vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Any, Optional import deep_ep import torch @@ -127,16 +127,12 @@ def _do_dispatch(self, tokens: torch.Tensor, expert_topk_weights) def prepare( - self, - a1: torch.Tensor, - a1_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - num_experts: int, - expert_map: Optional[torch.Tensor], - apply_router_weight_on_input: bool, + self, a1: torch.Tensor, a1_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], topk_weights: torch.Tensor, + topk_ids: torch.Tensor, num_experts: int, + expert_map: Optional[torch.Tensor], apply_router_weight_on_input: bool, quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]] ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[mk.ExpertTokensMetadata], Optional[torch.Tensor], Optional[torch.Tensor]]: @@ -191,7 +187,8 @@ def prepare( def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, apply_router_weight_on_input: bool, - weight_and_reduce_impl: mk.TopKWeightAndReduce) -> None: + weight_and_reduce_impl: mk.TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: assert self.handle is not None diff --git a/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py index b04f0197584..57871ca250a 100644 --- a/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional, Union +from typing import Any, Optional, Union import deep_ep import torch @@ -111,16 +111,12 @@ def _do_quant( return x, x_scales def prepare( - self, - a1: torch.Tensor, - a1_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - num_experts: int, - expert_map: Optional[torch.Tensor], - apply_router_weight_on_input: bool, + self, a1: torch.Tensor, a1_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], topk_weights: torch.Tensor, + topk_ids: torch.Tensor, num_experts: int, + expert_map: Optional[torch.Tensor], apply_router_weight_on_input: bool, quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]] ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[mk.ExpertTokensMetadata], Optional[torch.Tensor], Optional[torch.Tensor]]: @@ -169,7 +165,8 @@ def prepare( def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, apply_router_weight_on_input: bool, - weight_and_reduce_impl: mk.TopKWeightAndReduce) -> None: + weight_and_reduce_impl: mk.TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: assert isinstance( weight_and_reduce_impl, TopKWeightAndReduceDelegate ), ("Weight application and reduction happens in the combine kernel.") diff --git a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py new file mode 100644 index 00000000000..1753c4f6e23 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py @@ -0,0 +1,198 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from typing import Any, Optional + +import torch + +import vllm.model_executor.layers.fused_moe.modular_kernel as mk +from vllm.logger import init_logger +from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig +from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( + TopKWeightAndReduceDelegate) +from vllm.model_executor.layers.fused_moe.utils import extract_required_args +from vllm.utils.flashinfer import (flashinfer_cutlass_fused_moe, + has_flashinfer_cutlass_fused_moe) + +logger = init_logger(__name__) + + +def is_valid_flashinfer_cutlass_fused_moe(hidden_states: torch.Tensor, + w1: torch.Tensor, + w2: torch.Tensor) -> bool: + """ + Check if the given problem size is supported by the FlashInfer CUTLASS MoE + kernel. + """ + if not has_flashinfer_cutlass_fused_moe(): + logger.debug_once("FlashInferExperts disabled: " + "flashinfer_cutlass_fused_moe not available.") + return False + # Data type checks + if (w1.dtype != torch.uint8 or w2.dtype != torch.uint8 + or hidden_states.dtype + not in [torch.float32, torch.float16, torch.bfloat16]): + logger.debug_once( + "FlashInferExperts disabled: w1/w2 must be torch.uint8 " + f"(got w1={w1.dtype}, w2={w2.dtype}), hidden_states must be " + f"float32, float16, or bfloat16 (got {hidden_states.dtype}).") + return False + return True + + +class FlashInferExperts(mk.FusedMoEPermuteExpertsUnpermute): + + def __init__( + self, + use_nvfp4_w4a4: bool = False, + use_fp8_w8a8: bool = False, + use_dp: bool = False, + ep_rank: int = 0, + ep_size: int = 1, + tp_rank: int = 0, + tp_size: int = 1, + num_dispatchers: Optional[int] = None, + use_batched_format: bool = False, + ): + super().__init__( + FusedMoEQuantConfig( + quant_dtype=torch.uint8, + per_act_token_quant=False, + block_shape=None, + )) + self.use_nvfp4_w4a4 = use_nvfp4_w4a4 + self.use_fp8_w8a8 = use_fp8_w8a8 + self.ep_rank = ep_rank + self.ep_size = ep_size + self.tp_rank = tp_rank + self.tp_size = tp_size + self.use_dp = use_dp + assert not use_batched_format or num_dispatchers is not None + self.num_dispatchers = num_dispatchers + + @property + def activation_formats( + self + ) -> tuple[mk.FusedMoEActivationFormat, mk.FusedMoEActivationFormat]: + return (mk.FusedMoEActivationFormat.Standard, + mk.FusedMoEActivationFormat.Standard) + + def supports_expert_map(self) -> bool: + return False + + def supports_chunking(self) -> bool: + # This refers to TP chunking; DP chunking is handled separately. + return True + + def finalize_weight_and_reduce_impl(self) -> mk.TopKWeightAndReduce: + # Let PrepareAndFinalize::finalize() decide the impl. + return TopKWeightAndReduceDelegate() + + def workspace_shapes( + self, + a: torch.Tensor, + aq: torch.Tensor, + M: int, + N: int, + K: int, + topk: int, + global_num_experts: int, + local_num_experts: int, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + ) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]: + # We use global_num_experts due to how moe_align_block_size handles + # expert_maps. + """ + Compute the shapes for the temporary and final outputs of the two gemms + and activation in the fused expert function. Since the gemms are + independent, the workspace for the first gemm can be shared with the + workspace for the last gemm. + + Returns a tuple of: + - workspace13 shape tuple: must be large enough to hold the + result of either expert gemm. + - workspace2 shape tuple: must be large enough to hold the + result of the activation function. + - output shape tuple: must be exact size of the final gemm output. + - Workspace type: The dtype to use for the workspace tensors. + - Note: in order for activation chunking to work, the first dimension + of each tuple must be the number of tokens. + """ + assert self.use_nvfp4_w4a4 is True, ("Only nvfp4 quantization is " + "currently supported.") + aq_m, aq_n = aq.shape + workspace2 = () + output_shape = (aq_m, aq_n * 2) + workspace_dtype = a.dtype + workspace1 = output_shape + # The workspace is determined by `aq`, since it comes after any + # potential communication op and is involved in the expert computation. + return (workspace1, workspace2, output_shape, workspace_dtype) + + def apply( + self, + output: torch.Tensor, + hidden_states: torch.Tensor, + w1: torch.Tensor, + w2: torch.Tensor, + topk_weights: torch.Tensor, + topk_ids: torch.Tensor, + activation: str, + global_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], + w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], + a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], # Not used + workspace13: Optional[torch.Tensor], + workspace2: Optional[torch.Tensor], + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: Optional[bool], + extra_expert_args: Optional[dict[str, Any]], + ): + assert extra_expert_args is not None, \ + "extra_expert_args must be provided" + required_keys = [ + 'g1_alphas', 'g2_alphas', 'a1_gscale', 'a2_gscale', 'out_dtype' + ] + + g1_alphas, g2_alphas, a1_gscale, a2_gscale, out_dtype = ( + extract_required_args(extra_expert_args, required_keys)) + + # Flashinfer CUTLASS kernel takes scalar global scales, + # min because inv_scale. + assert self.use_nvfp4_w4a4 is True, ("Only nvfp4 quantization is " + "currently supported.") + + # Ensure w1_scale and w2_scale are not None before calling view + assert w1_scale is not None and w2_scale is not None, ( + "w1_scale and w2_scale must not " + "be None for FlashInferExperts") + + assert not apply_router_weight_on_input + + quant_scales = [ + a1_gscale, + w1_scale.view(torch.int32), + g1_alphas, + a2_gscale, + w2_scale.view(torch.int32), + g2_alphas, + ] + _ = flashinfer_cutlass_fused_moe( + hidden_states, + topk_ids.to(torch.int), + topk_weights, + # FlashInfer API requires weight to be long for nvfp4 + w1.view(torch.long), + w2.view(torch.long), + output_dtype=out_dtype, + quant_scales=quant_scales, + input_sf=a1q_scale, + tp_size=self.tp_size, + tp_rank=self.tp_rank, + ep_size=self.ep_size, + ep_rank=self.ep_rank, + output=output, + ) diff --git a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py new file mode 100644 index 00000000000..49819504c8e --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py @@ -0,0 +1,114 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from typing import Any, Optional + +import torch + +import vllm.envs as envs +import vllm.model_executor.layers.fused_moe.modular_kernel as mk +from vllm.distributed import get_dp_group +from vllm.forward_context import get_forward_context +from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig +from vllm.model_executor.layers.fused_moe.utils import ( + extract_required_args, moe_kernel_quantize_input) +from vllm.utils.flashinfer import fp4_swizzle_blockscale + + +def get_local_sizes(local_tokens): + cu_sizes = get_forward_context().dp_metadata.cu_tokens_across_dp_cpu + sizes = [cu_sizes[0].item()] + for i in range(1, len(cu_sizes)): + sizes.append((cu_sizes[i] - cu_sizes[i - 1]).item()) + max_num_tokens = envs.VLLM_MOE_DP_CHUNK_SIZE + sizes_chunked = [max_num_tokens] * len(sizes) + if local_tokens < max_num_tokens: + # When the number of local tokens is less than max_num_tokens, all other + # ranks will also have fewer than max_num_tokens. The remaining tokens + # are accounted for as residual. + sizes_chunked = [x % max_num_tokens for x in sizes] + + return sizes_chunked + + +class FlashInferCutlassMoEPrepareAndFinalize(mk.FusedMoEPrepareAndFinalize): + + def __init__( + self, + quant_dtype: Optional[torch.dtype] = None, + per_channel_quant: bool = False, + block_shape: Optional[list[int]] = None, + num_dispatchers: int = 1, + ): + super().__init__() + self.per_channel_quant = per_channel_quant + self.block_shape = block_shape + self.quant_dtype = quant_dtype + self.num_dispatchers_ = num_dispatchers + + @property + def activation_format(self) -> mk.FusedMoEActivationFormat: + return mk.FusedMoEActivationFormat.Standard + + def max_num_tokens_per_rank(self) -> Optional[int]: + return None + + def topk_indices_dtype(self) -> Optional[torch.dtype]: + return None + + def num_dispatchers(self) -> int: + return self.num_dispatchers_ + + def prepare( + self, + a1: torch.Tensor, + a1_scale: Optional[torch.Tensor], # Not used + a2_scale: Optional[torch.Tensor], # Not used + topk_weights: torch.Tensor, + topk_ids: torch.Tensor, + num_experts: int, + expert_map: Optional[torch.Tensor], + apply_router_weight_on_input: bool, + quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]] + ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor], + Optional[torch.Tensor], Optional[torch.Tensor]]: + + assert not apply_router_weight_on_input + + (a1_gscale, use_dp, local_tokens) = extract_required_args( + extra_prepare_args, ['a1_gscale', 'use_dp', 'local_tokens']) + + a1q, a1q_scale = moe_kernel_quantize_input( + a1, + a1_gscale, + quant_config.quant_dtype, + self.per_channel_quant, + self.block_shape, + is_fp4_scale_swizzled=not use_dp, # Swizzling after communication + ) + if use_dp: + topk_weights, topk_ids, a1q, a1q_scale = \ + get_dp_group().all_gatherv([topk_weights, topk_ids, a1q, a1q_scale], # noqa: E501 + dim=0, + sizes=get_local_sizes(local_tokens)) + a1_m, a1_n = a1q.shape + a1q_scale = fp4_swizzle_blockscale(a1q_scale, a1_m, a1_n * 2) + + return a1q, a1q_scale, None, topk_ids, topk_weights + + def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + apply_router_weight_on_input: bool, + weight_and_reduce_impl: mk.TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: + + (use_dp, + local_tokens) = extract_required_args(extra_finalize_args, + ['use_dp', 'local_tokens']) + if use_dp: + fused_expert_output = get_dp_group().reduce_scatterv( + fused_expert_output, + dim=0, + sizes=get_local_sizes(local_tokens), + ) + output.copy_(fused_expert_output) diff --git a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py index ab8a281b390..9a5c85e120c 100644 --- a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Fused batched MoE kernel.""" -from typing import Optional +from typing import Any, Optional import torch @@ -496,16 +496,12 @@ def num_dispatchers(self) -> int: return self.num_dispatchers_ def prepare( - self, - a1: torch.Tensor, - a1_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - num_experts: int, - expert_map: Optional[torch.Tensor], - apply_router_weight_on_input: bool, + self, a1: torch.Tensor, a1_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], topk_weights: torch.Tensor, + topk_ids: torch.Tensor, num_experts: int, + expert_map: Optional[torch.Tensor], apply_router_weight_on_input: bool, quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]] ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[mk.ExpertTokensMetadata], Optional[torch.Tensor], Optional[torch.Tensor]]: @@ -594,15 +590,11 @@ def prepare( return b_a1, b_a1_scale, expert_tokens_meta, None, None - def finalize( - self, - output: torch.Tensor, - fused_expert_output: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - apply_router_weight_on_input: bool, - weight_and_reduce_impl: mk.TopKWeightAndReduce, - ) -> None: + def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + apply_router_weight_on_input: bool, + weight_and_reduce_impl: mk.TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: if isinstance(weight_and_reduce_impl, TopKWeightAndReduceDelegate): weight_and_reduce_impl = TopKWeightAndReduceNaiveBatched(self.rank) weight_and_reduce_impl.apply( @@ -706,7 +698,8 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - apply_router_weight_on_input: bool): + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): assert hidden_states.dim() == 3 assert expert_tokens_meta is not None expert_num_tokens = expert_tokens_meta.expert_num_tokens @@ -911,7 +904,8 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - apply_router_weight_on_input: bool): + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): # Check constraints. if self.use_int4_w4a16: assert hidden_states.size(-1) // 2 == w1.size(2), ( diff --git a/vllm/model_executor/layers/fused_moe/fused_moe.py b/vllm/model_executor/layers/fused_moe/fused_moe.py index ddda87c441b..45936026007 100644 --- a/vllm/model_executor/layers/fused_moe/fused_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_moe.py @@ -1646,6 +1646,7 @@ def apply( workspace2: torch.Tensor, expert_tokens_meta: Optional[mk.ExpertTokensMetadata], apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]], ): # Check constraints. if self.use_int4_w4a16: diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index b3cee55e8ba..4b8a37fcc73 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -34,6 +34,7 @@ from vllm.platforms import current_platform from vllm.platforms.interface import CpuArchEnum from vllm.utils import direct_register_custom_op, has_deep_ep, has_pplx +from vllm.utils.flashinfer import has_flashinfer if current_platform.is_cuda_alike(): from .fused_batched_moe import BatchedTritonExperts @@ -45,6 +46,9 @@ from .deepep_ht_prepare_finalize import DeepEPHTPrepareAndFinalize from .deepep_ll_prepare_finalize import (DEEPEP_QUANT_BLOCK_SHAPE, DeepEPLLPrepareAndFinalize) + if has_flashinfer(): + from .flashinfer_cutlass_prepare_finalize import ( + FlashInferCutlassMoEPrepareAndFinalize) else: fused_experts = None # type: ignore FusedMoEPermuteExpertsUnpermute = None # type: ignore @@ -99,6 +103,9 @@ def maybe_make_prepare_finalize( prepare_finalize: Optional[FusedMoEPrepareAndFinalize] = None + if moe.use_flashinfer_cutlass_kernels: + prepare_finalize = FlashInferCutlassMoEPrepareAndFinalize( + quant_dtype=moe.quant_dtype, ) if moe.use_pplx_kernels: hidden_dim_bytes, hidden_scale_bytes = pplx_hidden_dim_scale_bytes( moe.max_num_tokens, @@ -204,6 +211,12 @@ def select_gemm_impl( f"{self.__class__.__name__} must select appropriate gemm " "implementation based on the prepare_finalize") + def maybe_swap_experts_impl( + self, + moe_parallel_config: FusedMoEParallelConfig, + ): + pass + @abstractmethod def apply( self, @@ -744,12 +757,15 @@ def __init__( moe_quant_params["intermediate_size_full"] = intermediate_size self.quant_method.create_weights(layer=self, **moe_quant_params) + if isinstance(self.quant_method, FusedMoEMethodBase): + self.quant_method.maybe_swap_experts_impl(self.moe_parallel_config) # Chunked all2all staging tensor self.batched_hidden_states: Optional[torch.Tensor] = None self.batched_router_logits: Optional[torch.Tensor] = None if (self.moe_parallel_config.use_pplx_kernels - or self.moe_parallel_config.use_deepep_ll_kernels): + or self.moe_parallel_config.use_deepep_ll_kernels + or self.moe_parallel_config.use_flashinfer_cutlass_kernels): self.batched_hidden_states = torch.zeros( (moe.max_num_tokens, self.hidden_size), dtype=moe.in_dtype, @@ -801,6 +817,10 @@ def use_deepep_ht_kernels(self): def use_deepep_ll_kernels(self): return self.moe_parallel_config.use_deepep_ll_kernels + @property + def use_flashinfer_cutlass_kernels(self): + return self.moe_parallel_config.use_flashinfer_cutlass_kernels + def _load_per_tensor_weight_scale(self, shard_id: str, param: torch.nn.Parameter, loaded_weight: torch.Tensor, @@ -1402,9 +1422,9 @@ def process_chunk(chunk_start, chunk_end, skip_result_store=False): final_hidden_states, non_blocking=True) ctx = get_forward_context() + # flashinfer_cutlass_kernels can handle: optional DP + TP/EP max_tokens_across_dp = ctx.dp_metadata.max_tokens_across_dp_cpu moe_dp_chunk_size_per_rank = self.moe_config.max_num_tokens - num_tokens = full_hidden_states.size(0) for chunk_start_ in range(0, max_tokens_across_dp, moe_dp_chunk_size_per_rank): @@ -1424,13 +1444,20 @@ def process_chunk(chunk_start, chunk_end, skip_result_store=False): def forward_impl(self, hidden_states: torch.Tensor, router_logits: torch.Tensor): assert self.quant_method is not None + # Route to the chunked forward path using the FlashInfer Cutlass kernel + # only when data parallelism (DP) is enabled. + use_flashinfer_cutlass_kernels = ( + self.dp_size > 1 + and self.moe_parallel_config.use_flashinfer_cutlass_kernels) if (self.moe_parallel_config.use_pplx_kernels - or self.moe_parallel_config.use_deepep_ll_kernels): + or self.moe_parallel_config.use_deepep_ll_kernels + or use_flashinfer_cutlass_kernels): return self.forward_impl_chunked(hidden_states, router_logits) do_naive_dispatch_combine: bool = ( self.dp_size > 1 - and not self.moe_parallel_config.use_deepep_ht_kernels) + and not self.moe_parallel_config.use_deepep_ht_kernels + and not self.moe_parallel_config.use_flashinfer_cutlass_kernels) if do_naive_dispatch_combine: hidden_states, router_logits = get_ep_group().dispatch( hidden_states, router_logits) @@ -1460,7 +1487,6 @@ def forward_impl(self, hidden_states: torch.Tensor, if do_naive_dispatch_combine: final_hidden_states = get_ep_group().combine(final_hidden_states) - if self.reduce_results and (self.tp_size > 1 or self.ep_size > 1): # Default set to False. (May have to add shared expert outputs. final_hidden_states = self.maybe_all_reduce_tensor_model_parallel( diff --git a/vllm/model_executor/layers/fused_moe/modular_kernel.py b/vllm/model_executor/layers/fused_moe/modular_kernel.py index bc4eb3b1932..6262904e4dc 100644 --- a/vllm/model_executor/layers/fused_moe/modular_kernel.py +++ b/vllm/model_executor/layers/fused_moe/modular_kernel.py @@ -4,7 +4,7 @@ from dataclasses import dataclass from enum import Enum from math import prod -from typing import Optional, final +from typing import Any, Optional, final import torch @@ -150,16 +150,12 @@ class FusedMoEPrepareAndFinalize(ABC): @abstractmethod def prepare( - self, - a1: torch.Tensor, - a1_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - num_experts: int, - expert_map: Optional[torch.Tensor], - apply_router_weight_on_input: bool, + self, a1: torch.Tensor, a1_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], topk_weights: torch.Tensor, + topk_ids: torch.Tensor, num_experts: int, + expert_map: Optional[torch.Tensor], apply_router_weight_on_input: bool, quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]] ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[ExpertTokensMetadata], Optional[torch.Tensor], Optional[torch.Tensor]]: @@ -190,15 +186,11 @@ def prepare( raise NotImplementedError @abstractmethod - def finalize( - self, - output: torch.Tensor, - fused_expert_output: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - apply_router_weight_on_input: bool, - weight_and_reduce_impl: TopKWeightAndReduce, - ) -> None: + def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + apply_router_weight_on_input: bool, + weight_and_reduce_impl: TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: """ Perform any combine plus apply weights and perform a reduction on the fused experts output. @@ -376,6 +368,7 @@ def apply( workspace2: torch.Tensor, expert_tokens_meta: Optional[ExpertTokensMetadata], apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]], ): """ This function computes the intermediate result of a Mixture of Experts @@ -460,21 +453,19 @@ def __init__( f"{fused_experts.__class__.__name__}." f"{fused_experts.activation_formats[0]}") - def _do_fused_experts(self, fused_out: Optional[torch.Tensor], - a1: torch.Tensor, a1q: torch.Tensor, - w1: torch.Tensor, w2: torch.Tensor, - topk_weights: torch.Tensor, topk_ids: torch.Tensor, - activation: str, global_num_experts: int, - local_num_experts: int, - expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], - w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], - w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - expert_tokens_meta: Optional[ExpertTokensMetadata], - apply_router_weight_on_input: bool) -> torch.Tensor: + def _do_fused_experts( + self, fused_out: Optional[torch.Tensor], a1: torch.Tensor, + a1q: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + activation: str, global_num_experts: int, local_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], w2_scale: Optional[torch.Tensor], + w1_zp: Optional[torch.Tensor], w2_zp: Optional[torch.Tensor], + a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], + expert_tokens_meta: Optional[ExpertTokensMetadata], + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]) -> torch.Tensor: _, M, N, K, top_k = _moe_problem_size(a1q, w1, w2, topk_ids) @@ -517,7 +508,8 @@ def _do_fused_experts(self, fused_out: Optional[torch.Tensor], workspace13=workspace13, workspace2=workspace2, expert_tokens_meta=expert_tokens_meta, - apply_router_weight_on_input=apply_router_weight_on_input) + apply_router_weight_on_input=apply_router_weight_on_input, + extra_expert_args=extra_expert_args) return fused_out @@ -541,6 +533,7 @@ def _maybe_chunk_fused_experts( a2_scale: Optional[torch.Tensor], expert_tokens_meta: Optional[ExpertTokensMetadata], apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]], ) -> torch.Tensor: _, M, N, K, top_k = _moe_problem_size(a1q, w1, w2, topk_ids) @@ -568,7 +561,8 @@ def _maybe_chunk_fused_experts( a1q_scale=a1q_scale, a2_scale=a2_scale, expert_tokens_meta=expert_tokens_meta, - apply_router_weight_on_input=apply_router_weight_on_input) + apply_router_weight_on_input=apply_router_weight_on_input, + extra_expert_args=extra_expert_args) # Chunking required case assert num_chunks > 1 @@ -624,6 +618,15 @@ def slice_expert_tokens_metadata( expert_num_tokens=c_expert_num_tokens, expert_num_tokens_cpu=c_expert_num_tokens_cpu) + m = None + if extra_expert_args is not None and 'm' in extra_expert_args: + m = extra_expert_args.get('m') + + if extra_expert_args is not None: + chunked_extra_expert_args = extra_expert_args + else: + chunked_extra_expert_args = {} + for chunk_idx in range(num_chunks): c_a1q, c_a1q_scale, c_a2_scale, c_topk_ids, c_topk_weights = ( slice_input_tensors(chunk_idx)) @@ -634,6 +637,11 @@ def slice_expert_tokens_metadata( expert_tokens_meta, c_topk_ids, local_num_experts, expert_map) + s = chunk_idx * CHUNK_SIZE + e = min(s + CHUNK_SIZE, M) + + if m is not None: + chunked_extra_expert_args['m'] = e - s self._do_fused_experts( fused_out=slice_output_tensor(chunk_idx), a1=a1, @@ -653,7 +661,8 @@ def slice_expert_tokens_metadata( a1q_scale=c_a1q_scale, a2_scale=c_a2_scale, expert_tokens_meta=c_expert_tokens_meta, - apply_router_weight_on_input=apply_router_weight_on_input) + apply_router_weight_on_input=apply_router_weight_on_input, + extra_expert_args=chunked_extra_expert_args) return fused_out @@ -675,6 +684,9 @@ def forward( a1_scale: Optional[torch.Tensor] = None, a2_scale: Optional[torch.Tensor] = None, apply_router_weight_on_input: bool = False, + extra_expert_args: Optional[dict] = None, + extra_prepare_args: Optional[dict] = None, + extra_finalize_args: Optional[dict] = None, ) -> torch.Tensor: """ This function computes a Mixture of Experts (MoE) layer using two sets @@ -707,6 +719,12 @@ def forward( - apply_router_weight_on_input (bool): When true, the topk weights are applied directly on the inputs. This is only applicable when topk is 1. + - extra_expert_args (Optional[dict]): Extra keyword arguments to pass to + fused_experts.apply. + - extra_prepare_args (Optional[dict]): Extra keyword arguments to pass + to prepare. + - extra_finalize_args (Optional[dict]): Extra keyword arguments to pass + to finalize. Returns: - torch.Tensor: The output tensor after applying the MoE layer. @@ -730,6 +748,7 @@ def forward( expert_map, apply_router_weight_on_input, self.fused_experts.quant_config, + extra_prepare_args, ) # Maybe prepare gathered topk_ids and topk_weights from other EP ranks. @@ -766,11 +785,13 @@ def forward( a1q_scale=a1q_scale, a2_scale=a2_scale, expert_tokens_meta=expert_tokens_meta, - apply_router_weight_on_input=apply_router_weight_on_input) + apply_router_weight_on_input=apply_router_weight_on_input, + extra_expert_args=extra_expert_args) self.prepare_finalize.finalize( output, fused_out, topk_weights, topk_ids, apply_router_weight_on_input, - self.fused_experts.finalize_weight_and_reduce_impl()) + self.fused_experts.finalize_weight_and_reduce_impl(), + extra_finalize_args) return output diff --git a/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py index 5a23a9f1ab0..46931f2dd7c 100644 --- a/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Any, Optional import pplx_kernels as pplx import torch @@ -89,16 +89,12 @@ def num_dispatchers(self) -> int: return self.num_dispatchers_ def prepare( - self, - a1: torch.Tensor, - a1_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - num_experts: int, - expert_map: Optional[torch.Tensor], - apply_router_weight_on_input: bool, + self, a1: torch.Tensor, a1_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], topk_weights: torch.Tensor, + topk_ids: torch.Tensor, num_experts: int, + expert_map: Optional[torch.Tensor], apply_router_weight_on_input: bool, quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]] ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[mk.ExpertTokensMetadata], Optional[torch.Tensor], Optional[torch.Tensor]]: @@ -217,15 +213,11 @@ def prepare( return expert_x, expert_x_scale, expert_tokens_meta, None, None - def finalize( - self, - output: torch.Tensor, - fused_expert_output: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - apply_router_weight_on_input: bool, - weight_and_reduce_impl: mk.TopKWeightAndReduce, - ) -> None: + def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + apply_router_weight_on_input: bool, + weight_and_reduce_impl: mk.TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: assert isinstance( weight_and_reduce_impl, TopKWeightAndReduceDelegate ), ("Weight application and reduction happens in the combine kernel.") diff --git a/vllm/model_executor/layers/fused_moe/prepare_finalize.py b/vllm/model_executor/layers/fused_moe/prepare_finalize.py index b15c00c44b5..696c7cdba9a 100644 --- a/vllm/model_executor/layers/fused_moe/prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/prepare_finalize.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Any, Optional import torch @@ -38,6 +38,7 @@ def prepare( expert_map: Optional[torch.Tensor], apply_router_weight_on_input: bool, quant_config: FusedMoEQuantConfig, + extra_prepare_args: Optional[dict[str, Any]], ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[mk.ExpertTokensMetadata], Optional[torch.Tensor], Optional[torch.Tensor]]: @@ -48,26 +49,33 @@ def prepare( assert topk == 1, \ "apply_router_weight_on_input is only implemented for topk=1" a1.mul_(topk_weights.to(a1.dtype)) + + if (extra_prepare_args is not None + and extra_prepare_args.get("skip_quant", True)): + # Skip quantization if explicitly requested + return a1, None, None, None, None + a1q, a1q_scale = moe_kernel_quantize_input( a1, a1_scale, quant_config.quant_dtype, quant_config.per_act_token_quant, quant_config.block_shape) return a1q, a1q_scale, None, None, None - def finalize( - self, - output: torch.Tensor, - fused_expert_output: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - apply_router_weight_on_input: bool, - weight_and_reduce_impl: mk.TopKWeightAndReduce, - ) -> None: - if isinstance(weight_and_reduce_impl, TopKWeightAndReduceDelegate): - weight_and_reduce_impl = TopKWeightAndReduceContiguous() - weight_and_reduce_impl.apply( - output=output, - fused_expert_output=fused_expert_output, - topk_weights=topk_weights, - topk_ids=topk_ids, - apply_router_weight_on_input=apply_router_weight_on_input) + def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor, + topk_weights: torch.Tensor, topk_ids: torch.Tensor, + apply_router_weight_on_input: bool, + weight_and_reduce_impl: mk.TopKWeightAndReduce, + extra_finalize_args: Optional[dict[str, Any]]) -> None: + if (extra_finalize_args is not None + and extra_finalize_args.get("skip_weight_reduce", True)): + assert output.shape == fused_expert_output.shape + output.copy_(fused_expert_output) + else: + if isinstance(weight_and_reduce_impl, TopKWeightAndReduceDelegate): + weight_and_reduce_impl = TopKWeightAndReduceContiguous() + weight_and_reduce_impl.apply( + output=output, + fused_expert_output=fused_expert_output, + topk_weights=topk_weights, + topk_ids=topk_ids, + apply_router_weight_on_input=apply_router_weight_on_input) diff --git a/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py index 51b95c9aa92..1b31368c79c 100644 --- a/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Any, Optional import torch @@ -119,28 +119,18 @@ def workspace_shapes( local_num_experts, expert_tokens_meta) - def apply( - self, - output: torch.Tensor, - hidden_states: torch.Tensor, - w1: torch.Tensor, - w2: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - activation: str, - global_num_experts: int, - expert_map: Optional[torch.Tensor], - w1_scale: Optional[torch.Tensor], - w2_scale: Optional[torch.Tensor], - w1_zp: Optional[torch.Tensor], - w2_zp: Optional[torch.Tensor], - a1q_scale: Optional[torch.Tensor], - a2_scale: Optional[torch.Tensor], - workspace13: torch.Tensor, - workspace2: torch.Tensor, - expert_tokens_meta: Optional[mk.ExpertTokensMetadata], - apply_router_weight_on_input: bool, - ): + def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, + w1: torch.Tensor, w2: torch.Tensor, topk_weights: torch.Tensor, + topk_ids: torch.Tensor, activation: str, global_num_experts: int, + expert_map: Optional[torch.Tensor], + w1_scale: Optional[torch.Tensor], + w2_scale: Optional[torch.Tensor], w1_zp: Optional[torch.Tensor], + w2_zp: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], + a2_scale: Optional[torch.Tensor], workspace13: torch.Tensor, + workspace2: torch.Tensor, + expert_tokens_meta: Optional[mk.ExpertTokensMetadata], + apply_router_weight_on_input: bool, + extra_expert_args: Optional[dict[str, Any]]): use_deep_gemm = (self.allow_deep_gemm and (_valid_deep_gemm(hidden_states, w1, w2) or is_blackwell_deep_gemm_used())) @@ -168,4 +158,5 @@ def apply( workspace2, expert_tokens_meta, apply_router_weight_on_input, + extra_expert_args, ) diff --git a/vllm/model_executor/layers/fused_moe/utils.py b/vllm/model_executor/layers/fused_moe/utils.py index c120d964b3c..966471b5c59 100644 --- a/vllm/model_executor/layers/fused_moe/utils.py +++ b/vllm/model_executor/layers/fused_moe/utils.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from math import prod -from typing import Optional, Union +from typing import Any, Optional, Union import torch @@ -15,6 +15,7 @@ from vllm.platforms import current_platform from vllm.triton_utils import tl, triton from vllm.utils import cdiv +from vllm.utils.flashinfer import fp4_quantize @triton.jit @@ -98,6 +99,16 @@ def _resize_cache(x: torch.Tensor, v: tuple[int, ...]) -> torch.Tensor: return x.flatten()[:prod(v)].view(*v) +def _fp4_quantize( + A: torch.Tensor, + A_scale: Optional[torch.Tensor], + is_sf_swizzled_layout: bool, +) -> tuple[torch.Tensor, torch.Tensor]: + return fp4_quantize(A, + A_scale, + is_sf_swizzled_layout=is_sf_swizzled_layout) + + def _fp8_quantize( A: torch.Tensor, A_scale: Optional[torch.Tensor], @@ -172,11 +183,16 @@ def moe_kernel_quantize_input( quant_dtype: Union[None, torch.dtype, str], per_act_token_quant: bool, block_shape: Optional[list[int]] = None, + is_fp4_scale_swizzled: bool = True, ) -> tuple[torch.Tensor, Optional[torch.Tensor]]: if quant_dtype == torch.float8_e4m3fn: return _fp8_quantize(A, A_scale, per_act_token_quant, block_shape) elif quant_dtype == torch.int8: return _int8_quantize(A, A_scale, per_act_token_quant, block_shape) + elif quant_dtype == torch.uint8: # nvfp4 + return _fp4_quantize(A, + A_scale, + is_sf_swizzled_layout=is_fp4_scale_swizzled) elif quant_dtype == "mxfp4": return _mxfp4_quantize(A, A_scale, per_act_token_quant, block_shape) else: @@ -236,3 +252,17 @@ def _validate_scale_shape( assert block_shape is not None expected = (a.shape[0], cdiv(a.shape[1], block_shape[1])) assert a_scale.shape == expected, f"{a_scale.shape} == {expected}" + + +def extract_required_args( + extra_args: Optional[dict[str, Any]], + required_keys: list[str], +) -> tuple[Any, ...]: + if extra_args is None: + raise ValueError("`extra_args` must be provided.") + + missing_keys = [k for k in required_keys if k not in extra_args] + if missing_keys: + raise ValueError(f"Missing keys in `extra_args`: {missing_keys}") + + return tuple(extra_args[k] for k in required_keys) diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py index fcf8ea023f6..1a31410c338 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py @@ -339,19 +339,19 @@ def apply( return cutlass_moe_fp4( a=x, w1_fp4=layer.w13_weight, - w1_blockscale=layer.w13_blockscale_swizzled, - w1_alphas=layer.g1_alphas, w2_fp4=layer.w2_weight, + w1_blockscale=layer.w13_blockscale_swizzled, w2_blockscale=layer.w2_blockscale_swizzled, - w2_alphas=layer.g2_alphas, + g1_alphas=layer.g1_alphas, + g2_alphas=layer.g2_alphas, + a1_gscale=layer.w13_input_scale_quant, + a2_gscale=layer.w2_input_scale_quant, topk_weights=topk_weights, topk_ids=topk_ids, m=x.shape[0], n=layer.w2_weight.shape[2] * 2, k=x.shape[1], e=layer.w13_weight.shape[0], - a1_gscale=layer.w13_input_scale_quant, - a2_gscale=layer.w2_input_scale_quant, device=x.device, apply_router_weight_on_input=apply_router_weight_on_input).to( x.dtype) diff --git a/vllm/model_executor/layers/quantization/modelopt.py b/vllm/model_executor/layers/quantization/modelopt.py index 788f0a9116f..3807899fc3e 100644 --- a/vllm/model_executor/layers/quantization/modelopt.py +++ b/vllm/model_executor/layers/quantization/modelopt.py @@ -7,9 +7,15 @@ from torch.nn import Module from torch.nn.parameter import Parameter +import vllm.envs as envs +import vllm.model_executor.layers.fused_moe.modular_kernel as mk from vllm._custom_ops import (cutlass_scaled_fp4_mm, cutlass_scaled_mm_supports_fp4, scaled_fp4_quant) +from vllm.distributed import get_ep_group from vllm.logger import init_logger +from vllm.model_executor.layers.fused_moe.config import FusedMoEParallelConfig +from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize import ( # noqa: E501 + FlashInferCutlassMoEPrepareAndFinalize) from vllm.model_executor.layers.fused_moe.layer import ( FusedMoE, FusedMoEMethodBase, FusedMoeWeightScaleSupported) from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, @@ -713,6 +719,18 @@ def __init__(self, quant_config: ModelOptNvFp4Config): self.quant_config = quant_config self.cutlass_nvfp4_supported = cutlass_fp4_supported() self.use_marlin = False + self.allow_flashinfer_cutlass = False + + if envs.VLLM_USE_FLASHINFER_MOE: + if self.cutlass_nvfp4_supported and current_platform.is_cuda() \ + and current_platform.is_device_capability(100): + logger.info_once( + "Using FlashInfer kernels for ModelOptNvFp4FusedMoE.") + self.allow_flashinfer_cutlass = True + else: + logger.warning_once( + "Flashinfer CUTLASS Fused MoE not supported " + "or found on the current platform.") if not self.cutlass_nvfp4_supported: if is_fp4_marlin_supported(): @@ -722,6 +740,73 @@ def __init__(self, quant_config: ModelOptNvFp4Config): " quantization. Please use Blackwell and" " above.") + self.fused_experts = None # type: ignore + + def maybe_swap_experts_impl( + self, + moe_parallel_config: FusedMoEParallelConfig, + ): + if not self.allow_flashinfer_cutlass: + return + + logger.debug_once("FlashInferExperts") + # default to TP/EP case only + + experts_kwargs: dict[str, Any] = { + "use_nvfp4_w4a4": True, + "use_dp": moe_parallel_config.dp_size > 1, + "ep_rank": moe_parallel_config.ep_rank, + "ep_size": moe_parallel_config.ep_size, + "tp_rank": moe_parallel_config.tp_rank, + "tp_size": moe_parallel_config.tp_size, + } + + from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( # noqa: E501 + FlashInferExperts) + experts = FlashInferExperts(**experts_kwargs) + self.fused_experts = mk.FusedMoEModularKernel( + FlashInferCutlassMoEPrepareAndFinalize( + quant_dtype=torch.uint8, + #meaning 2x e2m1 packed in one, kernel requirement + ), + experts, + ) + + # This method update self.fused_experts + # only prepare_finalize is not None call select_gemm_impl + # so when native cutlass fp4, fused_expert is in fuse_moe.py fused_expert + # when it's not called(TP case), we still have 2 kernels to use. + def select_gemm_impl(self, prepare_finalize, + moe) -> mk.FusedMoEPermuteExpertsUnpermute: + + assert moe is not None + assert prepare_finalize is not None + experts = None + all2all_manager = get_ep_group().device_communicator.all2all_manager + assert all2all_manager is not None + if self.allow_flashinfer_cutlass: + from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( # noqa: E501 + FlashInferExperts) + logger.debug_once("Using FlashInferExperts") + experts = FlashInferExperts( + use_nvfp4_w4a4=True, + use_dp=moe.moe_parallel_config.dp_size > 1, + ep_rank=moe.moe_parallel_config.ep_rank, + ep_size=moe.moe_parallel_config.ep_size, + tp_rank=moe.moe_parallel_config.tp_rank, + tp_size=moe.moe_parallel_config.tp_size, + ) + else: + assert moe.dp_size > 1 + logger.debug_once("Using CutlassExpertsFp4") + # Currently CutlassExpertsFp4 doesn't support DP + raise ValueError( + "CutlassExpertsFp4 doesn't support DP. " + "Use flashinfer CUTLASS FusedMoE(VLLM_USE_FLASHINFER_MOE)" + " backend instead.") + + return experts + def uses_weight_scale_2_pattern(self) -> bool: """ FP4 variants use 'weight_scale_2' pattern for per-tensor weight scales. @@ -842,8 +927,30 @@ def swizzle_blockscale(self, scale: torch.tensor): if scale_ndim == 2 else swizzled_scale.reshape(B, M, K)) def process_weights_after_loading(self, layer: torch.nn.Module) -> None: - # GEMM 1 + # The FlashInfer Cutlass fused MoE kernel expects the combined weights + # to be ordered as [w3, w1], unlike the standard [w1, w3] layout. + gemm1_weight = layer.w13_weight.data + gemm1_weight_scale = layer.w13_weight_scale.data + + if self.allow_flashinfer_cutlass: + dim = -2 + size = gemm1_weight.size(dim) + assert size % 2 == 0, f"Expected even size in dim {dim}, got {size}" + half = size // 2 + + # Reorder weight + w1, w3 = gemm1_weight.split(half, dim=dim) + gemm1_weight = torch.cat([w3, w1], dim=dim).contiguous() + + # Reorder scale + s1, s3 = gemm1_weight_scale.split(half, dim=dim) + gemm1_weight_scale = torch.cat([s3, s1], dim=dim).contiguous() + + layer.w13_weight = Parameter(gemm1_weight, requires_grad=False) + layer.w13_weight_scale = Parameter(gemm1_weight_scale, + requires_grad=False) + if not torch.allclose(layer.w13_weight_scale_2[:, 0], layer.w13_weight_scale_2[:, 1]): logger.warning_once( @@ -874,9 +981,6 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None: layer.w13_input_scale_quant = Parameter( (1 / w13_input_scale).to(torch.float32), requires_grad=False) - layer.w13_weight = Parameter(layer.w13_weight.data, - requires_grad=False) - # GEMM 2 layer.g2_alphas = Parameter( (layer.w2_input_scale * layer.w2_weight_scale_2).to(torch.float32), @@ -961,31 +1065,74 @@ def apply( global_num_experts=global_num_experts, expert_map=expert_map) - assert expert_map is None, ("Expert Parallelism / expert_map " - "is currently not supported for " - "ModelOptNvFp4FusedMoE.") - - from vllm.model_executor.layers.fused_moe.cutlass_moe import ( - cutlass_moe_fp4) - - # Cutlass moe takes in activations in BF16/Half precision - # and fp4 quantized weights loaded from the checkpoint - return cutlass_moe_fp4( - a=x, - w1_fp4=layer.w13_weight, - w1_blockscale=layer.w13_blockscale_swizzled, - w1_alphas=layer.g1_alphas, - w2_fp4=layer.w2_weight, - w2_blockscale=layer.w2_blockscale_swizzled, - w2_alphas=layer.g2_alphas, - topk_weights=topk_weights, - topk_ids=topk_ids, - m=x.shape[0], - n=layer.w2_weight.shape[2] * 2, - k=x.shape[1], - e=layer.w13_weight.shape[0], - a1_gscale=layer.w13_input_scale_quant, - a2_gscale=layer.w2_input_scale_quant, - device=x.device, - apply_router_weight_on_input=apply_router_weight_on_input).to( - x.dtype) + if self.fused_experts is None: + # If no modular kernel is provided, use cutlass_moe_fp4 for TP case + # only (no EP). + from vllm.model_executor.layers.fused_moe.cutlass_moe import ( + cutlass_moe_fp4) + out = cutlass_moe_fp4( + a=x, + w1_fp4=layer.w13_weight, + w2_fp4=layer.w2_weight, + w1_blockscale=layer.w13_blockscale_swizzled, + w2_blockscale=layer.w2_blockscale_swizzled, + g1_alphas=layer.g1_alphas, + g2_alphas=layer.g2_alphas, + a1_gscale=layer.w13_input_scale_quant, + a2_gscale=layer.w2_input_scale_quant, + topk_weights=topk_weights, + topk_ids=topk_ids, + m=x.shape[0], + n=layer.w2_weight.shape[2] * 2, + k=x.shape[1], + e=layer.w13_weight.shape[0], + device=x.device, + expert_map=expert_map, + apply_router_weight_on_input=apply_router_weight_on_input) + else: + # TP or DP case + from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( # noqa: E501 + is_valid_flashinfer_cutlass_fused_moe) + assert is_valid_flashinfer_cutlass_fused_moe( + x, layer.w13_weight, layer.w2_weight), ( + "Flashinfer CUTLASS Fused MoE not applicable!") + + a1_gscale = torch.min(layer.w13_input_scale_quant) + a2_gscale = torch.min(layer.w2_input_scale_quant) + extra_expert_args = { + 'g1_alphas': layer.g1_alphas, + 'g2_alphas': layer.g2_alphas, + 'out_dtype': x.dtype, + # Avoid confusion with a1_scale and a2_scale + # where are batch size related. + 'a1_gscale': a1_gscale, + 'a2_gscale': a2_gscale, + } + extra_prepare_args = { + 'use_dp': layer.dp_size > 1, + 'local_tokens': x.shape[0], + 'a1_gscale': a1_gscale, + } + extra_finalize_args = { + 'use_dp': layer.dp_size > 1, + 'local_tokens': x.shape[0], + } + + out = self.fused_experts( + hidden_states=x, + w1=layer.w13_weight, + w2=layer.w2_weight, + topk_weights=topk_weights, + topk_ids=topk_ids, + inplace=False, # TODO(shuw): fix later, now output is high prec + activation=activation, + global_num_experts=global_num_experts, + expert_map=expert_map, + w1_scale=layer.w13_blockscale_swizzled, + w2_scale=layer.w2_blockscale_swizzled, + apply_router_weight_on_input=apply_router_weight_on_input, + extra_expert_args=extra_expert_args, + extra_prepare_args=extra_prepare_args, + extra_finalize_args=extra_finalize_args, + ) + return out diff --git a/vllm/utils/flashinfer.py b/vllm/utils/flashinfer.py new file mode 100644 index 00000000000..dbd2dc39304 --- /dev/null +++ b/vllm/utils/flashinfer.py @@ -0,0 +1,107 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Compatibility wrapper for FlashInfer API changes. + +Users of vLLM should always import **only** these wrappers. +""" +from __future__ import annotations + +import contextlib +import functools +import importlib +import importlib.util +from typing import Any, Callable, NoReturn + +from vllm.logger import init_logger + +logger = init_logger(__name__) + + +@functools.cache +def has_flashinfer() -> bool: + """Return ``True`` if FlashInfer is available.""" + # Use find_spec to check if the module exists without importing it + # This avoids potential CUDA initialization side effects + return importlib.util.find_spec("flashinfer") is not None + + +def _missing(*_: Any, **__: Any) -> NoReturn: + """Placeholder for unavailable FlashInfer backend.""" + raise RuntimeError( + "FlashInfer backend is not available. Please install the package " + "to enable FlashInfer kernels: " + "https://github.com/flashinfer-ai/flashinfer") + + +def _get_submodule(module_name: str) -> Any | None: + """Safely import a submodule and return it, or None if not available.""" + try: + return importlib.import_module(module_name) + except (ImportError, ModuleNotFoundError): + return None + + +# General lazy import wrapper +def _lazy_import_wrapper(module_name: str, + attr_name: str, + fallback_fn: Callable[..., Any] = _missing): + """Create a lazy import wrapper for a specific function.""" + + @functools.cache + def _get_impl(): + if not has_flashinfer(): + return None + mod = _get_submodule(module_name) + return getattr(mod, attr_name, None) if mod else None + + def wrapper(*args, **kwargs): + impl = _get_impl() + if impl is None: + return fallback_fn(*args, **kwargs) + return impl(*args, **kwargs) + + return wrapper + + +# Create lazy wrappers for each function +flashinfer_cutlass_fused_moe = _lazy_import_wrapper("flashinfer.fused_moe", + "cutlass_fused_moe") +fp4_quantize = _lazy_import_wrapper("flashinfer", "fp4_quantize") +fp4_swizzle_blockscale = _lazy_import_wrapper("flashinfer", + "fp4_swizzle_blockscale") + +# Special case for autotune since it returns a context manager +autotune = _lazy_import_wrapper( + "flashinfer.autotuner", + "autotune", + fallback_fn=lambda *args, **kwargs: contextlib.nullcontext()) + + +@functools.cache +def has_flashinfer_cutlass_fused_moe() -> bool: + """Return ``True`` if FlashInfer CUTLASS fused MoE is available.""" + if not has_flashinfer(): + return False + + # Check if all required functions are available + required_functions = [ + ("flashinfer.fused_moe", "cutlass_fused_moe"), + ("flashinfer", "fp4_quantize"), + ("flashinfer", "fp4_swizzle_blockscale"), + ] + + for module_name, attr_name in required_functions: + mod = _get_submodule(module_name) + if not mod or not hasattr(mod, attr_name): + return False + return True + + +__all__ = [ + "has_flashinfer", + "has_flashinfer_cutlass_fused_moe", + "flashinfer_cutlass_fused_moe", + "fp4_quantize", + "fp4_swizzle_blockscale", + "autotune", +] From 83754ca3d4eb3d8a5cb665686ea735abe9f665c8 Mon Sep 17 00:00:00 2001 From: shixianc <49539556+shixianc@users.noreply.github.com> Date: Thu, 17 Jul 2025 21:34:43 -0700 Subject: [PATCH 169/552] [Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm (#20911) Signed-off-by: Shixian Cui Co-authored-by: Shixian Cui Signed-off-by: x22x22 --- .../cutlass_w8a8/moe/grouped_mm_c3x.cu | 49 +++++++++---- .../cutlass_w8a8/moe/grouped_mm_c3x.cuh | 67 ++++++++++++------ .../quantization/cutlass_w8a8/moe/moe_data.cu | 68 +++++++++++++------ tests/kernels/moe/test_cutlass_moe.py | 1 + 4 files changed, 135 insertions(+), 50 deletions(-) diff --git a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu index c88e134ae40..b024482208d 100644 --- a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu +++ b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu @@ -29,19 +29,36 @@ struct sm90_fp8_config_default { template typename Epilogue> -struct sm90_fp8_config_M16 { - // M in [1, 16] +struct sm90_fp8_config_M4 { + // M in [1, 4] static_assert(std::is_same()); using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum; using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; - using TileShape = cute::Shape; - using ClusterShape = cute::Shape; + using TileShape = cute::Shape; + using ClusterShape = cute::Shape; using Cutlass3xGemm = cutlass_3x_group_gemm; + KernelSchedule, EpilogueSchedule, true>; +}; + +template typename Epilogue> +struct sm90_fp8_config_M64 { + // M in (4, 64] + static_assert(std::is_same()); + using KernelSchedule = + cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum; + using EpilogueSchedule = + cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; + using TileShape = cute::Shape; + using ClusterShape = cute::Shape; + + using Cutlass3xGemm = + cutlass_3x_group_gemm; }; template ::Cutlass3xGemm; using Cutlass3xGemmK8192 = typename sm90_fp8_config_K8192< InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; - using Cutlass3xGemmM16 = typename sm90_fp8_config_M16< + using Cutlass3xGemmM4 = typename sm90_fp8_config_M4< + InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; + using Cutlass3xGemmM64 = typename sm90_fp8_config_M64< InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; using Cutlass3xGemmDefault = typename sm90_fp8_config_default< InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; @@ -111,18 +130,24 @@ void run_cutlass_moe_mm_sm90( uint32_t const n = out_tensors.size(1); uint32_t const k = a_tensors.size(1); - if (n >= 8192) { - cutlass_group_gemm_caller( + // Use swap_ab for M <= 64 by default to reduce padding + if (m <= 4) { + cutlass_group_gemm_caller( out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, problem_sizes, a_strides, b_strides, c_strides, per_act_token, per_out_ch); - } else if (k >= 8192) { - cutlass_group_gemm_caller( + } else if (m <= 64) { + cutlass_group_gemm_caller( out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, problem_sizes, a_strides, b_strides, c_strides, per_act_token, per_out_ch); - } else if (m <= 16) { - cutlass_group_gemm_caller( + } else if (n >= 8192) { + cutlass_group_gemm_caller( + out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, + problem_sizes, a_strides, b_strides, c_strides, per_act_token, + per_out_ch); + } else if (k >= 8192) { + cutlass_group_gemm_caller( out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, problem_sizes, a_strides, b_strides, c_strides, per_act_token, per_out_ch); diff --git a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh index bbd82d72e95..3225378a6ca 100644 --- a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh +++ b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh @@ -22,14 +22,23 @@ using ArchTag = cutlass::arch::Sm90; using OperatorClass = cutlass::arch::OpClassTensorOp; using LayoutA = cutlass::layout::RowMajor; +using LayoutA_Transpose = + typename cutlass::layout::LayoutTranspose::type; using LayoutB = cutlass::layout::ColumnMajor; -using LayoutC = cutlass::layout::RowMajor; +using LayoutB_Transpose = + typename cutlass::layout::LayoutTranspose::type; +using LayoutD = cutlass::layout::RowMajor; +using LayoutD_Transpose = + typename cutlass::layout::LayoutTranspose::type; +using LayoutC = LayoutD; +using LayoutC_Transpose = LayoutD_Transpose; template typename Epilogue_, typename TileShape, typename ClusterShape, typename KernelSchedule, - typename EpilogueSchedule> + typename EpilogueSchedule, bool swap_ab_ = false> struct cutlass_3x_group_gemm { + static constexpr bool swap_ab = swap_ab_; using ElementAB = ElementAB_; using ElementC = void; using ElementD = ElementC_; @@ -37,9 +46,6 @@ struct cutlass_3x_group_gemm { using Epilogue = Epilogue_; - using StrideC = - cute::remove_pointer_t, cute::Int<0>>>; - static constexpr int AlignmentAB = 128 / cutlass::sizeof_bits::value; static constexpr int AlignmentC = 128 / cutlass::sizeof_bits::value; @@ -50,19 +56,26 @@ struct cutlass_3x_group_gemm { typename cutlass::epilogue::collective::CollectiveBuilder< ArchTag, OperatorClass, TileShape, ClusterShape, cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator, - ElementAccumulator, ElementC, LayoutC*, AlignmentC, ElementD, - LayoutC*, AlignmentC, EpilogueSchedule, EVTCompute>::CollectiveOp; + ElementAccumulator, ElementC, + conditional_t, AlignmentC, + ElementD, conditional_t, + AlignmentC, EpilogueSchedule, EVTCompute>::CollectiveOp; static constexpr size_t CEStorageSize = sizeof(typename CollectiveEpilogue::SharedStorage); using Stages = typename cutlass::gemm::collective::StageCountAutoCarveout< static_cast(CEStorageSize)>; - using CollectiveMainloop = + using CollectiveMainloop = conditional_t< + swap_ab, + typename cutlass::gemm::collective::CollectiveBuilder< + ArchTag, OperatorClass, ElementAB, LayoutB_Transpose*, AlignmentAB, + ElementAB, LayoutA_Transpose*, AlignmentAB, ElementAccumulator, + TileShape, ClusterShape, Stages, KernelSchedule>::CollectiveOp, typename cutlass::gemm::collective::CollectiveBuilder< ArchTag, OperatorClass, ElementAB, LayoutA*, AlignmentAB, ElementAB, LayoutB*, AlignmentAB, ElementAccumulator, TileShape, ClusterShape, - Stages, KernelSchedule>::CollectiveOp; + Stages, KernelSchedule>::CollectiveOp>; using KernelType = enable_sm90_only>; @@ -78,12 +91,12 @@ void cutlass_group_gemm_caller( torch::Tensor const& problem_sizes, torch::Tensor const& a_strides, torch::Tensor const& b_strides, torch::Tensor const& c_strides, bool per_act_token, bool per_out_ch) { + static constexpr bool swap_ab = Gemm::swap_ab; + using ElementAB = typename Gemm::ElementAB; using ElementD = typename Gemm::ElementD; int num_experts = static_cast(expert_offsets.size(0)); - int k_size = a_tensors.size(1); - int n_size = out_tensors.size(1); auto stream = at::cuda::getCurrentCUDAStream(a_tensors.device().index()); @@ -110,19 +123,35 @@ void cutlass_group_gemm_caller( problem_sizes.data_ptr()); ProblemShape prob_shape{num_experts, problem_sizes_as_shapes, nullptr}; - typename GemmKernel::MainloopArguments mainloop_args{ - static_cast(a_ptrs.data_ptr()), - static_cast(a_strides.data_ptr()), - static_cast(b_ptrs.data_ptr()), - static_cast(b_strides.data_ptr())}; + typename GemmKernel::MainloopArguments mainloop_args; + if constexpr (swap_ab) { + mainloop_args = typename GemmKernel::MainloopArguments{ + static_cast(b_ptrs.data_ptr()), + static_cast(b_strides.data_ptr()), + static_cast(a_ptrs.data_ptr()), + static_cast(a_strides.data_ptr())}; + } else { + mainloop_args = typename GemmKernel::MainloopArguments{ + static_cast(a_ptrs.data_ptr()), + static_cast(a_strides.data_ptr()), + static_cast(b_ptrs.data_ptr()), + static_cast(b_strides.data_ptr())}; + } // Currently, we are only able to do broadcast on either all or none a_scales // and on either all or none b_scales typename GemmKernel::EpilogueArguments epilogue_args{ Gemm::Epilogue::prepare_args( - static_cast(a_scales_ptrs.data_ptr()), - static_cast(b_scales_ptrs.data_ptr()), - per_act_token, per_out_ch), + swap_ab ? static_cast( + b_scales_ptrs.data_ptr()) + : static_cast( + a_scales_ptrs.data_ptr()), + swap_ab ? static_cast( + a_scales_ptrs.data_ptr()) + : static_cast( + b_scales_ptrs.data_ptr()), + swap_ab ? per_out_ch : per_act_token, + swap_ab ? per_act_token : per_out_ch), nullptr, static_cast(c_strides.data_ptr()), static_cast(out_ptrs.data_ptr()), static_cast(c_strides.data_ptr())}; diff --git a/csrc/quantization/cutlass_w8a8/moe/moe_data.cu b/csrc/quantization/cutlass_w8a8/moe/moe_data.cu index 80c6589ab17..623c9a2f096 100644 --- a/csrc/quantization/cutlass_w8a8/moe/moe_data.cu +++ b/csrc/quantization/cutlass_w8a8/moe/moe_data.cu @@ -6,7 +6,10 @@ #include constexpr uint64_t THREADS_PER_EXPERT = 512; +// threshold must match the dispatch logic in run_cutlass_moe_mm_sm90() +constexpr int SWAP_AB_THRESHOLD = 64; +template __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids, int32_t* problem_sizes1, int32_t* problem_sizes2, @@ -24,40 +27,53 @@ __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids, if (threadIdx.x == 0) { int final_occurrences = atomic_buffer[expert_id]; - problem_sizes1[expert_id * 3] = final_occurrences; - problem_sizes1[expert_id * 3 + 1] = 2 * n; - problem_sizes1[expert_id * 3 + 2] = k; - problem_sizes2[expert_id * 3] = final_occurrences; - problem_sizes2[expert_id * 3 + 1] = k; - problem_sizes2[expert_id * 3 + 2] = n; + if constexpr (!SWAP_AB) { + problem_sizes1[expert_id * 3] = final_occurrences; + problem_sizes1[expert_id * 3 + 1] = 2 * n; + problem_sizes1[expert_id * 3 + 2] = k; + problem_sizes2[expert_id * 3] = final_occurrences; + problem_sizes2[expert_id * 3 + 1] = k; + problem_sizes2[expert_id * 3 + 2] = n; + } else { + problem_sizes1[expert_id * 3] = 2 * n; + problem_sizes1[expert_id * 3 + 1] = final_occurrences; + problem_sizes1[expert_id * 3 + 2] = k; + problem_sizes2[expert_id * 3] = k; + problem_sizes2[expert_id * 3 + 1] = final_occurrences; + problem_sizes2[expert_id * 3 + 2] = n; + } } } __global__ void compute_expert_offsets( const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets, - int32_t* atomic_buffer, const int num_experts) { + int32_t* atomic_buffer, const int num_experts, const int topk_length) { int32_t tot_offset = 0; expert_offsets[0] = 0; for (int i = 0; i < num_experts; ++i) { atomic_buffer[i] = tot_offset; - tot_offset += problem_sizes1[i * 3]; + tot_offset += topk_length > SWAP_AB_THRESHOLD ? problem_sizes1[i * 3] + : problem_sizes1[i * 3 + 1]; expert_offsets[i + 1] = tot_offset; } } __global__ void compute_expert_blockscale_offsets( const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets, - int32_t* blockscale_offsets, int32_t* atomic_buffer, - const int num_experts) { + int32_t* blockscale_offsets, int32_t* atomic_buffer, const int num_experts, + const int topk_length) { int32_t tot_offset = 0; int32_t tot_offset_round = 0; expert_offsets[0] = 0; blockscale_offsets[0] = 0; for (int i = 0; i < num_experts; ++i) { + int32_t cur_offset = topk_length > SWAP_AB_THRESHOLD + ? problem_sizes1[i * 3] + : problem_sizes1[i * 3 + 1]; atomic_buffer[i] = tot_offset; - tot_offset += problem_sizes1[i * 3]; + tot_offset += cur_offset; expert_offsets[i + 1] = tot_offset; - tot_offset_round += (problem_sizes1[i * 3] + (128 - 1)) / 128 * 128; + tot_offset_round += (cur_offset + (128 - 1)) / 128 * 128; blockscale_offsets[i + 1] = tot_offset_round; } } @@ -102,22 +118,36 @@ void get_cutlass_moe_mm_data_caller( torch::Tensor atomic_buffer = torch::zeros(num_experts, options_int32); int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel()); - compute_problem_sizes<<>>( - static_cast(topk_ids.data_ptr()), - static_cast(problem_sizes1.data_ptr()), - static_cast(problem_sizes2.data_ptr()), - static_cast(atomic_buffer.data_ptr()), topk_ids.numel(), n, k); + + if (topk_ids.numel() > SWAP_AB_THRESHOLD) { + compute_problem_sizes<<>>( + static_cast(topk_ids.data_ptr()), + static_cast(problem_sizes1.data_ptr()), + static_cast(problem_sizes2.data_ptr()), + static_cast(atomic_buffer.data_ptr()), topk_ids.numel(), n, + k); + } else { + compute_problem_sizes<<>>( + static_cast(topk_ids.data_ptr()), + static_cast(problem_sizes1.data_ptr()), + static_cast(problem_sizes2.data_ptr()), + static_cast(atomic_buffer.data_ptr()), topk_ids.numel(), n, + k); + } + if (blockscale_offsets.has_value()) { compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>( static_cast(problem_sizes1.data_ptr()), static_cast(expert_offsets.data_ptr()), static_cast(blockscale_offsets.value().data_ptr()), - static_cast(atomic_buffer.data_ptr()), num_experts); + static_cast(atomic_buffer.data_ptr()), num_experts, + topk_ids.numel()); } else { compute_expert_offsets<<<1, 1, 0, stream>>>( static_cast(problem_sizes1.data_ptr()), static_cast(expert_offsets.data_ptr()), - static_cast(atomic_buffer.data_ptr()), num_experts); + static_cast(atomic_buffer.data_ptr()), num_experts, + topk_ids.numel()); } compute_arg_sorts<<>>( static_cast(topk_ids.data_ptr()), diff --git a/tests/kernels/moe/test_cutlass_moe.py b/tests/kernels/moe/test_cutlass_moe.py index 5fb49c2da4f..37727b75b07 100644 --- a/tests/kernels/moe/test_cutlass_moe.py +++ b/tests/kernels/moe/test_cutlass_moe.py @@ -25,6 +25,7 @@ (2, 1024, 1536), (2, 3072, 1024), (2, 3072, 1536), + (7, 3072, 1536), (64, 1024, 1024), (64, 1024, 1536), (64, 3072, 1024), From 32e1fa5a40ebb4587e0085a667f2393de25f6499 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Thu, 17 Jul 2025 21:57:02 -0700 Subject: [PATCH 170/552] [Misc] Do not print async output warning for v1 (#21151) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- vllm/platforms/cuda.py | 2 +- vllm/platforms/rocm.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index 03f0c15270b..240724a675a 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -99,7 +99,7 @@ def get_device_total_memory(cls, device_id: int = 0) -> int: @classmethod def is_async_output_supported(cls, enforce_eager: Optional[bool]) -> bool: - if enforce_eager: + if enforce_eager and not envs.VLLM_USE_V1: logger.warning( "To see benefits of async output processing, enable CUDA " "graph. Since, enforce-eager is enabled, async output " diff --git a/vllm/platforms/rocm.py b/vllm/platforms/rocm.py index 04637f5c7aa..e9e18d3fe8e 100644 --- a/vllm/platforms/rocm.py +++ b/vllm/platforms/rocm.py @@ -299,7 +299,7 @@ def get_device_total_memory(cls, device_id: int = 0) -> int: @classmethod def is_async_output_supported(cls, enforce_eager: Optional[bool]) -> bool: - if enforce_eager: + if enforce_eager and not envs.VLLM_USE_V1: logger.warning( "To see benefits of async output processing, enable CUDA " "graph. Since, enforce-eager is enabled, async output " From cbfeff1c42bc22441d026e0f8f319616c78a29e4 Mon Sep 17 00:00:00 2001 From: Jialin Ouyang Date: Thu, 17 Jul 2025 23:22:08 -0700 Subject: [PATCH 171/552] [benchmark] Sending request strictly follows the random intervals (#21108) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- vllm/benchmarks/serve.py | 57 ++++++++++++++++++++++++++++------------ 1 file changed, 40 insertions(+), 17 deletions(-) diff --git a/vllm/benchmarks/serve.py b/vllm/benchmarks/serve.py index 8b16fea9e3d..a4d51936320 100644 --- a/vllm/benchmarks/serve.py +++ b/vllm/benchmarks/serve.py @@ -138,31 +138,54 @@ async def get_request( input_requests = list(input_requests) total_requests = len(input_requests) - request_index = 0 + assert total_requests > 0, "No requests provided." - for request in input_requests: + # Precompute delays among requests to minimize request send laggings + request_rates = [] + delay_ts = [] + for request_index, request in enumerate(input_requests): current_request_rate = _get_current_request_rate(ramp_up_strategy, ramp_up_start_rps, ramp_up_end_rps, request_index, total_requests, request_rate) - - yield request, current_request_rate - - request_index += 1 - + request_rates.append(current_request_rate) if current_request_rate == float("inf"): - # If the request rate is infinity, then we don't need to wait. - continue - - theta = 1.0 / (current_request_rate * burstiness) - - # Sample the request interval from the gamma distribution. - # If burstiness is 1, it follows exponential distribution. - interval = np.random.gamma(shape=burstiness, scale=theta) - # The next request will be sent after the interval. - await asyncio.sleep(interval) + delay_ts.append(0) + else: + theta = 1.0 / (current_request_rate * burstiness) + + # Sample the request interval from the gamma distribution. + # If burstiness is 1, it follows exponential distribution. + delay_ts.append(np.random.gamma(shape=burstiness, scale=theta)) + + # Calculate the cumulative delay time from the first sent out requests. + for i in range(1, len(delay_ts)): + delay_ts[i] += delay_ts[i - 1] + if ramp_up_strategy is None and delay_ts[-1] != 0: + # When ramp_up_strategy is not set, we assume the request rate is fixed + # and all requests should be sent in target_total_delay_s, the following + # logic would re-scale delay time to ensure the final delay_ts + # align with target_total_delay_s. + # + # NOTE: If we simply accumulate the random delta values + # from the gamma distribution, their sum would have 1-2% gap + # from target_total_delay_s. The purpose of the following logic is to + # close the gap for stablizing the throughput data + # from different random seeds. + target_total_delay_s = total_requests / request_rate + normalize_factor = target_total_delay_s / delay_ts[-1] + delay_ts = [delay * normalize_factor for delay in delay_ts] + + start_ts = time.time() + request_index = 0 + for request_index, request in enumerate(input_requests): + current_ts = time.time() + sleep_interval_s = start_ts + delay_ts[request_index] - current_ts + if sleep_interval_s > 0: + await asyncio.sleep(sleep_interval_s) + yield request, request_rates[request_index] def calculate_metrics( From 6332719bc330dc024158050967af91f26ba42a82 Mon Sep 17 00:00:00 2001 From: Roger Wang Date: Fri, 18 Jul 2025 00:13:57 -0700 Subject: [PATCH 172/552] [Misc] Make MM embedding merge interface explicit in model runner (#21147) Signed-off-by: Roger Wang Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/v1/worker/gpu_model_runner.py | 9 ++++----- vllm/v1/worker/tpu_model_runner.py | 9 ++++----- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index fc7f2538881..60fb78c060c 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1328,11 +1328,10 @@ def execute_model( # embeddings), we always use embeddings (rather than token ids) # as input to the multimodal model, even when the input is text. input_ids = self.input_ids[:num_scheduled_tokens] - if mm_embeds: - inputs_embeds = self.model.get_input_embeddings( - input_ids, mm_embeds) - else: - inputs_embeds = self.model.get_input_embeddings(input_ids) + inputs_embeds = self.model.get_input_embeddings( + input_ids=input_ids, + multimodal_embeddings=mm_embeds or None, + ) # TODO(woosuk): Avoid the copy. Optimize. self.inputs_embeds[:num_scheduled_tokens].copy_(inputs_embeds) inputs_embeds = self.inputs_embeds[:num_input_tokens] diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index ad62d204381..8565df42973 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -937,11 +937,10 @@ def _get_model_inputs(self, input_ids: torch.Tensor, # NOTE(woosuk): To unify token ids and soft tokens (vision # embeddings), we always use embeddings (rather than token ids) # as input to the multimodal model, even when the input is text. - if mm_embeds: - inputs_embeds = self.model.get_input_embeddings( - input_ids, mm_embeds) - else: - inputs_embeds = self.model.get_input_embeddings(input_ids) + inputs_embeds = self.model.get_input_embeddings( + input_ids=input_ids, + multimodal_embeddings=mm_embeds, + ) return None, inputs_embeds else: # For text-only models, we use token ids as input. From cfcc945f52b235bdbdb9c9c405089c0b3b7122b0 Mon Sep 17 00:00:00 2001 From: "wang.yuqi" Date: Fri, 18 Jul 2025 15:15:07 +0800 Subject: [PATCH 173/552] [Model] Re-add the implicit conversion feature for as_seq_cls_model (#21103) Signed-off-by: wang.yuqi Signed-off-by: x22x22 --- tests/models/registry.py | 32 ++++++++++------ tests/models/test_initialization.py | 29 ++++++++++---- tests/models/test_transformers.py | 35 +++++++++++++++++ vllm/config.py | 46 ++++++++++++----------- vllm/model_executor/model_loader/utils.py | 30 +++++++++++++-- vllm/model_executor/models/adapters.py | 15 +++++--- vllm/model_executor/models/gemma.py | 4 -- vllm/model_executor/models/llama.py | 4 -- vllm/model_executor/models/qwen2.py | 4 -- vllm/model_executor/models/qwen3.py | 4 -- vllm/model_executor/models/registry.py | 37 ++++++++++++++---- 11 files changed, 165 insertions(+), 75 deletions(-) diff --git a/tests/models/registry.py b/tests/models/registry.py index 2adfa859a1c..56ae501021f 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -265,7 +265,6 @@ def check_available_online( "Qwen2MoeForCausalLM": _HfExamplesInfo("Qwen/Qwen1.5-MoE-A2.7B-Chat"), "Qwen3ForCausalLM": _HfExamplesInfo("Qwen/Qwen3-8B"), "Qwen3MoeForCausalLM": _HfExamplesInfo("Qwen/Qwen3-30B-A3B"), - "Qwen3ForSequenceClassification": _HfExamplesInfo("tomaarsen/Qwen3-Reranker-0.6B-seq-cls"), # noqa: E501 "RWForCausalLM": _HfExamplesInfo("tiiuae/falcon-40b"), "StableLMEpochForCausalLM": _HfExamplesInfo("stabilityai/stablelm-zephyr-3b"), # noqa: E501 "StableLmForCausalLM": _HfExamplesInfo("stabilityai/stablelm-3b-4e1t"), @@ -292,7 +291,6 @@ def check_available_online( # [Text-only] "BertModel": _HfExamplesInfo("BAAI/bge-base-en-v1.5", v0_only=True), "Gemma2Model": _HfExamplesInfo("BAAI/bge-multilingual-gemma2", v0_only=True), # noqa: E501 - "GPT2ForSequenceClassification": _HfExamplesInfo("nie3e/sentiment-polish-gpt2-small"), # noqa: E501 "GritLM": _HfExamplesInfo("parasail-ai/GritLM-7B-vllm"), "GteModel": _HfExamplesInfo("Snowflake/snowflake-arctic-embed-m-v2.0", trust_remote_code=True), @@ -311,7 +309,6 @@ def check_available_online( "Qwen2Model": _HfExamplesInfo("ssmits/Qwen2-7B-Instruct-embed-base"), "Qwen2ForRewardModel": _HfExamplesInfo("Qwen/Qwen2.5-Math-RM-72B"), "Qwen2ForProcessRewardModel": _HfExamplesInfo("Qwen/Qwen2.5-Math-PRM-7B"), - "Qwen2ForSequenceClassification": _HfExamplesInfo("jason9693/Qwen2.5-1.5B-apeach"), # noqa: E501 "RobertaModel": _HfExamplesInfo("sentence-transformers/stsb-roberta-base-v2", v0_only=True), # noqa: E501 "RobertaForMaskedLM": _HfExamplesInfo("sentence-transformers/all-roberta-large-v1", v0_only=True), # noqa: E501 "XLMRobertaModel": _HfExamplesInfo("intfloat/multilingual-e5-small", v0_only=True), # noqa: E501 @@ -324,20 +321,29 @@ def check_available_online( is_available_online=False), # noqa: E501 } -_CROSS_ENCODER_EXAMPLE_MODELS = { - # [Text-only] +_SEQUENCE_CLASSIFICATION_EXAMPLE_MODELS = { + # [Decoder-only] + "GPT2ForSequenceClassification": _HfExamplesInfo("nie3e/sentiment-polish-gpt2-small"), # noqa: E501 + + # [Cross-encoder] "BertForSequenceClassification": _HfExamplesInfo("cross-encoder/ms-marco-MiniLM-L-6-v2", v0_only=True), # noqa: E501 - "GemmaForSequenceClassification": _HfExamplesInfo("BAAI/bge-reranker-v2-gemma", # noqa: E501 - v0_only=True, - hf_overrides={"architectures": ["GemmaForSequenceClassification"], # noqa: E501 - "classifier_from_token": ["Yes"], # noqa: E501 - "method": "no_post_processing"}), # noqa: E501 - "LlamaForSequenceClassification": _HfExamplesInfo("Skywork/Skywork-Reward-V2-Llama-3.2-1B"), # noqa: E501 "ModernBertForSequenceClassification": _HfExamplesInfo("Alibaba-NLP/gte-reranker-modernbert-base", v0_only=True), # noqa: E501 "RobertaForSequenceClassification": _HfExamplesInfo("cross-encoder/quora-roberta-base", v0_only=True), # noqa: E501 "XLMRobertaForSequenceClassification": _HfExamplesInfo("BAAI/bge-reranker-v2-m3", v0_only=True), # noqa: E501 } +_AUTOMATIC_CONVERTED_MODELS = { + # Use as_seq_cls_model for automatic conversion + "GemmaForSequenceClassification": _HfExamplesInfo("BAAI/bge-reranker-v2-gemma", # noqa: E501 + v0_only=True, + hf_overrides={"architectures": ["GemmaForSequenceClassification"], # noqa: E501 + "classifier_from_token": ["Yes"], # noqa: E501 + "method": "no_post_processing"}), # noqa: E501 + "LlamaForSequenceClassification": _HfExamplesInfo("Skywork/Skywork-Reward-V2-Llama-3.2-1B"), # noqa: E501 + "Qwen2ForSequenceClassification": _HfExamplesInfo("jason9693/Qwen2.5-1.5B-apeach"), # noqa: E501 + "Qwen3ForSequenceClassification": _HfExamplesInfo("tomaarsen/Qwen3-Reranker-0.6B-seq-cls"), # noqa: E501 +} + _MULTIMODAL_EXAMPLE_MODELS = { # [Decoder-only] "AriaForConditionalGeneration": _HfExamplesInfo("rhymes-ai/Aria"), @@ -449,6 +455,7 @@ def check_available_online( "JinaVLForRanking": _HfExamplesInfo("jinaai/jina-reranker-m0"), # noqa: E501 } + _SPECULATIVE_DECODING_EXAMPLE_MODELS = { "EAGLEModel": _HfExamplesInfo("JackFram/llama-68m", speculative_model="abhigoyal/vllm-eagle-llama-68m-random"), # noqa: E501 @@ -489,7 +496,7 @@ def check_available_online( _EXAMPLE_MODELS = { **_TEXT_GENERATION_EXAMPLE_MODELS, **_EMBEDDING_EXAMPLE_MODELS, - **_CROSS_ENCODER_EXAMPLE_MODELS, + **_SEQUENCE_CLASSIFICATION_EXAMPLE_MODELS, **_MULTIMODAL_EXAMPLE_MODELS, **_SPECULATIVE_DECODING_EXAMPLE_MODELS, **_TRANSFORMERS_MODELS, @@ -522,3 +529,4 @@ def find_hf_info(self, model_id: str) -> _HfExamplesInfo: HF_EXAMPLE_MODELS = HfExampleModels(_EXAMPLE_MODELS) +AUTO_EXAMPLE_MODELS = HfExampleModels(_AUTOMATIC_CONVERTED_MODELS) diff --git a/tests/models/test_initialization.py b/tests/models/test_initialization.py index 52005e74ef7..14d243012b2 100644 --- a/tests/models/test_initialization.py +++ b/tests/models/test_initialization.py @@ -13,20 +13,21 @@ from vllm.v1.engine.core import EngineCore as V1EngineCore from ..utils import create_new_process_for_each_test -from .registry import HF_EXAMPLE_MODELS +from .registry import AUTO_EXAMPLE_MODELS, HF_EXAMPLE_MODELS, HfExampleModels -@pytest.mark.parametrize("model_arch", HF_EXAMPLE_MODELS.get_supported_archs()) @create_new_process_for_each_test() -def test_can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch): - """The reason for using create_new_process_for_each_test is to avoid - the WARNING: - "We must use the 'spawn' multiprocessing start method. Overriding +def can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch, + EXAMPLE_MODELS: HfExampleModels): + """The reason for using create_new_process_for_each_test is to avoid + the WARNING: + "We must use the 'spawn' multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'." - The spawn process causes the _initialize_kv_caches_v1 function below to + The spawn process causes the _initialize_kv_caches_v1 function below to become ineffective. """ - model_info = HF_EXAMPLE_MODELS.get_hf_info(model_arch) + + model_info = EXAMPLE_MODELS.get_hf_info(model_arch) model_info.check_available_online(on_fail="skip") model_info.check_transformers_version(on_fail="skip") @@ -127,3 +128,15 @@ def _initialize_kv_caches_v1(self, vllm_config): load_format="dummy", hf_overrides=hf_overrides, ) + + +@pytest.mark.parametrize("model_arch", HF_EXAMPLE_MODELS.get_supported_archs()) +def test_can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch): + can_initialize(model_arch, monkeypatch, HF_EXAMPLE_MODELS) + + +@pytest.mark.parametrize("model_arch", + AUTO_EXAMPLE_MODELS.get_supported_archs()) +def test_implicit_converted_models(model_arch: str, + monkeypatch: pytest.MonkeyPatch): + can_initialize(model_arch, monkeypatch, AUTO_EXAMPLE_MODELS) diff --git a/tests/models/test_transformers.py b/tests/models/test_transformers.py index b7b99ce41cb..b87290e96a2 100644 --- a/tests/models/test_transformers.py +++ b/tests/models/test_transformers.py @@ -138,3 +138,38 @@ def test_quantization( name_0="transformers", name_1="vllm", ) + + +@pytest.mark.parametrize( + "model", + ["jason9693/Qwen2.5-1.5B-apeach"], +) +@pytest.mark.parametrize("dtype", ["half"]) +def test_classify( + hf_runner, + vllm_runner, + example_prompts, + model: str, + dtype: str, + monkeypatch, +) -> None: + import torch + from transformers import AutoModelForSequenceClassification + + with vllm_runner(model, + max_model_len=512, + dtype=dtype, + model_impl="transformers") as vllm_model: + vllm_outputs = vllm_model.classify(example_prompts) + + with hf_runner(model, + dtype=dtype, + auto_cls=AutoModelForSequenceClassification) as hf_model: + hf_outputs = hf_model.classify(example_prompts) + + for hf_output, vllm_output in zip(hf_outputs, vllm_outputs): + hf_output = torch.tensor(hf_output) + vllm_output = torch.tensor(vllm_output) + + assert torch.allclose(hf_output, vllm_output, + 1e-3 if dtype == "float" else 1e-2) diff --git a/vllm/config.py b/vllm/config.py index 41997488fa6..075aae9467c 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -551,7 +551,7 @@ def __post_init__(self) -> None: # For pooling models, self.task is used to indicate the # user-selected task if self.task == "score": - if self.registry.is_cross_encoder_model(self.architectures): + if self._is_classify_task(self.architectures): self.task = "classify" else: self.task = "embed" @@ -806,6 +806,12 @@ def _verify_tokenizer_mode(self) -> None: f"one of {get_args(TokenizerMode)}.") self.tokenizer_mode = tokenizer_mode + def _is_classify_task(self, architectures: list[str]): + for arch in architectures: + if arch.endswith("ForSequenceClassification"): + return True + return self.registry.is_cross_encoder_model(architectures) + def _get_preferred_pooling_task( self, architectures: list[str], @@ -813,14 +819,11 @@ def _get_preferred_pooling_task( model_id = self.model if get_pooling_config(model_id, self.revision): return "embed" - if self.registry.is_cross_encoder_model(architectures): - return "classify" if self.registry.is_transcription_model(architectures): return "transcription" suffix_to_preferred_task: list[tuple[str, _ResolvedTask]] = [ # Other models follow this pattern - ("ForSequenceClassification", "classify"), ("EmbeddingModel", "embed"), ("RewardModel", "reward"), ] @@ -878,11 +881,14 @@ def _get_supported_tasks( self, task_option: TaskOption, ) -> dict[RunnerType, list[_ResolvedTask]]: - return { - "generate": self._get_supported_generation_tasks(task_option), - "pooling": self._get_supported_pooling_tasks(task_option), - "draft": ["draft"] - } + if self._is_classify_task(self.architectures): + return {"generate": [], "pooling": ["classify"], "draft": []} + else: + return { + "generate": self._get_supported_generation_tasks(task_option), + "pooling": self._get_supported_pooling_tasks(task_option), + "draft": ["draft"] + } def _get_supported_runner_types( self, @@ -925,12 +931,16 @@ def _resolve_runner( f"Available tasks for runner={task_runner!r}: " f"{supported_tasks[task_runner]}") + if "classify" in supported_tasks.get("pooling", []): + # When multiple pooling tasks are present, default to + # pooling (eg cross-encoder) for non-standard architectures. + return "pooling" + suffix_to_preferred_runner: list[tuple[str, RunnerType]] = [ ("ForCausalLM", "generate"), ("ForConditionalGeneration", "generate"), ("ChatModel", "generate"), ("LMHeadModel", "generate"), - ("ForSequenceClassification", "pooling"), ("EmbeddingModel", "pooling"), ("RewardModel", "pooling"), ] @@ -940,10 +950,6 @@ def _resolve_runner( if arch.endswith(suffix) and pref_runner in supported_runner_types: return pref_runner - if "classify" in supported_tasks.get("pooling", []): - # When multiple pooling tasks are present, default to - # pooling (eg cross-encoder) for non-standard architectures. - return "pooling" if "generate" in supported_runner_types: return "generate" if "pooling" in supported_runner_types: @@ -1525,7 +1531,7 @@ def is_v1_compatible(self) -> bool: @property def is_matryoshka(self) -> bool: - return (hasattr(self.hf_config, "matryoshka_dimensions") + return (bool(getattr(self.hf_config, "matryoshka_dimensions", None)) or getattr(self.hf_config, "is_matryoshka", False)) @property @@ -1539,13 +1545,11 @@ def use_pad_token(self) -> bool: return getattr(self.hf_config, "use_pad_token", True) def get_and_verify_max_len(self, max_model_len: int): - # For pooling models, the tokenizer's `model_max_length` is often a - # reliable source for the maximum sequence length. However, for - # generative models, this can be incorrect and unduly limit the - # context window (e.g., DeepSeek-R1). Therefore, we only consider - # tokenizer_config for pooling models. + # Consider max_model_len in tokenizer_config only when + # pooling models use absolute position_embedding. tokenizer_config = None - if self.runner_type == "pooling": + if (self.runner_type == "pooling" and getattr( + self.hf_config, "position_embedding_type", "") == "absolute"): tokenizer_config = try_get_tokenizer_config( self.tokenizer, trust_remote_code=self.trust_remote_code, diff --git a/vllm/model_executor/model_loader/utils.py b/vllm/model_executor/model_loader/utils.py index 8e5f332ba7c..190d1f006bc 100644 --- a/vllm/model_executor/model_loader/utils.py +++ b/vllm/model_executor/model_loader/utils.py @@ -22,7 +22,8 @@ QuantizationConfig, QuantizeMethodBase) from vllm.model_executor.models import ModelRegistry from vllm.model_executor.models.adapters import (as_embedding_model, - as_reward_model) + as_reward_model, + as_seq_cls_model) from vllm.model_executor.models.interfaces import SupportsQuant from vllm.utils import is_pin_memory_available @@ -238,9 +239,29 @@ def get_model_architecture( vllm_supported_archs = ModelRegistry.get_supported_archs() vllm_not_supported = not any(arch in vllm_supported_archs for arch in architectures) + + if vllm_not_supported: + # try automatic conversion in adapters.py + for arch in architectures: + if not arch.endswith("ForSequenceClassification"): + continue + + assert model_config.task == "classify" + causal_lm_arch = arch.replace("ForSequenceClassification", + "ForCausalLM") + causal_lm_arch_vllm_supported = (causal_lm_arch + in vllm_supported_archs) + if not causal_lm_arch_vllm_supported: + continue + + architectures = [causal_lm_arch] + vllm_not_supported = False + break + if (model_config.model_impl == ModelImpl.TRANSFORMERS or model_config.model_impl != ModelImpl.VLLM and vllm_not_supported): architectures = resolve_transformers_arch(model_config, architectures) + logger.debug_once("Resolve transformers arch %s", str(architectures)) elif (model_config.quantization is not None and model_config.quantization not in mixtral_supported and "MixtralForCausalLM" in architectures): @@ -248,12 +269,13 @@ def get_model_architecture( model_cls, arch = ModelRegistry.resolve_model_cls(architectures) if model_config.task == "embed": + logger.debug_once("Automatic conversion using `as_embedding_model`.") model_cls = as_embedding_model(model_cls) elif model_config.task == "classify": - # Cannot automatically run as_seq_cls_model, - # otherwise it will cause a circular reference on is_cross_encoder_model - pass + logger.debug_once("Automatic conversion using `as_seq_cls_model`.") + model_cls = as_seq_cls_model(model_cls) elif model_config.task == "reward": + logger.debug_once("Automatic conversion using `as_reward_model`.") model_cls = as_reward_model(model_cls) return model_cls, arch diff --git a/vllm/model_executor/models/adapters.py b/vllm/model_executor/models/adapters.py index f319c0c4441..31b1d9a8b3c 100644 --- a/vllm/model_executor/models/adapters.py +++ b/vllm/model_executor/models/adapters.py @@ -331,13 +331,13 @@ def load_weights_using_from_2_way_softmax( false_id = tokenizer.convert_tokens_to_ids(tokens[0]) true_id = tokenizer.convert_tokens_to_ids(tokens[1]) - weight = model.lm_head.weight.data[[true_id]].to( + score_weight = model.lm_head.weight.data[[true_id]].to( torch.float32) - model.lm_head.weight.data[[false_id]].to( torch.float32) param = model.score.weight weight_loader = getattr(param, "weight_loader", default_weight_loader) - weight_loader(param, weight) + weight_loader(param, score_weight) del model.lm_head loaded_weights.add("score.weight") @@ -350,6 +350,8 @@ def load_weights_no_post_processing(model, torch.Tensor]]): from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead) + from vllm.model_executor.model_loader.weight_utils import ( + default_weight_loader) from vllm.model_executor.models.utils import AutoWeightsLoader model_config = model.vllm_config.model_config @@ -357,8 +359,6 @@ def load_weights_no_post_processing(model, tokens = cast(list[int], tokens) assert len(tokens) > 0 - device = model.score.weight.device - if model.config.tie_word_embeddings: model.lm_head = model.model.embed_tokens else: @@ -376,8 +376,11 @@ def load_weights_no_post_processing(model, trust_remote_code=model_config.trust_remote_code) token_ids = [tokenizer.convert_tokens_to_ids(t) for t in tokens] - score_weight = model.lm_head.weight.data[token_ids].to(device) - model.score.weight.data.copy_(score_weight) + score_weight = model.lm_head.weight.data[token_ids] + + param = model.score.weight + weight_loader = getattr(param, "weight_loader", default_weight_loader) + weight_loader(param, score_weight) del model.lm_head loaded_weights.add("score.weight") diff --git a/vllm/model_executor/models/gemma.py b/vllm/model_executor/models/gemma.py index bc8179f886f..59c3102add4 100644 --- a/vllm/model_executor/models/gemma.py +++ b/vllm/model_executor/models/gemma.py @@ -43,7 +43,6 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .adapters import as_seq_cls_model from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, @@ -426,6 +425,3 @@ def load_weights(self, weights: Iterable[tuple[str, if self.config.tie_word_embeddings else None), ) return loader.load_weights(weights) - - -GemmaForSequenceClassification = as_seq_cls_model(GemmaForCausalLM) diff --git a/vllm/model_executor/models/llama.py b/vllm/model_executor/models/llama.py index 2434ac9d205..48ec611df12 100644 --- a/vllm/model_executor/models/llama.py +++ b/vllm/model_executor/models/llama.py @@ -49,7 +49,6 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .adapters import as_seq_cls_model from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, is_pp_missing_parameter, @@ -646,6 +645,3 @@ def permute(w: torch.Tensor, n_heads: int): name = name.replace(item, mapping[item]) return name, loaded_weight - - -LlamaForSequenceClassification = as_seq_cls_model(LlamaForCausalLM) diff --git a/vllm/model_executor/models/qwen2.py b/vllm/model_executor/models/qwen2.py index 7ef9d248da4..23f65b99c22 100644 --- a/vllm/model_executor/models/qwen2.py +++ b/vllm/model_executor/models/qwen2.py @@ -50,7 +50,6 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .adapters import as_seq_cls_model from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, is_pp_missing_parameter, @@ -496,6 +495,3 @@ def load_weights(self, weights: Iterable[tuple[str, if self.config.tie_word_embeddings else None), ) return loader.load_weights(weights) - - -Qwen2ForSequenceClassification = as_seq_cls_model(Qwen2ForCausalLM) diff --git a/vllm/model_executor/models/qwen3.py b/vllm/model_executor/models/qwen3.py index de99a76f289..393ce41a91a 100644 --- a/vllm/model_executor/models/qwen3.py +++ b/vllm/model_executor/models/qwen3.py @@ -44,7 +44,6 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .adapters import as_seq_cls_model from .interfaces import SupportsLoRA, SupportsPP from .qwen2 import Qwen2MLP as Qwen3MLP from .qwen2 import Qwen2Model @@ -320,6 +319,3 @@ def load_weights(self, weights: Iterable[tuple[str, if self.config.tie_word_embeddings else None), ) return loader.load_weights(weights) - - -Qwen3ForSequenceClassification = as_seq_cls_model(Qwen3ForCausalLM) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 52fdb910891..fd831727ab2 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -12,7 +12,7 @@ import tempfile from abc import ABC, abstractmethod from collections.abc import Set -from dataclasses import dataclass, field +from dataclasses import asdict, dataclass, field from functools import lru_cache from typing import Callable, Optional, TypeVar, Union @@ -181,10 +181,6 @@ "ModernBertForSequenceClassification": ("modernbert", "ModernBertForSequenceClassification"), # [Auto-converted (see adapters.py)] - "GemmaForSequenceClassification": ("gemma", "GemmaForSequenceClassification"), # noqa: E501 - "Qwen2ForSequenceClassification": ("qwen2", "Qwen2ForSequenceClassification"), # noqa: E501 - "Qwen3ForSequenceClassification": ("qwen3", "Qwen3ForSequenceClassification"), # noqa: E501 - "LlamaForSequenceClassification": ("llama", "LlamaForSequenceClassification"), # noqa: E501 "JinaVLForRanking": ("jina_vl", "JinaVLForSequenceClassification"), # noqa: E501, } @@ -462,10 +458,26 @@ def _try_load_model_cls(self, return _try_load_model_cls(model_arch, self.models[model_arch]) def _try_inspect_model_cls(self, model_arch: str) -> Optional[_ModelInfo]: - if model_arch not in self.models: - return None + if model_arch in self.models: + return _try_inspect_model_cls(model_arch, self.models[model_arch]) + + if model_arch.endswith("ForSequenceClassification"): + causal_lm_arch = model_arch.replace("ForSequenceClassification", + "ForCausalLM") + if causal_lm_arch not in self.models: + return None + + info = _try_inspect_model_cls(causal_lm_arch, + self.models[causal_lm_arch]) - return _try_inspect_model_cls(model_arch, self.models[model_arch]) + info = _ModelInfo(**dict( + asdict(info), **{ + "architecture": model_arch, + "supports_cross_encoding": True + })) + return info + + return None def _normalize_archs( self, @@ -480,6 +492,15 @@ def _normalize_archs( normalized_arch = list( filter(lambda model: model in self.models, architectures)) + # try automatic conversion in adapters.py + for arch in architectures: + if not arch.endswith("ForSequenceClassification"): + continue + causal_lm_arch = arch.replace("ForSequenceClassification", + "ForCausalLM") + if causal_lm_arch in self.models: + normalized_arch.append(arch) + # make sure Transformers backend is put at the last as a fallback if len(normalized_arch) != len(architectures): normalized_arch.append("TransformersForCausalLM") From 80141408a486f511e2d14e065dc6bdd18c6eefc3 Mon Sep 17 00:00:00 2001 From: "wang.yuqi" Date: Fri, 18 Jul 2025 17:10:47 +0800 Subject: [PATCH 174/552] [Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. (#20750) Signed-off-by: wang.yuqi Signed-off-by: x22x22 --- tests/tokenization/test_do_lower_case.py | 18 ++++++++++++++++++ vllm/transformers_utils/tokenizer.py | 14 ++++++++++++++ 2 files changed, 32 insertions(+) create mode 100644 tests/tokenization/test_do_lower_case.py diff --git a/tests/tokenization/test_do_lower_case.py b/tests/tokenization/test_do_lower_case.py new file mode 100644 index 00000000000..7aa655e1c3b --- /dev/null +++ b/tests/tokenization/test_do_lower_case.py @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import pytest + +from vllm.transformers_utils.tokenizer import get_tokenizer + +TOKENIZER_NAMES = ["BAAI/bge-base-en"] + + +@pytest.mark.parametrize("tokenizer_name", TOKENIZER_NAMES) +@pytest.mark.parametrize("n_tokens", [510]) +def test_special_tokens(tokenizer_name: str, n_tokens: int): + tokenizer = get_tokenizer(tokenizer_name, revision="main") + + prompts = '[UNK]' * n_tokens + prompt_token_ids = tokenizer.encode(prompts) + assert len(prompt_token_ids) == n_tokens + 2 diff --git a/vllm/transformers_utils/tokenizer.py b/vllm/transformers_utils/tokenizer.py index 01d1769f0e5..25dd71d877f 100644 --- a/vllm/transformers_utils/tokenizer.py +++ b/vllm/transformers_utils/tokenizer.py @@ -16,6 +16,8 @@ from vllm import envs from vllm.logger import init_logger +from vllm.transformers_utils.config import ( + get_sentence_transformer_tokenizer_config) from vllm.transformers_utils.tokenizers import MistralTokenizer from vllm.transformers_utils.utils import check_gguf_file from vllm.utils import make_async @@ -256,6 +258,18 @@ def get_tokenizer( else: raise e + # The special_tokens in tokenizer should also be + # controlled by do_lower_case in encoder_config + encoder_config = get_sentence_transformer_tokenizer_config( + tokenizer_name, revision) + if isinstance(encoder_config, dict) and encoder_config.get( + "do_lower_case", False): + special_tokens_map = { + k: v.lower() + for k, v in tokenizer.special_tokens_map.items() + } + tokenizer.add_special_tokens(special_tokens_map) + # NOTE: We can remove this after https://github.com/THUDM/ChatGLM3/issues/1324 if type(tokenizer).__name__ in ("ChatGLMTokenizer", "ChatGLM4Tokenizer"): From d76405d2742c5b7b85f9d3b0db8275776122749e Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 18 Jul 2025 18:55:10 +0800 Subject: [PATCH 175/552] [Doc] Fix typo in model name (#21178) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- docs/models/supported_models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index e7ceca81087..de95e2f21ce 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -578,7 +578,7 @@ Specified using `--task generate`. | `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ | ✅︎ | | `Gemma3ForConditionalGeneration` | Gemma 3 | T + I+ | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ | | `GLM4VForCausalLM`^ | GLM-4V | T + I | `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ | ✅︎ | -| `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + IE+ + VE+ | `THUDM/GLM-4.1V-9B-Thinkg`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + IE+ + VE+ | `THUDM/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ | ✅︎ | | `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | ✅︎ | | `H2OVLChatModel` | H2OVL | T + IE+ | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | ✅︎ | | `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3`, etc. | ✅︎ | | ✅︎ | From 11098c0357dde6437abf1d683f235aa7c8ec4d54 Mon Sep 17 00:00:00 2001 From: ElizaWszola Date: Fri, 18 Jul 2025 12:55:52 +0200 Subject: [PATCH 176/552] [Bugfix] Allocate less memory in non-batched CUTLASS MoE (#21121) Signed-off-by: ElizaWszola Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/cutlass_moe.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index facc01a5ba8..ff49d7bb780 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -283,8 +283,8 @@ def workspace_shapes( (N // 2)) output = (self.max_experts_per_worker, padded_M, K) else: - workspace1 = (M * topk, max(2 * N, K)) - workspace2 = (M * topk, N) + workspace1 = (M * topk, max(N, K)) + workspace2 = (M * topk, N // 2) output = (M * topk, K) return (workspace1, workspace2, output, self.out_dtype if self.out_dtype is not None else a.dtype) From 20a43f6fab232b3578775d47a6e02a98f61c8e8b Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 18 Jul 2025 20:41:17 +0800 Subject: [PATCH 177/552] [Core] Set pooling params based on task and model (#21128) Signed-off-by: DarkLight1337 --- tests/models/language/pooling/test_gritlm.py | 26 ++- vllm/entrypoints/llm.py | 49 +++-- vllm/entrypoints/openai/protocol.py | 8 +- .../openai/serving_classification.py | 32 +++ vllm/entrypoints/openai/serving_embedding.py | 20 +- vllm/entrypoints/openai/serving_engine.py | 18 +- vllm/entrypoints/openai/serving_pooling.py | 5 + vllm/entrypoints/openai/serving_score.py | 30 ++- vllm/executor/executor_base.py | 7 + vllm/model_executor/layers/pooler.py | 149 +++++++++----- vllm/model_executor/models/bert.py | 12 +- vllm/model_executor/models/gritlm.py | 185 +++++++++++------- vllm/model_executor/models/interfaces.py | 7 - vllm/model_executor/models/modernbert.py | 12 +- vllm/pooling_params.py | 41 ++-- vllm/v1/engine/core.py | 6 + vllm/v1/worker/cpu_model_runner.py | 4 - vllm/v1/worker/gpu_input_batch.py | 19 +- vllm/v1/worker/gpu_model_runner.py | 48 ++++- vllm/v1/worker/gpu_worker.py | 4 + vllm/v1/worker/tpu_model_runner.py | 14 +- vllm/v1/worker/tpu_worker.py | 4 + vllm/worker/model_runner_base.py | 14 +- vllm/worker/pooling_model_runner.py | 16 +- 24 files changed, 499 insertions(+), 231 deletions(-) diff --git a/tests/models/language/pooling/test_gritlm.py b/tests/models/language/pooling/test_gritlm.py index c2f70bb647a..1274657991b 100644 --- a/tests/models/language/pooling/test_gritlm.py +++ b/tests/models/language/pooling/test_gritlm.py @@ -2,9 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from __future__ import annotations -import importlib.util -from array import array - +import numpy as np import openai import pytest from scipy.spatial.distance import cosine @@ -14,10 +12,6 @@ from ....utils import RemoteOpenAIServer -# GritLM embedding implementation is only supported by XFormers backend. -pytestmark = pytest.mark.skipif(not importlib.util.find_spec("xformers"), - reason="GritLM requires XFormers") - MODEL_NAME = "parasail-ai/GritLM-7B-vllm" MAX_MODEL_LEN = 4000 @@ -26,11 +20,11 @@ def _arr(arr): """ Convert a list of integers to an array of integers. """ - return array("i", arr) + return np.array(arr) def test_find_array(): - from vllm.model_executor.models.gritlm import GritLMPooler + from vllm.model_executor.models.gritlm import GritLMMeanPool model_config = ModelConfig( MODEL_NAME, @@ -41,17 +35,19 @@ def test_find_array(): dtype="bfloat16", seed=0, ) - pooler = GritLMPooler(model_config=model_config) + pooling = GritLMMeanPool(model_config=model_config) arr = _arr([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) - assert pooler._find_array(arr, _arr([3, 4, 5]), start_idx=0) == 3 - assert pooler._find_array(arr, _arr([3, 4, 5]), start_idx=1) == 3 - assert pooler._find_array(arr, _arr([3, 4, 5]), start_idx=5) == -1 - assert pooler._find_array(arr, _arr([3, 5]), start_idx=0) == -1 + assert pooling._find_array(arr, _arr([3, 4, 5]), start_idx=0) == 3 + assert pooling._find_array(arr, _arr([3, 4, 5]), start_idx=1) == 3 + assert pooling._find_array(arr, _arr([3, 4, 5]), start_idx=5) == -1 + assert pooling._find_array(arr, _arr([3, 4, 5]), end_idx=3) == -1 + assert pooling._find_array(arr, _arr([3, 4, 5]), end_idx=4) == 3 + assert pooling._find_array(arr, _arr([3, 5]), start_idx=0) == -1 with pytest.raises(ValueError): - pooler._find_array(arr, _arr([3, 4, 5]), start_idx=-1) + pooling._find_array(arr, _arr([3, 4, 5]), start_idx=-1) def run_llm_encode( diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py index e7398ecc23c..78f9d32d811 100644 --- a/vllm/entrypoints/llm.py +++ b/vllm/entrypoints/llm.py @@ -44,7 +44,7 @@ from vllm.outputs import (ClassificationRequestOutput, EmbeddingRequestOutput, PoolingRequestOutput, RequestOutput, ScoringRequestOutput) -from vllm.pooling_params import PoolingParams +from vllm.pooling_params import PoolingParams, PoolingTask from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import (BeamSearchParams, GuidedDecodingParams, RequestOutputKind, SamplingParams) @@ -964,6 +964,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: ... @@ -979,6 +980,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: ... @@ -994,6 +996,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: ... @@ -1010,6 +1013,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: ... @@ -1026,6 +1030,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: ... @@ -1040,6 +1045,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: ... @@ -1059,6 +1065,7 @@ def encode( use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, + pooling_task: PoolingTask = "encode", ) -> list[PoolingRequestOutput]: """Apply pooling to the hidden states corresponding to the input prompts. @@ -1080,6 +1087,7 @@ def encode( lora_request: LoRA request to use for generation, if any. prompt_adapter_request: Prompt Adapter request to use for generation, if any. + pooling_task: Override the pooling task to use. Returns: A list of `PoolingRequestOutput` objects containing the @@ -1116,11 +1124,12 @@ def encode( if pooling_params is None: # Use default pooling params. pooling_params = PoolingParams() - elif isinstance(pooling_params, PoolingParams): - pooling_params.verify(model_config) + + if isinstance(pooling_params, PoolingParams): + pooling_params.verify(pooling_task, model_config) else: for pooling_param in pooling_params: - pooling_param.verify(model_config) + pooling_param.verify(pooling_task, model_config) tokenization_kwargs = dict[str, Any]() _validate_truncation_size(model_config.max_model_len, @@ -1181,12 +1190,15 @@ def embed( raise ValueError("Embedding API is not supported by this model. " "Please set `--task embed`.") - items = self.encode(prompts, - truncate_prompt_tokens=truncate_prompt_tokens, - use_tqdm=use_tqdm, - pooling_params=pooling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + items = self.encode( + prompts, + truncate_prompt_tokens=truncate_prompt_tokens, + use_tqdm=use_tqdm, + pooling_params=pooling_params, + lora_request=lora_request, + prompt_adapter_request=prompt_adapter_request, + pooling_task="embed", + ) return [EmbeddingRequestOutput.from_base(item) for item in items] @@ -1228,10 +1240,13 @@ def classify( "Classification API is not supported by this model. " "Please set `--task classify`.") - items = self.encode(prompts, - use_tqdm=use_tqdm, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + items = self.encode( + prompts, + use_tqdm=use_tqdm, + lora_request=lora_request, + prompt_adapter_request=prompt_adapter_request, + pooling_task="classify", + ) return [ClassificationRequestOutput.from_base(item) for item in items] @@ -1251,7 +1266,9 @@ def _embedding_score( truncate_prompt_tokens=truncate_prompt_tokens, use_tqdm=use_tqdm, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + prompt_adapter_request=prompt_adapter_request, + pooling_task="embed", + ) encoded_output_1: list[PoolingRequestOutput] = encoded_output[ 0:len(text_1)] @@ -1287,7 +1304,7 @@ def _cross_encoding_score( if len(data_1) == 1: data_1 = data_1 * len(data_2) - pooling_params = PoolingParams(use_cross_encoder=True) + pooling_params = PoolingParams(task="score") tokenization_kwargs: dict[str, Any] = {} _validate_truncation_size(self.llm_engine.model_config.max_model_len, truncate_prompt_tokens, tokenization_kwargs) diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index a421ed1fc32..95e5bcd3bae 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -1347,8 +1347,8 @@ class ScoreRequest(OpenAIBaseModel): # --8<-- [end:score-extra-params] - def to_pooling_params(self, *, use_cross_encoder: bool = False): - return PoolingParams(use_cross_encoder=use_cross_encoder) + def to_pooling_params(self): + return PoolingParams() class RerankRequest(OpenAIBaseModel): @@ -1375,8 +1375,8 @@ class RerankRequest(OpenAIBaseModel): # --8<-- [end:rerank-extra-params] - def to_pooling_params(self, *, use_cross_encoder: bool = False): - return PoolingParams(use_cross_encoder=use_cross_encoder) + def to_pooling_params(self): + return PoolingParams() class RerankDocument(BaseModel): diff --git a/vllm/entrypoints/openai/serving_classification.py b/vllm/entrypoints/openai/serving_classification.py index 3ac4f01ea60..e4ea5ab8dc5 100644 --- a/vllm/entrypoints/openai/serving_classification.py +++ b/vllm/entrypoints/openai/serving_classification.py @@ -6,6 +6,7 @@ import numpy as np from fastapi import Request +from typing_extensions import override from vllm.config import ModelConfig from vllm.engine.protocol import EngineClient @@ -21,12 +22,14 @@ from vllm.entrypoints.openai.serving_models import OpenAIServingModels from vllm.logger import init_logger from vllm.outputs import ClassificationOutput, PoolingRequestOutput +from vllm.pooling_params import PoolingParams logger = init_logger(__name__) class ClassificationMixin(OpenAIServing): + @override async def _preprocess( self, ctx: ServeContext, @@ -75,6 +78,7 @@ async def _preprocess( logger.exception("Error in preprocessing prompt inputs") return self.create_error_response(str(e)) + @override def _build_response( self, ctx: ServeContext, @@ -158,3 +162,31 @@ async def create_classify( ) return await super().handle(ctx) # type: ignore + + @override + def _validate_request( + self, + ctx: ClassificationServeContext, + ) -> Optional[ErrorResponse]: + if error := super()._validate_request(ctx): + return error + + ctx.truncate_prompt_tokens = ctx.request.truncate_prompt_tokens + + return None + + @override + def _create_pooling_params( + self, + ctx: ClassificationServeContext, + ) -> Union[PoolingParams, ErrorResponse]: + pooling_params = super()._create_pooling_params(ctx) + if isinstance(pooling_params, ErrorResponse): + return pooling_params + + try: + pooling_params.verify("classify", self.model_config) + except ValueError as e: + return self.create_error_response(str(e)) + + return pooling_params diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index a5f816a66a8..f3b82dac899 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -32,7 +32,8 @@ from vllm.inputs.data import TokensPrompt as EngineTokensPrompt from vllm.logger import init_logger from vllm.outputs import (EmbeddingOutput, EmbeddingRequestOutput, - PoolingRequestOutput, RequestOutput) + PoolingRequestOutput) +from vllm.pooling_params import PoolingParams logger = init_logger(__name__) @@ -54,6 +55,7 @@ def _get_embedding( class EmbeddingMixin(OpenAIServing): + @override async def _preprocess( self, ctx: ServeContext, @@ -106,6 +108,7 @@ async def _preprocess( logger.exception("Error in preprocessing prompt inputs") return self.create_error_response(str(e)) + @override def _build_response( self, ctx: ServeContext, @@ -906,11 +909,20 @@ def _validate_request( ctx.truncate_prompt_tokens = ctx.request.truncate_prompt_tokens - pooling_params = ctx.request.to_pooling_params() + return None + + @override + def _create_pooling_params( + self, + ctx: ServeContext[EmbeddingRequest], + ) -> Union[PoolingParams, ErrorResponse]: + pooling_params = super()._create_pooling_params(ctx) + if isinstance(pooling_params, ErrorResponse): + return pooling_params try: - pooling_params.verify(self.model_config) + pooling_params.verify("embed", self.model_config) except ValueError as e: return self.create_error_response(str(e)) - return None + return pooling_params diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index 462317a0878..393e32f0ed9 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -305,6 +305,16 @@ def _validate_request(self, ctx: ServeContext) -> Optional[ErrorResponse]: " Please, select a smaller truncation size.") return None + def _create_pooling_params( + self, + ctx: ServeContext, + ) -> Union[PoolingParams, ErrorResponse]: + if not hasattr(ctx.request, "to_pooling_params"): + return self.create_error_response( + "Request type does not support pooling parameters") + + return ctx.request.to_pooling_params() + async def _prepare_generators( self, ctx: ServeContext, @@ -318,11 +328,9 @@ async def _prepare_generators( trace_headers = (None if ctx.raw_request is None else await self._get_trace_headers(ctx.raw_request.headers)) - if not hasattr(ctx.request, "to_pooling_params"): - return self.create_error_response( - "Request type does not support pooling parameters") - - pooling_params = ctx.request.to_pooling_params() + pooling_params = self._create_pooling_params(ctx) + if isinstance(pooling_params, ErrorResponse): + return pooling_params if ctx.engine_prompts is None: return self.create_error_response( diff --git a/vllm/entrypoints/openai/serving_pooling.py b/vllm/entrypoints/openai/serving_pooling.py index c2ed50d04d1..eec21087b99 100644 --- a/vllm/entrypoints/openai/serving_pooling.py +++ b/vllm/entrypoints/openai/serving_pooling.py @@ -142,6 +142,11 @@ async def create_pooling( try: pooling_params = request.to_pooling_params() + try: + pooling_params.verify("encode", self.model_config) + except ValueError as e: + return self.create_error_response(str(e)) + for i, engine_prompt in enumerate(engine_prompts): request_id_item = f"{request_id}-{i}" diff --git a/vllm/entrypoints/openai/serving_score.py b/vllm/entrypoints/openai/serving_score.py index 8d47a417f9c..35f6581768a 100644 --- a/vllm/entrypoints/openai/serving_score.py +++ b/vllm/entrypoints/openai/serving_score.py @@ -55,14 +55,13 @@ async def _embedding_score( texts_1: list[str], texts_2: list[str], request: Union[RerankRequest, ScoreRequest], - request_id=str, + request_id: str, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[Union[LoRARequest, None]] = None, prompt_adapter_request: Optional[Union[PromptAdapterRequest, None]] = None, trace_headers: Optional[Mapping[str, str]] = None, - ) -> list[PoolingRequestOutput]: - + ) -> Union[list[PoolingRequestOutput], ErrorResponse]: input_texts = texts_1 + texts_2 engine_prompts: list[TokensPrompt] = [] @@ -89,6 +88,11 @@ async def _embedding_score( generators: list[AsyncGenerator[PoolingRequestOutput, None]] = [] pooling_params = request.to_pooling_params() + try: + pooling_params.verify("embed", self.model_config) + except ValueError as e: + return self.create_error_response(str(e)) + for i, engine_prompt in enumerate(engine_prompts): request_id_item = f"{request_id}-{i}" @@ -169,14 +173,13 @@ async def _cross_encoding_score( data_1: Union[list[str], list[ScoreContentPartParam]], data_2: Union[list[str], list[ScoreContentPartParam]], request: Union[RerankRequest, ScoreRequest], - request_id=str, + request_id: str, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[Union[LoRARequest, None]] = None, prompt_adapter_request: Optional[Union[PromptAdapterRequest, None]] = None, trace_headers: Optional[Mapping[str, str]] = None, - ) -> list[PoolingRequestOutput]: - + ) -> Union[list[PoolingRequestOutput], ErrorResponse]: request_prompts: list[str] = [] engine_prompts: list[TokensPrompt] = [] @@ -245,7 +248,12 @@ async def _cross_encoding_score( # Schedule the request and get the result generator. generators: list[AsyncGenerator[PoolingRequestOutput, None]] = [] - pooling_params = request.to_pooling_params(use_cross_encoder=True) + pooling_params = request.to_pooling_params() + + try: + pooling_params.verify("score", self.model_config) + except ValueError as e: + return self.create_error_response(str(e)) for i, engine_prompt in enumerate(engine_prompts): request_id_item = f"{request_id}-{i}" @@ -286,8 +294,7 @@ async def _run_scoring( request_id: str, raw_request: Optional[Request] = None, truncate_prompt_tokens: Optional[int] = None, - ) -> list[PoolingRequestOutput]: - + ) -> Union[list[PoolingRequestOutput], ErrorResponse]: ( lora_request, prompt_adapter_request, @@ -374,6 +381,8 @@ async def create_score( raw_request, request.truncate_prompt_tokens, ) + if isinstance(final_res_batch, ErrorResponse): + return final_res_batch return self.request_output_to_score_response( final_res_batch, @@ -420,6 +429,9 @@ async def do_rerank( raw_request, request.truncate_prompt_tokens, ) + if isinstance(final_res_batch, ErrorResponse): + return final_res_batch + return self.request_output_to_rerank_response( final_res_batch, request_id, diff --git a/vllm/executor/executor_base.py b/vllm/executor/executor_base.py index 99e12201c96..ca9f1376b9f 100644 --- a/vllm/executor/executor_base.py +++ b/vllm/executor/executor_base.py @@ -4,6 +4,7 @@ import asyncio import time from abc import ABC, abstractmethod +from functools import cached_property from typing import (Any, Awaitable, Callable, Dict, List, Optional, Set, Tuple, Union) @@ -15,6 +16,7 @@ from vllm.logger import init_logger from vllm.lora.request import LoRARequest from vllm.model_executor.layers.sampler import SamplerOutput +from vllm.pooling_params import PoolingTask from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sequence import ExecuteModelRequest, PoolerOutput from vllm.utils import make_async @@ -135,6 +137,11 @@ def rpc_func(worker: WorkerBase) -> _R: return self.collective_rpc(rpc_func) + @cached_property # Avoid unnecessary RPC calls + def supported_pooling_tasks(self) -> tuple[PoolingTask, ...]: + output = self.collective_rpc("get_supported_pooling_tasks") + return tuple({task for tasks in output for task in tasks}) + def execute_model( self, execute_model_req: ExecuteModelRequest ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]: diff --git a/vllm/model_executor/layers/pooler.py b/vllm/model_executor/layers/pooler.py index 74916492f57..6a474b8e73a 100644 --- a/vllm/model_executor/layers/pooler.py +++ b/vllm/model_executor/layers/pooler.py @@ -3,7 +3,7 @@ from abc import ABC, abstractmethod from dataclasses import dataclass from enum import IntEnum -from typing import Callable, Literal, Optional, TypeVar, Union +from typing import Callable, Optional, TypeVar, Union import torch import torch.nn as nn @@ -15,13 +15,12 @@ from vllm.model_executor.pooling_metadata import ( # noqa: E501 PoolingMetadata as V0PoolingMetadata) from vllm.model_executor.pooling_metadata import PoolingTensors -from vllm.pooling_params import PoolingParams +from vllm.pooling_params import PoolingParams, PoolingTask from vllm.sequence import PoolerOutput, PoolingSequenceGroupOutput from vllm.utils import resolve_obj_by_qualname from vllm.v1.pool.metadata import PoolingMetadata as V1PoolingMetadata PoolingMetadata = Union[V0PoolingMetadata, V1PoolingMetadata] -PoolingTask = Literal["encode", "embed", "classify", "score"] class PoolingType(IntEnum): @@ -67,6 +66,15 @@ def from_config_with_defaults( ) +@dataclass(frozen=True) +class PoolingParamsUpdate: + requires_token_ids: bool = False + """Set this flag to enable `get_prompt_token_ids` for your pooler.""" + + def apply(self, params: PoolingParams) -> None: + params.requires_token_ids = self.requires_token_ids + + class Pooler(nn.Module, ABC): """The interface required for all poolers used in pooling models in vLLM.""" @@ -93,7 +101,10 @@ def from_config_with_defaults( return SimplePooler.from_config(resolved_config) - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: """ Construct the pooling parameters to use for a task, or `None` if the task is not supported. @@ -121,6 +132,23 @@ def get_prompt_lens( pooling_metadata, hidden_states.device).prompt_lens +def get_prompt_token_ids( + pooling_metadata: PoolingMetadata) -> list[torch.Tensor]: + if isinstance(pooling_metadata, V1PoolingMetadata): + assert pooling_metadata.prompt_token_ids is not None, ( + "Please set `requires_token_ids=True` in `get_pooling_updates`") + + return [ + pooling_metadata.prompt_token_ids[i, :num] + for i, num in enumerate(pooling_metadata.prompt_lens) + ] + + return [ + torch.tensor(seq_data_i.prompt_token_ids) + for seq_data_i in pooling_metadata.seq_data.values() + ] + + def get_classification_activation_function(config: PretrainedConfig): return PoolerClassify() @@ -165,7 +193,10 @@ def from_pooling_type(pooling_type: PoolingType) -> "PoolingMethod": raise NotImplementedError(f"Unsupported method: {pooling_type}") @abstractmethod - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: raise NotImplementedError @abstractmethod @@ -206,11 +237,14 @@ def forward( class CLSPool(PoolingMethod): - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: # The equalities are split up to keep mypy happy if (task == "encode" or task == "embed" or task == "classify" or task == "score"): - return PoolingParams() + return PoolingParamsUpdate() assert_never(task) @@ -236,11 +270,14 @@ def forward_all( class LastPool(PoolingMethod): - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: # The equalities are split up to keep mypy happy if (task == "encode" or task == "embed" or task == "classify" or task == "score"): - return PoolingParams() + return PoolingParamsUpdate() assert_never(task) @@ -262,9 +299,12 @@ def forward_all( class AllPool(PoolingMethod): - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: if task == "encode": - return PoolingParams() + return PoolingParamsUpdate() # The equalities are split up to keep mypy happy if task == "embed" or task == "classify" or task == "score": @@ -299,11 +339,14 @@ def forward_all( class MeanPool(PoolingMethod): - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: # The equalities are split up to keep mypy happy if (task == "encode" or task == "embed" or task == "classify" or task == "score"): - return PoolingParams() + return PoolingParamsUpdate() assert_never(task) @@ -520,8 +563,11 @@ def __init__(self, pooling: PoolingMethod, head: PoolerHead) -> None: self.pooling = pooling self.head = head - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: - return self.pooling.get_pooling_params(task) + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: + return self.pooling.get_pooling_updates(task) def forward( self, @@ -559,27 +605,13 @@ def __init__( self.step_tag_id = step_tag_id self.returned_token_ids = returned_token_ids - def get_prompt_token_ids( - self, - pooling_metadata: PoolingMetadata, - ) -> list[torch.Tensor]: - if isinstance(pooling_metadata, V1PoolingMetadata): - return [ - pooling_metadata.prompt_token_ids[i, :num] - for i, num in enumerate(pooling_metadata.prompt_lens) - ] - return [ - torch.tensor(seq_data_i.prompt_token_ids) - for seq_data_i in pooling_metadata.seq_data.values() - ] - def extract_states( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, ) -> Union[list[torch.Tensor], torch.Tensor]: pooled_data_lst = self.pooling(hidden_states, pooling_metadata) - prompt_token_ids = self.get_prompt_token_ids(pooling_metadata) + prompt_token_ids = get_prompt_token_ids(pooling_metadata) pooled_data = list[torch.Tensor]() returned_token_ids = self.returned_token_ids @@ -595,9 +627,12 @@ def extract_states( return pooled_data - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: if task == "encode": - return PoolingParams(logits_processing_needs_token_ids=True) + return PoolingParamsUpdate(requires_token_ids=True) # The equalities are split up to keep mypy happy if task == "embed" or task == "classify" or task == "score": @@ -650,19 +685,24 @@ def __init__( self.cross_encoder_act_fn = get_cross_encoder_activation_function( config.hf_config) if act_fn is None else act_fn - def _get_act_fn(self, use_cross_encoder: bool): - return (self.cross_encoder_act_fn - if use_cross_encoder else self.classification_act_fn) + def _get_act_fn(self, task: PoolingTask): + if task == "encode" or task == "classify": + return self.classification_act_fn + if task == "score": + return self.cross_encoder_act_fn + + raise ValueError(f"Unsupported task: {task!r}") + + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: + # The equalities are split up to keep mypy happy + if task == "encode" or task == "classify" or task == "score": + return PoolingParamsUpdate() - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: - if task == "encode": - return PoolingParams() if task == "embed": return None - if task == "classify": - return PoolingParams() - if task == "score": - return PoolingParams(use_cross_encoder=True) assert_never(task) @@ -682,27 +722,28 @@ def forward( else: pooled_output = [self.classifier(data) for data in pooled_data] + task_list: list[PoolingTask] if isinstance(pooling_metadata, V0PoolingMetadata): - use_cross_encoder_list = [ - pooling_param.use_cross_encoder - for _, pooling_param in pooling_metadata.seq_groups + task_list = [ + task for _, pooling_param in pooling_metadata.seq_groups + if (task := pooling_param.task) is not None ] else: - use_cross_encoder_list = [ - pooling_param.use_cross_encoder - for pooling_param in pooling_metadata.pooling_params + task_list = [ + task for pooling_param in pooling_metadata.pooling_params + if (task := pooling_param.task) is not None ] + assert len(task_list) == len(pooled_output) + # shape of scores: (batch_size, num_labels) - if all(use_cross_encoder == use_cross_encoder_list[0] - for use_cross_encoder in use_cross_encoder_list): - act_fn = self._get_act_fn(use_cross_encoder_list[0]) + if len(set(task_list)) <= 1: + act_fn = self._get_act_fn(task_list[0]) scores = act_fn(pooled_output) else: scores = torch.stack([ - self._get_act_fn(use_cross_encoder)(vecs) - for use_cross_encoder, vecs in zip(use_cross_encoder_list, - pooled_output) + self._get_act_fn(task)(vecs) + for task, vecs in zip(task_list, pooled_output) ]) return build_output(scores) diff --git a/vllm/model_executor/models/bert.py b/vllm/model_executor/models/bert.py index bd4445c49a0..006f547bb46 100644 --- a/vllm/model_executor/models/bert.py +++ b/vllm/model_executor/models/bert.py @@ -18,13 +18,14 @@ QKVParallelLinear, RowParallelLinear) from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, - PoolingMethod, PoolingTask, + PoolingMethod, + PoolingParamsUpdate, PoolingType) from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.pooling_params import PoolingParams +from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors from .interfaces import SupportsCrossEncoding, SupportsQuant, SupportsV0Only @@ -91,8 +92,11 @@ def __init__(self, config: BertConfig): self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh() - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: - return self.pooling.get_pooling_params(task) + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: + return self.pooling.get_pooling_updates(task) def forward( self, diff --git a/vllm/model_executor/models/gritlm.py b/vllm/model_executor/models/gritlm.py index ba0e22892d8..8443482119b 100644 --- a/vllm/model_executor/models/gritlm.py +++ b/vllm/model_executor/models/gritlm.py @@ -1,18 +1,24 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from array import array +from typing import Optional, Union +import numpy as np import torch import torch.nn as nn +from typing_extensions import assert_never from vllm.config import ModelConfig, VllmConfig from vllm.logger import init_logger -from vllm.model_executor.layers.pooler import PoolerHead, PoolerNormalize +from vllm.model_executor.layers.pooler import (Pooler, PoolerHead, + PoolerNormalize, + PoolingParamsUpdate, + build_output, get_prompt_lens, + get_prompt_token_ids) from vllm.model_executor.models.llama import LlamaForCausalLM -from vllm.model_executor.pooling_metadata import (PoolingMetadata, - PoolingTensors) -from vllm.sequence import PoolerOutput, PoolingSequenceGroupOutput +from vllm.model_executor.pooling_metadata import PoolingMetadata +from vllm.pooling_params import PoolingTask +from vllm.sequence import PoolerOutput from vllm.transformers_utils.tokenizer import cached_tokenizer_from_config from .interfaces import SupportsV0Only @@ -20,7 +26,8 @@ logger = init_logger(__name__) -class GritLMPooler(nn.Module): +class GritLMMeanPool(nn.Module): + """As `MeanPool`, but only includes non-instruction tokens.""" def __init__(self, model_config: ModelConfig): super().__init__() @@ -38,8 +45,8 @@ def __init__(self, model_config: ModelConfig): for tok in ["", "▁<", "<", "|", "embed", ">", "<0x0A>", "user"] } - def tokens_to_ids(tokens: list[str]) -> array: - return array("i", [self.token_ids[token] for token in tokens]) + def tokens_to_ids(tokens: list[str]) -> np.ndarray: + return np.array([self.token_ids[token] for token in tokens]) self.user_pattern_ids = tokens_to_ids( ["▁<", "|", "user", "|", ">", "<0x0A>"]) @@ -48,32 +55,44 @@ def tokens_to_ids(tokens: list[str]) -> array: self.embed_pattern_ids = tokens_to_ids( ["▁<", "|", "embed", "|", ">", "<0x0A>"]) - self.head = PoolerHead(PoolerNormalize()) - - def _find_array(self, arr: array, target: array, start_idx: int) -> int: + def _find_array( + self, + arr: np.ndarray, + target: np.ndarray, + start_idx: int = 0, + end_idx: Optional[int] = None, + ) -> int: """ - Find the first occurrence of target in arr starting from start_idx. + Find the first occurrence of `target` in `arr` starting from + `start_idx`. Args: - arr: The array to search within - target: The consecutive subsequence to find - start_idx: The starting index to search from + arr: The array to search within. + target: The consecutive subsequence to find. + start_idx: The starting index to search from (inclusive). + end_idx: The ending index to search from (exclusive). Returns: - int: The index of the first occurrence of target in arr. + The index of the first occurrence of `target` in `arr`. """ if start_idx < 0: - raise ValueError("start_idx must be non-negative") - if not target or not arr: - raise ValueError("Empty arr or target not allowed") + raise ValueError("`start_idx` must be non-negative") + if len(arr) == 0 or len(target) == 0: + raise ValueError("Empty `arr` or `target` not allowed") + arr_len = len(arr) target_len = len(target) - for i in range(start_idx, len(arr) - target_len + 1): - if arr[i:i + target_len] == target: + + if end_idx is None: + end_idx = arr_len + + for i in range(start_idx, min(end_idx, arr_len - target_len + 1)): + if (arr[i:i + target_len] == target).all(): return i + return -1 - def _get_instruction_len(self, prompt_token_ids: array) -> int: + def _get_instruction_len(self, prompt_token_ids: np.ndarray) -> int: """ Get the length of the instruction in the prompt. @@ -83,7 +102,6 @@ def _get_instruction_len(self, prompt_token_ids: array) -> int: The pattern matching is done using integers instead of strings because the prompt is given as a list of token IDs. """ - instruction_len = 0 # Return no instruction in case of missing BOS token. @@ -98,7 +116,8 @@ def _get_instruction_len(self, prompt_token_ids: array) -> int: embed_pattern_ids = self.embed_pattern_ids if self._find_array(prompt_token_ids, self.user_pattern_ids, - start_idx=1) == 1: + start_idx=1, + end_idx=2) == 1: embed_pattern_ids = self.embed_newline_pattern_ids # Find the embed pattern in the prompt. @@ -116,64 +135,92 @@ def _get_instruction_len(self, prompt_token_ids: array) -> int: return instruction_len - def forward( + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: + # The equalities are split up to keep mypy happy + if task == "encode" or task == "embed": + return PoolingParamsUpdate(requires_token_ids=True) + + if task == "classify" or task == "score": + return None + + assert_never(task) + + def forward_one( self, hidden_states: torch.Tensor, - pooling_metadata: PoolingMetadata, - ) -> PoolerOutput: - """ - Pool the hidden states by summing the embeddings of - non-instruction tokens. - """ - prompts_token_ids = [ - token_ids.prompt_token_ids_array - for _, token_ids in pooling_metadata.seq_data.items() - ] + prompt_len: Optional[torch.Tensor] = None, + instr_len: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + assert prompt_len is None or prompt_len == hidden_states.shape[0], \ + "partial prefill not supported with MEAN pooling" + + return hidden_states[instr_len:].mean(dim=0, dtype=torch.float32) + + def forward_all( + self, + hidden_states: torch.Tensor, + prompt_lens: torch.Tensor, + instr_lens: torch.Tensor, + ) -> Union[list[torch.Tensor], torch.Tensor]: + offset = 0 + pooled_data = list[torch.Tensor]() + + for prompt_len, instr_len in zip(prompt_lens, instr_lens): + pooled_data.append(hidden_states[offset + instr_len:offset + + prompt_len].mean( + dim=0, dtype=torch.float32)) + offset += prompt_len - instruction_lens = torch.tensor( + return pooled_data + + def forward( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> Union[list[torch.Tensor], torch.Tensor]: + prompt_lens = get_prompt_lens(hidden_states, pooling_metadata) + instr_lens = torch.tensor( [ - self._get_instruction_len(prompt_token_ids) - for prompt_token_ids in prompts_token_ids + self._get_instruction_len(token_ids.cpu().numpy()) + for token_ids in get_prompt_token_ids(pooling_metadata) ], - device=hidden_states.device, + device=prompt_lens.device, ) - prompt_lens = PoolingTensors.from_pooling_metadata( - pooling_metadata, hidden_states.device).prompt_lens - - mask = torch.zeros_like(hidden_states, dtype=torch.bool) - - start_idx = 0 - for prompt_len, instruction_len in zip(prompt_lens, instruction_lens): - end_idx = start_idx + prompt_len - mask[start_idx + instruction_len:end_idx] = True - start_idx = end_idx + if isinstance(hidden_states, list): + return [ + self.forward_one(h, prompt_len, instr_len) for h, prompt_len, + instr_len in zip(hidden_states, prompt_lens, instr_lens) + ] - masked_hidden_states = hidden_states.masked_fill(~mask, 0.0) + return self.forward_all(hidden_states, prompt_lens, instr_lens) - sum_embeddings = torch.zeros(len(prompt_lens), - hidden_states.size(1), - device=hidden_states.device) - start_idx = 0 - for i, prompt_len in enumerate(prompt_lens): - end_idx = start_idx + prompt_len - sum_embeddings[i] = masked_hidden_states[start_idx:end_idx].sum( - dim=0) - start_idx = end_idx +class GritLMPooler(Pooler): - num_non_instruction_tokens = prompt_lens - instruction_lens - mean_embeddings = sum_embeddings / num_non_instruction_tokens.unsqueeze( - 1) + def __init__(self, model_config: ModelConfig): + super().__init__() - pooled_data = self.head(mean_embeddings, - pooling_metadata=pooling_metadata) + self.pooling = GritLMMeanPool(model_config) + self.head = PoolerHead(PoolerNormalize()) - pooled_outputs = [ - PoolingSequenceGroupOutput(data) for data in pooled_data - ] + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: + return self.pooling.get_pooling_updates(task) - return PoolerOutput(outputs=pooled_outputs) + def forward( + self, + hidden_states: torch.Tensor, + pooling_metadata: PoolingMetadata, + ) -> PoolerOutput: + pooled_data = self.pooling(hidden_states, pooling_metadata) + pooled_data = self.head(pooled_data, pooling_metadata) + return build_output(pooled_data) class GritLM(LlamaForCausalLM, SupportsV0Only): @@ -202,7 +249,7 @@ def __init__( prefix: str = "", **kwargs, ) -> None: - # Use full attention for pooling + # Use full attention for pooling (this is why V1 is not supported yet) if vllm_config.model_config.runner_type == "pooling": hf_config = vllm_config.model_config.hf_config hf_config.is_causal = False diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 417f9059449..b60f1a5b6ff 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -599,13 +599,6 @@ def supports_cross_encoding( return is_pooling_model(model) and _supports_cross_encoding(model) -def has_step_pooler(model: Union[type[object], object]) -> bool: - """Check if the model uses step pooler.""" - from vllm.model_executor.layers.pooler import StepPooler - - return is_pooling_model(model) and isinstance(model.pooler, StepPooler) - - class SupportsQuant: """The interface required for all models that support quantization.""" diff --git a/vllm/model_executor/models/modernbert.py b/vllm/model_executor/models/modernbert.py index 94a7ddcc01c..74986f9f573 100644 --- a/vllm/model_executor/models/modernbert.py +++ b/vllm/model_executor/models/modernbert.py @@ -14,14 +14,15 @@ from vllm.model_executor.layers.linear import (QKVParallelLinear, RowParallelLinear) from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, - PoolingMethod, PoolingTask, + PoolingMethod, + PoolingParamsUpdate, PoolingType) from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.pooling_params import PoolingParams +from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors from .interfaces import SupportsCrossEncoding, SupportsV0Only @@ -270,8 +271,11 @@ def __init__(self, config: ModernBertConfig): eps=config.norm_eps, bias=config.norm_bias) - def get_pooling_params(self, task: PoolingTask) -> Optional[PoolingParams]: - return self.pooling.get_pooling_params(task) + def get_pooling_updates( + self, + task: PoolingTask, + ) -> Optional[PoolingParamsUpdate]: + return self.pooling.get_pooling_updates(task) def forward( self, diff --git a/vllm/pooling_params.py b/vllm/pooling_params.py index 1a7305727e1..868facbe255 100644 --- a/vllm/pooling_params.py +++ b/vllm/pooling_params.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import TYPE_CHECKING, Optional +from typing import TYPE_CHECKING, Literal, Optional import msgspec @@ -10,12 +10,14 @@ if TYPE_CHECKING: from vllm.config import ModelConfig +PoolingTask = Literal["encode", "embed", "classify", "score"] + class PoolingParams( msgspec.Struct, omit_defaults=True, # type: ignore[call-arg] array_like=True): # type: ignore[call-arg] - """API parameters for pooling models. This + """API parameters for pooling models. Attributes: dimensions: Reduce the dimensions of embeddings @@ -24,24 +26,33 @@ class PoolingParams( dimensions: Optional[int] = None - use_cross_encoder: bool = False - """Internal use only.""" + output_kind: RequestOutputKind = RequestOutputKind.FINAL_ONLY - logits_processing_needs_token_ids: bool = False + task: Optional[PoolingTask] = None """Internal use only.""" - output_kind: RequestOutputKind = RequestOutputKind.FINAL_ONLY + requires_token_ids: bool = False + """Internal use only.""" def clone(self) -> "PoolingParams": """Returns a deep copy of the PoolingParams instance.""" return PoolingParams( dimensions=self.dimensions, - use_cross_encoder=self.use_cross_encoder, - logits_processing_needs_token_ids=self. - logits_processing_needs_token_ids, + task=self.task, + requires_token_ids=self.requires_token_ids, ) - def verify(self, model_config: "ModelConfig") -> None: + def verify(self, task: PoolingTask, model_config: "ModelConfig") -> None: + if self.task is None: + self.task = task + elif self.task != task: + msg = f"You cannot overwrite {self.task=!r} with {task=!r}!" + raise ValueError(msg) + + # NOTE: Task validation needs to done against the model instance, + # which is not available in model config. So, it's not included + # in this method + if self.dimensions is not None: if not model_config.is_matryoshka: raise ValueError( @@ -61,12 +72,10 @@ def verify(self, model_config: "ModelConfig") -> None: raise ValueError("Dimensions must be greater than 0") def __repr__(self) -> str: - return ( - f"PoolingParams(" - f"dimensions={self.dimensions}, " - f"use_cross_encoder={self.use_cross_encoder}, " - f"logits_processing_needs_token_ids={self.logits_processing_needs_token_ids})" - ) + return (f"PoolingParams(" + f"dimensions={self.dimensions}, " + f"task={self.task}, " + f"requires_token_ids={self.requires_token_ids})") def __post_init__(self) -> None: assert self.output_kind == RequestOutputKind.FINAL_ONLY,\ diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index f5c59bef478..b3210197750 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -181,6 +181,12 @@ def _initialize_kv_caches( def add_request(self, request: EngineCoreRequest): """Add request to the scheduler.""" + if pooling_params := request.pooling_params: + supported_pooling_tasks = ( + self.model_executor.supported_pooling_tasks) + if pooling_params.task not in supported_pooling_tasks: + raise ValueError(f"Unsupported task: {pooling_params.task!r} " + f"Supported tasks: {supported_pooling_tasks}") if request.mm_hashes is not None: # Here, if hash exists for a multimodal input, then it will be diff --git a/vllm/v1/worker/cpu_model_runner.py b/vllm/v1/worker/cpu_model_runner.py index 410a54e7466..c315dcb1832 100644 --- a/vllm/v1/worker/cpu_model_runner.py +++ b/vllm/v1/worker/cpu_model_runner.py @@ -8,7 +8,6 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.model_executor.model_loader import get_model -from vllm.model_executor.models.interfaces import has_step_pooler from vllm.v1.worker.gpu_model_runner import GPUModelRunner logger = init_logger(__name__) @@ -54,9 +53,6 @@ def load_model(self) -> None: logger.info("Starting to load model %s...", self.model_config.model) self.model = get_model(vllm_config=self.vllm_config) - if has_step_pooler(self.model): - self.input_batch.logits_processing_needs_token_ids = True - if self.lora_config: self.model = self.load_lora_model(self.model, self.model_config, self.scheduler_config, diff --git a/vllm/v1/worker/gpu_input_batch.py b/vllm/v1/worker/gpu_input_batch.py index 1a79d72be0a..a242c7fca5e 100644 --- a/vllm/v1/worker/gpu_input_batch.py +++ b/vllm/v1/worker/gpu_input_batch.py @@ -70,7 +70,6 @@ def __init__( vocab_size: int, block_sizes: list[int], # The block_size of each kv cache group is_spec_decode: bool = False, - logits_processing_needs_token_ids: bool = False, ): self.is_spec_decode = is_spec_decode self.max_num_reqs = max_num_reqs @@ -79,8 +78,6 @@ def __init__( self.device = device self.pin_memory = pin_memory self.vocab_size = vocab_size - self.logits_processing_needs_token_ids = ( - logits_processing_needs_token_ids) self._req_ids: list[Optional[str]] = [] self.req_id_to_index: dict[str, int] = {} @@ -233,6 +230,9 @@ def __init__( # req_index -> bad_words_token_ids self.bad_words_token_ids: dict[int, list[list[int]]] = {} + self.logits_processing_needs_token_ids = np.zeros(max_num_reqs, + dtype=bool) + self.req_output_token_ids: list[Optional[list[int]]] = [] # This is updated each time the batch constituents change. @@ -365,9 +365,12 @@ def add_request( if sampling_params.bad_words_token_ids: self.bad_words_token_ids[ req_index] = sampling_params.bad_words_token_ids + elif pooling_params := request.pooling_params: + self.pooling_params[req_id] = pooling_params + self.logits_processing_needs_token_ids[req_index] = ( + pooling_params.requires_token_ids) else: - assert request.pooling_params is not None - self.pooling_params[req_id] = request.pooling_params + raise NotImplementedError(request) # Add request lora ID if request.lora_request: @@ -620,9 +623,9 @@ def _make_sampling_metadata(self) -> SamplingMetadata: copy_slice(self.repetition_penalties_cpu_tensor, self.repetition_penalties, num_reqs) - needs_prompt_token_ids = (not self.no_penalties or - (self.num_reqs > 0 - and self.logits_processing_needs_token_ids)) + needs_prompt_token_ids = ( + not self.no_penalties + or self.logits_processing_needs_token_ids[:num_reqs].any()) if needs_prompt_token_ids: # The prompt tokens are used only for applying penalties or # step pooling during the sampling/pooling process. diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 60fb78c060c..c3eeb6c2e39 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -4,7 +4,7 @@ import gc import time from contextlib import contextmanager -from typing import TYPE_CHECKING, Any, Optional, Union +from typing import TYPE_CHECKING, Any, Optional, Union, cast, get_args import numpy as np import torch @@ -32,12 +32,13 @@ from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding from vllm.model_executor.model_loader import TensorizerLoader, get_model_loader -from vllm.model_executor.models.interfaces import (has_step_pooler, - is_mixture_of_experts) +from vllm.model_executor.models.interfaces import is_mixture_of_experts +from vllm.model_executor.models.interfaces_base import (VllmModelForPooling, + is_pooling_model) from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import MultiModalKwargs, PlaceholderRange from vllm.multimodal.utils import group_mm_inputs_by_modality -from vllm.pooling_params import PoolingParams +from vllm.pooling_params import PoolingParams, PoolingTask from vllm.sampling_params import SamplingType from vllm.sequence import IntermediateTensors from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, DeviceMemoryProfiler, @@ -404,6 +405,7 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> None: req_id = new_req_data.req_id sampling_params = new_req_data.sampling_params pooling_params = new_req_data.pooling_params + if sampling_params and \ sampling_params.sampling_type == SamplingType.RANDOM_SEED: generator = torch.Generator(device=self.device) @@ -411,6 +413,18 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> None: else: generator = None + if pooling_params: + assert pooling_params.task is not None, ( + "You did not set `task` in the API") + + model = cast(VllmModelForPooling, self.model) + to_update = (model.pooler.get_pooling_updates( + pooling_params.task)) + assert to_update is not None, ( + f"{pooling_params.task=} is not supported by the model") + + to_update.apply(pooling_params) + self.requests[req_id] = CachedRequestState( req_id=req_id, prompt_token_ids=new_req_data.prompt_token_ids, @@ -1092,6 +1106,16 @@ def _gather_mm_embeddings( def get_model(self) -> nn.Module: return self.model + def get_supported_pooling_tasks(self) -> list[PoolingTask]: + model = self.get_model() + if not is_pooling_model(model): + return [] + + return [ + task for task in get_args(PoolingTask) + if model.pooler.get_pooling_updates(task) + ] + def apply_grammar_bitmask( self, scheduler_output: "SchedulerOutput", @@ -1737,8 +1761,6 @@ def load_model(self) -> None: ) model_loader.load_weights(self.model, model_config=self.model_config) - if has_step_pooler(self.model): - self.input_batch.logits_processing_needs_token_ids = True if self.lora_config: self.model = self.load_lora_model(self.model, self.model_config, @@ -2138,17 +2160,25 @@ def _dummy_pooler_run( req_num_tokens = num_tokens // num_reqs + model = cast(VllmModelForPooling, self.model) + dummy_task = self.get_supported_pooling_tasks()[0] + dummy_pooling_params = PoolingParams(task=dummy_task) + + to_update = model.pooler.get_pooling_updates(dummy_task) + assert to_update is not None + to_update.apply(dummy_pooling_params) + dummy_metadata = PoolingMetadata( prompt_lens=torch.tensor([h.shape[0] for h in hidden_states_list], device=self.device), prompt_token_ids=torch.zeros((num_reqs, req_num_tokens), dtype=torch.int32, device=self.device), - pooling_params=[PoolingParams()] * num_reqs) + pooling_params=[dummy_pooling_params] * num_reqs) try: - pooler_output = self.model.pooler(hidden_states=hidden_states_list, - pooling_metadata=dummy_metadata) + pooler_output = model.pooler(hidden_states=hidden_states_list, + pooling_metadata=dummy_metadata) except RuntimeError as e: if 'out of memory' in str(e): raise RuntimeError( diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 6458b55777a..1610d0ecee2 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -23,6 +23,7 @@ from vllm.lora.request import LoRARequest from vllm.model_executor import set_random_seed from vllm.platforms import current_platform +from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors from vllm.utils import GiB_bytes, MemorySnapshot, memory_profiling from vllm.v1.kv_cache_interface import KVCacheConfig, KVCacheSpec @@ -309,6 +310,9 @@ def compile_or_warm_up_model(self) -> None: def get_model(self) -> nn.Module: return self.model_runner.get_model() + def get_supported_pooling_tasks(self) -> list[PoolingTask]: + return self.model_runner.get_supported_pooling_tasks() + @torch.inference_mode() def execute_model( self, diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 8565df42973..1b55e5d61aa 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -3,7 +3,7 @@ import bisect import gc import time -from typing import TYPE_CHECKING, Any, Optional, cast +from typing import TYPE_CHECKING, Any, Optional, cast, get_args from unittest.mock import patch import numpy as np @@ -25,10 +25,12 @@ from vllm.lora.layers import BaseLayerWithLoRA from vllm.model_executor.model_loader import get_model_loader from vllm.model_executor.model_loader.tpu import TPUModelLoader +from vllm.model_executor.models.interfaces_base import is_pooling_model from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import (BatchedTensorInputs, MultiModalKwargs, PlaceholderRange) from vllm.multimodal.utils import group_mm_inputs_by_modality +from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, LayerBlockType, cdiv, is_pin_memory_available, prev_power_of_2) @@ -483,6 +485,16 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> bool: def get_model(self) -> nn.Module: return self.model + def get_supported_pooling_tasks(self) -> list[PoolingTask]: + model = self.get_model() + if not is_pooling_model(model): + return [] + + return [ + task for task in get_args(PoolingTask) + if model.pooler.get_pooling_updates(task) + ] + def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: """ Generates the KVCacheSpec by parsing the kv cache format from each diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index c4bf40d6654..592d9fc17c9 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -19,6 +19,7 @@ from vllm.lora.request import LoRARequest from vllm.model_executor import set_random_seed from vllm.platforms import current_platform +from vllm.pooling_params import PoolingTask from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, cdiv from vllm.v1.attention.backends.pallas import TPU_HEAD_SIZE_ALIGNMENT from vllm.v1.core.sched.output import SchedulerOutput @@ -275,6 +276,9 @@ def compile_or_warm_up_model(self) -> None: def get_model(self) -> nn.Module: return self.model_runner.get_model() + def get_supported_pooling_tasks(self) -> list[PoolingTask]: + return self.model_runner.get_supported_pooling_tasks() + def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: return self.model_runner.get_kv_cache_spec() diff --git a/vllm/worker/model_runner_base.py b/vllm/worker/model_runner_base.py index d567ce4a6e7..b0737dfe319 100644 --- a/vllm/worker/model_runner_base.py +++ b/vllm/worker/model_runner_base.py @@ -4,7 +4,7 @@ import dataclasses from abc import ABC, abstractmethod from typing import (TYPE_CHECKING, Any, Dict, Generic, List, Optional, Type, - TypeVar) + TypeVar, get_args) import torch import torch.nn as nn @@ -12,6 +12,8 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.model_executor.layers.sampler import SamplerOutput +from vllm.model_executor.models.interfaces_base import is_pooling_model +from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors, SequenceGroupMetadata if TYPE_CHECKING: @@ -223,6 +225,16 @@ def prepare_model_input( def get_model(self) -> nn.Module: raise NotImplementedError + def get_supported_pooling_tasks(self) -> list[PoolingTask]: + model = self.get_model() + if not is_pooling_model(model): + return [] + + return [ + task for task in get_args(PoolingTask) + if model.pooler.get_pooling_updates(task) + ] + def execute_model( self, model_input: T, diff --git a/vllm/worker/pooling_model_runner.py b/vllm/worker/pooling_model_runner.py index f80955f71a5..2c3f4eb3ad4 100644 --- a/vllm/worker/pooling_model_runner.py +++ b/vllm/worker/pooling_model_runner.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import dataclasses -from typing import Any, Dict, List, Optional, Tuple, Type, Union +from typing import Any, Dict, List, Optional, Tuple, Type, Union, cast import torch @@ -10,6 +10,7 @@ from vllm.distributed import get_pp_group from vllm.forward_context import set_forward_context from vllm.logger import init_logger +from vllm.model_executor.models.interfaces_base import VllmModelForPooling from vllm.model_executor.pooling_metadata import PoolingMetadata from vllm.multimodal import MultiModalKwargs from vllm.pooling_params import PoolingParams @@ -195,7 +196,20 @@ def _prepare_pooling( seq_groups: List[Tuple[List[int], PoolingParams]] = [] for i, seq_group_metadata in enumerate(seq_group_metadata_list): seq_ids = list(seq_group_metadata.seq_data.keys()) + pooling_params = seq_group_metadata.pooling_params + assert pooling_params is not None + assert pooling_params.task is not None, ( + "You did not set `task` in the API") + + to_update = (cast(VllmModelForPooling, + self.model).pooler.get_pooling_updates( + pooling_params.task)) + assert to_update is not None, ( + f"{pooling_params.task=} is not supported by the model") + + to_update.apply(pooling_params) + seq_groups.append((seq_ids, pooling_params)) seq_data: Dict[int, SequenceData] = {} From 11dac90634cbb05a4cc99c6a6caf00f0cdceda88 Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Fri, 18 Jul 2025 14:52:52 +0200 Subject: [PATCH 178/552] Let GraniteMoeAttention use YaRN (#21174) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- vllm/model_executor/models/granitemoe.py | 6 +++++- vllm/model_executor/models/granitemoeshared.py | 2 ++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/vllm/model_executor/models/granitemoe.py b/vllm/model_executor/models/granitemoe.py index 142b0e96729..7d31854dce8 100644 --- a/vllm/model_executor/models/granitemoe.py +++ b/vllm/model_executor/models/granitemoe.py @@ -24,7 +24,7 @@ # limitations under the License. """Inference-only GraniteMoe model.""" from collections.abc import Iterable -from typing import Optional +from typing import Any, Optional import torch from torch import nn @@ -113,6 +113,7 @@ def __init__( num_kv_heads: int, max_position: int = 4096 * 32, rope_theta: float = 10000, + rope_scaling: Optional[dict[str, Any]] = None, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, attention_multiplier: Optional[float] = None, @@ -163,6 +164,7 @@ def __init__( max_position=max_position, base=int(self.rope_theta), is_neox_style=True, + rope_scaling=rope_scaling, ) self.attn = Attention(self.num_heads, self.head_dim, @@ -198,12 +200,14 @@ def __init__( self.hidden_size = config.hidden_size # Requires transformers > 4.32.0 rope_theta = getattr(config, "rope_theta", 10000) + rope_scaling = getattr(config, "rope_scaling", None) self.self_attn = GraniteMoeAttention( hidden_size=self.hidden_size, num_heads=config.num_attention_heads, max_position=config.max_position_embeddings, num_kv_heads=config.num_key_value_heads, rope_theta=rope_theta, + rope_scaling=rope_scaling, cache_config=cache_config, quant_config=quant_config, prefix=f"{prefix}.self_attn", diff --git a/vllm/model_executor/models/granitemoeshared.py b/vllm/model_executor/models/granitemoeshared.py index 7303f485378..1e2e8544179 100644 --- a/vllm/model_executor/models/granitemoeshared.py +++ b/vllm/model_executor/models/granitemoeshared.py @@ -81,12 +81,14 @@ def __init__( self.hidden_size = config.hidden_size # Requires transformers > 4.32.0 rope_theta = getattr(config, "rope_theta", 10000) + rope_scaling = getattr(config, "rope_scaling", None) self.self_attn = GraniteMoeAttention( hidden_size=self.hidden_size, num_heads=config.num_attention_heads, max_position=config.max_position_embeddings, num_kv_heads=config.num_key_value_heads, rope_theta=rope_theta, + rope_scaling=rope_scaling, cache_config=cache_config, quant_config=quant_config, prefix=f"{prefix}.self_attn", From 5a10812e5ac45565c4dfd816c9caeed91974eead Mon Sep 17 00:00:00 2001 From: Richard Zou Date: Fri, 18 Jul 2025 09:51:12 -0400 Subject: [PATCH 179/552] [CI] Update CODEOWNERS for vllm/compilation (#21185) Signed-off-by: Richard Zou Signed-off-by: x22x22 --- .github/CODEOWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 7def035b792..97f9e7dc157 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -16,7 +16,7 @@ /vllm/lora @jeejeelee /vllm/reasoning @aarnphm /vllm/entrypoints @aarnphm -/vllm/compilation @zou3519 @youkaichao +/vllm/compilation @zou3519 @youkaichao @ProExpertProg CMakeLists.txt @tlrmchlsmth @LucasWilkinson # Any change to the VllmConfig changes can have a large user-facing impact, From 42db0647b6838151597016469daae988c27a899e Mon Sep 17 00:00:00 2001 From: x22x22 Date: Mon, 21 Jul 2025 01:19:23 +0800 Subject: [PATCH 180/552] In the EmbeddingMixin class, add validation for pooling parameters to ensure consistency between the task and the model configuration. In case the validation fails, return an error response. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index f3b82dac899..84c7a32dc98 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -434,6 +434,12 @@ async def _prepare_generators( pooling_params = ctx.request.to_pooling_params() + # Verify and set the task for pooling params + try: + pooling_params.verify("embed", self.model_config) + except ValueError as e: + return self.create_error_response(str(e)) + if ctx.engine_prompts is None: return self.create_error_response( "Engine prompts not available") From d078772b3978e8a8dcf07b49d43fcef0e6e1b9c7 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 24 Jul 2025 22:58:08 +0800 Subject: [PATCH 181/552] Within the EmbeddingMixin class, the error message format for input length validation has been optimized to provide clearer feedback when the input exceeds the maximum length. Error messages are dynamically generated based on either the maximum embedding input length or the maximum context length. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 22 +++++++++++--------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 84c7a32dc98..64f432db729 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -368,19 +368,21 @@ def _validate_input( if max_embed_len is not None: # Use max_embed_len for validation instead of max_model_len effective_max_len = max_embed_len - validation_error_msg = ( - f"This model's maximum embedding input length is " - f"{max_embed_len} tokens. However, you requested " - f"{token_num} tokens in the input for embedding " - f"generation. Please reduce the length of the input.") + length_type = "maximum embedding input length" + max_length_value = max_embed_len else: # Fall back to max_model_len validation (original behavior) effective_max_len = self.max_model_len - validation_error_msg = ( - f"This model's maximum context length is " - f"{self.max_model_len} tokens. However, you requested " - f"{token_num} tokens in the input for embedding " - f"generation. Please reduce the length of the input.") + length_type = "maximum context length" + max_length_value = self.max_model_len + + validation_error_msg = ( + "This model's {length_type} is {max_length} tokens. " + "However, you requested {token_num} tokens in the input for " + "embedding generation. Please reduce the length of the input." + ).format(length_type=length_type, + max_length=max_length_value, + token_num=token_num) # Check if input exceeds effective max length if token_num > effective_max_len: From 127d51636c707f366b084169edb99f310c412b99 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 24 Jul 2025 23:36:40 +0800 Subject: [PATCH 182/552] In the OpenAIServing class, an additional check for the tokenizer type was added to ensure that it supports the required methods. Type safety was implemented for the apply_hf_chat_template and decode methods to ensure that appropriate errors are thrown in the absence of support. This change has enhanced the robustness and maintainability of the code. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_engine.py | 58 ++++++++++++++++++----- 1 file changed, 45 insertions(+), 13 deletions(-) diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index 393e32f0ed9..14bcbafc6ab 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -16,6 +16,7 @@ from fastapi import Request from pydantic import BaseModel, ConfigDict, Field from starlette.datastructures import Headers +from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast from typing_extensions import TypeIs if sys.version_info >= (3, 12): @@ -525,8 +526,13 @@ def _get_message_types(self, request: AnyRequest) -> set[str]: if (isinstance(message, dict) and "content" in message and isinstance(message["content"], list)): for content_dict in message["content"]: - if "type" in content_dict: - message_types.add(content_dict["type"].split("_")[0]) + # Check if content_dict has a "type" key and it's a string + if isinstance(content_dict, dict): + type_value = content_dict.get("type") + if isinstance(type_value, str): + # Split on "_" and take the first part + base_type = type_value.split("_")[0] + message_types.add(base_type) return message_types async def _normalize_prompt_text_to_input( @@ -900,12 +906,23 @@ async def _preprocess_chat( **_chat_template_kwargs, ) else: - request_prompt = apply_hf_chat_template( - tokenizer=tokenizer, - conversation=conversation, - model_config=model_config, - **_chat_template_kwargs, - ) + # Type check for apply_hf_chat_template which only accepts + # PreTrainedTokenizer or PreTrainedTokenizerFast + if isinstance(tokenizer, + (PreTrainedTokenizer, PreTrainedTokenizerFast)): + request_prompt = apply_hf_chat_template( + tokenizer=tokenizer, + conversation=conversation, + model_config=model_config, + **_chat_template_kwargs, + ) + else: + # For other tokenizer types, we need to handle this differently + # This shouldn't happen in normal operation, but we handle it + # for type safety + raise ValueError( + f"Unsupported tokenizer type for HF chat template: " + f"{type(tokenizer)}") mm_data = await mm_data_future @@ -935,9 +952,16 @@ async def _preprocess_chat( # For MistralTokenizer assert is_list_of(request_prompt, int), ( "Prompt has to be either a string or a list of token ids") - prompt_inputs = TextTokensPrompt( - prompt=tokenizer.decode(request_prompt), - prompt_token_ids=request_prompt) + # Type check for decode method + if hasattr(tokenizer, 'decode'): + decoded_prompt = tokenizer.decode(request_prompt) + else: + # Fallback for tokenizers without decode method + raise ValueError( + f"Tokenizer {type(tokenizer)} does not support " + f"decode method") + prompt_inputs = TextTokensPrompt(prompt=decoded_prompt, + prompt_token_ids=request_prompt) engine_prompt = EngineTokensPrompt( prompt_token_ids=prompt_inputs["prompt_token_ids"]) @@ -997,7 +1021,9 @@ def _log_inputs( elif isinstance(inputs, list): prompt_token_ids = inputs elif 'prompt_embeds' in inputs: - prompt_embeds = inputs.get("prompt_embeds") + # Cast to proper type for log_inputs + prompt_embeds = cast(Optional[torch.Tensor], + inputs.get("prompt_embeds")) else: prompt = inputs["prompt"] prompt_token_ids = inputs["prompt_token_ids"] @@ -1046,7 +1072,13 @@ def _get_decoded_token(logprob: Logprob, if logprob.decoded_token is not None: return logprob.decoded_token - return tokenizer.decode(token_id) + + # Type check for decode method + if hasattr(tokenizer, 'decode'): + return tokenizer.decode(token_id) + else: + # Fallback for tokenizers without decode method + return f"token_id:{token_id}" def _is_model_supported(self, model_name: Optional[str]) -> bool: if not model_name: From 679a74c29798cbe4dd2a886b9fe59c2066017f15 Mon Sep 17 00:00:00 2001 From: Richard Zou Date: Fri, 18 Jul 2025 14:10:21 -0400 Subject: [PATCH 183/552] [Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 (#19346) Signed-off-by: rzou Signed-off-by: x22x22 --- csrc/torch_bindings.cpp | 12 ++++++++---- vllm/attention/ops/rocm_aiter_mla.py | 8 ++++++-- vllm/model_executor/layers/fused_moe/fused_moe.py | 8 +++++--- 3 files changed, 19 insertions(+), 9 deletions(-) diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index 23e9212a2f1..79e2575974b 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -20,13 +20,17 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { // vLLM custom ops // - // The default behavior in PyTorch 2.6 is "requires_contiguous", so we need + // The default behavior in PyTorch 2.6 was changed to "requires_contiguous", + // so we need // to override this for many GEMMs with the following tag. Otherwise, // torch.compile will force all input tensors to be contiguous(), which // will break many custom ops that require column-major weight matrices. - // TODO: remove this for PyTorch 2.8, when the default is planned to switch - // to match exact eager-mode strides. - at::Tag stride_tag = at::Tag::needs_fixed_stride_order; + // This was a bug and PyTorch 2.7 has since fixed this. +#if TORCH_VERSION_MAJOR == 2 && TORCH_VERSION_MINOR == 6 + #define stride_tag at::Tag::needs_fixed_stride_order +#else + #define stride_tag +#endif ops.def("weak_ref_tensor(Tensor input) -> Tensor"); ops.impl("weak_ref_tensor", torch::kCUDA, &weak_ref_tensor); diff --git a/vllm/attention/ops/rocm_aiter_mla.py b/vllm/attention/ops/rocm_aiter_mla.py index cce6b463946..d91cda255ff 100644 --- a/vllm/attention/ops/rocm_aiter_mla.py +++ b/vllm/attention/ops/rocm_aiter_mla.py @@ -6,7 +6,7 @@ import torch from vllm.platforms import current_platform -from vllm.utils import direct_register_custom_op +from vllm.utils import direct_register_custom_op, is_torch_equal_or_newer def get_aiter_mla_metadata(max_batch_size: int, block_size: int, @@ -93,8 +93,12 @@ def mla_decode_fwd_fake( if current_platform.is_rocm(): + if is_torch_equal_or_newer("2.7.0"): + tags = () + else: + tags = (torch.Tag.needs_fixed_stride_order, ), direct_register_custom_op(op_name="rocm_aiter_mla_decode_fwd", op_func=mla_decode_fwd_impl, mutates_args=["o"], fake_impl=mla_decode_fwd_fake, - tags=[torch.Tag.needs_fixed_stride_order]) + tags=tags) diff --git a/vllm/model_executor/layers/fused_moe/fused_moe.py b/vllm/model_executor/layers/fused_moe/fused_moe.py index 45936026007..aec5d7b252e 100644 --- a/vllm/model_executor/layers/fused_moe/fused_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_moe.py @@ -33,7 +33,7 @@ dequant_mxfp4) from vllm.platforms import current_platform from vllm.triton_utils import tl, triton -from vllm.utils import direct_register_custom_op +from vllm.utils import direct_register_custom_op, is_torch_equal_or_newer from vllm.utils.deep_gemm import is_blackwell_deep_gemm_used from .rocm_aiter_fused_moe import is_rocm_aiter_moe_enabled @@ -1056,7 +1056,8 @@ def inplace_fused_experts_fake( op_func=inplace_fused_experts, mutates_args=["hidden_states"], fake_impl=inplace_fused_experts_fake, - tags=(torch.Tag.needs_fixed_stride_order, ), + tags=(() if is_torch_equal_or_newer("2.7.0") else + (torch.Tag.needs_fixed_stride_order, )), ) @@ -1122,7 +1123,8 @@ def outplace_fused_experts_fake( op_func=outplace_fused_experts, mutates_args=[], fake_impl=outplace_fused_experts_fake, - tags=(torch.Tag.needs_fixed_stride_order, ), + tags=(() if is_torch_equal_or_newer("2.7.0") else + (torch.Tag.needs_fixed_stride_order, )), ) From 484282191cc898cf7f3a60f278488cc579027fdd Mon Sep 17 00:00:00 2001 From: JialinOuyang-Meta Date: Fri, 18 Jul 2025 12:34:40 -0700 Subject: [PATCH 184/552] [Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue (#21005) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- benchmarks/kv_cache/benchmark_block_pool.py | 108 ++++++++++++++++++++ tests/v1/core/test_kv_cache_utils.py | 28 ++--- tests/v1/core/test_prefix_caching.py | 26 ++--- vllm/v1/core/kv_cache_utils.py | 106 +++++++++++++------ 4 files changed, 210 insertions(+), 58 deletions(-) create mode 100644 benchmarks/kv_cache/benchmark_block_pool.py diff --git a/benchmarks/kv_cache/benchmark_block_pool.py b/benchmarks/kv_cache/benchmark_block_pool.py new file mode 100644 index 00000000000..134551bb612 --- /dev/null +++ b/benchmarks/kv_cache/benchmark_block_pool.py @@ -0,0 +1,108 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import gc +import time +from typing import Optional + +from tabulate import tabulate + +from vllm.utils import FlexibleArgumentParser +from vllm.v1.core.block_pool import BlockPool + + +class Metric: + def __init__(self) -> None: + self.cnt: int = 0 + self.sum_v: int = 0 + self.max_v: Optional[int] = None + + def update(self, v: int) -> None: + self.cnt += 1 + self.sum_v += v + if self.max_v is None: + self.max_v = v + else: + self.max_v = max(self.max_v, v) + + def avg_v(self) -> float: + return self.sum_v * 1.0 / self.cnt + + +def main(args): + rows = [] + for allocate_block in args.allocate_blocks: + # Enforce a GC collect ahead to minimize the impact among runs + gc.collect() + block_pool = BlockPool(num_gpu_blocks=args.num_gpu_blocks, enable_caching=True) + + get_blocks_metric: Metric = Metric() + free_blocks_metric: Metric = Metric() + for _ in range(args.num_iteration): + t1 = time.monotonic_ns() + blocks = block_pool.get_new_blocks(allocate_block) + t2 = time.monotonic_ns() + block_pool.free_blocks(blocks) + t3 = time.monotonic_ns() + get_blocks_metric.update(t2 - t1) + free_blocks_metric.update(t3 - t2) + + if get_blocks_metric.max_v is not None and free_blocks_metric.max_v is not None: + rows.append( + [ + get_blocks_metric.cnt, + args.num_gpu_blocks, + allocate_block, + get_blocks_metric.avg_v() / 1000000, + get_blocks_metric.max_v / 1000000.0, + free_blocks_metric.avg_v() / 1000000, + free_blocks_metric.max_v / 1000000.0, + ] + ) + else: + print( + "No valid metrics found." + f" {get_blocks_metric.max_v=} {free_blocks_metric.max_v=}" + ) + + print( + tabulate( + rows, + headers=[ + "Iterations", + "Total\nBlocks", + "Allocated\nBlocks", + "Get Blocks\nAvg (ms)", + "Get Blocks\nMax (ms)", + "Free Blocks\nAvg (ms)", + "Free Blocks\nMax (ms)", + ], + tablefmt="grid", + floatfmt=".6f", + ) + ) + + +def invoke_main() -> None: + parser = FlexibleArgumentParser( + description="Benchmark the performance of BlockPool for KV Cache." + ) + parser.add_argument("--num-gpu-blocks", type=int, default=100000) + parser.add_argument( + "--num-iteration", + type=int, + default=1000, + help="Number of iterations to run to stablize final data readings", + ) + parser.add_argument( + "--allocate-blocks", + type=int, + nargs="*", + default=[10, 50, 100, 500, 1000], + help="Number of blocks to allocate", + ) + args = parser.parse_args() + main(args) + + +if __name__ == "__main__": + invoke_main() # pragma: no cover diff --git a/tests/v1/core/test_kv_cache_utils.py b/tests/v1/core/test_kv_cache_utils.py index 0676cb3eb65..68b06015690 100644 --- a/tests/v1/core/test_kv_cache_utils.py +++ b/tests/v1/core/test_kv_cache_utils.py @@ -132,8 +132,8 @@ def test_free_kv_cache_block_queue_initialization(): block = KVCacheBlock(block_id=0) queue = FreeKVCacheBlockQueue([block]) assert queue.num_free_blocks == 1 - assert queue.free_list_head == block - assert queue.free_list_tail == block + assert queue.fake_free_list_head.next_free_block is block + assert queue.fake_free_list_tail.prev_free_block is block def test_free_kv_cache_block_queue_operations(): @@ -145,36 +145,38 @@ def test_free_kv_cache_block_queue_operations(): # Check initial state assert queue.num_free_blocks == 5 - assert queue.free_list_head == blocks[0] - assert queue.free_list_tail == blocks[4] + assert queue.fake_free_list_head.next_free_block is blocks[0] + assert queue.fake_free_list_tail.prev_free_block is blocks[4] # Pop the first block block1 = queue.popleft() assert block1 == blocks[0] assert queue.num_free_blocks == 4 - assert queue.free_list_head == blocks[1] - assert queue.free_list_tail == blocks[4] + assert queue.fake_free_list_head.next_free_block is blocks[1] + assert queue.fake_free_list_tail.prev_free_block is blocks[4] # Remove a block from the middle block_to_remove = blocks[2] queue.remove(block_to_remove) assert queue.num_free_blocks == 3 - assert blocks[1].next_free_block == blocks[3] - assert blocks[3].prev_free_block == blocks[1] + assert blocks[1].next_free_block is blocks[3] + assert blocks[3].prev_free_block is blocks[1] # Append a block back queue.append(block_to_remove) assert queue.num_free_blocks == 4 - assert queue.free_list_tail == block_to_remove - assert block_to_remove.prev_free_block == blocks[4] - assert block_to_remove.next_free_block is None + assert queue.fake_free_list_tail.prev_free_block is block_to_remove + assert block_to_remove.prev_free_block is blocks[4] + assert block_to_remove.next_free_block is queue.fake_free_list_tail # Pop blocks until empty for _ in range(4): queue.popleft() assert queue.num_free_blocks == 0 - assert queue.free_list_head is None - assert queue.free_list_tail is None + assert (queue.fake_free_list_head.next_free_block + is queue.fake_free_list_tail) + assert (queue.fake_free_list_tail.prev_free_block + is queue.fake_free_list_head) # Attempt to pop from an empty queue with pytest.raises(ValueError) as e: diff --git a/tests/v1/core/test_prefix_caching.py b/tests/v1/core/test_prefix_caching.py index f31bdf74f4a..b7f583de1f6 100644 --- a/tests/v1/core/test_prefix_caching.py +++ b/tests/v1/core/test_prefix_caching.py @@ -155,13 +155,14 @@ def test_prefill(hash_algo): assert block.ref_cnt == 2 # At this point, we should have 5 free blocks left. - assert manager.block_pool.free_block_queue.num_free_blocks == 5 + free_block_queue = manager.block_pool.free_block_queue + assert free_block_queue.num_free_blocks == 5 manager.free(req0) manager.free(req1) # All blocks should be available. - assert manager.block_pool.free_block_queue.num_free_blocks == 10 + assert free_block_queue.num_free_blocks == 10 # The order should be # [unallocated (6, 7, 8, 9, 10)] # [unique_req0 (4)] @@ -188,14 +189,10 @@ def test_prefill(hash_algo): # Although we only have 6 free blocks, we have 8 blocks in # the free block queue due to lazy removal. - assert manager.block_pool.free_block_queue.num_free_blocks == 6 - assert all([ - b.ref_cnt == 0 - for b in manager.block_pool.free_block_queue.get_all_free_blocks() - ]) - assert len([ - b for b in manager.block_pool.free_block_queue.get_all_free_blocks() - ]) == 6 + assert free_block_queue.num_free_blocks == 6 + assert all( + [b.ref_cnt == 0 for b in free_block_queue.get_all_free_blocks()]) + assert len([b for b in free_block_queue.get_all_free_blocks()]) == 6 manager.free(req2) @@ -209,9 +206,12 @@ def test_prefill(hash_algo): computed_blocks) # This block ID order also checks the eviction order. assert blocks.get_block_ids() == ([7, 8, 9, 10, 4, 5, 6, 3, 2, 1], ) - assert manager.block_pool.free_block_queue.num_free_blocks == 0 - assert manager.block_pool.free_block_queue.free_list_head is None - assert manager.block_pool.free_block_queue.free_list_tail is None + + assert free_block_queue.num_free_blocks == 0 + assert (free_block_queue.fake_free_list_head.next_free_block + is free_block_queue.fake_free_list_tail) + assert (free_block_queue.fake_free_list_tail.prev_free_block + is free_block_queue.fake_free_list_head) def test_prefill_hybrid_model(): diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index 6067a127e97..b1fab0d34de 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -212,27 +212,65 @@ class FreeKVCacheBlockQueue: def __init__(self, blocks: list[KVCacheBlock]) -> None: self.num_free_blocks = len(blocks) - # Initialize the doubly linked list of free blocks. - self.free_list_head: Optional[KVCacheBlock] = blocks[0] - self.free_list_tail: Optional[KVCacheBlock] = blocks[-1] + # Initialize doubly links of consecutive blocks for i in range(self.num_free_blocks): if i > 0: blocks[i].prev_free_block = blocks[i - 1] if i < self.num_free_blocks - 1: blocks[i].next_free_block = blocks[i + 1] + # Create a fake head and a tail block for the doubly linked list to + # reduce branching in the code + # + # The implementation garenteed that the fake head and tail + # are NEVER got popped, so we could safely assume each real blocks + # in the queue has prev and next blocks. + self.fake_free_list_head = KVCacheBlock(block_id=-1) + self.fake_free_list_tail = KVCacheBlock(block_id=-1) + if self.num_free_blocks > 0: + # Connect fake_head and fake_tail to the first and last block + # respectively. + self.fake_free_list_head.next_free_block = blocks[0] + blocks[0].prev_free_block = self.fake_free_list_head + self.fake_free_list_tail.prev_free_block = blocks[-1] + blocks[-1].next_free_block = self.fake_free_list_tail + else: + # For empty list, simply connect the fake head and tail. + self.fake_free_list_head.next_free_block = self.fake_free_list_tail + self.fake_free_list_tail.prev_free_block = self.fake_free_list_head + def popleft(self) -> KVCacheBlock: """Pop the first free block and reduce num_free_blocks by 1. Returns: The first free block. """ - if not self.free_list_head: + if (self.fake_free_list_head.next_free_block + is self.fake_free_list_tail + or self.fake_free_list_head.next_free_block is None): + assert self.num_free_blocks == 0, ( + f"num_free_blocks ({self.num_free_blocks}) is out of sync " + "with the free list.") raise ValueError("No free blocks available") - block = self.free_list_head - self.remove(block) - return block + first_block: KVCacheBlock = self.fake_free_list_head.next_free_block + + if first_block.next_free_block is None: + # This should not happen if the block is from the free list. + # It indicates a bug in the caller's logic. + raise RuntimeError("Invalid block found in popleft() " + "which doesn't have a valid next_free_block") + + # Connect fake_head and the next block of first_block (i.e. second block + # or fake tail). + self.fake_free_list_head.next_free_block = first_block.next_free_block + first_block.next_free_block.prev_free_block = self.fake_free_list_head + + # Remove the block from the linked list. + first_block.prev_free_block = first_block.next_free_block = None + + self.num_free_blocks -= 1 + return first_block def remove(self, block: KVCacheBlock) -> None: """Remove a block in the free list and reduce num_free_blocks by 1. @@ -240,19 +278,15 @@ def remove(self, block: KVCacheBlock) -> None: Args: block: The block to remove. """ - if block.prev_free_block is not None: - # Link the previous block to the next block. - block.prev_free_block.next_free_block = block.next_free_block - if block.next_free_block is not None: - # Link the next block to the previous block. - block.next_free_block.prev_free_block = block.prev_free_block - - if block == self.free_list_head: - # Update the head if the block is the head. - self.free_list_head = block.next_free_block - if block == self.free_list_tail: - # Update the tail if the block is the tail. - self.free_list_tail = block.prev_free_block + if block.prev_free_block is None or block.next_free_block is None: + # This should not happen if the block is from the free list. + # It indicates a bug in the caller's logic. + raise RuntimeError(f"remove() called on an invalid block: {block}") + + # Link the previous block to the next block. + block.prev_free_block.next_free_block = block.next_free_block + # Link the next block to the previous block. + block.next_free_block.prev_free_block = block.prev_free_block # Remove the block from the linked list. block.prev_free_block = block.next_free_block = None @@ -265,17 +299,19 @@ def append(self, block: KVCacheBlock) -> None: Args: block: The block to append. """ - if self.free_list_tail is not None: - # Link the last block to the new block. - self.free_list_tail.next_free_block = block - block.prev_free_block = self.free_list_tail - self.free_list_tail = block - else: - # The free list is empty. - assert self.free_list_head is None - self.free_list_head = self.free_list_tail = block + if self.fake_free_list_tail.prev_free_block is None: + raise RuntimeError( + "prev_free_block of fake_free_list_tail should always exist") + last_block: KVCacheBlock = self.fake_free_list_tail.prev_free_block + + # Connect the new block after the last block. + last_block.next_free_block = block + block.prev_free_block = last_block + + # Connect the fake tail after the new block. + block.next_free_block = self.fake_free_list_tail + self.fake_free_list_tail.prev_free_block = block - block.next_free_block = None self.num_free_blocks += 1 def get_all_free_blocks(self) -> list[KVCacheBlock]: @@ -285,8 +321,14 @@ def get_all_free_blocks(self) -> list[KVCacheBlock]: A list of free blocks. """ ret = [] - curr_block = self.free_list_head - while curr_block is not None: + if self.fake_free_list_head.next_free_block is None: + raise RuntimeError( + "next_free_block of fake_free_list_head should always exist") + # Start from the first block + curr_block: KVCacheBlock = self.fake_free_list_head.next_free_block + # As long as next_free_block is available, we haven't reached to + # the fake tail yet. + while curr_block.next_free_block is not None: ret.append(curr_block) curr_block = curr_block.next_free_block return ret From 55447fe1ce828ba2d459b1b7bbe7ec7b0e79b188 Mon Sep 17 00:00:00 2001 From: hax0r31337 <65506006+hax0r31337@users.noreply.github.com> Date: Sat, 19 Jul 2025 00:40:18 +0200 Subject: [PATCH 185/552] [Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) (#21077) Signed-off-by: hax0r31337 Signed-off-by: x22x22 --- vllm/attention/layer.py | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/vllm/attention/layer.py b/vllm/attention/layer.py index f9c2d4f4983..b6b93ff4a0a 100644 --- a/vllm/attention/layer.py +++ b/vllm/attention/layer.py @@ -16,6 +16,7 @@ has_kv_transfer_group, is_v1_kv_transfer_group) from vllm.forward_context import ForwardContext, get_forward_context +from vllm.logger import init_logger from vllm.model_executor.layers.linear import UnquantizedLinearMethod from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) @@ -23,6 +24,34 @@ from vllm.platforms import _Backend, current_platform from vllm.utils import direct_register_custom_op +logger = init_logger(__name__) +USE_XFORMERS_OPS = None + + +def check_xformers_availability(): + global USE_XFORMERS_OPS + if USE_XFORMERS_OPS is not None: + return USE_XFORMERS_OPS + + if current_platform.is_cuda() and current_platform.has_device_capability( + 100): + # Xformers FA is not compatible with B200 + USE_XFORMERS_OPS = False + else: + try: + from importlib.util import find_spec + + find_spec("xformers.ops") + USE_XFORMERS_OPS = True + except ImportError: + USE_XFORMERS_OPS = False + + # the warning only needs to be shown once + if not USE_XFORMERS_OPS: + logger.warning("Xformers is not available, falling back.") + + return USE_XFORMERS_OPS + class Attention(nn.Module): """Attention layer. @@ -314,6 +343,10 @@ def __init__( _Backend.TORCH_SDPA, _Backend.XFORMERS, _Backend.PALLAS_VLLM_V1 } else _Backend.TORCH_SDPA + if (self.attn_backend == _Backend.XFORMERS + and not check_xformers_availability()): + self.attn_backend = _Backend.TORCH_SDPA + def forward( self, query: torch.Tensor, From 53b712af159db0aecd490c901a036fdef397b41c Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Fri, 18 Jul 2025 17:46:09 -0700 Subject: [PATCH 186/552] Elastic Expert Parallel Initial Support (#20775) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- examples/online_serving/elastic_ep/bench.sh | 57 ++++ examples/online_serving/elastic_ep/scale.py | 53 ++++ .../elastic_ep/serve_deepseek_v2.sh | 72 +++++ tools/ep_kernels/elastic_ep/eep_nvshmem.patch | 92 +++++++ .../elastic_ep/install_eep_libraries.sh | 86 ++++++ vllm/config.py | 13 + vllm/distributed/eplb/eplb_state.py | 252 +++++++++++++++--- vllm/distributed/eplb/rebalance_execute.py | 117 ++++++++ vllm/engine/protocol.py | 6 + vllm/entrypoints/openai/api_server.py | 105 ++++++++ vllm/executor/uniproc_executor.py | 9 + vllm/model_executor/layers/fused_moe/layer.py | 39 ++- vllm/model_executor/models/deepseek_v2.py | 23 +- vllm/model_executor/models/interfaces.py | 7 + vllm/v1/engine/__init__.py | 16 ++ vllm/v1/engine/async_llm.py | 58 ++++ vllm/v1/engine/coordinator.py | 32 ++- vllm/v1/engine/core.py | 69 ++++- vllm/v1/engine/core_client.py | 189 ++++++++++++- vllm/v1/engine/utils.py | 225 +++++++++++++++- vllm/v1/executor/ray_distributed_executor.py | 9 + vllm/v1/worker/cpu_model_runner.py | 2 +- vllm/v1/worker/gpu_model_runner.py | 37 ++- vllm/v1/worker/gpu_worker.py | 159 ++++++++++- 24 files changed, 1659 insertions(+), 68 deletions(-) create mode 100644 examples/online_serving/elastic_ep/bench.sh create mode 100644 examples/online_serving/elastic_ep/scale.py create mode 100644 examples/online_serving/elastic_ep/serve_deepseek_v2.sh create mode 100644 tools/ep_kernels/elastic_ep/eep_nvshmem.patch create mode 100644 tools/ep_kernels/elastic_ep/install_eep_libraries.sh diff --git a/examples/online_serving/elastic_ep/bench.sh b/examples/online_serving/elastic_ep/bench.sh new file mode 100644 index 00000000000..e4763146561 --- /dev/null +++ b/examples/online_serving/elastic_ep/bench.sh @@ -0,0 +1,57 @@ +#!/bin/bash + +MODEL_NAME="deepseek-ai/DeepSeek-V2-Lite" +LOCAL_MODEL_PATH="/models/models--deepseek-ai--DeepSeek-V2-Lite/snapshots/604d5664dddd88a0433dbae533b7fe9472482de0" +HOST="localhost" +PORT=8006 +NUM_PROMPTS=20 +REQUEST_RATE=5 + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --model) + MODEL_NAME="$2" + shift 2 + ;; + --local-model) + MODEL_NAME=$LOCAL_MODEL_PATH + shift + ;; + --host) + HOST="$2" + shift 2 + ;; + --port) + PORT="$2" + shift 2 + ;; + --num-prompts) + NUM_PROMPTS="$2" + shift 2 + ;; + --request-rate) + REQUEST_RATE="$2" + shift 2 + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "Options:" + echo " --model MODEL_NAME Set model name or path (default: deepseek-ai/DeepSeek-V2-Lite)" + echo " --local-model Use local model path (convenience option)" + exit 0 + ;; + *) + echo "Unknown option: $1" + echo "Use -h or --help for usage information" + exit 1 + ;; + esac +done + +vllm bench serve \ + --model $MODEL_NAME \ + --host $HOST \ + --port $PORT \ + --num-prompts $NUM_PROMPTS \ + --request-rate $REQUEST_RATE diff --git a/examples/online_serving/elastic_ep/scale.py b/examples/online_serving/elastic_ep/scale.py new file mode 100644 index 00000000000..a93c299e323 --- /dev/null +++ b/examples/online_serving/elastic_ep/scale.py @@ -0,0 +1,53 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import argparse +import json +import sys + +import requests + + +def scale(host, port, new_dp_size): + url = f"http://{host}:{port}/scale_elastic_ep" + payload = {"new_data_parallel_size": new_dp_size} + headers = {"Content-Type": "application/json"} + + print(f"Sending scale request to {url}") + print(f"Payload: {json.dumps(payload, indent=2)}") + + try: + response = requests.post(url, json=payload, headers=headers, timeout=300) + + print(f"Status Code: {response.status_code}") + print(f"Response: {response.text}") + + if response.status_code == 200: + print("Scale up/down request successful!") + return True + else: + print("Scale up/down request failed!") + return False + + except requests.exceptions.RequestException as e: + print(f"Request failed: {e}") + return False + + +def main(): + parser = argparse.ArgumentParser(description="Test scale up/down functionality") + parser.add_argument("--host", default="localhost", help="API server host") + parser.add_argument("--port", type=int, default=8006, help="API server port") + parser.add_argument( + "--new-dp-size", type=int, default=2, help="New data parallel size" + ) + + args = parser.parse_args() + + success = scale(args.host, args.port, args.new_dp_size) + sys.exit(0 if success else 1) + + +if __name__ == "__main__": + main() diff --git a/examples/online_serving/elastic_ep/serve_deepseek_v2.sh b/examples/online_serving/elastic_ep/serve_deepseek_v2.sh new file mode 100644 index 00000000000..1234ebba4d8 --- /dev/null +++ b/examples/online_serving/elastic_ep/serve_deepseek_v2.sh @@ -0,0 +1,72 @@ +#!/bin/bash + +HOST="0.0.0.0" +PORT=8006 +DATA_PARALLEL_SIZE=4 +REDUNDANT_EXPERTS=0 +LOCAL_MODEL_PATH="/models/models--deepseek-ai--DeepSeek-V2-Lite/snapshots/604d5664dddd88a0433dbae533b7fe9472482de0" +MODEL_NAME="deepseek-ai/DeepSeek-V2-Lite" + +while [[ $# -gt 0 ]]; do + case $1 in + --dp) + DATA_PARALLEL_SIZE="$2" + shift 2 + ;; + --re) + REDUNDANT_EXPERTS="$2" + shift 2 + ;; + --host) + HOST="$2" + shift 2 + ;; + --port) + PORT="$2" + shift 2 + ;; + --model) + MODEL_NAME="$2" + shift 2 + ;; + --local-model) + MODEL_NAME=$LOCAL_MODEL_PATH + shift + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "Options:" + echo " --dp SIZE Set data parallel size (default: 4)" + echo " --re SIZE Set redundant experts (default: 0)" + echo " --host HOST Set host address (default: 0.0.0.0)" + echo " --port PORT Set port number (default: 8006)" + echo " --model MODEL_NAME Set model name or path" + echo " -h, --help Show this help message" + exit 0 + ;; + *) + echo "Unknown option: $1" + echo "Use -h or --help for usage information" + exit 1 + ;; + esac +done + +echo "Starting vLLM server for $MODEL_NAME with data parallel size: $DATA_PARALLEL_SIZE and redundant experts: $REDUNDANT_EXPERTS" + +export RAY_DEDUP_LOGS=0 +export VLLM_USE_V1=1 +export VLLM_ALL2ALL_BACKEND="pplx" +export VLLM_USE_DEEP_GEMM=1 + +vllm serve $MODEL_NAME \ + --data-parallel-size $DATA_PARALLEL_SIZE \ + --data-parallel-size-local $DATA_PARALLEL_SIZE \ + --data-parallel-backend ray \ + --enforce-eager \ + --enable-expert-parallel \ + --enable-eplb \ + --num-redundant-experts $REDUNDANT_EXPERTS \ + --trust-remote-code \ + --host $HOST \ + --port $PORT diff --git a/tools/ep_kernels/elastic_ep/eep_nvshmem.patch b/tools/ep_kernels/elastic_ep/eep_nvshmem.patch new file mode 100644 index 00000000000..5ebdaea58dd --- /dev/null +++ b/tools/ep_kernels/elastic_ep/eep_nvshmem.patch @@ -0,0 +1,92 @@ +From 18c0599c2f07ec965132efa25961dc8179c2dda3 Mon Sep 17 00:00:00 2001 +From: Yongji Wu +Date: Tue, 20 May 2025 13:41:12 -0700 +Subject: [PATCH] fix reinit issues due to states not cleaned up + +fix double free +--- + src/host/init/init.cu | 10 ++++++++++ + .../internal/host/nvshmemi_mem_transport.hpp | 15 +++++++++++++++ + src/modules/bootstrap/uid/bootstrap_uid.cpp | 5 +++++ + 3 files changed, 30 insertions(+) + +diff --git a/src/host/init/init.cu b/src/host/init/init.cu +index b1c5dbf..1fecb4b 100644 +--- a/src/host/init/init.cu ++++ b/src/host/init/init.cu +@@ -43,6 +43,8 @@ + #include "internal/host/nvshmemi_types.h" + #include "internal/host/shared_memory.h" + #include "internal/host/nvshmemi_symmetric_heap.hpp" ++// eep-dev ++#include "internal/host/nvshmemi_mem_transport.hpp" + + extern __constant__ nvshmemi_device_host_state_t nvshmemi_device_state_d; + static std::map registered_device_states; +@@ -1293,6 +1295,14 @@ void nvshmemid_hostlib_finalize(void *device_ctx, void *transport_device_ctx) { + /* Multi-init Multi-fini*/ + nvshmemi_state = NULL; + nvshmemi_device_state.nvshmemi_is_nvshmem_initialized = 0; ++ ++ // eep-dev ++ nvshmemi_mem_p2p_transport::destroy_instance(); ++ nvshmemi_mem_remote_transport::destroy_instance(); ++ free(nvshmemi_default_session); ++ nvshmemi_default_session = nullptr; ++ nvshmemi_device_state.nvshmemi_is_nvshmem_bootstrapped = false; ++ + nvshmemi_is_device_state_ready = false; + } else + nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle); +diff --git a/src/include/internal/host/nvshmemi_mem_transport.hpp b/src/include/internal/host/nvshmemi_mem_transport.hpp +index 2495844..e4f408a 100644 +--- a/src/include/internal/host/nvshmemi_mem_transport.hpp ++++ b/src/include/internal/host/nvshmemi_mem_transport.hpp +@@ -36,6 +36,13 @@ class nvshmemi_mem_p2p_transport final { + return p2p_objref_; + } + } ++ // eep-dev ++ static void destroy_instance(void) { ++ if (p2p_objref_ != nullptr) { ++ delete p2p_objref_; ++ p2p_objref_ = nullptr; ++ } ++ } + + void print_mem_handle(int pe_id, int transport_idx, nvshmemi_symmetric_heap &obj); + +@@ -87,6 +94,14 @@ class nvshmemi_mem_remote_transport final { + } + } + ++ // eep-dev ++ static void destroy_instance(void) { ++ if (remote_objref_ != nullptr) { ++ delete remote_objref_; ++ remote_objref_ = nullptr; ++ } ++ } ++ + int gather_mem_handles(nvshmemi_symmetric_heap &obj, uint64_t heap_offset, size_t size); + /* On-demand registration and release of memory */ + int register_mem_handle(nvshmem_mem_handle_t *local_handles, int transport_idx, +diff --git a/src/modules/bootstrap/uid/bootstrap_uid.cpp b/src/modules/bootstrap/uid/bootstrap_uid.cpp +index a1fa748..788fa96 100644 +--- a/src/modules/bootstrap/uid/bootstrap_uid.cpp ++++ b/src/modules/bootstrap/uid/bootstrap_uid.cpp +@@ -630,6 +630,11 @@ int nvshmemi_bootstrap_plugin_pre_init(bootstrap_handle_t* handle, const int abi + // Discover the network for bootstrap, if not done previously. + // This code needs to be stateful to be able to be called multiple times by the caller + BOOTSTRAP_CHECK(bootstrap_net_init()); ++ // eep-dev ++ if (handle->pre_init_ops != nullptr) { ++ BOOTSTRAP_PTR_FREE(handle->pre_init_ops); ++ handle->pre_init_ops = nullptr; ++ } + if (handle->pre_init_ops == nullptr) { + BOOTSTRAP_CALLOC(&handle->pre_init_ops, 1); + handle->pre_init_ops->get_unique_id = bootstrap_get_unique_id; +-- +2.43.0 + diff --git a/tools/ep_kernels/elastic_ep/install_eep_libraries.sh b/tools/ep_kernels/elastic_ep/install_eep_libraries.sh new file mode 100644 index 00000000000..9d7dc1032f5 --- /dev/null +++ b/tools/ep_kernels/elastic_ep/install_eep_libraries.sh @@ -0,0 +1,86 @@ +#!/bin/bash + +set -ex + +# Default workspace directory +WORKSPACE=$(pwd)/eep_kernels_workspace +INSTALL_NVSHMEM=true + +# Parse command line arguments +while getopts "w:n" opt; do + case $opt in + w) + WORKSPACE="$OPTARG" + ;; + n) + INSTALL_NVSHMEM=false + ;; + \?) + echo "Invalid option: -$OPTARG" >&2 + exit 1 + ;; + esac +done + +if [ ! -d "$WORKSPACE" ]; then + mkdir -p $WORKSPACE +fi + + +# install dependencies if not installed +pip3 install cmake torch ninja + +# build nvshmem +pushd $WORKSPACE +# Reset NVSHMEM build if requested +if [ "$INSTALL_NVSHMEM" = true ]; then + mkdir -p nvshmem_src + wget https://developer.download.nvidia.com/compute/redist/nvshmem/3.2.5/source/nvshmem_src_3.2.5-1.txz + tar -xvf nvshmem_src_3.2.5-1.txz -C nvshmem_src --strip-components=1 + pushd nvshmem_src + wget https://github.com/deepseek-ai/DeepEP/raw/main/third-party/nvshmem.patch + git init + git apply -vvv nvshmem.patch + git apply --reject --whitespace=fix ../../eep_nvshmem.patch +else + pushd nvshmem_src +fi + +# assume CUDA_HOME is set correctly +if [ -z "$CUDA_HOME" ]; then + echo "CUDA_HOME is not set, please set it to your CUDA installation directory." + exit 1 +fi + +# disable all features except IBGDA +export NVSHMEM_IBGDA_SUPPORT=1 + +export NVSHMEM_SHMEM_SUPPORT=0 +export NVSHMEM_UCX_SUPPORT=0 +export NVSHMEM_USE_NCCL=0 +export NVSHMEM_PMIX_SUPPORT=0 +export NVSHMEM_TIMEOUT_DEVICE_POLLING=0 +export NVSHMEM_USE_GDRCOPY=0 +export NVSHMEM_IBRC_SUPPORT=0 +export NVSHMEM_BUILD_TESTS=0 +export NVSHMEM_BUILD_EXAMPLES=0 +export NVSHMEM_MPI_SUPPORT=0 +export NVSHMEM_BUILD_HYDRA_LAUNCHER=0 +export NVSHMEM_BUILD_TXZ_PACKAGE=0 +export NVSHMEM_TIMEOUT_DEVICE_POLLING=0 + +cmake -G Ninja -S . -B $WORKSPACE/nvshmem_build/ -DCMAKE_INSTALL_PREFIX=$WORKSPACE/nvshmem_install +cmake --build $WORKSPACE/nvshmem_build/ --target install + +popd + +export CMAKE_PREFIX_PATH=$WORKSPACE/nvshmem_install:$CMAKE_PREFIX_PATH + +# build and install pplx, require pytorch installed +pushd $WORKSPACE +git clone https://github.com/ppl-ai/pplx-kernels +cd pplx-kernels +# see https://github.com/pypa/pip/issues/9955#issuecomment-838065925 +# PIP_NO_BUILD_ISOLATION=0 disables build isolation +PIP_NO_BUILD_ISOLATION=0 TORCH_CUDA_ARCH_LIST=9.0a+PTX pip install . --no-deps -v + diff --git a/vllm/config.py b/vllm/config.py index 075aae9467c..ef0bd9a3d0d 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -2008,6 +2008,19 @@ def has_unfinished_dp(dp_group: "ProcessGroup", aggregated_has_unfinished = bool(tensor.item()) return aggregated_has_unfinished + @staticmethod + def sync_kv_cache_memory_size(dp_group: "ProcessGroup", + kv_cache_memory: int) -> int: + if kv_cache_memory == -1: + kv_cache_memory = torch.iinfo(torch.int64).max + tensor = torch.tensor([kv_cache_memory], + dtype=torch.int64, + device="cpu") + # we cannot use broadcast for stateless dp group since it depends + # on global rank + torch.distributed.all_reduce(tensor, op=ReduceOp.MIN, group=dp_group) + return tensor.item() + def compute_hash(self): """ Provide a hash that uniquely identifies all the configs diff --git a/vllm/distributed/eplb/eplb_state.py b/vllm/distributed/eplb/eplb_state.py index 6b0a126ca9b..af646208496 100644 --- a/vllm/distributed/eplb/eplb_state.py +++ b/vllm/distributed/eplb/eplb_state.py @@ -29,12 +29,15 @@ import time from collections.abc import Sequence from dataclasses import dataclass +from typing import Optional, Union import torch -from torch.distributed import all_gather, all_reduce +from torch.distributed import ProcessGroup, all_gather, all_reduce from vllm.config import ParallelConfig -from vllm.distributed.parallel_state import get_ep_group, get_node_count +from vllm.distributed.parallel_state import (get_ep_group, get_node_count, + in_the_same_node_as) +from vllm.distributed.utils import StatelessProcessGroup from vllm.logger import init_logger from vllm.model_executor.models.interfaces import MixtureOfExperts @@ -172,6 +175,9 @@ def build( model: MixtureOfExperts, device: torch.device, parallel_config: ParallelConfig, + global_expert_load: Optional[torch.Tensor] = None, + old_global_expert_indices: Optional[torch.Tensor] = None, + rank_mapping: Optional[dict[int, int]] = None, ) -> "EplbState": """ Build the initial EPLB state. @@ -185,8 +191,16 @@ def build( physical_to_logical_map_list, device=device, ) + # Assuming 8 GPUs per node, this supports up to + # (1023 + 1) / 8 = 128 nodes for now. + # TODO(rui): make this configurable + MAX_EXPERT_REDUNDANCY = 1023 + assert model.num_redundant_experts <= MAX_EXPERT_REDUNDANCY, ( + f"num_redundant_experts {model.num_redundant_experts} " + f"must be less than or equal to {MAX_EXPERT_REDUNDANCY}") + max_slots_per_logical_expert = MAX_EXPERT_REDUNDANCY + 1 logical_to_physical_map = torch.full( - (model.num_logical_experts, model.num_redundant_experts + 1), + (model.num_logical_experts, max_slots_per_logical_expert), -1, device=device, ) @@ -235,11 +249,63 @@ def build( expert_rearrangement_step = max( 0, eplb_step_interval - eplb_step_interval // 4) + if global_expert_load is not None: + ep_group = get_ep_group().device_group + assert global_expert_load.shape == (model.num_moe_layers, + model.num_logical_experts) + assert global_expert_load.dtype == torch.int64 + + num_replicas = model.num_physical_experts + num_groups = model.num_expert_groups + num_nodes = get_node_count() + num_gpus = ep_group.size() + + if num_gpus % num_nodes != 0: + num_nodes = 1 + logger.warning_once( + f"num_gpus % num_nodes != 0, " + "not using hierarchical rearrangement algorithm.\n" + f"{num_gpus=}, {num_nodes=}") + + # Get new expert mappings + ( + new_physical_to_logical_map, + new_logical_to_physical_map, + new_logical_replica_count, + ) = (rebalance_experts( + global_expert_load, + num_replicas, + num_groups, + num_nodes, + num_gpus, + )) + + max_physical_slots = new_logical_to_physical_map.shape[-1] + assert max_physical_slots <= logical_to_physical_map.shape[-1] + new_logical_to_physical_map = torch.nn.functional.pad( + new_logical_to_physical_map, + (0, logical_to_physical_map.shape[-1] - max_physical_slots), + value=-1, + ) + physical_to_logical_map = new_physical_to_logical_map.to(device) + logical_to_physical_map.copy_(new_logical_to_physical_map) + logical_replica_count.copy_(new_logical_replica_count) + model.set_eplb_state( expert_load_pass, logical_to_physical_map, logical_replica_count, ) + if global_expert_load is not None: + rearrange_expert_weights_inplace( + old_global_expert_indices, + new_physical_to_logical_map, + model.expert_weights, + ep_group, + False, + rank_mapping, + ) + expert_rearrangement_step = 0 return cls( physical_to_logical_map, @@ -337,7 +403,10 @@ def step(self, def rearrange(self, model: MixtureOfExperts, - is_profile: bool = False) -> None: + is_profile: bool = False, + execute_shuffle: bool = True, + global_expert_load: Optional[torch.Tensor] = None, + rank_mapping: Optional[dict[int, int]] = None) -> None: """ Rearrange the experts according to the current load. """ @@ -353,42 +422,79 @@ def rearrange(self, logger.info("Rearranging experts %s...", "(profile)" if is_profile else "") - # This mapping is only used here, so we do not store it in the state - physical_expert_start = ep_rank * model.num_local_physical_experts - physical_expert_end = (physical_expert_start + - model.num_local_physical_experts) - # (num_moe_layers, num_local_physical_experts) - local_physical_to_logical_map = self.physical_to_logical_map[ - :, - physical_expert_start:physical_expert_end, - ] + if global_expert_load is None: + # This mapping is only used here, so we do not store it in the state + physical_expert_start = ep_rank * model.num_local_physical_experts + physical_expert_end = (physical_expert_start + + model.num_local_physical_experts) + # (num_moe_layers, num_local_physical_experts) + local_physical_to_logical_map = self.physical_to_logical_map[ + :, + physical_expert_start:physical_expert_end, + ] - # Map the local physical expert load to global logical experts - logical_expert_load_window = torch.zeros( - self.expert_load_window_size, - model.num_moe_layers, - model.num_logical_experts, - dtype=self.expert_load_window.dtype, - device=self.expert_load_window.device, - ) - logical_expert_load_window.scatter_add_( - dim=-1, - index=local_physical_to_logical_map.unsqueeze(0).expand_as( - self.expert_load_window).long(), - src=self.expert_load_window, - ) + # Map the local physical expert load to global logical experts + logical_expert_load_window = torch.zeros( + self.expert_load_window_size, + model.num_moe_layers, + model.num_logical_experts, + dtype=self.expert_load_window.dtype, + device=self.expert_load_window.device, + ) + logical_expert_load_window.scatter_add_( + dim=-1, + index=local_physical_to_logical_map.unsqueeze(0).expand_as( + self.expert_load_window).long(), + src=self.expert_load_window, + ) - # Perform all-reduce to get the expert load across all ranks - global_expert_load_window = logical_expert_load_window.sum(dim=0) - all_reduce(global_expert_load_window, group=ep_group) + if not execute_shuffle: + metadata = torch.tensor( + [ + model.num_moe_layers, model.num_logical_experts, + self.physical_to_logical_map.shape[1] + ], + dtype=torch.int32, + device="cpu", + ) + torch.distributed.broadcast(metadata, + group=get_ep_group().cpu_group, + group_src=0) + + # Perform all-reduce to get the expert load across all ranks + global_expert_load_window = logical_expert_load_window.sum(dim=0) + all_reduce(global_expert_load_window, group=ep_group) + + if not execute_shuffle: + # (num_moe_layers, old_num_physical_experts) + old_global_expert_indices = self.physical_to_logical_map + torch.distributed.broadcast(old_global_expert_indices, + group=ep_group, + group_src=0) + return global_expert_load_window + else: + assert execute_shuffle + global_expert_load_window = global_expert_load # TODO(bowen): Treat differently for prefill and decode nodes num_replicas = model.num_physical_experts num_groups = model.num_expert_groups - num_nodes = get_node_count() - num_gpus = ep_group.size() + if rank_mapping is not None and len(rank_mapping) == ep_group.size(): + # NOTE(yongji): scale down, we need to rebalance the experts on + # remaining GPUs, transfer the experts while we haven't shutdown + # the GPUs to be released. + cpu_group = get_ep_group().cpu_group + num_nodes = _node_count_with_rank_mapping(cpu_group, rank_mapping) + num_gpus = sum(new_rank != -1 + for new_rank in rank_mapping.values()) + num_replicas = num_replicas // ep_group.size( + ) * num_gpus # handle num replicas change + else: + num_nodes = get_node_count() + num_gpus = ep_group.size() if num_gpus % num_nodes != 0: + self.num_nodes = 1 logger.warning_once( f"num_gpus % num_nodes != 0, " "not using hierarchical rearrangement algorithm.\n" @@ -414,10 +520,24 @@ def rearrange(self, model.expert_weights, ep_group, is_profile, + rank_mapping, ) if not is_profile: - self.physical_to_logical_map.copy_(new_physical_to_logical_map) + if self.physical_to_logical_map.shape[ + 1] != new_physical_to_logical_map.shape[1]: + self.physical_to_logical_map = new_physical_to_logical_map.to( + self.physical_to_logical_map.device) + else: + self.physical_to_logical_map.copy_(new_physical_to_logical_map) + max_physical_slots = new_logical_to_physical_map.shape[-1] + assert max_physical_slots <= self.logical_to_physical_map.shape[-1] + new_logical_to_physical_map = torch.nn.functional.pad( + new_logical_to_physical_map, + (0, + self.logical_to_physical_map.shape[-1] - max_physical_slots), + value=-1, + ) self.logical_to_physical_map.copy_(new_logical_to_physical_map) self.logical_replica_count.copy_(new_logical_replica_count) @@ -430,3 +550,69 @@ def rearrange(self, " (profile) " if is_profile else " ", time_end - time_start, ) + + @staticmethod + def recv_state() -> tuple[torch.Tensor, torch.Tensor]: + """ + Receive the expert load and old placement from the master rank. + """ + ep_group = get_ep_group() + metadata = torch.empty(3, dtype=torch.int32, device="cpu") + torch.distributed.broadcast(metadata, + group=ep_group.cpu_group, + group_src=0) + num_moe_layers, num_logical_experts, num_old_physical_experts = ( + metadata.tolist()) + global_expert_load = torch.zeros( + (num_moe_layers, num_logical_experts), + dtype=torch.int64, + device=ep_group.device, + ) + all_reduce(global_expert_load, group=ep_group.device_group) + old_global_expert_indices = torch.empty( + (num_moe_layers, num_old_physical_experts), + dtype=torch.int64, + device=ep_group.device, + ) + torch.distributed.broadcast(old_global_expert_indices, + group=ep_group.device_group, + group_src=0) + + return global_expert_load, old_global_expert_indices + + +def _node_count_with_rank_mapping( + pg: Union[ProcessGroup, StatelessProcessGroup], + rank_mapping: dict[int, int], +) -> int: + if isinstance(pg, ProcessGroup): + world_size = torch.distributed.get_world_size(group=pg) + else: + world_size = pg.world_size + + if world_size == 1: + return 1 + + # Build node assignment map + node_assignment = [0] * world_size # rank -> node_id + next_node_id = 0 + + for current_rank in range(world_size): + if node_assignment[current_rank] != 0: + continue # Already assigned to a node + + assert current_rank in rank_mapping + if rank_mapping[current_rank] == -1: + continue # Pending shutdown + + # Assign current rank to a new node + next_node_id += 1 + node_assignment[current_rank] = next_node_id + + # Find all ranks on the same node as current_rank + same_node_flags = in_the_same_node_as(pg, current_rank) + for other_rank, is_same_node in enumerate(same_node_flags): + if is_same_node and node_assignment[other_rank] == 0: + node_assignment[other_rank] = next_node_id + + return next_node_id diff --git a/vllm/distributed/eplb/rebalance_execute.py b/vllm/distributed/eplb/rebalance_execute.py index 2ef8587b559..f8a7d1170bb 100644 --- a/vllm/distributed/eplb/rebalance_execute.py +++ b/vllm/distributed/eplb/rebalance_execute.py @@ -8,6 +8,7 @@ from collections.abc import Iterable, MutableSequence, Sequence from functools import partial +from typing import Optional import torch from torch.distributed import (P2POp, ProcessGroup, all_gather, @@ -127,6 +128,8 @@ def shuffle_layer( dst_global = local2global(dst) if is_received_locally[dst]: continue + if old_indices[src_global] == -1 or new_indices[dst_global] == -1: + continue if old_indices[src_global] == new_indices[dst_global]: is_received_locally[dst] = True for weight, buffer in zip(expert_weights, @@ -139,6 +142,8 @@ def shuffle_layer( experts_send_loc: dict[int, int] = {} for src in range(num_local_experts): expert = old_indices[local2global(src)] + if expert == -1: + continue if expert in experts_send_loc: continue experts_send_loc[expert] = src @@ -181,6 +186,8 @@ def shuffle_layer( if is_received_locally[dst]: continue expert = new_indices[local2global(dst)] + if expert == -1: + continue if expert in experts_recv_loc: continue experts_recv_loc[expert] = dst @@ -227,6 +234,8 @@ def shuffle_layer( weight[dst].copy_(buffer[dst]) else: expert = new_indices[local2global(dst)] + if expert == -1: + continue src = experts_recv_loc[expert] for weight, buffer in zip(expert_weights, expert_weights_buffer): weight[dst].copy_(buffer[src]) @@ -238,6 +247,7 @@ def rearrange_expert_weights_inplace( expert_weights: Sequence[Iterable[torch.Tensor]], ep_group: ProcessGroup, is_profile: bool = False, + rank_mapping: Optional[dict[int, int]] = None, ) -> None: """ Rearranges the expert weights in place according to the new expert indices. @@ -256,7 +266,28 @@ def rearrange_expert_weights_inplace( is_profile (bool): If `True`, do not perform any actual weight copy. This is used during profile run, where we only perform dummy communications to reserve enough memory for the buffers. + rank_mapping: A dictionary mapping old rank to new rank. """ + if rank_mapping is not None: + if len(rank_mapping) == ep_group.size(): + # scale down + new_global_expert_indices = \ + _map_new_expert_indices_with_rank_mapping( + new_global_expert_indices, + rank_mapping, + ) + else: + # scale up + old_global_expert_indices = \ + _map_old_expert_indices_with_rank_mapping( + old_global_expert_indices, + rank_mapping, + ep_group.size(), + ) + + assert old_global_expert_indices.shape[ + 1] == new_global_expert_indices.shape[1] + num_moe_layers, num_physical_experts = old_global_expert_indices.shape assert len(expert_weights) == num_moe_layers @@ -304,4 +335,90 @@ def rearrange_expert_weights_inplace( ) +def _map_old_expert_indices_with_rank_mapping( + old_global_expert_indices: torch.Tensor, + rank_mapping: dict[int, int], + new_ep_size: int, +) -> torch.Tensor: + """ + Map the old global expert indices to the new global expert indices. + + Args: + old_global_expert_indices: + Shape (num_layers, old_ep_size * num_local_physical_experts). + rank_mapping: Mapping from old rank to new rank. + new_ep_size: New expert parallelism size. + + Returns: + Mapped expert indices with shape + (num_layers, new_ep_size * num_local_physical_experts). + """ + num_layers, old_num_physical_experts = old_global_expert_indices.shape + assert rank_mapping, "Rank mapping is required" + + # Get sizes from parameters and rank_mapping + old_ep_size = len(rank_mapping) + num_local_physical_experts = old_num_physical_experts // old_ep_size + new_num_physical_experts = new_ep_size * num_local_physical_experts + + # Create mapped tensor with new shape, initialized to -1 + mapped_expert_indices = torch.full( + (num_layers, new_num_physical_experts), + fill_value=-1, + dtype=old_global_expert_indices.dtype, + device=old_global_expert_indices.device, + ) + + # Handle rank mapping (scale up/down with rank changes) + for old_rank in range(old_ep_size): + new_rank = rank_mapping.get(old_rank) + if new_rank is not None and new_rank >= 0 and new_rank < new_ep_size: + # This old rank exists in the new configuration + old_start_idx = old_rank * num_local_physical_experts + old_end_idx = (old_rank + 1) * num_local_physical_experts + new_start_idx = new_rank * num_local_physical_experts + new_end_idx = (new_rank + 1) * num_local_physical_experts + + mapped_expert_indices[:, new_start_idx:new_end_idx] = \ + old_global_expert_indices[:, old_start_idx:old_end_idx] + # If new_rank is None or >= new_ep_size, the experts remain -1 + # (scale down case) + + return mapped_expert_indices + + +def _map_new_expert_indices_with_rank_mapping( + new_global_expert_indices: torch.Tensor, + rank_mapping: dict[int, int], +) -> torch.Tensor: + num_layers, new_num_physical_experts = new_global_expert_indices.shape + assert rank_mapping, "Rank mapping is required" + + # Get sizes from parameters and rank_mapping + old_ep_size = len(rank_mapping) + new_ep_size = sum(new_rank != -1 for new_rank in rank_mapping.values()) + num_local_physical_experts = new_num_physical_experts // new_ep_size + old_num_physical_experts = old_ep_size * num_local_physical_experts + + mapped_expert_indices = torch.full( + (num_layers, old_num_physical_experts), + fill_value=-1, + dtype=new_global_expert_indices.dtype, + device=new_global_expert_indices.device, + ) + + for old_rank in range(old_ep_size): + new_rank = rank_mapping[old_rank] + if new_rank >= 0 and new_rank < new_ep_size: + old_start_idx = old_rank * num_local_physical_experts + old_end_idx = (old_rank + 1) * num_local_physical_experts + new_start_idx = new_rank * num_local_physical_experts + new_end_idx = (new_rank + 1) * num_local_physical_experts + + mapped_expert_indices[:, old_start_idx:old_end_idx] = \ + new_global_expert_indices[:, new_start_idx:new_end_idx] + + return mapped_expert_indices + + __all__ = ["rearrange_expert_weights_inplace"] diff --git a/vllm/engine/protocol.py b/vllm/engine/protocol.py index 8688fcc82cd..f5cc9c47405 100644 --- a/vllm/engine/protocol.py +++ b/vllm/engine/protocol.py @@ -324,3 +324,9 @@ async def is_sleeping(self) -> bool: async def add_lora(self, lora_request: LoRARequest) -> None: """Load a new LoRA adapter into the engine for future requests.""" ... + + async def scale_elastic_ep(self, + new_data_parallel_size: int, + drain_timeout: int = 300) -> None: + """Scale the engine""" + raise NotImplementedError diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index c2185acbf0c..3f0c1c85dee 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1018,6 +1018,73 @@ async def is_sleeping(raw_request: Request): return JSONResponse(content={"is_sleeping": is_sleeping}) +@router.post("/scale_elastic_ep", + dependencies=[Depends(validate_json_request)], + responses={ + HTTPStatus.OK.value: { + "model": dict + }, + HTTPStatus.BAD_REQUEST.value: { + "model": ErrorResponse + }, + HTTPStatus.REQUEST_TIMEOUT.value: { + "model": ErrorResponse + }, + HTTPStatus.INTERNAL_SERVER_ERROR.value: { + "model": ErrorResponse + }, + }) +async def scale_elastic_ep(raw_request: Request): + try: + body = await raw_request.json() + except json.JSONDecodeError as e: + raise HTTPException(status_code=400, + detail="Invalid JSON format") from e # noqa: B904 + + new_data_parallel_size = body.get("new_data_parallel_size") + drain_timeout = body.get("drain_timeout", 120) # Default 2 minutes + + if new_data_parallel_size is None: + raise HTTPException(status_code=400, + detail="new_data_parallel_size is required") + + if not isinstance(new_data_parallel_size, + int) or new_data_parallel_size <= 0: + raise HTTPException( + status_code=400, + detail="new_data_parallel_size must be a positive integer") + + if not isinstance(drain_timeout, int) or drain_timeout <= 0: + raise HTTPException(status_code=400, + detail="drain_timeout must be a positive integer") + + # Set scaling flag to prevent new requests + global _scaling_elastic_ep + _scaling_elastic_ep = True + client = engine_client(raw_request) + try: + await client.scale_elastic_ep(new_data_parallel_size, drain_timeout) + return JSONResponse({ + "message": + f"Scaled to {new_data_parallel_size} " + "data parallel engines", + }) + except TimeoutError as e: + raise HTTPException(status_code=408, + detail="Scale failed due to request drain timeout " + f"after {drain_timeout} seconds") from e + except Exception as e: + logger.error("Scale failed: %s", e) + raise HTTPException(status_code=500, detail="Scale failed") from e + finally: + _scaling_elastic_ep = False + + +@router.post("/is_scaling_elastic_ep") +async def is_scaling_elastic_ep(raw_request: Request): + return JSONResponse({"is_scaling_elastic_ep": _scaling_elastic_ep}) + + # TODO: RequestType = TypeForm[BaseModel] when recognized by type checkers # (requires typing_extensions >= 4.13) RequestType = Any @@ -1216,6 +1283,41 @@ async def send_with_request_id(message: Message) -> None: return self.app(scope, receive, send_with_request_id) +# Global variable to track scaling state +_scaling_elastic_ep = False + + +class ScalingMiddleware: + """ + Middleware that checks if the model is currently scaling and + returns a 503 Service Unavailable response if it is. + + This middleware applies to all HTTP requests and prevents + processing when the model is in a scaling state. + """ + + def __init__(self, app: ASGIApp) -> None: + self.app = app + + def __call__(self, scope: Scope, receive: Receive, + send: Send) -> Awaitable[None]: + if scope["type"] != "http": + return self.app(scope, receive, send) + + # Check global scaling state + global _scaling_elastic_ep + if _scaling_elastic_ep: + # Return 503 Service Unavailable response + response = JSONResponse(content={ + "error": + "The model is currently scaling. Please try again later." + }, + status_code=503) + return response(scope, receive, send) + + return self.app(scope, receive, send) + + def _extract_content_from_chunk(chunk_data: dict) -> str: """Extract content from a streaming response chunk.""" try: @@ -1404,6 +1506,9 @@ async def validation_exception_handler(_: Request, if args.enable_request_id_headers: app.add_middleware(XRequestIdMiddleware) + # Add scaling middleware to check for scaling state + app.add_middleware(ScalingMiddleware) + if envs.VLLM_DEBUG_LOG_API_SERVER_RESPONSE: logger.warning("CAUTION: Enabling log response in the API Server. " "This can include sensitive information and should be " diff --git a/vllm/executor/uniproc_executor.py b/vllm/executor/uniproc_executor.py index 7ebeb4a2255..aabc9ed9b80 100644 --- a/vllm/executor/uniproc_executor.py +++ b/vllm/executor/uniproc_executor.py @@ -12,6 +12,7 @@ from vllm.logger import init_logger from vllm.utils import (get_distributed_init_method, get_ip, get_open_port, run_method) +from vllm.v1.engine import ReconfigureDistributedRequest, ReconfigureRankType from vllm.worker.worker_base import WorkerWrapperBase logger = init_logger(__name__) @@ -62,6 +63,14 @@ def check_health(self) -> None: # it's running. return + def reinitialize_distributed( + self, reconfig_request: ReconfigureDistributedRequest) -> None: + self.driver_worker.reinitialize_distributed(reconfig_request) + if reconfig_request.new_data_parallel_rank == \ + ReconfigureRankType.SHUTDOWN_CURRENT_RANK: + self.shutdown() + return + UniProcExecutorAsync = UniProcExecutor diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index 4b8a37fcc73..4a6a3b95ec7 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -265,9 +265,6 @@ def select_gemm_impl( prepare_finalize: FusedMoEPrepareAndFinalize, moe: FusedMoEConfig, ) -> FusedMoEPermuteExpertsUnpermute: - - assert self.fused_experts == fused_experts - if (prepare_finalize.activation_format == FusedMoEActivationFormat.BatchedExperts): logger.debug("BatchedTritonExperts %s", self.moe) @@ -375,8 +372,10 @@ def apply( logical_replica_count: Optional[torch.Tensor] = None, ) -> torch.Tensor: if enable_eplb: - raise NotImplementedError( - "EPLB not supported for `UnquantizedFusedMoEMethod` yet.") + assert expert_load_view is not None + assert logical_to_physical_map is not None + assert logical_replica_count is not None + assert isinstance(layer, FusedMoE) return self.forward( x=x, @@ -393,7 +392,12 @@ def apply( scoring_func=scoring_func, e_score_correction_bias=e_score_correction_bias, activation=activation, - apply_router_weight_on_input=apply_router_weight_on_input) + apply_router_weight_on_input=apply_router_weight_on_input, + enable_eplb=enable_eplb, + expert_load_view=expert_load_view, + logical_to_physical_map=logical_to_physical_map, + logical_replica_count=logical_replica_count, + ) def forward_cuda( self, @@ -412,6 +416,10 @@ def forward_cuda( e_score_correction_bias: Optional[torch.Tensor] = None, apply_router_weight_on_input: bool = False, activation: str = "silu", + enable_eplb: bool = False, + expert_load_view: Optional[torch.Tensor] = None, + logical_to_physical_map: Optional[torch.Tensor] = None, + logical_replica_count: Optional[torch.Tensor] = None, ) -> torch.Tensor: topk_weights, topk_ids = FusedMoE.select_experts( @@ -425,7 +433,12 @@ def forward_cuda( custom_routing_function=custom_routing_function, scoring_func=scoring_func, e_score_correction_bias=e_score_correction_bias, - indices_type=self.topk_indices_dtype) + indices_type=self.topk_indices_dtype, + enable_eplb=enable_eplb, + expert_map=expert_map, + expert_load_view=expert_load_view, + logical_to_physical_map=logical_to_physical_map, + logical_replica_count=logical_replica_count) if self.rocm_aiter_moe_enabled: return self.rocm_aiter_fused_experts( @@ -730,7 +743,8 @@ def __init__( if self.enable_eplb: from vllm.model_executor.layers.quantization.fp8 import ( Fp8MoEMethod) - if not isinstance(quant_method, Fp8MoEMethod): + if not isinstance(quant_method, + (Fp8MoEMethod, UnquantizedFusedMoEMethod)): # TODO: Add support for additional quantization methods. # The implementation for other quantization methods does not # contain essential differences, but the current quant API @@ -821,6 +835,15 @@ def use_deepep_ll_kernels(self): def use_flashinfer_cutlass_kernels(self): return self.moe_parallel_config.use_flashinfer_cutlass_kernels + def update_expert_map(self): + # ep_size and ep_rank should already be updated + assert self.expert_map is not None + with self.expert_map.device: + self.local_num_experts, self.expert_map = determine_expert_map( + ep_size=self.ep_size, + ep_rank=self.ep_rank, + global_num_experts=self.global_num_experts) + def _load_per_tensor_weight_scale(self, shard_id: str, param: torch.nn.Parameter, loaded_weight: torch.Tensor, diff --git a/vllm/model_executor/models/deepseek_v2.py b/vllm/model_executor/models/deepseek_v2.py index 8d36dda65b5..5106b9914b5 100644 --- a/vllm/model_executor/models/deepseek_v2.py +++ b/vllm/model_executor/models/deepseek_v2.py @@ -776,6 +776,24 @@ def set_eplb_state( logical_replica_count=logical_replica_count, ) + def update_physical_experts_metadata( + self, + num_physical_experts: int, + num_local_physical_experts: int, + ) -> None: + assert self.num_local_physical_experts == num_local_physical_experts + self.num_physical_experts = num_physical_experts + self.num_local_physical_experts = num_local_physical_experts + self.num_redundant_experts = (num_physical_experts - + self.num_logical_experts) + for layer in self.model.layers: + if isinstance(layer.mlp, DeepseekV2MoE): + moe = layer.mlp + moe.n_local_physical_experts = num_local_physical_experts + moe.n_physical_experts = num_physical_experts + moe.n_redundant_experts = self.num_redundant_experts + moe.experts.update_expert_map() + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: return self.model.get_input_embeddings(input_ids) @@ -931,9 +949,8 @@ class DeepseekV3ForCausalLM(DeepseekV2ForCausalLM): def get_spec_layer_idx_from_weight_name(config: PretrainedConfig, weight_name: str) -> Optional[int]: - if hasattr(config, - "num_nextn_predict_layers") and (config.num_nextn_predict_layers - > 0): + if (hasattr(config, "num_nextn_predict_layers") + and config.num_nextn_predict_layers > 0): layer_idx = config.num_hidden_layers for i in range(config.num_nextn_predict_layers): if weight_name.startswith(f"model.layers.{layer_idx+i}."): diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index b60f1a5b6ff..7f3efde4347 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -543,6 +543,13 @@ def set_eplb_state( """ ... + def update_physical_experts_metadata( + self, + num_physical_experts: int, + num_local_physical_experts: int, + ) -> None: + ... + def is_mixture_of_experts(model: object) -> TypeIs[MixtureOfExperts]: return isinstance(model, MixtureOfExperts) diff --git a/vllm/v1/engine/__init__.py b/vllm/v1/engine/__init__.py index 921ccd708cd..79dc80d8fc5 100644 --- a/vllm/v1/engine/__init__.py +++ b/vllm/v1/engine/__init__.py @@ -177,3 +177,19 @@ class EngineCoreRequestType(enum.Enum): UTILITY = b'\x03' # Sentinel used within EngineCoreProc. EXECUTOR_FAILED = b'\x04' + + +class ReconfigureDistributedRequest(msgspec.Struct): + new_data_parallel_size: int + new_data_parallel_rank: int + new_data_parallel_rank_local: int + new_data_parallel_master_ip: str + new_data_parallel_master_port: int + + +class ReconfigureRankType(enum.IntEnum): + """ + Rank type for reconfiguring distributed request. + """ + KEEP_CURRENT_RANK = -1 + SHUTDOWN_CURRENT_RANK = -2 diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index 3754570dfaa..6395d2c1875 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -1,6 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import asyncio +import time from collections.abc import AsyncGenerator, Mapping from copy import copy from typing import Any, Optional, Union @@ -608,6 +609,63 @@ async def collective_rpc(self, return await self.engine_core.collective_rpc_async( method, timeout, args, kwargs) + async def wait_for_requests_to_drain(self, drain_timeout: int = 300): + """Wait for all requests to be drained.""" + start_time = time.time() + while time.time() - start_time < drain_timeout: + if not self.engine_core.dp_engines_running(): + logger.info("Engines are idle, requests have been drained") + return + + logger.info( + "Engines are still running, waiting for requests to drain...") + await asyncio.sleep(1) # Wait 1 second before checking again + + raise TimeoutError(f"Timeout reached after {drain_timeout} seconds " + "waiting for requests to drain.") + + async def scale_elastic_ep(self, + new_data_parallel_size: int, + drain_timeout: int = 300): + """ + Scale up or down the data parallel size by adding or removing + engine cores. + Args: + new_data_parallel_size: The new number of data parallel workers + drain_timeout: + Maximum time to wait for requests to drain (seconds) + """ + old_data_parallel_size = \ + self.vllm_config.parallel_config.data_parallel_size + if old_data_parallel_size == new_data_parallel_size: + logger.info("Data parallel size is already %s, skipping scale", + new_data_parallel_size) + return + logger.info( + "Waiting for requests to drain before " + "scaling up to %s engines...", new_data_parallel_size) + await self.wait_for_requests_to_drain(drain_timeout) + logger.info( + "Requests have been drained, proceeding with scale " + "to %s engines", new_data_parallel_size) + await self.engine_core.scale_elastic_ep(new_data_parallel_size) + self.vllm_config.parallel_config.data_parallel_size = \ + new_data_parallel_size + + # recreate stat loggers + if new_data_parallel_size > old_data_parallel_size: + stat_loggers: list[list[StatLoggerBase]] = setup_default_loggers( + vllm_config=self.vllm_config, + log_stats=self.log_stats, + engine_num=new_data_parallel_size, + custom_stat_loggers=None, + ) + num_new_engines = len(stat_loggers) - len(self.stat_loggers) + self.stat_loggers.extend(stat_loggers[-num_new_engines:]) + else: + for _ in range(old_data_parallel_size - new_data_parallel_size): + self.stat_loggers.pop() + @property def is_running(self) -> bool: # Is None before the loop is started. diff --git a/vllm/v1/engine/coordinator.py b/vllm/v1/engine/coordinator.py index b3e7a2e85b8..005e71647aa 100644 --- a/vllm/v1/engine/coordinator.py +++ b/vllm/v1/engine/coordinator.py @@ -200,11 +200,41 @@ def process_input_socket(self, front_publish_address: str, # Ignore subscription messages. continue + decoded = msgspec.msgpack.decode(buffer) + if isinstance(decoded, (list, tuple)) and len( + decoded) == 2 and decoded[0] == "SCALE_ELASTIC_EP": + # Handle scale up notification + new_engine_count = decoded[1] + current_count = len(self.engines) + if new_engine_count > current_count: + for _ in range(new_engine_count - current_count): + self.engines.append(EngineState()) + # NOTE(yongji): handle the case + # where newly started engines have current_wave = 0 + # if existing engines just finished a wave + # and engine_running isn't updated yet at + # CoordinatorProc requests routed to newly started + # engines may not wake up existing engines, as long + # as 0 < request.wave < existing engines' + # current_wave + # we note that 0 is the wave number for the new + # engine + self.engines_running = False + logger.info( + "DPCoordinator scaled up from %s to %s " + "engines", current_count, new_engine_count) + else: + self.engines = self.engines[:new_engine_count] + logger.info( + "DPCoordinator scaled down from %s to %s " + "engines", current_count, new_engine_count) + continue # Skip normal engine notification processing + # We received a message on the front-end XPUB socket, # from an API server sending a new request while the # engines are paused, so that we can wake the other # engines. - engine_to_exclude, wave = msgspec.msgpack.decode(buffer) + engine_to_exclude, wave = decoded if not self.engines_running: if wave < self.current_wave: # If the wave number is stale, ensure the message diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index b3210197750..ca636bf5a6f 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -32,7 +32,9 @@ from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.core.sched.scheduler import Scheduler as V1Scheduler from vllm.v1.engine import (EngineCoreOutputs, EngineCoreRequest, - EngineCoreRequestType, UtilityOutput) + EngineCoreRequestType, + ReconfigureDistributedRequest, ReconfigureRankType, + UtilityOutput) from vllm.v1.engine.mm_input_cache import MirroredProcessingCache from vllm.v1.engine.utils import EngineHandshakeMetadata, EngineZmqAddresses from vllm.v1.executor.abstract import Executor @@ -77,6 +79,8 @@ def __init__(self, self.model_executor.register_failure_callback( executor_fail_callback) + self.available_gpu_memory_for_kv_cache = -1 + # Setup KV Caches and update CacheConfig after profiling. num_gpu_blocks, num_cpu_blocks, kv_cache_config = \ self._initialize_kv_caches(vllm_config) @@ -137,12 +141,23 @@ def _initialize_kv_caches( # Get all kv cache needed by the model kv_cache_specs = self.model_executor.get_kv_cache_specs() - # Profiles the peak memory usage of the model to determine how much - # memory can be allocated for kv cache. has_kv_cache = any(kv_cache_spec for kv_cache_spec in kv_cache_specs) if has_kv_cache: - available_gpu_memory = \ - self.model_executor.determine_available_memory() + if os.environ.get("VLLM_ELASTIC_EP_SCALE_UP_LAUNCH") == "1": + dp_group = getattr(self, "dp_group", None) + assert dp_group is not None + self.available_gpu_memory_for_kv_cache = \ + ParallelConfig.sync_kv_cache_memory_size(dp_group, -1) + available_gpu_memory = [ + self.available_gpu_memory_for_kv_cache + ] * len(kv_cache_specs) + else: + # Profiles the peak memory usage of the model to determine how + # much memory can be allocated for kv cache. + available_gpu_memory = ( + self.model_executor.determine_available_memory()) + self.available_gpu_memory_for_kv_cache = \ + available_gpu_memory[0] else: # Attention free models don't need memory for kv cache available_gpu_memory = [0] * len(kv_cache_specs) @@ -989,6 +1004,50 @@ def _has_global_unfinished_reqs(self, local_unfinished: bool) -> bool: return ParallelConfig.has_unfinished_dp(self.dp_group, local_unfinished) + def reinitialize_distributed( + self, reconfig_request: ReconfigureDistributedRequest) -> None: + stateless_destroy_torch_distributed_process_group(self.dp_group) + self.shutdown() + + parallel_config = self.vllm_config.parallel_config + old_dp_size = parallel_config.data_parallel_size + parallel_config.data_parallel_size = \ + reconfig_request.new_data_parallel_size + if reconfig_request.new_data_parallel_rank != -1: + parallel_config.data_parallel_rank = \ + reconfig_request.new_data_parallel_rank + # local rank specifies device visibility, it should not be changed + assert reconfig_request.new_data_parallel_rank_local == \ + ReconfigureRankType.KEEP_CURRENT_RANK + parallel_config.data_parallel_master_ip = \ + reconfig_request.new_data_parallel_master_ip + parallel_config.data_parallel_master_port = \ + reconfig_request.new_data_parallel_master_port + if reconfig_request.new_data_parallel_rank != -2: + self.dp_rank = parallel_config.data_parallel_rank + self.dp_group = parallel_config.stateless_init_dp_group() + reconfig_request.new_data_parallel_master_port = \ + parallel_config.data_parallel_master_port + + self.model_executor.reinitialize_distributed(reconfig_request) + if reconfig_request.new_data_parallel_size > old_dp_size: + assert self.available_gpu_memory_for_kv_cache > 0 + # pass available_gpu_memory_for_kv_cache from existing + # engine-cores to new engine-cores so they can directly + # use it in _initialize_kv_caches() rather than profiling. + ParallelConfig.sync_kv_cache_memory_size( + self.dp_group, self.available_gpu_memory_for_kv_cache) + # NOTE(yongji): newly joined workers require dummy_run even + # CUDA graph is not used + self.model_executor.collective_rpc("compile_or_warm_up_model") + if reconfig_request.new_data_parallel_rank == \ + ReconfigureRankType.SHUTDOWN_CURRENT_RANK: + self.shutdown() + logger.info("DPEngineCoreProc %s shutdown", self.dp_rank) + else: + logger.info("Distributed environment reinitialized for DP rank %s", + self.dp_rank) + class DPEngineCoreActor(DPEngineCoreProc): """ diff --git a/vllm/v1/engine/core_client.py b/vllm/v1/engine/core_client.py index dafaa15f777..82fc1fa9937 100644 --- a/vllm/v1/engine/core_client.py +++ b/vllm/v1/engine/core_client.py @@ -21,9 +21,11 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.lora.request import LoRARequest -from vllm.utils import get_open_zmq_inproc_path, make_zmq_socket +from vllm.utils import get_open_port, get_open_zmq_inproc_path, make_zmq_socket from vllm.v1.engine import (EngineCoreOutputs, EngineCoreRequest, - EngineCoreRequestType, UtilityOutput) + EngineCoreRequestType, + ReconfigureDistributedRequest, ReconfigureRankType, + UtilityOutput) from vllm.v1.engine.coordinator import DPCoordinator from vllm.v1.engine.core import EngineCore, EngineCoreProc from vllm.v1.engine.exceptions import EngineDeadError @@ -162,6 +164,9 @@ def dp_engines_running(self) -> bool: running state.""" raise NotImplementedError + async def scale_elastic_ep(self, new_data_parallel_size: int) -> None: + raise NotImplementedError + async def get_output_async(self) -> EngineCoreOutputs: raise NotImplementedError @@ -910,14 +915,30 @@ async def run_engine_stats_update_task(): events = await poller.poll() if not self.engines_running and len(events) == 2 or ( events[0][0] == first_req_rcv_socket): - # Send a message to notify the coordinator that + # Check if this is a regular request notification or + # scale up notification + buf = first_req_rcv_socket.recv( + flags=zmq.NOBLOCK).result() + + decoded = msgspec.msgpack.decode(buf) + if isinstance( + decoded, + (list, tuple)) and len(decoded) == 2 and decoded[ + 0] == "SCALE_ELASTIC_EP": + # Extract new engine count from the decoded message + new_engine_count = decoded[1] + # Send scale up notification to coordinator + scale_msg = msgspec.msgpack.encode( + ("SCALE_ELASTIC_EP", new_engine_count)) + await socket.send(scale_msg) + continue + # we're sending a request while the engines are # paused, so that it can wake the others up # (to run dummy EP loop). + assert decoded[0] == "FIRST_REQ" + target_eng_index = decoded[1] self.engines_running = True - buf = first_req_rcv_socket.recv( - flags=zmq.NOBLOCK).result() - target_eng_index = int.from_bytes(buf, "little") msg = msgspec.msgpack.encode( (target_eng_index, self.current_wave)) await socket.send(msg) @@ -953,7 +974,8 @@ async def add_request_async(self, request: EngineCoreRequest) -> None: chosen_engine) if not self.engines_running: # Notify coordinator that we're sending a request - await self.first_req_send_socket.send(chosen_engine) + req_msg = msgspec.msgpack.encode(("FIRST_REQ", chosen_engine)) + await self.first_req_send_socket.send(req_msg) await to_await @@ -1047,3 +1069,156 @@ async def _abort_requests(self, request_ids: list[str], engine: EngineIdentity) -> None: await self._send_input(EngineCoreRequestType.ABORT, request_ids, engine) + + async def _send_reconfig_message( + self, reconfig_request: ReconfigureDistributedRequest, + engine: EngineIdentity) -> asyncio.Future: + """Send reconfiguration message and return the result future without + waiting for completion.""" + call_id = uuid.uuid1().int >> 64 + future = asyncio.get_running_loop().create_future() + self.utility_results[call_id] = future + message = (EngineCoreRequestType.UTILITY.value, *self.encoder.encode( + (self.client_index, call_id, "reinitialize_distributed", + (reconfig_request, )))) + await self._send_input_message(message, engine, reconfig_request) + self._ensure_output_queue_task() + return future + + async def scale_elastic_ep(self, new_data_parallel_size: int) -> None: + """Scale elastic EP data parallel size""" + cur_data_parallel_size = len(self.core_engines) + + assert new_data_parallel_size != cur_data_parallel_size, ( + f"new_data_parallel_size {new_data_parallel_size} must be " + f"different from cur_data_parallel_size {cur_data_parallel_size}") + + assert self.vllm_config.parallel_config.data_parallel_backend == \ + "ray", ("Only ray DP backend supports scaling elastic EP") + + scale_up = new_data_parallel_size > cur_data_parallel_size + + if scale_up: + await self._scale_up_elastic_ep(cur_data_parallel_size, + new_data_parallel_size) + else: + await self._scale_down_elastic_ep(cur_data_parallel_size, + new_data_parallel_size) + + async def _scale_up_elastic_ep(self, cur_data_parallel_size: int, + new_data_parallel_size: int) -> None: + """Scale up the data parallel size by creating new engine cores + and reconfiguring existing ones.""" + cur_data_parallel_size = len(self.core_engines) + + # Phase 1: Send reconfigure messages to all existing engines and wait + # for them to be sent + reconfig_futures = [] + self.vllm_config.parallel_config.data_parallel_master_port = \ + get_open_port() + for engine in self.core_engines: + reconfig_request = ReconfigureDistributedRequest( + new_data_parallel_size=new_data_parallel_size, + new_data_parallel_rank=ReconfigureRankType.KEEP_CURRENT_RANK, + new_data_parallel_rank_local=\ + ReconfigureRankType.KEEP_CURRENT_RANK, + new_data_parallel_master_ip=self.vllm_config.parallel_config. + data_parallel_master_ip, + new_data_parallel_master_port=self.vllm_config.parallel_config. + data_parallel_master_port) + future = await self._send_reconfig_message(reconfig_request, + engine) + reconfig_futures.append(future) + + logger.info("All reconfigure messages sent, starting engine creation") + + # Phase 2: Create new engines now that reconfig messages have been sent + # self.resources.engine_manager is guaranteed to be + # CoreEngineActorManager for RayDPClient + assert isinstance(self.resources.engine_manager, + CoreEngineActorManager) + self.resources.engine_manager.scale_up_elastic_ep( + self.vllm_config, new_data_parallel_size) + + # Create new CoreEngine objects for the new engines + new_engine_identities = set() + for i in range(cur_data_parallel_size, new_data_parallel_size): + new_engine = i.to_bytes(2, "little") + self.core_engines.append(new_engine) + new_engine_identities.add(new_engine) + + # Wait for ready messages from new engines on the input socket + sync_input_socket = zmq.Socket.shadow(self.input_socket) + while new_engine_identities: + if not sync_input_socket.poll(timeout=600_000): + raise TimeoutError( + "Timed out waiting for new engines to send initial " + "message on input socket.") + identity, _ = sync_input_socket.recv_multipart() + new_engine_identities.discard(identity) + + # Phase 3: Wait for all existing engines to complete reconfiguration + logger.info("Waiting for existing engines to complete reconfiguration") + await asyncio.gather(*reconfig_futures) + + # Notify coordinator about scale up through existing + # stats_update_task connection + self._ensure_stats_update_task() + scale_up_marker = msgspec.msgpack.encode( + ("SCALE_ELASTIC_EP", new_data_parallel_size)) + await self.first_req_send_socket.send(scale_up_marker) + + # Update the parallel config + self.vllm_config.parallel_config.data_parallel_size = \ + new_data_parallel_size + logger.info( + "[Elastic EP] Scale up completed, new data parallel size: %s", + new_data_parallel_size) + + async def _scale_down_elastic_ep(self, cur_data_parallel_size: int, + new_data_parallel_size: int) -> None: + """Scale down the data parallel size by shutting down and + reconfiguring existing engine cores.""" + cur_data_parallel_size = len(self.core_engines) + + self.vllm_config.parallel_config.data_parallel_master_port = \ + get_open_port() + + reconfig_futures = [] + for cur_dp_rank, engine in enumerate(self.core_engines): + reconfig_request = ReconfigureDistributedRequest( + new_data_parallel_size=new_data_parallel_size, + new_data_parallel_rank=ReconfigureRankType.KEEP_CURRENT_RANK, + new_data_parallel_rank_local=\ + ReconfigureRankType.KEEP_CURRENT_RANK, + new_data_parallel_master_ip=self.vllm_config.parallel_config. + data_parallel_master_ip, + new_data_parallel_master_port=self.vllm_config.parallel_config. + data_parallel_master_port) + if cur_dp_rank >= new_data_parallel_size: + reconfig_request.new_data_parallel_rank = \ + ReconfigureRankType.SHUTDOWN_CURRENT_RANK + future = await self._send_reconfig_message(reconfig_request, + engine) + reconfig_futures.append(future) + + for _ in range(new_data_parallel_size, cur_data_parallel_size): + self.core_engines.pop() + + await asyncio.gather(*reconfig_futures) + + assert isinstance(self.resources.engine_manager, + CoreEngineActorManager) + self.resources.engine_manager.scale_down_elastic_ep( + cur_data_parallel_size, new_data_parallel_size) + + self._ensure_stats_update_task() + scale_down_marker = msgspec.msgpack.encode( + ("SCALE_ELASTIC_EP", new_data_parallel_size)) + await self.first_req_send_socket.send(scale_down_marker) + + self.vllm_config.parallel_config.data_parallel_size = \ + new_data_parallel_size + logger.info( + "[Elastic EP] Scale down completed, new data parallel size: %s", + new_data_parallel_size) diff --git a/vllm/v1/engine/utils.py b/vllm/v1/engine/utils.py index ae104bd6eb9..6dde477576b 100644 --- a/vllm/v1/engine/utils.py +++ b/vllm/v1/engine/utils.py @@ -174,16 +174,21 @@ def __init__( self.local_engine_actors: list[ray.ActorHandle] = [] self.remote_engine_actors: list[ray.ActorHandle] = [] + + env_vars_list = get_env_vars_to_copy(destination="DPEngineCoreActor") + self.env_vars_dict = { + name: os.environ[name] + for name in env_vars_list if name in os.environ + } + runtime_env = RuntimeEnv(env_vars=self.env_vars_dict) + + self.addresses = addresses + self.executor_class = executor_class + self.log_stats = log_stats dp_size = vllm_config.parallel_config.data_parallel_size local_engine_count = \ vllm_config.parallel_config.data_parallel_size_local world_size = vllm_config.parallel_config.world_size - env_vars_set = get_env_vars_to_copy(destination="DPEngineCoreActor") - env_vars_dict = { - name: os.environ[name] - for name in env_vars_set if name in os.environ - } - runtime_env = RuntimeEnv(env_vars=env_vars_dict) if ray.is_initialized(): logger.info( @@ -208,6 +213,7 @@ def __init__( assert len(placement_groups) == dp_size, ( "Number of placement groups must match data parallel size") + self.placement_group_is_local = [] refs = [] for index in range(dp_size): local_index = local_dp_ranks[index] @@ -231,6 +237,7 @@ def __init__( self.local_engine_actors.append(actor) else: self.remote_engine_actors.append(actor) + self.placement_group_is_local.append(local_client) refs.append(actor.wait_for_init.remote()) ray.get(refs) @@ -242,6 +249,9 @@ def __init__( def create_dp_placement_groups( vllm_config: VllmConfig ) -> tuple[list["PlacementGroup"], list[int]]: + """ + Create placement groups for data parallel. + """ import ray from ray._private.state import available_resources_per_node @@ -250,10 +260,11 @@ def create_dp_placement_groups( logger.info("Creating placement groups for data parallel") dp_master_ip = \ vllm_config.parallel_config.data_parallel_master_ip - dp_size = vllm_config.parallel_config.data_parallel_size + num_pg_to_create = vllm_config.parallel_config.data_parallel_size local_engine_count = \ vllm_config.parallel_config.data_parallel_size_local + nodes = list_nodes() nodes = sorted(list_nodes(), key=lambda node: node.node_ip != dp_master_ip) assert nodes[0].node_ip == dp_master_ip, ( @@ -293,7 +304,7 @@ def create_dp_placement_groups( local_dp_ranks.append(i) else: for i in range(available_engine_count): - if len(placement_groups) == dp_size: + if len(placement_groups) == num_pg_to_create: break bundles = [{"GPU": 1.0}] * world_size + [{"CPU": 1.0}] pg = ray.util.placement_group( @@ -305,6 +316,204 @@ def create_dp_placement_groups( local_dp_ranks.append(i) return placement_groups, local_dp_ranks + @staticmethod + def add_dp_placement_groups( + old_vllm_config: VllmConfig, new_data_parallel_size: int + ) -> tuple[list["PlacementGroup"], list[int]]: + """ + Add placement groups for new data parallel size. + """ + import ray + from ray._private.state import (available_resources_per_node, + total_resources_per_node) + from ray.util.state import list_nodes + + old_dp_size = old_vllm_config.parallel_config.data_parallel_size + num_pg_to_create = new_data_parallel_size - old_dp_size + + if num_pg_to_create <= 0: + return [], [] + + dp_master_ip = old_vllm_config.parallel_config.data_parallel_master_ip + world_size = old_vllm_config.parallel_config.world_size + + nodes = list_nodes() + nodes = sorted(nodes, key=lambda node: node.node_ip != dp_master_ip) + assert nodes[0].node_ip == dp_master_ip, ( + "The first node must be the head node") + assert len(nodes) == 1 or nodes[1].node_ip != dp_master_ip, ( + "There can only be one head node") + + available_resources = available_resources_per_node() + total_resources = total_resources_per_node() + + placement_groups = [] + local_dp_ranks = [] + num_pg_created = 0 + + for node in nodes: + if num_pg_created >= num_pg_to_create: + break + + node_ip = node.node_ip + node_id = node.node_id + available_gpus = int(available_resources[node_id]["GPU"]) + + # Get total GPUs on this node from the node's resources + # Ray stores node resources with node ID as key + total_gpus = int(total_resources[node_id]["GPU"]) + + # Calculate used GPUs and used engines on this node + used_gpus = max(0, total_gpus - available_gpus) + used_engines_on_node = used_gpus // world_size + + # Calculate how many new engines this node can accommodate + available_engine_count = available_gpus // world_size + + # Create placement groups for new engines on this node + for i in range(available_engine_count): + if num_pg_created >= num_pg_to_create: + break + + rank = old_dp_size + num_pg_created + + # Create bundles with node constraint for master node + if node_ip == dp_master_ip: + bundles = [{ + "GPU": 1.0, + "node:" + dp_master_ip: 0.001 + }] * world_size + [{ + "CPU": 1.0 + }] + else: + bundles = [{"GPU": 1.0}] * world_size + [{"CPU": 1.0}] + + pg = ray.util.placement_group( + name=f"dp_rank_{rank}", + strategy="STRICT_PACK", + bundles=bundles, + ) + placement_groups.append(pg) + + # Local rank starts from the number of engines already used + # on this node + local_rank = used_engines_on_node + i + local_dp_ranks.append(local_rank) + num_pg_created += 1 + + return placement_groups, local_dp_ranks + + def scale_up_elastic_ep(self, cur_vllm_config: VllmConfig, + new_data_parallel_size: int) -> None: + import copy + + import ray + from ray.runtime_env import RuntimeEnv + from ray.util.scheduling_strategies import ( + PlacementGroupSchedulingStrategy) + + from vllm.v1.engine.core import DPEngineCoreActor + + cur_data_parallel_size = len(self.local_engine_actors) + \ + len(self.remote_engine_actors) + + assert new_data_parallel_size > cur_data_parallel_size, ( + f"New data parallel size {new_data_parallel_size} must be greater " + f"than current data parallel size {cur_data_parallel_size} " + "for scale up") + + placement_groups, local_dp_ranks = \ + self.add_dp_placement_groups( + cur_vllm_config, new_data_parallel_size) + + world_size = cur_vllm_config.parallel_config.world_size + dp_master_ip = cur_vllm_config.parallel_config.data_parallel_master_ip + new_local_engines = 0 + + runtime_env = RuntimeEnv(env_vars=self.env_vars_dict + | {"VLLM_ELASTIC_EP_SCALE_UP_LAUNCH": "1"}) + for i, (pg, + local_rank) in enumerate(zip(placement_groups, + local_dp_ranks)): + rank = cur_data_parallel_size + i + dp_vllm_config = copy.deepcopy(cur_vllm_config) + dp_vllm_config.parallel_config.data_parallel_size = \ + new_data_parallel_size + dp_vllm_config.parallel_config.placement_group = pg + + # Check if this placement group is on the head node + local_client = any( + bundle.get("node:" + dp_master_ip, 0) > 0 + for bundle in pg.bundle_specs) + + if local_client: + new_local_engines += 1 + # Update data_parallel_size_local + dp_vllm_config.parallel_config.data_parallel_size_local = ( + cur_vllm_config.parallel_config.data_parallel_size_local + + new_local_engines) + + actor = ray.remote(DPEngineCoreActor).options( + scheduling_strategy=PlacementGroupSchedulingStrategy( + placement_group=pg, + placement_group_bundle_index=world_size, + ), + runtime_env=runtime_env).remote( + vllm_config=dp_vllm_config, + executor_class=self.executor_class, + log_stats=self.log_stats, + local_client=local_client, + addresses=self.addresses, + dp_rank=rank, + local_dp_rank=local_rank) + + if local_client: + self.local_engine_actors.append(actor) + else: + self.remote_engine_actors.append(actor) + self.created_placement_groups.append(pg) + self.placement_group_is_local.append(local_client) + + ray.get([ + actor.wait_for_init.remote() + for actor in (self.local_engine_actors[-new_local_engines:] + if new_local_engines > 0 else []) + + self.remote_engine_actors[-(len(placement_groups) - + new_local_engines):] + ]) + + actors = (self.local_engine_actors[-new_local_engines:] + if new_local_engines > 0 else []) + \ + self.remote_engine_actors[-(len(placement_groups) - + new_local_engines):] + + for actor in actors: + self.run_refs.append(actor.run.remote()) + + cur_vllm_config.parallel_config.data_parallel_size = \ + new_data_parallel_size + # Update old_vllm_config with new data_parallel_size_local if any new + # local engines were added + if new_local_engines > 0: + cur_vllm_config.parallel_config.data_parallel_size_local += \ + new_local_engines + + def scale_down_elastic_ep(self, cur_data_parallel_size: int, + new_data_parallel_size: int) -> None: + import ray + assert cur_data_parallel_size > new_data_parallel_size, ( + f"cur_data_parallel_size {cur_data_parallel_size} must be greater " + f"than new_data_parallel_size {new_data_parallel_size} " + "for scale down") + for _ in range(cur_data_parallel_size - new_data_parallel_size): + pg = self.created_placement_groups.pop() + is_local = self.placement_group_is_local.pop() + if is_local: + self.local_engine_actors.pop() + else: + self.remote_engine_actors.pop() + ray.util.remove_placement_group(pg) + def get_run_refs(self): return self.run_refs diff --git a/vllm/v1/executor/ray_distributed_executor.py b/vllm/v1/executor/ray_distributed_executor.py index daca7c0faf6..eb659e4f9e4 100644 --- a/vllm/v1/executor/ray_distributed_executor.py +++ b/vllm/v1/executor/ray_distributed_executor.py @@ -6,6 +6,7 @@ from vllm.executor.ray_distributed_executor import ( # noqa RayDistributedExecutor as RayDistributedExecutorV0) +from vllm.v1.engine import ReconfigureDistributedRequest, ReconfigureRankType from vllm.v1.executor.abstract import Executor from vllm.v1.outputs import ModelRunnerOutput @@ -62,3 +63,11 @@ def execute_model( # When PP is used, we return a FutureWrapper immediately so that # the scheduler can yield to the next batch. return FutureWrapper(refs[0]) + + def reinitialize_distributed( + self, reconfig_request: ReconfigureDistributedRequest) -> None: + self._run_workers("reinitialize_distributed", reconfig_request) + if reconfig_request.new_data_parallel_rank == \ + ReconfigureRankType.SHUTDOWN_CURRENT_RANK: + self.shutdown() + return diff --git a/vllm/v1/worker/cpu_model_runner.py b/vllm/v1/worker/cpu_model_runner.py index c315dcb1832..136a9f08e82 100644 --- a/vllm/v1/worker/cpu_model_runner.py +++ b/vllm/v1/worker/cpu_model_runner.py @@ -49,7 +49,7 @@ def replace_tensor(obj: Any, cpu_attr_name: str, if k.endswith("_cpu") and isinstance(v, torch.Tensor): replace_tensor(self.input_batch.block_table, k, k[:-4]) - def load_model(self) -> None: + def load_model(self, eep_scale_up: bool = False) -> None: logger.info("Starting to load model %s...", self.model_config.model) self.model = get_model(vllm_config=self.vllm_config) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index c3eeb6c2e39..06d0214c4d6 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1745,8 +1745,40 @@ def update_config(self, overrides: dict[str, Any]) -> None: new_config = update_config(config, config_overrides) setattr(self, config_name, new_config) - def load_model(self) -> None: + def load_model(self, eep_scale_up: bool = False) -> None: + """ + Args: + eep_scale_up: the model loading is for elastic EP scale up. + """ logger.info("Starting to load model %s...", self.model_config.model) + if eep_scale_up: + from vllm.distributed.parallel_state import get_ep_group + num_local_physical_experts = torch.empty(1, + dtype=torch.int32, + device="cpu") + torch.distributed.broadcast(num_local_physical_experts, + group=get_ep_group().cpu_group, + group_src=0) + num_local_physical_experts = int(num_local_physical_experts.item()) + new_ep_size = get_ep_group().world_size + global_expert_load, old_global_expert_indices = ( + EplbState.recv_state()) + num_logical_experts = global_expert_load.shape[1] + self.parallel_config.num_redundant_experts = ( + num_local_physical_experts * new_ep_size - num_logical_experts) + assert old_global_expert_indices.shape[ + 1] % num_local_physical_experts == 0 + old_ep_size = old_global_expert_indices.shape[ + 1] // num_local_physical_experts + rank_mapping = { + old_ep_rank: old_ep_rank + for old_ep_rank in range(old_ep_size) + } + else: + global_expert_load = None + old_global_expert_indices = None + rank_mapping = None + with DeviceMemoryProfiler() as m: # noqa: SIM117 time_before_load = time.perf_counter() model_loader = get_model_loader(self.load_config) @@ -1788,6 +1820,9 @@ def load_model(self) -> None: self.model, self.device, self.parallel_config, + global_expert_load, + old_global_expert_indices, + rank_mapping, ) def save_tensorized_model( diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 1610d0ecee2..2201481fa5b 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -26,6 +26,7 @@ from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors from vllm.utils import GiB_bytes, MemorySnapshot, memory_profiling +from vllm.v1.engine import ReconfigureDistributedRequest, ReconfigureRankType from vllm.v1.kv_cache_interface import KVCacheConfig, KVCacheSpec from vllm.v1.outputs import EMPTY_MODEL_RUNNER_OUTPUT, ModelRunnerOutput from vllm.v1.utils import report_usage_stats @@ -191,8 +192,9 @@ def load_model(self) -> None: else: from contextlib import nullcontext context = nullcontext() + eep_scale_up = os.environ.get("VLLM_ELASTIC_EP_SCALE_UP_LAUNCH") == "1" with context: - self.model_runner.load_model() + self.model_runner.load_model(eep_scale_up=eep_scale_up) def update_config(self, overrides: dict[str, Any]) -> None: self.model_runner.update_config(overrides) @@ -384,6 +386,161 @@ def check_health(self) -> None: # worker will always be healthy as long as it's running. return + def _eplb_before_scale_down(self, old_ep_size: int, + new_ep_size: int) -> None: + from vllm.distributed.parallel_state import get_ep_group + if get_ep_group().rank == 0: + logger.info("[Elastic EP] Starting expert resharding " + "before scaling down...") + rank_mapping = { + old_ep_rank: old_ep_rank if old_ep_rank < new_ep_size else -1 + for old_ep_rank in range(old_ep_size) + } + assert self.model_runner.eplb_state is not None + self.model_runner.eplb_state.rearrange(self.model_runner.model, + execute_shuffle=True, + global_expert_load=None, + rank_mapping=rank_mapping) + torch.cuda.synchronize() + if get_ep_group().rank == 0: + logger.info("[Elastic EP] Expert resharding completed!") + + def _eplb_after_scale_up( + self, old_ep_size: int, new_ep_size: int, + global_expert_load: Optional[torch.Tensor]) -> None: + from vllm.distributed.parallel_state import get_ep_group + if get_ep_group().rank == 0: + logger.info("[Elastic EP] Starting expert resharding " + "after scaling up...") + rank_mapping = { + old_ep_rank: old_ep_rank + for old_ep_rank in range(old_ep_size) + } + assert self.model_runner.eplb_state is not None + self.model_runner.eplb_state.rearrange( + self.model_runner.model, + execute_shuffle=True, + global_expert_load=global_expert_load, + rank_mapping=rank_mapping) + if get_ep_group().rank == 0: + logger.info("[Elastic EP] Expert resharding completed!") + + def _reconfigure_parallel_config( + self, reconfig_request: ReconfigureDistributedRequest) -> None: + """ + Update parallel config with provided reconfig_request + """ + parallel_config = self.vllm_config.parallel_config + parallel_config.data_parallel_size = \ + reconfig_request.new_data_parallel_size + if reconfig_request.new_data_parallel_rank != \ + ReconfigureRankType.KEEP_CURRENT_RANK: + parallel_config.data_parallel_rank = \ + reconfig_request.new_data_parallel_rank + if reconfig_request.new_data_parallel_rank_local != \ + ReconfigureRankType.KEEP_CURRENT_RANK: + parallel_config.data_parallel_rank_local = \ + reconfig_request.new_data_parallel_rank_local + parallel_config.data_parallel_master_ip = \ + reconfig_request.new_data_parallel_master_ip + parallel_config.data_parallel_master_port = \ + reconfig_request.new_data_parallel_master_port + + def _reconfigure_moe(self, old_ep_size: int, + new_ep_size: int) -> Optional[torch.Tensor]: + """ + Reconfigure MoE modules with provided reconfig_request + + Return the global expert load if new_ep_size > old_ep_size, + otherwise None + """ + from vllm.distributed.parallel_state import ( + get_dp_group, get_ep_group, prepare_communication_buffer_for_model) + from vllm.model_executor.layers.fused_moe.layer import ( + FusedMoEParallelConfig) + + parallel_config = self.vllm_config.parallel_config + moe_modules = [ + module for module in self.model_runner.model.modules() + if module.__class__.__name__ == "FusedMoE" + ] + num_local_experts = moe_modules[0].moe_config.num_local_experts + assert all(module.moe_config.num_local_experts == num_local_experts + for module in moe_modules), ( + "All MoE modules must have the same number of experts") + for module in moe_modules: + module.moe_config.num_experts = num_local_experts * new_ep_size + module.global_num_experts = module.moe_config.num_experts + module.moe_parallel_config = FusedMoEParallelConfig.make( + tp_size_=get_tp_group().world_size, + dp_size_=get_dp_group().world_size, + vllm_parallel_config=parallel_config, + ) + module.moe_config.moe_parallel_config = module.moe_parallel_config + if new_ep_size < old_ep_size: + num_local_physical_experts = num_local_experts + assert self.model_runner.eplb_state is not None + new_physical_experts = \ + self.model_runner.eplb_state.physical_to_logical_map.shape[1] + parallel_config.num_redundant_experts = ( + new_physical_experts - + self.model_runner.eplb_state.logical_replica_count.shape[1]) + global_expert_load = None + else: + num_local_physical_experts = torch.tensor([num_local_experts], + dtype=torch.int32, + device="cpu") + torch.distributed.broadcast(num_local_physical_experts, + group=get_ep_group().cpu_group, + group_src=0) + num_local_physical_experts = num_local_physical_experts.item() + new_physical_experts = num_local_physical_experts * new_ep_size + assert self.model_runner.eplb_state is not None + global_expert_load = self.model_runner.eplb_state.rearrange( + self.model_runner.model, execute_shuffle=False) + parallel_config.num_redundant_experts = ( + new_physical_experts - global_expert_load.shape[1]) + prepare_communication_buffer_for_model(self.model_runner.model) + self.model_runner.model.update_physical_experts_metadata( + num_physical_experts=new_physical_experts, + num_local_physical_experts=num_local_physical_experts) + return global_expert_load + + def reinitialize_distributed( + self, reconfig_request: ReconfigureDistributedRequest) -> None: + from vllm.config import set_current_vllm_config + from vllm.distributed.parallel_state import ( + cleanup_dist_env_and_memory, get_ep_group) + + old_ep_size = get_ep_group().world_size + old_ep_rank = get_ep_group().rank + new_ep_size = reconfig_request.new_data_parallel_size * get_tp_group( + ).world_size * get_pp_group().world_size + if new_ep_size < old_ep_size: + self._eplb_before_scale_down(old_ep_size, new_ep_size) + + cleanup_dist_env_and_memory() + + if reconfig_request.new_data_parallel_rank == \ + ReconfigureRankType.SHUTDOWN_CURRENT_RANK: + assert old_ep_rank >= new_ep_size + # shutdown + return + + self._reconfigure_parallel_config(reconfig_request) + + with set_current_vllm_config(self.vllm_config): + init_worker_distributed_environment(self.vllm_config, self.rank, + self.distributed_init_method, + self.local_rank) + + global_expert_load = self._reconfigure_moe(old_ep_size, new_ep_size) + + if new_ep_size > old_ep_size: + assert global_expert_load is not None + self._eplb_after_scale_up(old_ep_size, new_ep_size, + global_expert_load) + def save_sharded_state( self, path: str, From 795d8d5a6ce06d408cd3c635099c2e42c22221fd Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Sat, 19 Jul 2025 08:52:02 +0800 Subject: [PATCH 187/552] [Quantization] Enable BNB support for more MoE models (#21100) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- docs/models/supported_models.md | 8 +- vllm/model_executor/models/bailing_moe.py | 21 +- vllm/model_executor/models/ernie45_moe.py | 153 +++++++------- vllm/model_executor/models/grok1.py | 24 ++- vllm/model_executor/models/hunyuan_v1_moe.py | 198 ++++++++++--------- 5 files changed, 223 insertions(+), 181 deletions(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index de95e2f21ce..11a7f2440a4 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -316,7 +316,7 @@ Specified using `--task generate`. | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ | | `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | -| `BailingMoeForCausalLM` | Ling | `inclusionAI/Ling-lite-1.5`, `inclusionAI/Ling-plus`, etc. | | ✅︎ | ✅︎ | +| `BailingMoeForCausalLM` | Ling | `inclusionAI/Ling-lite-1.5`, `inclusionAI/Ling-plus`, etc. | ✅︎ | ✅︎ | ✅︎ | | `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅︎ | | `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | | | `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | | | @@ -328,8 +328,8 @@ Specified using `--task generate`. | `DeepseekV2ForCausalLM` | DeepSeek-V2 | `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat`, etc. | | ✅︎ | ✅︎ | | `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3`, etc. | | ✅︎ | ✅︎ | | `Dots1ForCausalLM` | dots.llm1 | `rednote-hilab/dots.llm1.base`, `rednote-hilab/dots.llm1.inst`, etc. | | ✅︎ | ✅︎ | -| `Ernie4_5_ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | | ✅︎ | ✅︎ | -| `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. | | ✅︎ | ✅︎ | +| `Ernie4_5_ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. |✅︎| ✅︎ | ✅︎ | | `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Fairseq2LlamaForCausalLM` | Llama (fairseq2 format) | `mgleize/fairseq2-dummy-Llama-3.2-1B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | | ✅︎ | ✅︎ | @@ -351,7 +351,7 @@ Specified using `--task generate`. | `GraniteMoeSharedForCausalLM` | Granite MoE Shared | `ibm-research/moe-7b-1b-active-shared-experts` (test model) | ✅︎ | ✅︎ | ✅︎ | | `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | | | `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ | ✅︎ | -| `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, etc. | | | ✅︎ | +| `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, etc. | ✅︎ | | ✅︎ | | `InternLMForCausalLM` | InternLM | `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM2ForCausalLM` | InternLM2 | `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM3ForCausalLM` | InternLM3 | `internlm/internlm3-8b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | diff --git a/vllm/model_executor/models/bailing_moe.py b/vllm/model_executor/models/bailing_moe.py index ccfc3997e45..853c13b135e 100644 --- a/vllm/model_executor/models/bailing_moe.py +++ b/vllm/model_executor/models/bailing_moe.py @@ -53,7 +53,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsPP +from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, maybe_prefix) @@ -374,6 +374,14 @@ def forward( hidden_states, _ = self.norm(hidden_states, residual) return hidden_states + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.num_experts, + ) + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: stacked_params_mapping = [ @@ -381,14 +389,10 @@ def load_weights(self, weights: Iterable[tuple[str, ("gate_up_proj", "gate_proj", 0), ("gate_up_proj", "up_proj", 1), ] - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="gate_proj", - ckpt_down_proj_name="down_proj", - ckpt_up_proj_name="up_proj", - num_experts=self.config.num_experts) params_dict = dict(self.named_parameters(remove_duplicate=False)) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if self.config.norm_head and "lm_head.weight" in name: loaded_weight = F.normalize(loaded_weight, @@ -449,7 +453,7 @@ def load_weights(self, weights: Iterable[tuple[str, return loaded_params -class BailingMoeForCausalLM(nn.Module, SupportsPP): +class BailingMoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA): packed_modules_mapping = { "query_key_value": ["query_key_value"], @@ -518,3 +522,6 @@ def load_weights(self, weights: Iterable[tuple[str, if self.config.tie_word_embeddings else None), ) return loader.load_weights(weights) + + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() diff --git a/vllm/model_executor/models/ernie45_moe.py b/vllm/model_executor/models/ernie45_moe.py index e7a50ff7a1c..984003e62d1 100644 --- a/vllm/model_executor/models/ernie45_moe.py +++ b/vllm/model_executor/models/ernie45_moe.py @@ -51,8 +51,8 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsPP -from .utils import (PPMissingLayer, extract_layer_index, +from .interfaces import SupportsLoRA, SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, maybe_prefix) @@ -427,66 +427,15 @@ def forward( return hidden_states + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: -class Ernie4_5_MoeForCausalLM(nn.Module, SupportsPP): - packed_modules_mapping = { - "qkv_proj": [ - "q_proj", - "k_proj", - "v_proj", - ], - "gate_up_proj": [ - "gate_proj", - "up_proj", - ], - } - - fall_back_to_pt_during_load = False - - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - config = vllm_config.model_config.hf_config - quant_config = vllm_config.quant_config - self.config = config - self.quant_config = quant_config - self.model = Ernie4_5_MoeModel(vllm_config=vllm_config, - prefix=maybe_prefix(prefix, "model")) - - if get_pp_group().is_last_rank: - self.lm_head = ParallelLMHead(config.vocab_size, - config.hidden_size, - quant_config=quant_config) - else: - self.lm_head = PPMissingLayer() - - if self.config.tie_word_embeddings: - self.lm_head.weight = self.model.embed_tokens.weight - self.logits_processor = LogitsProcessor(config.vocab_size) - self.make_empty_intermediate_tensors = ( - self.model.make_empty_intermediate_tensors) - - def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: - return self.model.get_input_embeddings(input_ids) - - def forward( - self, - input_ids: torch.Tensor, - positions: torch.Tensor, - intermediate_tensors: Optional[IntermediateTensors] = None, - inputs_embeds: Optional[torch.Tensor] = None, - ) -> Union[torch.Tensor, IntermediateTensors]: - hidden_states = self.model(input_ids, positions, intermediate_tensors, - inputs_embeds) - return hidden_states - - def compute_logits( - self, - hidden_states: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[torch.Tensor]: - logits = self.logits_processor(self.lm_head, hidden_states, - sampling_metadata) - return logits + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.moe_num_experts) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: @@ -499,16 +448,9 @@ def load_weights(self, weights: Iterable[tuple[str, ("gate_up_proj", "up_proj", 1), ] - # Params for weights, fp8 weight scales, fp8 activation scales - # (param_name, weight_name, expert_id, shard_id) - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="gate_proj", - ckpt_down_proj_name="down_proj", - ckpt_up_proj_name="up_proj", - num_experts=self.config.moe_num_experts) - params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if self.config.tie_word_embeddings and name.endswith( "lm_head.weight"): @@ -581,3 +523,76 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight) loaded_params.add(name) return loaded_params + + +class Ernie4_5_MoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA): + packed_modules_mapping = { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } + + fall_back_to_pt_during_load = False + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + self.config = config + self.quant_config = quant_config + self.model = Ernie4_5_MoeModel(vllm_config=vllm_config, + prefix=maybe_prefix(prefix, "model")) + + if get_pp_group().is_last_rank: + self.lm_head = ParallelLMHead(config.vocab_size, + config.hidden_size, + quant_config=quant_config) + else: + self.lm_head = PPMissingLayer() + + if self.config.tie_word_embeddings: + self.lm_head.weight = self.model.embed_tokens.weight + self.logits_processor = LogitsProcessor(config.vocab_size) + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.model.get_input_embeddings(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + hidden_states = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds) + return hidden_states + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader( + self, + skip_prefixes=(["lm_head."] + if self.config.tie_word_embeddings else None), + ) + return loader.load_weights(weights) + + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() diff --git a/vllm/model_executor/models/grok1.py b/vllm/model_executor/models/grok1.py index 2d930527b2b..3659249cd8b 100644 --- a/vllm/model_executor/models/grok1.py +++ b/vllm/model_executor/models/grok1.py @@ -360,6 +360,16 @@ def forward( hidden_states, _ = self.norm(hidden_states, residual) return hidden_states + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + # Map Grok1's unique expert parameter names to standard names + # Grok1 uses "num_experts" in its config + num_experts = getattr(self.config, "num_experts", 8) + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="linear", # Grok1 specific + ckpt_down_proj_name="linear_1", # Grok1 specific + ckpt_up_proj_name="linear_v", # Grok1 specific + num_experts=num_experts) + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: stacked_params_mapping = [ @@ -369,18 +379,9 @@ def load_weights(self, weights: Iterable[tuple[str, ("qkv_proj", "v_proj", "v"), ] - # Map Grok1's unique expert parameter names to standard names - # Grok1 uses "num_experts" in its config - num_experts = getattr(self.config, "num_experts", 8) - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="linear", # Grok1 specific - ckpt_down_proj_name="linear_1", # Grok1 specific - ckpt_up_proj_name="linear_v", # Grok1 specific - num_experts=num_experts) - params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() - + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if (self.quant_config is not None and (scale_name := self.quant_config.get_cache_scale(name))): @@ -544,3 +545,6 @@ def load_weights(self, weights: Iterable[tuple[str, skip_prefixes=skip_prefixes, ) return loader.load_weights(weights) + + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() diff --git a/vllm/model_executor/models/hunyuan_v1_moe.py b/vllm/model_executor/models/hunyuan_v1_moe.py index 43ffba00721..b3baec98b0f 100644 --- a/vllm/model_executor/models/hunyuan_v1_moe.py +++ b/vllm/model_executor/models/hunyuan_v1_moe.py @@ -56,7 +56,9 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .utils import PPMissingLayer, is_pp_missing_parameter, make_layers +from .interfaces import SupportsLoRA +from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, + make_layers) def _get_cla_factor(config: PretrainedConfig) -> int: @@ -617,86 +619,6 @@ def forward( hidden_states, _ = self.norm(hidden_states, residual) return hidden_states - -class HunYuanMoEV1ForCausalLM(nn.Module): - packed_modules_mapping = { - "qkv_proj": [ - "q_proj", - "k_proj", - "v_proj", - ], - "gate_up_proj": [ - "gate_proj", - "up_proj", - ], - } - - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - - config = vllm_config.model_config.hf_config - quant_config = vllm_config.quant_config - lora_config = vllm_config.lora_config - self.config = config - self.quant_config = quant_config - self.lora_config = lora_config - - self.model = HunYuanModel(vllm_config=vllm_config, prefix="model") - if get_pp_group().is_last_rank: - self.unpadded_vocab_size = config.vocab_size - if lora_config: - self.unpadded_vocab_size += lora_config.lora_extra_vocab_size - self.lm_head = ParallelLMHead( - self.unpadded_vocab_size, - config.hidden_size, - org_num_embeddings=config.vocab_size, - padding_size=DEFAULT_VOCAB_PADDING_SIZE, - quant_config=quant_config, - ) - if config.tie_word_embeddings: - self.lm_head.weight = self.model.embed_tokens.weight - - logit_scale = getattr(config, "logit_scale", 1.0) - self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, - config.vocab_size, - logit_scale) - else: - self.lm_head = PPMissingLayer() - - def forward( - self, - input_ids: torch.Tensor, - positions: torch.Tensor, - intermediate_tensors: Optional[IntermediateTensors] = None, - inputs_embeds: Optional[torch.Tensor] = None, - ) -> Union[torch.Tensor, IntermediateTensors]: - model_output = self.model(input_ids, positions, intermediate_tensors, - inputs_embeds) - return model_output - - def compute_logits( - self, - hidden_states: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[torch.Tensor]: - logits = self.logits_processor(self.lm_head, hidden_states, - sampling_metadata) - return logits - - def make_empty_intermediate_tensors( - self, batch_size: int, dtype: torch.dtype, - device: torch.device) -> IntermediateTensors: - return IntermediateTensors({ - "hidden_states": - torch.zeros((batch_size, self.config.hidden_size), - dtype=dtype, - device=device), - "residual": - torch.zeros((batch_size, self.config.hidden_size), - dtype=dtype, - device=device), - }) - def _split_qkv_weight(self, qkv: torch.Tensor): num_attention_heads = self.config.num_attention_heads num_kv_heads = getattr(self.config, "num_key_value_heads", @@ -719,6 +641,17 @@ def _split_qkv_weight(self, qkv: torch.Tensor): v = v.reshape(-1, hidden_size) return torch.concat((q, k, v)) + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.num_experts, + ) + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): cla_factor = _get_cla_factor(self.config) stacked_params_mapping = [ @@ -745,16 +678,9 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): ), ] - # Params for weights, fp8 weight scales, fp8 activation scales - # (param_name, weight_name, expert_id, shard_id) - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="gate_proj", - ckpt_down_proj_name="down_proj", - ckpt_up_proj_name="up_proj", - num_experts=self.config.num_experts, - ) - params_dict = dict(self.named_parameters()) + loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if "rotary_emb.inv_freq" in name: continue @@ -806,7 +732,7 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): param = params_dict[name] weight_loader = param.weight_loader weight_loader(param, loaded_weight, shard_id) - + loaded_params.add(name) is_found = True break if is_found: @@ -885,3 +811,93 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): weight_loader = getattr(param, "weight_loader", default_weight_loader) weight_loader(param, loaded_weight) + loaded_params.add(name) + return loaded_params + + +class HunYuanMoEV1ForCausalLM(nn.Module, SupportsLoRA): + packed_modules_mapping = { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + self.config = config + self.quant_config = quant_config + + self.model = HunYuanModel(vllm_config=vllm_config, prefix="model") + if get_pp_group().is_last_rank: + self.unpadded_vocab_size = config.vocab_size + self.lm_head = ParallelLMHead( + self.unpadded_vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + padding_size=DEFAULT_VOCAB_PADDING_SIZE, + quant_config=quant_config, + ) + if config.tie_word_embeddings: + self.lm_head.weight = self.model.embed_tokens.weight + + logit_scale = getattr(config, "logit_scale", 1.0) + self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, + config.vocab_size, + logit_scale) + else: + self.lm_head = PPMissingLayer() + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + model_output = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds) + return model_output + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def make_empty_intermediate_tensors( + self, batch_size: int, dtype: torch.dtype, + device: torch.device) -> IntermediateTensors: + return IntermediateTensors({ + "hidden_states": + torch.zeros((batch_size, self.config.hidden_size), + dtype=dtype, + device=device), + "residual": + torch.zeros((batch_size, self.config.hidden_size), + dtype=dtype, + device=device), + }) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader( + self, + skip_prefixes=(["lm_head."] + if self.config.tie_word_embeddings else None), + ) + return loader.load_weights(weights) + + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() From 655aa4f8f6c2be71d46b6f3480010cd705fa32af Mon Sep 17 00:00:00 2001 From: Lucia Fang <116399278+luccafong@users.noreply.github.com> Date: Sat, 19 Jul 2025 11:48:38 +0800 Subject: [PATCH 188/552] [Core] Support Local Chunked Attention for Hybrid KV Cache (#19351) Signed-off-by: Lucia Fang Signed-off-by: Lu Fang Signed-off-by: Lu Fang Co-authored-by: Lu Fang Signed-off-by: x22x22 --- tests/v1/core/test_specialized_manager.py | 157 ++++++++++++++++++- vllm/attention/layer.py | 1 + vllm/config.py | 7 + vllm/v1/attention/backends/flash_attn.py | 3 +- vllm/v1/attention/backends/utils.py | 1 + vllm/v1/core/kv_cache_utils.py | 19 ++- vllm/v1/core/single_type_kv_cache_manager.py | 125 ++++++++++++++- vllm/v1/kv_cache_interface.py | 49 ++++-- vllm/v1/worker/gpu_model_runner.py | 8 + 9 files changed, 351 insertions(+), 19 deletions(-) diff --git a/tests/v1/core/test_specialized_manager.py b/tests/v1/core/test_specialized_manager.py index a9e1898df93..b67c05bd7ac 100644 --- a/tests/v1/core/test_specialized_manager.py +++ b/tests/v1/core/test_specialized_manager.py @@ -1,13 +1,17 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import random + import torch from vllm.v1.core.block_pool import BlockPool from vllm.v1.core.kv_cache_utils import (BlockHash, BlockHashWithGroupId, KVCacheBlock) -from vllm.v1.core.single_type_kv_cache_manager import SlidingWindowManager -from vllm.v1.kv_cache_interface import SlidingWindowSpec +from vllm.v1.core.single_type_kv_cache_manager import ( + ChunkedLocalAttentionManager, SlidingWindowManager) +from vllm.v1.kv_cache_interface import (ChunkedLocalAttentionSpec, + SlidingWindowSpec) def get_sliding_window_manager(sliding_window_spec, block_pool): @@ -17,6 +21,80 @@ def get_sliding_window_manager(sliding_window_spec, block_pool): kv_cache_group_id=0) +def get_chunked_local_attention_manager(chunked_local_attention_spec, + block_pool): + return ChunkedLocalAttentionManager(chunked_local_attention_spec, + block_pool, + caching_hash_fn=lambda x: x, + kv_cache_group_id=0) + + +def test_chunked_local_attention_possible_cached_prefix(): + block_size = 2 + chunked_local_attention_spec = ChunkedLocalAttentionSpec( + block_size=block_size, + num_kv_heads=1, + head_size=1, + dtype=torch.float32, + attention_chunk_size=4, + use_mla=False, + ) + + block_pool = BlockPool(num_gpu_blocks=100, enable_caching=True) + manager = get_chunked_local_attention_manager(chunked_local_attention_spec, + block_pool) + + def run_one_case(block_is_cached, tail_token, expect_length): + block_hash_list = [ + BlockHash(i, ()) for i in range(len(block_is_cached)) + ] + + block_pool.cached_block_hash_to_block.clear() + + # Mock the block pool with the cached blocks + for i, (block_hash, + is_cached) in enumerate(zip(block_hash_list, block_is_cached)): + if is_cached: + block_pool.cached_block_hash_to_block[BlockHashWithGroupId( + block_hash, 0)] = { + i: block_pool.blocks[i + 10], + } + + computed_blocks = manager.find_longest_cache_hit( + block_hashes=block_hash_list, + max_length=len(block_hash_list) * block_size + tail_token, + kv_cache_group_ids=[0], + block_pool=block_pool, + kv_cache_spec=chunked_local_attention_spec, + use_eagle=False)[0] + assert len(computed_blocks) == expect_length + + assert all(block == block_pool.null_block + for block in computed_blocks[:(expect_length - 1) // 2]) + + run_one_case([True], 0, 1) + run_one_case([True], 1, 1) + run_one_case([True, False], 0, 2) + run_one_case([True, False], 1, 2) + run_one_case([True, True], 0, 2) + run_one_case([True, True], 1, 2) + run_one_case([True, True, False], 0, 2) + run_one_case([True, True, False], 1, 2) + run_one_case([True, True, True], 0, 3) + run_one_case([True, True, True], 1, 3) + run_one_case([True, True, True, False], 0, 4) + run_one_case([True, True, True, False], 1, 4) + run_one_case([random.choice([True, False])] * 8 + [True], 1, 9) + run_one_case([random.choice([True, False])] * 8 + [False], 1, 8) + run_one_case([random.choice([True, False])] * 8 + [True, True], 1, 10) + run_one_case([random.choice([True, False])] * 8 + [True, False], 0, 10) + run_one_case([random.choice([True, False])] * 8 + [True, False], 1, 10) + run_one_case([random.choice([True, False])] * 8 + [False, True], 0, 10) + run_one_case([random.choice([True, False])] * 8 + [False, True], 1, 10) + run_one_case([random.choice([True, False])] * 8 + [False, False], 0, 10) + run_one_case([random.choice([True, False])] * 8 + [False, False], 1, 10) + + def test_sliding_window_possible_cached_prefix(): block_size = 2 sliding_window_spec = SlidingWindowSpec( @@ -84,6 +162,58 @@ def run_one_case(block_is_cached, expect_length): ], 8) +def test_chunked_local_attention_remove_skipped_blocks(): + attention_spec = ChunkedLocalAttentionSpec( + block_size=2, + num_kv_heads=1, + head_size=1, + dtype=torch.float32, + attention_chunk_size=4, + use_mla=False, + ) + + block_pool = BlockPool(num_gpu_blocks=2000, enable_caching=True) + + manager = get_chunked_local_attention_manager(attention_spec, block_pool) + + null_block_id = block_pool.null_block.block_id + + def id_to_block_table(ids) -> list[KVCacheBlock]: + return [ + KVCacheBlock(id_) + if id_ != null_block_id else block_pool.null_block for id_ in ids + ] + + def assert_block_id(block_table: list[KVCacheBlock], ids: list[int]): + for block, id_ in zip(block_table, ids): + if id_ == null_block_id: + assert block == block_pool.null_block + else: + assert block.block_id == id_ + + original_block_ids = [ + 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010 + ] + block_table = id_to_block_table(original_block_ids) + manager.req_to_blocks["test"] = block_table + + manager.remove_skipped_blocks("test", 0) + assert_block_id(block_table, original_block_ids) + + # For 4th token (0-indexed), token 0-3 is out of the local attention window. + manager.remove_skipped_blocks("test", 4) + assert_block_id(block_table, [null_block_id] * 2) + + # For 6th token (0-indexed), token 4 - 6 are in local attention window, + # token 0 - 3 are out, 2 blocks can be removed. + manager.remove_skipped_blocks("test", 6) + assert_block_id(block_table, [null_block_id] * 2 + original_block_ids[2:]) + # For 12th token (0-indexed), + # token 0-11 are out, 6 block can be removed. + manager.remove_skipped_blocks("test", 12) + assert_block_id(block_table, [null_block_id] * 6) + + def test_sliding_window_remove_skipped_blocks(): sliding_window_spec = SlidingWindowSpec( block_size=2, @@ -172,3 +302,26 @@ def test_get_num_blocks_to_allocate(): cached_blocks_1) == 20 assert manager.get_num_blocks_to_allocate("2", 20 * block_size, cached_blocks_2) == 15 + + +def test_chunked_local_attention_get_num_blocks_to_allocate(): + block_size = 2 + attention_spec = ChunkedLocalAttentionSpec( + block_size=block_size, + num_kv_heads=1, + head_size=1, + dtype=torch.float32, + attention_chunk_size=4, # Placeholder value, not related to test result + use_mla=False, + ) + + block_pool = BlockPool(num_gpu_blocks=100, enable_caching=True) + manager = get_chunked_local_attention_manager(attention_spec, block_pool) + cached_blocks_1 = [KVCacheBlock(i + 1) for i in range(10)] + cached_blocks_2 = [block_pool.null_block for _ in range(5) + ] + [KVCacheBlock(i + 1) for i in range(5)] + + assert manager.get_num_blocks_to_allocate("1", 20 * block_size, + cached_blocks_1) == 20 + assert manager.get_num_blocks_to_allocate("2", 20 * block_size, + cached_blocks_2) == 15 diff --git a/vllm/attention/layer.py b/vllm/attention/layer.py index b6b93ff4a0a..d0677525d31 100644 --- a/vllm/attention/layer.py +++ b/vllm/attention/layer.py @@ -172,6 +172,7 @@ def __init__( kv_sharing_target_layer_name, **extra_impl_args) self.backend = backend_name_to_enum(attn_backend.get_name()) self.dtype = dtype + self.use_irope = extra_impl_args.get("use_irope", False) # For cuda-alike (CUDA and ROCM) and cpu platforms, we control how # torch.compile works by registering the attention as one giant diff --git a/vllm/config.py b/vllm/config.py index ef0bd9a3d0d..270027a4b5a 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -4751,6 +4751,13 @@ def __post_init__(self): if self.kv_events_config is not None: # Hybrid KV cache manager is not compatible with KV events. self.scheduler_config.disable_hybrid_kv_cache_manager = True + if self.model_config is not None and \ + self.model_config.attention_chunk_size is not None and \ + self.speculative_config is not None and \ + self.speculative_config.use_eagle(): + # Hybrid KV cache manager is not yet supported with chunked + # local attention + eagle. + self.scheduler_config.disable_hybrid_kv_cache_manager = True def update_sizes_for_sequence_parallelism(self, possible_sizes: list) -> list: diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index d5b30ac685a..a37bf2a7115 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -538,6 +538,7 @@ def use_cascade_attention( num_kv_heads: int, use_alibi: bool, use_sliding_window: bool, + use_local_attention: bool, num_sms: int, ) -> bool: """Decide whether to use cascade attention. @@ -553,7 +554,7 @@ def use_cascade_attention( if common_prefix_len < 256: return False # Cascade attention is currently not supported with these variants. - if use_alibi or use_sliding_window: + if use_alibi or use_sliding_window or use_local_attention: return False # Too few queries. Probably not worth using cascade attention. # We use an arbitrary threshold of 8 queries. TODO: Tune this threshold. diff --git a/vllm/v1/attention/backends/utils.py b/vllm/v1/attention/backends/utils.py index b6a06b17bca..65c3baa6784 100644 --- a/vllm/v1/attention/backends/utils.py +++ b/vllm/v1/attention/backends/utils.py @@ -120,6 +120,7 @@ def use_cascade_attention( num_kv_heads: int, use_alibi: bool, use_sliding_window: bool, + use_local_attention: bool, num_sms: int, ) -> bool: return False diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index b1fab0d34de..457d95cc738 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -11,7 +11,8 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.utils import GiB_bytes, cdiv, sha256_cbor_64bit -from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig, +from vllm.v1.kv_cache_interface import (ChunkedLocalAttentionSpec, + FullAttentionSpec, KVCacheConfig, KVCacheGroupSpec, KVCacheSpec, KVCacheTensor, SlidingWindowSpec) from vllm.v1.metrics.stats import PrefixCacheStats @@ -976,7 +977,11 @@ def is_hybrid(kv_cache_spec: dict[str, KVCacheSpec]) -> bool: isinstance(spec, FullAttentionSpec) for spec in kv_cache_spec.values()) has_sliding_window = any( isinstance(spec, SlidingWindowSpec) for spec in kv_cache_spec.values()) - if has_full_attention and has_sliding_window: + has_chunked_local_attention = any( + isinstance(spec, ChunkedLocalAttentionSpec) + for spec in kv_cache_spec.values()) + if has_full_attention and (has_sliding_window + or has_chunked_local_attention): for layer_name, spec in kv_cache_spec.items(): if isinstance(spec, SlidingWindowSpec): kv_cache_spec[layer_name] = FullAttentionSpec( @@ -987,6 +992,15 @@ def is_hybrid(kv_cache_spec: dict[str, KVCacheSpec]) -> bool: use_mla=spec.use_mla, sliding_window=spec.sliding_window, ) + elif isinstance(spec, ChunkedLocalAttentionSpec): + kv_cache_spec[layer_name] = FullAttentionSpec( + block_size=spec.block_size, + num_kv_heads=spec.num_kv_heads, + head_size=spec.head_size, + dtype=spec.dtype, + use_mla=spec.use_mla, + attention_chunk_size=spec.attention_chunk_size, + ) if is_hybrid(kv_cache_spec): raise ValueError("Hybrid KV cache manager is disabled but failed to " @@ -1010,7 +1024,6 @@ def get_kv_cache_config( The generated KVCacheConfigs """ check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory) - if vllm_config.scheduler_config.disable_hybrid_kv_cache_manager: unify_hybrid_kv_cache_specs(kv_cache_spec) diff --git a/vllm/v1/core/single_type_kv_cache_manager.py b/vllm/v1/core/single_type_kv_cache_manager.py index 1560406c900..65a196e044a 100644 --- a/vllm/v1/core/single_type_kv_cache_manager.py +++ b/vllm/v1/core/single_type_kv_cache_manager.py @@ -394,6 +394,129 @@ def get_num_common_prefix_blocks(self, request_id: str, return 0 +class ChunkedLocalAttentionManager(SingleTypeKVCacheManager): + + def __init__(self, kv_cache_spec: ChunkedLocalAttentionSpec, + block_pool: BlockPool, **kwargs) -> None: + super().__init__(kv_cache_spec, block_pool, **kwargs) + self.attention_chunk_size = kv_cache_spec.attention_chunk_size + self._null_block = block_pool.null_block + + @classmethod + def find_longest_cache_hit( + cls, + block_hashes: list[BlockHash], + max_length: int, + kv_cache_group_ids: list[int], + block_pool: BlockPool, + kv_cache_spec: KVCacheSpec, + use_eagle: bool, + ) -> tuple[list[KVCacheBlock], ...]: + """ + For chunked local attention, we need to find the longest cache hit + prefix of the blocks that is not longer than `max_length`. The prefix + should be a common prefix hit for all the kv cache groups in + `kv_cache_group_ids`. If no cache hit is found, return an empty list. + note we mark as computed if the whole block is outside of the local + window, and set the block as null. Examples: + + 1. Attention chunk size of 8, block size of 4, max length of 15 + for next token at 15th (zero-indexed), 8th - 14th tokens are in + the window(needs lookup), 0th - 7th are not in the window, + so they are already marked as computed. We check the complete + block3 (8th - 11th tokens), Assume block 3 is hit, we will return + [null, null, block 3], otherwise, we return [null, null] + + 2. Attention chunk size of 8, block size of 4, max length of 16 + for next token at 16th (zero-indexed), 0th - 15th tokens are not + in the window, so they are already marked as computed. + we return 4 blocks[null, null, null, null] + + Args: + block_hashes: The block hashes of the request. + max_length: The maximum length of the cache hit prefix. + kv_cache_group_ids: The ids of the kv cache groups. + block_pool: The block pool. + kv_cache_spec: The kv cache spec. + use_eagle: Whether to use eagle. + + Returns: + A list of cached blocks + """ + assert isinstance(kv_cache_spec, ChunkedLocalAttentionSpec), ( + "ChunkedLocalAttentionManager can only be used for " + + "chunked local attention groups") + assert use_eagle is False, ("Hybrid KV cache is not supported for " + + "eagle + chunked local attention.") + max_num_blocks = max_length // kv_cache_spec.block_size + if max_length > 0: + local_attention_start_idx = (max_length // + kv_cache_spec.attention_chunk_size * + kv_cache_spec.attention_chunk_size) + else: + local_attention_start_idx = 0 + # we marked blocks out of window as computed + # with null blocks, and blocks inside window based on cache lookup + # result [null] [null] ... [null] [hit block 1 (1st block contain + # last window)] [hit block 2] ... [hit block x] + local_attention_start_block_idx = (local_attention_start_idx // + kv_cache_spec.block_size) + computed_blocks: tuple[list[KVCacheBlock], ...] = tuple( + [block_pool.null_block] * local_attention_start_block_idx + for _ in range(len(kv_cache_group_ids))) + for i in range(local_attention_start_block_idx, max_num_blocks): + block_hash = block_hashes[i] + if cached_block := block_pool.get_cached_block( + block_hash, kv_cache_group_ids): + for computed, cached in zip(computed_blocks, cached_block): + computed.append(cached) + else: + break + return computed_blocks + + def remove_skipped_blocks(self, request_id: str, + num_computed_tokens: int) -> None: + # Remove the blocks that are no longer be in the chunked attention + # window and skipped during the attention computation. + + # [chunk 0][chunk 1]local_attention_start_idx ... current + # we computed previous number of chunks to get the idx of + # current chunk window starting offset, + # e.g. for computed 1024 tokens, the 1024th token (0 indexed) + # is in the second chunk, there are 1 prev chunk, the start idx + # is 1024. for 1023, it will be 0. + num_cached_block = self.num_cached_block.get(request_id, 0) + local_attention_start_idx = ( + num_computed_tokens + ) // self.attention_chunk_size * self.attention_chunk_size + first_useful_block_idx = local_attention_start_idx // self.block_size + if num_cached_block > 0: + # Make sure we don't delete the last cached block + first_useful_block_idx = min(first_useful_block_idx, + num_cached_block - 1) + # if block size = 128, 0 -> block 0, 1024 (= 128 * 8) -> + # block 8, 372 (= 128 * 2 + 116) -> block 2 + blocks = self.req_to_blocks[request_id] + removed_blocks: list[KVCacheBlock] = [] + # we need to keep the last block to get the previous hash key + for i in range(first_useful_block_idx - 1, -1, -1): + if blocks[i] == self._null_block: + # If the block is already a null block, the blocks before it + # should also have been set to null blocks by the previous calls + # to this function. + break + removed_blocks.append(blocks[i]) + blocks[i] = self._null_block + self.block_pool.free_blocks(removed_blocks) + + def get_num_common_prefix_blocks(self, request_id: str, + num_running_requests: int) -> int: + """ + cascade attention is not supported by chunked local attention. + """ + return 0 + + class MambaManager(SingleTypeKVCacheManager): @classmethod @@ -435,8 +558,8 @@ def allocate_new_blocks(self, request_id: str, spec_manager_map: dict[type[KVCacheSpec], type[SingleTypeKVCacheManager]] = { FullAttentionSpec: FullAttentionManager, - ChunkedLocalAttentionSpec: FullAttentionManager, SlidingWindowSpec: SlidingWindowManager, + ChunkedLocalAttentionSpec: ChunkedLocalAttentionManager, MambaSpec: MambaManager, } diff --git a/vllm/v1/kv_cache_interface.py b/vllm/v1/kv_cache_interface.py index 6726709955f..bec31a7a058 100644 --- a/vllm/v1/kv_cache_interface.py +++ b/vllm/v1/kv_cache_interface.py @@ -87,6 +87,7 @@ def page_size_bytes(self) -> int: @dataclass class FullAttentionSpec(AttentionSpec): sliding_window: Optional[int] = None + attention_chunk_size: Optional[int] = None """ When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding @@ -105,6 +106,17 @@ def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int: max_model_len = vllm_config.model_config.max_model_len return cdiv(max_model_len, self.block_size) * self.page_size_bytes + @classmethod + def merge_window_sizes(cls, window_sizes: set[int]) -> Optional[int]: + if len(window_sizes) == 0: + return None + elif len(window_sizes) == 1: + return window_sizes.pop() + else: + raise ValueError( + "All attention layers in the same KV cache group must have the " + "same window size.") + @classmethod def merge(cls, specs: list[Self]) -> Self: """ @@ -114,14 +126,17 @@ def merge(cls, specs: list[Self]) -> Self: merged_spec = super().merge(specs) sliding_window = set(spec.sliding_window for spec in specs if spec.sliding_window is not None) - if len(sliding_window) == 0: - merged_spec.sliding_window = None - elif len(sliding_window) == 1: - merged_spec.sliding_window = sliding_window.pop() - else: - raise ValueError( - "All sliding window layers in the same KV cache group " - "must have the same window size.") + attention_chunk_size = set(spec.attention_chunk_size for spec in specs + if spec.attention_chunk_size is not None) + + merged_spec.sliding_window = cls.merge_window_sizes(sliding_window) + merged_spec.attention_chunk_size = ( + cls.merge_window_sizes(attention_chunk_size)) + assert ( + (merged_spec.sliding_window is not None) + + (merged_spec.attention_chunk_size is not None) <= 1 + ), ("Model with both sliding window layers and chunked local attention " + "layers is not supported.") return merged_spec @@ -129,16 +144,26 @@ def merge(cls, specs: list[Self]) -> Self: class ChunkedLocalAttentionSpec(AttentionSpec): attention_chunk_size: int - def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int: - max_model_len = vllm_config.model_config.max_model_len - return cdiv(max_model_len, self.block_size) * self.page_size_bytes - @property def type_id(self) -> str: return ( f"local_attention_{self.attention_chunk_size}_{self.block_size}_{self.page_size_bytes}" ) # noqa + def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int: + max_model_len = vllm_config.model_config.max_model_len + max_num_batched_tokens = ( + vllm_config.scheduler_config.max_num_batched_tokens) + + # During chunked prefill, we allocate KV cache for at most + # `self.attention_chunk_size` computed tokens plus the newly scheduled + # tokens. And we won't allocate KV cache for more than `max_model_len` + # tokens. + num_tokens = min(self.attention_chunk_size + max_num_batched_tokens, + max_model_len) + + return cdiv(num_tokens, self.block_size) * self.page_size_bytes + @dataclass class SlidingWindowSpec(AttentionSpec): diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 06d0214c4d6..9620bf6a795 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -862,6 +862,10 @@ def _compute_cascade_attn_prefix_len( use_sliding_window = (isinstance(kv_cache_spec, SlidingWindowSpec) or (isinstance(kv_cache_spec, FullAttentionSpec) and kv_cache_spec.sliding_window is not None)) + use_local_attention = ( + isinstance(kv_cache_spec, ChunkedLocalAttentionSpec) + or (isinstance(kv_cache_spec, FullAttentionSpec) + and kv_cache_spec.attention_chunk_size is not None)) assert isinstance(kv_cache_spec, AttentionSpec) use_cascade = attn_metadata_builder.use_cascade_attention( common_prefix_len=common_prefix_len, @@ -870,6 +874,7 @@ def _compute_cascade_attn_prefix_len( num_kv_heads=kv_cache_spec.num_kv_heads, use_alibi=self.use_alibi, use_sliding_window=use_sliding_window, + use_local_attention=use_local_attention, num_sms=self.num_sms, ) return common_prefix_len if use_cascade else 0 @@ -2672,6 +2677,9 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: dtype=self.kv_cache_dtype, sliding_window=attn_module.sliding_window, use_mla=use_mla) + assert not use_local_attention, ( + "attention module can not be with ", + "both local attention and sliding window") elif use_local_attention: kv_cache_spec[layer_name] = (ChunkedLocalAttentionSpec( block_size=block_size, From 97e9862d314a63c7e9d09468269c7ccd18e69a8a Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Sat, 19 Jul 2025 09:45:03 +0530 Subject: [PATCH 189/552] [Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 (#21183) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- vllm/lora/models.py | 19 +++++++++++++++++-- vllm/lora/utils.py | 16 ++++++++++------ 2 files changed, 27 insertions(+), 8 deletions(-) diff --git a/vllm/lora/models.py b/vllm/lora/models.py index 521bb079da4..633674d5fb2 100644 --- a/vllm/lora/models.py +++ b/vllm/lora/models.py @@ -498,6 +498,14 @@ def remove_all_adapters(self): self._active_adapters.clear() def _create_lora_modules(self): + + def _parent_module(module_name: str) -> str: + # module name is a dot separated name. + # for example: + # - given an input 'x.y.z' return 'x.y' + # - given an input 'x' return '' + return module_name.rpartition('.')[0] + for module_name, module in self.model.named_modules( remove_duplicate=False): if isinstance(module, PPMissingLayer): @@ -529,10 +537,17 @@ def _create_lora_modules(self): new_module.scaling_factor_to_offset # (yard1): TODO make this more robust if "lm_head" in module_name: + logits_processor_module_name = 'logits_processor' + parent_module = _parent_module(module_name) + if parent_module: + logits_processor_module_name = ( + f"{parent_module}.{logits_processor_module_name}") + logits_processor_module = self.model.get_submodule( - "logits_processor") + logits_processor_module_name) + new_module = replace_submodule( - self.model, "logits_processor", + self.model, logits_processor_module_name, from_layer_logits_processor(logits_processor_module, module, self.lora_slots, self.lora_config, diff --git a/vllm/lora/utils.py b/vllm/lora/utils.py index 6b3291e9c92..7148ffe1494 100644 --- a/vllm/lora/utils.py +++ b/vllm/lora/utils.py @@ -188,16 +188,20 @@ def get_supported_lora_modules(model: nn.Module) -> list[str]: """ In vLLM, all linear layers support LoRA. """ + supported_lora_modules: set[str] = set() - # step1: traverse the model to get all the linear subfixes. for name, module in model.named_modules(): + # get the embedding modules if the module's embedding_modules + # is not empty. + embedding_modules = getattr(module, "embedding_modules", None) + if embedding_modules is not None: + for name in embedding_modules: + supported_lora_modules.add(name) + + # get all the linear subfixes. if isinstance(module, (LinearBase, )): supported_lora_modules.add(name.split(".")[-1]) - # step 2: get the embedding modules if the model's mbedding_modules - # is not empty. - if model.embedding_modules: - for name in model.embedding_modules: - supported_lora_modules.add(name) + return list(supported_lora_modules) From cf382a7d47db0abd42cbd716a5012ad5695c94d8 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Fri, 18 Jul 2025 21:47:50 -0700 Subject: [PATCH 190/552] [V0 Deprecation] Remove V0 Spec Decode workers (#21152) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 14 - .github/CODEOWNERS | 1 - .github/mergify.yml | 3 - pyproject.toml | 1 - tests/core/test_serialization.py | 2 +- tests/core/utils.py | 134 +- tests/metrics/test_metrics.py | 146 -- tests/models/registry.py | 8 +- tests/models/test_registry.py | 14 +- tests/samplers/test_rejection_sampler.py | 577 ------- .../test_typical_acceptance_sampler.py | 480 ------ tests/spec_decode/__init__.py | 0 tests/spec_decode/conftest.py | 12 - tests/spec_decode/e2e/__init__.py | 0 tests/spec_decode/e2e/conftest.py | 307 ---- tests/spec_decode/e2e/test_compatibility.py | 66 - .../spec_decode/e2e/test_eagle_correctness.py | 480 ------ tests/spec_decode/e2e/test_integration.py | 161 -- .../e2e/test_integration_dist_tp2.py | 247 --- .../e2e/test_integration_dist_tp4.py | 123 -- tests/spec_decode/e2e/test_logprobs.py | 315 ---- .../e2e/test_medusa_correctness.py | 417 ------ tests/spec_decode/e2e/test_mlp_correctness.py | 533 ------- tests/spec_decode/e2e/test_mtp_correctness.py | 333 ----- .../e2e/test_multistep_correctness.py | 842 ----------- .../spec_decode/e2e/test_ngram_correctness.py | 392 ----- tests/spec_decode/e2e/test_seed.py | 70 - tests/spec_decode/test_batch_expansion.py | 110 -- tests/spec_decode/test_dynamic_spec_decode.py | 90 -- tests/spec_decode/test_memory_usage.py | 91 -- tests/spec_decode/test_metrics.py | 205 --- tests/spec_decode/test_multi_step_worker.py | 838 ----------- tests/spec_decode/test_ngram_worker.py | 221 --- tests/spec_decode/test_scorer.py | 116 -- tests/spec_decode/test_spec_decode_worker.py | 945 ------------ tests/spec_decode/test_utils.py | 150 -- tests/spec_decode/utils.py | 290 ---- tests/test_sequence.py | 1 - tests/v1/test_oracle.py | 6 - tools/mypy.sh | 1 - vllm/config.py | 61 +- vllm/engine/arg_utils.py | 28 +- vllm/engine/llm_engine.py | 8 - vllm/engine/metrics.py | 66 - vllm/engine/metrics_types.py | 12 +- vllm/engine/output_processor/multi_step.py | 5 - .../layers/rejection_sampler.py | 406 ----- vllm/model_executor/layers/sampler.py | 12 +- .../layers/spec_decode_base_sampler.py | 259 ---- .../layers/typical_acceptance_sampler.py | 166 --- vllm/model_executor/models/eagle.py | 261 ---- vllm/model_executor/models/registry.py | 5 +- vllm/platforms/cuda.py | 12 +- vllm/platforms/rocm.py | 11 +- vllm/sequence.py | 14 +- vllm/spec_decode/__init__.py | 0 vllm/spec_decode/batch_expansion.py | 506 ------- vllm/spec_decode/draft_model_runner.py | 349 ----- vllm/spec_decode/interfaces.py | 99 -- vllm/spec_decode/medusa_worker.py | 138 -- vllm/spec_decode/metrics.py | 213 --- vllm/spec_decode/mlp_speculator_worker.py | 94 -- vllm/spec_decode/mqa_scorer.py | 160 -- vllm/spec_decode/multi_step_worker.py | 423 ------ vllm/spec_decode/ngram_worker.py | 196 --- vllm/spec_decode/proposer_worker_base.py | 59 - .../spec_decode/smaller_tp_proposer_worker.py | 196 --- vllm/spec_decode/spec_decode_worker.py | 1326 ----------------- vllm/spec_decode/target_model_runner.py | 45 - vllm/spec_decode/top1_proposer.py | 275 ---- vllm/spec_decode/util.py | 277 ---- vllm/transformers_utils/configs/eagle.py | 40 +- vllm/worker/worker_base.py | 2 - 73 files changed, 191 insertions(+), 14275 deletions(-) delete mode 100644 tests/samplers/test_rejection_sampler.py delete mode 100644 tests/samplers/test_typical_acceptance_sampler.py delete mode 100644 tests/spec_decode/__init__.py delete mode 100644 tests/spec_decode/conftest.py delete mode 100644 tests/spec_decode/e2e/__init__.py delete mode 100644 tests/spec_decode/e2e/conftest.py delete mode 100644 tests/spec_decode/e2e/test_compatibility.py delete mode 100644 tests/spec_decode/e2e/test_eagle_correctness.py delete mode 100644 tests/spec_decode/e2e/test_integration.py delete mode 100644 tests/spec_decode/e2e/test_integration_dist_tp2.py delete mode 100644 tests/spec_decode/e2e/test_integration_dist_tp4.py delete mode 100644 tests/spec_decode/e2e/test_logprobs.py delete mode 100644 tests/spec_decode/e2e/test_medusa_correctness.py delete mode 100644 tests/spec_decode/e2e/test_mlp_correctness.py delete mode 100644 tests/spec_decode/e2e/test_mtp_correctness.py delete mode 100644 tests/spec_decode/e2e/test_multistep_correctness.py delete mode 100644 tests/spec_decode/e2e/test_ngram_correctness.py delete mode 100644 tests/spec_decode/e2e/test_seed.py delete mode 100644 tests/spec_decode/test_batch_expansion.py delete mode 100644 tests/spec_decode/test_dynamic_spec_decode.py delete mode 100644 tests/spec_decode/test_memory_usage.py delete mode 100644 tests/spec_decode/test_metrics.py delete mode 100644 tests/spec_decode/test_multi_step_worker.py delete mode 100644 tests/spec_decode/test_ngram_worker.py delete mode 100644 tests/spec_decode/test_scorer.py delete mode 100644 tests/spec_decode/test_spec_decode_worker.py delete mode 100644 tests/spec_decode/test_utils.py delete mode 100644 tests/spec_decode/utils.py delete mode 100644 vllm/model_executor/layers/rejection_sampler.py delete mode 100644 vllm/model_executor/layers/spec_decode_base_sampler.py delete mode 100644 vllm/model_executor/layers/typical_acceptance_sampler.py delete mode 100644 vllm/model_executor/models/eagle.py delete mode 100644 vllm/spec_decode/__init__.py delete mode 100644 vllm/spec_decode/batch_expansion.py delete mode 100644 vllm/spec_decode/draft_model_runner.py delete mode 100644 vllm/spec_decode/interfaces.py delete mode 100644 vllm/spec_decode/medusa_worker.py delete mode 100644 vllm/spec_decode/metrics.py delete mode 100644 vllm/spec_decode/mlp_speculator_worker.py delete mode 100644 vllm/spec_decode/mqa_scorer.py delete mode 100644 vllm/spec_decode/multi_step_worker.py delete mode 100644 vllm/spec_decode/ngram_worker.py delete mode 100644 vllm/spec_decode/proposer_worker_base.py delete mode 100644 vllm/spec_decode/smaller_tp_proposer_worker.py delete mode 100644 vllm/spec_decode/spec_decode_worker.py delete mode 100644 vllm/spec_decode/target_model_runner.py delete mode 100644 vllm/spec_decode/top1_proposer.py delete mode 100644 vllm/spec_decode/util.py diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index bbbcfb745d5..7f1848b4bfb 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -159,7 +159,6 @@ steps: - tests/distributed/test_utils - tests/distributed/test_pynccl - tests/distributed/test_events - - tests/spec_decode/e2e/test_integration_dist_tp4 - tests/compile/test_basic_correctness - examples/offline_inference/rlhf.py - examples/offline_inference/rlhf_colocate.py @@ -182,7 +181,6 @@ steps: - pytest -v -s compile/test_basic_correctness.py - pytest -v -s distributed/test_pynccl.py - pytest -v -s distributed/test_events.py - - pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py # TODO: create a dedicated test section for multi-GPU example tests # when we have multiple distributed example tests - pushd ../examples/offline_inference @@ -330,17 +328,6 @@ steps: - pytest -v -s samplers - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers -- label: Speculative decoding tests # 40min - mirror_hardwares: [amdexperimental] - source_file_dependencies: - - vllm/spec_decode - - tests/spec_decode - - vllm/model_executor/models/eagle.py - commands: - - pytest -v -s spec_decode/e2e/test_multistep_correctness.py - - VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py --ignore=spec_decode/e2e/test_mtp_correctness.py - - pytest -v -s spec_decode/e2e/test_eagle_correctness.py - - label: LoRA Test %N # 15min each mirror_hardwares: [amdexperimental, amdproduction] source_file_dependencies: @@ -726,7 +713,6 @@ steps: - pytest -v -s distributed/test_sequence_parallel.py # this test fails consistently. # TODO: investigate and fix - # - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 97f9e7dc157..8c68bc8f02b 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -43,7 +43,6 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson /tests/multimodal @DarkLight1337 @ywang96 /tests/prefix_caching @comaniac @KuntaiDu /tests/quantization @mgoin @robertgshaw2-redhat -/tests/spec_decode @njhill @LiuXiaoxuanPKU /tests/test_inputs.py @DarkLight1337 @ywang96 /tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm /tests/v1/structured_output @mgoin @russellb @aarnphm diff --git a/.github/mergify.yml b/.github/mergify.yml index fccce82d50d..5c878ac0206 100644 --- a/.github/mergify.yml +++ b/.github/mergify.yml @@ -164,10 +164,7 @@ pull_request_rules: description: Automatically apply speculative-decoding label conditions: - or: - - files~=^vllm/spec_decode/ - files~=^vllm/v1/spec_decode/ - - files=vllm/model_executor/layers/spec_decode_base_sampler.py - - files~=^tests/spec_decode/ - files~=^tests/v1/spec_decode/ - files~=^examples/.*(spec_decode|mlpspeculator|eagle|speculation).*\.py - files~=^vllm/model_executor/models/.*eagle.*\.py diff --git a/pyproject.toml b/pyproject.toml index 85a112ff51c..0c8d2f82d1d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -73,7 +73,6 @@ line-length = 80 "vllm/engine/**/*.py" = ["UP006", "UP035"] "vllm/executor/**/*.py" = ["UP006", "UP035"] "vllm/prompt_adapter/**/*.py" = ["UP006", "UP035"] -"vllm/spec_decode/**/*.py" = ["UP006", "UP035"] "vllm/worker/**/*.py" = ["UP006", "UP035"] # Python 3.8 typing - skip utils for ROCm "vllm/utils/__init__.py" = ["UP006", "UP035"] diff --git a/tests/core/test_serialization.py b/tests/core/test_serialization.py index 8281298d663..ee9ac2129f2 100644 --- a/tests/core/test_serialization.py +++ b/tests/core/test_serialization.py @@ -6,7 +6,7 @@ from vllm.executor.msgspec_utils import decode_hook, encode_hook from vllm.sequence import ExecuteModelRequest -from ..spec_decode.utils import create_batch +from .utils import create_batch def test_msgspec_serialization(): diff --git a/tests/core/utils.py b/tests/core/utils.py index b746c178646..033fffd2c4e 100644 --- a/tests/core/utils.py +++ b/tests/core/utils.py @@ -4,15 +4,16 @@ import time from collections import defaultdict from collections.abc import Sequence as GenericSequence -from typing import Any, Optional +from itertools import count +from typing import Any, Optional, Union import torch -from vllm import SamplingParams from vllm.core.scheduler import Scheduler, SchedulerOutputs from vllm.inputs import EncoderDecoderInputs, embeds_inputs, token_inputs from vllm.lora.request import LoRARequest -from vllm.sequence import (Logprob, Sequence, SequenceGroup, +from vllm.sampling_params import SamplingParams +from vllm.sequence import (Logprob, Sequence, SequenceData, SequenceGroup, SequenceGroupMetadata) @@ -262,3 +263,130 @@ def last_schedule_ret( self, ) -> tuple[list[SequenceGroupMetadata], SchedulerOutputs, Any]: _, _, ret = self.call_history["schedule"][-1] return ret + + +def create_seq_group_metadata_from_prompts( + prompts: list[list[int]], + num_gpu_blocks: int, + block_size: int, + final_prompt_lens: list[int], + continuations: Optional[list[list[int]]] = None, + seq_ids: Optional[list[int]] = None, +) -> list[SequenceGroupMetadata]: + + if continuations is None: + continuations = [[] for _ in prompts] + + if seq_ids is None: + seq_ids = list(i for i, _ in enumerate(prompts)) + + free_gpu_blocks = list(range(num_gpu_blocks)) + + block_allocations = { + i: [ + free_gpu_blocks.pop() + for _ in range(round_up_to_next_block(final_len, block_size)) + ] + for i, final_len in enumerate(final_prompt_lens) + } + + seq_grou_metadata_list = [] + for i, (prompt_token_ids, + cont_token_ids) in enumerate(zip(prompts, continuations)): + data = SequenceData.from_seqs(prompt_token_ids, cont_token_ids) + data.update_num_computed_tokens( + len(prompt_token_ids) + len(cont_token_ids) - 1) + seq_data = {i: data} + seq_grou_metadata_list.append( + SequenceGroupMetadata( + request_id=str(i), + is_prompt=len(cont_token_ids) == 0, + seq_data=seq_data, + sampling_params=SamplingParams(temperature=0.0), + block_tables={i: block_allocations[i][:]}, + )) + return seq_grou_metadata_list + + +def create_chunked_seq_group_metadata_from_prompt( + prompt: list[int], + num_gpu_blocks: int, + chunk_size: int, + block_size: int, + seq_id: Optional[int] = None) -> list[SequenceGroupMetadata]: + + if seq_id is None: + seq_id = 0 + + free_gpu_blocks = list(range(num_gpu_blocks)) + + block_allocations = [ + free_gpu_blocks.pop() + for _ in range(round_up_to_next_block(len(prompt), block_size)) + ] + + seq_group_metadata_list = [] + for i, idx in enumerate(range(0, len(prompt), chunk_size)): + chunk_ids = prompt[idx:idx + chunk_size] + data = SequenceData.from_seqs(prompt) + data.update_num_computed_tokens(idx) + seq_data = {i: data} + seq_group_metadata_list.append( + SequenceGroupMetadata( + request_id=str(seq_id), + is_prompt=True, + do_sample=idx + chunk_size >= len(prompt), # terminal chunk + seq_data=seq_data, + sampling_params=SamplingParams(temperature=0.0), + block_tables={i: block_allocations}, + token_chunk_size=len(chunk_ids))) + return seq_group_metadata_list + + +def create_batch(batch_size, + k, + prompt_len: Union[int, list[int]] = 10, + prev_output_token_len: int = 10, + seq_ids: Optional[list[int]] = None, + num_gpu_blocks: Optional[int] = None, + block_size: Optional[int] = None, + prefill_chunk_size: Optional[int] = None): + if block_size is None: + block_size = 8 + + if num_gpu_blocks is None: + num_gpu_blocks = 2048 // block_size + + iterator = count() + + if isinstance(prompt_len, int): + prompt_lens = [prompt_len for _ in range(batch_size)] + else: + prompt_lens = prompt_len + + prompts = [[next(iterator) for _ in range(p_len)] for p_len in prompt_lens] + + if prefill_chunk_size: + # Create a batch of chunked prompts. + if not seq_ids: + seq_ids = list(range(len(prompts))) + seq_group_metadata_list = [] + for p, sid in zip(prompts, seq_ids): + seq_group_metadata_list += \ + create_chunked_seq_group_metadata_from_prompt( + p, num_gpu_blocks, prefill_chunk_size, block_size, sid) + seq_group_metadata_list = seq_group_metadata_list[:batch_size] + prev_output_tokens = [] + else: + prev_output_tokens = [[ + next(iterator) for _ in range(prev_output_token_len) + ] for _ in range(batch_size)] + final_prompt_lens = [ + len(prompt) + len(prev_output_token) + k + 1 + for prompt, prev_output_token in zip(prompts, prev_output_tokens) + ] + + seq_group_metadata_list = create_seq_group_metadata_from_prompts( + prompts, num_gpu_blocks, block_size, final_prompt_lens, + prev_output_tokens, seq_ids) + return seq_group_metadata_list, prompts, prev_output_tokens diff --git a/tests/metrics/test_metrics.py b/tests/metrics/test_metrics.py index 7bb5d8980d6..54dbb747de0 100644 --- a/tests/metrics/test_metrics.py +++ b/tests/metrics/test_metrics.py @@ -1,15 +1,12 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import time - import pytest import ray from prometheus_client import REGISTRY import vllm.envs as envs from vllm import EngineArgs, LLMEngine -from vllm.distributed import cleanup_dist_env_and_memory from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.engine.metrics import RayPrometheusStatLogger @@ -232,149 +229,6 @@ def test_engine_log_metrics_regression( assert_metrics(model, engine, disable_log_stats, len(example_prompts)) -@pytest.mark.parametrize("model", MODELS) -@pytest.mark.parametrize("dtype", ["half"]) -@pytest.mark.parametrize("max_tokens", [10]) -def test_metric_spec_decode( - vllm_runner, - example_prompts, - model: str, - dtype: str, - max_tokens: int, -) -> None: - k = 5 - - with vllm_runner( - model, - dtype=dtype, - disable_log_stats=False, - gpu_memory_utilization=0.4, - speculative_config={ - "model": model, - "num_speculative_tokens": k, - }, - ) as vllm_model: - - # Force log interval to be 0 to catch all metrics. - stat_logger = vllm_model.model.llm_engine.stat_loggers['prometheus'] - stat_logger.local_interval = 0 - - # Note that the purpose of this test is to verify spec decode - # metrics instead of functional correctness, so the expected values - # are intended to be loose. - metric_name_to_expected_fn = { - "gauge_spec_decode_draft_acceptance_rate": lambda v: 0 <= v <= 1, - "gauge_spec_decode_efficiency": lambda v: 0 <= v <= 1, - "counter_spec_decode_num_accepted_tokens": lambda v: 0 <= v <= k, - "counter_spec_decode_num_draft_tokens": lambda v: v == k, - "counter_spec_decode_num_emitted_tokens": - lambda v: 0 <= v <= k + 1, - } - - # Use one request to better inspect the metrics. - prompts = example_prompts[:1] - - _ = vllm_model.generate_greedy(prompts, max_tokens) - for metric_name, is_expected in metric_name_to_expected_fn.items(): - metric_val = getattr( - stat_logger.metrics, - metric_name).labels(**stat_logger.labels)._value.get() - assert is_expected(metric_val), ( - f"the value of metric {metric_name} ({metric_val}) " - "does not meet expectation") - - -@pytest.mark.parametrize("model", MODELS) -@pytest.mark.parametrize("dtype", ["half"]) -@pytest.mark.parametrize("max_tokens", [10]) -@pytest.mark.parametrize("log_interval", [1, 3, 5, 7]) -def test_metric_spec_decode_interval( - vllm_runner, - example_prompts, - model: str, - dtype: str, - max_tokens: int, - log_interval: int, -) -> None: - k = 5 - - engine_args = EngineArgs( - model=model, - dtype=dtype, - disable_log_stats=False, - gpu_memory_utilization=0.4, - speculative_config={ - "model": model, - "num_speculative_tokens": k, - }, - enforce_eager=True, - ) - - engine = LLMEngine.from_engine_args(engine_args) - - try: - - engine.add_request( - "request-id-0", - example_prompts[0], - SamplingParams(max_tokens=max_tokens), - ) - - # set log internal - stat_logger = engine.stat_loggers['prometheus'] - stat_logger.local_interval = log_interval - - # prefill - engine.step() - - # wait for 5 seconds to ensure that spec decode metrics - # get triggered in first decode step - time.sleep(5) - - # first decode step should trigger async collection of metrics - engine.step() - - # wait one second to allow H2D transfer to finish - time.sleep(1) - - # second decode step should now be able to collect the spec - # decode stats and the request should also be finished - engine.step() - - # must have finisehd now - assert not engine.has_unfinished_requests() - - # wait to ensure logging occurs - time.sleep(log_interval) - - # force logging - engine.step() - - # Note that the purpose of this test is to verify spec decode - # metrics instead of functional correctness, so the expected values - # are intended to be loose. - metric_name_to_expected_fn = { - "gauge_spec_decode_draft_acceptance_rate": lambda v: 0 <= v <= 1, - "gauge_spec_decode_efficiency": lambda v: 0 <= v <= 1, - "counter_spec_decode_num_accepted_tokens": lambda v: 0 <= v <= k, - "counter_spec_decode_num_draft_tokens": lambda v: v == k, - "counter_spec_decode_num_emitted_tokens": - lambda v: 0 <= v <= k + 1, - } - - for metric_name, is_expected in metric_name_to_expected_fn.items(): - metric_val = getattr( - stat_logger.metrics, - metric_name).labels(**stat_logger.labels)._value.get() - assert is_expected(metric_val), ( - f"the value of metric {metric_name} ({metric_val}) " - "does not meet expectation") - - finally: - del engine - cleanup_dist_env_and_memory() - - def assert_metrics(model: str, engine: LLMEngine, disable_log_stats: bool, num_requests: int) -> None: if disable_log_stats: diff --git a/tests/models/registry.py b/tests/models/registry.py index 56ae501021f..3ffa7f81a1a 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -457,12 +457,12 @@ def check_available_online( _SPECULATIVE_DECODING_EXAMPLE_MODELS = { - "EAGLEModel": _HfExamplesInfo("JackFram/llama-68m", - speculative_model="abhigoyal/vllm-eagle-llama-68m-random"), # noqa: E501 "MedusaModel": _HfExamplesInfo("JackFram/llama-68m", speculative_model="abhigoyal/vllm-medusa-llama-68m-random"), # noqa: E501 - "MLPSpeculatorPreTrainedModel": _HfExamplesInfo("JackFram/llama-160m", - speculative_model="ibm-ai-platform/llama-160m-accelerator"), # noqa: E501 + # Temporarily disabled. + # TODO(woosuk): Re-enable this once the MLP Speculator is supported in V1. + # "MLPSpeculatorPreTrainedModel": _HfExamplesInfo("JackFram/llama-160m", + # speculative_model="ibm-ai-platform/llama-160m-accelerator"), # noqa: E501 "DeepSeekMTPModel": _HfExamplesInfo("luccafong/deepseek_mtp_main_random", speculative_model="luccafong/deepseek_mtp_draft_random", # noqa: E501 trust_remote_code=True), diff --git a/tests/models/test_registry.py b/tests/models/test_registry.py index 01b2260abe8..1ce90070c5c 100644 --- a/tests/models/test_registry.py +++ b/tests/models/test_registry.py @@ -72,11 +72,15 @@ def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce): @create_new_process_for_each_test() -@pytest.mark.parametrize("model_arch,is_pp,init_cuda", [ - ("MLPSpeculatorPreTrainedModel", False, False), - ("DeepseekV2ForCausalLM", True, False), - ("Qwen2VLForConditionalGeneration", True, True), -]) +@pytest.mark.parametrize( + "model_arch,is_pp,init_cuda", + [ + # TODO(woosuk): Re-enable this once the MLP Speculator is supported + # in V1. + # ("MLPSpeculatorPreTrainedModel", False, False), + ("DeepseekV2ForCausalLM", True, False), + ("Qwen2VLForConditionalGeneration", True, True), + ]) def test_registry_is_pp(model_arch, is_pp, init_cuda): assert ModelRegistry.is_pp_supported_model(model_arch) is is_pp diff --git a/tests/samplers/test_rejection_sampler.py b/tests/samplers/test_rejection_sampler.py deleted file mode 100644 index 3b93c64113d..00000000000 --- a/tests/samplers/test_rejection_sampler.py +++ /dev/null @@ -1,577 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""Tests for rejection sampling.""" - -import pytest -import torch -import torch.nn.functional as F - -from vllm.model_executor.layers.rejection_sampler import RejectionSampler -from vllm.model_executor.utils import set_random_seed - - -@pytest.fixture(scope="function", autouse=True) -def use_v0_only(monkeypatch): - """ - This file tests V0 internals, so set VLLM_USE_V1=0. - """ - monkeypatch.setenv('VLLM_USE_V1', '0') - - -CUDA_DEVICES = [ - f"cuda:{i}" for i in range(1 if torch.cuda.device_count() == 1 else 2) -] - - -def mock_causal_accepted_tensor( - k: int, last_accepted_indices: torch.Tensor) -> torch.Tensor: - """Generate an "accepted" tensor which should yield causally-accepted tokens - up to last accepted indices. - - Tokens after last_accepted_indices+1 may also be accepted, although they - will not be causally accepted. - """ - batch_size = last_accepted_indices.shape[0] - - accepted = (torch.arange(k).expand(batch_size, k) - <= last_accepted_indices.unsqueeze(-1).broadcast_to( - batch_size, k)) - - # Sprinkle accepted values after the contiguous initial accepted values. - # This replicates the behavior of rejection sampling, which may "accept" - # a token that cannot be accepted because of causality. - sprinkle_candidates = (torch.arange(k).expand( - batch_size, - k) > last_accepted_indices.unsqueeze(-1).broadcast_to(batch_size, k) + - 1) - sprinkle = torch.rand(batch_size, k) > 0.5 - accepted[sprinkle_candidates] = sprinkle[sprinkle_candidates] - return accepted - - -@pytest.mark.parametrize("seed", list(range(10))) -@pytest.mark.parametrize( - "which_tokens_accepted", - ["all_tokens_accepted", "no_tokens_accepted", "some_tokens_accepted"]) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@pytest.mark.parametrize("use_flashinfer", [True, False]) -@torch.inference_mode() -def test_correct_output_format(which_tokens_accepted: str, seed: int, - device: str, use_flashinfer: bool): - """Verify the output has correct format given predetermined accepted matrix. - """ - set_random_seed(seed) - torch.set_default_device(device) - - batch_size = 10 - k = 5 - vocab_size = 3000 - - if which_tokens_accepted == "all_tokens_accepted": - accepted = mock_causal_accepted_tensor( - k, -1 + k * torch.ones((batch_size, ), dtype=torch.long)) - elif which_tokens_accepted == "no_tokens_accepted": - accepted = mock_causal_accepted_tensor( - k, -torch.ones((batch_size, ), dtype=torch.long)) - elif which_tokens_accepted == "some_tokens_accepted": - last_accepted_indices = torch.randint(low=-1, - high=k, - size=(batch_size, )) - accepted = mock_causal_accepted_tensor(k, last_accepted_indices) - else: - raise AssertionError() - - recovered_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer) - rejection_sampler.init_gpu_tensors(device=device) - output_token_ids = rejection_sampler._create_output( # pylint: disable=protected-access - accepted, - recovered_token_ids, - draft_token_ids, - bonus_token_ids, - ) - - expected_bonus_token_ids = bonus_token_ids.clone() - - if which_tokens_accepted == "all_tokens_accepted": - # Expect all tokens to be equal to draft tokens. - assert torch.equal(output_token_ids[:, :-1], draft_token_ids) - - # Expect all bonus tokens to be included. - assert torch.equal(output_token_ids[:, -1:], expected_bonus_token_ids) - elif which_tokens_accepted == "no_tokens_accepted": - # Expect first token to be equal to recovered tokens. - assert torch.equal(output_token_ids[:, 0], recovered_token_ids[:, 0]) - - # Expect everything else to be -1. - assert torch.equal(output_token_ids[:, 1:], - torch.ones_like(output_token_ids[:, 1:]) * -1) - elif which_tokens_accepted == "some_tokens_accepted": - recovered_plus_bonus = torch.cat( - (recovered_token_ids, expected_bonus_token_ids), dim=-1) - # Assert first rejected token is a recovered token or bonus token. - assert torch.equal( - recovered_plus_bonus[torch.arange(0, batch_size), - last_accepted_indices + 1], - output_token_ids[torch.arange(0, batch_size), - last_accepted_indices + 1]) - - # Assert every subsequent token is -1. - subsequent_mask = torch.arange(0, k + 1).expand( - batch_size, k + 1) >= (last_accepted_indices + 2).unsqueeze(-1) - assert torch.all(output_token_ids[subsequent_mask] == -1) - - -@pytest.mark.parametrize("k", list(range(1, 6))) -@pytest.mark.parametrize("vocab_size", [30_000, 50_000]) -@pytest.mark.parametrize("batch_size", list(range(1, 32))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@pytest.mark.parametrize("use_flashinfer", [True, False]) -@torch.inference_mode() -def test_no_crash_with_varying_dims(k: int, vocab_size: int, batch_size: int, - device: str, use_flashinfer: bool): - torch.set_default_device(device) - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer) - rejection_sampler.init_gpu_tensors(device=device) - - draft_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - target_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - - rejection_sampler(target_probs, bonus_token_ids, draft_probs, - draft_token_ids) - - -@pytest.mark.parametrize("frac_seeded", [0.0, 0.25, 0.5, 1.0]) -@pytest.mark.parametrize("k", [1, 3, 6]) -@pytest.mark.parametrize("vocab_size", [30_000, 50_000]) -@pytest.mark.parametrize("batch_size", [1, 8, 32, 128]) -@pytest.mark.parametrize("n_rep", [100]) -@pytest.mark.parametrize("device", CUDA_DEVICES) -# @pytest.mark.parametrize("use_flashinfer", [True, False]) -# Not testing FlashInfer now, since 0.2.3 API removed the ability -# to pass in uniform samples. -@pytest.mark.parametrize("use_flashinfer", [False]) -@torch.inference_mode() -def test_deterministic_when_seeded(k: int, vocab_size: int, batch_size: int, - frac_seeded: float, n_rep: int, device: str, - use_flashinfer: bool): - torch.set_default_device(device) - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer) - rejection_sampler.init_gpu_tensors(device=device) - - draft_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - target_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - - seeded_mask = torch.rand(batch_size, dtype=torch.float32) <= frac_seeded - - results = [] - for _ in range(n_rep): - seeded_seqs = { - i: torch.Generator(device=device).manual_seed(i) - for i in range(batch_size) if seeded_mask[i] - } - results.append( - rejection_sampler(target_probs, bonus_token_ids, draft_probs, - draft_token_ids, seeded_seqs)) - - for i in range(batch_size): - if seeded_mask[i]: - for j in range(1, n_rep): - assert torch.equal(results[j][i], results[0][i]) - - -@pytest.mark.parametrize("k", [1, 3, 6]) -@pytest.mark.parametrize("vocab_size", [30_000, 50_000]) -@pytest.mark.parametrize("batch_size", [3, 8, 32, 128]) -@pytest.mark.parametrize("device", CUDA_DEVICES) -# @pytest.mark.parametrize("use_flashinfer", [True, False]) -# Not testing FlashInfer now, since 0.2.3 API removed the ability -# to pass in uniform samples. -@pytest.mark.parametrize("use_flashinfer", [False]) -@torch.inference_mode() -def test_mixed_seeded_batch(k: int, vocab_size: int, batch_size: int, - device: str, use_flashinfer: bool): - torch.set_default_device(device) - set_random_seed(0) - draft_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - target_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - - single_batches = [] - for i in range(batch_size): - single_batches.append((draft_probs[i].clone().unsqueeze(0), - draft_token_ids[i].clone().unsqueeze(0), - target_probs[i].clone().unsqueeze(0), - bonus_token_ids[i].clone().unsqueeze(0), - draft_token_ids[i].clone().unsqueeze(0))) - - set_random_seed(0) - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer) - rejection_sampler.init_gpu_tensors(device=device) - - results = [] - seeded_seqs = { - i: torch.Generator(device=device).manual_seed(i) - for i in range(1, batch_size) # 0 is seed None - } - batch_result = rejection_sampler(target_probs.clone(), - bonus_token_ids.clone(), - draft_probs.clone(), - draft_token_ids.clone(), seeded_seqs) - - set_random_seed(0) - - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer) - rejection_sampler.init_gpu_tensors(device=device) - for i in range(batch_size): - request_seeded_seqs = { - 0: torch.Generator(device=device).manual_seed(i) - } if seeded_seqs.get(i) is not None else None - (draft_probs, draft_token_ids, target_probs, bonus_token_ids, - draft_token_ids) = single_batches[i] - results.append( - rejection_sampler(target_probs, bonus_token_ids, draft_probs, - draft_token_ids, request_seeded_seqs)) - for i in range(batch_size): - assert torch.equal(batch_result[i], results[i].squeeze(0)) - - -@pytest.mark.parametrize("k", [1, 3, 6]) -@pytest.mark.parametrize("vocab_size", [30_000, 50_000]) -@pytest.mark.parametrize("batch_size", [1, 8, 32, 128]) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_compare_nonflashinfer_backend(k: int, vocab_size: int, - batch_size: int, device: str): - """ - Test the flashinfer and nonflashinfer backend generate - the same output metrics. - """ - - pytest.skip("Not testing FlashInfer now, since 0.2.3 API removed " - "the ability to pass in uniform samples.") - - torch.set_default_device(device) - torch.manual_seed(0) - draft_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - target_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - - num_accepted_tokens = [] - num_emitted_tokens = [] - num_draft_tokens = [] - - def get_seeded_seqs(): - return { - i: torch.Generator(device=device).manual_seed(i) - for i in range(batch_size) - } - - for use_flashinfer in [True, False]: - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer) - rejection_sampler.init_gpu_tensors(device=device) - # We use seeded sequences to ensure the same tokens are accepted - # for both flashinfer and nonflashinfer backends. - seeded_seqs = get_seeded_seqs() - rejection_sampler(target_probs, bonus_token_ids, draft_probs, - draft_token_ids, seeded_seqs) - num_accepted_tokens.append(rejection_sampler.num_accepted_tokens) - num_emitted_tokens.append(rejection_sampler.num_emitted_tokens) - num_draft_tokens.append(rejection_sampler.num_draft_tokens) - - assert num_accepted_tokens[0] == num_accepted_tokens[1] - assert num_emitted_tokens[0] == num_emitted_tokens[1] - assert num_draft_tokens[0] == num_draft_tokens[1] - - -@pytest.mark.parametrize("above_or_below_vocab_range", ["above", "below"]) -@pytest.mark.parametrize("which_token_ids", - ["bonus_token_ids", "draft_token_ids"]) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@pytest.mark.parametrize("use_flashinfer", [True, False]) -@torch.inference_mode() -def test_raises_when_vocab_oob(above_or_below_vocab_range: str, - which_token_ids: str, device: str, - use_flashinfer: bool): - k = 3 - batch_size = 5 - vocab_size = 30_000 - torch.set_default_device(device) - - rejection_sampler = RejectionSampler(use_flashinfer=use_flashinfer, - strict_mode=True) - rejection_sampler.init_gpu_tensors(device=device) - - draft_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - target_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - - oob_token_ids = None - if which_token_ids == "bonus_token_ids": - oob_token_ids = bonus_token_ids - elif which_token_ids == "draft_token_ids": - oob_token_ids = draft_token_ids - else: - raise AssertionError() - - if above_or_below_vocab_range == "above": - rogue_token_id = vocab_size + 1 - elif above_or_below_vocab_range == "below": - rogue_token_id = -1 - else: - raise AssertionError() - - oob_token_ids[0][0] = rogue_token_id - - with pytest.raises(AssertionError): - rejection_sampler(target_probs, bonus_token_ids, draft_probs, - draft_token_ids) - - -@pytest.mark.parametrize("draft_and_target_probs_equal", [True, False]) -@pytest.mark.parametrize("seed", list(range(5))) -@pytest.mark.parametrize("use_flashinfer", [True, False]) -@torch.inference_mode() -def test_rejection_sampling_approximates_target_distribution( - seed: int, draft_and_target_probs_equal: bool, use_flashinfer: bool): - """Verify rejection sampling approximates target distribution, - despite sampling from a potentially distinct draft distribution. - - This is done by first creating a random target probability - distribution and a random draft probability distribution. We then - sample token ids from the rejection sampler using these draft - and target distributions. The samples are used to estimate - the output probability distribution, which we expect to approximate - the target distribution. - - A basic distance metric is used to determine similarity between - distributions. - - We expect that as we increase the number of samples, - the distance between the observed distribution and the target - distribution decreases. To measure this, we compare the distance - of the observed distribution against both the target distribution - and a uniform random distribution. We expect the distance between - the observed distribution and the target distribution to improve - much more than the distance improvement between the observed - distribution and the random distribution. - - When draft_and_target_probs_equal=True, the draft and target - probabilities are exactly equal. Rejection sampling should - still work without any NaNs or exceptions. - """ - torch.set_default_device("cpu") - set_random_seed(seed) - helper = _CorrectnessTestHelper( - vocab_size=10, - rejection_sampler=RejectionSampler(use_flashinfer=use_flashinfer), - ) - - draft_probs, target_probs, reference_probs = helper.generate_probs_for_test( - draft_and_target_probs_equal) - - sample_sizes = [10, 100, 1_000, 10_000, 100_000] - distance_wrt_reference: list[float] = [] - distance_wrt_target: list[float] = [] - - for num_samples in sample_sizes: - (reference_vs_rejsample_dist, - target_vs_rejsample_dist) = helper.run_and_compare_distributions( - draft_probs, - target_probs, - reference_probs, - num_samples, - ) - - distance_wrt_reference.append(reference_vs_rejsample_dist) - distance_wrt_target.append(target_vs_rejsample_dist) - - relative_change_in_distance_wrt_target = get_ratio_first_to_last( - distance_wrt_target) - relative_change_in_distance_wrt_reference = get_ratio_first_to_last( - distance_wrt_reference) - - print(f"{num_samples=} {target_vs_rejsample_dist=:.05f} " - f"{reference_vs_rejsample_dist=:.05f}") - print(f"{num_samples=} {relative_change_in_distance_wrt_target=:.02f} " - f"{relative_change_in_distance_wrt_reference=:.02f}") - - relative_change_in_distance_wrt_target = get_ratio_first_to_last( - distance_wrt_target) - relative_change_in_distance_wrt_reference = get_ratio_first_to_last( - distance_wrt_reference) - - expected_improvement_multiplier = 20 - assert (relative_change_in_distance_wrt_target - > relative_change_in_distance_wrt_reference * - expected_improvement_multiplier) - - -def get_ratio_first_to_last(elements: list[float]) -> float: - return elements[0] / elements[-1] - - -class _CorrectnessTestHelper: - """Class that packages together logic required for the unit-level - rejection sampling correctness test. - """ - - def __init__(self, vocab_size: int, rejection_sampler: RejectionSampler): - self.rejection_sampler = rejection_sampler - self.vocab_size = vocab_size - self.vocab_range = (0, vocab_size) - - self.rejection_sampler.init_gpu_tensors(device=0) - - # Keep test simple, use k=1 - self.k = 1 - - # Bonus tokens not used, but rejection sampler requires - # correct shape. - self.num_bonus_tokens = 1 - - def generate_probs_for_test( - self, draft_and_target_probs_equal: bool - ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: - draft_probs, target_probs = (F.softmax( - torch.rand(self.vocab_size, dtype=torch.float32), - dim=-1, - ) for _ in range(2)) - - num_reference_probs = 100 - reference_probs = F.softmax( - torch.rand(num_reference_probs, - self.vocab_size, - dtype=torch.float32), - dim=-1, - ) - - if draft_and_target_probs_equal: - target_probs = draft_probs.clone() - - return draft_probs, target_probs, reference_probs - - def run_and_compare_distributions(self, draft_probs: torch.Tensor, - target_probs: torch.Tensor, - reference_probs: torch.Tensor, - num_samples: int) -> tuple[float, float]: - # Sample using rejection sampling. - rej_sample_probs = self._estimate_rejection_sampling_pdf( - draft_probs, target_probs, num_samples) - - # Average distance from reference probs. - reference_vs_rejsample_dist = torch.dist( - reference_probs, - rej_sample_probs).item() / reference_probs.shape[0] - target_vs_rejsample_dist = torch.dist(target_probs, - rej_sample_probs).item() - - return reference_vs_rejsample_dist, target_vs_rejsample_dist - - def _estimate_rejection_sampling_pdf( - self, - draft_probs: torch.Tensor, - target_probs: torch.Tensor, - num_samples: int, - ) -> torch.Tensor: - # Repeat draft probs num_samples times. - draft_probs = draft_probs.reshape(1, self.k, self.vocab_size).repeat( - num_samples, 1, 1) - - # Repeat target probs num_samples * (k + 1) times. - # Rejection sampler requires bonus token probs, but they aren't used. - target_probs = target_probs.reshape(1, 1, self.vocab_size).repeat( - num_samples, self.k + 1, 1) - - # Randomly sample draft token ids from draft probs. - draft_token_ids = torch.multinomial(draft_probs[:, 0, :], - num_samples=1, - replacement=True).reshape( - num_samples, self.k) - - # Bonus tokens not used but required. - bonus_token_ids = torch.zeros((1, self.num_bonus_tokens), - dtype=torch.int64, - device="cuda").repeat(num_samples, 1) - - # Get output tokens via rejection sampling. - output_token_ids = self.rejection_sampler(target_probs.to("cuda"), - bonus_token_ids.to("cuda"), - draft_probs.to("cuda"), - draft_token_ids.to("cuda")) - - # Remove bonus tokens - output_token_ids = output_token_ids[:, :-1].flatten() - - # Estimate probability density function - hist = torch.histogram(output_token_ids.to(dtype=torch.float, - device="cpu"), - bins=self.vocab_size, - range=self.vocab_range, - density=True) - - return hist.hist diff --git a/tests/samplers/test_typical_acceptance_sampler.py b/tests/samplers/test_typical_acceptance_sampler.py deleted file mode 100644 index 119841470bf..00000000000 --- a/tests/samplers/test_typical_acceptance_sampler.py +++ /dev/null @@ -1,480 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""Tests for rejection sampling.""" - -import pytest -import torch - -from vllm.model_executor.layers.typical_acceptance_sampler import ( - TypicalAcceptanceSampler) -from vllm.model_executor.utils import set_random_seed - -CUDA_DEVICES = [f"cuda:{i}" for i in range(1)] - - -@pytest.fixture(scope="function", autouse=True) -def use_v0_only(monkeypatch): - """ - This file tests V0 internals, so set VLLM_USE_V1=0. - """ - monkeypatch.setenv('VLLM_USE_V1', '0') - - -def get_zero_temperature_prob_dist(batch_size, k, vocab_size): - """ - Generates a fake temperature zero probability distribution. - Returns: - 1. A fake temperature zero probability distribution of shape - [batch_size, k, vocab_size] - 2. Tensor of shape [batch_size, k] containing the token ids - of the probability 1.0 tokens at each position. - """ - # Simulate temperature 0 probability distribution for target probabilities - # and create target probabilities such that only 1 token id has - # probability 1.0 - target_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - probs = torch.rand(batch_size, k, vocab_size) - _, zero_temperature_token_ids = torch.max(probs, dim=-1) - # set the probability of the tokens with ids in zero_temperature_token_ids - # to 1 and the rest to 0. - target_probs = torch.zeros_like(probs).scatter_( - -1, zero_temperature_token_ids.unsqueeze(-1), 1.0) - return target_probs, zero_temperature_token_ids - - -def get_draft_token_ids(batch_size: int, k: int, vocab_size: int, - token_ids_to_exclude: torch.Tensor): - """ - Returns a tensor of shape [batch_size, k] of fake draft token ids - drawn randomly from a vocab of size vocab_size. We however ensure - that token_ids from token_ids_to_exclude are excluded at the - corresponding positions. - """ - draft_token_ids = torch.empty(batch_size, k, dtype=torch.long) - for i in range(batch_size): - for j in range(k): - # Generate a random token ID excluding token_ids_to_exclude[i, j] - while True: - token_id = torch.randint(0, vocab_size, (1, )).item() - if token_id != token_ids_to_exclude[i, j]: - draft_token_ids[i, j] = token_id - break - return draft_token_ids - - -def get_acceptance_sampler( - posterior_threshold: float = 0.03, - posterior_alpha: float = 0.9, - strict_mode: bool = False, -) -> TypicalAcceptanceSampler: - """ - Initializes and returns a TypicalAcceptanceSampler. - """ - return TypicalAcceptanceSampler(posterior_threshold, posterior_alpha, - strict_mode) - - -@pytest.mark.parametrize("k", list(range(1, 6))) -@pytest.mark.parametrize("vocab_size", [30_000, 50_000]) -@pytest.mark.parametrize("batch_size", list(range(1, 32))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_no_crash_with_varying_dims(k: int, vocab_size: int, batch_size: int, - device: str): - """ - Tests that the TypicalAcceptancSampler forward succeeds for - different combinations of k, vocab_size, batch_size and num devices. - """ - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler() - typical_acceptance_sampler.init_gpu_tensors(device=device) - target_with_bonus_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - # Verify that sampling succeeds for all cases. - typical_acceptance_sampler(target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - - -@pytest.mark.parametrize("above_or_below_vocab_range", ["above", "below"]) -@pytest.mark.parametrize("which_token_ids", - ["bonus_token_ids", "draft_token_ids"]) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_raises_when_vocab_oob(above_or_below_vocab_range: str, - which_token_ids: str, device: str): - """ - Tests that we throw an exception of the token ids fall outside - the bound of the provided vocabulary. - """ - k = 3 - batch_size = 5 - vocab_size = 30_000 - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - target_with_bonus_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - # Verify that appropriate exceptions are thrown for out - # of bound vocabs. - oob_token_ids = None - if which_token_ids == "bonus_token_ids": - oob_token_ids = bonus_token_ids - elif which_token_ids == "draft_token_ids": - oob_token_ids = draft_token_ids - else: - raise AssertionError() - - if above_or_below_vocab_range == "above": - rogue_token_id = vocab_size + 1 - elif above_or_below_vocab_range == "below": - rogue_token_id = -1 - else: - raise AssertionError() - - oob_token_ids[0][0] = rogue_token_id - - with pytest.raises(AssertionError): - typical_acceptance_sampler(target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - - -@pytest.mark.parametrize("seed", list(range(10))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_uniform_target_distribution_accepts_all_tokens( - seed: int, device: str): - """ - Test the TypicalAcceptanceSampler with a uniform target probability - distribution. - - This test verifies that when provided with a uniform target probability - distribution, the TypicalAcceptanceSampler accepts all draft tokens. The - entropy of the uniform target distribution being high should lead to all - draft tokens being accepted. - """ - set_random_seed(seed) - k = 3 - batch_size = 5 - vocab_size = 30_000 - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - target_with_bonus_probs = torch.rand(batch_size, - k + 1, - vocab_size, - dtype=torch.float32) - draft_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - output_token_ids = typical_acceptance_sampler( - target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - # We are using a uniform target probability distribution. - # For a uniform distribution the entropy is very high and it - # should lead to all draft tokens being accepted. Verify that. - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - assert torch.all(output_token_ids[:, -1] == bonus_token_ids.squeeze()) - - assert torch.all(output_token_ids[:, :k] == draft_token_ids) - - -@pytest.mark.parametrize("seed", list(range(10))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_temperature_zero_target_distribution(seed: int, device: str): - """ - Test the TypicalAcceptanceSampler with a zero-temperature target - probability distribution. - - This test verifies that when using a zero-temperature target probability - distribution, where only one token has a probability of 1.0, the - TypicalAcceptanceSampler correctly rejects all draft tokens that do not - match this probability. Additionally, it ensures that when all draft - tokens are rejected, the sampler falls back to greedy sampling to select a - single token from the target distribution. - """ - set_random_seed(seed) - k = 3 - batch_size = 5 - vocab_size = 30_000 - torch.set_default_device(device) - - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - # Simulate temperature 0 probability distribution for target probabilities - # and create target probabilities such that only 1 token id has - # probability 1.0 - target_with_bonus_probs, zero_temperature_token_ids = \ - get_zero_temperature_prob_dist(batch_size, k + 1, vocab_size) - zero_temperature_token_ids = zero_temperature_token_ids[:, :-1] - # Populate draft_token_ids such that they exclude the token_ids - # with probability = 1.0 - draft_token_ids = get_draft_token_ids(batch_size, k, vocab_size, - zero_temperature_token_ids) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - # The target probaility distribution is a temperature zero distribution - # with zero entropy. Since our draft token ids don't match the probability - # 1.0 tokens in the target distribution we will reject all of them and - # fallback to the greedy sampling for selecting 1 token for each sequence. - # Verify the same. - output_token_ids = typical_acceptance_sampler( - target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - assert torch.all(output_token_ids[:, -1] == -1) - assert torch.all(output_token_ids[:, 0] == zero_temperature_token_ids[:, - 0]) - - -@pytest.mark.parametrize("seed", list(range(10))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_mixed_target_distribution(seed: int, device: str): - """ - Test the TypicalAcceptanceSampler with a mixed target probability - distribution. - - This test ensures that the TypicalAcceptanceSampler handles a mixed - target probability distribution correctly. Specifically, it uses a - zero-temperature distribution for some sequences and a uniform - distribution for others. The test verifies that: - - - For sequences with a zero-temperature distribution, only the token - with a probability of 1.0 is accepted, and all other tokens are rejected. - - For sequences with a uniform distribution, all draft tokens are - accepted. - """ - set_random_seed(seed) - k = 3 - batch_size = 4 - vocab_size = 30_000 - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - # For sequences 0 and 2 set the distribution to a temperature - # zero distribution. For sequences 1 and 3 set it to a uniform - # distribution. - target_with_bonus_probs, zero_temperature_token_ids = \ - get_zero_temperature_prob_dist(batch_size, k + 1, vocab_size) - zero_temperature_token_ids = zero_temperature_token_ids[:, :-1] - target_probs = target_with_bonus_probs[:, :-1] - draft_token_ids = get_draft_token_ids(batch_size, k, vocab_size, - zero_temperature_token_ids) - uniform_probs = torch.rand(2, k, vocab_size, dtype=torch.float32) - target_probs[[1, 3]] = uniform_probs - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - output_token_ids = typical_acceptance_sampler( - target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - # verify the shape of output_token_ids - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - # For sequences 0 and 2 verify that only 1 token is accepted - # which is the token with probability 1.0 in the target distribution - # at position 0. - assert torch.all(output_token_ids[[0, 2], 1:] == -1) - assert (torch.all(output_token_ids[[0, 2], - 0] == zero_temperature_token_ids[[0, 2], - 0])) - # For sequences 1 and 3 verify that all tokens are accepted since the - # target probability distribution is uniform. In addition verify that - # we also accept the bonus tokens. - assert torch.all( - output_token_ids[[1, 3], :-1] == draft_token_ids[[1, 3], :]) - assert torch.all(output_token_ids[[1, 3], -1] != -1) - - -@pytest.mark.parametrize("seed", list(range(10))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_accept_tokens_partially(seed: int, device: str): - """ - Test the TypicalAcceptanceSampler's behavior when only a subset of draft - tokens should be accepted. - - This test verifies that the TypicalAcceptanceSampler correctly accepts or - rejects draft tokens based on a zero-temperature target probability - distribution. Specifically, it ensures that: - - - When all draft tokens match tokens with a probability of 1.0 in the - target distribution, all draft tokens are accepted. - - When only some draft tokens match tokens with a probability of 1.0 in - the target distribution, only those matching tokens are accepted, and the - rest are rejected. - """ - set_random_seed(seed) - k = 5 - batch_size = 1 - vocab_size = 30_000 - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - # Create a temperature zero target probability distribution and ensure - # all draft token ids correspond to the tokens with 1.0 probability. - # Verify that all of them are accepted. - target_with_bonus_probs, zero_temperature_token_ids = \ - get_zero_temperature_prob_dist(batch_size, k + 1, vocab_size) - zero_temperature_token_ids = zero_temperature_token_ids[:, :-1] - draft_token_ids = zero_temperature_token_ids - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - output_token_ids = typical_acceptance_sampler( - target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - assert torch.all(output_token_ids[:, 0:-1] == draft_token_ids) - assert torch.all(output_token_ids[:, -1] == bonus_token_ids) - # Next only keep the first 2 draft tokens same as the zero temperature - # tokens. For the remaining 3 choose some other tokens. In the - # response we will expect the first 2 tokens to be the same as the - # draft tokens and the recovered token and rest as -1 - draft_token_ids_to_replace = get_draft_token_ids( - batch_size, k, vocab_size, zero_temperature_token_ids) - draft_token_ids = torch.cat( - (draft_token_ids[:, :2], draft_token_ids_to_replace[:, -3:]), dim=1) - output_token_ids = typical_acceptance_sampler( - target_with_bonus_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - assert torch.all(output_token_ids[:, :2] == draft_token_ids[:, :2]) - assert torch.all( - output_token_ids[:, 2] == target_with_bonus_probs.argmax(-1)[:, 2]) - assert torch.all(output_token_ids[:, -3:] == -1) - - -@pytest.mark.parametrize("seed", list(range(1))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_accept_tokens_set_non_default_posteriors(seed: int, device: str): - """ - Test the TypicalAcceptanceSampler with custom posterior thresholds and - alpha values. This test verifies that by modifying the posterior - thresholds and alpha values we can change the acceptance behavior of the - sampler. - """ - set_random_seed(seed) - k = 5 - batch_size = 1 - vocab_size = 30_000 - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - # Simulate temperature 0 probability distribution for target - # probabilities and create target probabilities such that only 1 token - # id has probability 1.0 and others have a very low probability of - # 0.00001. Populate draft_token_ids such that they exclude the token_ids - # with probability = 1.0. Without any changes to the posterior thresholds - # none of the draft tokens are accepted. - target_probs, zero_temperature_token_ids = get_zero_temperature_prob_dist( - batch_size, k + 1, vocab_size) - zero_temperature_token_ids = zero_temperature_token_ids[:, :-1] - target_probs[target_probs == 0] = 0.00001 - draft_token_ids = get_draft_token_ids(batch_size, k, vocab_size, - zero_temperature_token_ids) - bonus_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, 1), - dtype=torch.int64) - output_token_ids = typical_acceptance_sampler( - target_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - assert torch.all(output_token_ids[:, 1:-1] == -1) - - # Change the posterior threshold values to 0.0 so that we will - # now accept even draft tokens with very low probability in the - # target distribution. Simulate and verify the same. - typical_acceptance_sampler = TypicalAcceptanceSampler( - strict_mode=True, posterior_threshold=0.0, posterior_alpha=0.0) - typical_acceptance_sampler.init_gpu_tensors(device=device) - output_token_ids = typical_acceptance_sampler( - target_probs, - bonus_token_ids, - draft_probs=None, - draft_token_ids=draft_token_ids) - assert output_token_ids.shape[0] == batch_size - assert output_token_ids.shape[1] == (k + 1) - assert torch.all(output_token_ids[:, 0:-1] == draft_token_ids) - assert torch.all(output_token_ids[:, -1] == bonus_token_ids) - - -@pytest.mark.parametrize("seed", list(range(10))) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_get_recovered_token_ids(seed: int, device: str): - """ - Test the TypicalAcceptanceSampler's method for generating - replacement token IDs. - - This test verifies that the `_get_recovered_token_ids` method of the - TypicalAcceptanceSampler correctly identifies the token IDs to be used - as recovered token IDs based on the target probability distribution. - Specifically, it ensures that the method correctly identifies the - tokens with the highest probability for each sequence in the batch. - """ - set_random_seed(seed) - k = 10 - batch_size = 5 - vocab_size = 30_000 - torch.set_default_device(device) - typical_acceptance_sampler = get_acceptance_sampler(strict_mode=True) - typical_acceptance_sampler.init_gpu_tensors(device=device) - target_probs = torch.rand(batch_size, k, vocab_size, dtype=torch.float32) - expected_replacement_tokens = torch.argmax(target_probs, dim=-1) - actual_replacement_tokens = ( - typical_acceptance_sampler._get_recovered_token_ids(target_probs)) - assert torch.all(expected_replacement_tokens == actual_replacement_tokens) diff --git a/tests/spec_decode/__init__.py b/tests/spec_decode/__init__.py deleted file mode 100644 index e69de29bb2d..00000000000 diff --git a/tests/spec_decode/conftest.py b/tests/spec_decode/conftest.py deleted file mode 100644 index 375b248ebed..00000000000 --- a/tests/spec_decode/conftest.py +++ /dev/null @@ -1,12 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import pytest - - -@pytest.fixture(scope="function", autouse=True) -def use_v0_only(monkeypatch): - """ - Since this module is V0 only, set VLLM_USE_V1=0 for - all tests in the module. - """ - monkeypatch.setenv('VLLM_USE_V1', '0') diff --git a/tests/spec_decode/e2e/__init__.py b/tests/spec_decode/e2e/__init__.py deleted file mode 100644 index e69de29bb2d..00000000000 diff --git a/tests/spec_decode/e2e/conftest.py b/tests/spec_decode/e2e/conftest.py deleted file mode 100644 index f3fe9db3f79..00000000000 --- a/tests/spec_decode/e2e/conftest.py +++ /dev/null @@ -1,307 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from collections.abc import Sequence -from itertools import cycle -from typing import Optional, Union - -import pytest -import torch - -from vllm import LLM, SamplingParams -from vllm.distributed import cleanup_dist_env_and_memory -from vllm.model_executor.utils import set_random_seed -from vllm.sequence import PromptLogprobs, SampleLogprobs - -from ...models.utils import (TokensTextLogprobs, - TokensTextLogprobsPromptLogprobs, - check_logprobs_close, check_outputs_equal) -from ...utils import RemoteOpenAIServer - -PROMPTS = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", - "San Francisco is know for its", - "Facebook was created in 2004 by", - "Curious George is a", - "Python 3.11 brings improvements to its", -] - - -@pytest.fixture -def test_llm_generator(common_llm_kwargs, per_test_common_llm_kwargs, - test_llm_kwargs, seed): - - def generate(): - kwargs = { - **common_llm_kwargs, - **per_test_common_llm_kwargs, - **test_llm_kwargs, - } - - llm = LLM(**kwargs) - - if seed is not None: - set_random_seed(seed) - - yield llm - - del llm - cleanup_dist_env_and_memory() - - return generate - - -def maybe_assert_ngram_worker(llm): - # Verify the proposer worker is ngram if ngram is specified. - if (llm.llm_engine.speculative_config is not None - and llm.llm_engine.speculative_config.method == "ngram"): - from vllm.spec_decode.ngram_worker import NGramWorker - assert isinstance( - llm.llm_engine.model_executor.driver_worker.proposer_worker, - NGramWorker) - - -def get_output_from_llm_generator( - llm_generator, prompts, - sampling_params) -> tuple[list[str], list[list[int]], float]: - tokens: list[str] = [] - token_ids: list[list[int]] = [] - acceptance_rate: float = -1.0 - for llm in llm_generator(): - maybe_assert_ngram_worker(llm) - - outputs = llm.generate(prompts, sampling_params, use_tqdm=True) - - token_ids = [output.outputs[0].token_ids for output in outputs] - tokens = [output.outputs[0].text for output in outputs] - - # Fetch acceptance rate if logging is enabled. - if stat_loggers := getattr(llm.llm_engine, "stat_loggers", None): - stat_logger = stat_loggers["prometheus"] - acceptance_rate = (stat_logger.metrics. - gauge_spec_decode_draft_acceptance_rate.labels( - **stat_logger.labels)._value.get()) - del llm - - return tokens, token_ids, acceptance_rate - - -def check_logprobs_correctness( - spec_outputs: Sequence[Union[TokensTextLogprobs, - TokensTextLogprobsPromptLogprobs]], - baseline_outputs: Sequence[Union[TokensTextLogprobs, - TokensTextLogprobsPromptLogprobs]], - disable_logprobs: bool = False, -): - """Compare sampled and prompt logprobs between baseline and spec decoding - """ - if not disable_logprobs: - return check_logprobs_close( - outputs_0_lst=baseline_outputs, - outputs_1_lst=spec_outputs, - name_0="org", - name_1="sd", - ) - - # Check correctness when disable_logprobs == True - for spec_output, baseline_output in zip(spec_outputs, baseline_outputs): - # Check generated token logprobs. - spec_logprobs = spec_output[2] - baseline_logprobs = baseline_output[2] - _check_logprobs_when_output_disabled(spec_logprobs, - baseline_logprobs, - is_prompt_logprobs=False) - - # Check prompt logprobs too, if they exist - if len(baseline_output) == 4: - assert len(spec_output) == 4 - spec_prompt_logprobs = spec_output[3] - baseline_prompt_logprobs = baseline_output[3] - _check_logprobs_when_output_disabled(spec_prompt_logprobs, - baseline_prompt_logprobs, - is_prompt_logprobs=True) - - -def _check_logprobs_when_output_disabled( - spec_logprobs: Union[Optional[PromptLogprobs], SampleLogprobs], - baseline_logprobs: Union[Optional[PromptLogprobs], SampleLogprobs], - is_prompt_logprobs: bool = False, -): - # Prompt logprobs are optional - if is_prompt_logprobs and baseline_logprobs is None: - assert spec_logprobs is None - return - - assert spec_logprobs is not None - assert baseline_logprobs is not None - assert len(spec_logprobs) == len(baseline_logprobs) - - # For each generated position of the sequence. - for pos, (spec_pos_logprobs, baseline_pos_logprobs) in enumerate( - zip(spec_logprobs, baseline_logprobs)): - - # First prompt logprob is expected to be None - if is_prompt_logprobs and baseline_pos_logprobs is None: - assert spec_pos_logprobs is None - assert pos == 0 - continue - - assert spec_pos_logprobs is not None - assert baseline_pos_logprobs is not None - - # When disabled, the 1 logprob is returned with dummy values for the - # score and rank, but the token id should match the baseline model - assert len(spec_pos_logprobs) == 1 - (spec_pos_logprob_token_id, - spec_pos_logprob) = next(iter(spec_pos_logprobs.items())) - assert spec_pos_logprob.rank == -1 - assert spec_pos_logprob.logprob == 0.0 - if isinstance(spec_pos_logprob_token_id, torch.Tensor): - spec_pos_logprob_token_id = spec_pos_logprob_token_id.item() - assert spec_pos_logprob_token_id in baseline_pos_logprobs - - -def run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size: int, - max_output_len: int, - seed: Optional[int] = 0, - temperature: float = 0.0, - disable_seed: bool = False, - ignore_eos: bool = True, - ensure_all_accepted: bool = False, - expected_acceptance_rate: Optional[float] = None, - logprobs: Optional[int] = None, - prompt_logprobs: Optional[int] = None, - disable_logprobs: bool = False): - - org_args = { - **common_llm_kwargs, - **per_test_common_llm_kwargs, - **baseline_llm_kwargs, - } - - sd_args = { - **common_llm_kwargs, - **per_test_common_llm_kwargs, - **test_llm_kwargs, - } - - prompts = [prompt for prompt, _ in zip(cycle(PROMPTS), range(batch_size))] - - if disable_seed: - seed = None - - sampling_params = SamplingParams(temperature=temperature, - max_tokens=max_output_len, - seed=seed, - ignore_eos=ignore_eos, - logprobs=logprobs, - prompt_logprobs=prompt_logprobs) - - with vllm_runner(**org_args) as vllm_model: - org_outputs = vllm_model.generate_w_logprobs(prompts, sampling_params) - - with vllm_runner(**sd_args) as vllm_model: - if ensure_all_accepted or expected_acceptance_rate is not None: - # Force log interval to be 0 to catch all metrics. - stat_logger = vllm_model.model.llm_engine.stat_loggers[ - 'prometheus'] - stat_logger.local_interval = -100 - - sd_outputs = vllm_model.generate_w_logprobs(prompts, sampling_params) - - if ensure_all_accepted or expected_acceptance_rate is not None: - acceptance_rate = (stat_logger.metrics. - gauge_spec_decode_draft_acceptance_rate.labels( - **stat_logger.labels)._value.get()) - - if ensure_all_accepted: - assert True - # FIXME: ci fails to log acceptance rate. - # It works locally. - # assert acceptance_rate == 1.0 - - if expected_acceptance_rate is not None: - assert acceptance_rate >= expected_acceptance_rate - 1e-2 - - # Only pass token entries, not the logprobs - check_outputs_equal(outputs_0_lst=[out[0:2] for out in org_outputs], - outputs_1_lst=[out[0:2] for out in sd_outputs], - name_0="org", - name_1="sd") - - # Check logprobs if requested - if logprobs is not None or prompt_logprobs is not None: - check_logprobs_correctness(sd_outputs, org_outputs, disable_logprobs) - - -def run_equality_correctness_test_tp(model, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size: int, - max_output_len: int, - seed: int = 0, - temperature: float = 0.0, - logprobs: Optional[int] = None): - """Helper method that compares the outputs of both the baseline LLM and - the test LLM. It asserts greedy equality, e.g. that the outputs are exactly - the same when temperature is zero. - """ - arg1 = common_llm_kwargs + per_test_common_llm_kwargs + baseline_llm_kwargs - arg2 = common_llm_kwargs + per_test_common_llm_kwargs + test_llm_kwargs - env1 = env2 = None - - max_wait_seconds = 240 - results = [] - - prompts = [prompt for prompt, _ in zip(cycle(PROMPTS), range(batch_size))] - for args, env in ((arg1, env1), (arg2, env2)): - with RemoteOpenAIServer(model, - args, - env_dict=env, - max_wait_seconds=max_wait_seconds) as server: - client = server.get_client() - - completion = client.completions.create(model=model, - prompt=prompts, - max_tokens=max_output_len, - seed=seed, - temperature=temperature, - logprobs=logprobs) - - results.append({ - "test": - "seeded_sampling", - "text": [choice.text for choice in completion.choices], - "logprobs": [choice.logprobs for choice in completion.choices], - "finish_reason": - [choice.finish_reason for choice in completion.choices], - "usage": - completion.usage, - }) - - n = len(results) // 2 - arg1_results = results[:n] - arg2_results = results[n:] - # Separate logprobs to avoid asserting exact equality. - arg1_logprobs = [r.pop("logprobs") for r in arg1_results] - arg2_logprobs = [r.pop("logprobs") for r in arg2_results] - - for arg1_result, arg2_result in zip(arg1_results, arg2_results): - assert arg1_result == arg2_result, ( - f"Results for {model=} are not the same with {arg1=} and {arg2=}. " - f"{arg1_result=} != {arg2_result=}") - if logprobs: - for logs1, logs2 in zip(arg1_logprobs, arg2_logprobs): - for l1, l2 in zip(logs1, logs2): - assert l1.tokens == l2.tokens diff --git a/tests/spec_decode/e2e/test_compatibility.py b/tests/spec_decode/e2e/test_compatibility.py deleted file mode 100644 index 6c453879a6a..00000000000 --- a/tests/spec_decode/e2e/test_compatibility.py +++ /dev/null @@ -1,66 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import pytest - -from vllm import SamplingParams - -from .conftest import get_output_from_llm_generator - - -@pytest.mark.parametrize("common_llm_kwargs", - [{ - "model": "meta-llama/Llama-3.2-1B-Instruct", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - { - # Speculative max model len > overridden max model len should raise. - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "max_model_len": 129, - }, - "max_model_len": 128, - }, - { - # Speculative max model len > draft max model len should raise. - # https://huggingface.co/JackFram/llama-68m/blob/3b606af5198a0b26762d589a3ee3d26ee6fa6c85/config.json#L12 - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "max_model_len": 2048 + 1, - }, - }, - { - # Speculative max model len > target max model len should raise. - # https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/9213176726f574b556790deb65791e0c5aa438b6/config.json#L18 - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "max_model_len": 131072 + 1, - }, - }, - ]) -@pytest.mark.parametrize("test_llm_kwargs", [{}]) -@pytest.mark.parametrize("seed", [1]) -def test_spec_decode_xfail_spec_max_model_len(test_llm_generator): - """Verify that speculative decoding validates speculative_max_model_len. - """ - output_len = 128 - temperature = 0.0 - - prompts = [ - "Hello, my name is", - ] - - sampling_params = SamplingParams( - max_tokens=output_len, - ignore_eos=True, - temperature=temperature, - ) - - with pytest.raises(ValueError, match="cannot be larger than"): - get_output_from_llm_generator(test_llm_generator, prompts, - sampling_params) diff --git a/tests/spec_decode/e2e/test_eagle_correctness.py b/tests/spec_decode/e2e/test_eagle_correctness.py deleted file mode 100644 index 7c369feec41..00000000000 --- a/tests/spec_decode/e2e/test_eagle_correctness.py +++ /dev/null @@ -1,480 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""This docstring details important information on the testing methodology. - -Most of the tests rely on "greedy equality", where we expect the output of -speculative decoding on a sequence to exactly match the output of normal non- -speculative decoding. - -Since speculative decoding with rejection sampling guarantees that the output -distribution matches the target model's output distribution (up to hardware -numerics, see https://arxiv.org/pdf/2302.01318.pdf), we can expect greedy -equality. - -However, we still need to verify below scenario could be passed: - * Batch size 1 greedy equality - * Batch size >1 greedy equality - * Test greedy equality under preemption - * Test greedy equality under various number of speculative tokens. - -With those tests, we can say at least, EAGLE would not break the -correctness for the target model outputs. -""" - -import pytest - -from .conftest import run_equality_correctness_test - -# main model -MAIN_MODEL = "JackFram/llama-68m" - -# speculative model -SPEC_MODEL = "abhigoyal/vllm-eagle-llama-68m-random" - -# max. number of speculative tokens: this corresponds to -# num_heads in the config.json of the speculator model. -MAX_SPEC_TOKENS = 4 - -# precision -PRECISION = "float32" - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -def test_eagle_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int): - - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_logprobs": False, - }, -}, { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_logprobs": True, - }, -}]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -def test_eagle_e2e_greedy_logprobs(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int, - logprobs: int): - - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - logprobs=logprobs, - prompt_logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "enforce_eager": False, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -def test_eagle_e2e_greedy_correctness_cuda_graph( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality with cuda graph enabled and different - batch sizes.""" - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 8, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 128, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("seed", [1]) -def test_eagle_e2e_greedy_correctness_with_preemption( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality, even when some sequences are preempted mid- - generation. - """ - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": k, - }, - } - # Try a range of num. speculative tokens - for k in range(1, 1 + MAX_SPEC_TOKENS) - ]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_eagle_different_k(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that eagle speculative decoding produces exact equality - to without spec decode with different values of num_speculative_tokens. - """ - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_by_batch_size": 4, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_eagle_disable_queue(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that eagle speculative decoding produces exact equality - to without spec decode when speculation is disabled for large - batch sizes. - """ - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": "float16", - - # Main model - "model_name": "meta-llama/Llama-2-7b-chat-hf", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "yuhuili/EAGLE-llama2-chat-7B", - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize("seed", [1]) -def test_llama2_eagle_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - output_len: int, seed: int): - - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # 2 for small prompt, 256//16 for generated. - "num_gpu_blocks_override": 2 + 256 // 16, - "max_model_len": (2 + 256 // 16) * 16, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": "float16", - - # Main model - "model_name": "meta-llama/Meta-Llama-3-8B-Instruct", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize("seed", [1]) -def test_llama3_eagle_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - output_len: int, seed: int): - - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # 2 for small prompt, 256//16 for generated. - "num_gpu_blocks_override": 2 + 256 // 16, - "max_model_len": (2 + 256 // 16) * 16, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": "float16", - - # Main model - "model_name": "Qwen/Qwen2-7B-Instruct", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "yuhuili/EAGLE-Qwen2-7B-Instruct", - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize("seed", [1]) -def test_qwen2_eagle_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - output_len: int, seed: int): - - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0) - - -if __name__ == "__main__": - import pytest - pytest.main([__file__]) diff --git a/tests/spec_decode/e2e/test_integration.py b/tests/spec_decode/e2e/test_integration.py deleted file mode 100644 index f15a9224c00..00000000000 --- a/tests/spec_decode/e2e/test_integration.py +++ /dev/null @@ -1,161 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""Tests which cover integration of the speculative decoding framework with -other features, e.g. cuda graphs. -""" - -import pytest - -from .conftest import run_equality_correctness_test - -MAIN_MODEL = "JackFram/llama-68m" - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Verify equality when cuda graphs allowed. - "enforce_eager": False, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - { - # Identical models. - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - }, - ]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{}]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("output_len", [32]) -@pytest.mark.parametrize("seed", [1]) -def test_spec_decode_cuda_graph(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int): - """Verify spec decode equality when cuda graphs are enabled. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", []) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - # Explicitly specify draft model quantization - { - "speculative_config": { - "model": "LnL-AI/TinyLlama-1.1B-Chat-v1.0-GPTQ-4bit", - "num_speculative_tokens": 5, - "quantization": "gptq", - }, - }, - # Explicitly specify GPTQ-based draft model to use marlin quantization - { - "speculative_config": { - "model": "LnL-AI/TinyLlama-1.1B-Chat-v1.0-GPTQ-4bit", - "num_speculative_tokens": 5, - "quantization": "marlin", - }, - }, - # Not explicitly specify draft model quantization - { - "speculative_config": { - "model": "LnL-AI/TinyLlama-1.1B-Chat-v1.0-GPTQ-4bit", - "num_speculative_tokens": 5, - "quantization": None, - }, - }, - ]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize("seed", [1]) -def test_speculative_model_quantization_config(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size: int, seed: int): - """Verify spec decode works well with draft model quantization configs. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=32, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": MAIN_MODEL, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "disable_mqa_scorer": True, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_mqa_scorer(vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, - output_len: int, seed: int): - """Verify that speculative decoding generates the same output - with batch expansion scorer and mqa scorer. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) diff --git a/tests/spec_decode/e2e/test_integration_dist_tp2.py b/tests/spec_decode/e2e/test_integration_dist_tp2.py deleted file mode 100644 index a18be80c50d..00000000000 --- a/tests/spec_decode/e2e/test_integration_dist_tp2.py +++ /dev/null @@ -1,247 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""Tests which cover integration of the speculative decoding framework with -tensor parallelism. -""" - -import json -from typing import Optional - -import pytest -import torch - -from vllm.platforms import current_platform - -from .conftest import run_equality_correctness_test_tp - - -@pytest.mark.skipif(torch.cuda.device_count() < 2, - reason="Need at least 2 GPUs to run the test.") -@pytest.mark.parametrize( - "common_llm_kwargs", - [[ - # Skip cuda graph recording for fast test. - "--enforce-eager", - "--tensor-parallel-size", - "2" - ]]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [[]]) -@pytest.mark.parametrize("baseline_llm_kwargs", [[]]) -@pytest.mark.parametrize("test_llm_kwargs", [ - [ - "--speculative_config", - json.dumps({ - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - }), - ], - [ - "--speculative_config", - json.dumps({ - "model": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - }), - ], -]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_target_model_tp_gt_1(common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int): - """Verify greedy equality when tensor parallelism is used. - """ - if current_platform.is_rocm(): - pytest.skip("hip is not well-supported yet") - run_equality_correctness_test_tp("JackFram/llama-68m", - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0) - - -@pytest.mark.skipif(torch.cuda.device_count() < 2, - reason="Need at least 2 GPUs to run the test.") -@pytest.mark.parametrize( - "common_llm_kwargs", - [[ - # Skip cuda graph recording for fast test. - "--enforce-eager", - "--tensor_parallel_size", - "2", - - # precision - "--dtype", - "bfloat16", - ]]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [[]]) -@pytest.mark.parametrize("baseline_llm_kwargs", [[]]) -@pytest.mark.parametrize( - "model, test_llm_kwargs", - [("JackFram/llama-68m", [ - "--speculative_config", - json.dumps({ - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "draft_tensor_parallel_size": 1, - }), - ]), - ("ibm-granite/granite-3b-code-instruct", [ - "--speculative_config", - json.dumps({ - "model": "ibm-granite/granite-3b-code-instruct", - "num_speculative_tokens": 5, - "draft_tensor_parallel_size": 1, - }), - ])]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize("seed", [1]) -def test_draft_model_tp_lt_target_model_tp2(model, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - seed: int): - """Verify spec decode works well with smaller tp for draft models. - """ - run_equality_correctness_test_tp(model, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=32, - seed=seed, - temperature=0.0) - - -@pytest.mark.skipif(torch.cuda.device_count() < 2, - reason="Need at least 2 GPUs to run the test.") -@pytest.mark.parametrize( - "common_llm_kwargs", - [[ - # Skip cuda graph recording for fast test. - "--enforce-eager", - "--tensor_parallel_size", - "2", - - # precision - "--dtype", - "bfloat16", - ]]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [["--enable-chunked-prefill", "False"], - [ - "--enable-chunked-prefill", "True", "--max-num-batched-tokens", "4", - "--max-num-seqs", "4" - ]]) -@pytest.mark.parametrize("baseline_llm_kwargs", [[]]) -@pytest.mark.parametrize("model, test_llm_kwargs", - [("JackFram/llama-68m", [ - "--speculative_config", - json.dumps({ - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - }), - ]), - ("JackFram/llama-68m", [ - "--speculative_config", - json.dumps({ - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "draft_tensor_parallel_size": 1, - }), - ])]) -@pytest.mark.parametrize("logprobs", [None]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize("seed", [1]) -def test_spec_decode_chunked_prefill_tp2(model, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - logprobs: Optional[int], - batch_size: int, seed: int): - """Verify spec decode works well with same and different TP size for - the draft model with chunked prefill. - """ - run_equality_correctness_test_tp(model, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=32, - seed=seed, - temperature=0.0, - logprobs=logprobs) - - -@pytest.mark.skipif(torch.cuda.device_count() < 2, - reason="Need at least 2 GPUs to run the test.") -@pytest.mark.parametrize( - "common_llm_kwargs", - [[ - # Skip cuda graph recording for fast test. - "--enforce-eager", - "--tensor_parallel_size", - "2", - - # precision - "--dtype", - "bfloat16", - ]]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [["--enable-chunked-prefill", "False"], - [ - "--enable-chunked-prefill", "True", "--max-num-batched-tokens", "4", - "--max-num-seqs", "4" - ]]) -@pytest.mark.parametrize("baseline_llm_kwargs", [[]]) -@pytest.mark.parametrize("model, test_llm_kwargs", - [("JackFram/llama-68m", [ - "--speculative_config", - json.dumps({ - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "disable_logprobs": False, - }), - ]), - ("JackFram/llama-68m", [ - "--speculative_config", - json.dumps({ - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "draft_tensor_parallel_size": 1, - "disable_logprobs": False, - }), - ])]) -@pytest.mark.parametrize("logprobs", [2]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize("seed", [1]) -def test_spec_decode_chunked_prefill_tp2_with_logprobs( - model, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, logprobs: Optional[int], - batch_size: int, seed: int): - """Verify spec decode works well with same and different TP size for - the draft model with chunked prefill. - """ - run_equality_correctness_test_tp(model, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=32, - seed=seed, - temperature=0.0, - logprobs=logprobs) diff --git a/tests/spec_decode/e2e/test_integration_dist_tp4.py b/tests/spec_decode/e2e/test_integration_dist_tp4.py deleted file mode 100644 index 039eec8fd2c..00000000000 --- a/tests/spec_decode/e2e/test_integration_dist_tp4.py +++ /dev/null @@ -1,123 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""Tests which cover integration of the speculative decoding framework with -tensor parallelism. -""" - -import json - -import openai -import pytest -import torch - -from .conftest import run_equality_correctness_test_tp - -MAIN_MODEL = "JackFram/llama-68m" -SPEC_MODEL = "JackFram/llama-68m" - - -@pytest.mark.skipif(torch.cuda.device_count() < 4, - reason="Need at least 4 GPUs to run the test.") -@pytest.mark.parametrize( - "common_llm_kwargs", - [[ - # Skip cuda graph recording for fast test. - "--enforce_eager", - "--tensor-parallel-size", - "4", - ]]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [ - [], -]) -@pytest.mark.parametrize("baseline_llm_kwargs", [[]]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - #TODO(wooyeon): add spec_draft_dp=2 case - [ - "--speculative_config", - json.dumps({ - "model": f"{SPEC_MODEL}", - "num_speculative_tokens": 5, - "draft_tensor_parallel_size": 1, - }), - ], - ]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize("seed", [1]) -def test_draft_model_tp_lt_target_model_tp4(common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - seed: int): - """Verify spec decode works well with smaller tp for draft models. - """ - run_equality_correctness_test_tp(MAIN_MODEL, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=32, - seed=seed, - temperature=0.0) - - -@pytest.mark.skipif(torch.cuda.device_count() < 4, - reason="Need at least 4 GPUs to run the test.") -@pytest.mark.parametrize( - "common_llm_kwargs", - [[ - - # Skip cuda graph recording for fast test. - "--enforce-eager", - "--tensor-parallel-size", - "4", - ]]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [[]]) -@pytest.mark.parametrize("baseline_llm_kwargs", [[]]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - [ - # Artificially limit the draft model max model len; this forces vLLM - # to skip speculation once the sequences grow beyond 32-k tokens. - "--speculative_config", - json.dumps({ - "model": f"{SPEC_MODEL}", - "num_speculative_tokens": 5, - "max_model_len": 32, - }), - ], - ]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize( - "output_len", - [ - # This must be a good bit larger than speculative_max_model_len so that - # we can test the case where all seqs are skipped, but still small to - # ensure fast test. - 64, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_skip_speculation(common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int): - """Verify job failure with RuntimeError when all sequences skip speculation. - We do this by setting the max model len of the draft model to an - artificially low value, such that when the sequences grow beyond it, they - are skipped in speculative decoding. - - TODO: fix it to pass without raising Error. (#5814) - """ - with pytest.raises( - (openai.APIConnectionError, openai.InternalServerError)): - run_equality_correctness_test_tp(MAIN_MODEL, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0) diff --git a/tests/spec_decode/e2e/test_logprobs.py b/tests/spec_decode/e2e/test_logprobs.py deleted file mode 100644 index 4de7ee05605..00000000000 --- a/tests/spec_decode/e2e/test_logprobs.py +++ /dev/null @@ -1,315 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from itertools import cycle - -import pytest - -from vllm import SamplingParams - -from ..utils import maybe_enable_chunked_prefill -from .conftest import run_equality_correctness_test - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "disable_logprobs": False, - }, -}, { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "disable_logprobs": True, - }, -}]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 7, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4, 12]) -def test_logprobs_equality(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int, logprobs: int, prefill_chunk_size: int): - """Verify output logprobs are equal with and without speculative decoding, - as well as with and without chunked prefill. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, common_llm_kwargs) - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0, - logprobs=logprobs, - prompt_logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": "JackFram/llama-160m", - "num_speculative_tokens": 3, - "disable_logprobs": False, - }, -}, { - "speculative_config": { - "model": "JackFram/llama-160m", - "num_speculative_tokens": 6, - "disable_logprobs": False, - }, -}]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -def test_logprobs_different_k(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - output_len: int, seed: int, logprobs: int): - """Veriy logprob greedy equality with different speculation lens. - """ - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0, - logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [{ - "speculative_config": { - "model": "JackFram/llama-160m", - "num_speculative_tokens": 3, - "disable_logprobs": False, - # Artificially limit the draft model max model len; this forces - # vLLM to skip speculation once the sequences grow beyond 32-k - # tokens. - "max_model_len": 32, - }, - }]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1]) -def test_logprobs_when_skip_speculation(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int, logprobs: int): - """Verify logprobs greedy equality when some sequences skip speculation. - """ - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0, - logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": "JackFram/llama-160m", - "num_speculative_tokens": 3, - "disable_logprobs": False, - }, -}]) -@pytest.mark.parametrize("batch_size", [1]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [6]) -def test_logprobs_temp_1(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int, logprobs: int): - """Verify at least one logprob result has num_logprobs+1, which tests the - case where the sampled token is not in top-k logprobs. - - Ideally, this test should validate equality with non-spec by getting - logprobs. This is left as future improvement. - """ - temperature = 1.0 - - prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", - "San Francisco is know for its", - "Facebook was created in 2004 by", - "Curious George is a", - "Python 3.11 brings improvements to its", - ] - - prompts = [prompt for prompt, _ in zip(cycle(prompts), range(batch_size))] - - sampling_params = SamplingParams( - max_tokens=output_len, - ignore_eos=True, - temperature=temperature, - logprobs=logprobs, - ) - - sd_args = { - **common_llm_kwargs, - **per_test_common_llm_kwargs, - **test_llm_kwargs, - } - - with vllm_runner(**sd_args) as vllm_model: - sd_outputs = vllm_model.generate_w_logprobs(prompts, sampling_params) - - num_returned_logprobs = [ - len(seq_logprobs) for seq_logprobs in sd_outputs[-1] - ] - - # Assert one of the returned logprobs has > num_logprobs (indicating the - # sampled token is not in top-k). - assert any( - [num_returned > logprobs for num_returned in num_returned_logprobs]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "disable_logprobs": True, - }, -}]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("logprobs", [0]) -def test_logprobs_disabled(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int, logprobs: int): - """Check the behavior when logprobs are disabled. - Token choices should match with the base model. - """ - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - temperature=0.0, - logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) diff --git a/tests/spec_decode/e2e/test_medusa_correctness.py b/tests/spec_decode/e2e/test_medusa_correctness.py deleted file mode 100644 index bc9501bd573..00000000000 --- a/tests/spec_decode/e2e/test_medusa_correctness.py +++ /dev/null @@ -1,417 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""This docstring details important information on the testing methodology. - -Most of the tests rely on "greedy equality", where we expect the output of -speculative decoding on a sequence to exactly match the output of normal non- -speculative decoding. - -Since speculative decoding with rejection sampling guarantees that the output -distribution matches the target model's output distribution (up to hardware -numerics, see https://arxiv.org/pdf/2302.01318.pdf), we can expect greedy -equality. - -However, we still need to verify below scenario could be passed: - * Batch size 1 greedy equality - * Batch size >1 greedy equality - * Test greedy equality under preemption - * Test greedy equality under various number of speculative tokens. - -With those tests, we can say at least, Medusa would not break the -correctness for the target model outputs. -""" - -import pytest - -from ..utils import maybe_enable_chunked_prefill -from .conftest import run_equality_correctness_test - -# main model -# lmsys/vicuna-7b-v1.3 was to be used but it's causing -# OOM in CI pipeline, so using a smaller model. -MAIN_MODEL = "JackFram/llama-68m" - -# speculative model -SPEC_MODEL = "abhigoyal/vllm-medusa-llama-68m-random" - -# max number of speculative tokens: this corresponds to -# num_heads in the config.json of the speculator model. -MAX_SPEC_TOKENS = 5 - -# precision -PRECISION = "float32" - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_medusa_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int, prefill_chunk_size: int): - """Verify greedy equality with different batch size.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_logprobs": False, - }, - }, - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_logprobs": True, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 8, -]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_medusa_e2e_greedy_logprobs(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int, logprobs: int, - prefill_chunk_size: int): - """Verify greedy equality with different batch size.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0, - logprobs=logprobs, - prompt_logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "enforce_eager": False, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_medusa_e2e_greedy_correctness_cuda_graph( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int, prefill_chunk_size: int): - """Verify greedy equality with cuda graph enabled and different - batch sizes.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 16, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 128, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_medusa_e2e_greedy_correctness_with_preemption( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int, prefill_chunk_size: int): - """Verify greedy equality, even when some sequences are preempted mid- - generation. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": k, - }, - } - # Try a range of num. speculative tokens - for k in range(1, 1 + MAX_SPEC_TOKENS) - ]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_medusa_different_k(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int, prefill_chunk_size: int): - """Verify that medusa speculative decoding produces exact equality - to without spec decode with different values of num_speculative_tokens. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_by_batch_size": 4, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_medusa_disable_queue(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - output_len: int, seed: int, - prefill_chunk_size: int): - """Verify that medusa speculative decoding produces exact equality - to without spec decode when speculation is disabled for large - batch sizes. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_by_batch_size": 4, - "disable_mqa_scorer": True, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_mqa_scorer(vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, - output_len: int, seed: int, prefill_chunk_size: int): - """Verify that speculative decoding generates the same output - with batch expansion scorer and mqa scorer. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -if __name__ == "__main__": - import pytest - pytest.main([__file__]) diff --git a/tests/spec_decode/e2e/test_mlp_correctness.py b/tests/spec_decode/e2e/test_mlp_correctness.py deleted file mode 100644 index 0e41d93eaa1..00000000000 --- a/tests/spec_decode/e2e/test_mlp_correctness.py +++ /dev/null @@ -1,533 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""This docstring details important information on the testing methodology. - -Most of the tests rely on "greedy equality", where we expect the output of -speculative decoding on a sequence to exactly match the output of normal non- -speculative decoding. - -Since speculative decoding with rejection sampling guarantees that the output -distribution matches the target model's output distribution (up to hardware -numerics, see https://arxiv.org/pdf/2302.01318.pdf), we can expect greedy -equality. - -However, we still need to verify below scenario could be passed: - * Batch size 1 greedy equality - * Batch size >1 greedy equality - * Test greedy equality under preemption - * Test greedy equality under various number of speculative tokens. - -With those tests, we can say at least, MLPSpeculator would not break the -correctness for the target model outputs. -""" - -from unittest.mock import patch - -import pytest - -from vllm.model_executor.layers.vocab_parallel_embedding import pad_vocab_size - -from ..utils import maybe_enable_chunked_prefill -from .conftest import run_equality_correctness_test - -# main model -MAIN_MODEL = "JackFram/llama-160m" - -# speculative model -SPEC_MODEL = "ibm-ai-platform/llama-160m-accelerator" - -# max. number of speculative tokens: this corresponds to -# n_predict in the config.json of the speculator model. -MAX_SPEC_TOKENS = 3 - -# precision -PRECISION = "float32" - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [4, 32]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 32]) -def test_mlp_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int, prefill_chunk_size: int): - """Verify greedy equality with different batch size.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - "disable_logprobs": False, - }, - }, - { - "speculative_config": { - "model": SPEC_MODEL, - "disable_logprobs": True, - }, - }, -]) -@pytest.mark.parametrize("output_len", [8]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -def test_mlp_e2e_greedy_logprobs(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int, - logprobs: int, prefill_chunk_size: int): - """Verify greedy equality with different batch size.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - # NOTE Test is sensitive enough st if we don't enable chunked prefill - # scheduling on baseline too, we get slightly different logprobs, ending - # up sampling different tokens at the tail (ie top tokens don't change). - # TL;DR: sd+cp == org+cp but sd+cp != org..is this expected? - maybe_enable_chunked_prefill(prefill_chunk_size, baseline_llm_kwargs) - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0, - logprobs=logprobs, - prompt_logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - }, - }, -]) -@pytest.mark.parametrize("output_len", [2048]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -def test_mlp_e2e_acceptance_rate(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - prefill_chunk_size: int, seed: int): - """Verify acceptance rate with different batch size and large output - length.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - temperature=0.0, - seed=seed, - expected_acceptance_rate=0.48) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - - # Speculative config - "speculative_config": { - "model": SPEC_MODEL, - }, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{"seed": 1}]) -@pytest.mark.parametrize("test_llm_kwargs", [{"seed": 5}]) -@pytest.mark.parametrize("output_len", [64]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("temperature", [1.0]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -@pytest.mark.parametrize("seed", [1]) -def test_mlp_e2e_seeded_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - temperature: float, - prefill_chunk_size: int, seed: int): - """Verify seeded runs produce the same output.""" - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - maybe_enable_chunked_prefill(prefill_chunk_size, baseline_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - temperature=temperature, - seed=seed) - - # Ensure this same test does fail if we _don't_ include per-request seeds - with pytest.raises(AssertionError): - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - temperature=temperature, - seed=seed, - disable_seed=True) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 16, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 128, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -@pytest.mark.parametrize("seed", [1]) -def test_mlp_e2e_greedy_correctness_with_preemption( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - prefill_chunk_size: int, seed: int): - """Verify greedy equality, even when some sequences are preempted mid- - generation. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 16, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": SPEC_MODEL, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 128, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -def test_mlp_e2e_greedy_correctness_with_padding( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - prefill_chunk_size: int, seed: int): - """Verify greedy equality when the vocab dimension is padded - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - - # Default pad_to is 64, test model has vocab_size of 32000 - def patched_pad_vocab_size(vocab_size, pad_to=None): - return pad_vocab_size(vocab_size, pad_to=32064) - - with patch( - "vllm.model_executor.layers.vocab_parallel_embedding.pad_vocab_size", - patched_pad_vocab_size): - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "model": SPEC_MODEL, - "num_speculative_tokens": k, - }, - } - # Try a range of num. speculative tokens - for k in range(1, 1 + MAX_SPEC_TOKENS) - ]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -@pytest.mark.parametrize("seed", [1]) -def test_mlp_different_k(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - prefill_chunk_size: int, seed: int, output_len: int): - """Verify that mlp speculative decoding produces exact equality - to without spec decode with different values of num_speculative_tokens. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": SPEC_MODEL, - "disable_by_batch_size": 4, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -# Speculative decoding is disabled when sequences reach decoding and the batch -# consists of single-token requests. Hence we set `max_num_seqs` -# >= `speculative_disable_by_batch_size` to test feature interaction. -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -@pytest.mark.parametrize("seed", [1]) -def test_mlp_disable_queue(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - prefill_chunk_size: int, seed: int, - output_len: int): - """Verify that mlp speculative decoding produces exact equality - to without spec decode when speculation is disabled for large - batch sizes. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": MAIN_MODEL, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": SPEC_MODEL, - "disable_mqa_scorer": True, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -@pytest.mark.parametrize("seed", [1]) -def test_mqa_scorer(vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, - output_len: int, prefill_chunk_size: int, seed: int): - """Verify that speculative decoding generates the same output - with batch expansion scorer and mqa scorer. - """ - maybe_enable_chunked_prefill(prefill_chunk_size, test_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) diff --git a/tests/spec_decode/e2e/test_mtp_correctness.py b/tests/spec_decode/e2e/test_mtp_correctness.py deleted file mode 100644 index d9c7be8ffe7..00000000000 --- a/tests/spec_decode/e2e/test_mtp_correctness.py +++ /dev/null @@ -1,333 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""This docstring details important information on the testing methodology. - -Most of the tests rely on "greedy equality", where we expect the output of -speculative decoding on a sequence to exactly match the output of normal non- -speculative decoding. - -Since speculative decoding with rejection sampling guarantees that the output -distribution matches the target model's output distribution (up to hardware -numerics, see https://arxiv.org/pdf/2302.01318.pdf), we can expect greedy -equality. - -However, we still need to verify below scenario could be passed: - * Batch size 1 greedy equality - * Batch size >1 greedy equality - * Test greedy equality under preemption - * Test greedy equality under various number of speculative tokens. - -With those tests, we can say at least, mtp would not break the -correctness for the target model outputs. -""" - -import pytest - -from .conftest import run_equality_correctness_test - -# main model -MAIN_MODEL = "luccafong/deepseek_mtp_main_random" - -# max. number of speculative tokens: this corresponds to -# num_nextn_predict_layers in the config.json of the speculator model. -MAX_SPEC_TOKENS = 1 - -# precision -PRECISION = "bfloat16" - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - - # GPU memory utilization - "gpu_memory_utilization": 0.85 - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -def test_mtp_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int): - - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - - # GPU memory utilization - "gpu_memory_utilization": 0.85 - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_logprobs": False, - }, - }, - { - "speculative_config": { - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_logprobs": True, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -def test_mtp_e2e_greedy_logprobs(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int, - logprobs: int): - - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - output_len, - seed, - logprobs=logprobs, - prompt_logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "enforce_eager": False, - - # Print spec metrics. - "disable_log_stats": False, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - "gpu_memory_utilization": 0.85 - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 128, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -def test_mtp_e2e_greedy_correctness_cuda_graph(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size: int, - output_len: int, seed: int): - """Verify greedy equality with cuda graph enabled and different - batch sizes.""" - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 8, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - - # GPU memory utilization - "gpu_memory_utilization": 0.9 - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "num_speculative_tokens": MAX_SPEC_TOKENS, - }, - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 128, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("seed", [1]) -def test_mtp_e2e_greedy_correctness_with_preemption( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality, even when some sequences are preempted mid- - generation. - """ - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - - # GPU memory utilization - "gpu_memory_utilization": 0.9 - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "num_speculative_tokens": k, - }, - } - # Try a range of num. speculative tokens - for k in range(1, 1 + MAX_SPEC_TOKENS) - ]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_mtp_different_k(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that mtp speculative decoding produces exact equality - to without spec decode with different values of num_speculative_tokens. - """ - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Precision - "dtype": PRECISION, - - # Main model - "model_name": MAIN_MODEL, - - # GPU memory utilization - "gpu_memory_utilization": 0.9 - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "num_speculative_tokens": MAX_SPEC_TOKENS, - "disable_by_batch_size": 4 - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_mtp_disable_queue(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that mtp speculative decoding produces exact equality - to without spec decode when speculation is disabled for large - batch sizes. - """ - run_equality_correctness_test(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size, output_len, seed) - - -if __name__ == "__main__": - import pytest - pytest.main([__file__]) diff --git a/tests/spec_decode/e2e/test_multistep_correctness.py b/tests/spec_decode/e2e/test_multistep_correctness.py deleted file mode 100644 index ccc8e745ab3..00000000000 --- a/tests/spec_decode/e2e/test_multistep_correctness.py +++ /dev/null @@ -1,842 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""The tests in this file verify end-to-end speculative decoding correctness. - -This docstring details important information on the testing methodology. - -Most of the tests rely on "greedy equality", where we expect the output of -speculative decoding on a sequence to exactly match the output of normal non- -speculative decoding. - -Since speculative decoding with rejection sampling guarantees that the output -distribution matches the target model's output distribution (up to hardware -numerics, see https://arxiv.org/pdf/2302.01318.pdf), we can expect greedy -equality. This gives us good coverage of temp=0. - -At temp=0, the TypicalAcceptanceSampler ensures that only the tokens with the -highest probability in the target distribution are accepted. Therefore, we can -expect greedy equality for the TypicalAcceptanceSampler at temp=0. - -For temp>0, we rely on unit tests on the rejection sampler to verify that the -output distribution is the same with spec decode vs. no spec decode (this would -be prohibitively expensive to run with a real model). Similarly, for the -TypicalAcceptance sampler also, we rely on unit tests to validate temp>0 -test cases. - -NOTE: Speculative decoding's distribution equality requires that the measured -distributions of the target model and proposal model be deterministic given the -same input. vLLM largely guarantees this. - -@cadedaniel has seen cases where the output probabilities of a draft/target -model change slightly with certain batch sizes or prompts, even with Torch -determinism flags set. It is unclear if this is a bug in vLLM, due to non- -determinism in on-device batched operations, a bug in vLLM's spec decode -implementation, or the "hardware numerics" limitations. Either way, rejection -sampling ensures the output distribution matches the target model, but it breaks -greedy-equality tests for those batch sizes/prompts. -""" - -from itertools import cycle - -import pytest -from transformers import AutoTokenizer - -from vllm import SamplingParams - -from ...utils import create_new_process_for_each_test -from .conftest import (get_output_from_llm_generator, - run_equality_correctness_test) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Use a small model for a fast test. - # Note this is repeated in the test body; to initialize a tokenizer. - "model": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - # Chunked prefill enabled with small value - # to make sure we get mixed batches. - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, - { - # Verify the detokenizer assertions in the test work when spec - # decode is disabled. - }, - ]) -@pytest.mark.parametrize("test_llm_kwargs", [{}]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_with_detokenization(test_llm_generator, - batch_size: int): - """Run generation with speculative decoding on a batch. Verify the engine - generates the correct number of tokens (via ignore_eos=True), and that the - detokenization matches HF transformers. - """ - output_len = 32 - temperature = 0.0 - - prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", - ] - - prompts = [prompt for prompt, _ in zip(cycle(prompts), range(batch_size))] - - sampling_params = SamplingParams( - max_tokens=output_len, - ignore_eos=True, - temperature=temperature, - ) - - batch_tokens, batch_token_ids, _ = get_output_from_llm_generator( - test_llm_generator, prompts, sampling_params) - - # Expect a generation for each prompt in the batch. - assert len(batch_token_ids) == len(prompts) - - # Expect each generation to have expected number of tokens (note ignore_eos - # is True). - assert [len(token_ids) - for token_ids in batch_token_ids] == ([output_len] * batch_size) - - # Expect detokenized string to match. - tok = AutoTokenizer.from_pretrained("JackFram/llama-68m") - for actual_tokens, actual_token_ids in zip(batch_tokens, batch_token_ids): - expected_tokens = tok.decode(actual_token_ids) - print(f"{actual_token_ids=}") - assert actual_tokens.strip() == expected_tokens.strip() - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - # Try two different tiny base models. - # Note that one is equal to the draft model, another isn't. - { - "model_name": "JackFram/llama-68m", - }, - { - "model_name": "JackFram/llama-160m", - }, - ]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "disable_logprobs": False, - }, - "enable_chunked_prefill": False, -}, { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 3, - "disable_logprobs": False, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4, -}]) -@pytest.mark.parametrize( - "output_len", - [ - # Use long output len for the small model test. - 10, - ]) -@pytest.mark.parametrize("batch_size", [1]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_greedy_correctness_tiny_model_bs1( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality on a tiny model with batch size of one. - - Since this test is cheaper than other e2e correctness tests, we generate - with a higher output_len. - - When the draft model is the same as the target model, we further check - whether all speculative tokens are accepted. - """ - ensure_all_accepted = per_test_common_llm_kwargs.get( - "model_name") == test_llm_kwargs.get("speculative_config")["model"] - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - prompt_logprobs=2, - logprobs=2, - disable_logprobs=False, - temperature=0.0, - ensure_all_accepted=ensure_all_accepted) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - # Try two different tiny base models. - # Note that one is equal to the draft model, another isn't. - { - "model_name": "JackFram/llama-68m", - }, - { - "model_name": "JackFram/llama-160m", - }, - ]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 256, - ]) -@pytest.mark.parametrize("batch_size", [64]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_greedy_correctness_tiny_model_large_bs( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality on a tiny model and large batch size. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - # Try two different tiny base models. - # Note that one is equal to the draft model, another isn't. - { - "model_name": "JackFram/llama-68m", - }, - { - "model_name": "JackFram/llama-160m", - }, - ]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize("max_output_len", [ - 256, -]) -@pytest.mark.parametrize("batch_size", [32]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_greedy_correctness_tiny_model_large_bs_diff_output_len( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, - max_output_len: int, seed: int): - """Verify greedy equality on a tiny model, with a large batch size, and when - sampling respects the EOS token. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len, - seed=seed, - temperature=0.0, - ignore_eos=False) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # A "real" model (not tiny). - "model_name": "meta-llama/Llama-2-7b-chat-hf", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize("batch_size", [1]) -@pytest.mark.parametrize( - "output_len", - [ - # Use decently long output len for a high quality test. - 256, - ]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_greedy_correctness_real_model_bs1( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality on a "real" model and batch size of 1. This is - separate from large BS tests to make identifying the source of bugs easier. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # A "real" model (not tiny). - "model_name": "meta-llama/Llama-2-7b-chat-hf", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize("batch_size", [32]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 64, - ]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_greedy_correctness_real_model_large_bs( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality with a "real" model on a nontrivial batch size. - This is the closest test to a real production workload. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 16, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [ - { - "model_name": "JackFram/llama-160m", - }, -]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 256, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_e2e_greedy_correctness_with_preemption( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality, even when some sequences are preempted mid- - generation. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize( - "per_test_common_llm_kwargs", - [ - # https://github.com/triton-lang/triton/issues/2266 tl.dot - # doesn't support embedding < 16 - { - "block_size": 16, - }, - { - "block_size": 32, - }, - ]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_spec_decode_different_block_size(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int): - """Verify greedy equality over different block sizes. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - - # Artificially limit the draft model max model len; this forces vLLM - # to skip speculation once the sequences grow beyond 32-k tokens. - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "max_model_len": 32, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "max_model_len": 32, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4, - }, - ]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize( - "output_len", - [ - # This must be a good bit larger than speculative_max_model_len so that - # we can test the case where all seqs are skipped, but still small to - # ensure fast test. - 64, - ]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_skip_speculation(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality when some (or all) sequences skip speculation. - We do this by setting the max model len of the draft model to an - artificially low value, such that when the sequences grow beyond it, they - are skipped in speculative decoding. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "disable_by_batch_size": 2, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": 5, - "disable_by_batch_size": 2, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4, - }, -]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("output_len", [10]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_disable_speculation(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality when all sequences disable speculation. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": k, - }, - "enable_chunked_prefill": False, - } - # Try a range of common k, as well as large speculation. - for k in [1, 2, 3, 4, 5, 6, 7, 8, 9, 63] - ] + [{ - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": k, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4, - } for k in [1, 2, 3, 4, 5, 6, 7, 8, 9, 63]]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_many_k(vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, - output_len: int, seed: int): - """Verify that speculative decoding produces exact equality to without spec - decode with many different values of k. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-160m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": k, - "acceptance_method": "typical_acceptance_sampler", - }, - "enable_chunked_prefill": False - } - # Try a range of common k. - for k in [1, 2, 3] - ] + [{ - "speculative_config": { - "model": "JackFram/llama-68m", - "num_speculative_tokens": k, - "acceptance_method": "typical_acceptance_sampler", - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - } for k in [1, 2, 3]]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -@create_new_process_for_each_test() -def test_typical_acceptance_sampling(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - seed: int): - """Verify that speculative decoding produces exact equality to without spec - decode with TypicalAcceptanceSampler as the draft token acceptance - sampling method. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) diff --git a/tests/spec_decode/e2e/test_ngram_correctness.py b/tests/spec_decode/e2e/test_ngram_correctness.py deleted file mode 100644 index 58d1a6ca7ad..00000000000 --- a/tests/spec_decode/e2e/test_ngram_correctness.py +++ /dev/null @@ -1,392 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""This docstring details important information on the testing methodology. - -Most of the tests rely on "greedy equality", where we expect the output of -speculative decoding on a sequence to exactly match the output of normal non- -speculative decoding. - -Since speculative decoding with rejection sampling guarantees that the output -distribution matches the target model's output distribution (up to hardware -numerics, see https://arxiv.org/pdf/2302.01318.pdf), we can expect greedy -equality. - -For ngram lookup, its idea comes from https://github.com/apoorvumang/prompt-lookup-decoding, -and is merged into transform code base: https://github.com/huggingface/transformers/pull/27775. -Since there is no model is needed for generate the proposal, we could make -the testcase much simpler than drafter multi-step one. - -However, we still need to verify below scenario could be passed: - * Batch size 1 greedy equality - * Batch size >1 greedy equality - * Test greedy equality under preemption - * Test greedy equality under various ngram sizes / speculative sizes - -With those tests, we can say at least, ngram spec would not break the -correctness for the target model outputs. -""" - -import pytest - -from ..utils import maybe_enable_chunked_prefill -from .conftest import run_equality_correctness_test - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [ - { - "model_name": "JackFram/llama-68m", - }, -]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_mqa_scorer": False, - }, - }, - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_mqa_scorer": True, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 256, -]) -@pytest.mark.parametrize("batch_size", [1, 32]) -@pytest.mark.parametrize("prefill_chunk_size", [-1, 4]) -@pytest.mark.parametrize("seed", [1]) -def test_ngram_e2e_greedy_correctness(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, - prefill_chunk_size: int, seed: int): - """Verify greedy equality on a tiny model with different batch size.""" - maybe_enable_chunked_prefill(prefill_chunk_size, common_llm_kwargs) - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # Print spec metrics. - "disable_log_stats": False, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [ - { - "model_name": "JackFram/llama-68m", - }, -]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_logprobs": False, - }, - }, - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_logprobs": True, - }, - }, -]) -@pytest.mark.parametrize("output_len", [ - 8, -]) -@pytest.mark.parametrize("batch_size", [8]) -@pytest.mark.parametrize("seed", [1]) -@pytest.mark.parametrize("logprobs", [1, 6]) -def test_ngram_e2e_greedy_logprobs(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, - batch_size: int, output_len: int, seed: int, - logprobs: int): - """Verify greedy equality on a tiny model with different batch size.""" - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0, - logprobs=logprobs, - prompt_logprobs=logprobs, - disable_logprobs=test_llm_kwargs["speculative_config"] - ["disable_logprobs"]) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "block_size": 16, - # 2 for small prompt, 256//8 for generated. - "num_gpu_blocks_override": 2 + 256 // 8, - "max_model_len": (2 + 256 // 8) * 8, - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [ - { - "model_name": "JackFram/llama-160m", - }, -]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [ - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - }, - "enable_chunked_prefill": False, - }, - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_mqa_scorer": True, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 - }, -]) -@pytest.mark.parametrize( - "output_len", - [ - # Use small output len for fast test. - 256, - ]) -@pytest.mark.parametrize("batch_size", [4]) -@pytest.mark.parametrize("seed", [1]) -def test_ngram_e2e_greedy_correctness_with_preemption( - vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs, - baseline_llm_kwargs, test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify greedy equality, even when some sequences are preempted mid- - generation. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - temperature=0, - seed=seed) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize( - "test_llm_kwargs", - [ - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": k, - "prompt_lookup_max": 3, - }, - } - # Try a range of common k, as well as large speculation. - for k in [1, 3, 5] - ] + [ - { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": k, - "prompt_lookup_max": 1, - }, - } - # Try a range of common k, as well as large speculation. - for k in [1, 3, 5] - ]) -@pytest.mark.parametrize("batch_size", [2]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_ngram_different_k(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that ngram speculative decoding produces exact equality - to without spec decode with many different values of k and - different ngram prompt_lookup_max. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_by_batch_size": 4 - }, -}, { - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_by_batch_size": 4, - "disable_mqa_scorer": True, - }, - "enable_chunked_prefill": True, - "max_num_batched_tokens": 4, - "max_num_seqs": 4 -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_ngram_disable_queue(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that ngram speculative decoding produces exact equality - to without spec decode with many different values of k and - different ngram prompt_lookup_max. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # The original model is float32, keep it for numerical stability. - "dtype": "float32", - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{}]) -@pytest.mark.parametrize("test_llm_kwargs", [{ - "speculative_config": { - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 3, - "disable_mqa_scorer": True, - }, -}]) -@pytest.mark.parametrize("batch_size", [1, 5]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 32, - ]) -@pytest.mark.parametrize("seed", [1]) -def test_ngram_scorer(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, output_len: int, - seed: int): - """Verify that ngram speculative decoding generates the same output - with batch expansion scorer and mqa scorer. - """ - run_equality_correctness_test(vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - seed=seed, - temperature=0.0) diff --git a/tests/spec_decode/e2e/test_seed.py b/tests/spec_decode/e2e/test_seed.py deleted file mode 100644 index 4cf373809db..00000000000 --- a/tests/spec_decode/e2e/test_seed.py +++ /dev/null @@ -1,70 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import pytest - -from .conftest import run_equality_correctness_test - -# main model -MAIN_MODEL = "JackFram/llama-68m" - -# speculative model -SPEC_MODEL = "JackFram/llama-160m" - - -@pytest.mark.parametrize( - "common_llm_kwargs", - [{ - "model_name": "JackFram/llama-68m", - - # Skip cuda graph recording for fast test. - "enforce_eager": True, - - # speculative config - "speculative_config": { - "model": "JackFram/llama-160m", - "num_speculative_tokens": 3, - }, - }]) -@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}]) -@pytest.mark.parametrize("baseline_llm_kwargs", [{"seed": 1}]) -@pytest.mark.parametrize("test_llm_kwargs", [{"seed": 5}]) -@pytest.mark.parametrize("batch_size", [1, 8, 32]) -@pytest.mark.parametrize("temperature", [0.1, 1.0]) -@pytest.mark.parametrize( - "output_len", - [ - # Use smaller output len for fast test. - 20, - ]) -def test_seeded_consistency(vllm_runner, common_llm_kwargs, - per_test_common_llm_kwargs, baseline_llm_kwargs, - test_llm_kwargs, batch_size: int, - temperature: float, output_len: int): - """Verify outputs are consistent across multiple runs with same seed - """ - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - temperature=temperature, - disable_seed=False, - ) - - # Ensure this same test does fail if we _don't_ include per-request seeds - with pytest.raises(AssertionError): - run_equality_correctness_test( - vllm_runner, - common_llm_kwargs, - per_test_common_llm_kwargs, - baseline_llm_kwargs, - test_llm_kwargs, - batch_size, - max_output_len=output_len, - temperature=temperature, - disable_seed=True, - ) diff --git a/tests/spec_decode/test_batch_expansion.py b/tests/spec_decode/test_batch_expansion.py deleted file mode 100644 index d20c549b090..00000000000 --- a/tests/spec_decode/test_batch_expansion.py +++ /dev/null @@ -1,110 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import pytest -import torch - -from vllm.spec_decode.batch_expansion import BatchExpansionTop1Scorer - -from .utils import create_seq_group_metadata_from_prompts, mock_worker - - -@pytest.mark.parametrize('num_target_seq_ids', [100]) -@pytest.mark.skip_global_cleanup -def test_create_target_seq_id_iterator(num_target_seq_ids: int): - """Verify all new sequence ids are greater than all input - seq ids. - """ - scorer = BatchExpansionTop1Scorer(mock_worker(), 'cuda:0', 32_000) - - all_seq_ids = [ - [1, 3, 5, 7], - list(range(100)) + [0], - [100], - ] - - for seq_ids in all_seq_ids: - max_seq_id = max(seq_ids) - iterator = scorer._create_target_seq_id_iterator(seq_ids) # pylint: disable=protected-access - for _ in range(num_target_seq_ids): - assert next(iterator) > max_seq_id - - -@pytest.mark.parametrize('k', [1, 2, 6]) -@pytest.mark.skip_global_cleanup -def test_get_token_ids_to_score(k: int): - """Verify correct tokens are selected for scoring. - """ - proposal_token_ids = torch.tensor( - list(range(k)), - dtype=torch.int64, - device='cuda', - ) - - expected_output: list[list[int]] = [ - [], - ] - for i in range(proposal_token_ids.shape[0]): - expected_output.append(proposal_token_ids[:i + 1].tolist()) - - scorer = BatchExpansionTop1Scorer(mock_worker(), 'cuda:0', 32_000) - actual_output = scorer._get_token_ids_to_score(proposal_token_ids.tolist()) # pylint: disable=protected-access - - actual_output = [ - x.tolist() if isinstance(x, torch.Tensor) else x for x in actual_output - ] - - assert actual_output == expected_output - - -@pytest.mark.parametrize('k', [1, 2, 6]) -@pytest.mark.skip_global_cleanup -def test_create_single_target_seq_group_metadata(k: int): - """Verify correct creation of a batch-expanded seq group metadata. - """ - - prompt_tokens = [1, 2, 3] - prev_output_tokens = [4, 5, 6] - - token_ids = list(range(k)) - - num_tokens_processed = len(prompt_tokens) + len(prev_output_tokens) - 1 - - final_seq_len = len(prompt_tokens) + len(prev_output_tokens) + len( - token_ids) - - block_size = 32 - input_seq_group_metadata = create_seq_group_metadata_from_prompts( - [prompt_tokens], 2048 // block_size, block_size, [final_seq_len], - [prev_output_tokens], [num_tokens_processed])[0] - - input_seq_id = list(input_seq_group_metadata.seq_data.keys())[0] - target_seq_id = 100 - - scorer = BatchExpansionTop1Scorer(mock_worker(), 'cuda:0', 32_000) - output = scorer._create_single_target_seq_group_metadata( # pylint: disable=protected-access - input_seq_group_metadata, - input_seq_id, - target_seq_id, - token_ids, - input_seq_group_metadata.sampling_params, - ) - - assert output.request_id == input_seq_group_metadata.request_id - assert output.sampling_params.repetition_penalty == \ - input_seq_group_metadata.sampling_params.repetition_penalty - assert output.sampling_params.temperature == \ - input_seq_group_metadata.sampling_params.temperature - assert output.sampling_params.top_p == \ - input_seq_group_metadata.sampling_params.top_p - assert output.sampling_params.top_k == \ - input_seq_group_metadata.sampling_params.top_k - assert len(output.seq_data) == 1 - assert output.seq_data[target_seq_id].get_prompt_token_ids() == tuple( - prompt_tokens) - assert output.seq_data[target_seq_id].get_output_token_ids() == tuple( - prev_output_tokens + token_ids) - - assert len(output.block_tables) == 1 - assert output.block_tables[ - target_seq_id] == input_seq_group_metadata.block_tables[input_seq_id] diff --git a/tests/spec_decode/test_dynamic_spec_decode.py b/tests/spec_decode/test_dynamic_spec_decode.py deleted file mode 100644 index 407786ad3c6..00000000000 --- a/tests/spec_decode/test_dynamic_spec_decode.py +++ /dev/null @@ -1,90 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from unittest.mock import MagicMock, patch - -import pytest -import torch - -from vllm.sequence import ExecuteModelRequest -from vllm.spec_decode.metrics import AsyncMetricsCollector -from vllm.spec_decode.multi_step_worker import MultiStepWorker -from vllm.spec_decode.spec_decode_worker import SpecDecodeWorker -from vllm.spec_decode.top1_proposer import Top1Proposer - -from .test_utils import mock_spec_decode_sampler -from .utils import create_batch, mock_worker - - -@pytest.mark.parametrize('queue_size', [4]) -@pytest.mark.parametrize('batch_size', [1]) -@pytest.mark.parametrize('k', [1]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_disable_spec_tokens(queue_size: int, batch_size: int, k: int, - acceptance_sampler_method: str): - """Verify that speculative tokens are disabled when the batch size - exceeds the threshold. - """ - disable_by_batch_size = 3 - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - worker = SpecDecodeWorker(proposer_worker=draft_worker, - scorer_worker=target_worker, - spec_decode_sampler=mock_spec_decode_sampler( - acceptance_sampler_method), - disable_logprobs=False, - metrics_collector=metrics_collector, - disable_by_batch_size=disable_by_batch_size) - - exception_secret = 'artificial stop' - draft_worker.get_spec_proposals.side_effect = ValueError(exception_secret) - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k, - running_queue_size=queue_size) - - if queue_size > disable_by_batch_size: - with patch.object(worker, - '_run_no_spec', - side_effect=ValueError(exception_secret)), \ - pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=execute_model_req) - - # When the batch size is larger than the threshold, - # we expect no speculative tokens (0). - expected_num_spec_tokens = None if queue_size < disable_by_batch_size else 0 - assert seq_group_metadata_list[ - 0].num_speculative_tokens == expected_num_spec_tokens - - draft_worker.sampler_output.side_effect = ValueError(exception_secret) - - proposer = Top1Proposer( - worker=draft_worker, - device='cpu', # not used - vocab_size=100, # not used - # Must be long enough to avoid being skipped due to length. - max_proposal_len=1024, - ) - - if queue_size < disable_by_batch_size: - # Should raise exception when executing the mocked draft model. - with pytest.raises(ValueError, match=exception_secret): - proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k), - seq_ids_with_bonus_token_in_last_step=set()) - else: - # Should not execute the draft model because spec decode is disabled - # for all requests. Accordingly, the proposal length should be 0. - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k), - seq_ids_with_bonus_token_in_last_step=set()) - assert proposals.proposal_lens.tolist() == [0] * batch_size diff --git a/tests/spec_decode/test_memory_usage.py b/tests/spec_decode/test_memory_usage.py deleted file mode 100644 index 5d9dd3f72a7..00000000000 --- a/tests/spec_decode/test_memory_usage.py +++ /dev/null @@ -1,91 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""This docstring details important information on the testing methodology. - -This test verifies that memory usage remains constant (or never grows) when -we enable / disable speculation via --speculative-disable-by-batch-size. - -There are a lot of things we try to keep track of between batches of requests -and if certain tensors are not freed from memory, can result in CUDA ooms. - -This is particularly relevant for production situations where speculation might -be enabled during off hours, but disabled once traffic peaks during the workday. -Since traffic will stay high for a long period of time, verifying we do not -increase our memory usage over time is essential to prevent possible CUDA ooms. -""" - -import torch - -import vllm -from tests.core.utils import create_dummy_prompt -from vllm.sequence import SequenceGroup - -ITERATIONS = 100 -MAIN_MODEL = "JackFram/llama-68m" - -# speculative model -SPEC_MODEL = "abhigoyal/vllm-medusa-llama-68m-random" - -BATCH_SIZE = 5 -SPEC_DISABLE_BATCH_SIZE = 2 - - -def add_seq_group_to_engine(engine: vllm.LLMEngine, seq_group: SequenceGroup): - scheduler = engine.scheduler[0] - scheduler.add_seq_group(seq_group) - - -""" -Since we are using a batch size greater than the disabled batch size, -we can ensure we go through the _no_spec codepath for most of our engine steps. -""" - - -def test_memory_usage_no_spec(): - previous_memory_allocated = None - llm = vllm.LLM(model=MAIN_MODEL, - speculative_config={ - "model": SPEC_MODEL, - "num_speculative_tokens": 3, - "disable_by_batch_size": SPEC_DISABLE_BATCH_SIZE, - }) - - batch_sequences = set() - engine = llm.llm_engine - - for i in range(ITERATIONS): - seq, seq_group = create_dummy_prompt(request_id=str(i), - prompt_length=10, - min_tokens=10, - max_tokens=10) - - add_seq_group_to_engine(engine, seq_group) - - batch_sequences.add(seq) - engine.step() - for seq in list(batch_sequences): - if seq.is_finished(): - batch_sequences.remove(seq) - - # If we aren't at our batch size yet, continue - if len(batch_sequences) <= BATCH_SIZE: - continue - - # Otherwise, loop until at least one request is done - while not any(seq.is_finished() for seq in batch_sequences): - engine.step() - - # Remove it from the set - for seq in list(batch_sequences): - if seq.is_finished(): - batch_sequences.remove(seq) - - # At this point, we are always at the case where we have finished - # processing some number of requests from the batch after running - # several _no_spec executions. The memory should not have - # increased between the previous time this was recorded and the - # current time. - if previous_memory_allocated is None: - previous_memory_allocated = torch.cuda.memory_allocated() - else: - assert previous_memory_allocated == torch.cuda.memory_allocated() diff --git a/tests/spec_decode/test_metrics.py b/tests/spec_decode/test_metrics.py deleted file mode 100644 index e8de410f8a9..00000000000 --- a/tests/spec_decode/test_metrics.py +++ /dev/null @@ -1,205 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import math -from unittest.mock import MagicMock - -import pytest -import torch - -from vllm.spec_decode.metrics import AsyncMetricsCollector - - -def test_initial_call_returns_none(): - """Expect first call to get metrics to return None. - """ - spec_decode_sampler = MagicMock() - spec_decode_sampler.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_draft_tokens = 0 - - collector = AsyncMetricsCollector(spec_decode_sampler) - collector.init_gpu_tensors(rank=0) - maybe_metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert maybe_metrics is None - - -def test_second_call_returns_metrics(): - """Expect second call to not return None. - """ - spec_decode_sampler = MagicMock() - spec_decode_sampler.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_draft_tokens = 0 - - collect_interval_s = 5.0 - timer = MagicMock() - timer.side_effect = [ - 0.0, collect_interval_s + 0.1, collect_interval_s + 0.2 - ] - - collector = AsyncMetricsCollector(spec_decode_sampler=spec_decode_sampler, - timer=timer, - collect_interval_s=collect_interval_s) - collector.init_gpu_tensors(rank=0) - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is not None - - -@pytest.mark.parametrize("rank", [1, 2, 3, 4]) -def test_nonzero_rank_noop(rank): - """Verify nonzero ranks don't collect metrics. - """ - spec_decode_sampler = MagicMock() - spec_decode_sampler.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_draft_tokens = 0 - - collector = AsyncMetricsCollector(spec_decode_sampler) - collector.init_gpu_tensors(rank=rank) - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is None - - -def test_noop_until_time(): - """Verify metrics aren't collected until enough time passes. - """ - spec_decode_sampler = MagicMock() - spec_decode_sampler.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_draft_tokens = 0 - - collect_interval_s = 5.0 - timer = MagicMock() - timer.side_effect = [ - 0.0, collect_interval_s - 0.1, collect_interval_s - 0.1, - collect_interval_s + 0.1, collect_interval_s + 0.1 - ] - - collector = AsyncMetricsCollector(spec_decode_sampler=spec_decode_sampler, - timer=timer, - collect_interval_s=collect_interval_s) - collector.init_gpu_tensors(rank=0) - - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is None - - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is not None - - -def test_timer_is_reset(): - """Verify that the internal timer inside AsyncMetricsCollector - is reset after collection. - """ - spec_decode_sampler = MagicMock() - spec_decode_sampler.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_draft_tokens = 0 - - collect_interval_s = 5.0 - timer = MagicMock() - timer.side_effect = [ - 0.0, - collect_interval_s + 0.1, - collect_interval_s + 0.1, - collect_interval_s + 0.2, - collect_interval_s + 0.2, - 2 * collect_interval_s + 0.1, - 2 * collect_interval_s + 0.1, - ] - - collector = AsyncMetricsCollector(spec_decode_sampler=spec_decode_sampler, - timer=timer, - collect_interval_s=collect_interval_s) - collector.init_gpu_tensors(rank=0) - - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is not None - - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is None - - _ = collector.maybe_collect_rejsample_metrics(k=5) - metrics = collector.maybe_collect_rejsample_metrics(k=5) - assert metrics is not None - - -@pytest.mark.parametrize("has_data", [True, False]) -def test_initial_metrics_has_correct_values(has_data: bool): - """Test correctness of metrics data. - """ - if has_data: - num_accepted_tokens = 103 - num_emitted_tokens = 104 - num_draft_tokens = 105 - else: - num_accepted_tokens = 0 - num_emitted_tokens = 0 - num_draft_tokens = 0 - k = 5 - - max_num_emitted_tokens = AsyncMetricsCollector.get_max_num_emitted_tokens( - num_draft_tokens, k) - - spec_decode_sampler = MagicMock() - spec_decode_sampler.num_accepted_tokens = torch.tensor(num_accepted_tokens, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_emitted_tokens = torch.tensor(num_emitted_tokens, - dtype=torch.long, - device='cuda') - spec_decode_sampler.num_draft_tokens = num_draft_tokens - - collect_interval_s = 5.0 - timer = MagicMock() - timer.side_effect = [ - 0.0, collect_interval_s + 0.1, collect_interval_s + 0.2 - ] - - collector = AsyncMetricsCollector(spec_decode_sampler=spec_decode_sampler, - timer=timer, - collect_interval_s=collect_interval_s) - collector.init_gpu_tensors(rank=0) - _ = collector.maybe_collect_rejsample_metrics(k) - metrics = collector.maybe_collect_rejsample_metrics(k) - - assert metrics.num_spec_tokens == k - assert metrics.accepted_tokens == num_accepted_tokens - assert metrics.draft_tokens == num_draft_tokens - assert metrics.emitted_tokens == num_emitted_tokens - - if has_data: - assert (metrics.draft_acceptance_rate == num_accepted_tokens / - num_draft_tokens) - assert (metrics.system_efficiency == num_emitted_tokens / - max_num_emitted_tokens) - else: - assert math.isnan(metrics.draft_acceptance_rate) - assert math.isnan(metrics.system_efficiency) diff --git a/tests/spec_decode/test_multi_step_worker.py b/tests/spec_decode/test_multi_step_worker.py deleted file mode 100644 index f2d93203b8e..00000000000 --- a/tests/spec_decode/test_multi_step_worker.py +++ /dev/null @@ -1,838 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import random -from unittest.mock import MagicMock - -import pytest -import torch - -from vllm.attention.selector import (_Backend, - global_force_attn_backend_context_manager) -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.utils import set_random_seed -from vllm.sequence import (ExecuteModelRequest, HiddenStates, Logprob, - get_all_seq_ids) -from vllm.spec_decode.draft_model_runner import TP1DraftModelRunner -from vllm.spec_decode.multi_step_worker import MultiStepWorker -from vllm.spec_decode.top1_proposer import Top1Proposer -from vllm.worker.worker import Worker - -from .utils import (assert_logprobs_dict_allclose, create_batch, - create_seq_group_metadata_from_prompts, create_worker, - patch_execute_model_with_seeds, zero_kv_cache) - - -@pytest.mark.parametrize('num_steps', list(range(1, 17))) -def test_assert_enough_kv_space(num_steps: int): - """Test that the multi step worker checks for sufficient space in the KV - cache. It should throw if it cannot run all the steps. - """ - block_size = 16 - num_gpu_blocks = 2048 // block_size - - prompts = [ - list(range(block_size * 3)), - list(range(block_size * 2)), - ] - - prev_output_tokens = [ - list(range(block_size * 1)), - list(range(block_size * 2)), - ] - - final_prompt_lens = [ - len(prompt + output) + num_steps - for prompt, output in zip(prompts, prev_output_tokens) - ] - - inputs = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - final_prompt_lens, - continuations=prev_output_tokens) - - assert_enough_kv_space = MultiStepWorker._assert_enough_kv_space # pylint: disable=protected-access - worker = MagicMock() - worker.model_runner.block_size = block_size - - for seq_group_metadata in inputs: - original_block_tables = seq_group_metadata.block_tables - - # No exception. - assert_enough_kv_space(worker, inputs, num_steps) - - seq_group_metadata.block_tables = { - seq_id: [] - for seq_id, physical_blocks in original_block_tables.items() - } - - # Expect exception. - with pytest.raises(ValueError, - match='times but found insufficient KV space for'): - assert_enough_kv_space(worker, inputs, num_steps) - - seq_group_metadata.block_tables = original_block_tables - - -@torch.inference_mode() -def test_same_output_for_single_step(): - """Verify the multi step worker produces the same output as the normal - worker for num_steps=1. - """ - seed = 100 - model_name = 'JackFram/llama-68m' - - block_size = 32 - num_gpu_blocks = 2048 // block_size - multi_step_worker = create_worker( - MultiStepWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - model_runner_cls=TP1DraftModelRunner, - ) - worker = create_worker( - Worker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - # multi_step_worker.model_runner = worker.model_runner - # multi_step_worker.cache_engine = worker.cache_engine - - num_steps = 1 - - prompts = [ - [1, 2, 3, 4, 5], - [6, 7, 8, 9, 10], - ] - - final_prompt_lens = [len(prompt) + num_steps for prompt in prompts] - - multi_step_seq_group = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - final_prompt_lens=final_prompt_lens) - - zero_kv_cache(multi_step_worker.cache_engine) - set_random_seed(seed) - actual_output, _ = multi_step_worker.sampler_output( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=multi_step_seq_group), - sample_len=num_steps, - seq_ids_with_bonus_token_in_last_step=set()) - assert len(actual_output) == num_steps - actual_output = actual_output[0] - - single_step_seq_group = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - final_prompt_lens=final_prompt_lens) - - zero_kv_cache(worker.cache_engine) - set_random_seed(seed) - expected_output = worker.execute_model( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=single_step_seq_group))[0] - - actual_token_ids = [ - output.samples[0].output_token for output in actual_output - ] - actual_logprobs = [output.samples[0].logprobs for output in actual_output] - - expected_token_ids = [ - output.samples[0].output_token for output in expected_output - ] - expected_logprobs = [ - output.samples[0].logprobs for output in expected_output - ] - - assert actual_token_ids == expected_token_ids - - print(f'{actual_logprobs=}') - print(f'{expected_logprobs=}') - assert_logprobs_dict_allclose(actual_logprobs, expected_logprobs) - - -@torch.inference_mode() -def test_same_output_for_multi_step(): - """Verify the multi-step worker produces the same output as the normal - worker when num_steps > 1. This test runs the multi-step worker once, and - then runs the worker num_steps times, and compares the output. - """ - seed = 100 - model_name = 'JackFram/llama-68m' - - block_size = 16 - num_gpu_blocks = 2048 // block_size - multi_step_worker = create_worker( - MultiStepWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - - worker = create_worker( - Worker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - - # Make sure we go over the block boundary. - num_steps = block_size + 1 - - random.seed(seed) - prompts = [[ - random.randint(0, 1000) for _ in range(random.randint(10, 20)) - ] for _ in range(10)] - - final_prompt_lens = [len(prompt) + num_steps for prompt in prompts] - - rand_seeds = list(random.randint(0, 100) for _ in range(num_steps)) - multi_step_worker.execute_model = patch_execute_model_with_seeds( - multi_step_worker, rand_seeds) - worker.execute_model = patch_execute_model_with_seeds(worker, rand_seeds) - - continuations = [[1] for _ in prompts] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=continuations, - final_prompt_lens=final_prompt_lens) - - # Run multi-step. - zero_kv_cache(multi_step_worker.cache_engine) - set_random_seed(seed) - multi_step_output, _ = multi_step_worker.sampler_output( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list), - sample_len=num_steps, - seq_ids_with_bonus_token_in_last_step=set()) - - # Run single-step repeatedly. - zero_kv_cache(worker.cache_engine) - single_step_output: list[SamplerOutput] = [] - continuations = [[1] for _ in prompts] - set_random_seed(seed) - - for _ in multi_step_output: - - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=continuations, - final_prompt_lens=final_prompt_lens) - - single_step_output.extend( - worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list))) - - # Append output tokens to new sequence data. - for i, seq_group_output in enumerate(single_step_output[-1]): - continuations[i].append(seq_group_output.samples[0].output_token) - - # Get token ids and logprobs for comparison. - multi_step_output_logprobs: list[list[dict[int, - Logprob]]] = [[] - for _ in prompts] - single_step_output_logprobs: list[list[dict[int, - Logprob]]] = [[] - for _ in prompts] - - multi_step_output_token_ids: list[list[int]] = [[] for _ in prompts] - single_step_output_token_ids: list[list[int]] = [[] for _ in prompts] - for i, _ in enumerate(prompts): - for multi_step, single_step in zip(multi_step_output, - single_step_output): - multi_step_output_token_ids[i].append( - multi_step[i].samples[0].output_token) - single_step_output_token_ids[i].append( - single_step[i].samples[0].output_token) - - multi_step_output_logprobs[i].append( - multi_step[i].samples[0].logprobs) - single_step_output_logprobs[i].append( - single_step[i].samples[0].logprobs) - - # Print per-sequence token ids - for i, (multi_step_tokens, single_step_tokens) in enumerate( - zip(multi_step_output_token_ids, single_step_output_token_ids)): - print(f'{i=} {multi_step_tokens=}') - print(f'{i=} {single_step_tokens=}') - print(f'{i=} equal {multi_step_tokens == single_step_tokens}') - - # Assert token ids are equal. - for multi_step_tokens, single_step_tokens in zip( - multi_step_output_token_ids, single_step_output_token_ids): - assert multi_step_tokens == single_step_tokens - - # Assert logprobs are equal. - for multi_step_logprobs, single_step_logprobs in zip( - multi_step_output_logprobs, single_step_output_logprobs): - assert_logprobs_dict_allclose(multi_step_logprobs, - single_step_logprobs) - - -@torch.inference_mode() -def test_multi_step_with_batch_expansion_correct_output(): - """ - In this test we verify that the MultiStepWorker is able to handle bonus - tokens correctly. The test verifies that if a sequence has a - bonus token then the MultiStepWorker is able to expand the batch by adding - new sequences corresponding to the sequences with bonus tokens. The - expanded batch is then used for predicting the next tokens. - """ - seed = 100 - model_name = 'JackFram/llama-68m' - - block_size = 16 - num_gpu_blocks = 2048 // block_size - batch_size = 128 - multi_step_worker = create_worker( - MultiStepWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - model_runner_cls=TP1DraftModelRunner, - ) - multi_step_worker.set_include_gpu_probs_tensor() - worker = create_worker( - Worker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - random.seed(seed) - prompts = [[0] for _ in range(batch_size)] - num_steps = 2 - final_prompt_lens = [(num_steps + 1) for prompt in prompts] - rand_seeds = list(random.randint(0, 100) for _ in range(num_steps)) - multi_step_worker.execute_model = patch_execute_model_with_seeds( - multi_step_worker, rand_seeds) - worker.execute_model = patch_execute_model_with_seeds(worker, rand_seeds) - # Create the test continuations - continuations = [[random.randint(0, 1000)] for _ in prompts] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=continuations, - final_prompt_lens=final_prompt_lens) - - # Run single-step twice to generate 2 tokens. This - # will simulate the bonus token case with the second token - # being the bonus token. - zero_kv_cache(worker.cache_engine) - single_step_output: list[SamplerOutput] = [] - set_random_seed(seed) - for _ in range(num_steps): - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=continuations, - final_prompt_lens=final_prompt_lens) - single_step_output.extend( - worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list))) - # Append output tokens to new sequence data. - for i, seq_group_output in enumerate(single_step_output[-1]): - continuations[i].append(seq_group_output.samples[0].output_token) - - # Create continuations for the MultiStepWorker. The continuations have - # 2 tokens in order to simulate the bonus token case. - multi_step_continuations = [] - for continuation in continuations: - multi_step_continuations.append(continuation[:2]) - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=multi_step_continuations, - final_prompt_lens=final_prompt_lens) - - # Run multi-step and verify that the third token prediction is accurate - # for all sequences. - zero_kv_cache(multi_step_worker.cache_engine) - all_seq_ids = {i for i in range(batch_size)} - multi_step_output, _ = multi_step_worker.sampler_output( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list), - sample_len=1, - seq_ids_with_bonus_token_in_last_step=all_seq_ids) - for index, output in enumerate(multi_step_output[-1].outputs): - assert (continuations[index][-1] == output.samples[0].output_token) - - -@torch.inference_mode() -def test_multi_step_with_batch_expansion_incorrect_output(): - """ - Tests the MultiStepWorker's ability to handle batch expansion with bonus - tokens in a negative case scenario. This test provides the MultiStepWorker - with a batch containing sequences with bonus tokens but specifies the - sequence IDs with bonus tokens incorrectly. The test verifies that the - MultiStepWorker generates correct tokens for the sequences where the - sequence ID is specified correctly and incorrect tokens for those where - the sequence ID is specified incorrectly. - """ - seed = 100 - model_name = 'JackFram/llama-68m' - - block_size = 16 - num_gpu_blocks = 2048 // block_size - batch_size = 128 - multi_step_worker = create_worker( - MultiStepWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - model_runner_cls=TP1DraftModelRunner, - ) - multi_step_worker.set_include_gpu_probs_tensor() - worker = create_worker( - Worker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - random.seed(seed) - prompts = [[0] for _ in range(batch_size)] - num_steps = 2 - final_prompt_lens = [(num_steps + 1) for prompt in prompts] - rand_seeds = list(random.randint(0, 100) for _ in range(num_steps)) - multi_step_worker.execute_model = patch_execute_model_with_seeds( - multi_step_worker, rand_seeds) - worker.execute_model = patch_execute_model_with_seeds(worker, rand_seeds) - # Create the test continuations - continuations = [[random.randint(0, 1000)] for _ in prompts] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=continuations, - final_prompt_lens=final_prompt_lens) - # Run single-step twice to generate 2 tokens. This - # will simulate the bonus token case with the second token - # being the bonus token. - zero_kv_cache(worker.cache_engine) - single_step_output: list[SamplerOutput] = [] - set_random_seed(seed) - for _ in range(num_steps): - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=continuations, - final_prompt_lens=final_prompt_lens) - single_step_output.extend( - worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list))) - # Append output tokens to new sequence data. - for i, seq_group_output in enumerate(single_step_output[-1]): - continuations[i].append(seq_group_output.samples[0].output_token) - - # Create continuations for the MultiStepWorker. The continuations have - # 2 tokens in order to simulate the bonus token case. - multi_step_continuations = [] - for continuation in continuations: - multi_step_continuations.append(continuation[:2]) - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=multi_step_continuations, - final_prompt_lens=final_prompt_lens) - - # Run multi-step. In this run INCORRECTLY specify that only the odd number - # sequences have bonus tokens. Verify that with this setting the third token - # prediction is accurate only for the odd numbered sequences. Also verify - # that the prediction might be wrong for some of the even numbered - # sequences. - zero_kv_cache(multi_step_worker.cache_engine) - set_random_seed(seed) - odd_seq_ids = {i for i in range(batch_size) if i % 2 != 0} - multi_step_output, _ = multi_step_worker.sampler_output( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list), - sample_len=1, - seq_ids_with_bonus_token_in_last_step=odd_seq_ids) - num_mismatch = 0 - for index, output in enumerate(multi_step_output[-1].outputs): - if (index % 2) != 0: - assert (continuations[index][-1] == output.samples[0].output_token) - elif (continuations[index][-1] != output.samples[0].output_token): - num_mismatch += 1 - # The prediction is accurate for some of the sequences even without proper - # handling of the bonus tokens. Hence verify that the number of sequences - # for which there is a mismatch is > 0. - assert (num_mismatch > 0) - - -@torch.inference_mode() -@pytest.mark.parametrize('num_steps', [1, 2, 3, 4]) -# The choice of backends forces the multi_step_worker to choose between -# the vanilla model_runner and TP1DraftModelRunner and that we can test -# both code paths. -@pytest.mark.parametrize('attn_backend', - [_Backend.XFORMERS, _Backend.FLASH_ATTN]) -def test_multi_step_correct_kvcache(num_steps, attn_backend): - """Verify that the KV cache of the draft model - is correctly updated for sequences with bonus token. - """ - seed = 100 - model_name = "JackFram/llama-68m" - - block_size = 16 - num_gpu_blocks = 2048 // block_size - batch_size = 1 - - with global_force_attn_backend_context_manager(attn_backend): - dtype = 'float16' if attn_backend == _Backend.FLASH_ATTN else 'float32' - multi_step_worker = create_worker(MultiStepWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - model_runner_cls=TP1DraftModelRunner, - dtype=dtype) - multi_step_worker.set_include_gpu_probs_tensor() - worker = create_worker(Worker, - model_name, - block_size, - num_gpu_blocks, - seed, - dtype=dtype) - - prompts = [[0] for _ in range(batch_size)] - # Already generate two tokens for the sequence - # so that we can simulate the bonus token case - multi_step_continuations = [[ - random.randint(0, 1000), - random.randint(0, 1000) - ] for _ in prompts] - final_prompt_lens = [len(prompt) + 2 + num_steps for prompt in prompts] - - seq_ids_with_bonus_token_in_last_step = set(range(batch_size)) - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=multi_step_continuations, - final_prompt_lens=final_prompt_lens) - - # Run multi-step. - zero_kv_cache(multi_step_worker.cache_engine) - multi_step_worker.sampler_output(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list), - sample_len=num_steps, - seq_ids_with_bonus_token_in_last_step= - seq_ids_with_bonus_token_in_last_step) - - # Run single-step repeatedly. - zero_kv_cache(worker.cache_engine) - # Generate the kv cache for the bonus token first - single_step_continuations = [c[:1] for c in multi_step_continuations] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=single_step_continuations, - final_prompt_lens=final_prompt_lens) - single_step_output = worker.execute_model( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list)) - for _ in range(num_steps): - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - continuations=multi_step_continuations, - final_prompt_lens=final_prompt_lens) - - single_step_output = worker.execute_model( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list)) - - for i, seq_group_output in enumerate(single_step_output[-1]): - multi_step_continuations[i].append( - seq_group_output.samples[0].output_token) - - # Verify that the KV cache of the single-step and - # multi-step workers are the same. - single_step_gpu_cache = worker.cache_engine[0].gpu_cache - multi_step_gpu_cache = multi_step_worker.cache_engine[0].gpu_cache - num_layers = len(single_step_gpu_cache) - allclose = lambda a, b: torch.allclose( - a.cuda(), b.cuda(), rtol=1e-2, atol=1e-2) - for i in range(num_layers): - assert allclose(single_step_gpu_cache[i][0], - multi_step_gpu_cache[i][0]) - assert allclose(single_step_gpu_cache[i][1], - multi_step_gpu_cache[i][1]) - - -@torch.inference_mode() -def test_draft_proposals_full_speculation_len(): - """Verify Top1Proposer correctly handles case where all sequences - can speculate. - """ - k = 10 - batch_size = 32 - vocab_size = 32_000 - device = 'cuda:0' - - draft_worker = MagicMock() - proposer = Top1Proposer( - worker=draft_worker, - device=device, - vocab_size=vocab_size, - max_proposal_len=2048, - ) - draft_worker.sampler_output.return_value = [ - SamplerOutput( - outputs=[], - sampled_token_probs=torch.rand(batch_size, - vocab_size, - device=device, - dtype=torch.float32), - logprobs=torch.rand(batch_size, - vocab_size, - device=device, - dtype=torch.float32), - sampled_token_ids=torch.randint(low=0, - high=vocab_size, - size=(batch_size, ), - device=device, - dtype=torch.long), - ) for _ in range(k) - ], True - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k), - seq_ids_with_bonus_token_in_last_step=set()) - - assert torch.is_tensor(proposals.proposal_token_ids) - assert torch.is_tensor(proposals.proposal_probs) - - assert proposals.proposal_token_ids.shape == torch.Size([batch_size, k]) - assert proposals.proposal_probs.shape[:-1] == torch.Size([batch_size, k]) - - assert proposals.proposal_lens.shape == torch.Size([batch_size]) - assert proposals.proposal_lens.tolist() == [k for _ in range(batch_size)] - - -@torch.inference_mode() -def test_draft_proposals_no_speculations(): - """Verify Top1Proposer correctly handles case where no sequences - can speculate. - """ - k = 10 - batch_size = 32 - vocab_size = 32_000 - device = 'cuda:0' - prompt_len = 10 - - draft_worker = MagicMock() - proposer = Top1Proposer( - worker=draft_worker, - device=device, - vocab_size=vocab_size, - max_proposal_len=prompt_len + k - 1, - ) - - seq_group_metadata_list, _, _ = create_batch(batch_size, - k, - prompt_len=prompt_len) - - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k), - seq_ids_with_bonus_token_in_last_step=set()) - - assert torch.is_tensor(proposals.proposal_token_ids) - assert torch.is_tensor(proposals.proposal_probs) - - assert proposals.proposal_token_ids.shape == torch.Size([batch_size, k]) - assert proposals.proposal_probs.shape[:-1] == torch.Size([batch_size, k]) - - assert proposals.proposal_lens.shape == torch.Size([batch_size]) - assert proposals.proposal_lens.tolist() == [0 for _ in range(batch_size)] - - -@torch.inference_mode() -def test_draft_proposals_mixed_k(): - """Verify Top1Proposer correctly handles case some sequences can - speculate and some can't. - """ - k = 10 - batch_size = 32 - vocab_size = 32_000 - device = 'cuda:0' - - small_prompt_len = 5 - long_prompt_len = 10 - prev_output_token_len = 20 - - expected_num_proposal_seqs = 6 - expected_num_no_proposal_seqs = batch_size - expected_num_proposal_seqs - - prompt_len = [ - small_prompt_len for _ in range(expected_num_proposal_seqs - 1) - ] + [long_prompt_len - for _ in range(expected_num_no_proposal_seqs)] + [small_prompt_len] - - draft_worker = MagicMock() - proposer = Top1Proposer( - worker=draft_worker, - device=device, - vocab_size=vocab_size, - max_proposal_len=long_prompt_len + prev_output_token_len + k - 1, - ) - - draft_worker.sampler_output.return_value = [ - SamplerOutput( - outputs=[], - sampled_token_probs=torch.rand(expected_num_proposal_seqs, - vocab_size, - device=device, - dtype=torch.float32), - logprobs=torch.rand(expected_num_proposal_seqs, - vocab_size, - device=device, - dtype=torch.float32), - sampled_token_ids=torch.randint( - low=0, - high=vocab_size, - size=(expected_num_proposal_seqs, ), - device=device, - dtype=torch.long), - ) for _ in range(k) - ], True - - seq_group_metadata_list, _, _ = create_batch( - batch_size, - k, - prompt_len=prompt_len, - prev_output_token_len=prev_output_token_len, - ) - - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k), - seq_ids_with_bonus_token_in_last_step=set()) - - assert torch.is_tensor(proposals.proposal_token_ids) - assert torch.is_tensor(proposals.proposal_probs) - - assert proposals.proposal_token_ids.shape == torch.Size([batch_size, k]) - assert proposals.proposal_probs.shape[:-1] == torch.Size([batch_size, k]) - - assert proposals.proposal_lens.shape == torch.Size([batch_size]) - assert proposals.proposal_lens.tolist() == [ - k for _ in range(expected_num_proposal_seqs - 1) - ] + [0 for _ in range(expected_num_no_proposal_seqs)] + [k] - - -@torch.inference_mode() -def test_use_draft_model_runner_advance_step(): - """Verify that draft model runner triggers advance step - when applicable. - """ - seed = 100 - model_name = 'JackFram/llama-68m' - - k = 5 - batch_size = 32 - block_size = 32 - num_gpu_blocks = 2048 // block_size - worker = create_worker( - MultiStepWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - model_runner_cls=TP1DraftModelRunner, - ) - - # Mock "_gpu_advance_step" to raise an exception when called. - exception_secret = "artificial stop" - worker.model_runner._gpu_advance_step = MagicMock() - worker.model_runner._gpu_advance_step.side_effect = ValueError( - exception_secret) - - seq_group_metadata_list, _, _ = create_batch(batch_size, - k, - block_size=block_size, - num_gpu_blocks=num_gpu_blocks) - - # Fallback (should not call) when num_steps=1. - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k, - num_steps=1) - worker.execute_model(execute_model_req=execute_model_req) - - # Expect exception if _gpu_advance_step is called. - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k, - num_steps=k) - - with pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=execute_model_req) - call_args_list = worker.model_runner._gpu_advance_step.call_args_list - assert len(call_args_list) == 1 - - -@torch.inference_mode() -def test_expand_execute_model_request_sync_with_expand_hidden_states(): - """ - In this test we verify that the logic for expanding the - seq_group_metadata_list remains in sync with the expansion logic of - the HiddenStates in _expand_execute_model_request. - """ - k = 5 - batch_size = 16 - seq_with_bonus_token_in_last_step = [1, 3, 8, 10, 13, 15] - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - - execute_model_request = ExecuteModelRequest( - seq_group_metadata_list, - previous_hidden_states=HiddenStates( - torch.arange(batch_size), seq_group_metadata_list, - torch.arange(batch_size, 2 * batch_size))) - - expanded_execute_model_request, orig_seq_group_ids = MultiStepWorker.\ - _expand_execute_model_request(execute_model_request, - seq_with_bonus_token_in_last_step) - - all_seq_ids = torch.tensor( - get_all_seq_ids( - expanded_execute_model_request.seq_group_metadata_list)) - ref_expanded_hidden_states = all_seq_ids + batch_size - ref_expanded_hidden_states[orig_seq_group_ids] -= batch_size - - assert (ref_expanded_hidden_states == expanded_execute_model_request. - previous_hidden_states.hidden_states).all().item() diff --git a/tests/spec_decode/test_ngram_worker.py b/tests/spec_decode/test_ngram_worker.py deleted file mode 100644 index 8a7c1148568..00000000000 --- a/tests/spec_decode/test_ngram_worker.py +++ /dev/null @@ -1,221 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import torch - -from vllm.sequence import ExecuteModelRequest -from vllm.spec_decode.ngram_worker import NGramWorker -from vllm.spec_decode.top1_proposer import Top1Proposer - -from .utils import create_seq_group_metadata_from_prompts, create_worker - - -def test_ngram_algo_correctness_for_single_no_match(): - """Verify our ngram algo find the right candidate in the prompt - - For the scenario cannot find any candidate in one single batch - """ - block_size = 32 - num_gpu_blocks = 2048 // block_size - seed = 100 - model_name = 'JackFram/llama-68m' - vocab_size = 32_000 - device = 'cuda:0' - - ngram_worker = create_worker( - NGramWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - - proposer = Top1Proposer( - worker=ngram_worker, - device=device, - vocab_size=vocab_size, - max_proposal_len=20, - ) - - # set ngram window [1, 3], which is window=1/2/3 - ngram_worker.set_ngram_window_size(1, 3) - - prompts = [ - # shall find no candidate - [1, 2, 3, 4, 5, 6, 7], - ] - - proposal_len = 5 - final_prompt_lens = [len(prompt) + proposal_len for prompt in prompts] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - final_prompt_lens=final_prompt_lens) - - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=proposal_len), - seq_ids_with_bonus_token_in_last_step=None) - - assert torch.is_tensor(proposals.proposal_token_ids) - assert torch.is_tensor(proposals.proposal_probs) - - assert proposals.proposal_token_ids.shape == torch.Size([1, proposal_len]) - assert proposals.proposal_probs.shape[:-1] == torch.Size([1, proposal_len]) - assert proposals.proposal_lens.shape == torch.Size([1]) - assert proposals.proposal_lens.tolist() == [0] - - -def test_ngram_algo_correctness_for_batches_not_match_all(): - """Verify our ngram algo find the right candidate in the prompt - - For the scenario find some candidate not full in batchs - """ - block_size = 32 - num_gpu_blocks = 2048 // block_size - seed = 100 - model_name = 'JackFram/llama-68m' - vocab_size = 32_000 - device = 'cuda:0' - - ngram_worker = create_worker( - NGramWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - - proposer = Top1Proposer( - worker=ngram_worker, - device=device, - vocab_size=vocab_size, - max_proposal_len=20, - ) - - # set ngram window [1, 3], which is window=1/2/3 - ngram_worker.set_ngram_window_size(1, 3) - - prompts = [ - # shall find no candidate - [1, 2, 3, 4, 5, 6, 7], - # shall find candidate 12,13,14,15,16 - [11, 12, 13, 14, 15, 16, 11], - # shall find candidate 23,24,25,26,21 - [21, 21, 22, 23, 24, 25, 26, 21, 22], - # shall find candidate 34,35,36,37,38 - [31, 32, 31, 32, 33, 34, 35, 36, 37, 38, 31, 32, 33], - # shall find no candidate as exceed max_proposal_len - [ - 31, 32, 31, 32, 31, 32, 31, 32, 31, 32, 31, 32, 33, 34, 35, 36, 37, - 38, 31, 32, 33 - ], - ] - - proposal_len = 5 - final_prompt_lens = [len(prompt) + proposal_len for prompt in prompts] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - final_prompt_lens=final_prompt_lens) - for sg in seq_group_metadata_list: - sg.is_prompt = False - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=proposal_len), - seq_ids_with_bonus_token_in_last_step=None) - - assert torch.is_tensor(proposals.proposal_token_ids) - assert torch.is_tensor(proposals.proposal_probs) - - assert proposals.proposal_token_ids.shape == torch.Size([5, proposal_len]) - assert proposals.proposal_probs.shape[:-1] == torch.Size([5, proposal_len]) - assert proposals.proposal_lens.shape == torch.Size([5]) - - # the first sequence has no match so proposal_len should be overwritten to 0 - assert proposals.proposal_lens.tolist( - ) == [0] + [proposal_len for _ in range(3)] + [0] - - for i in range(proposal_len): - assert proposals.proposal_token_ids[0][i] == -1 - assert proposals.proposal_token_ids[1][i] == prompts[1][i + 1] - assert proposals.proposal_token_ids[2][i] == prompts[2][i + 3] - assert proposals.proposal_token_ids[3][i] == prompts[3][i + 5] - assert proposals.proposal_token_ids[4][i] == -1 - - -def test_ngram_algo_correctness_for_batches_match_all(): - """Verify our ngram algo find the right candidate in the prompt - - For the scenario find candidate in all batches - """ - - block_size = 32 - num_gpu_blocks = 2048 // block_size - seed = 100 - model_name = 'JackFram/llama-68m' - vocab_size = 32_000 - device = 'cuda:0' - - ngram_worker = create_worker( - NGramWorker, - model_name, - block_size, - num_gpu_blocks, - seed, - ) - - proposer = Top1Proposer( - worker=ngram_worker, - device=device, - vocab_size=vocab_size, - max_proposal_len=20, - ) - - # set ngram window [0, 3], which is window=1/2/3 - ngram_worker.set_ngram_window_size(1, 3) - - prompts = [ - # shall find candidate 12,13,14,15,16 - [11, 12, 13, 14, 15, 16, 11], - # shall find candidate 23,24,25,26,21 - [21, 21, 22, 23, 24, 25, 26, 21, 22], - # shall find candidate 34,35,36,37,38 - [31, 32, 31, 32, 33, 34, 35, 36, 37, 38, 31, 32, 33], - ] - - proposal_len = 5 - final_prompt_lens = [len(prompt) + proposal_len for prompt in prompts] - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, - num_gpu_blocks, - block_size, - final_prompt_lens=final_prompt_lens) - - # Normally drafter is run on decode requests only; here we check the output - # of the ngram worker as it is the sole proposer that has no forward. - for sg in seq_group_metadata_list: - sg.is_prompt = False - proposals = proposer.get_spec_proposals( - execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=proposal_len), - seq_ids_with_bonus_token_in_last_step=None) - - assert torch.is_tensor(proposals.proposal_token_ids) - assert torch.is_tensor(proposals.proposal_probs) - - assert proposals.proposal_token_ids.shape == torch.Size([3, proposal_len]) - assert proposals.proposal_probs.shape[:-1] == torch.Size([3, proposal_len]) - assert proposals.proposal_lens.shape == torch.Size([3]) - - assert proposals.proposal_lens.tolist() == [proposal_len for _ in range(3)] - - for i in range(proposal_len): - assert proposals.proposal_token_ids[0][i] == prompts[0][i + 1] - assert proposals.proposal_token_ids[1][i] == prompts[1][i + 3] - assert proposals.proposal_token_ids[2][i] == prompts[2][i + 5] diff --git a/tests/spec_decode/test_scorer.py b/tests/spec_decode/test_scorer.py deleted file mode 100644 index 55fcf005574..00000000000 --- a/tests/spec_decode/test_scorer.py +++ /dev/null @@ -1,116 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import random - -import pytest -import torch - -from vllm.sequence import ExecuteModelRequest -from vllm.spec_decode.batch_expansion import BatchExpansionTop1Scorer -from vllm.spec_decode.interfaces import SpeculativeProposals, SpeculativeScores -from vllm.spec_decode.mqa_scorer import MQAScorer -from vllm.worker.worker import Worker - -from .utils import create_batch, create_worker - - -def create_proposal(propose_lens: list[int], vocab_size: int, - device: str) -> SpeculativeProposals: - batch_size = len(propose_lens) - max_propose_len = max(propose_lens) - proposal_probs = torch.rand((batch_size, max_propose_len, vocab_size), - device=device) - - proposal_token_ids = torch.full((batch_size, max_propose_len), - fill_value=-1, - device=device) - for i in range(batch_size): - proposal_token_ids[i][:propose_lens[i]] = torch.argmax( - proposal_probs[i][:propose_lens[i]], dim=-1) - - propose_lens = torch.tensor(propose_lens, device=device) - return SpeculativeProposals(proposal_token_ids, proposal_probs, - propose_lens) - - -def assert_score_equal(score1: SpeculativeScores, - score2: SpeculativeScores) -> None: - assert torch.allclose(score1.probs, score2.probs) - assert torch.allclose(score1.logprobs, score2.logprobs) - assert torch.equal( - score1.token_ids, - score2.token_ids), f"{score1.token_ids}, {score2.token_ids}" - - -@pytest.mark.parametrize('model_name', ['facebook/opt-125m']) -@pytest.mark.parametrize('batch_size', [1, 2, 4, 8, 16]) -@pytest.mark.parametrize('max_propose_len', [1, 3, 5]) -@pytest.mark.parametrize('mixed_propose_len', [True]) -@pytest.mark.parametrize('device', ['cuda']) -@pytest.mark.parametrize('prefill_chunking', [False, True]) -def test_scorer(model_name: str, batch_size: int, max_propose_len: int, - mixed_propose_len: bool, device: str, - prefill_chunking: bool) -> None: - """ - Compare the batch expansion scorer and mqa scorer return the same score. - We test for both queries with the same propose length and different - propose length, as well as mixed prefill-decode batches. - """ - seed = 0 - block_size = 32 - num_gpu_blocks = 2048 // block_size - scorer_worker = create_worker(Worker, model_name, block_size, - num_gpu_blocks, seed) - scorer_worker.model_runner.disable_logprobs = True # accessed by mqa_scorer - scorer_worker.model_runner.sampler.include_gpu_probs_tensor = True - scorer_worker.model_runner.sampler.should_modify_greedy_probs_inplace = True - - vocab_size = scorer_worker.vocab_size - - if not mixed_propose_len: - propose_lens = [max_propose_len] * batch_size - else: - # There must be at least 1 decode request, otherwise - # we have nothing to score (`_run_no_spec`). - non_zero_cnt = random.randint(1, batch_size) - propose_lens = [max_propose_len - ] * non_zero_cnt + [0] * (batch_size - non_zero_cnt) - random.shuffle(propose_lens) - - seq_group_metadatalist, _, _ = create_batch(batch_size, - max_propose_len, - block_size=block_size, - num_gpu_blocks=num_gpu_blocks) - - if mixed_propose_len and prefill_chunking and (n_prefills := - batch_size - non_zero_cnt): - prefill, _, _ = create_batch(n_prefills, - None, - prefill_chunk_size=4, - block_size=block_size, - num_gpu_blocks=num_gpu_blocks, - seq_ids=list( - range(batch_size, - batch_size + n_prefills))) - # re-order to guarantee prefill|decode order - target_group_metadatalist = [ - seq_group_metadatalist[i] for i, p in enumerate(propose_lens) - if p > 0 - ] - seq_group_metadatalist = prefill + target_group_metadatalist - propose_lens = [0] * n_prefills + [p for p in propose_lens if p > 0] - - proposals = create_proposal(propose_lens, vocab_size, device) - requests = ExecuteModelRequest(seq_group_metadatalist, - num_lookahead_slots=max_propose_len) - - batch_expansion_scorer = BatchExpansionTop1Scorer(scorer_worker, device, - vocab_size) - batch_expansion_score = batch_expansion_scorer.score_proposals( - requests, proposals) - - mqa_scorer = MQAScorer(scorer_worker, device, vocab_size) - mqa_score = mqa_scorer.score_proposals(requests, proposals) - - assert_score_equal(batch_expansion_score, mqa_score) diff --git a/tests/spec_decode/test_spec_decode_worker.py b/tests/spec_decode/test_spec_decode_worker.py deleted file mode 100644 index 8aceaadff8d..00000000000 --- a/tests/spec_decode/test_spec_decode_worker.py +++ /dev/null @@ -1,945 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import random -from collections import defaultdict -from types import SimpleNamespace -from unittest.mock import MagicMock - -import pytest -import torch - -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.utils import set_random_seed -from vllm.sequence import ExecuteModelRequest, SequenceOutput -from vllm.spec_decode.batch_expansion import BatchExpansionTop1Scorer -from vllm.spec_decode.draft_model_runner import TP1DraftModelRunner -from vllm.spec_decode.interfaces import SpeculativeProposals -from vllm.spec_decode.metrics import (AsyncMetricsCollector, - SpecDecodeWorkerMetrics) -from vllm.spec_decode.multi_step_worker import MultiStepWorker -from vllm.spec_decode.spec_decode_worker import (SpecDecodeWorker, - split_num_cache_blocks_evenly) -from vllm.worker.worker import Worker - -from .test_utils import mock_spec_decode_sampler -from .utils import (create_batch, create_sampler_output_list, create_worker, - mock_worker) - - -@pytest.mark.parametrize('k', [1, 2, 6]) -@pytest.mark.parametrize('batch_size', [1, 2, 32]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_correctly_calls_draft_model(k: int, batch_size: int, - acceptance_sampler_method: str): - """Verify SpecDecodeWorker calls the draft worker with correct - inputs. Everything else is mocked out. - """ - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - worker = SpecDecodeWorker( - draft_worker, - target_worker, - mock_spec_decode_sampler(acceptance_sampler_method), - disable_logprobs=False, - metrics_collector=metrics_collector) - exception_secret = 'artificial stop' - draft_worker.get_spec_proposals.side_effect = ValueError(exception_secret) - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, num_lookahead_slots=k) - - with pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=execute_model_req) - - call_args_list = draft_worker.get_spec_proposals.call_args_list - assert len(call_args_list) == 1 - - for args, _ in call_args_list: - actual_execute_model_data = args[0] - assert actual_execute_model_data == execute_model_req - - -@pytest.mark.parametrize('k', [1, 2, 6]) -@pytest.mark.parametrize('batch_size', [1, 2, 32]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_batch_expansion_correctly_calls_target_model( - k: int, batch_size: int, acceptance_sampler_method: str): - """Verify SpecDecodeWorker calls the target model with correct - inputs with batch expansion. Everything else is mocked out. - """ - draft_worker = mock_worker(cls=MultiStepWorker, use_spec=False) - target_worker = mock_worker(use_spec=False) - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - - draft_worker.device = 'cuda' - target_worker.device = 'cuda' - - set_random_seed(1) - - worker = SpecDecodeWorker( - draft_worker, - target_worker, - mock_spec_decode_sampler(acceptance_sampler_method), - disable_logprobs=False, - metrics_collector=metrics_collector, - disable_mqa_scorer=True) - worker.init_device() - - vocab_size = 32_000 - - proposal_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64, - device='cuda') - proposal_probs = torch.rand(batch_size, - k, - vocab_size, - dtype=torch.float32, - device='cuda') - proposal_lens = torch.ones(batch_size, dtype=torch.int64, - device='cuda') * k - - seq_group_metadata_list, prompts, prev_output_tokens = create_batch( - batch_size, k) - - draft_worker.get_spec_proposals.return_value = SpeculativeProposals( - proposal_token_ids=proposal_token_ids, - proposal_probs=proposal_probs, - proposal_lens=proposal_lens) - - exception_secret = 'artificial stop' - target_worker.execute_model.side_effect = ValueError(exception_secret) - - with pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k)) - - seen_contexts: list[list[int]] = [] - - call_args_list = target_worker.execute_model.call_args_list - assert len(call_args_list) == 1 - for _, kwargs in call_args_list: - seq_group_metadata_list = kwargs[ - "execute_model_req"].seq_group_metadata_list - - assert len(seq_group_metadata_list) == (k + 1) * batch_size - for seq_group_metadata in seq_group_metadata_list: - for seq_data in seq_group_metadata.seq_data.values(): - seen_contexts.append(seq_data.get_token_ids()) - - expected_seen_contexts: list[list[int]] = [] - - for prompt, prev_generated, draft_tokens in zip( - prompts, prev_output_tokens, proposal_token_ids.tolist()): - - for i in range(len(draft_tokens) + 1): - expected_seen_contexts.append(prompt + prev_generated + - draft_tokens[:i]) - - seen_contexts.sort() - expected_seen_contexts.sort() - assert expected_seen_contexts == seen_contexts - - -@pytest.mark.parametrize('k', [1, 2, 6]) -@pytest.mark.parametrize('batch_size', [1, 2, 32]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_correctly_calls_spec_decode_sampler(k: int, batch_size: int, - acceptance_sampler_method: str): - """Verify SpecDecodeWorker calls the rejection sampler with - correct inputs. Everything else is mocked out. - """ - vocab_size = 32_000 - - draft_worker = mock_worker(cls=MultiStepWorker, - vocab_size=vocab_size, - use_spec=False) - target_worker = mock_worker(vocab_size=vocab_size, use_spec=False) - spec_decode_sampler = mock_spec_decode_sampler(acceptance_sampler_method) - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - draft_worker.device = 'cuda' - target_worker.device = 'cuda' - - set_random_seed(1) - - worker = SpecDecodeWorker(draft_worker, - target_worker, - spec_decode_sampler, - disable_logprobs=False, - metrics_collector=metrics_collector) - worker.init_device() - - proposal_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64, - device='cuda') - proposal_probs = torch.rand(batch_size, - k, - vocab_size, - dtype=torch.float32, - device='cuda') - - proposal_lens = torch.ones(batch_size, dtype=torch.int64, - device='cuda') * k - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - - draft_worker.get_spec_proposals.return_value = SpeculativeProposals( - proposal_token_ids=proposal_token_ids, - proposal_probs=proposal_probs, - proposal_lens=proposal_lens) - - target_token_ids = torch.randint(low=0, - high=vocab_size, - size=(1, batch_size * (k + 1)), - dtype=torch.int64, - device='cuda') - target_token_probs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_token_logprobs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_output = create_sampler_output_list(target_token_ids, - target_token_probs, - target_token_logprobs) - - target_worker.execute_model.return_value = [target_output[0]] - - exception_secret = 'artificial stop' - - spec_decode_sampler.side_effect = ValueError(exception_secret) - - with pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k)) - - assert len(spec_decode_sampler.call_args_list) == 1 - _, kwargs = spec_decode_sampler.call_args_list[0] - actual = SimpleNamespace(**kwargs) - - assert torch.equal(actual.bonus_token_ids, - target_token_ids.reshape(batch_size, k + 1)[:, -1:]) - assert torch.equal(actual.target_with_bonus_probs, - target_token_probs.reshape(batch_size, k + 1, -1)) - assert torch.equal(actual.draft_token_ids, proposal_token_ids) - assert torch.equal(actual.draft_probs, proposal_probs) - - -@pytest.mark.parametrize('k', [1, 2, 6]) -@pytest.mark.parametrize('batch_size', [1, 2, 32]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_correctly_formats_output(k: int, batch_size: int, - acceptance_sampler_method: str): - """Verify SpecDecodeWorker formats sampler output correctly. - Everything else is mocked out. - """ - vocab_size = 32_000 - - draft_worker = mock_worker(cls=MultiStepWorker, - vocab_size=vocab_size, - use_spec=False) - target_worker = mock_worker(vocab_size=vocab_size, use_spec=False) - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - draft_worker.device = 'cuda' - target_worker.device = 'cuda' - - set_random_seed(1) - spec_decode_sampler = mock_spec_decode_sampler(acceptance_sampler_method) - worker = SpecDecodeWorker(draft_worker, - target_worker, - spec_decode_sampler, - disable_logprobs=False, - metrics_collector=metrics_collector) - worker.init_device() - - proposal_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64, - device='cuda') - proposal_probs = torch.rand(batch_size, - k, - vocab_size, - dtype=torch.float32, - device='cuda') - - proposal_lens = torch.ones(batch_size, dtype=torch.int64, - device='cuda') * k - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - - draft_worker.get_spec_proposals.return_value = SpeculativeProposals( - proposal_token_ids=proposal_token_ids, - proposal_probs=proposal_probs, - proposal_lens=proposal_lens) - - target_token_ids = torch.randint(low=0, - high=vocab_size, - size=(1, batch_size * (k + 1)), - dtype=torch.int64, - device='cuda') - target_token_probs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_token_logprobs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_output = create_sampler_output_list(target_token_ids, - target_token_probs, - target_token_logprobs) - - target_worker.execute_model.return_value = [target_output[0]] - - spec_decode_sampler_output = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k + 1), - dtype=torch.int64, - device='cuda') - for i in range(batch_size): - minimum_accepted_tokens = 1 - spec_decode_sampler_output[i][ - -random.randint(minimum_accepted_tokens, k + 1):] = -1 - - spec_decode_sampler.return_value = spec_decode_sampler_output - output = worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k)) - - expected_output = create_sampler_output_list( - token_ids=spec_decode_sampler_output.transpose(0, 1), - probs=[None for _ in range(k + 1)], - logprobs=[None for _ in range(k + 1)]) - - seq_ids = [ - next(iter(seq_group_metadata.seq_data.keys())) - for seq_group_metadata in seq_group_metadata_list - ] - actual_output_by_seq: dict[int, list[SequenceOutput]] = { - seq_id: [] - for seq_id in seq_ids - } - expected_output_by_seq: dict[int, list[SequenceOutput]] = { - seq_id: [] - for seq_id in seq_ids - } - - for step in output: - for seq_group in step: - for sample in seq_group.samples: - seq_id = sample.parent_seq_id - actual_output_by_seq[seq_id].append(sample) - - for step in expected_output: - for seq_group in step: - for sample in seq_group.samples: - seq_id = sample.parent_seq_id - expected_output_by_seq[seq_id].append(sample) - - all_seen_seq_ids = set( - list(actual_output_by_seq.keys()) + - list(expected_output_by_seq.keys())) - for seq_id in all_seen_seq_ids: - actual_by_step = actual_output_by_seq[seq_id] - expected_by_step = expected_output_by_seq[seq_id] - - for i in range(k + 1): - if i >= len(actual_by_step): - assert expected_by_step[i].output_token == -1 - continue - assert actual_by_step[i].output_token == expected_by_step[ - i].output_token - - -@pytest.mark.parametrize('k', [1, 2]) -@pytest.mark.parametrize('batch_size', [1]) -@pytest.mark.parametrize('returns_metrics', [True, False]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_collects_metrics(k: int, batch_size: int, returns_metrics: bool, - acceptance_sampler_method: str): - """Verify SpecDecodeWorker collects metrics. - """ - vocab_size = 32_000 - - draft_worker = mock_worker(cls=MultiStepWorker, - vocab_size=vocab_size, - use_spec=False) - target_worker = mock_worker(vocab_size=vocab_size, use_spec=False) - spec_decode_sampler = mock_spec_decode_sampler(acceptance_sampler_method) - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - draft_worker.device = 'cuda' - target_worker.device = 'cuda' - - set_random_seed(1) - - worker = SpecDecodeWorker(draft_worker, - target_worker, - spec_decode_sampler, - disable_logprobs=False, - metrics_collector=metrics_collector) - worker.init_device() - - proposal_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k), - dtype=torch.int64, - device='cuda') - proposal_probs = torch.rand(batch_size, - k, - vocab_size, - dtype=torch.float32, - device='cuda') - - proposal_lens = torch.ones(batch_size, dtype=torch.int64, - device='cuda') * k - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - - draft_worker.get_spec_proposals.return_value = SpeculativeProposals( - proposal_token_ids=proposal_token_ids, - proposal_probs=proposal_probs, - proposal_lens=proposal_lens) - - target_token_ids = torch.randint(low=0, - high=vocab_size, - size=(1, batch_size * (k + 1)), - dtype=torch.int64, - device='cuda') - target_token_probs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_token_logprobs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_output = create_sampler_output_list(target_token_ids, - target_token_probs, - target_token_logprobs) - - target_worker.execute_model.return_value = [target_output[0]] - - spec_decode_sampler_output = torch.randint(low=0, - high=vocab_size, - size=(batch_size, k + 1), - dtype=torch.int64, - device='cuda') - for i in range(batch_size): - minimum_accepted_tokens = 1 - spec_decode_sampler_output[i][ - -random.randint(minimum_accepted_tokens, k + 1):] = -1 - spec_decode_sampler.return_value = spec_decode_sampler_output - - mock_rejsample_metrics = MagicMock( - spec=SpecDecodeWorkerMetrics) if returns_metrics else None - metrics_collector.maybe_collect_rejsample_metrics.return_value = ( - mock_rejsample_metrics) - - output = worker.execute_model(execute_model_req=ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k)) - assert output[0].spec_decode_worker_metrics == mock_rejsample_metrics - - call_args_list = ( - metrics_collector.maybe_collect_rejsample_metrics.call_args_list) - assert len(call_args_list) == 1 - args, kwargs = call_args_list[0] - assert args[0] == k or kwargs.get('k', -1) == k - - -@pytest.mark.parametrize('k', [0]) -@pytest.mark.parametrize('batch_size', [1, 2, 32]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_k_equals_zero(k: int, batch_size: int, - acceptance_sampler_method: str): - """Verify that the SpecDecodeWorker calls the draft and target workers - when k is zero. This happens during prefill. - """ - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - - sampler_output = MagicMock(spec=SamplerOutput) - sampler_output.hidden_states = None - target_worker.execute_model.return_value = [sampler_output] - - draft_worker.device = 'cuda' - target_worker.device = 'cuda' - - set_random_seed(1) - - worker = SpecDecodeWorker( - proposer_worker=draft_worker, - scorer_worker=target_worker, - spec_decode_sampler=mock_spec_decode_sampler( - acceptance_sampler_method), - disable_logprobs=False, - metrics_collector=metrics_collector, - ) - - seq_group_metadata_list, _, _ = create_batch(batch_size, - k, - prev_output_token_len=0) - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, num_lookahead_slots=k) - - out = worker.execute_model(execute_model_req=execute_model_req) - - assert len(out) == 1, f"expected only one token output when {k=}" - assert out[0].sampled_token_probs is None, ( - "expect gpu tensor references to be None") - assert out[ - 0].sampled_token_ids is None, "expect gpu tensor references to be None" - - draft_worker.execute_model.assert_called_once_with(execute_model_req) - target_worker.execute_model.assert_called_once_with(execute_model_req) - - -@pytest.mark.parametrize('k', [0, 5]) -@pytest.mark.parametrize('batch_size', [0]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_empty_input_batch(k: int, batch_size: int, - acceptance_sampler_method: str): - """Verify that the SpecDecodeWorker calls the draft and target workers - when the input batch is empty. This can happen if the engine communicates - to the workers information without scheduling a batch. - """ - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - - sampler_output = MagicMock(spec=SamplerOutput) - sampler_output.hidden_states = None - target_worker.execute_model.return_value = [sampler_output] - - draft_worker.device = 'cuda' - target_worker.device = 'cuda' - - set_random_seed(1) - - worker = SpecDecodeWorker( - proposer_worker=draft_worker, - scorer_worker=target_worker, - spec_decode_sampler=mock_spec_decode_sampler( - acceptance_sampler_method), - disable_logprobs=False, - metrics_collector=metrics_collector, - ) - - seq_group_metadata_list, _, _ = create_batch(batch_size, - k, - prev_output_token_len=0) - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, num_lookahead_slots=k) - - out = worker.execute_model(execute_model_req=execute_model_req) - - assert len(out) == 1, f"expected only one token output when {k=}" - assert out[0].sampled_token_probs is None, ( - "expect gpu tensor references to be None") - assert out[ - 0].sampled_token_ids is None, "expect gpu tensor references to be None" - - draft_worker.execute_model.assert_called_once_with(execute_model_req) - target_worker.execute_model.assert_called_once_with(execute_model_req) - - -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@pytest.mark.skip_global_cleanup -def test_init_device(acceptance_sampler_method: str): - """Verify SpecDecodeWorker invokes proposer/scorer worker init_device, as - well as other GPU initialization. - """ - draft_worker = mock_worker(cls=MultiStepWorker, use_spec=False) - target_worker = mock_worker(use_spec=False) - spec_decode_sampler = mock_spec_decode_sampler(acceptance_sampler_method) - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - - worker = SpecDecodeWorker( - proposer_worker=draft_worker, - scorer_worker=target_worker, - spec_decode_sampler=spec_decode_sampler, - disable_logprobs=False, - metrics_collector=metrics_collector, - ) - worker.init_device() - - draft_worker.init_device.assert_called_once() - - target_worker.init_device.assert_called_once() - - metrics_collector.init_tensors.assert_called_once() - spec_decode_sampler.init_tensors.assert_called_once() - - -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@torch.inference_mode() -def test_initialize_cache(acceptance_sampler_method): - """Verify SpecDecodeWorker invokes initialize_cache on proposer/scorer - workers. - """ - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - - worker = SpecDecodeWorker(proposer_worker=draft_worker, - scorer_worker=target_worker, - spec_decode_sampler=mock_spec_decode_sampler( - acceptance_sampler_method), - metrics_collector=metrics_collector) - - kwargs = {"num_gpu_blocks": 1024, "num_cpu_blocks": 1023} - worker.initialize_cache(**kwargs) - - draft_worker.initialize_cache.assert_called_once_with(**kwargs) - target_worker.initialize_cache.assert_called_once_with(**kwargs) - - -@pytest.mark.parametrize('available_gpu_blocks', [1, 1024]) -@pytest.mark.parametrize('available_cpu_blocks', [500]) -@pytest.mark.parametrize('target_cache_block_size_bytes', [2 * 2 * 4096]) -@pytest.mark.parametrize('draft_kv_size_bytes', [0, 2 * 2 * 768, 2 * 2 * 4096]) -@pytest.mark.parametrize("acceptance_sampler_method", - ["rejection_sampler", "typical_acceptance_sampler"]) -@pytest.mark.skip_global_cleanup -def test_determine_num_available_blocks(available_gpu_blocks: int, - available_cpu_blocks: int, - target_cache_block_size_bytes: int, - draft_kv_size_bytes: int, - acceptance_sampler_method: str): - """Verify SpecDecodeWorker correctly profiles num available GPU blocks. - Specifically, it should run profiling in the scorer worker, and then evenly - split the blocks between proposer and scorer worker. - """ - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - - target_worker.determine_num_available_blocks.return_value = ( - available_gpu_blocks, available_cpu_blocks) - target_worker.get_cache_block_size_bytes.return_value = ( - target_cache_block_size_bytes) - draft_worker.get_cache_block_size_bytes.return_value = draft_kv_size_bytes - - worker = SpecDecodeWorker( - draft_worker, target_worker, - mock_spec_decode_sampler(acceptance_sampler_method), metrics_collector) - - num_gpu_blocks, num_cpu_blocks = worker.determine_num_available_blocks() - - target_worker.determine_num_available_blocks.assert_called_once() - assert num_cpu_blocks == available_cpu_blocks - - assert num_gpu_blocks == split_num_cache_blocks_evenly( - target_cache_block_size_bytes, draft_kv_size_bytes, - available_gpu_blocks) - - -@pytest.mark.parametrize('available_gpu_blocks', - list(range(20)) + [1024, 1024**2]) -@pytest.mark.parametrize('target_cache_block_size_bytes', - [2 * 2 * 4096, 2 * 2 * 8192]) -@pytest.mark.parametrize('draft_kv_size_bytes', [0, 2 * 2 * 768, 2 * 2 * 4096]) -@pytest.mark.skip_global_cleanup -def test_split_num_cache_blocks_evenly(available_gpu_blocks: int, - target_cache_block_size_bytes: int, - draft_kv_size_bytes: int): - """Verify split_num_cache_blocks_evenly does not exceed original memory - allocation in bytes. - """ - num_blocks = split_num_cache_blocks_evenly(target_cache_block_size_bytes, - draft_kv_size_bytes, - available_gpu_blocks) - assert (num_blocks * target_cache_block_size_bytes) + ( - num_blocks * draft_kv_size_bytes) <= (available_gpu_blocks * - target_cache_block_size_bytes) - - -@torch.inference_mode() -def test_populate_seq_ids_with_bonus_tokens(): - """ - Verify that a call to _create_output_sampler_list correctly updates - seq_with_bonus_token_in_last_step. - - seq_with_bonus_token_in_last_step is an internal data structure in - SpecDecodeWorker that tracks the sequence IDs which are assigned bonus - tokens by the target model in their last forward pass. This state is - maintained only for models relying on the KV cache, such as those using - the MultiStepWorker. - """ - batch_size = 10 - k = 5 - vocab_size = 10000 - num_sequences_with_bonus_tokens = 5 - target_worker = mock_worker(vocab_size=vocab_size, use_spec=False) - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - target_worker.execute_model.return_value = [MagicMock(spec=SamplerOutput)] - target_worker.device = 'cuda' - - set_random_seed(1) - draft_worker = mock_worker(cls=MultiStepWorker) - draft_worker.device = 'cuda' - # The sequence_ids attached to each sequence in the batch. - # The sequence at index i has seq_id assigned_seq_ids[i] - assigned_seq_ids = list(range(batch_size)) - seq_group_metadata_list, _, _ = create_batch(batch_size, - k, - seq_ids=assigned_seq_ids, - prev_output_token_len=10) - target_token_logprobs = torch.rand(batch_size, (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - accepted_token_ids = torch.randint(low=0, - high=vocab_size, - size=(batch_size, (k + 1)), - dtype=torch.int64, - device='cuda') - expected_request_id_seq_ids_mapping: dict[str, set[int]] = defaultdict(set) - for seq_group_metadata in seq_group_metadata_list: - for seq_id in seq_group_metadata.seq_data: - expected_request_id_seq_ids_mapping[ - seq_group_metadata.request_id].add(seq_id) - # Generate a random sample of sequence indexes with bonus tokens - seq_indexes_with_bonus_tokens = random.sample( - range(batch_size), num_sequences_with_bonus_tokens) - # Create a mask that is True for indices in seq_indexes_with_bonus_tokens - mask = torch.ones(batch_size, dtype=torch.bool, device='cuda') - mask[seq_indexes_with_bonus_tokens] = False - # Set the last token ID to -1 for all indices not in - # seq_indexes_with_bonus_tokens to indicate the lack of bonus token in - # those indices. - accepted_token_ids[mask, -1:] = -1 - worker = SpecDecodeWorker(draft_worker, - target_worker, - mock_spec_decode_sampler("rejection_sampler"), - disable_logprobs=False, - metrics_collector=metrics_collector) - # Initialize _seq_with_bonus_token_in_last_step with a set of sequence IDs. - # This set includes all sequence IDs in the batch as well as an additional - # `num_extra_sequence_ids` sequence IDs. Note that the sequence IDs are in - # the range [0, batch_size + num_extra_sequence_ids). - num_extra_sequence_ids = 10 - worker._seq_with_bonus_token_in_last_step = set( - range(batch_size + num_extra_sequence_ids)) - worker._create_output_sampler_list( - seq_group_metadata_list=seq_group_metadata_list, - accepted_token_ids=accepted_token_ids, - target_logprobs=target_token_logprobs, - prompt_logprobs=None, - k=k, - stage_times=(0, 0, 0)) - # Verify that _seq_with_bonus_token_in_last_step contains the following: - # 1. Sequence IDs that were already present in - # _seq_with_bonus_token_in_last_step but were not part of the current - # batch are retained. - # 2. Of the sequence IDs present in the current batch, only those with a - # bonus token are retained in _seq_with_bonus_token_in_last_step. - # Sequence IDs that are present in the current batch but do not have - # bonus tokens are removed from _seq_with_bonus_token_in_last_step. - expected_seq_ids_with_bonus_tokens = \ - set([assigned_seq_ids[i] for i in seq_indexes_with_bonus_tokens]) - additional_sequence_ids = \ - set(range(batch_size, batch_size + num_extra_sequence_ids)) - assert worker._seq_with_bonus_token_in_last_step == \ - expected_seq_ids_with_bonus_tokens.union(additional_sequence_ids) - assert worker._request_id_seq_id_mapping == \ - expected_request_id_seq_ids_mapping - - -@torch.inference_mode() -def test_handle_finished_requests(): - """ - Test to verify that finished request IDs are appropriately processed to - update the internal state of the SpecDecodeWorker. - - This test initializes the SpecDecodeWorker with mock data, marks certain - requests as finished, and ensures that the corresponding sequence IDs are - correctly removed from the internal mappings. - """ - batch_size = 32 - k = 3 - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - worker = SpecDecodeWorker(draft_worker, target_worker, - mock_spec_decode_sampler("rejection_sampler"), - metrics_collector) - # Initialize the request_id_seq_id_mapping mapping dict with a few fake - # request ids and corresponding sequence ids. - worker._request_id_seq_id_mapping = \ - {'request-1': {1,2,3}, 'request-2': {4,5,6,7}, - 'request-3': {8,9}, 'request-4': {10,11}} - # Initialize seq_with_bonus_token_in_last_step with a few fake - # sequence ids. - worker._seq_with_bonus_token_in_last_step = {1, 4, 5, 8, 9, 10} - exception_secret = 'artificial stop' - draft_worker.get_spec_proposals.side_effect = ValueError(exception_secret) - - seq_group_metadata_list, _, _ = create_batch(batch_size, k) - # Mark requests with ids request-1 and request-3 as finished. - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=seq_group_metadata_list, - num_lookahead_slots=k, - finished_requests_ids=['request-1', 'request-3']) - - with pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=execute_model_req) - # Verify that request-1 and request-3 are removed from - # request_id_seq_id_mapping - assert worker._request_id_seq_id_mapping == \ - {'request-2': {4,5,6,7}, 'request-4': {10,11}} - # Verify that all sequence ids corresponding to 'request-1' - # and 'request-3' are removed from seq_with_bonus_token_in_last_step. - assert worker._seq_with_bonus_token_in_last_step == \ - {4,5,10} - - -@pytest.mark.parametrize('k', [3]) -@pytest.mark.parametrize('batch_size', [2, 32]) -@pytest.mark.parametrize("batch_composition", - ["prefill_only", "decode_only", "mixed"]) -@torch.inference_mode() -def test_chunked_prefill_flow(k: int, batch_size: int, batch_composition: str): - """ - Verify SpecDecodeWorker calls match the expected flow. - """ - vocab_size = 32_000 - draft_worker = mock_worker(cls=MultiStepWorker) - target_worker = mock_worker() - metrics_collector = MagicMock(spec=AsyncMetricsCollector) - worker = SpecDecodeWorker(draft_worker, - target_worker, - mock_spec_decode_sampler("rejection_sampler"), - disable_logprobs=False, - metrics_collector=metrics_collector) - exception_secret = 'artificial stop' - worker.scorer = mock_worker(BatchExpansionTop1Scorer) - worker.scorer.score_proposals.side_effect = ValueError(exception_secret) - - # Create batch with combination of terminal/non-terminal prefill chunks - # and decodes (different seq_ids). - decodes, _, _ = create_batch(batch_size, k) - # Pre-chunking here, get 'batch_size' chunks. - prefill, _, _ = create_batch(batch_size, - k, - prefill_chunk_size=4, - seq_ids=list(range(batch_size, - batch_size * 2))) - - if batch_composition == "prefill_only": - n_prefills = batch_size - elif batch_composition == "decode_only": - n_prefills = 0 - else: - n_prefills = random.randint(1, batch_size - 1) - n_decodes = batch_size - n_prefills - - prefill = random.sample(prefill, n_prefills) - decodes = random.sample(decodes, n_decodes) - target_group_metadata_list = prefill + decodes - execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=target_group_metadata_list, - # For prefill only batches we expect num_lookahead_slots = 0. - num_lookahead_slots=k if n_decodes > 0 else 0) - - target_token_ids = torch.randint(low=0, - high=vocab_size, - size=(1, batch_size * (k + 1)), - dtype=torch.int64, - device='cuda') - target_token_probs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_token_logprobs = torch.rand(1, - batch_size * (k + 1), - vocab_size, - dtype=torch.float32, - device='cuda') - target_output = create_sampler_output_list(target_token_ids, - target_token_probs, - target_token_logprobs) - - target_worker.execute_model.return_value = [target_output[0]] - - if not len(decodes): - worker.execute_model(execute_model_req=execute_model_req) - # no spec run (prefill only) - draft_worker.execute_model.assert_called_once_with(execute_model_req) - target_worker.execute_model.assert_called_once_with(execute_model_req) - else: - # Decode-only run OR mixed batch, scorer call fails (it's mocked) - with pytest.raises(ValueError, match=exception_secret): - worker.execute_model(execute_model_req=execute_model_req) - # but first draft still counted - assert draft_worker.get_spec_proposals.call_count == 1 - - -def test_correctly_load_weight_for_eagle(): - """ - Verify SpecDecodeWorker loads lm_head weight for eagle correctly. - """ - seed = 100 - block_size = 32 - num_gpu_blocks = 8096 // block_size - target_worker = create_worker( - Worker, - "JackFram/llama-68m", - block_size, - num_gpu_blocks, - seed, - ) - draft_worker = create_worker( - MultiStepWorker, - "abhigoyal/vllm-eagle-llama-68m-random", - block_size, - num_gpu_blocks, - seed, - model_runner_cls=TP1DraftModelRunner, - ) - - spec_decode_sampler = mock_spec_decode_sampler("rejection_sampler") - worker = SpecDecodeWorker(draft_worker, - target_worker, - spec_decode_sampler, - disable_logprobs=False) - worker.proposer_worker.maybe_load_lm_head_weight( - target_worker.model_runner.model.lm_head.weight.data) - assert torch.allclose( - worker.proposer_worker.worker.model_runner.model.lm_head.weight.data, - worker.scorer_worker.model_runner.model.lm_head.weight.data) diff --git a/tests/spec_decode/test_utils.py b/tests/spec_decode/test_utils.py deleted file mode 100644 index 9cfc618b9d9..00000000000 --- a/tests/spec_decode/test_utils.py +++ /dev/null @@ -1,150 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from unittest.mock import MagicMock - -import pytest -import torch - -from vllm.model_executor.layers.rejection_sampler import RejectionSampler -from vllm.model_executor.layers.sampler import _get_ranks -from vllm.model_executor.layers.typical_acceptance_sampler import ( - TypicalAcceptanceSampler) -from vllm.sequence import SequenceGroupMetadata, get_all_seq_ids -from vllm.spec_decode.util import (get_sampled_token_logprobs, - split_batch_by_proposal_len) - - -def test_get_all_seq_ids(): - """Verify get_all_seq_ids extracts all seq ids. - """ - expected_seq_ids = list(range(10)) + list(range(100, 110)) - - seq_group_metadata_list = [ - SequenceGroupMetadata( - request_id=str(seq_id), - is_prompt=True, - seq_data={ - seq_id: MagicMock(), - }, - sampling_params=MagicMock(), - block_tables={ - seq_id: MagicMock(), - }, - lora_request=None, - ) for seq_id in expected_seq_ids - ] - - actual_seq_ids = get_all_seq_ids(seq_group_metadata_list) - assert actual_seq_ids == expected_seq_ids - - -@pytest.fixture -def fake_sequence_group_metadata(): - seq_ids = list(range(3)) - return [ - SequenceGroupMetadata( - request_id=str(i), - is_prompt=True, - seq_data={ - i: MagicMock(), - }, - sampling_params=MagicMock(), - block_tables={ - i: MagicMock(), - }, - lora_request=None, - ) for i in seq_ids - ] - - -def test_filter_zero_length_proposals(fake_sequence_group_metadata): - proposal_lens = [0, 1, 0] - _, (filtered_groups, - indices) = split_batch_by_proposal_len(fake_sequence_group_metadata, - proposal_lens) - - expected_groups = [ - fake_sequence_group_metadata[0], fake_sequence_group_metadata[2] - ] - expected_indices = [0, 2] - - assert filtered_groups == expected_groups - assert indices == expected_indices - - -def test_filter_non_zero_length_proposals(fake_sequence_group_metadata): - proposal_lens = [0, 1, 2] - (filtered_groups, - indices), _ = split_batch_by_proposal_len(fake_sequence_group_metadata, - proposal_lens) - - expected_groups = [ - fake_sequence_group_metadata[1], fake_sequence_group_metadata[2] - ] - expected_indices = [1, 2] - - assert filtered_groups == expected_groups - assert indices == expected_indices - - -def test_empty_inputs(): - _, (filtered_groups, indices) = split_batch_by_proposal_len([], []) - - assert filtered_groups == [] - assert indices == [] - - -def test_all_zero_with_non_zero_filter(fake_sequence_group_metadata): - proposal_lens = [0, 0, 0] - (filtered_groups, - indices), _ = split_batch_by_proposal_len(fake_sequence_group_metadata, - proposal_lens) - - assert filtered_groups == [] - assert indices == [] - - -def test_all_non_zero_with_zero_filter(fake_sequence_group_metadata): - proposal_lens = [1, 1, 1] - _, (filtered_groups, - indices) = split_batch_by_proposal_len(fake_sequence_group_metadata, - proposal_lens) - - assert filtered_groups == [] - assert indices == [] - - -def mock_spec_decode_sampler(acceptance_sampler_method): - """ - Returns either a RejectionSampler or TypicalAcceptanceSampler - object depending on whether acceptance_sampler_method is - 'rejection_sampler' or 'typical_acceptance_sampler' respectively. - """ - if acceptance_sampler_method == "rejection_sampler": - sampler = MagicMock(spec=RejectionSampler) - sampler.token_id_dtype = torch.int64 - return sampler - elif acceptance_sampler_method == "typical_acceptance_sampler": - sampler = MagicMock(spec=TypicalAcceptanceSampler) - sampler.token_id_dtype = torch.int64 - return sampler - else: - raise ValueError(f"Invalid sampler name {acceptance_sampler_method}") - - -def test_get_sampled_token_logprobs(): - """Verify get_sampled_token_logprobs returns consistent rankings - with regular get_ranks when probabilities match exactly. - """ - logprob_tensor = torch.tensor( - [[[-.1, -.1]] * 2]) # shape (num_steps, batch_size, vocab_size) - sampled_token_tensor = torch.tensor([[1, - 0]]) # shape (num_steps, batch_size) - ranks_spec_dec, _ = get_sampled_token_logprobs(logprob_tensor, - sampled_token_tensor) - - ranks_regular = _get_ranks(logprob_tensor.reshape((2, -1)), - sampled_token_tensor.reshape(-1)) - - assert torch.equal(ranks_spec_dec.reshape(-1), ranks_regular) diff --git a/tests/spec_decode/utils.py b/tests/spec_decode/utils.py deleted file mode 100644 index 1733f66feec..00000000000 --- a/tests/spec_decode/utils.py +++ /dev/null @@ -1,290 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from collections.abc import Sequence as GenericSequence -from itertools import count -from typing import Callable, Optional, TypeVar, Union -from unittest.mock import MagicMock - -import torch - -from vllm.engine.arg_utils import EngineArgs -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.utils import set_random_seed -from vllm.sampling_params import SamplingParams -from vllm.sequence import (CompletionSequenceGroupOutput, Logprob, - SequenceData, SequenceGroupMetadata, SequenceOutput) -from vllm.utils import get_distributed_init_method, get_ip, get_open_port -from vllm.worker.cache_engine import CacheEngine -from vllm.worker.model_runner import ModelRunner -from vllm.worker.worker import Worker - -T = TypeVar("T", bound=Worker) - - -def round_up_to_next_block(seq_len: int, block_size: int) -> int: - return (seq_len + block_size - 1) // block_size - - -def mock_worker(cls=None, - vocab_size: int = 30_000, - max_model_len: int = 2048, - rank: int = 0, - use_spec: bool = True) -> MagicMock: - if cls is None: - cls = Worker - - spec = cls if use_spec else None - - worker = MagicMock(spec=spec) - worker.vocab_size = vocab_size - worker.max_model_len = max_model_len - worker.rank = rank - worker.device = 'cuda:0' - return worker - - -def patch_execute_model_with_seeds(worker: Worker, rand_seeds: list[int]): - seed_iter = iter(rand_seeds) - original_execute_model = worker.execute_model - - def new_execute_model(*args, **kwargs): - result = original_execute_model(*args, **kwargs) - set_random_seed(next(seed_iter)) - return result - - return new_execute_model - - -def zero_kv_cache(cache_engine: list[CacheEngine]): - assert cache_engine[0].gpu_cache - for key_blocks, value_blocks in cache_engine[0].gpu_cache: - key_blocks.zero_() - value_blocks.zero_() - - -def create_worker(cls: Callable[..., T], - model_name: str, - block_size: int, - num_gpu_blocks: int, - seed: int, - is_driver_worker: bool = True, - enforce_eager: bool = True, - model_runner_cls: Optional[ModelRunner] = None, - dtype: Optional[str] = "auto") -> T: - engine_args = EngineArgs( - model=model_name, - seed=seed, - block_size=block_size, - enforce_eager=enforce_eager, - dtype=dtype, - ) - engine_config = engine_args.create_engine_config() - - distributed_init_method = get_distributed_init_method( - get_ip(), get_open_port()) - - worker = cls( - vllm_config=engine_config, - local_rank=0, - rank=0, - distributed_init_method=distributed_init_method, - is_driver_worker=is_driver_worker, - model_runner_cls=model_runner_cls, - ) - - worker.init_device() - worker.load_model() - - engine_config.cache_config.num_gpu_blocks = num_gpu_blocks - engine_config.cache_config.num_cpu_blocks = 0 - worker.initialize_cache( - num_gpu_blocks=engine_config.cache_config.num_gpu_blocks, - num_cpu_blocks=engine_config.cache_config.num_cpu_blocks) - - return worker - - -def create_seq_group_metadata_from_prompts( - prompts: list[list[int]], - num_gpu_blocks: int, - block_size: int, - final_prompt_lens: list[int], - continuations: Optional[list[list[int]]] = None, - seq_ids: Optional[list[int]] = None, -) -> list[SequenceGroupMetadata]: - - if continuations is None: - continuations = [[] for _ in prompts] - - if seq_ids is None: - seq_ids = list(i for i, _ in enumerate(prompts)) - - free_gpu_blocks = list(range(num_gpu_blocks)) - - block_allocations = { - i: [ - free_gpu_blocks.pop() - for _ in range(round_up_to_next_block(final_len, block_size)) - ] - for i, final_len in enumerate(final_prompt_lens) - } - - seq_grou_metadata_list = [] - for i, (prompt_token_ids, - cont_token_ids) in enumerate(zip(prompts, continuations)): - data = SequenceData.from_seqs(prompt_token_ids, cont_token_ids) - data.update_num_computed_tokens( - len(prompt_token_ids) + len(cont_token_ids) - 1) - seq_data = {i: data} - seq_grou_metadata_list.append( - SequenceGroupMetadata( - request_id=str(i), - is_prompt=len(cont_token_ids) == 0, - seq_data=seq_data, - sampling_params=SamplingParams(temperature=0.0), - block_tables={i: block_allocations[i][:]}, - )) - return seq_grou_metadata_list - - -def create_chunked_seq_group_metadata_from_prompt( - prompt: list[int], - num_gpu_blocks: int, - chunk_size: int, - block_size: int, - seq_id: Optional[int] = None) -> list[SequenceGroupMetadata]: - - if seq_id is None: - seq_id = 0 - - free_gpu_blocks = list(range(num_gpu_blocks)) - - block_allocations = [ - free_gpu_blocks.pop() - for _ in range(round_up_to_next_block(len(prompt), block_size)) - ] - - seq_group_metadata_list = [] - for i, idx in enumerate(range(0, len(prompt), chunk_size)): - chunk_ids = prompt[idx:idx + chunk_size] - data = SequenceData.from_seqs(prompt) - data.update_num_computed_tokens(idx) - seq_data = {i: data} - seq_group_metadata_list.append( - SequenceGroupMetadata( - request_id=str(seq_id), - is_prompt=True, - do_sample=idx + chunk_size >= len(prompt), # terminal chunk - seq_data=seq_data, - sampling_params=SamplingParams(temperature=0.0), - block_tables={i: block_allocations}, - token_chunk_size=len(chunk_ids))) - return seq_group_metadata_list - - -def assert_logprobs_dict_allclose( - actual_logprobs: list[dict[int, Logprob]], - expected_logprobs: list[dict[int, Logprob]]) -> None: - for single_step_actual_logprobs, single_step_expected_logprobs in zip( - actual_logprobs, expected_logprobs): - assert set(single_step_actual_logprobs.keys()) == set( - single_step_expected_logprobs.keys()) - for token_id in single_step_actual_logprobs: - actual = torch.tensor( - single_step_actual_logprobs[token_id].logprob) - expected = torch.tensor( - single_step_expected_logprobs[token_id].logprob) - torch.testing.assert_close(actual, expected) - - -def create_sampler_output_list( - token_ids: torch.Tensor, - probs: GenericSequence[Optional[torch.Tensor]], - logprobs: GenericSequence[Optional[torch.Tensor]], - seq_ids: Optional[list[int]] = None) -> list[SamplerOutput]: - num_steps, batch_size = token_ids.shape - token_ids_by_step = token_ids.tolist() - - if seq_ids is None: - seq_ids = list(range(batch_size)) - - return [ - SamplerOutput(outputs=[ - CompletionSequenceGroupOutput( - samples=[ - SequenceOutput( - output_token=token_id, - parent_seq_id=seq_ids[seq_index], - logprobs={token_id: Logprob(0)}, - ) - ], - prompt_logprobs=None, - ) for seq_index, token_id in enumerate(token_ids_by_step[step]) - ], - sampled_token_probs=probs[step], - logprobs=logprobs[step], - sampled_token_ids=token_ids[step]) - for step in range(num_steps) - ] - - -def create_batch(batch_size, - k, - prompt_len: Union[int, list[int]] = 10, - prev_output_token_len: int = 10, - seq_ids: Optional[list[int]] = None, - num_gpu_blocks: Optional[int] = None, - block_size: Optional[int] = None, - prefill_chunk_size: Optional[int] = None): - if block_size is None: - block_size = 8 - - if num_gpu_blocks is None: - num_gpu_blocks = 2048 // block_size - - iterator = count() - - if isinstance(prompt_len, int): - prompt_lens = [prompt_len for _ in range(batch_size)] - else: - prompt_lens = prompt_len - - prompts = [[next(iterator) for _ in range(p_len)] for p_len in prompt_lens] - - if prefill_chunk_size: - # Create a batch of chunked prompts. - if not seq_ids: - seq_ids = list(range(len(prompts))) - seq_group_metadata_list = [] - for p, sid in zip(prompts, seq_ids): - seq_group_metadata_list += \ - create_chunked_seq_group_metadata_from_prompt( - p, num_gpu_blocks, prefill_chunk_size, block_size, sid) - seq_group_metadata_list = seq_group_metadata_list[:batch_size] - prev_output_tokens = [] - else: - prev_output_tokens = [[ - next(iterator) for _ in range(prev_output_token_len) - ] for _ in range(batch_size)] - final_prompt_lens = [ - len(prompt) + len(prev_output_token) + k + 1 - for prompt, prev_output_token in zip(prompts, prev_output_tokens) - ] - - seq_group_metadata_list = create_seq_group_metadata_from_prompts( - prompts, num_gpu_blocks, block_size, final_prompt_lens, - prev_output_tokens, seq_ids) - return seq_group_metadata_list, prompts, prev_output_tokens - - -def maybe_enable_chunked_prefill(prefill_chunk_size, llm_kwargs): - if prefill_chunk_size > 0: - llm_kwargs.update( - **{ - "enable_chunked_prefill": True, - "max_num_batched_tokens": prefill_chunk_size, - "max_num_seqs": prefill_chunk_size - }) - else: - llm_kwargs["enable_chunked_prefill"] = False diff --git a/tests/test_sequence.py b/tests/test_sequence.py index a782a3bf771..c734c8514a6 100644 --- a/tests/test_sequence.py +++ b/tests/test_sequence.py @@ -29,7 +29,6 @@ def test_sampler_output_initialization(sampler_output, sample_outputs): assert len(sampler_output) == len(sample_outputs) assert sampler_output.sampled_token_probs is None assert sampler_output.sampled_token_ids is None - assert sampler_output.spec_decode_worker_metrics is None def test_sampler_output_getitem(sampler_output, sample_outputs): diff --git a/tests/v1/test_oracle.py b/tests/v1/test_oracle.py index 7a7ba346a71..39515d710e8 100644 --- a/tests/v1/test_oracle.py +++ b/tests/v1/test_oracle.py @@ -40,12 +40,6 @@ def test_unsupported_configs(monkeypatch): with monkeypatch.context() as m: m.setenv("VLLM_USE_V1", "1") - with pytest.raises(NotImplementedError): - AsyncEngineArgs( - model=MODEL, - kv_cache_dtype="fp8", - ).create_engine_config() - with pytest.raises(NotImplementedError): AsyncEngineArgs( model=MODEL, diff --git a/tools/mypy.sh b/tools/mypy.sh index 77d342da1ec..af4c61233ab 100755 --- a/tools/mypy.sh +++ b/tools/mypy.sh @@ -32,6 +32,5 @@ run_mypy vllm/lora run_mypy vllm/model_executor run_mypy vllm/plugins run_mypy vllm/prompt_adapter -run_mypy vllm/spec_decode run_mypy vllm/worker run_mypy vllm/v1 diff --git a/vllm/config.py b/vllm/config.py index 270027a4b5a..c00ca475d8b 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -2536,8 +2536,6 @@ def __post_init__(self): SpeculativeMethod = Literal["ngram", "eagle", "eagle3", "medusa", "mlp_speculator", "draft_model", "deepseek_mtp"] -SpeculativeAcceptanceMethod = Literal["rejection_sampler", - "typical_acceptance_sampler"] @config @@ -2560,13 +2558,6 @@ class SpeculativeConfig: If using `ngram` method, the related configuration `prompt_lookup_max` and `prompt_lookup_min` should be considered.""" - acceptance_method: SpeculativeAcceptanceMethod = "rejection_sampler" - """The method to use for accepting draft tokens:\n - - "rejection_sampler" maps to `RejectionSampler`.\n - - "typical_acceptance_sampler" maps to `TypicalAcceptanceSampler`. - - If using `typical_acceptance_sampler`, the related configuration - `posterior_threshold` and `posterior_alpha` should be considered.""" draft_tensor_parallel_size: Optional[int] = None """The degree of the tensor parallelism for the draft model. Can only be 1 or the same as the target model's tensor parallel size.""" @@ -2593,9 +2584,6 @@ class SpeculativeConfig: will use the default version.""" # Advanced control - disable_mqa_scorer: bool = False - """Disable the MQA scorer and fall back to batch expansion for scoring - proposals.""" disable_by_batch_size: Optional[int] = None """Disable speculative decoding for new incoming requests when the number of enqueued requests is larger than this value, if provided.""" @@ -2608,16 +2596,6 @@ class SpeculativeConfig: """Minimum size of ngram token window when using Ngram proposer, if provided. Defaults to 1.""" - # Typical acceptance sampler configuration - posterior_threshold: Optional[float] = None - """A threshold value that sets a lower bound on the posterior probability - of a token in the target model for it to be accepted. This threshold is - used only when we use the `TypicalAcceptanceSampler` for token acceptance. - """ - posterior_alpha: Optional[float] = None - """Scaling factor for entropy-based threshold, applied when using - `TypicalAcceptanceSampler`.""" - speculative_token_tree: Optional[str] = None """Specifies the tree structure for speculative token generation. """ @@ -2795,8 +2773,8 @@ def __post_init__(self): elif (self.draft_model_config.hf_config.model_type == "mlp_speculator"): self.method = "mlp_speculator" - elif (self.draft_model_config.hf_config.model_type == - "deepseek_mtp"): + elif (self.draft_model_config.hf_config.model_type + in ("deepseek_mtp", "mimo_mtp")): self.method = "deepseek_mtp" if self.num_speculative_tokens > 1: logger.warning( @@ -2806,6 +2784,11 @@ def __post_init__(self): ) else: self.method = "draft_model" + raise NotImplementedError( + "Speculative decoding with draft model is not " + "supported yet. Please consider using other " + "speculative decoding methods such as ngram, medusa, " + "eagle, or deepseek_mtp.") # Replace hf_config for EAGLE draft_model if self.method in ("eagle", "eagle3"): @@ -2864,12 +2847,6 @@ def __post_init__(self): self.target_parallel_config, self.draft_tensor_parallel_size)) - if self.acceptance_method == "typical_acceptance_sampler": - if self.posterior_threshold is None: - self.posterior_threshold = 0.09 - if self.posterior_alpha is None: - self.posterior_alpha = 0.3 - @staticmethod def _maybe_override_draft_max_model_len( speculative_max_model_len: Optional[int], @@ -2975,30 +2952,6 @@ def _verify_args(self) -> Self: if self.draft_model_config: self.draft_model_config.verify_with_parallel_config( self.draft_parallel_config) - # Validate and set draft token acceptance related settings. - - if self.acceptance_method is None: - raise ValueError("acceptance_method is not set. " - "Expected values are rejection_sampler or " - "typical_acceptance_sampler.") - - if (self.acceptance_method != 'rejection_sampler' - and self.acceptance_method != 'typical_acceptance_sampler'): - raise ValueError( - "Expected acceptance_method to be either " - "rejection_sampler or typical_acceptance_sampler. Instead it " - f"is {self.acceptance_method}") - - if self.acceptance_method == "typical_acceptance_sampler" and ( - (self.posterior_threshold is not None - and self.posterior_threshold < 0) or - (self.posterior_alpha is not None and self.posterior_alpha < 0)): - raise ValueError( - "Expected the posterior_threshold and posterior_alpha of " - "typical_acceptance_sampler to be > 0. " - "Instead found posterior_threshold = " - f"{self.posterior_threshold} and posterior_alpha = " - f"{self.posterior_alpha}") if (self.disable_by_batch_size is not None and self.disable_by_batch_size < 2): diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index b20defde73e..a7fcf6c354e 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1417,28 +1417,12 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: return False # V1 supports N-gram, Medusa, and Eagle speculative decoding. - is_ngram_enabled = False - is_eagle_enabled = False - is_medusa_enabled = False - if self.speculative_config is not None: - # This is supported but experimental (handled below). - speculative_method = self.speculative_config.get("method") - if speculative_method: - if speculative_method in ("ngram", "[ngram]"): - is_ngram_enabled = True - elif speculative_method == "medusa": - is_medusa_enabled = True - elif speculative_method in ("eagle", "eagle3", "deepseek_mtp"): - is_eagle_enabled = True - else: - speculative_model = self.speculative_config.get("model") - if speculative_model in ("ngram", "[ngram]"): - is_ngram_enabled = True - if not (is_ngram_enabled or is_eagle_enabled or is_medusa_enabled): - # Other speculative decoding methods are not supported yet. - _raise_or_fallback(feature_name="Speculative Decoding", - recommend_to_remove=False) - return False + if (self.speculative_config is not None + and self.speculative_config.get("method") == "draft_model"): + raise NotImplementedError( + "Speculative decoding with draft model is not supported yet. " + "Please consider using other speculative decoding methods " + "such as ngram, medusa, eagle, or deepseek_mtp.") # No XFormers so far. V1_BACKENDS = [ diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py index 25fa1c3058b..e2f8de1990b 100644 --- a/vllm/engine/llm_engine.py +++ b/vllm/engine/llm_engine.py @@ -1780,13 +1780,6 @@ def _get_stats(self, num_generation_tokens_from_prefill_groups) num_tokens_iter = (num_generation_tokens_iter + num_prompt_tokens_iter) - # Spec decode, if enabled, emits specialized metrics from the worker in - # sampler output. - if model_output and isinstance(model_output[0], SamplerOutput) and ( - model_output[0].spec_decode_worker_metrics is not None): - spec_decode_metrics = model_output[0].spec_decode_worker_metrics - else: - spec_decode_metrics = None return Stats( now=now, @@ -1808,7 +1801,6 @@ def _get_stats(self, num_tokens_iter=num_tokens_iter, time_to_first_tokens_iter=time_to_first_tokens_iter, time_per_output_tokens_iter=time_per_output_tokens_iter, - spec_decode_metrics=spec_decode_metrics, num_preemption_iter=num_preemption_iter, # Request stats diff --git a/vllm/engine/metrics.py b/vllm/engine/metrics.py index 8d51f047235..ba8dbd1fad7 100644 --- a/vllm/engine/metrics.py +++ b/vllm/engine/metrics.py @@ -2,7 +2,6 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import time -from typing import TYPE_CHECKING from typing import Counter as CollectionsCounter from typing import Dict, List, Optional, Type, Union, cast @@ -19,9 +18,6 @@ else: ray_metrics = None -if TYPE_CHECKING: - from vllm.spec_decode.metrics import SpecDecodeWorkerMetrics - logger = init_logger(__name__) prometheus_client.disable_created_metrics() @@ -199,30 +195,6 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig): documentation="Count of successfully processed requests.", labelnames=labelnames + [Metrics.labelname_finish_reason]) - # Speculative decoding stats - self.gauge_spec_decode_draft_acceptance_rate = self._gauge_cls( - name="vllm:spec_decode_draft_acceptance_rate", - documentation="Speulative token acceptance rate.", - labelnames=labelnames, - multiprocess_mode="sum") - self.gauge_spec_decode_efficiency = self._gauge_cls( - name="vllm:spec_decode_efficiency", - documentation="Speculative decoding system efficiency.", - labelnames=labelnames, - multiprocess_mode="sum") - self.counter_spec_decode_num_accepted_tokens = (self._counter_cls( - name="vllm:spec_decode_num_accepted_tokens_total", - documentation="Number of accepted tokens.", - labelnames=labelnames)) - self.counter_spec_decode_num_draft_tokens = self._counter_cls( - name="vllm:spec_decode_num_draft_tokens_total", - documentation="Number of draft tokens.", - labelnames=labelnames) - self.counter_spec_decode_num_emitted_tokens = (self._counter_cls( - name="vllm:spec_decode_num_emitted_tokens_total", - documentation="Number of emitted tokens.", - labelnames=labelnames)) - # --8<-- [end:metrics-definitions] @@ -391,9 +363,6 @@ def log(self, stats: Stats) -> None: self.num_prompt_tokens.append(stats.num_prompt_tokens_iter) self.num_generation_tokens.append(stats.num_generation_tokens_iter) - # Update spec decode metrics - self.maybe_update_spec_decode_metrics(stats) - # Log locally every local_interval seconds. if local_interval_elapsed(stats.now, self.last_local_log, self.local_interval): @@ -435,10 +404,6 @@ def log(self, stats: Stats) -> None: stats.gpu_prefix_cache_hit_rate * 100, stats.cpu_prefix_cache_hit_rate * 100, ) - if self.spec_decode_metrics is not None: - log_fn( - self._format_spec_decode_metrics_str( - self.spec_decode_metrics)) self._reset(stats, prompt_throughput, generation_throughput) @@ -447,21 +412,9 @@ def _reset(self, stats, prompt_throughput, generation_throughput) -> None: self.num_prompt_tokens = [] self.num_generation_tokens = [] self.last_local_log = stats.now - self.spec_decode_metrics = None self.last_prompt_throughput = prompt_throughput self.last_generation_throughput = generation_throughput - def _format_spec_decode_metrics_str( - self, metrics: "SpecDecodeWorkerMetrics") -> str: - - return ("Speculative metrics: " - f"Draft acceptance rate: {metrics.draft_acceptance_rate:.3f}, " - f"System efficiency: {metrics.system_efficiency:.3f}, " - f"Number of speculative tokens: {metrics.num_spec_tokens}, " - f"Number of accepted tokens: {metrics.accepted_tokens}, " - f"Number of draft tokens: {metrics.draft_tokens}, " - f"Number of emitted tokens: {metrics.emitted_tokens}.") - def info(self, type: str, obj: SupportsMetricsInfo) -> None: raise NotImplementedError @@ -579,33 +532,14 @@ def log(self, stats: Stats): self.num_prompt_tokens.append(stats.num_prompt_tokens_iter) self.num_generation_tokens.append(stats.num_generation_tokens_iter) - # Update spec decode metrics - self.maybe_update_spec_decode_metrics(stats) - # Log locally every local_interval seconds. if local_interval_elapsed(stats.now, self.last_local_log, self.local_interval): - if self.spec_decode_metrics is not None: - self._log_gauge( - self.metrics.gauge_spec_decode_draft_acceptance_rate, - self.spec_decode_metrics.draft_acceptance_rate) - self._log_gauge(self.metrics.gauge_spec_decode_efficiency, - self.spec_decode_metrics.system_efficiency) - self._log_counter( - self.metrics.counter_spec_decode_num_accepted_tokens, - self.spec_decode_metrics.accepted_tokens) - self._log_counter( - self.metrics.counter_spec_decode_num_draft_tokens, - self.spec_decode_metrics.draft_tokens) - self._log_counter( - self.metrics.counter_spec_decode_num_emitted_tokens, - self.spec_decode_metrics.emitted_tokens) # Reset tracked stats for next interval. self.num_prompt_tokens = [] self.num_generation_tokens = [] self.last_local_log = stats.now - self.spec_decode_metrics = None def info(self, type: str, obj: SupportsMetricsInfo) -> None: # Info type metrics are syntactic sugar for a gauge permanently set to 1 diff --git a/vllm/engine/metrics_types.py b/vllm/engine/metrics_types.py index 9375dc4c495..3281a9121a9 100644 --- a/vllm/engine/metrics_types.py +++ b/vllm/engine/metrics_types.py @@ -16,10 +16,9 @@ import time from abc import ABC, abstractmethod from dataclasses import dataclass -from typing import List, Optional +from typing import List from vllm.config import SupportsMetricsInfo, VllmConfig -from vllm.spec_decode.metrics import SpecDecodeWorkerMetrics @dataclass @@ -65,8 +64,6 @@ class Stats: running_lora_adapters: List[str] max_lora: str - spec_decode_metrics: Optional["SpecDecodeWorkerMetrics"] = None - class StatLoggerBase(ABC): """Base class for StatLogger.""" @@ -77,7 +74,6 @@ def __init__(self, local_interval: float, vllm_config: VllmConfig) -> None: self.num_generation_tokens: List[int] = [] self.last_local_log = time.time() self.local_interval = local_interval - self.spec_decode_metrics: Optional[SpecDecodeWorkerMetrics] = None @abstractmethod def log(self, stats: Stats) -> None: @@ -86,9 +82,3 @@ def log(self, stats: Stats) -> None: @abstractmethod def info(self, type: str, obj: SupportsMetricsInfo) -> None: raise NotImplementedError - - def maybe_update_spec_decode_metrics(self, stats: Stats): - """Save spec decode metrics (since they are unlikely - to be emitted at same time as log interval).""" - if stats.spec_decode_metrics is not None: - self.spec_decode_metrics = stats.spec_decode_metrics diff --git a/vllm/engine/output_processor/multi_step.py b/vllm/engine/output_processor/multi_step.py index e0fa6a00ecf..8b66ef0dc76 100644 --- a/vllm/engine/output_processor/multi_step.py +++ b/vllm/engine/output_processor/multi_step.py @@ -104,11 +104,6 @@ def process_outputs(self, seqs = sequence_group.get_seqs( status=SequenceStatus.FINISHED_ABORTED) - for output in outputs: - if output.samples[0].output_token != VLLM_INVALID_TOKEN_ID: - sequence_group.metrics.spec_token_acceptance_counts[ - output.step_index] += 1 - assert seqs, "Expected RUNNING or FINISHED_ABORTED sequences" assert len(seqs) == 1, ( "Beam search not supported in multi-step decoding.") diff --git a/vllm/model_executor/layers/rejection_sampler.py b/vllm/model_executor/layers/rejection_sampler.py deleted file mode 100644 index db68f18726d..00000000000 --- a/vllm/model_executor/layers/rejection_sampler.py +++ /dev/null @@ -1,406 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from functools import cached_property -from importlib.util import find_spec -from typing import Optional - -import torch -import torch.jit - -import vllm.envs as envs -from vllm.logger import init_logger -from vllm.model_executor.layers.spec_decode_base_sampler import ( - SpecDecodeStochasticBaseSampler) -from vllm.platforms import current_platform - -logger = init_logger(__name__) - -if find_spec("flashinfer"): - """ - Consider utilizing the FlashInfer rejection sampling kernel initially, - as it employs a dedicated kernel rather than relying on - Torch tensor operations. This design choice helps to fuse operations, - reduce memory I/O, and consequently enhances performance. - """ - from flashinfer.sampling import chain_speculative_sampling -else: - chain_speculative_sampling = None - - -class RejectionSampler(SpecDecodeStochasticBaseSampler): - """Apply modified rejection sampling as described in "Accelerating Large - Language Model Decoding with Speculative Sampling" - https://arxiv.org/pdf/2302.01318.pdf. - """ - - def __init__(self, - strict_mode: bool = False, - use_flashinfer: Optional[bool] = None): - """Create a rejection sampler. - - Args: - strict_mode: Whether or not to perform shape/device/dtype checks - during sampling. This catches correctness issues but adds - nontrivial latency. - use_flashinfer: We will use this parameter to determine whether - to use the FlashInfer rejection sampling kernel or not. If it's - None, we will use the default value from the environment variable. - This parameter is only used for testing purposes. - """ - super().__init__(strict_mode=strict_mode) - if use_flashinfer is None: - self.use_flashinfer = envs.VLLM_USE_FLASHINFER_SAMPLER and ( - chain_speculative_sampling is not None) - else: - self.use_flashinfer = use_flashinfer - - if self.use_flashinfer: - logger.info("Use flashinfer for rejection sampling.") - else: - logger.info("Use pytorch for rejection sampling.") - - def forward( - self, - target_with_bonus_probs: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - seeded_seqs: Optional[dict[int, torch.Generator]] = None, - ) -> torch.Tensor: - """Sample token ids using rejection sampling. This accepts or rejects - tokens proposed by the draft model using the probability of each token - according to the draft and target models. - - In the worst case where all draft tokens are rejected, it is guaranteed - one correct token will be emitted. - - In the case where all draft tokens are accepted, a bonus token will be - accepted as its cheap to have the target model score this speculative - sequence. - - Args: - target_with_bonus_probs: The probability distribution - over token ids given context according to the target model. - shape = [batch_size, num_speculative_tokens + 1, vocab_size] - - bonus_token_ids: The "bonus" token ids that are accepted iff all - speculative tokens in a sequence are accepted. - shape = [batch_size, num_bonus_tokens] - - draft_probs: The probability distribution over token ids given - context according to the draft model. - shape = [batch_size, num_speculative_tokens, vocab_size] - - draft_token_ids: The token ids that were sampled from the draft - probabilities. - shape = [batch_size, num_speculative_tokens] - - seeded_seqs: Dict of batch row index to torch generator, for - sequences using seeded generation. - - Returns: - output_token_ids: The token ids sampled via rejection sampling, - or -1 if unable to sample a token because the previous token - was rejected. - shape = [batch_size, num_speculative_tokens + num_bonus_tokens] - """ - # Only perform shape/dtype/device checking in strict mode, as it adds - # overhead. - if self._strict_mode: - self._raise_if_incorrect_input(target_with_bonus_probs, - draft_token_ids, bonus_token_ids, - draft_probs) - - batch_size, k, _ = draft_probs.shape - - # batch_size = 0 when all requests in the batch are - # non_spec requests. In this case, output_token_ids is - # just an empty tensor. - if batch_size == 0: - return torch.empty(0, k + 1, device=draft_probs.device, dtype=int) - - # If use Flashinfer chain_speculative_sampling kernel - # for rejection sampling - if self.use_flashinfer and chain_speculative_sampling is not None: - batch_size, k, _ = draft_probs.shape - - (output_token_ids, accepted_token_num, - emitted_token_num) = chain_speculative_sampling( - draft_probs, - draft_token_ids, - target_with_bonus_probs, - ) - - # num_emitted_tokens returned by flashinfer - # does not include the bonus token - # Flashinfer stops at the first token that violates - # the condition p >= q and does not include recovery/bonus token. - # Therefore, we need to add batch_size here. - self.num_accepted_tokens += accepted_token_num.sum() - self.num_emitted_tokens += emitted_token_num.sum() + batch_size - self.num_draft_tokens += batch_size * k - else: - accepted, recovered_token_ids = ( - self._batch_modified_rejection_sampling( - target_with_bonus_probs[:, :-1], - draft_probs, - draft_token_ids, - seeded_seqs, - )) - - output_token_ids = self._create_output( - accepted, - recovered_token_ids, - draft_token_ids, - bonus_token_ids, - ) - - return output_token_ids - - def _batch_modified_rejection_sampling( - self, - target_probs: torch.Tensor, # [batch_size, k, vocab_size] - draft_probs: torch.Tensor, # [batch_size, k, vocab_size] - draft_token_ids: torch.Tensor, # [batch_size, k] - seeded_seqs: Optional[dict[int, torch.Generator]], - ) -> tuple[torch.Tensor, torch.Tensor]: - """Perform modified rejection sampling on each sequence. - - Returns: - A tuple of two tensors: - 0: A bool tensor of which tokens in each sequence is accepted. - shape = [batch_size, k] - 1: Token ids sampled from a recovered distribution, to be used - when a token is rejected. - shape = [batch_size, k] - """ - - batch_size, k, vocab_size = draft_probs.shape - - # shape [batch_size, k] - accepted = self._get_accepted(target_probs, draft_probs, - draft_token_ids, seeded_seqs) - - recovered_probs = self._get_recovered_probs( - target_probs, draft_probs).reshape(batch_size * k, vocab_size) - - # NOTE: the recovered_probs are overwritten by this method. - recovered_token_ids = _multinomial( - recovered_probs, - num_samples=1, - k=k, - seeded_seqs=seeded_seqs or {}, - ).reshape(batch_size, k) - - return accepted, recovered_token_ids - - def _create_uniform_samples(self, - seeded_seqs: Optional[dict[int, - torch.Generator]], - batch_size: int, k: int, - device: torch.device) -> torch.Tensor: - """ - Generates a batch of uniform random samples, with optional seeding - for specific sequences. - - This method creates a tensor of shape `(batch_size, k + 1)` filled - with uniform random values in the range [0, 1). If `seeded_seqs` - is provided, the sequences corresponding to specific indices - will be generated using the provided `torch.Generator` for - reproducibility. The other sequences will be generated without - a seed. - - Args: - seeded_seqs : Optional[dict[int, torch.Generator]] - A dictionary mapping indices in the batch to - `torch.Generator` objects. If `None`, all samples are - generated without a seed. - batch_size : int - The number of sequences to generate. - k : int - The number of random samples per sequence. - device : torch.device - The device on which to allocate the tensor. - - Returns: - uniform_rand : torch.Tensor - A tensor of shape `(batch_size, k + 1)` containing uniform - random values in the range [0, 1). - """ - if not seeded_seqs: - return torch.rand(batch_size, k + 1, device=device) - - uniform_rand = torch.empty(batch_size, k + 1, device=device) - - non_seeded_indices = [] - for idx in range(batch_size): - generator = seeded_seqs.get(idx) - if generator is None: - non_seeded_indices.append(idx) - else: - uniform_rand[idx, :] = torch.rand(1, - k + 1, - dtype=self.probs_dtype, - device=device, - generator=generator) - if non_seeded_indices: - uniform_rand[non_seeded_indices, :] = torch.rand( - len(non_seeded_indices), - k + 1, - dtype=self.probs_dtype, - device=device) - return uniform_rand - - def _get_accepted( - self, - target_probs: torch.Tensor, # [batch_size, k, vocab_size] - draft_probs: torch.Tensor, # [batch_size, k, vocab_size] - draft_token_ids: torch.Tensor, # [batch_size, k] - seeded_seqs: Optional[dict[int, torch.Generator]], - ) -> torch.Tensor: - r"""Create bool matrix over the proposed draft tokens. If - True, then a token can be accepted, else it should be - rejected. - - Given $q(\hat{x}_{n+1}|x_1, \dots, x_n)$, the probability of - $\hat{x}_{n+1}$ given context $x_1, \dots, x_n$ according - to the target model, and $p(\hat{x}_{n+1}|x_1, \dots, x_n)$, the - same conditional probability according to the draft model, the token - is accepted with probability: - - $$ - \min\left(1, \frac{q(\hat{x}_{n+1}|x_1, \dots, x_n)} - {p(\hat{x}_{n+1}|x_1, \dots, x_n)}\right) - $$ - - This implementation does not apply causality. When using the output, - if a token is rejected, subsequent tokens should not be used. - - Returns a bool tensor of shape [batch_size, k] specifying which tokens - are accepted. - """ - batch_size, k, _ = draft_probs.shape - batch_indices = torch.arange(batch_size, - device=target_probs.device)[:, None] - probs_indices = torch.arange(k, device=target_probs.device) - - # shape [batch_size, k] - selected_draft_probs = draft_probs[batch_indices, probs_indices, - draft_token_ids] - - # shape [batch_size, k] - selected_target_probs = target_probs[batch_indices, probs_indices, - draft_token_ids] - - uniform_rand = self._create_uniform_samples(seeded_seqs, batch_size, - k - 1, target_probs.device) - - capped_ratio = torch.minimum( - selected_target_probs / selected_draft_probs, - torch.full((1, ), 1, device=target_probs.device)) - accepted = uniform_rand < capped_ratio - - return accepted - - def _get_recovered_probs( - self, - target_probs: torch.Tensor, # [k, vocab_size] - draft_probs: torch.Tensor, # [k, vocab_size] - ) -> torch.Tensor: - r"""Create a probability distribution for each proposed token which can - be sampled if the proposed token is rejected. - - When this routine is applied sequentially, the true distribution of the - target model is recovered (within hardware numerics). - - The probability distribution used in this rejection case is constructed - as follows. Given $q(x|x_1, \dots, x_n)$, the probability of - $x$ given context $x_1, \dots, x_n$ according to the target - model and $p(x|x_1, \dots, x_n)$, the same conditional probability - according to the draft model: - - $$ - x_{n+1} \sim (q(x|x_1, \dots, x_n) - p(x|x_1, \dots, x_n))_+ - $$ - - where $(f(x))_+$ is defined as: - - $$ - (f(x))_+ = \frac{\max(0, f(x))}{\sum_x \max(0, f(x))} - $$ - - See https://github.com/vllm-project/vllm/pull/2336 for a visualization - of the draft, target, and recovered probability distributions. - - Returns a tensor of shape [batch_size, k, vocab_size]. - - Note: - This batches operations on GPU and thus constructs the recovered - distribution for all tokens, even if they are accepted. This causes - division-by-zero errors, so we use self._smallest_positive_value to - avoid that. This introduces some drift to the distribution. - """ - _, k, _ = draft_probs.shape - - # shape [batch_size, k, vocab_size] - difference = target_probs - draft_probs - - # TODO(cade): Can we use logprobs instead of probs, and avoid the - # division-by-zero errors without introducing distribution drift? - - # shape [batch_size, k, vocab_size] - f = torch.clamp(difference, min=self._smallest_positive_value) - - # shape [batch_size, k, vocab_size] - recovered_probs = f / torch.sum(f, dim=-1).reshape(-1, k, 1) - - return recovered_probs - - @cached_property - def _smallest_positive_value(self) -> float: - """Return the smallest positive value representable by the probs dtype. - This value is used when constructing a distribution from which to sample - recovered tokens in the first rejection case. - - See _get_recovered_probs for more details - - Note that this isn't actually the smallest positive value representable - by float32, but the smallest positive normal value. - See https://en.wikipedia.org/wiki/Subnormal_number for more information. - """ - return torch.finfo(self.probs_dtype).tiny - - -# torch.multinomial forces a GPU<->CPU sync. -# Therefore, we use an optimized implementation instead that skips the sync. -# Note that we always sample with replacement. -# probs will be modified in place, but this is fine, as we pass -# in a copy already. -@torch.compile(dynamic=True, backend=current_platform.simple_compile_backend) -def _multinomial( - probs: torch.Tensor, - num_samples: int, - k: int, - seeded_seqs: dict[int, torch.Generator], -) -> torch.Tensor: - - if num_samples > 1: - # This is equivalent to torch.repeat_interleaved (which also - # forces a GPU<->CPU sync). - probs = probs[:, None, :].expand(probs.shape[0], num_samples, - probs.shape[1]).contiguous().view( - -1, probs.shape[1]) - q = torch.empty_like(probs) - if not seeded_seqs: - q.exponential_(1.0) - else: - start = 0 - for idx in range(len(q) // k): - end = start + k - generator = seeded_seqs.get(idx) - # Note: generator might be None for non seeded - q[start:end].exponential_(1.0, generator=generator) - start = end - - return probs.div_(q).argmax(dim=1).view(-1, num_samples) diff --git a/vllm/model_executor/layers/sampler.py b/vllm/model_executor/layers/sampler.py index 08840fc40cf..e77eb637c89 100644 --- a/vllm/model_executor/layers/sampler.py +++ b/vllm/model_executor/layers/sampler.py @@ -21,7 +21,6 @@ from vllm.sequence import (VLLM_INVALID_TOKEN_ID, CompletionSequenceGroupOutput, Logprob, PromptLogprobs, SampleLogprobs, SequenceOutput) -from vllm.spec_decode.metrics import SpecDecodeWorkerMetrics if envs.VLLM_USE_FLASHINFER_SAMPLER and find_spec("flashinfer"): # yapf: disable @@ -119,9 +118,6 @@ class SamplerOutput( # specified in lieu of prompt token ids or text. sampled_token_embeds: Optional[torch.Tensor] = None - # Spec decode metrics populated by workers. - spec_decode_worker_metrics: Optional[SpecDecodeWorkerMetrics] = None - # Optional last hidden states from the model. hidden_states: Optional[torch.Tensor] = None @@ -159,11 +155,9 @@ def __repr__(self) -> str: else self.sampled_token_probs.shape) sampled_token_ids_repr = ("None" if self.sampled_token_ids is None else self.sampled_token_ids.shape) - return ( - f"SamplerOutput(outputs={self.outputs}, " - f"sampled_token_probs={sampled_token_probs_repr}, " - f"sampled_token_ids={sampled_token_ids_repr}, " - f"spec_decode_worker_metrics={self.spec_decode_worker_metrics})") + return (f"SamplerOutput(outputs={self.outputs}, " + f"sampled_token_probs={sampled_token_probs_repr}, " + f"sampled_token_ids={sampled_token_ids_repr})") class Sampler(nn.Module): diff --git a/vllm/model_executor/layers/spec_decode_base_sampler.py b/vllm/model_executor/layers/spec_decode_base_sampler.py deleted file mode 100644 index 0a36fe9be45..00000000000 --- a/vllm/model_executor/layers/spec_decode_base_sampler.py +++ /dev/null @@ -1,259 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from abc import abstractmethod -from typing import Optional, Union - -import torch -import torch.jit -import torch.nn as nn - -from vllm.platforms import current_platform - - -class SpecDecodeBaseSampler(nn.Module): - """Base class for samplers used for Speculative Decoding verification - step. - """ - - def __init__(self, strict_mode: bool = False): - """Base class constructor. - Args: - strict_mode: Whether or not to perform shape/device/dtype checks - during sampling. This catches correctness issues but adds - nontrivial latency. - """ - super().__init__() - self._strict_mode = strict_mode - - # NOTE: A "bonus token" is accepted iff all proposal tokens are - # accepted. There is always only one possible bonus token. We store this - # value in a variable for readability. - self._num_bonus_tokens = 1 - - self.num_accepted_tokens: Optional[torch.Tensor] = None - self.num_emitted_tokens: Optional[torch.Tensor] = None - self.num_draft_tokens: int = 0 - - def init_gpu_tensors(self, device: Union[int, str]) -> None: - assert self.num_accepted_tokens is None - if isinstance(device, int): - device = f"{current_platform.device_type}:{device}" - elif not isinstance(device, str): - raise ValueError(f"Device must be int or str, get {type(device)}") - self.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device=device) - self.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device=device) - - def init_tensors(self, - device: Union[int, str], - device_type: Union[torch.device, str] = 'cuda') -> None: - assert self.num_accepted_tokens is None - if isinstance(device_type, torch.device): - device_type = device_type.type - if isinstance(device, int): - device = f"{device_type}:{device}" - self.num_accepted_tokens = torch.tensor(0, - dtype=torch.long, - device=device) - self.num_emitted_tokens = torch.tensor(0, - dtype=torch.long, - device=device) - - @property - def probs_dtype(self): - return torch.float32 - - @property - def token_id_dtype(self): - return torch.int64 - - def _create_output( - self, - accepted: torch.Tensor, # [batch_size, k] - substitute_token_ids: torch.Tensor, # [batch_size, k] - draft_token_ids: torch.Tensor, # [batch_size, k] - bonus_token_ids: torch.Tensor, # [batch_size] - ) -> torch.Tensor: - """Format output. Returns a matrix of token ids. When - a token is rejected via sampling, all subsequent token ids are - set to -1 for the sequence. - - Args: - accepted: A boolean tensor indicating if the corresponding - draft token in draft_token_ids should be accepted or not. - substitute_token_ids: A tensor of token_ids that can be used - as substitutes for the draft token ids if the proposed token - is rejected. - draft_token_ids: A tensor of token ids speculated by the - draft model. - bonus_token_ids: Token ids to use as the bonus token if - all the draft tokens are accepted. - Returns: - A tensor containing the accepted token ids. The shape of the - tensor is [batch_size, k + num_bonus_tokens] - """ - batch_size, k = substitute_token_ids.shape - bonus_token_ids = bonus_token_ids.squeeze(-1) - # Determine the index of the first False value for each row. - limits = (accepted == 0).max(1).indices - limits[~(accepted == 0).any(1)] = k - - # Create masks using the indices. - indices = torch.arange(k, device=accepted.device).unsqueeze(0) - accepted_mask = indices < limits.unsqueeze(1) - after_false_mask = indices == limits.unsqueeze(1) - - # Create an extended output tensor - output_with_bonus_tokens = -torch.ones( - (batch_size, k + self._num_bonus_tokens), - dtype=self.token_id_dtype, - device=accepted.device) - output = output_with_bonus_tokens[:, :k] - - # Fill in the first k columns of the output tensor using masks and data - # tensors. - output[:, :k] = torch.where(accepted_mask, draft_token_ids, - -torch.ones_like(draft_token_ids)) - - # Fill the last column. - # We check output directly as accepted may have True values inconsistent - # with causal acceptance. - output_with_bonus_tokens[:, -1] = torch.where(output[:, -1] != -1, - bonus_token_ids, -1) - - # Fill the recovered token ids. - output.mul_(~after_false_mask).add_( - substitute_token_ids.mul(after_false_mask)) - - self.num_accepted_tokens += accepted.sum() - self.num_emitted_tokens += (output_with_bonus_tokens != -1).sum() - self.num_draft_tokens += batch_size * k - - return output_with_bonus_tokens - - def _raise_if_incorrect_input( - self, - target_with_bonus_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: Optional[torch.Tensor] = None, - ) -> None: - self._raise_if_incorrect_shape(target_with_bonus_probs, - draft_token_ids, bonus_token_ids, - draft_probs) - self._raise_if_incorrect_dtype(target_with_bonus_probs, - draft_token_ids, bonus_token_ids, - draft_probs) - self._raise_if_inconsistent_device(target_with_bonus_probs, - draft_token_ids, bonus_token_ids, - draft_probs) - self._raise_if_out_of_bounds_vocab(target_with_bonus_probs.shape[-1], - draft_token_ids, bonus_token_ids) - - def _raise_if_incorrect_shape( - self, - target_with_bonus_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: Optional[torch.Tensor] = None, - ) -> None: - (target_batch_size, num_target_probs, - target_vocab_size) = target_with_bonus_probs.shape - - # Does not count the extra token - num_target_probs -= 1 - - # validate the shape of draft token ids. - draft_token_ids_batch_size, num_draft_token_ids = draft_token_ids.shape - assert draft_token_ids_batch_size == target_batch_size - assert num_draft_token_ids == num_target_probs - - # validate the shape of bonus token ids - bonus_batch_size, num_bonus_tokens = bonus_token_ids.shape - assert bonus_batch_size == target_batch_size - assert num_bonus_tokens == self._num_bonus_tokens - - # validate the shape of draft probs if it is set - if draft_probs is not None: - (draft_batch_size, num_draft_probs, - draft_vocab_size) = draft_probs.shape - assert draft_batch_size == target_batch_size - assert num_draft_probs == num_target_probs - assert (draft_vocab_size == target_vocab_size - ), f"{draft_vocab_size=} {target_vocab_size=}" - - def _raise_if_incorrect_dtype( - self, - target_with_bonus_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: Optional[torch.Tensor] = None, - ) -> None: - assert target_with_bonus_probs.dtype == self.probs_dtype - assert draft_token_ids.dtype == self.token_id_dtype - assert bonus_token_ids.dtype == self.token_id_dtype - if draft_probs is not None: - assert draft_probs.dtype == self.probs_dtype - - def _raise_if_inconsistent_device( - self, - target_with_bonus_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: Optional[torch.Tensor] = None, - ) -> None: - devices = [ - t.device for t in [ - target_with_bonus_probs, bonus_token_ids, draft_probs, - draft_token_ids - ] if t is not None - ] - assert all([devices[0] == device for device in devices]) - - def _raise_if_out_of_bounds_vocab( - self, - vocab_size: int, - draft_token_ids: torch.Tensor, - bonus_token_ids: torch.Tensor, - ) -> None: - assert torch.all(bonus_token_ids < vocab_size) - assert torch.all(bonus_token_ids >= 0) - assert torch.all(draft_token_ids < vocab_size) - assert torch.all(draft_token_ids >= 0) - - -class SpecDecodeDeterministicBaseSampler(SpecDecodeBaseSampler): - """Base class for samplers used for Speculative Decoding verification - step which are deterministic. - """ - - @abstractmethod - def forward( - self, - target_with_bonus_probs: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - ) -> torch.Tensor: - raise NotImplementedError - - -class SpecDecodeStochasticBaseSampler(SpecDecodeBaseSampler): - """Base class for samplers used for Speculative Decoding verification - step which are stochastic - """ - - @abstractmethod - def forward( - self, - target_with_bonus_probs: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - seeded_seqs: Optional[dict[int, torch.Generator]] = None, - ) -> torch.Tensor: - raise NotImplementedError diff --git a/vllm/model_executor/layers/typical_acceptance_sampler.py b/vllm/model_executor/layers/typical_acceptance_sampler.py deleted file mode 100644 index 5dabaa5379e..00000000000 --- a/vllm/model_executor/layers/typical_acceptance_sampler.py +++ /dev/null @@ -1,166 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import torch -import torch.jit - -from vllm.model_executor.layers.spec_decode_base_sampler import ( - SpecDecodeDeterministicBaseSampler) - - -class TypicalAcceptanceSampler(SpecDecodeDeterministicBaseSampler): - """Apply typical acceptance sampling as described in section 3.3.1 in - "MEDUSA: Simple LLM Inference Acceleration Framework with - Multiple Decoding Heads" - https://arxiv.org/pdf/2401.10774 - """ - - def __init__( - self, - posterior_threshold: float, - posterior_alpha: float, - strict_mode: bool = False, - ): - """Create a Typical Acceptance Sampler. - - Args: - strict_mode: Whether or not to perform shape/device/dtype checks - during sampling. This catches correctness issues but adds - nontrivial latency. - posterior_threshold : A threshold value that sets a lower bound - on the posterior probability of a token in target model for it - to be accepted. - posterior_alpha : A scaling factor for the entropy-based - threshold in typical acceptance sampling. - """ - self._posterior_threshold = posterior_threshold - self._posterior_alpha = posterior_alpha - super().__init__(strict_mode=strict_mode) - - def forward( - self, - target_with_bonus_probs: torch.Tensor, - bonus_token_ids: torch.Tensor, - draft_probs: torch.Tensor, - draft_token_ids: torch.Tensor, - ) -> torch.Tensor: - """Sample token ids using typical acceptance sampling. This accepts - or rejects tokens proposed by the draft model using the probability - of each token according to the draft and target models. - - In the worst case where all draft tokens are rejected, it is guaranteed - one token will be emitted. - - In the case where all draft tokens are accepted, the bonus token will be - accepted. - - Args: - target_probs: The probability distribution over token ids given - context according to the target model. - shape = [batch_size, num_speculative_tokens, vocab_size] - - bonus_token_ids: The "bonus" token ids that are accepted iff all - speculative tokens in a sequence are accepted. - shape = [batch_size, num_bonus_tokens] - - draft_probs: This parameter is unused by the acceptance sampler. - - draft_token_ids: The token ids that were sampled from the draft - probabilities. - shape = [batch_size, num_speculative_tokens] - - Returns: - output_token_ids: The token ids sampled via rejection sampling, - or -1 if unable to sample a token because the previous token - was rejected. - shape = [batch_size, num_speculative_tokens + num_bonus_tokens] - """ - # Only perform shape/dtype/device checking in strict mode, as it adds - # overhead. - if self._strict_mode: - self._raise_if_incorrect_input(target_with_bonus_probs, - draft_token_ids, bonus_token_ids) - target_probs = target_with_bonus_probs[:, :-1] - accepted = self._evaluate_accepted_tokens(target_probs, - draft_token_ids) - recovered_token_ids = self._get_recovered_token_ids(target_probs) - output_token_ids = self._create_output(accepted, recovered_token_ids, - draft_token_ids, - bonus_token_ids) - return output_token_ids - - def _evaluate_accepted_tokens(self, target_probs, draft_token_ids): - r""" - Evaluates and returns a mask of accepted tokens based on the - posterior probabilities. - - Args: - target_probs (torch.Tensor): A tensor of shape - (batch_size, k, vocab_size) representing the probabilities of - each token in the vocabulary for each position in the proposed - sequence. This is the distribution generated by the target - model. - draft_token_ids (torch.Tensor): A tensor of shape (batch_size, k) - representing the proposed token ids. - - A draft token_id x_{n+k} is accepted if it satisfies the - following condition - - $$ - p_{\text{original}}(x_{n+k} | x_1, x_2, \dots, x_{n+k-1}) > - \min \left( \epsilon, \delta * \exp \left( - -H(p_{\text{original}}( - \cdot | x_1, x_2, \ldots, x_{n+k-1})) \right) \right) - $$ - - where $p_{\text{original}}$ corresponds to target_probs - and $\epsilon$ and $\delta$ correspond to hyperparameters - specified using self._posterior_threshold and self._posterior_alpha - - This method computes the posterior probabilities for the given - draft token ids based on the provided target probabilities. It - calculates the entropy of the posterior distribution and determines - a dynamic threshold for each token position using the provided - posterior_threshold and posterior_alpha values. The method then - returns a boolean mask indicating which tokens can be accepted. - - Returns: - torch.Tensor: A boolean tensor of shape (batch_size, k) where each - element indicates whether the corresponding draft token has - been accepted or rejected. True indicates acceptance and false - indicates rejection. - """ - device = target_probs.device - candidates_prob = torch.gather( - target_probs, dim=-1, - index=draft_token_ids.unsqueeze(-1)).squeeze(-1) - # A small constant added to prevent computing the logarithm of zero, - # which can lead to undefined values. - epsilon = 1e-5 - posterior_entropy = -torch.sum( - target_probs * torch.log(target_probs + epsilon), dim=-1) - threshold = torch.minimum( - torch.ones_like(posterior_entropy, device=device) * - self._posterior_threshold, - torch.exp(-posterior_entropy) * self._posterior_alpha, - ) - accepted_mask = candidates_prob > threshold - return accepted_mask - - def _get_recovered_token_ids(self, target_probs): - """ - The recovered token ids will fill the first unmatched token - by the target token. - - Args: - target_probs (torch.Tensor): A tensor of shape - (batch_size, k, vocab_size) containing the target probability - distribution. - - Returns: - torch.Tensor: A tensor of shape (batch_size, k) with the recovered - token ids which are selected from target probs. - """ - max_indices = torch.argmax(target_probs, dim=-1) - - return max_indices diff --git a/vllm/model_executor/models/eagle.py b/vllm/model_executor/models/eagle.py deleted file mode 100644 index c551ecd68ef..00000000000 --- a/vllm/model_executor/models/eagle.py +++ /dev/null @@ -1,261 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from collections.abc import Iterable -from typing import Optional - -import torch -import torch.nn as nn - -from vllm.config import VllmConfig -from vllm.logger import init_logger -from vllm.model_executor.layers.layernorm import RMSNorm -from vllm.model_executor.layers.logits_processor import LogitsProcessor -from vllm.model_executor.layers.vocab_parallel_embedding import ( - DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead) -from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.model_executor.models import ModelRegistry -from vllm.model_executor.sampling_metadata import SamplingMetadata -from vllm.sequence import IntermediateTensors - -from .utils import maybe_prefix - -logger = init_logger(__name__) - - -class DummyInputLayerNorm(nn.Module): - - def __init__(self, weight=None, bias=None): - super().__init__() - self.weight = nn.Parameter(weight) if weight is not None else None - self.bias = nn.Parameter(bias) if bias is not None else None - - def forward(self, x): - return x - - -class DummyOutputNorm(nn.Module): - - def forward(self, x, residual): - if residual is None: - return x - else: - return x + residual, None - - -class EAGLE(nn.Module): - """This class implements the EAGLE draft model from the paper: https://arxiv.org/pdf/2401.15077 - Reference implementation: https://github.com/SafeAILab/EAGLE - - Differences from reference implementation: - 1. In reference, LlamaDecoderLayer implementation doesn't have - input_layernorm for 1st decoder layer (https://github.com/SafeAILab/EAGLE/blob/7d065d084443fbfd386f88839efd7193c12be869/eagle/model/cnets.py#L427). - Following this approach, our implementation also disables - the input_layernorm for the first decoder layer. - 2. We allow any decoder layer to be used in EAGLE whereas in reference - decoder layer is fixed to be LlamaDecoderLayer. - 3. We have an optional token_map which reduces draft vocab to most - frequently used tokens to give some additional speed-up by reducing - sampling overhead. This is disabled unless the checkpoint file has - explicit token_map tensor and config has an optional attribute - truncated_vocab_size < vocab_size. To use this technique, one has to find - the top-k most frequent tokens in target dataset and add that as a tensor - in the draft checkpoint (using key token_map). Also, the draft config - needs to have truncated_vocab_size (=k) as an attribute. - 4. We allow an enhanced EAGLE architecture similar to the DeepSeek MTP - module with regards to the use of additional RMS norms. The original - EAGLE architecture 1) skips the pre-attention norm in its first - transformer block, and 2) skips the final output norm, both of which we - found to be suboptimal. We also add the support for separate norms - applying to both the token embedding and hidden states before projection - as in DeepSeek MTP, which we found to improve performance as well. - """ - - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - config = vllm_config.model_config.hf_config - self.dtype = vllm_config.model_config.dtype - self.config = config - - architectures = getattr(self.config.model, "architectures", []) - model_cls, _ = ModelRegistry.resolve_model_cls(architectures) - - self.model = model_cls(vllm_config=vllm_config, - prefix=maybe_prefix(prefix, "model")) - - self.fc = nn.Linear(config.model.hidden_size * 2, - config.model.hidden_size, - bias=getattr(self.config, "eagle_fc_bias", False)) - - # Modify layer normalization and residual connections as suggested - # in the EAGLE framework: https://github.com/SafeAILab/EAGLE - # While weights and biases are generally not needed, - # they are retained here to support certain unit tests - # (e.g., spec_decode/e2e/test_eagle_correctness.py). - if not hasattr(self.config.model, - "skip_prenorm") or self.config.model.skip_prenorm: - self.model.model.layers[0].input_layernorm = DummyInputLayerNorm( - weight=self.model.model.layers[0].input_layernorm.weight) - - if not hasattr( - self.config.model, - "skip_output_norm") or self.config.model.skip_output_norm: - self.model.model.norm = DummyOutputNorm() - - self.add_para_norm = False - if hasattr(self.config.model, - "add_para_norm") and self.config.model.add_para_norm: - self.enorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) - self.hnorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) - self.add_para_norm = True - - self.orig_vocab_size = config.vocab_size - self.truncated_vocab_size = config.truncated_vocab_size - self.unpadded_vocab_size = self.truncated_vocab_size - - self.lm_head = ParallelLMHead( - self.unpadded_vocab_size, - config.hidden_size, - org_num_embeddings=self.truncated_vocab_size, - padding_size=DEFAULT_VOCAB_PADDING_SIZE, - ) - - logit_scale = getattr(config, "logit_scale", 1.0) - self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, - self.truncated_vocab_size, - logit_scale) - - # Token map is a idx to token mapping to reduce the vocab size for - # the draft model. Using smaller vocab size for draft, containing - # only most frequent tokens reduces the speculation overhead. This - # doesn't affect the acceptance rate much and thus gives more speed - # -up. By default, this is disabled and is only used if the EAGLE - # checkpoint file has token_map tensor. - self.token_map = None - - def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: - return self.model.model.get_input_embeddings(input_ids) - - def forward( - self, - input_ids: torch.Tensor, - positions: torch.Tensor, - previous_hidden_states: torch.Tensor, - intermediate_tensors: Optional[IntermediateTensors] = None, - inputs_embeds: Optional[torch.Tensor] = None, - ) -> torch.Tensor: - - if inputs_embeds is None: - inputs_embeds = self.get_input_embeddings(input_ids) - - # Handle both empty previous_hidden_states - # and mismatched batch size - batch_size = inputs_embeds.size(0) - if previous_hidden_states.size(0) == 0 or \ - previous_hidden_states.size(0) != batch_size: - hidden_dim = self.config.model.hidden_size - device = inputs_embeds.device - # Create zero tensor with matching batch size - previous_hidden_states = \ - torch.zeros(batch_size, hidden_dim, device=device) - - if self.add_para_norm: - inputs_embeds = torch.cat([ - self.enorm(inputs_embeds), - self.hnorm(previous_hidden_states) - ], - dim=-1) - else: - inputs_embeds = torch.cat([inputs_embeds, previous_hidden_states], - dim=-1) - - inputs_embeds = self.fc(inputs_embeds) - - inputs_embeds[positions == 0] = 0 # masking inputs at position=0 - - hidden_states = self.model.model( - input_ids=None, - inputs_embeds=inputs_embeds, - positions=positions, - intermediate_tensors=intermediate_tensors, - ) - return hidden_states - - def compute_logits(self, hidden_states: torch.Tensor, - sampling_metadata: SamplingMetadata) -> torch.Tensor: - logits = self.logits_processor(self.lm_head, hidden_states, - sampling_metadata) - - if self.token_map is not None: - _logits = logits - logits = -torch.inf * torch.ones( - size=(*_logits.shape[:-1], self.orig_vocab_size), - device=_logits.device, - dtype=_logits.dtype) - - logits[..., self.token_map] = _logits - - return logits - - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - # This implementation is incompatible with https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B - # due to missing lm_head weights and its config being that of a - # Llama model. Here's a compatible version with the same weights: - # https://huggingface.co/abhigoyal/EAGLE-LLaMA3-Instruct-8B-vllm - # Also, here's an example script for converting trained EAGLE - # checkpoint to vLLM compatible version: https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d - model_weights = {} - for name, loaded_weight in weights: - if name == "token_map": - if self.config.truncated_vocab_size < self.config.vocab_size: - self.token_map = nn.Parameter(loaded_weight, - requires_grad=False) - elif name.startswith("fc.weight"): - weight_loader = getattr(self.fc.weight, "weight_loader", - default_weight_loader) - weight_loader(self.fc.weight, loaded_weight) - elif name.startswith("fc.bias"): - if self.fc.bias is not None: - weight_loader = getattr(self.fc.bias, "weight_loader", - default_weight_loader) - weight_loader(self.fc.bias, loaded_weight) - else: - logger.warning_once("Found bias in the loaded weights but " - "the model config doesn't have bias.") - elif name.startswith("enorm.weight"): - weight_loader = getattr(self.enorm.weight, "weight_loader", - default_weight_loader) - weight_loader(self.enorm.weight, loaded_weight) - elif name.startswith("hnorm.weight"): - weight_loader = getattr(self.hnorm.weight, "weight_loader", - default_weight_loader) - weight_loader(self.hnorm.weight, loaded_weight) - elif name.startswith("model.lm_head.") or name.startswith( - "model.model."): - model_weights[name.split("model.", 1)[-1]] = loaded_weight - elif name.startswith("lm_head.") or name.startswith("model."): - model_weights[name] = loaded_weight - else: - model_weights[f"model.{name}"] = loaded_weight - - if "lm_head.weight" in model_weights: - lm_head_weight = model_weights.pop("lm_head.weight") - - if self.token_map is not None and\ - lm_head_weight.shape[0] > self.token_map.shape[0]: - - lm_head_weight = lm_head_weight[self.token_map] - - else: - # NOTE(Shangming): initialize the placeholder for lm_head weight. - lm_head_weight = torch.zeros( - self.lm_head.org_vocab_size, - self.lm_head.embedding_dim, - dtype=self.dtype, - ) - - weight_loader = getattr(self.lm_head.weight, "weight_loader", - default_weight_loader) - weight_loader(self.lm_head.weight, lm_head_weight) - - self.model.load_weights(model_weights.items()) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index fd831727ab2..d5233c28b19 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -239,14 +239,15 @@ _SPECULATIVE_DECODING_MODELS = { "MiMoMTPModel": ("mimo_mtp", "MiMoMTP"), - "EAGLEModel": ("eagle", "EAGLE"), "EagleLlamaForCausalLM": ("llama_eagle", "EagleLlamaForCausalLM"), "EagleLlama4ForCausalLM": ("llama4_eagle", "EagleLlama4ForCausalLM"), "EagleMiniCPMForCausalLM": ("minicpm_eagle", "EagleMiniCPMForCausalLM"), "Eagle3LlamaForCausalLM": ("llama_eagle3", "Eagle3LlamaForCausalLM"), "DeepSeekMTPModel": ("deepseek_mtp", "DeepSeekMTP"), "MedusaModel": ("medusa", "Medusa"), - "MLPSpeculatorPreTrainedModel": ("mlp_speculator", "MLPSpeculator"), + # Temporarily disabled. + # # TODO(woosuk): Re-enable this once the MLP Speculator is supported in V1. + # "MLPSpeculatorPreTrainedModel": ("mlp_speculator", "MLPSpeculator"), } _TRANSFORMERS_MODELS = { diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index 240724a675a..962e2b3aab6 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -132,14 +132,10 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: parallel_config.worker_cls = \ "vllm.worker.multi_step_worker.MultiStepWorker" elif vllm_config.speculative_config: - if envs.VLLM_USE_V1: - parallel_config.worker_cls = \ - "vllm.v1.worker.gpu_worker.Worker" - else: - parallel_config.worker_cls = \ - "vllm.spec_decode.spec_decode_worker.create_spec_worker" - parallel_config.sd_worker_cls = \ - "vllm.worker.worker.Worker" + if not envs.VLLM_USE_V1: + raise NotImplementedError( + "Speculative decoding is not supported on vLLM V0.") + parallel_config.worker_cls = "vllm.v1.worker.gpu_worker.Worker" else: if envs.VLLM_USE_V1: parallel_config.worker_cls = \ diff --git a/vllm/platforms/rocm.py b/vllm/platforms/rocm.py index e9e18d3fe8e..0bf9262776b 100644 --- a/vllm/platforms/rocm.py +++ b/vllm/platforms/rocm.py @@ -326,15 +326,10 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: parallel_config.worker_cls = \ "vllm.worker.multi_step_worker.MultiStepWorker" elif vllm_config.speculative_config: - if envs.VLLM_USE_V1: + if not envs.VLLM_USE_V1: raise NotImplementedError( - "Speculative decoding is not yet supported on vLLM V1." - ) - else: - parallel_config.worker_cls = \ - "vllm.spec_decode.spec_decode_worker.create_spec_worker" - parallel_config.sd_worker_cls = \ - "vllm.worker.worker.Worker" + "Speculative decoding is not supported on vLLM V0.") + parallel_config.worker_cls = "vllm.v1.worker.gpu_worker.Worker" else: if envs.VLLM_USE_V1: parallel_config.worker_cls = \ diff --git a/vllm/sequence.py b/vllm/sequence.py index ffe890eb2da..87ba74c6853 100644 --- a/vllm/sequence.py +++ b/vllm/sequence.py @@ -112,13 +112,6 @@ class RequestMetrics: model_execute_time: The time spent in the model execute function. This will include model forward, block/sync across workers, cpu-gpu sync time and sampling time. - spec_token_acceptance_counts: number of accepted speculative tokens at - each position; the first token is from - the target model and is always accepted; - e.g., when it's [10, 8, 4, 2] for a req, - it means there were 10 forward passes in - total, and there were 8, 4, 2 accepted - tokens at 1st, 2nd, 3rd speculation step. """ arrival_time: float last_token_time: float @@ -129,7 +122,6 @@ class RequestMetrics: scheduler_time: Optional[float] = None model_forward_time: Optional[float] = None model_execute_time: Optional[float] = None - spec_token_acceptance_counts: Optional[list[int]] = None class SequenceDataDelta( @@ -748,9 +740,7 @@ def __init__(self, last_token_time=arrival_time, first_scheduled_time=None, first_token_time=None, - time_in_queue=None, - spec_token_acceptance_counts=[0] * - draft_size) + time_in_queue=None) self.last_token_latency = 0.0 self.lora_request = lora_request self.prompt_logprobs: Optional[PromptLogprobs] = None @@ -1390,8 +1380,6 @@ class ExecuteModelRequest( previous_hidden_states: Optional[HiddenStates] = None # The number of forward steps to run. num_steps: int = 1 - # The step index for spec model input. - spec_step_idx: Optional[int] = None # Finished request ids since last step. finished_requests_ids: list[str] = msgspec.field(default_factory=list) # The last sampled token ids for multi step decoding. diff --git a/vllm/spec_decode/__init__.py b/vllm/spec_decode/__init__.py deleted file mode 100644 index e69de29bb2d..00000000000 diff --git a/vllm/spec_decode/batch_expansion.py b/vllm/spec_decode/batch_expansion.py deleted file mode 100644 index f9b882469a4..00000000000 --- a/vllm/spec_decode/batch_expansion.py +++ /dev/null @@ -1,506 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from array import array -from itertools import chain, count -from typing import Iterator, List, Optional, Tuple - -import torch - -from vllm import SamplingParams -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.sequence import (VLLM_INVALID_TOKEN_ID, VLLM_TOKEN_ID_ARRAY_TYPE, - ExecuteModelRequest, SequenceData, - SequenceGroupMetadata, get_all_seq_ids) -from vllm.spec_decode.interfaces import (SpeculativeProposals, - SpeculativeScorer, SpeculativeScores) -from vllm.spec_decode.util import nvtx_range, split_batch_by_proposal_len - -SeqId = int -TargetSeqId = int -TokenId = int - -DEFAULT_SIMPLE_SAMPLING_PARAMS = SamplingParams() - - -class BatchExpansionTop1Scorer(SpeculativeScorer): - """Implements a speculative scorer that uses batch expansion to get - probabilities of speculative tokens according to the scoring model. - - Batch expansion converts a list of sequences and multiple query positions - to a new batch of sequences, each with a single query position. This allows - for MQA-like scoring in speculative decoding without requiring an MQA - kernel. - - It is strictly less efficient than MQA scoring. - - It only supports scoring the top1 proposal tokens of the proposer, instead - of topk/tree. - """ - - @nvtx_range("BatchExpansionTop1Scorer.score_proposals") - def score_proposals( - self, - execute_model_req: ExecuteModelRequest, - proposals: SpeculativeProposals, - ) -> SpeculativeScores: - """Score the proposed tokens via the scorer model. - - This converts each input sequence to a set of k+1 target sequences. The - target sequences have the unique continuations to be scored and a - unique sequence ID that is different from all input sequence ids. - - If a speculative sequence length would exceed the max model length, then - no speculation is produced for that sequence. - - Args: - execute_model_req: The execution request. - proposals: The speculative proposals to score. - Returns: - SpeculativeScores: The scores of each speculative token, along with - which sequences were ignored during scoring. - """ - - # TODO(cade) perform this on GPU to remove blocking call. - proposal_lens_list = proposals.proposal_lens.tolist() - proposal_token_ids_list = proposals.proposal_token_ids.tolist() - - # Filter the list to ignore invalid proposals. - proposal_token_ids_list_without_skips = [ - proposals for proposals in proposal_token_ids_list - if VLLM_INVALID_TOKEN_ID not in proposals - ] - - (spec_indices, non_spec_indices, target_seq_group_metadata_list, - num_scoring_tokens) = self._expand_batch( - seq_group_metadata_list=execute_model_req.seq_group_metadata_list, - proposal_token_ids_list=proposal_token_ids_list_without_skips, - proposal_lens_list=proposal_lens_list, - ) - - target_sampler_output = self._scorer_worker.execute_model( - execute_model_req=execute_model_req.clone( - seq_group_metadata_list=target_seq_group_metadata_list)) - assert len(target_sampler_output) == 1, "expected single-step output" - target_sampler_output = target_sampler_output[0] - - if not non_spec_indices: - # All sequence groups in batch have spec decoding enabled - return self._contract_batch_all_spec( - target_sampler_output=target_sampler_output, - proposals=proposals, - ) - else: - # Batch has a mix of spec decode enabled and disabled seq groups - return self._contract_batch( - execute_model_req.seq_group_metadata_list, - target_sampler_output=target_sampler_output, - proposals=proposals, - num_scoring_tokens=num_scoring_tokens, - non_spec_indices=non_spec_indices, - spec_indices=spec_indices, - k=execute_model_req.num_lookahead_slots, - ) - - def _expand_batch( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - proposal_token_ids_list: List[List[TokenId]], - proposal_lens_list: List[int], - ) -> Tuple[List[int], List[int], List[SequenceGroupMetadata], int]: - """Given the input sequences and potentially multiple corresponding - proposal tokens, create a new batch where each sequence has a single - query token. - """ - - # vLLM currently only supports proposal lens equal to zero or the batch - # proposal len. This adds some complexity (splitting the batch into spec - # and non spec sequences) and should be removed in the future. It can be - # done by supporting per-sequence proposal lens. - (spec_seqs, spec_indices), (non_spec_seqs, non_spec_indices) = \ - split_batch_by_proposal_len( - seq_group_metadata_list, proposal_lens_list) - - spec_expanded_seqs = self._create_scoring_model_input( - seq_group_metadata_list=spec_seqs, - proposal_token_ids=proposal_token_ids_list, - # NOTE: We determine the seq ids in the expanded batch using the - # full seq_group_metadata_list, instead of only spec_seqs. - target_seq_ids_iter=self._create_target_seq_id_iterator( - seq_ids=get_all_seq_ids(seq_group_metadata_list)), - ) - - num_scoring_tokens = len(spec_expanded_seqs) - # Batch speculative and non-speculative (e.g. chunked prefill) requests - # but make sure order is prefill|decode due to backend requirement. - target_seq_group_metadata_list = non_spec_seqs + spec_expanded_seqs - - return (spec_indices, non_spec_indices, target_seq_group_metadata_list, - num_scoring_tokens) - - def _contract_non_speculative( - self, scores: SpeculativeScores, - seq_group_metadata_list: List[SequenceGroupMetadata], - non_spec_indices: List[int], non_spec_outputs: SpeculativeScores, - has_prompt_log: bool) -> SpeculativeScores: - """ - Augment input `scores` with non-speculative requests outputs. - This includes decode requests with speculation turned off, as well - as prefill requests when `enable_chunked_prefill` is set. - For the latter, prefills are further separated into terminal and - non-terminal chunks (from which no token is sampled). - """ - if not non_spec_indices: - return scores - - if has_prompt_log: - # When prompt_logprobs is enabled, prefills yield output token - # (and respective prob) in the last entry (prompt|out): - # [.|.|.|prefill0_out|.|prefill1_out|decode0_out|..]. - # With chunked prefill, non-terminal chunks have -1 on each - # position: they're still picked, but they're discarded later. - seq_meta = seq_group_metadata_list - nospec_sizes = torch.tensor([ - seq_meta[i].token_chunk_size if seq_meta[i].is_prompt else 1 - for i in non_spec_indices - ]) - nospec_sampled_token_idxs = torch.cumsum(nospec_sizes, 0).add_(-1) - else: - # In this case only sampled tokens are returned, select all. - nospec_sampled_token_idxs = list( - range(len(non_spec_outputs.token_ids))) - - scores.token_ids[non_spec_indices, :1] = \ - non_spec_outputs.token_ids[nospec_sampled_token_idxs].unsqueeze(1) - scores.probs[non_spec_indices, :1, :] = \ - non_spec_outputs.probs[nospec_sampled_token_idxs].unsqueeze(1) - scores.logprobs[non_spec_indices, :1, :] = \ - non_spec_outputs.logprobs[nospec_sampled_token_idxs].unsqueeze(1) - if scores.hidden_states is not None: - assert non_spec_outputs.hidden_states is not None - scores.hidden_states[non_spec_indices, :1, :] = \ - non_spec_outputs.hidden_states[nospec_sampled_token_idxs].unsqueeze(1) - return scores - - def _contract_batch( - self, - contracted_seq_group_metadata_list: List[SequenceGroupMetadata], - target_sampler_output: SamplerOutput, - proposals: SpeculativeProposals, num_scoring_tokens: int, - non_spec_indices: List[int], spec_indices: List[int], - k: int) -> SpeculativeScores: - """Contract the expanded batch back into its original size. - This maps the scores of speculative tokens back to their original - sequences. - - contracted_bs is the original batch size, and the batch size that the - target_sampler_output will be contracted to. - """ - contracted_bs = len(contracted_seq_group_metadata_list) - (target_token_ids, target_probs, target_logprobs, target_hidden_states, - non_spec_target_token_ids, non_spec_target_probs, - non_spec_target_logprobs, - non_spec_target_hidden_states) = self._split_scoring_output( - target_sampler_output, num_scoring_tokens) - - # Map distinct sequences used to score each token - # of shape [batch_size * k + 1] back to [batch_size, k + 1]. - expanded_batch_size, k = proposals.proposal_token_ids.shape - - # The number of tokens in the expanded batch used for speculation is - # equal to the total expanded batch size minus the number of samples for - # non-speculative sequences, prefill chunks with no out tokens included - non_spec_expanded_bs = len(non_spec_indices) - spec_expanded_bs = expanded_batch_size - non_spec_expanded_bs - - target_token_ids = target_token_ids.reshape(spec_expanded_bs, k + 1) - target_probs = target_probs.reshape(*target_token_ids.shape, - self._vocab_size) - target_logprobs = target_logprobs.reshape(target_probs.shape) - - if target_hidden_states is not None: - target_hidden_states = target_hidden_states.reshape( - *target_token_ids.shape, target_hidden_states.shape[-1]) - - all_tokens = target_token_ids.new_full(size=(contracted_bs, k + 1), - fill_value=-1) - all_probs = target_probs.new_zeros(*all_tokens.shape, self._vocab_size) - all_logprobs = target_logprobs.new_full(size=all_probs.shape, - fill_value=-float("inf")) - - if target_sampler_output.hidden_states is not None: - all_hidden_states = target_hidden_states.new_zeros( - size=(contracted_bs, k + 1, target_hidden_states.shape[-1])) - else: - all_hidden_states = None - - has_prompt_log = any((sg.sampling_params.prompt_logprobs - and sg.sampling_params.prompt_logprobs > 0) - for sg in contracted_seq_group_metadata_list) - # When prompt logprobs is enabled, lens of returned tensors go from - # n_sampled (requests with do_sample=True) to n_prompt+n_prefills. - # We adjust stride accordingly to get the generated tokens and - # their probs, but pass on prompt_logprobs as is. - prompt_logprobs = None - if (not self._scorer_worker.model_runner.disable_logprobs\ - and has_prompt_log): - prompt_logprobs = [ - o.prompt_logprobs for o in target_sampler_output.outputs - ] - elif not has_prompt_log: - # When prompt logprobs are not to be returned, - # we can ignore non-terminal chunks (no out token). - non_spec_indices = [ - idx for idx in non_spec_indices - if contracted_seq_group_metadata_list[idx].do_sample - ] - - # "Contract" speculative. - if spec_indices: - all_tokens[spec_indices] = target_token_ids - all_probs[spec_indices] = target_probs - all_logprobs[spec_indices] = target_logprobs - if all_hidden_states is not None: - all_hidden_states[spec_indices] = target_hidden_states - - spec_scores = SpeculativeScores(probs=all_probs, - token_ids=all_tokens, - logprobs=all_logprobs, - hidden_states=all_hidden_states, - prompt_logprobs=prompt_logprobs) - - non_spec_outputs = SpeculativeScores( - probs=non_spec_target_probs, - token_ids=non_spec_target_token_ids, - logprobs=non_spec_target_logprobs, - hidden_states=non_spec_target_hidden_states) - # Contract remaining nonspec entries based on non_spec_indices, if any. - return self._contract_non_speculative( - spec_scores, contracted_seq_group_metadata_list, non_spec_indices, - non_spec_outputs, has_prompt_log) - - def _contract_batch_all_spec( - self, - target_sampler_output: SamplerOutput, - proposals: SpeculativeProposals, - ) -> SpeculativeScores: - """Contract the expanded batch back into its original size. - This maps the scores of speculative tokens back to their original - sequences. - - It assumes all sequences in the batch were previously expanded. - """ - - # Map distinct sequences used to score each token - # of shape [batch_size * k + 1] back to [batch_size, k + 1]. - contracted_bs, k = proposals.proposal_token_ids.shape - - # Reshape tensors to original batch size - target_token_ids = target_sampler_output.sampled_token_ids.reshape( - contracted_bs, k + 1) - target_probs = target_sampler_output.sampled_token_probs.reshape( - *target_token_ids.shape, self._vocab_size) - target_logprobs = target_sampler_output.logprobs.reshape( - target_probs.shape) - target_hidden_states = target_sampler_output.hidden_states - if target_hidden_states is not None: - target_hidden_states = target_hidden_states.reshape( - *target_token_ids.shape, target_hidden_states.shape[-1]) - - return SpeculativeScores(probs=target_probs, - token_ids=target_token_ids, - logprobs=target_logprobs, - hidden_states=target_hidden_states, - prompt_logprobs=None) - - def _create_scoring_model_input( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - proposal_token_ids: List[List[TokenId]], # shape: [batch_size, k] - target_seq_ids_iter: Iterator[TargetSeqId], - ) -> List[SequenceGroupMetadata]: - """Given the original input sequences and proposed tokens from the draft - model, create a list of target sequences that can be used for scoring. - - target_seq_ids_iter provides sequence ids for the expanded batch, - fulfilling the requirement that no seq id in the expanded batch is equal - to the seq id in the original batch. - """ - - if not seq_group_metadata_list: - return [] - - target_seq_group_metadata = list( - chain.from_iterable( - self._create_target_seq_group_metadata( - seq_group_metadata, - proposal_token_ids, - i, - target_seq_ids_iter, - ) for i, seq_group_metadata in enumerate( - seq_group_metadata_list))) - - return target_seq_group_metadata - - def _create_target_seq_group_metadata( - self, - input_seq_group_metadata: SequenceGroupMetadata, - proposal_token_ids: List[List[TokenId]], # shape: [batch_size, k] - batch_index: int, - target_seq_ids_iter: Iterator[TargetSeqId], - ) -> List[SequenceGroupMetadata]: - """Given an input sequence group metadata and a list of draft tokens, - create a list of target SequenceGroupMetadata, one for each - token id that needs to be scored. - - Naive speculative decoding requires K target model scores, one for each - draft model token. However one can add a bonus token such that if each - token is accepted, then a final token may be sampled from the model. - This function creates K+1 target SequenceGroupMetadata to take - advantage of the bonus token. - """ - assert len(input_seq_group_metadata.seq_data) == 1, ( - "Beam search " - "not supported in speculative decoding") - input_seq_id = next(iter(input_seq_group_metadata.seq_data.keys())) - - token_ids_to_score = self._get_token_ids_to_score( - proposal_token_ids[batch_index]) - - sampling_params = input_seq_group_metadata.sampling_params - target_seq_group_metadata_list: List[SequenceGroupMetadata] = [] - for i, token_ids in enumerate(token_ids_to_score): - target_seq_group_metadata_list.append( - self._create_single_target_seq_group_metadata( - input_seq_group_metadata, - input_seq_id, - next(target_seq_ids_iter), - token_ids, - sampling_params=sampling_params, - )) - - return target_seq_group_metadata_list - - @staticmethod - def _create_single_target_seq_group_metadata( - seq_group_metadata: SequenceGroupMetadata, - seq_id: SeqId, - target_seq_id: TargetSeqId, - token_ids: List[TokenId], - sampling_params: SamplingParams, - ) -> SequenceGroupMetadata: - """Create a single target SequenceGroupMetadata. - - Args: - seq_group_metadata: The metadata for the input sequence. - seq_id: The input sequence ID. - target_seq_id: The corresponding target sequence ID. - token_ids: The list of token ids that are to be appended to the - input sequence. - """ - seq_data = seq_group_metadata.seq_data[seq_id] - prompt_token_ids = seq_data.prompt_token_ids_array - new_output_token_ids = [*seq_data.get_output_token_ids(), *token_ids] - mrope_position_delta = seq_data.mrope_position_delta - - new_seq_data_dict = { - target_seq_id: - SequenceData( - prompt_token_ids, - _output_token_ids=array(VLLM_TOKEN_ID_ARRAY_TYPE, - new_output_token_ids), - ), - } - # This is a hack. Technically, spec decoding should compute - # num_lookahead slots at one shot, but instead, it expands the batch - # and evaluate one by one right now. context_len is seq_len - 1 because - # the kv cache is filled by a previous batch in the batch expansion. - for data in new_seq_data_dict.values(): - data.update_num_computed_tokens(data.get_len() - 1) - data.mrope_position_delta = mrope_position_delta - - return SequenceGroupMetadata( - request_id=seq_group_metadata.request_id, - is_prompt=seq_group_metadata.is_prompt, - seq_data=new_seq_data_dict, - sampling_params=sampling_params, - block_tables={ - target_seq_id: seq_group_metadata.block_tables[seq_id], - }, - lora_request=None, - token_chunk_size=1, - ) - - @staticmethod - def _split_scoring_output( - sampler_output: SamplerOutput, num_scoring_tokens: int - ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, - Optional[torch.Tensor], torch.Tensor, torch.Tensor, - torch.Tensor, Optional[torch.Tensor]]: - """Split the target model output into speculative and non-speculative - output. - """ - - # vLLM currently only supports proposal lens equal to zero or the batch - # proposal len. This adds some complexity (splitting the batch into spec - # and non spec sequences) and should be removed in the future. It can be - # done by supporting per-sequence proposal lens. - # - # First samples are non-speculative, latter samples are from speculative - # scoring (prefill|decode order). - split_sizes = (sampler_output.sampled_token_ids.numel() - - num_scoring_tokens, num_scoring_tokens) - (non_spec_probs, - spec_probs) = sampler_output.sampled_token_probs.split(split_sizes) - (non_spec_sampled_tokens, spec_sampled_tokens - ) = sampler_output.sampled_token_ids.flatten().split(split_sizes) - (non_spec_logprobs, - spec_logprobs) = sampler_output.logprobs.split(split_sizes) - - if sampler_output.hidden_states is not None: - (non_spec_hidden_states, spec_hidden_states - ) = sampler_output.hidden_states.split(split_sizes) - else: - non_spec_hidden_states, spec_hidden_states = None, None - - return (spec_sampled_tokens, spec_probs, spec_logprobs, - spec_hidden_states, non_spec_sampled_tokens, non_spec_probs, - non_spec_logprobs, non_spec_hidden_states) - - @staticmethod - def _create_target_seq_id_iterator( - seq_ids: List[SeqId]) -> Iterator[TargetSeqId]: - """Create an iterator for creating target sequence ids. - Target sequence ids are distinct from sequence ids because we create a - distinct target sequence id for each proposal token to be scored. - - This implementation increments a counter starting at 1 + max of all - provided input sequence ids. - """ - return count(start=max(seq_ids) + 1) - - @staticmethod - def _get_token_ids_to_score( - full_spec_token_ids: List[TokenId] # shape: [k] - ) -> List[List[TokenId]]: - """Given an int tensor of proposal token ids, return a list of - token ids that should be scored. - - Returns k+1 output lists. The additional one is used for generating the - bonus token. - - Example: - Input: [0, 1, 2, 3] (k=4) - Output: (k+1 lists) - [] - [0] - [0, 1] - [0, 1, 2] - [0, 1, 2, 3] - """ - empty_token_ids: List[TokenId] = [] - - token_ids_to_score = [empty_token_ids] - token_ids_to_score.extend(full_spec_token_ids[:i + 1] - for i in range(len(full_spec_token_ids))) - return token_ids_to_score diff --git a/vllm/spec_decode/draft_model_runner.py b/vllm/spec_decode/draft_model_runner.py deleted file mode 100644 index 96646ec9471..00000000000 --- a/vllm/spec_decode/draft_model_runner.py +++ /dev/null @@ -1,349 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import List, Optional - -import torch - -from vllm.forward_context import set_forward_context -from vllm.model_executor.layers.sampler import SamplerOutput - -try: - try: - from vllm.attention.backends.flash_attn import FlashAttentionMetadata - except (ModuleNotFoundError, ImportError): - # vllm_flash_attn is not installed, try the ROCm FA metadata - from vllm.attention.backends.rocm_flash_attn import ( - ROCmFlashAttentionMetadata as FlashAttentionMetadata) -except (ModuleNotFoundError, ImportError) as err: - raise RuntimeError( - "Draft model speculative decoding currently only supports " - "CUDA and ROCm flash attention backend.") from err - -from vllm.logger import init_logger -from vllm.multimodal import MultiModalKwargs -from vllm.sequence import ExecuteModelRequest, IntermediateTensors -from vllm.worker.model_runner_base import (ModelRunnerBase, - ModelRunnerInputBase, - ModelRunnerWrapperBase) - -logger = init_logger(__name__) - -# A flag to enable debug prints for the updated input tensors -# before each step. -debug_advance_input = False -# A flag to allow GPU advance step for draft model runner. -# Set to False for debugging. -allow_gpu_advance_step = True - - -class TP1DraftModelRunner(ModelRunnerWrapperBase): - """Specialized model runner for speculative decoding draft model. - Since the draft model always execute k forward passes consecutively to - generate k speculative tokens in a single speculative decoding step, - we could get rid of most CPU-GPU synchronization and data transfer - overheads by keeping model input and output tensors on GPU all the time. - - TODOs: - 1. Currently supports only flash-attn, add support for other attn_backends. - 2. Support TP > 1 (this requires some designs because we do not expect - any broadcasting inside execute_model). - """ - - def __init__(self, model_runner: ModelRunnerBase): - super().__init__(model_runner) - - self.indices_of_seq_with_bonus_tokens = None - - def _update_sampling_metadata(self, sampling_metadata, num_seqs, - num_queries): - - assert sampling_metadata.num_prompts == 0 - assert len(sampling_metadata.seq_groups) == num_queries - assert sampling_metadata.selected_token_indices.shape == ( - num_queries, ) - # assert sampling_metadata.categorized_sample_indices == TODO: Add if needed # noqa: E501 - - # Verify that all sequences are decodes - for i in range(num_queries): - seq_group = sampling_metadata.seq_groups[i] - - assert seq_group.is_prompt is False # No prompt - assert seq_group.prompt_logprob_indices == [] # No prompt - assert seq_group.sample_indices == [i] # Simple - - def _gpu_advance_step(self, model_input: ModelRunnerInputBase, - last_output: SamplerOutput) -> ModelRunnerInputBase: - # Currently, we expect "decode mode" only - assert not model_input.is_prompt - - # Get num_seqs - num_seqs = len(model_input.seq_lens) - num_queries = len(model_input.query_lens) - - # Get output tokens GPU tensor - sampled_token_ids = last_output.sampled_token_ids - assert sampled_token_ids is not None - - # Update attn_metadata - attn_metadata = model_input.attn_metadata - assert isinstance(attn_metadata, FlashAttentionMetadata) - - attn_metadata.advance_step(model_input, sampled_token_ids, - self.block_size, num_seqs, num_queries) - - # Update sampling_metadata - sampling_metadata = model_input.sampling_metadata - self._update_sampling_metadata(sampling_metadata, num_seqs, - num_queries) - - # Create new input - new_model_input = self._model_input_cls( - input_tokens=model_input.input_tokens, - input_positions=model_input.input_positions, - attn_metadata=attn_metadata, - seq_lens=attn_metadata.seq_lens, - query_lens=model_input.query_lens, - lora_mapping=model_input.lora_mapping, - lora_requests=model_input.lora_requests, - multi_modal_kwargs=model_input.multi_modal_kwargs, - sampling_metadata=model_input.sampling_metadata, - is_prompt=False, - ) - - # Ensure we skip CPU samples - assert new_model_input.sampling_metadata.skip_sampler_cpu_output is True - # We can reuse sampling tensors since every decode iteration is the same - new_model_input.sampling_metadata.reuse_sampling_tensors = True - - if debug_advance_input: - logger.debug("NEW INPUT: ") - logger.debug(" input_tokens = %s", new_model_input.input_tokens) - logger.debug(" input_positions = %s", - new_model_input.input_positions) - logger.debug(" seq_lens = %d", new_model_input.seq_lens) - logger.debug(" query_lens = %d", new_model_input.query_lens) - logger.debug(" attn_metadata:") - logger.debug(" seq_lens_tensor: %s", - attn_metadata.seq_lens_tensor) - logger.debug(" slot_mapping: %s", attn_metadata.slot_mapping) - logger.debug(" block_tables: %s", attn_metadata.block_tables) - - return new_model_input - - def supports_gpu_multi_step(self, execute_model_req: ExecuteModelRequest): - """Determines if draft_model_runner GPU multi-step can be used. - Currently required conditions are: - 1. Only decodes - 2. Only flash-attn - 3. No LORA - 4. No prompt_adapter_config - """ - if not allow_gpu_advance_step: - return False - - # We allow multi-step GPU only in decode mode - for seq_group in execute_model_req.seq_group_metadata_list: - if seq_group.is_prompt: - return False - - # TODO: Add support for other attn backends - if self.attn_backend.get_name() not in ("FLASH_ATTN", ): - return False - - # TODO: Add support for LORA - if self.lora_config: - return False - - # TODO: Add soft-tuning prompt adapter support - return not self.prompt_adapter_config - - def set_indices_of_seq_with_bonus_tokens(self, - indices_of_seq_with_bonus_tokens): - self.indices_of_seq_with_bonus_tokens = indices_of_seq_with_bonus_tokens - - @torch.inference_mode() - def execute_model( - self, - model_input: ModelRunnerInputBase, - kv_caches: List[torch.Tensor], - previous_hidden_states: Optional[torch.Tensor] = None, - intermediate_tensors: Optional[IntermediateTensors] = None, - num_steps: int = 1, - **kwargs, - ) -> Optional[List[SamplerOutput]]: - """Executes num_steps forward passes with advacement of input tensors - on the GPU. Look at supports_gpu_multi_step(..) for pre-conditions. - - Optimizations used: - 1. Input tensors are updated on the GPU directly - 2. Skips GPU=>CPU serialization of sampler outputs (we don't need - them since we do batch expansion later that uses GPU outputs) - 3. Reuses sampling tensors (since we run only decodes and they have - a repeating sampling logic) - """ - - # When num_steps == 1, we execute the fallback here for the GPU - # advance_step, which runs prepare_inputs on CPU and for each spec - # iteration invokes this function only once - # (Look at multi-step-worker code) - is_fallback = num_steps == 1 - if not is_fallback: - # Since we do not broadcast data inside execute_model anymore, - # we need to figure out the best way to support TP > 1 in this - # case, because we will at least need to broadcast the sampled - # tokens to all workers. - if not self.is_driver_worker: - raise ValueError("TP1DraftModelRunner only supports TP=1.") - - # Sanity - if self.lora_config is not None: - raise ValueError("TP1DraftModelRunner has no support for LORA") - if self.prompt_adapter_config is not None: - raise ValueError("TP1DraftModelRunner has no support for " - "prompt_adapter_config") - if model_input.inputs_embeds is not None: - raise ValueError("TP1DraftModelRunner has no support for " - "inputs_embeds") - if model_input.multi_modal_kwargs: - raise ValueError( - "TP1DraftModelRunner has no support for multi_modal_kwargs" - ) - else: - if self.lora_config: - assert model_input.lora_requests is not None - assert model_input.lora_mapping is not None - self.set_active_loras(model_input.lora_requests, - model_input.lora_mapping) - - if self.prompt_adapter_config: - assert model_input.prompt_adapter_requests is not None - assert model_input.prompt_adapter_mapping is not None - self.set_active_prompt_adapters( - model_input.prompt_adapter_requests, - model_input.prompt_adapter_mapping) - - self.attn_state.begin_forward(model_input) - - # Detect exec mode - assert model_input.attn_metadata is not None - use_cuda_graph = False - if model_input.attn_metadata.num_prefills > 0: - # In this case, execute_model(..) was called directly - if num_steps > 1: - raise ValueError( - "execute_model(..) of draft_model_runner can be called " - "directly only with a single-step prefill") - else: - # We can skip CPU samples for spec token generation. - # (We do allow CPU samples for num_steps == 1 to support the - # fallback case, where supports_gpu_multi_step(..) does not pass) - model_input.sampling_metadata.skip_sampler_cpu_output = ( - not is_fallback) - - # Attn attr defines if we use cuda graphs - use_cuda_graph = model_input.attn_metadata.use_cuda_graph - - # Get model - if use_cuda_graph: - if model_input.inputs_embeds is None: - graph_batch_size = model_input.input_tokens.shape[0] - model_executable = ( - self.graph_runners[model_input.virtual_engine][( - graph_batch_size, False)]) - else: - graph_batch_size = model_input.inputs_embeds.shape[0] - model_executable = ( - self.graph_runners[model_input.virtual_engine][( - graph_batch_size, True)]) - - if previous_hidden_states is not None: - hidden_states = torch.cat([ - previous_hidden_states, - torch.empty([ - graph_batch_size - previous_hidden_states.shape[0], - *previous_hidden_states.shape[1:] - ], - dtype=previous_hidden_states.dtype, - device=previous_hidden_states.device) - ]) - else: - hidden_states = None - else: - model_executable = self.model - hidden_states = previous_hidden_states - - outputs: List[SamplerOutput] = [] - for step in range(num_steps): - multi_modal_kwargs = model_input.multi_modal_kwargs or {} - - model_execute_kwargs = {"previous_hidden_states": hidden_states} \ - if previous_hidden_states is not None else {} - - compute_logits_kwargs = {} - # Run model - if hasattr(self.model.config, "num_nextn_predict_layers"): - # for DeepSeek MTP only to use the corresponding layer for - # each step - spec_step_idx = kwargs.get("spec_step_idx", step) - model_execute_kwargs["spec_step_idx"] = spec_step_idx - compute_logits_kwargs["spec_step_idx"] = spec_step_idx - with set_forward_context(model_input.attn_metadata, - self.vllm_config): - hidden_states = model_executable( - input_ids=model_input.input_tokens, - inputs_embeds=None, - positions=model_input.input_positions, - intermediate_tensors=intermediate_tensors, - **MultiModalKwargs.as_kwargs( - multi_modal_kwargs, - device=self.device, - ), - **model_execute_kwargs, - ) - - # Compute the logits. - logits = self.model.compute_logits(hidden_states, - model_input.sampling_metadata, - **compute_logits_kwargs) - if not self.is_driver_worker: - return [] - # Sample the next token. - output = self.model_runner.sampler( - logits=logits, - sampling_metadata=model_input.sampling_metadata, - ) - outputs.append(output) - - if self.return_hidden_states and is_fallback: - if use_cuda_graph: - indices = model_input.sampling_metadata\ - .selected_token_indices - output.hidden_states = hidden_states[:len(indices)] - else: - output.hidden_states = hidden_states - - if model_input.attn_metadata.num_prefills == 0 \ - and self.indices_of_seq_with_bonus_tokens is not None: - assert output.sampled_token_ids is not None - # output.sampled_token_ids should be of shape (num_seqs, 1) - nums_seqs, num_tokens_per_seq = output.sampled_token_ids.shape - assert num_tokens_per_seq == 1 - count = 0 - for i in range(nums_seqs): - bonus_seq_idx = self.indices_of_seq_with_bonus_tokens[ - count] - if i != bonus_seq_idx: - # The following might cause a cpu->gpu sync - # However, the performance impact is negligible as we - # benchmarked on H100. - output.sampled_token_ids[ - i, :] = model_input.input_tokens[bonus_seq_idx] - else: - count += 1 - - # Prepare inputs for the next step - if step != num_steps - 1: - model_input = self._gpu_advance_step(model_input, outputs[-1]) - - return outputs diff --git a/vllm/spec_decode/interfaces.py b/vllm/spec_decode/interfaces.py deleted file mode 100644 index 70ec1590e7a..00000000000 --- a/vllm/spec_decode/interfaces.py +++ /dev/null @@ -1,99 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from abc import ABC, abstractmethod -from dataclasses import dataclass -from typing import List, Optional, Set, Union - -import torch - -from vllm.sequence import ExecuteModelRequest, PromptLogprobs -from vllm.worker.worker_base import WorkerBase - - -@dataclass -class SpeculativeProposals: - """Datastructure used to represent proposal tokens from some proposer. It - also tracks how many speculative tokens each sequence has. - """ - - # Speculative proposal tokens. - proposal_token_ids: torch.Tensor - - # Probabilities of the proposal tokens according to the proposer. - proposal_probs: torch.Tensor - - # The valid length of each proposal; can be zero. - proposal_lens: torch.Tensor - - # A flag to mark that there's no available proposals - no_proposals: bool = False - - def __repr__(self): - return (f"SpeculativeProposals(" - f"proposal_token_ids={self.proposal_token_ids}, " - f"proposal_probs={self.proposal_probs.shape}, " - f"proposal_lens={self.proposal_lens})") - - -@dataclass -class SpeculativeScores: - """Datastructure used to represent the scores of speculative tokens - according to the scoring model. - """ - - # Probabilities of the speculative tokens according to the scoring model. - probs: torch.Tensor - - # Log-probabilities of the speculative tokens according to the scoring - # model. These values can be used to generate Logprob objects that are - # returned to the user. - logprobs: torch.Tensor - - # Token ids sampled from the scoring model. Used for speculative bonus - # tokens and also non-speculative normal decoding. - token_ids: torch.Tensor - - # Optional last hidden states from the scoring model. - hidden_states: Optional[torch.Tensor] = None - - # Scoring model may also return logprobs for prompt tokens - # for each request, when chunked prefill is enabled. - prompt_logprobs: Optional[List[PromptLogprobs]] = None - - def __repr__(self): - return (f"SpeculativeScores(" - f"probs={self.probs.shape}, " - f"token_ids={self.token_ids.shape})") - - -class SpeculativeProposer(ABC): - - @abstractmethod - def get_spec_proposals( - self, - execute_model_req: ExecuteModelRequest, - # If set, this contains all sequence IDs that were assigned - # bonus tokens in their last forward pass. - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> SpeculativeProposals: - raise NotImplementedError - - -class SpeculativeScorer(ABC): - - def __init__(self, scorer_worker: WorkerBase, - device: Union[torch.device, str], vocab_size: int): - self._scorer_worker = scorer_worker - if isinstance(device, torch.device): - device = device.type - self._device = device - self._vocab_size = vocab_size - - @abstractmethod - def score_proposals( - self, - execute_model_req: ExecuteModelRequest, - proposals: SpeculativeProposals, - ) -> SpeculativeScores: - raise NotImplementedError diff --git a/vllm/spec_decode/medusa_worker.py b/vllm/spec_decode/medusa_worker.py deleted file mode 100644 index 82b5a79fa7c..00000000000 --- a/vllm/spec_decode/medusa_worker.py +++ /dev/null @@ -1,138 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import weakref -from typing import List, Optional, Set, Tuple - -import torch - -from vllm.model_executor import SamplingMetadata -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.sequence import ExecuteModelRequest, SequenceGroupMetadata -from vllm.spec_decode.interfaces import SpeculativeProposals -from vllm.spec_decode.proposer_worker_base import NonLLMProposerWorkerBase -from vllm.spec_decode.top1_proposer import Top1Proposer -from vllm.worker.worker_base import DelegateWorkerBase - - -class MedusaWorker(NonLLMProposerWorkerBase, DelegateWorkerBase): - """Worker for Medusa. - """ - - def __init__(self, *args, **kwargs): - DelegateWorkerBase.__init__(self, *args, **kwargs) - # Lazy initialization list. - self._proposer: Top1Proposer - - def init_device(self): - self.worker.init_device() - - self._proposer = Top1Proposer( - weakref.proxy(self), # type: ignore[arg-type] - self.device, - self.vocab_size, - max_proposal_len=self.max_model_len, - ) - - def set_include_gpu_probs_tensor(self): - pass - - def set_should_modify_greedy_probs_inplace(self): - pass - - @torch.inference_mode() - def sampler_output( - self, - execute_model_req: ExecuteModelRequest, - sample_len: int, - # Unused parameter. - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> Tuple[List[SamplerOutput], bool]: - """Run the model forward pass to generate sample_len future tokens. - Returns the list of sampler output, one per layer, along with indicator - of whether torch tensor in sampler output need to be transposed in - latter sampler_output_to_torch logic. - - For medusa worker, this indicator shall be False. - """ - self._raise_if_unsupported(execute_model_req) - - seq_group_metadata_list = execute_model_req.seq_group_metadata_list - - seq_lens, query_lens = self._prepare_input_tensors( - seq_group_metadata_list) - - generators = self.model_runner.get_generators( - execute_model_req.finished_requests_ids) - sampling_metadata = SamplingMetadata.prepare( - seq_group_metadata_list, seq_lens, query_lens, self.device, - self.model_runner.pin_memory, generators) - - model_outputs = self.model_runner.model.generate_proposals( - previous_hidden_states=execute_model_req.previous_hidden_states. - hidden_states, - sampling_metadata=sampling_metadata) - - return model_outputs, False - - def _prepare_input_tensors( - self, - seq_group_metadata_list: Optional[List[SequenceGroupMetadata]], - ) -> Tuple[List[int], List[int]]: - if not seq_group_metadata_list: - return [], [] - - seq_lens: List[int] = [] - query_lens: List[int] = [] - - for seq_group_metadata in seq_group_metadata_list: - is_prompt = seq_group_metadata.is_prompt - - for seq_data in seq_group_metadata.seq_data.values(): - seq_data_len = seq_data.get_len() - if is_prompt: - context_len = seq_data.get_num_computed_tokens() - seq_len = min( - seq_data_len, - context_len + seq_group_metadata.token_chunk_size) - seq_lens.append(seq_len) - query_lens.append(seq_len - context_len) - else: - seq_lens.append(seq_data_len) - query_lens.append(1) - - return seq_lens, query_lens - - def get_spec_proposals( - self, - execute_model_req: ExecuteModelRequest, - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> SpeculativeProposals: - """Produce speculations given an input batch of sequences. The number of - speculative tokens per sequence is determined by max_proposal_len. - """ - - return self._proposer.get_spec_proposals( - execute_model_req, seq_ids_with_bonus_token_in_last_step) - - def _raise_if_unsupported( - self, - execute_model_req: ExecuteModelRequest, - ) -> None: - """MedusaWorker does not yet implement support for cache swap - operations or beam search. - """ - if any([ - execute_model_req.blocks_to_swap_in, - execute_model_req.blocks_to_swap_out, - execute_model_req.blocks_to_copy - ]): - raise NotImplementedError( - "MedusaWorker does not support cache operations") - - if any( - len(seq_group_metadata.seq_data.keys()) != 1 - for seq_group_metadata in - execute_model_req.seq_group_metadata_list): - raise NotImplementedError( - "MedusaWorker does not support beam search.") diff --git a/vllm/spec_decode/metrics.py b/vllm/spec_decode/metrics.py deleted file mode 100644 index a4784cad962..00000000000 --- a/vllm/spec_decode/metrics.py +++ /dev/null @@ -1,213 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import time -from typing import Callable, Optional, Union - -import msgspec -import torch - -from vllm.model_executor.layers.spec_decode_base_sampler import ( - SpecDecodeBaseSampler) -from vllm.platforms import current_platform -from vllm.utils import is_pin_memory_available - - -class SpecDecodeWorkerMetrics( - msgspec.Struct, - omit_defaults=True, # type: ignore[call-arg] - array_like=True): # type: ignore[call-arg] - """Dataclass holding metrics emitted from the spec decode worker. - """ - - # The empirical acceptance rate of the proposal method on a per-token basis. - # This is useful for evaluating how well the proposal method aligns with the - # scoring method. - draft_acceptance_rate: float - - # The empirical efficiency, measured as the number of tokens emitted by the - # system divided by the number of tokens that could be emitted by the system - # if the proposal method were perfect. - system_efficiency: float - - # The number of speculative tokens produced by the proposal method. - draft_tokens: int - - # The number of tokens emitted by the entire system. - emitted_tokens: int - - # The number of tokens accepted by the scoring model and verification - # routine, e.g. Llama2-70B and lossless rejection sampling. - # - # NOTE: Any token accepted by the verification routine is considered - # accepted (regardless of if the speculative prefix is also accepted). The - # user will usually see less accepted tokens. This metric is helpful when - # evaluating alignment of the proposal method with the scoring model. - accepted_tokens: int - - # The number of speculative tokens per sequence. - num_spec_tokens: int - - -Timer = Callable[[], float] - - -class AsyncMetricsCollector: - """Class which copies rejection/typical-acceptance sampler metrics - from the device to CPU on a non-default Torch stream. - """ - - def __init__(self, - spec_decode_sampler: SpecDecodeBaseSampler, - timer: Optional[Timer] = None, - collect_interval_s: float = 5.0): - self.spec_decode_sampler = spec_decode_sampler - self._timer = time.time if timer is None else timer - - self._rank: Optional[int] = None - - # We don't have a device set yet. - self._copy_stream: Optional[torch.cuda.Stream] = None - - self._in_flight_copy: Optional[torch.cuda.Event] = None - - pin_memory = is_pin_memory_available() - self._aggregate_num_accepted_tokens = torch.tensor( - 0, dtype=torch.long, device="cpu", pin_memory=pin_memory) - self._aggregate_num_emitted_tokens = torch.tensor( - 0, dtype=torch.long, device="cpu", pin_memory=pin_memory) - self._aggregate_num_draft_tokens = 0 - - self._rejsample_metrics_collect_interval_s = collect_interval_s - self._last_metrics_collect_time = self._timer() - - def init_gpu_tensors(self, rank: int) -> None: - self._rank = rank - self._copy_stream = torch.cuda.Stream() - - def init_tensors(self, - rank: int, - device_type: Union[torch.device, str] = 'cuda') -> None: - self._rank = rank - if isinstance(device_type, torch.device): - device_type = device_type.type - stream = current_platform.Stream - if stream is not None: - self._copy_stream = stream() - - def maybe_collect_rejsample_metrics( - self, k: int) -> Optional[SpecDecodeWorkerMetrics]: - # Skip for any platform that doesn't have device Event - if current_platform.Event is None: - return None - - # If a copy was initiated in the previous call, collect and return. - if self._in_flight_copy is not None: - ready_event = self._in_flight_copy - self._in_flight_copy = None - return self._collect_rejsample_metrics(k, ready_event) - - # Otherwise, check if we should start a new copy. - if self._should_collect_rejsample_metrics(self._timer()): - assert self._in_flight_copy is None - self._in_flight_copy = self._copy_rejsample_metrics_async() - - return None - - def _should_collect_rejsample_metrics(self, now: float) -> bool: - """Return whether or not this iteration should print sampling - metrics. - """ - if self._rank != 0: - return False - - return now - self._last_metrics_collect_time >= self._rejsample_metrics_collect_interval_s # noqa: E501 - - def _copy_rejsample_metrics_async(self) -> torch.cuda.Event: - """Copy rejection/typical-acceptance sampling metrics - (number of accepted tokens, etc) to CPU asynchronously. - - Returns a device event recording when the copy is complete. - """ - assert self._copy_stream is not None - self._copy_stream.wait_stream(current_platform.current_stream()) - - with current_platform.stream(self._copy_stream): - self._aggregate_num_accepted_tokens.copy_( - self.spec_decode_sampler.num_accepted_tokens, - non_blocking=True) - self._aggregate_num_emitted_tokens.copy_( - self.spec_decode_sampler.num_emitted_tokens, non_blocking=True) - # Number of draft tokens is calculated on CPU, so no copy is - # required. - self._aggregate_num_draft_tokens = ( - self.spec_decode_sampler.num_draft_tokens) - - aggregate_metrics_ready = current_platform.Event() - aggregate_metrics_ready.record(self._copy_stream) - - return aggregate_metrics_ready - - def _collect_rejsample_metrics( - self, k: int, - ready_event: torch.cuda.Event) -> SpecDecodeWorkerMetrics: - """Create metrics object from statistics copied asynchronously. - - Args: - k: int. The number of speculative tokens; used to determine system - efficiency. - ready_event: torch.cuda.Event. The CUDA event recording when the - async GPU->CPU copy is complete. - """ - - ready_event.synchronize() - - # update time of last collection - self._last_metrics_collect_time = self._timer() - - accepted_tokens = self._aggregate_num_accepted_tokens.item() - emitted_tokens = self._aggregate_num_emitted_tokens.item() - draft_tokens = self._aggregate_num_draft_tokens - - max_num_emitted_tokens = self.get_max_num_emitted_tokens( - draft_tokens, k) - - if draft_tokens > 0: - draft_acceptance_rate = accepted_tokens / draft_tokens - else: - draft_acceptance_rate = float("nan") - - if max_num_emitted_tokens > 0: - system_efficiency = emitted_tokens / max_num_emitted_tokens - else: - system_efficiency = float("nan") - - return SpecDecodeWorkerMetrics( - num_spec_tokens=k, - draft_acceptance_rate=draft_acceptance_rate, - system_efficiency=system_efficiency, - accepted_tokens=accepted_tokens, - draft_tokens=draft_tokens, - emitted_tokens=emitted_tokens, - ) - - @staticmethod - def get_max_num_emitted_tokens(draft_tokens: int, k: int) -> int: - """Calculate the number of emitted tokens, assuming all tokens are - accepted. - - This is equal to the number of sequences that have been speculated on, - times (speculation len + 1). The +1 comes from the bonus token. - """ - # Determine the number of sequences that have been speculated on. Since - # the batch size can be variable, we divide by k. - assert draft_tokens % k == 0 - total_num_spec_seqs = draft_tokens // k - - # A single sequence may emit k accepted tokens and one bonus token in - # the best case. - num_emitted_per_seq_if_all_accepted = k + 1 - - # The max num of emitted tokens is the number of speculated sequences - # times the max emitted per seq. - return total_num_spec_seqs * num_emitted_per_seq_if_all_accepted diff --git a/vllm/spec_decode/mlp_speculator_worker.py b/vllm/spec_decode/mlp_speculator_worker.py deleted file mode 100644 index 8e8c05d2636..00000000000 --- a/vllm/spec_decode/mlp_speculator_worker.py +++ /dev/null @@ -1,94 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import List, Optional, Set, Tuple - -import torch - -from vllm.model_executor import SamplingMetadata -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.sequence import ExecuteModelRequest, SequenceGroupMetadata -from vllm.spec_decode.multi_step_worker import MultiStepWorker -from vllm.spec_decode.proposer_worker_base import NonLLMProposerWorkerBase - - -class MLPSpeculatorWorker(NonLLMProposerWorkerBase, MultiStepWorker): - """Worker for MLPSpeculator models. - - Not currently compatible with LoRA or chunked prefill. - """ - - @torch.inference_mode() - def sampler_output( - self, - execute_model_req: ExecuteModelRequest, - sample_len: int, - # Unused parameter. MLPSpeculatorWorker does not use the KV Cache and - # therefore does not need this parameter. - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> Tuple[List[SamplerOutput], bool]: - """Run the model forward pass to generate sample_len future tokens. - Returns the list of sampler output, one per layer, along with indicator - of whether torch tensor in sampler output need to be transposed in - latter sampler_output_to_torch logic. - - For mlp spec worker, this indicator shall be True. - """ - self._raise_if_unsupported(execute_model_req) - - seq_group_metadata_list = execute_model_req.seq_group_metadata_list - - (input_tokens, seq_lens, - query_lens) = self._prepare_input_tensors(seq_group_metadata_list) - - generators = self.model_runner.get_generators( - execute_model_req.finished_requests_ids) - sampling_metadata = SamplingMetadata.prepare( - seq_group_metadata_list, seq_lens, query_lens, self.device, - self.model_runner.pin_memory, generators) - - model_outputs = self.model_runner.model.generate_proposals( - input_ids=input_tokens, - previous_hidden_states=execute_model_req.previous_hidden_states. - hidden_states, - num_predict_tokens=sample_len, - sampling_metadata=sampling_metadata) - - assert len(model_outputs) == sample_len - - return model_outputs, True - - def _prepare_input_tensors( - self, - seq_group_metadata_list: Optional[List[SequenceGroupMetadata]], - ) -> Tuple[torch.Tensor, List[int], List[int]]: - if not seq_group_metadata_list: - return torch.empty(0, device=self.device), [], [] - - input_tokens: List[int] = [] - seq_lens: List[int] = [] - query_lens: List[int] = [] - - for seq_group_metadata in seq_group_metadata_list: - is_prompt = seq_group_metadata.is_prompt - - for seq_data in seq_group_metadata.seq_data.values(): - seq_data_len = seq_data.get_len() - if is_prompt: - context_len = seq_data.get_num_computed_tokens() - seq_len = min( - seq_data_len, - context_len + seq_group_metadata.token_chunk_size) - tokens = seq_data.get_token_ids()[context_len:seq_len] - seq_lens.append(seq_len) - input_tokens.extend(tokens) - query_lens.append(seq_len - context_len) - else: - seq_lens.append(seq_data_len) - input_tokens.append(seq_data.get_last_token_id()) - query_lens.append(1) - - input_tokens_tensor = torch.tensor(input_tokens, - dtype=torch.long, - device=self.device) - return input_tokens_tensor, seq_lens, query_lens diff --git a/vllm/spec_decode/mqa_scorer.py b/vllm/spec_decode/mqa_scorer.py deleted file mode 100644 index 18e7b055a67..00000000000 --- a/vllm/spec_decode/mqa_scorer.py +++ /dev/null @@ -1,160 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from vllm.sequence import (ExecuteModelRequest, SequenceData, - SequenceGroupMetadata, get_all_seq_ids) -from vllm.spec_decode.interfaces import (SpeculativeProposals, - SpeculativeScorer, SpeculativeScores) - -SeqId = int -TargetSeqId = int - - -class MQAScorer(SpeculativeScorer): - - def score_proposals( - self, - execute_model_req: ExecuteModelRequest, - proposals: SpeculativeProposals, - ) -> SpeculativeScores: - target_seq_group_metadata_list = [] - target_seq_id_start = max( - get_all_seq_ids(execute_model_req.seq_group_metadata_list)) + 1 - all_proposal_tokens = proposals.proposal_token_ids.tolist() - all_proposal_lengths = proposals.proposal_lens.tolist() - for i, seq_group_metadata in enumerate( - execute_model_req.seq_group_metadata_list): - if all_proposal_lengths[i] == 0: - # Keep prompt seqs untouched (keep computed_tokens for chunks). - target_seq_group_metadata_list.append(seq_group_metadata) - continue - - seq_data_dict = seq_group_metadata.seq_data - assert len(seq_data_dict) == 1 - seq_id = next(iter(seq_data_dict.keys())) - - seq_data: SequenceData = seq_data_dict[seq_id] - prompt_token_ids = seq_data.get_prompt_token_ids() - output_token_ids = seq_data.get_output_token_ids() - proposal_token_ids = all_proposal_tokens[ - i][:all_proposal_lengths[i]] - new_output_token_ids = [*output_token_ids, *proposal_token_ids] - - target_seq_id = target_seq_id_start + i - new_seq_data = SequenceData.from_seqs( - prompt_token_ids=prompt_token_ids, - output_token_ids=new_output_token_ids, - ) - new_seq_data.update_num_computed_tokens( - len(prompt_token_ids) + len(output_token_ids) - 1) - - # Ensure that the new decode sequence has at least one token. - assert len(output_token_ids) >= 1 - new_seq_data_dict = {target_seq_id: new_seq_data} - - new_seq_group_metadata = SequenceGroupMetadata( - request_id=seq_group_metadata.request_id, - is_prompt=seq_group_metadata.is_prompt, - seq_data=new_seq_data_dict, - sampling_params=seq_group_metadata.sampling_params, - block_tables={ - target_seq_id: seq_group_metadata.block_tables[seq_id], - }, - lora_request=None, - ) - target_seq_group_metadata_list.append(new_seq_group_metadata) - - target_sampler_output = self._scorer_worker.execute_model( - execute_model_req=execute_model_req.clone( - seq_group_metadata_list=target_seq_group_metadata_list)) - - target_sampler_output = target_sampler_output[0] - - k = execute_model_req.num_lookahead_slots - bs = len(execute_model_req.seq_group_metadata_list) - target_token_ids = target_sampler_output.sampled_token_ids - target_probs = target_sampler_output.sampled_token_probs - target_logprobs = target_sampler_output.logprobs - prompt_logprobs = None - - # If all requests have the same number of query tokens, we can avoid - # the for loop to build output for better performance. - if min(all_proposal_lengths) == k: - # Regular decodes only. - assert all(not sg.is_prompt - for sg in target_seq_group_metadata_list - if sg.is_prompt) - bs, _ = proposals.proposal_token_ids.shape - all_tokens = target_token_ids.reshape(bs, k + 1) - all_probs = target_probs.reshape(bs, k + 1, self._vocab_size) - all_logprobs = target_logprobs.reshape(bs, k + 1, self._vocab_size) - else: - # We either have decodes with different lens or prefill+decodes. - all_tokens = target_token_ids.new_full(size=(bs, k + 1), - fill_value=-1) - all_probs = target_probs.new_zeros(*all_tokens.shape, - self._vocab_size) - all_logprobs = target_logprobs.new_full(size=all_probs.shape, - fill_value=-float("inf")) - target_token_ids = target_token_ids.flatten() - - # When prompt logprobs is enabled, lens of returned tensors go from - # n_sampled (requests with do_sample=True) to n_prompt+n_prefills. - # We adjust stride accordingly to get the generated tokens and - # their probs, but pass on prompt_logprobs as is, since it may be - # that n_prompts >> K. - has_prompt_log = any((sg.sampling_params.prompt_logprobs - and sg.sampling_params.prompt_logprobs > 0) - for sg in target_seq_group_metadata_list) - # TODO (NickLucche) we should surface `disable_logprobs` as to not - # break abstraction to get its value. - if (not self._scorer_worker.model_runner.disable_logprobs\ - and has_prompt_log): - prompt_logprobs = [ - o.prompt_logprobs for o in target_sampler_output.outputs - ] - - # Split loop into prefill|decode for readability. - start_loc, i = 0, 0 - while i < len(target_seq_group_metadata_list - ) and target_seq_group_metadata_list[i].is_prompt: - seq_meta = target_seq_group_metadata_list[i] - end_loc = start_loc - if has_prompt_log: - end_loc += seq_meta.token_chunk_size - elif seq_meta.do_sample: - end_loc += 1 - - # Skip chunks with no output tokens. - if seq_meta.do_sample: - # Get sampled token (last position in chunk) and its prob. - all_tokens[i, 0] = target_token_ids[end_loc - 1] - all_probs[i, 0] = target_probs[end_loc - 1] - all_logprobs[i, 0] = target_logprobs[end_loc - 1] - - i += 1 - start_loc = end_loc - # Decodes. - while i < len(target_seq_group_metadata_list): - proposed_len, seq_meta = all_proposal_lengths[ - i], target_seq_group_metadata_list[i] - output_len = proposed_len + 1 - end_loc = start_loc + output_len - all_tokens[ - i, :output_len] = target_token_ids[start_loc:end_loc] - all_probs[i, :output_len] = target_probs[start_loc:end_loc] - all_logprobs[ - i, :output_len] = target_logprobs[start_loc:end_loc] - start_loc = end_loc - i += 1 - - hidden_states = None - if target_sampler_output.hidden_states is not None: - hidden_states = target_sampler_output.hidden_states.reshape( - bs, (k + 1), -1) - - return SpeculativeScores(probs=all_probs, - token_ids=all_tokens, - logprobs=all_logprobs, - hidden_states=hidden_states, - prompt_logprobs=prompt_logprobs) diff --git a/vllm/spec_decode/multi_step_worker.py b/vllm/spec_decode/multi_step_worker.py deleted file mode 100644 index 4a9bbe44d89..00000000000 --- a/vllm/spec_decode/multi_step_worker.py +++ /dev/null @@ -1,423 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import copy -import weakref -from typing import Dict, List, Set, Tuple - -import torch - -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.platforms import current_platform -from vllm.sequence import (ExecuteModelRequest, HiddenStates, SequenceData, - SequenceGroupMetadata) - -if current_platform.is_cuda_alike(): - from vllm.spec_decode.draft_model_runner import TP1DraftModelRunner - -from vllm.spec_decode.interfaces import (SpeculativeProposals, - SpeculativeProposer) -from vllm.spec_decode.proposer_worker_base import ProposerWorkerBase -from vllm.spec_decode.top1_proposer import Top1Proposer -from vllm.worker.worker_base import DelegateWorkerBase - - -class MultiStepWorker(ProposerWorkerBase, DelegateWorkerBase): - """The MultiStepWorker is equivalent to a Worker except that it allows - multiple forward passes in a single call, assuming the scheduler has - allocated enough space to store the additional KV. This reduces overhead - by invoking the scheduler less. - - The MultiStepWorker does not support cache swap operations, or beam search. - Cache swap operations do not require large modifications. On the other hand, - beam search requires memory allocations during sequence forks and thus - requires more thought for MultiStepWorker support. - """ - - def __init__(self, *args, **kwargs): - DelegateWorkerBase.__init__(self, *args, **kwargs) - # Lazy initialization list. - self._proposer: SpeculativeProposer - - def init_device(self) -> None: - self.worker.init_device() - self._proposer = Top1Proposer( - weakref.proxy(self), # type: ignore[arg-type] - self.device, - self.vocab_size, - max_proposal_len=self.max_model_len, - ) - - def set_include_gpu_probs_tensor(self) -> None: - # Need include_gpu_probs_tensor for MultiStepWorker - self.model_runner.sampler.include_gpu_probs_tensor = True - if hasattr(self.model_runner.model, "sampler"): - (self.model_runner.model.sampler.include_gpu_probs_tensor) = True - - def set_should_modify_greedy_probs_inplace(self) -> None: - self.model_runner.sampler.should_modify_greedy_probs_inplace = True - if hasattr(self.model_runner.model, "sampler"): - (self.model_runner.model.sampler.should_modify_greedy_probs_inplace - ) = True - - @torch.inference_mode() - def sampler_output( - self, - execute_model_req: ExecuteModelRequest, - sample_len: int, - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> Tuple[List[SamplerOutput], bool]: - """Run the model forward pass sample_len times. Returns the list of - sampler output, one per model forward pass, along with indicator of - whether torch tensor in sampler output need to be transposed in latter - sampler_output_to_torch logic. - - For multi step worker, this indicator shall be True. - """ - self._raise_if_unsupported(execute_model_req) - # Expand the batch for sequences with a bonus token. - # Perform a forward pass on the expanded batch and filter the - # response to retain only the original sequences' responses. - expanded_request, indices_of_seq_with_bonus_tokens =\ - self._expand_execute_model_request( - execute_model_req, seq_ids_with_bonus_token_in_last_step) - - # Run model sample_len times. - model_outputs: List[SamplerOutput] = [] - if current_platform.is_cuda_alike() and isinstance( - self.model_runner, TP1DraftModelRunner - ) and self.model_runner.supports_gpu_multi_step(expanded_request): - # Here we run the draft_model_runner with multi-step prepare - # on the GPU directly - expanded_request.num_steps = sample_len - self.model_runner.set_indices_of_seq_with_bonus_tokens( - indices_of_seq_with_bonus_tokens) - model_outputs = self.execute_model( - execute_model_req=expanded_request) - else: - # Here we run multi-step directly, with every step prepared - # on the CPU. - # TODO: Remove this branch once DraftModelRunner supports TP>1 - # and other restrictions that are part of DraftModelRunner's - # supports_gpu_multi_step(..) - if expanded_request.previous_hidden_states is not None: - self.worker.model_runner.return_hidden_states = True - for _ in range(sample_len): - model_output: List[SamplerOutput] = self.worker.execute_model( - execute_model_req=expanded_request) - assert (len(model_output) == 1 - ), "composing multistep workers not supported" - model_output = model_output[0] - self._maybe_update_previous_hidden_states( - model_output, expanded_request) - - self._append_new_tokens( - model_output, expanded_request.seq_group_metadata_list, - indices_of_seq_with_bonus_tokens) - model_outputs.append(model_output) - - # move indices to device to avoid stream sync - indices_of_seq_with_bonus_tokens = torch.tensor( - indices_of_seq_with_bonus_tokens, device=self.device) - filtered_model_outputs = self._filter_model_output( - model_outputs, indices_of_seq_with_bonus_tokens) - return filtered_model_outputs, True - - @staticmethod - def _maybe_update_previous_hidden_states( - model_output: SamplerOutput, - expanded_request: ExecuteModelRequest) -> None: - """ - Updates the previous hidden states in an expanded request - in-place with the hidden states from the model output. - """ - if expanded_request.previous_hidden_states is not None: - expanded_request.previous_hidden_states = HiddenStates( - model_output.hidden_states, - expanded_request.seq_group_metadata_list) - - @staticmethod - def _expand_execute_model_request( - execute_model_req: ExecuteModelRequest, - seq_with_bonus_token_in_last_step: set, - ) -> Tuple[ExecuteModelRequest, List[int]]: - """ - Expands the execute model request based on sequences with bonus - tokens. - - For each sequence with a bonus token, this method creates a new - sequence without the bonus token and adds it to the execute model - request. The original sequence groups are also retained. The indices - of the original sequence groups are returned for further processing. - - Args: - execute_model_req (ExecuteModelRequest): The original execute - model request. - seq_with_bonus_token_in_last_step (set): Set of sequence IDs that - contain bonus tokens. - - Returns: - Tuple[ExecuteModelRequest, List[int]]: The updated execute model - request with expanded sequences and a list of indices corresponding - to the original sequence groups. - """ - updated_seq_group_metadata_list: List[SequenceGroupMetadata] = [] - updated_execute_model_req = execute_model_req.clone( - updated_seq_group_metadata_list) - indices_of_original_sequence_groups = [] - for seq_group in execute_model_req.seq_group_metadata_list: - seq_group_has_bonus_tokens = False - for seq_id, _ in seq_group.seq_data.items(): - # Identify sequences with bonus tokens in the sequence group. - if seq_id in seq_with_bonus_token_in_last_step: - seq_group_has_bonus_tokens = True - break - if seq_group_has_bonus_tokens: - #Create new sequences without the last bonus token. These new - # sequence have the same sequence id as the original sequence. - # We create a new sequence group and add them there. - updated_seq_group_without_bonus_token = \ - MultiStepWorker._copy_seq_metadata_excluding_last_token( - seq_group, seq_with_bonus_token_in_last_step) - updated_seq_group_metadata_list.append( - updated_seq_group_without_bonus_token) - # Add the original sequence group. - updated_seq_group_metadata_list.append( - MultiStepWorker._shallow_copy_seq_group_metadata(seq_group)) - # Record the index of the original sequence group. - indices_of_original_sequence_groups.append( - len(updated_seq_group_metadata_list) - 1) - - updated_execute_model_req.seq_group_metadata_list =\ - updated_seq_group_metadata_list - - if isinstance(updated_execute_model_req.previous_hidden_states, - HiddenStates): - updated_execute_model_req.previous_hidden_states\ - .expand_with_bonus_tokens(seq_with_bonus_token_in_last_step) - - return updated_execute_model_req, indices_of_original_sequence_groups - - @staticmethod - def _filter_model_output( - expanded_batch_outputs: List[SamplerOutput], - output_indices_to_retain: torch.Tensor) -> List[SamplerOutput]: - """ - Filters the model output to include only the specified sequence - outputs. This method contracts the expanded batch output from the - model to retain the outputs of only those sequences indicated by the - provided indices. - - Args: - expanded_batch_output (List[SamplerOutput]): The expanded output - batch from the model. - output_indices_to_retain (torch.Tensor): Indices of the model - outputs to retain. - - Returns: - List[SamplerOutput]: A list containing the filtered model - outputs for the specified indices. - """ - return [ - SamplerOutput( - outputs=[ - expanded_batch_output.outputs[i] - for i in output_indices_to_retain - ] if len(expanded_batch_output.outputs) > 0 else [], - sampled_token_probs=( - expanded_batch_output. - sampled_token_probs[output_indices_to_retain] - if expanded_batch_output.sampled_token_probs is not None - else None), - logprobs=( - expanded_batch_output.logprobs[output_indices_to_retain] - if expanded_batch_output.logprobs is not None else None), - sampled_token_ids=(expanded_batch_output. - sampled_token_ids[output_indices_to_retain] - if expanded_batch_output.sampled_token_ids - is not None else None)) - for expanded_batch_output in expanded_batch_outputs - ] - - def get_spec_proposals( - self, - execute_model_req: ExecuteModelRequest, - seq_ids_with_bonus_token_in_last_step: set, - ) -> SpeculativeProposals: - """Produce speculations given an input batch of sequences. The number of - speculative tokens per sequence is determined by max_proposal_len. - """ - return self._proposer.get_spec_proposals( - execute_model_req, seq_ids_with_bonus_token_in_last_step) - - @staticmethod - def _append_new_tokens( - model_output: List[SamplerOutput], - seq_group_metadata_list: List[SequenceGroupMetadata], - indices_of_seq_with_bonus_tokens: List[int]) -> None: - """Given model output from a single run, append the tokens to the - sequences. This is normally done outside of the worker, but it is - required if the worker is to perform multiple forward passes. - """ - count = 0 - for index, (seq_group_metadata, sequence_group_outputs) in enumerate( - zip(seq_group_metadata_list, model_output)): - seq_group_metadata.is_prompt = False - - for seq_output in sequence_group_outputs.samples: - # NOTE: Beam search is not supported, so we can assume that - # parent_seq_id == seq_id. - seq = seq_group_metadata.seq_data[seq_output.parent_seq_id] - - token_id = seq_output.output_token - token_logprob = seq_output.logprobs[token_id] - # Determine the actual token ID to be generated, - # considering bonus tokens - if index != indices_of_seq_with_bonus_tokens[count]: - bonus_seq_metadata = seq_group_metadata_list[ - indices_of_seq_with_bonus_tokens[count]] - _, bonus_token_seq_data = next( - iter(bonus_seq_metadata.seq_data.items())) - token_id = bonus_token_seq_data.output_token_ids[-1] - else: - count += 1 - - seq.append_token_id(token_id, token_logprob.logprob, - seq_output.output_embed) - seq.update_num_computed_tokens(1) - - @staticmethod - def _shallow_copy_seq_group_metadata( - seq_group_metadata: SequenceGroupMetadata, ) -> SequenceGroupMetadata: - """Copy input data structures to remove side-effects when input data - structures are shared with other modules. - - Helpful when the vLLM scheduler runs in the same process as the worker. - The alternative is deep-copying (or other form of deep copy); this has - performance downsides. - """ - # Shallow-copy the SequenceGroupMetadata. This allows us to - # append tokens and change is_prompt without external side-effects. - # We must shallow-copy seq_group_metadata as is_prompt could change. - new_seq_group_metadata = copy.copy(seq_group_metadata) - - # We must shallow-copy seq_data as we will append token ids - new_seq_data: Dict[int, SequenceData] = {} - for seq_id, old_seq_data in seq_group_metadata.seq_data.items(): - new_seq_data[seq_id] = copy.copy(old_seq_data) - new_seq_data[seq_id].output_token_ids =\ - old_seq_data.output_token_ids[:] - - new_seq_group_metadata.seq_data = new_seq_data - return new_seq_group_metadata - - @staticmethod - def _copy_seq_metadata_excluding_last_token( - seq_group_metadata: SequenceGroupMetadata, - seq_ids_to_copy: Set[int], - ) -> SequenceGroupMetadata: - """ - Creates a shallow copy of the given SequenceGroupMetadata, retaining - only the sequence IDs specified in seq_ids_to_copy. For each of these - sequence IDs, all output_token_ids except the last one are copied. - Sequence IDs not in seq_ids_to_copy are excluded from the copy. - - Parameters: - seq_group_metadata (SequenceGroupMetadata): The original sequence - group metadata. - seq_ids_to_copy (Set[int]): The set of sequence IDs to include in the - copy. - - Returns: - SequenceGroupMetadata: A shallow copy of the sequence group metadata - with the specified modifications. - """ - # Shallow-copy the SequenceGroupMetadata. - new_seq_group_metadata = copy.copy(seq_group_metadata) - # Shallow-copy seq_data and modify the output_token_ids. - new_seq_data: Dict[int, SequenceData] = {} - for seq_id, old_seq_data in seq_group_metadata.seq_data.items(): - if (seq_id in seq_ids_to_copy): - new_seq_data[seq_id] = copy.copy(old_seq_data) - # Copy all the output token ids except the last. - # Also reduce num_computed_tokens by 1 since we are not - # including the last output token. - # NOTE: num_computed_tokens is not directly used by the - # speculative decoding workers, as it is only relevant for - # chunked prefill, which is disabled for speculative decoding. - # However, to maintain consistency in num_computed_tokens, - # we update it here. - new_seq_data[seq_id].output_token_ids =\ - old_seq_data.output_token_ids[:-1] - new_seq_data[seq_id].update_num_computed_tokens(-1) - new_seq_group_metadata.seq_data = new_seq_data - return new_seq_group_metadata - - def _assert_enough_kv_space( - self, seq_group_metadata_list: List[SequenceGroupMetadata], - num_steps: int) -> None: - """Assert there are enough physical blocks per sequence to store the - current KV plus additional KV from num_steps tokens. - """ - assert self.model_runner.block_size is not None - for seq_group_metadata in seq_group_metadata_list: - # Only one seq_id is guaranteed because there is no beam search. - seq_id = list(seq_group_metadata.seq_data.keys())[0] - seq = seq_group_metadata.seq_data[seq_id] - - # After num_steps, the seq len will be the current seq len - # plus one token per step. - final_seq_len = seq.get_len() + num_steps - - # We will have final_seq_len - 1 KV because vLLM saves KV for a - # token in the iteration after the token was generated. - required_num_kv_slots = final_seq_len - 1 - - # The allocated number of kv slots is the number of allocated blocks - # times the number of slots of block. - number_physical_blocks = len( - seq_group_metadata.block_tables[seq_id]) - allocated_kv_slots = (number_physical_blocks * - self.model_runner.block_size) - - if required_num_kv_slots > allocated_kv_slots: - request_id = seq_group_metadata.request_id - raise ValueError( - "The worker attempted to run " - f"{num_steps} times but found insufficient KV space for " - f"{request_id=} {seq_id=}. ({allocated_kv_slots=} " - f"{required_num_kv_slots=}).") - - def _raise_if_unsupported( - self, - execute_model_req: ExecuteModelRequest, - ) -> None: - """MultiStepWorker does not yet implement support for cache swap - operations or beam search. - """ - if any([ - execute_model_req.blocks_to_swap_in, - execute_model_req.blocks_to_swap_out, - execute_model_req.blocks_to_copy - ]): - raise NotImplementedError( - "MultiStepWorker does not support cache operations") - - if any( - len(seq_group_metadata.seq_data.keys()) != 1 - for seq_group_metadata in - execute_model_req.seq_group_metadata_list): - raise NotImplementedError( - "MultiStepWorker does not support beam search.") - - def maybe_load_lm_head_weight( - self, - lm_head_weight: torch.Tensor, - ) -> None: - weight_loader = getattr( - self.worker.model_runner.model_runner.model.lm_head.weight, - "weight_loader", default_weight_loader) - weight_loader( - self.worker.model_runner.model_runner.model.lm_head.weight, - lm_head_weight) diff --git a/vllm/spec_decode/ngram_worker.py b/vllm/spec_decode/ngram_worker.py deleted file mode 100644 index 7a1a0e56dc0..00000000000 --- a/vllm/spec_decode/ngram_worker.py +++ /dev/null @@ -1,196 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import weakref -from typing import List, Optional, Set, Tuple - -import torch -import torch.nn as nn - -from vllm.config import VllmConfig -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.sequence import ExecuteModelRequest -from vllm.spec_decode.interfaces import SpeculativeProposals -from vllm.spec_decode.proposer_worker_base import NonLLMProposerWorkerBase -from vllm.spec_decode.top1_proposer import Top1Proposer - - -class _DummyModel(nn.Module): - pass - - -class NGramWorker(NonLLMProposerWorkerBase): - """NGramWorker provides a light drafter without need for model. - - Current NGramWorker only implements prompt lookup decoding, - and in future we may also do RAG type drafter and other scenarios - which don't rely on LLM model to give proposals. - """ - - def __init__( - self, - vllm_config: VllmConfig, - local_rank: int, - device_type: str = "cuda", - **kwargs, - ): - super().__init__(vllm_config) - - # Get local_rank/vocab_size from kwargs attribute - self.local_rank = local_rank - self.device_type = device_type - - # Lazy initialization list. - self._proposer: Top1Proposer - - def set_ngram_window_size(self, ngram_prompt_lookup_min: int, - ngram_prompt_lookup_max: int): - # Search valid candidate window between - # ngram_prompt_lookup_min/ngram_prompt_lookup_max - self.ngram_prompt_lookup_max = ngram_prompt_lookup_max - self.ngram_prompt_lookup_min = ngram_prompt_lookup_min - - def init_device(self): - self.device = torch.device(f"{self.device_type}:{self.local_rank}") - - # Current NGramWorker only supports Top1Proposer - self._proposer = Top1Proposer( - weakref.proxy(self), # type: ignore[arg-type] - device=self.device, - vocab_size=self.vocab_size, - ) - - def load_model(self) -> None: - pass # Dummy - - def get_model(self) -> nn.Module: - return _DummyModel() - - def sampler_output( - self, - execute_model_req: ExecuteModelRequest, - sample_len: int, - # Unused parameter. NGramWorker does not use the KV Cache and - # therefore does not need this parameter. - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> Tuple[Optional[List[Optional[SamplerOutput]]], bool]: - """NGram match algo to pick proposal candidate. Returns the list of - sampler output, one per SequenceGroupMetadata. - - For ngram worker, we already done needed transposed internal, so the - indicator pass to sampler_output_to_torch shall be False. - """ - self._raise_if_unsupported(execute_model_req) - - has_spec_out = False - token_id_list: List[Optional[torch.Tensor]] = [] - token_prob_list: List[Optional[torch.Tensor]] = [] - for idx, seq_group_metadata in enumerate( - execute_model_req.seq_group_metadata_list): - seq_data = next(iter(seq_group_metadata.seq_data.values())) - - seq_len = seq_data.get_len() - # When seq_len is less than 3072 (3K), we use CPU to perform - # the ngram match. Otherwise, we use the device specified in - # the model config (normally GPU). 3072 is a rough threshold - # based on profiling on H100, and it can be adjusted based - # on the actual performance on different hardware. - cur_device = "cpu" if seq_len < 3072 else self.device - input_ids = torch.as_tensor(seq_data.get_token_ids(), - dtype=torch.long, - device=cur_device) - input_length = seq_data.get_len() - - for ngram_size in range( - min(self.ngram_prompt_lookup_max, input_length - 1), - self.ngram_prompt_lookup_min - 1, - -1, - ): - ngram_tensor = input_ids[-ngram_size:] - if ngram_size == 1: - # Do not match itself and do not use unfold and all - matches = (input_ids[:-1] == ngram_tensor) - else: - windows = input_ids.unfold(dimension=0, - size=ngram_size, - step=1) - # Do not match itself - matches = (windows[:-1] == ngram_tensor).all(dim=-1) - - # first_match includes "values" (bool), indicating whether - # the match is found, and "indices", indicating the index - # of the first match. - first_match = matches.max(dim=-1) - if first_match.values.item(): - proposal_start_idx = first_match.indices.add_(ngram_size) - spec_indices = ( - proposal_start_idx).repeat(sample_len) + torch.arange( - sample_len, device=cur_device) - spec_indices.clamp_(max=input_ids.shape[-1] - 1) - res = input_ids.gather(dim=-1, - index=spec_indices).to(self.device) - token_id_list.append(res) - token_prob_list.append( - torch.nn.functional.one_hot( - res, - num_classes=self.vocab_size).to(torch.float32)) - has_spec_out = True - break - else: - token_id_list.append(None) - token_prob_list.append(None) - - if not has_spec_out: - return None, False - - outputs: List[Optional[SamplerOutput]] = [] - for idx in range(len(execute_model_req.seq_group_metadata_list)): - if token_id_list[idx] is None: - outputs.append(None) - else: - outputs.append( - SamplerOutput( - outputs=None, - sampled_token_probs=token_prob_list[idx], - logprobs=torch.zeros((sample_len, self.vocab_size), - dtype=torch.float32, - device=self.device), - sampled_token_ids=token_id_list[idx], - )) - - return outputs, False - - def get_spec_proposals( - self, - execute_model_req: ExecuteModelRequest, - # Unused parameter. NGramWorker does not use the KV Cache and - # therefore does not need this parameter. - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> SpeculativeProposals: - """Produce speculations given an input batch of sequences. The number of - speculative tokens per sequence is determined by max_proposal_len. - """ - return self._proposer.get_spec_proposals( - execute_model_req, seq_ids_with_bonus_token_in_last_step) - - def _raise_if_unsupported( - self, - execute_model_req: ExecuteModelRequest, - ) -> None: - """NGramWorker does not yet implement support for cache swap - operations or beam search. - """ - if any([ - execute_model_req.blocks_to_swap_in, - execute_model_req.blocks_to_swap_out, - execute_model_req.blocks_to_copy - ]): - raise NotImplementedError( - "NGramWorker does not support cache operations") - - if any( - len(seq_group_metadata.seq_data.keys()) != 1 - for seq_group_metadata in - execute_model_req.seq_group_metadata_list): - raise NotImplementedError( - "NGramWorker does not support beam search.") diff --git a/vllm/spec_decode/proposer_worker_base.py b/vllm/spec_decode/proposer_worker_base.py deleted file mode 100644 index fb44275aa93..00000000000 --- a/vllm/spec_decode/proposer_worker_base.py +++ /dev/null @@ -1,59 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from abc import ABC, abstractmethod -from typing import List, Optional, Set, Tuple - -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.sequence import ExecuteModelRequest -from vllm.spec_decode.interfaces import SpeculativeProposer -from vllm.worker.worker_base import LoRANotSupportedWorkerBase - - -class ProposerWorkerBase(LoRANotSupportedWorkerBase, SpeculativeProposer): - """Interface for proposer workers""" - - @abstractmethod - def sampler_output( - self, - execute_model_req: ExecuteModelRequest, - sample_len: int, - # A set containing all sequence IDs that were assigned bonus tokens - # in their last forward pass. This set is used to backfill the KV cache - # with the key-value pairs of the penultimate token in the sequences. - # This parameter is only used by the MultiStepWorker, which relies on - # the KV cache for token generation. It is not used by workers that - # do not utilize the KV cache. - seq_ids_with_bonus_token_in_last_step: Set[int] - ) -> Tuple[Optional[List[SamplerOutput]], bool]: - raise NotImplementedError - - def set_include_gpu_probs_tensor(self) -> None: - """Implementation optional""" - pass - - def set_should_modify_greedy_probs_inplace(self) -> None: - """Implementation optional""" - pass - - -class NonLLMProposerWorkerBase(ProposerWorkerBase, ABC): - """Proposer worker which does not use a model with kvcache""" - - def execute_model( - self, - execute_model_req: Optional[ExecuteModelRequest] = None - ) -> List[SamplerOutput]: - """get_spec_proposals is used to get the proposals""" - return [] - - def determine_num_available_blocks(self) -> Tuple[int, int]: - """This is never called on the proposer, only the target model""" - raise NotImplementedError - - def initialize_cache(self, num_gpu_blocks: int, - num_cpu_blocks: int) -> None: - pass - - def get_cache_block_size_bytes(self) -> int: - return 0 diff --git a/vllm/spec_decode/smaller_tp_proposer_worker.py b/vllm/spec_decode/smaller_tp_proposer_worker.py deleted file mode 100644 index 91256cab6e7..00000000000 --- a/vllm/spec_decode/smaller_tp_proposer_worker.py +++ /dev/null @@ -1,196 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import List, Optional, Set, Tuple - -import torch -import torch.nn as nn - -from vllm.distributed.parallel_state import (get_tp_group, - init_model_parallel_group, - patch_tensor_parallel_group) -from vllm.logger import init_logger -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.sequence import ExecuteModelRequest -from vllm.spec_decode.interfaces import SpeculativeProposals -from vllm.spec_decode.multi_step_worker import MultiStepWorker -from vllm.spec_decode.proposer_worker_base import ProposerWorkerBase - -logger = init_logger(__name__) - - -class _DummyModel(nn.Module): - pass - - -class SmallerTpProposerWorker(ProposerWorkerBase): - """Class which allows a speculative draft model to run with smaller tensor - parallel degree than target model. - This reduces the communication overhead of small draft models. - - To implement this feature, this class differs behavior based on is_dummy - flag, where dummy means worker that does not participate draft generation. - Participating workers use a smaller tp group by patching vLLM's tensor - parallel group temporarily during forward passes of draft models. - """ - - @classmethod - def maybe_wrap_worker(cls, worker, draft_tensor_parallel_size: int, - target_tensor_parallel_size: int): - """Wrap the worker in a SmallerTpProposerWorker if necessary. - """ - if draft_tensor_parallel_size == target_tensor_parallel_size: - return worker - - # gpu ranks that will generate draft tokens together - draft_ranks = list(range(draft_tensor_parallel_size)) - - logger.info("Wrapping {%s} in {%s}", type(worker), cls) - return cls(worker, draft_ranks) - - def __init__(self, worker: MultiStepWorker, draft_ranks: List[int]): - """Create a SmallerTpProposerWorker. - - Args: - worker (~vllm.spec_decode.multi_step_worker.MultiStepWorker): an - actual worker wrapped with this class - draft_ranks (List[int]): if this value is given, only the GPU ranks - written in this value participate in draft generation - """ - self._worker = worker - self._draft_ranks = draft_ranks - - # init during init_device - self._is_dummy = False - self._tp_group = None - - def _patch_tensor_parallel_group(self): - """Temporarily patch the global tp group state with its own tp group - state. - """ - return patch_tensor_parallel_group(self._tp_group) - - def init_device(self) -> None: - self._is_dummy = get_tp_group().rank not in self._draft_ranks - - # dummy workers do nothing - if self._is_dummy: - return - - # creates tp process group containing only a subset of gpu ranks - local_rank = get_tp_group().local_rank - tp_backend = torch.distributed.get_backend(get_tp_group().device_group) - self._tp_group = init_model_parallel_group([self._draft_ranks], - local_rank, tp_backend) - - with self._patch_tensor_parallel_group(): - self._worker.init_device() - - def set_include_gpu_probs_tensor(self) -> None: - if self._is_dummy: - return - - # Need include_gpu_probs_tensor for multi_step_worker - self._worker.set_include_gpu_probs_tensor() - - def set_should_modify_greedy_probs_inplace(self) -> None: - if self._is_dummy: - return - - self._worker.set_should_modify_greedy_probs_inplace() - - def load_model(self) -> None: - if self._is_dummy: - return - - with self._patch_tensor_parallel_group(): - self._worker.load_model() - - def determine_num_available_blocks(self) -> Tuple[int, int]: - if self._is_dummy: - # this case is not used now - return -1, -1 - - with self._patch_tensor_parallel_group(): - return self._worker.determine_num_available_blocks() - - def initialize_cache(self, num_gpu_blocks: int, - num_cpu_blocks: int) -> None: - if self._is_dummy: - return - - with self._patch_tensor_parallel_group(): - self._worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) - - def sampler_output( - self, - execute_model_req: ExecuteModelRequest, - sample_len: int, - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> Tuple[List[SamplerOutput], bool]: - # Do not check _is_dummy, as it's always called by get_spec_proposals - return self._worker.sampler_output( - execute_model_req, sample_len, - seq_ids_with_bonus_token_in_last_step) - - def get_spec_proposals( - self, - execute_model_req: ExecuteModelRequest, - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> SpeculativeProposals: - """Produce speculations given an input batch of sequences. The number of - speculative tokens per sequence is determined by max_proposal_len. - """ - if self._is_dummy: - return SpeculativeProposals(None, None, None) - - with self._patch_tensor_parallel_group(): - return self._worker.get_spec_proposals( - execute_model_req, seq_ids_with_bonus_token_in_last_step) - - def get_model(self) -> nn.Module: - if self._is_dummy: - return _DummyModel() - - with self._patch_tensor_parallel_group(): - return self._worker.get_model() - - def execute_model( - self, - execute_model_req: Optional[ExecuteModelRequest] = None - ) -> List[SamplerOutput]: - if self._is_dummy: - return [] - - with self._patch_tensor_parallel_group(): - return self._worker.execute_model(execute_model_req) - - def get_cache_block_size_bytes(self) -> int: - if self._is_dummy: - # by returning zero, target worker can use the entire kv cache space - return 0 - - return self._worker.get_cache_block_size_bytes() - - @property - def vocab_size(self) -> int: - return self._worker.vocab_size - - def maybe_load_lm_head_weight( - self, - lm_head_weight: torch.Tensor, - ) -> None: - if self._is_dummy: - return - - with self._patch_tensor_parallel_group(): - weight_loader = getattr( - self._worker.worker.model_runner.model_runner.model.\ - lm_head.weight, - "weight_loader", - default_weight_loader) - weight_loader( - self._worker.worker.model_runner.model_runner.model.\ - lm_head.weight, - lm_head_weight) diff --git a/vllm/spec_decode/spec_decode_worker.py b/vllm/spec_decode/spec_decode_worker.py deleted file mode 100644 index 7dda1cbfe23..00000000000 --- a/vllm/spec_decode/spec_decode_worker.py +++ /dev/null @@ -1,1326 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import copy -from collections import defaultdict -from functools import cached_property -from typing import Any, Dict, List, Optional, Set, Tuple, Type - -import torch -import torch.nn as nn - -from vllm.config import ParallelConfig, SpeculativeConfig, VllmConfig -from vllm.distributed.communication_op import (broadcast_tensor_dict, - get_tp_group, - tensor_model_parallel_gather) -from vllm.distributed.parallel_state import model_parallel_is_initialized -from vllm.logger import init_logger -from vllm.model_executor.layers.rejection_sampler import RejectionSampler -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.layers.spec_decode_base_sampler import ( - SpecDecodeBaseSampler, SpecDecodeStochasticBaseSampler) -from vllm.model_executor.layers.typical_acceptance_sampler import ( - TypicalAcceptanceSampler) -from vllm.platforms import current_platform -from vllm.sequence import (VLLM_INVALID_TOKEN_ID, - CompletionSequenceGroupOutput, ExecuteModelRequest, - HiddenStates, SequenceGroupMetadata, - get_all_seq_ids_and_request_ids) -from vllm.spec_decode.batch_expansion import BatchExpansionTop1Scorer - -if current_platform.is_cuda_alike(): - from vllm.spec_decode.draft_model_runner import TP1DraftModelRunner - -from vllm.spec_decode.interfaces import (SpeculativeProposals, - SpeculativeScorer, SpeculativeScores) -from vllm.spec_decode.medusa_worker import MedusaWorker -from vllm.spec_decode.metrics import AsyncMetricsCollector -from vllm.spec_decode.mlp_speculator_worker import MLPSpeculatorWorker -from vllm.spec_decode.mqa_scorer import MQAScorer -from vllm.spec_decode.multi_step_worker import MultiStepWorker -from vllm.spec_decode.ngram_worker import NGramWorker -from vllm.spec_decode.proposer_worker_base import ProposerWorkerBase -from vllm.spec_decode.smaller_tp_proposer_worker import SmallerTpProposerWorker -from vllm.spec_decode.target_model_runner import TargetModelRunner -from vllm.spec_decode.util import (Timer, create_logprobs_output, - create_sequence_group_output, - get_all_num_logprobs, - get_sampled_token_logprobs, nvtx_range, - split_batch_by_proposal_len) -from vllm.utils import resolve_obj_by_qualname -from vllm.worker.worker_base import LoRANotSupportedWorkerBase, WorkerBase - -logger = init_logger(__name__) - - -def create_spec_worker(*args, **kwargs) -> "SpecDecodeWorker": - """Helper method that is the entrypoint for Executors which use - WorkerWrapper. It constructs a SpecDecodeWorker from the speculative config. - """ - vllm_config: VllmConfig = kwargs.get("vllm_config") - speculative_config: SpeculativeConfig = vllm_config.speculative_config - assert speculative_config is not None - - if vllm_config.parallel_config.pipeline_parallel_size > 1: - raise NotImplementedError("Speculative decoding is currently " - "incompatible with pipeline parallelism") - - draft_worker_kwargs = kwargs.copy() - - kwargs["model_runner_cls"] = TargetModelRunner - target_worker_config = copy.deepcopy(vllm_config) - target_worker_config.parallel_config.worker_cls =\ - target_worker_config.parallel_config.sd_worker_cls - cls = resolve_obj_by_qualname( - target_worker_config.parallel_config.worker_cls) - target_worker = cls(*args, **kwargs) - # Set the disable_logprobs variable in the TargetModelRunner instance - # as per its value specified in the SpeculativeConfig. - target_worker.model_runner.disable_logprobs =\ - speculative_config.disable_logprobs - - draft_worker_config = copy.deepcopy(vllm_config) - draft_worker_config.model_config = speculative_config.draft_model_config - draft_worker_config.quant_config = VllmConfig._get_quantization_config( - draft_worker_config.model_config, - vllm_config.load_config, - ) - speculative_config.draft_parallel_config.worker_cls =\ - draft_worker_config.parallel_config.sd_worker_cls - draft_worker_config.parallel_config = speculative_config.draft_parallel_config # noqa - # TODO allow draft-model specific load config. - - # Override draft-model specific worker args. - draft_worker_kwargs.update( - vllm_config=draft_worker_config, - ngram_prompt_lookup_max=speculative_config.prompt_lookup_max, - ngram_prompt_lookup_min=speculative_config.prompt_lookup_min, - ) - - spec_decode_worker = SpecDecodeWorker.create_worker( - scorer_worker=target_worker, - draft_worker_kwargs=draft_worker_kwargs, - disable_mqa_scorer=speculative_config.disable_mqa_scorer, - disable_by_batch_size=speculative_config.disable_by_batch_size, - draft_token_acceptance_method=speculative_config.acceptance_method, - typical_acceptance_sampler_posterior_threshold=speculative_config. - posterior_threshold, - typical_acceptance_sampler_posterior_alpha=speculative_config. - posterior_alpha, - disable_logprobs=speculative_config.disable_logprobs, - disable_log_stats=speculative_config.disable_log_stats, - num_speculative_tokens=speculative_config.num_speculative_tokens, - ) - - return spec_decode_worker - - -# Reminder: Please update docs/features/compatibility_matrix.md -# If the feature combo become valid -class SpecDecodeWorker(LoRANotSupportedWorkerBase): - """Worker which implements speculative decoding. - - Speculative decoding reduces decoding per-token latency by using a proposal - method, such as a small draft model, to speculate ahead of a larger LLM. The - probabilities of the speculative tokens are then determined by the larger - LLM, after which some verification routine determines which (if any) of the - speculative tokens are accepted by the larger LLM. - - See https://github.com/vllm-project/vllm/pull/2188 and - https://github.com/vllm-project/vllm/pull/3103 for more info. - - The current implementation has the following limitations: - * Only draft-model proposal is implemented (contributions for more forms are - welcome!). - * Only top-1 proposal and scoring are implemented. Tree-attention is left as - future work. - * All sequences in a batch must have the same proposal length, or zero. This - can be improved by having per-sequence speculation in the future. - * The scoring forward pass is done without an MQA kernel, which is - suboptimal especially as the batch size, proposal length, and sequence - lengths grow. Contributions to add a MQA scoring are welcome once - correctness tests pass. - More info here https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit. - """ - - @classmethod - def create_worker( - cls, - scorer_worker: WorkerBase, - draft_worker_kwargs: Dict[str, Any], - disable_mqa_scorer: bool, - disable_by_batch_size: Optional[int], - draft_token_acceptance_method: str, - typical_acceptance_sampler_posterior_threshold: float, - typical_acceptance_sampler_posterior_alpha: float, - disable_logprobs: bool, - disable_log_stats: bool, - num_speculative_tokens: int, - ) -> "SpecDecodeWorker": - - allow_zero_draft_token_step = True - enable_lm_head_weight_load = False - num_spec_prefill_steps = 1 - ngram_prompt_lookup_max = ( - draft_worker_kwargs.pop("ngram_prompt_lookup_max")) - ngram_prompt_lookup_min = ( - draft_worker_kwargs.pop("ngram_prompt_lookup_min")) - draft_model_config = draft_worker_kwargs["vllm_config"].model_config - draft_parallel_config: ParallelConfig = draft_worker_kwargs[ - 'vllm_config'].parallel_config - if ngram_prompt_lookup_max > 0: - draft_worker_kwargs[ - "device_type"] = scorer_worker.device_config.device.type - proposer_worker = NGramWorker(**draft_worker_kwargs) - proposer_worker.set_ngram_window_size(ngram_prompt_lookup_min, - ngram_prompt_lookup_max) - else: - draft_tp = draft_parallel_config.tensor_parallel_size - target_tp = scorer_worker.parallel_config.tensor_parallel_size - - if draft_model_config.hf_config.model_type == "mlp_speculator": - proposer_worker = MLPSpeculatorWorker(**draft_worker_kwargs) - elif draft_model_config.hf_config.model_type == "medusa": - proposer_worker = MedusaWorker(**draft_worker_kwargs) - else: - if draft_tp == 1: - if current_platform.is_cuda_alike(): - draft_worker_kwargs[ - "model_runner_cls"] = TP1DraftModelRunner - else: - if draft_model_config.hf_config.model_type == "eagle": - raise NotImplementedError( - f"{draft_model_config.hf_config.model_type} " - "does not support TP > 1 yet") - - allow_zero_draft_token_step = False - - # Load lm_head weight for eagle in init_device - if draft_model_config.hf_config.model_type == "eagle": - enable_lm_head_weight_load = True - - proposer_worker = MultiStepWorker(**draft_worker_kwargs) - if draft_model_config.hf_config.model_type == "deepseek_mtp": - num_spec_prefill_steps = \ - draft_model_config.hf_config.n_predict - - proposer_worker = SmallerTpProposerWorker.maybe_wrap_worker( - proposer_worker, draft_tp, target_tp) - - logger.info("Configuring SpecDecodeWorker with proposer=%s", - type(proposer_worker)) - - spec_decode_sampler: SpecDecodeBaseSampler = None - if draft_token_acceptance_method == "rejection_sampler": - spec_decode_sampler = RejectionSampler() - elif draft_token_acceptance_method == "typical_acceptance_sampler": - spec_decode_sampler = TypicalAcceptanceSampler( - posterior_threshold=\ - typical_acceptance_sampler_posterior_threshold, - posterior_alpha=typical_acceptance_sampler_posterior_alpha, - ) - logger.info( - "[Speculative Decoding] Configuring" - " SpecDecodeWorker with sampler=%s", type(spec_decode_sampler)) - - if not disable_mqa_scorer: - if scorer_worker.model_runner.attn_backend.get_name( - ) != "FLASH_ATTN": - disable_mqa_scorer = True - logger.info( - "[Speculative Decoding] Disabling MQA scorer as the " - "MQA is only available with flash attn backend.") - - if draft_model_config and \ - draft_model_config.max_model_len < \ - scorer_worker.model_config.max_model_len: - disable_mqa_scorer = True - logger.info( - "[Speculative Decoding] Disabling MQA scorer as the " - "draft model max_model_len is smaller than the target " - "model max_model_len.") - - if not scorer_worker.model_runner.model_config.enforce_eager: - disable_mqa_scorer = True - logger.info( - "[Speculative Decoding] Disabling MQA scorer as the " - "target model is not running in eager mode.") - - return SpecDecodeWorker( - proposer_worker, - scorer_worker, - disable_mqa_scorer=disable_mqa_scorer, - disable_logprobs=disable_logprobs, - disable_log_stats=disable_log_stats, - disable_by_batch_size=disable_by_batch_size, - spec_decode_sampler=spec_decode_sampler, - allow_zero_draft_token_step=allow_zero_draft_token_step, - enable_lm_head_weight_load=enable_lm_head_weight_load, - num_spec_prefill_steps=num_spec_prefill_steps) - - def __init__( - self, - proposer_worker: ProposerWorkerBase, - scorer_worker: WorkerBase, - spec_decode_sampler: SpecDecodeBaseSampler, - disable_mqa_scorer: bool = False, - disable_logprobs: bool = False, - disable_log_stats: bool = False, - metrics_collector: Optional[AsyncMetricsCollector] = None, - disable_by_batch_size: Optional[int] = None, - allow_zero_draft_token_step: Optional[bool] = True, - enable_lm_head_weight_load: Optional[bool] = False, - num_spec_prefill_steps: int = 1, - ): - """ - Create a SpecDecodeWorker. - - Args: - proposer_worker: A worker that can produce speculative tokens for - sequences. - scorer_worker: A worker that produces probabilities of speculative - tokens according to some base model. Typically a vanilla vLLM - Worker. - spec_decode_sampler: A Torch module used to perform acceptance - sampling of the draft tokens in the verification step of - speculative decoding. Currently we support two different - types of sampler namely RejectionSampler and - TypicalAcceptanceSampler. 'spec_decode_sampler' is either an - instance of RejectionSampler or TypicalAcceptanceSampler. - disable_mqa_scorer: If set to True, disable the MQA scorer and use - the BatchExpansionTop1Scorer instead. - disable_logprobs: If set to True, token log probabilities will - not be output in both the draft worker and the target worker. - If set to False, log probabilities will be output by both. - disable_log_stats: If set to True, disable periodic printing of - speculative stage times. - disable_by_batch_size: If the batch size is larger than this, - disable speculative decoding for new incoming requests. - metrics_collector: Helper class for collecting metrics; can be set - for testing purposes. - allow_zero_draft_token_step: whether to allow a step where the draft - model generates no draft token; should disallow when the tp of - draft model is larger than 1 (TODO: #5814) - enable_lm_head_weight_load: whether to load lm_head weight for - draft models like eagle. - num_spec_prefill_steps: number of speculative prefill steps to run - before the speculative decoding starts. This is only used when - the draft model is a deepseek_mtp model that requires prefill - kv cache separately for each MTP layer. - """ - self.proposer_worker = proposer_worker - self.scorer_worker = scorer_worker - scorer_runner = getattr(self.scorer_worker, "model_runner", None) - self.generators = scorer_runner.get_generators( - ) if scorer_runner else None - self.disable_by_batch_size = disable_by_batch_size or float("inf") - self.spec_decode_sampler = spec_decode_sampler - self._allow_zero_draft_token_step = allow_zero_draft_token_step - self._enable_lm_head_weight_load = enable_lm_head_weight_load - self._metrics = AsyncMetricsCollector( - self.spec_decode_sampler - ) if metrics_collector is None else metrics_collector - # Tracks the sequence IDs that received a bonus token ID in - # their last forward pass. Needed only if KV cache is being - # used for token generation such as in the case of MultiStepWorker. - self._seq_with_bonus_token_in_last_step: Set[int] = set() - # Tracks the currently active request ids and the sequence IDs - # corresponding to them - self._request_id_seq_id_mapping: Dict[str, Set[int]] = defaultdict(set) - # Tracks if the proposer worker uses the KV cache or not. - - self.probs_dtype = self.spec_decode_sampler.probs_dtype - self.token_id_dtype = self.spec_decode_sampler.token_id_dtype - # Lazy initialization. - self.scorer: SpeculativeScorer - self.disable_mqa_scorer = disable_mqa_scorer - - # Hidden states from target model to pass to proposer - # in the subsequent step. - self.previous_hidden_states: Optional[HiddenStates] = None - self._disable_logprobs = disable_logprobs - self._disable_log_stats = disable_log_stats - self._num_spec_prefill_steps = num_spec_prefill_steps - - def init_device(self) -> None: - """Initialize both scorer and proposer models. - """ - # The scorer worker model is initialized first in case the proposer - # model has a smaller TP degree than the target worker. - self.scorer_worker.init_device() - self.proposer_worker.init_device() - - # NOTE(cade): load_model is not part of the WorkerBase interface. - self.scorer_worker.load_model() - self.proposer_worker.load_model() - - if self._enable_lm_head_weight_load: - # NOTE(Shangming): gather lm_head weight when tp enabled - target_lm_head_weight: torch.Tensor = tensor_model_parallel_gather( - self.scorer_worker.model_runner.model_runner.model.lm_head.\ - weight.data, - dim=0, - ) - - self.proposer_worker.maybe_load_lm_head_weight( - target_lm_head_weight) - - self._metrics.init_tensors(self.rank, device_type=self.device) - if model_parallel_is_initialized(): - self.spec_decode_sampler.init_tensors(get_tp_group().local_rank, - device_type=self.device) - else: - self.spec_decode_sampler.init_tensors(self.rank, - device_type=self.device) - - scorer_cls: Type[SpeculativeScorer] - if self.disable_mqa_scorer: - scorer_cls = BatchExpansionTop1Scorer - logger.info("[Speculative Decoding] Use batch " - "expansion for scoring proposals.") - else: - scorer_cls = MQAScorer - logger.info( - "[Speculative Decoding] Use MQA scorer for scoring proposals.") - - self.scorer = scorer_cls(scorer_worker=self.scorer_worker, - device=self.device, - vocab_size=self._vocab_size) - - self._configure_model_sampler_for_spec_decode() - - def load_model(self, *args, **kwargs): - pass - - def _configure_model_sampler_for_spec_decode(self): - """Configure model sampler to emit GPU tensors. This allows spec decode - to keep data on device without transferring to CPU and serializing, - which significantly reduces overhead of sampling during verification. - - NOTE(cade): This breaks abstraction boundaries pretty badly. The better - design is to have the "move to CPU and serialize" sampling decision be - done outside of the model/sampler; this way the "last-mile" worker - object which interfaces with the scheduler can serialize and incur the - performance hit as necessary. This allows us to run the worker several - iterations in a row without incurring the "move to CPU and serialize" - performance penalty. - - Since this requires a large change to vLLM, we defer it to later and - temporarily accept this broken abstraction boundary. - - NOTE(cade): This will require a special check if the proposer worker - does not have a sampler (e.g. ngram speculation). - """ - (self.scorer_worker.model_runner.sampler.include_gpu_probs_tensor - ) = True - (self.scorer_worker.model_runner.sampler. - should_modify_greedy_probs_inplace) = True - self.proposer_worker.set_include_gpu_probs_tensor() - self.proposer_worker.set_should_modify_greedy_probs_inplace() - - def determine_num_available_blocks(self) -> Tuple[int, int]: - """Determine the number of cache blocks to use. - - This is done by profiling the scorer model (which is typically the - larger of the two). Then the total memory which would be used by the - scorer cache is divided evenly between the proposer and scorer model KV, - such that the number of blocks is equal in both KV caches. - """ - num_gpu_blocks, num_cpu_blocks = ( - self.scorer_worker.determine_num_available_blocks()) - - scorer_cache_block_size_bytes = ( - self.scorer_worker.get_cache_block_size_bytes()) - proposer_cache_block_size_bytes = ( - self.proposer_worker.get_cache_block_size_bytes()) - - new_num_gpu_blocks = split_num_cache_blocks_evenly( - scorer_cache_block_size_bytes, proposer_cache_block_size_bytes, - num_gpu_blocks) - return new_num_gpu_blocks, num_cpu_blocks - - def initialize_cache(self, num_gpu_blocks: int, - num_cpu_blocks: int) -> None: - """Initialize the cache engine of the scorer and proposer workers. - """ - self.scorer_worker.initialize_cache(num_gpu_blocks=num_gpu_blocks, - num_cpu_blocks=num_cpu_blocks) - self.proposer_worker.initialize_cache(num_gpu_blocks=num_gpu_blocks, - num_cpu_blocks=num_cpu_blocks) - - def get_model(self) -> nn.Module: - return self.scorer_worker.get_model() - - @torch.inference_mode() - def execute_model( - self, - execute_model_req: Optional[ExecuteModelRequest] = None - ) -> List[SamplerOutput]: - """Perform speculative decoding on the input batch. - """ - if self.rank != self._driver_rank: - self._run_non_driver_rank() - return [] - - if execute_model_req is None: - # This signals that there's no more requests to process for now. - # All workers are running infinite loop with broadcast_tensor_dict, - # and it stops the loop when the driver broadcasts an empty input. - # Send an empty input to notify all other workers to stop their - # execution loop. - broadcast_tensor_dict({}, src=0) - return [] - - self._track_finished_requests(execute_model_req) - disable_all_speculation = self._should_disable_all_speculation( - execute_model_req) - num_lookahead_slots = execute_model_req.num_lookahead_slots - all_prompt = True - atleast_one_prompt = False - all_zero_spec_tokens = True - for sgm in execute_model_req.seq_group_metadata_list: - all_prompt = all_prompt and sgm.is_prompt - atleast_one_prompt = atleast_one_prompt or sgm.is_prompt - all_zero_spec_tokens = all_zero_spec_tokens and ( - sgm.num_speculative_tokens == 0) - - if all_prompt and execute_model_req.seq_group_metadata_list: - assert num_lookahead_slots == 0, ( - "Prompt only runs should have num_lookahead_slots equal to 0. " - "This should never happen, please file a bug at " - "https://github.com/vllm-project/vllm/issues") - # Speculative decoding is disabled in the following cases: - # 1. Prefill phase: Speculative decoding is not - # used during the prefill phase. - # 2. Auto-disable enabled: The running queue size exceeds - # the specified threshold. - # 3. No request: There are no requests in the batch, or - # none of the requests in the batch have spec decoding enabled. - # In any of these cases, the proposer and scorer workers - # are called normally. - # We expect `num_speculative_tokens` to be None for prefills. - no_spec = (num_lookahead_slots == 0 or disable_all_speculation - or all_zero_spec_tokens) - - # Broadcast how many lookahead slots are scheduled for this step, and - # whether all speculation is disabled, to all non-driver workers. - - # This is required as if the number of draft model runs changes - # dynamically, the non-driver workers won't know unless we perform a - # communication to inform them. - - # no_spec is used to signal non-driver worker about prefill vs decode - # stage. This is needed to ensure that order of execution of proposer - # and scorer is same in both driver and non-driver workers (i.e., - # scorer -> proposer for prefill and proposer -> scorer in decode). This - # order is needed to support models like EAGLE that take scorer states - # as inputs. - broadcast_dict = dict( - num_lookahead_slots=num_lookahead_slots, - no_spec=no_spec, - disable_all_speculation=disable_all_speculation, - # When both chunked prefill and speculative decoding are enabled - # it is possible that the same batch contains both prefill - # and decodes. If that happens in the scorer we run the batch - # as one single forward pass. However, in the proposer we - # run them as 2 different batches - one for prefill and - # the other for decodes. The variable indicates to the non-driver - # worker that there are prefills as part of the speculative batch - # and hence it needs to run an extra prefill forward pass. - run_spec_proposer_for_prefill=atleast_one_prompt, - ) - broadcast_tensor_dict(broadcast_dict, src=self._driver_rank) - - assert execute_model_req.seq_group_metadata_list is not None, ( - "speculative decoding requires non-None seq_group_metadata_list") - - self._maybe_disable_speculative_tokens( - disable_all_speculation, execute_model_req.seq_group_metadata_list) - - if no_spec: - return self._run_no_spec(execute_model_req, - skip_proposer=disable_all_speculation) - return self._run_speculative_decoding_step(execute_model_req, - num_lookahead_slots) - - @torch.inference_mode() - def start_worker_execution_loop(self) -> None: - """Execute model loop to perform speculative decoding - in parallel worker.""" - while self._run_non_driver_rank(): - pass - - def _should_disable_all_speculation( - self, execute_model_req: ExecuteModelRequest) -> bool: - # When the batch size is too large, disable speculative decoding - # to stop trading off throughput for latency. - return (execute_model_req.running_queue_size - >= self.disable_by_batch_size) - - def _maybe_disable_speculative_tokens( - self, disable_all_speculation: bool, - seq_group_metadata_list: List[SequenceGroupMetadata]) -> None: - if not disable_all_speculation: - return - - for seq_group_metadata in seq_group_metadata_list: - # Once num_speculative_tokens is set to 0, the spec decode - # of this request will be disabled forever. - # TODO(comaniac): We currently store spec decoding specific - # state in the global data structure, but we should maintain - # this state within spec decode worker. - seq_group_metadata.num_speculative_tokens = 0 - - def _serialize_sampler_output_no_logprobs( - self, execute_model_req: ExecuteModelRequest, - sampler_output: SamplerOutput) -> List[SamplerOutput]: - """ - Creates and returns a `SamplerOutput` with only the token IDs being - serialized to CPU and populated in `CompletionSequenceGroupOutput`. - All other parameters in `CompletionSequenceGroupOutput` related to log - probabilities are skipped. - - Args: - execute_model_req (ExecuteModelRequest): The model request that - was executed. - sampler_output (SamplerOutput): The output from the sampler with - only GPU tensors populated. - - Returns: - SamplerOutput: A new `SamplerOutput` instance containing a list of - `CompletionSequenceGroupOutput` objects with only token IDs - populated. - """ - seq_output_prompt_logprobs = [ - seq.is_prompt and seq.sampling_params.prompt_logprobs is not None - and seq.sampling_params.prompt_logprobs > 0 - for seq in execute_model_req.seq_group_metadata_list - ] - # ignore slots for prompt tokens that are filled with INVALID_TOKEN_ID - sampled_token_ids_list = (sampler_output.sampled_token_ids[torch.where( - # subtracting is faster than testing for equality - sampler_output.sampled_token_ids - VLLM_INVALID_TOKEN_ID)[0]] \ - if any(seq_output_prompt_logprobs) else \ - sampler_output.sampled_token_ids).tolist() - - seq_data_entries = [ - (seq_id, seq_data) for sg in \ - execute_model_req.seq_group_metadata_list \ - for seq_id, seq_data in sg.seq_data.items() - ] - completion_seq_group_output_list: List[ - CompletionSequenceGroupOutput] = [] - output_index = 0 - # Make sure the non-terminal prefill chunks are still aligned with - # their own empty output. - for idx, seq_group_meta in enumerate( - execute_model_req.seq_group_metadata_list): - needs_prompt_logprobs = seq_output_prompt_logprobs[idx] - seq_id, seq_data = seq_data_entries[idx] - if needs_prompt_logprobs: - prompt_token_ids = seq_data.get_prompt_token_ids() - - # Some of these sequences may belong to non-terminal chunks, - # which may still have to report logprobs for prompts. - start = 1 if seq_data._num_computed_tokens == 0 \ - else seq_data._num_computed_tokens - end = (seq_data._num_computed_tokens + \ - seq_group_meta.token_chunk_size) - prompt_token_ids = prompt_token_ids[start:end] - prompt_logprobs = [ - create_logprobs_output( - token_id=p_token_id, - token_id_logprob_rank=-1, - token_id_logprob=0.0, - topk_token_ids=[], - topk_logprobs=[], - ) for p_token_id in prompt_token_ids - ] - else: - prompt_logprobs = None - - # Since we can get chunks here, we dont always have a sampled token - # (only on last chunk) but we still have to provide an output. - if not seq_group_meta.do_sample: - completion_seq_group_output_list.append( - CompletionSequenceGroupOutput( - samples=[], prompt_logprobs=prompt_logprobs)) - continue - - # Sequence with output. - completion_seq_group_output_list.append( - create_sequence_group_output( - token_id=sampled_token_ids_list[output_index][0], - token_id_logprob_rank=-1, - token_id_logprob=0.0, - seq_id=seq_id, - topk_token_ids=[], - topk_logprobs=[], - prompt_logprobs=prompt_logprobs)) - output_index += 1 - - return [SamplerOutput(outputs=completion_seq_group_output_list)] - - @nvtx_range("spec_decode_worker._run_no_spec") - def _run_no_spec(self, execute_model_req: ExecuteModelRequest, - skip_proposer: bool) -> List[SamplerOutput]: - """Run a single generation step without any speculation. The input is - sent to the proposer and scorer model so that the KV cache is consistent - between the two. When skip_proposer is True, the proposer model is - not called, meaning that the kv-cache in proposer for requests is not - updated, so they cannot enable spec decode in the rest decoding. - """ - - sampler_output = self.scorer_worker.execute_model(execute_model_req) - assert len(sampler_output) == 1 - sampler_output = sampler_output[0] - - # Store hidden states from target model execution, BxD. - hidden_states = sampler_output.hidden_states - if hidden_states is not None: - # Only decodes and prefill terminal chunks need a hidden state. - seq_group_meta_with_hidden = [ - sg for sg in execute_model_req.seq_group_metadata_list - if sg.do_sample - ] - if any(seq.is_prompt for seq in seq_group_meta_with_hidden): - # Drop hidden_states with no prediction (eg non-terminal chunks) - hidden_states = hidden_states[ - torch.where(sampler_output.sampled_token_ids - - VLLM_INVALID_TOKEN_ID)[0]] - if self.previous_hidden_states is None and len( - seq_group_meta_with_hidden): - self.previous_hidden_states = HiddenStates( - hidden_states, seq_group_meta_with_hidden) - elif self.previous_hidden_states and len( - seq_group_meta_with_hidden): - self.previous_hidden_states.update(hidden_states, - seq_group_meta_with_hidden) - self.previous_hidden_states.prune(seq_group_meta_with_hidden) - - if not skip_proposer: - # We prepare the prefill hidden states here so that there no - # additional complexity in worker for spec_decode vs non_spec_decode - # flow and execute_model doesn't need additional modifications. - execute_model_req.previous_hidden_states = \ - prepare_prefill_hidden_states( - sampler_output.prefill_hidden_states) - for i in range(self._num_spec_prefill_steps): - execute_model_req.spec_step_idx = i - self.proposer_worker.execute_model(execute_model_req) - - sampler_output_to_return = (self._serialize_sampler_output_no_logprobs( - execute_model_req=execute_model_req, sampler_output=sampler_output) - if self._disable_logprobs else - [sampler_output]) - - # Clear device tensors from sampler output. This reduces communication - # overhead when the engine runs in a different process than the workers. - sampler_output.sampled_token_probs = None - sampler_output.sampled_token_ids = None - sampler_output.logprobs = None - return sampler_output_to_return - - def _run_non_driver_rank(self) -> bool: - """Run proposer and verifier model in non-driver workers. This is used - for both speculation cases (num_lookahead_slots>0) and non-speculation - cases (e.g. prefill). - - Returns True if there are remaining sequences to process. - """ - assert self.rank != self._driver_rank - - data = broadcast_tensor_dict(src=self._driver_rank) - if not data: - return False - num_lookahead_slots = data["num_lookahead_slots"] - - # In case of prefill, scorer_worker has to be run before proposer so - # that the hidden states can be propagated to proposer when needed. - if data["no_spec"]: - self.scorer_worker.execute_model() - - if not data["disable_all_speculation"]: - # Even if num_lookahead_slots is zero, we want to run the - # proposer model as it may have KV. - # - # We run the proposer once per lookahead slot. In the future we - # should delegate how many times it runs to the proposer. - for _ in range(max(num_lookahead_slots, 1)): - self.proposer_worker.execute_model() - - if not data["no_spec"]: - self.scorer_worker.execute_model() - if data["run_spec_proposer_for_prefill"]: - self.proposer_worker.execute_model() - - return True - - @nvtx_range("spec_decode_worker._run_speculative_decoding_step") - def _run_speculative_decoding_step( - self, execute_model_req: ExecuteModelRequest, - num_lookahead_slots: int) -> List[SamplerOutput]: - """Execute a single step of speculative decoding. - - This invokes the proposer worker to get k speculative tokens for each - sequence, then scores each speculative token using the scoring worker. - - When `enable_chunked_prefill` is set, scorer will batch decodes and - prefills, while proposer will sync its KV-cache by running an extra - forward on prefills. - - Returns a list of SamplerOutput, each containing a single token per - sequence. - """ - # With prefill chunking, expect requests to have prompts first - # so that backend gets prefill|decode. - assert num_lookahead_slots == execute_model_req.num_lookahead_slots - - # Pass last hidden states from target model to proposer - execute_model_req.previous_hidden_states = self.previous_hidden_states - self.previous_hidden_states = None - - with Timer() as proposal_timer: - # Generate proposals using draft worker. - proposals = self.proposer_worker.get_spec_proposals( - execute_model_req, self._seq_with_bonus_token_in_last_step) - - if not self._allow_zero_draft_token_step and proposals.no_proposals: - #TODO: Fix it #5814 - raise RuntimeError("Cannot handle cases where distributed draft " - "workers generate no tokens") - - execute_model_req.previous_hidden_states = None - - with Timer() as scoring_timer: - proposal_scores = self.scorer.score_proposals( - execute_model_req, - proposals, - ) - - _, (non_spec_seqs, non_spec_indices) = split_batch_by_proposal_len( - execute_model_req.seq_group_metadata_list, proposals.proposal_lens) - # With prefill chunking enabled, `non_spec_seqs` contains prefills too: - # discard decodes that have already been processed by proposer. - non_spec_indices = [ - idx for idx in non_spec_indices - if execute_model_req.seq_group_metadata_list[idx].is_prompt - ] - if len(non_spec_indices): - all_hidden_states = proposal_scores.hidden_states - if all_hidden_states is not None: - prefill_hidden_states = all_hidden_states[non_spec_indices] - execute_model_req.previous_hidden_states = \ - prepare_prefill_hidden_states(prefill_hidden_states) - # Sync proposer KV cache for prefills. - prefill_req = execute_model_req.clone(non_spec_seqs) - # TODO avoid sampling here? - self.proposer_worker.execute_model(prefill_req) - - with Timer() as verification_timer: - accepted_token_ids, target_logprobs = self._verify_tokens( - execute_model_req.seq_group_metadata_list, proposal_scores, - proposals, execute_model_req.num_lookahead_slots) - - stage_times = (proposal_timer.elapsed_time_ms / num_lookahead_slots, - scoring_timer.elapsed_time_ms, - verification_timer.elapsed_time_ms) - - return self._create_output_sampler_list( - execute_model_req.seq_group_metadata_list, - accepted_token_ids, - target_logprobs=target_logprobs, - prompt_logprobs=proposal_scores.prompt_logprobs - if not self._disable_logprobs else None, - k=execute_model_req.num_lookahead_slots, - stage_times=stage_times) - - @nvtx_range("spec_decode_worker._verify_tokens") - def _verify_tokens( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - proposal_scores: SpeculativeScores, - proposals: SpeculativeProposals, - max_proposal_len: int, - ) -> Tuple[torch.Tensor, torch.Tensor]: - """Determine which speculative tokens are accepted using the - probabilities of each token according to the proposer and scorer models. - - Returns a tuple of Tensors, one for the accepted token ids and one for - the logprobs according to the scoring model. - """ - proposal_lens_list = proposals.proposal_lens.tolist() - - # vLLM currently only supports proposal lens equal to zero or the batch - # proposal len. This adds some complexity (splitting the batch into spec - # and non spec sequences) and should be removed in the future. It can be - # done by supporting per-sequence proposal lens. - (_, spec_indices), (_, non_spec_indices) = split_batch_by_proposal_len( - seq_group_metadata_list, proposal_lens_list) - original_indices = spec_indices + non_spec_indices - - # Get probabilities of target model, including bonus tokens. - proposal_verifier_probs = proposal_scores.probs[spec_indices] - - # Get non-speculative sampled tokens from target model. - non_spec_token_ids = proposal_scores.token_ids[non_spec_indices] - - # Get bonus tokens from target model. - bonus_token_ids = proposal_scores.token_ids[spec_indices, -1:] - - # Get probabilities according to proposal method. - proposal_probs = proposals.proposal_probs[spec_indices] - - # Get proposed tokens. - proposal_token_ids = proposals.proposal_token_ids[spec_indices] - - # Sampler arguments - sampler_extra_kwargs: Dict[str, Any] = {} - if self.generators and isinstance(self.spec_decode_sampler, - SpecDecodeStochasticBaseSampler): - sampler_extra_kwargs["seeded_seqs"] = { - idx: self.generators[sgm.request_id] - for idx, sgm in enumerate(seq_group_metadata_list) - if sgm.sampling_params.seed is not None - } - - accepted_token_ids = self.spec_decode_sampler( - target_with_bonus_probs=proposal_verifier_probs, - bonus_token_ids=bonus_token_ids, - draft_probs=proposal_probs, - draft_token_ids=proposal_token_ids, - **sampler_extra_kwargs, - ) - # Append output tokens from non-speculative sequences to - # the accepted token ids tensor. - non_spec_token_ids = non_spec_token_ids.expand(-1, max_proposal_len + - 1).clone() - non_spec_token_ids[:, 1:] = -1 - accepted_token_ids = torch.cat( - [accepted_token_ids, non_spec_token_ids]) - logprobs = proposal_scores.logprobs - # Rearrange so that results are in the order of the original seq group - # metadata. - accepted_token_ids[original_indices] = accepted_token_ids.clone() - - # B x K+1 x D - hidden_states = proposal_scores.hidden_states - if hidden_states is not None: - # Only get terminal hidden states for next step - terminal_metadata = [ - sg for sg in seq_group_metadata_list if sg.do_sample - ] - - # Contract hidden states based on accepted tokens - hs_size = hidden_states.shape[-1] - accepted_index = accepted_token_ids + 1 # Convert -1 to 0 - accepted_index = accepted_index.count_nonzero(dim=1).add_(-1) # b - # Drop non-terminal prefill chunks hidden states. - hidden_states = hidden_states[accepted_index != - VLLM_INVALID_TOKEN_ID] - accepted_index = accepted_index[accepted_index != - VLLM_INVALID_TOKEN_ID] - assert len(accepted_index) == hidden_states.shape[0] == len( - terminal_metadata) - index = accepted_index[:, None, None].expand(-1, 1, - hs_size) # b x 1 x d - second_last_token_hidden_states = hidden_states[:, -2] # b x d - hidden_states = hidden_states.gather(1, index).squeeze(1) # b x d - # Store hidden states from target model for subsequent decode step - self.previous_hidden_states = HiddenStates( - hidden_states, terminal_metadata, - second_last_token_hidden_states) - return accepted_token_ids, logprobs - - def _create_output_sampler_list( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - accepted_token_ids: torch.Tensor, # shape: [batch_size, k+1] - target_logprobs: torch.Tensor, # shape: [batch_size, k+1, vocab_size] - prompt_logprobs: Optional[ - torch.Tensor], # shape: [nprompt_tokens, vocab_size] - k: int, - stage_times: Tuple[float, float, float], - ) -> List[SamplerOutput]: - """Given the accepted token ids, create a list of SamplerOutput. - - The output is padded with -1 tokens such that each sequence has - the same number of outputs. - """ - batch_size, num_steps = accepted_token_ids.shape - accepted_token_ids_by_step = accepted_token_ids.transpose(0, 1) - if self._disable_logprobs: - # We are skipping the logprobs. Hence don't serialize the - # logprobs related tensors from the GPU. Instead create - # empty/dummy lists. - (accepted_token_id_ranks_by_step, - accepted_token_id_logprobs_by_step, - topk_logprobs_by_step, topk_indices_by_step) =\ - self._create_dummy_logprob_lists( - batch_size, num_steps, - self.scorer_worker.model_config.max_logprobs) - else: - # Organize input tensors by step instead of by sequence. - target_logprobs_by_step = target_logprobs.transpose(0, 1) - # Serialize all tensors into Python lists. - (accepted_token_id_ranks_by_step, - accepted_token_id_logprobs_by_step, - topk_logprobs_by_step, topk_indices_by_step) =\ - self._create_logprob_lists_from_tensors( - target_logprobs_by_step, accepted_token_ids_by_step, - self.scorer_worker.model_config.max_logprobs) - - # Get the sequence ids and num_logprobs (sampling parameter) in the - # batch. - seq_ids, request_ids_seq_ids_mapping = get_all_seq_ids_and_request_ids( - seq_group_metadata_list) - - num_logprobs_per_seq = get_all_num_logprobs(seq_group_metadata_list) - - # Serialize tensor to CPU Python list. - accepted_token_ids_by_step = accepted_token_ids_by_step.tolist() - - # Construct the output on a per-step, per-sequence basis. - # Non-terminal prefill chunks will end up here as rows with just -1s - # i.e mixed-batch [[-1, 1576], [-1, 29884], [-1, -1], [-1, -1]] while - # terminal chunks will only have one generated token at time 0. - sampler_output_list: List[SamplerOutput] = [] - - # Prefills are not multi-step (return at most 1 token), in order to - # avoid padding or repetition to fit decodes, we separate them. - for i, sg in enumerate(seq_group_metadata_list): - if not sg.is_prompt: - # Requests are ordered as prefills|decodes=>no more prefills. - break - num_logprobs = num_logprobs_per_seq[i] - seq_kwargs = dict(token_id=-1, - token_id_logprob_rank=0, - token_id_logprob=-float('inf'), - topk_token_ids=[-1] * num_logprobs, - topk_logprobs=[-float('inf')] * num_logprobs, - seq_id=seq_ids[i]) - # Terminal chunk, has token. - if sg.do_sample: - seq_kwargs.update( - dict( - token_id=accepted_token_ids[i][0].item(), - token_id_logprob_rank=accepted_token_id_ranks_by_step[ - 0][i], - token_id_logprob=accepted_token_id_logprobs_by_step[0] - [i], - topk_token_ids=topk_indices_by_step[0][i] - [:num_logprobs], - # output only so step is 0 - topk_logprobs=topk_logprobs_by_step[0][i] - [:num_logprobs], - )) - needs_plogs = (sg.sampling_params.prompt_logprobs - and sg.sampling_params.prompt_logprobs > 0) - plogs = None - if prompt_logprobs is not None: - # Even non-terminal prompt chunks can have logprobs here. - plogs = prompt_logprobs[i] - elif needs_plogs: - # Prompt logprobs are requested but `_disable_logprobs` is set. - seq_data = next(iter(sg.seq_data.values())) - # Get only the tokens in this chunk! - prompt_token_ids = seq_data.get_prompt_token_ids() - prompt_token_ids = prompt_token_ids[ - seq_data. - _num_computed_tokens:seq_data._num_computed_tokens + - sg.token_chunk_size] - - is_first_chunk = seq_data._num_computed_tokens == 0 - # There's no prob generated for the first token in a sequence. - if is_first_chunk: - prompt_token_ids = prompt_token_ids[1:] - plogs = [ - create_logprobs_output( - token_id=p_token_id, - token_id_logprob_rank=-1, - token_id_logprob=0.0, - topk_token_ids=[], - topk_logprobs=[], - ) for p_token_id in prompt_token_ids - ] - seq_kwargs.update(dict(prompt_logprobs=plogs)) - - sampler_output_list.append( - SamplerOutput( - outputs=[create_sequence_group_output( - **seq_kwargs)])) # type: ignore - - # Decodes, create one SamplerOutput per-step (at most K+1). - for step_index in range(num_steps): - if all(token_id == -1 for sg, token_id in zip( - seq_group_metadata_list, - accepted_token_ids_by_step[step_index]) - if not sg.is_prompt): - break - - step_output_token_ids: List[CompletionSequenceGroupOutput] = [] - for sequence_index in range(batch_size): - seq_meta = seq_group_metadata_list[sequence_index] - # Prompts already processed above. - if seq_meta.is_prompt: - continue - - # Each sequence may have a different num_logprobs; retrieve it. - num_logprobs = num_logprobs_per_seq[sequence_index] - step_output_token_ids.append( - create_sequence_group_output( - token_id=accepted_token_ids_by_step[step_index] - [sequence_index], - token_id_logprob_rank=accepted_token_id_ranks_by_step[ - step_index][sequence_index], - token_id_logprob=accepted_token_id_logprobs_by_step[ - step_index][sequence_index], - seq_id=seq_ids[sequence_index], - topk_token_ids=topk_indices_by_step[step_index] - [sequence_index][:num_logprobs], - topk_logprobs=topk_logprobs_by_step[step_index] - [sequence_index][:num_logprobs], - step_index=step_index)) - sampler_output_list.append( - SamplerOutput(outputs=step_output_token_ids)) - - # Populate the data structures needed to keep track of sequences with - # bonus tokens. - self._track_sequences_with_bonus_tokens(seq_ids, - request_ids_seq_ids_mapping, - accepted_token_ids_by_step) - maybe_rejsample_metrics = ( - self._metrics.maybe_collect_rejsample_metrics(k)) - if maybe_rejsample_metrics is not None: - sampler_output_list[ - 0].spec_decode_worker_metrics = maybe_rejsample_metrics - - # Log time spent in each stage periodically. - # This is periodic because the rejection sampler emits metrics - # periodically. - self._maybe_log_stage_times(*stage_times) - # First `n_prefills` entries will contain prefills SamplerOutput when - # chunked prefill is enabled, the rest is decodes in multi-step format. - return sampler_output_list - - def _maybe_log_stage_times(self, average_time_per_proposal_tok_ms: float, - scoring_time_ms: float, - verification_time_ms: float) -> None: - """Log the speculative stage times. If stat logging is disabled, do - nothing. - """ - if self._disable_log_stats: - return - - logger.info( - "SpecDecodeWorker stage times: " - "average_time_per_proposal_tok_ms=%.02f " - "scoring_time_ms=%.02f verification_time_ms=%.02f", - average_time_per_proposal_tok_ms, scoring_time_ms, - verification_time_ms) - - def _create_dummy_logprob_lists( - self, - batch_size: int, - num_steps: int, - num_top_k: int, - ) -> Tuple[List[List[int]], List[List[float]], - List[List[List[Optional[float]]]], - List[List[List[Optional[int]]]]]: - """ - Creates and returns four dummy lists representing token probabilities - and their ranks. - - This method initializes and returns: - - The ranks of the accepted tokens, shaped (num_steps, batch_size) - - The log probabilities of the accepted tokens, - shaped (num_steps, batch_size) - - The log probabilities of the top k tokens, - shaped (num_steps, batch_size, num_top_k) - - The token IDs of the top k tokens, - shaped (num_steps, batch_size, num_top_k) - - Args: - batch_size (int): The size of the batch. - num_steps (int): The number of steps in the sequence. - num_top_k (int): The number of top-k token log probabilities to - return. - - Returns: - A tuple containing four dummy lists as described above. - """ - accepted_token_id_ranks_by_step = [[-1] * batch_size - for _ in range(num_steps)] - accepted_token_id_logprobs_by_step = [[0.0] * batch_size - for _ in range(num_steps)] - topk_logprobs_by_step: List[List[List[Optional[float]]]] = [[ - [None] * num_top_k for _ in range(batch_size) - ] for _ in range(num_steps)] - topk_indices_by_step: List[List[List[Optional[int]]]] = [[ - [None] * num_top_k for _ in range(batch_size) - ] for _ in range(num_steps)] - return (accepted_token_id_ranks_by_step, - accepted_token_id_logprobs_by_step, topk_logprobs_by_step, - topk_indices_by_step) - - def _create_logprob_lists_from_tensors( - self, - target_logprobs_by_step: torch.Tensor, - accepted_token_ids_by_step: torch.Tensor, - num_top_k: int, - ) -> Tuple[List[List[int]], List[List[float]], - List[List[List[Optional[float]]]], - List[List[List[Optional[int]]]]]: - """ - Creates and returns four lists representing token probabilities and - their ranks. - - This method initializes and returns four lists containing: - - The ranks of the accepted tokens, shaped (num_steps, batch_size) - - The log probabilities of the accepted tokens, - shaped (num_steps, batch_size) - - The log probabilities of the top k tokens, - shaped (num_steps, batch_size, num_top_k) - - The token IDs of the top k tokens, - shaped (num_steps, batch_size, num_top_k) - - Args: - target_logprobs_by_step (torch.Tensor): Tensor representing the - log probabilities of the target model, - shaped (num_steps, batch_size, vocab_size) - accepted_token_ids_by_step (torch.Tensor): Tensor representing - the accepted token_ids, shaped (num_steps, batch_size) - num_top_k (int): The number of top-k token log probabilities to - return. - - Returns: - A tuple containing the lists as described above. - """ - # Serialize all tensors to CPU Python lists. - # Get the logprobs/rank of the accepted tokens. - (accepted_token_id_ranks_by_step_tensor, - accepted_token_id_logprobs_by_step_tensor - ) = get_sampled_token_logprobs( - logprob_tensor=target_logprobs_by_step, - sampled_token_ids=accepted_token_ids_by_step, - ) - # Get the top-k logprobs (which may or may not include the - # logprob of the accepted token). - (topk_logprobs_by_step_tensor, - topk_indices_by_step_tensor) = target_logprobs_by_step.topk( - k=num_top_k, - dim=-1, - ) - accepted_token_id_ranks_by_step = ( - accepted_token_id_ranks_by_step_tensor.tolist()) - accepted_token_id_logprobs_by_step = ( - accepted_token_id_logprobs_by_step_tensor.tolist()) - topk_logprobs_by_step = topk_logprobs_by_step_tensor.tolist() - topk_indices_by_step = topk_indices_by_step_tensor.tolist() - return (accepted_token_id_ranks_by_step, - accepted_token_id_logprobs_by_step, topk_logprobs_by_step, - topk_indices_by_step) - - def _track_finished_requests(self, execute_model_req: ExecuteModelRequest): - """ - Removes the finished requests and their associated sequence ids from - internal book keeping data structures. - """ - for finished_request in execute_model_req.finished_requests_ids: - for seq_id in self._request_id_seq_id_mapping[finished_request]: - self._seq_with_bonus_token_in_last_step.discard(seq_id) - del self._request_id_seq_id_mapping[finished_request] - - def _track_sequences_with_bonus_tokens( - self, seq_ids: List[int], - request_ids_seq_ids_mapping: Dict[str, Set[int]], - accepted_token_ids_by_step: List[List[int]]): - """ - Updates the internal data structures which keep track of sequences - which have been assigned bonus tokens in their last forward pass. - """ - for seq_index, seq_id in enumerate(seq_ids): - last_token_id = accepted_token_ids_by_step[-1][seq_index] - if last_token_id == -1: - self._seq_with_bonus_token_in_last_step.discard(seq_id) - else: - self._seq_with_bonus_token_in_last_step.add(seq_id) - for request_id, sequences in request_ids_seq_ids_mapping.items(): - self._request_id_seq_id_mapping[request_id].update(sequences) - - @cached_property - def _vocab_size(self) -> int: - """Get the vocab size of the model and make sure it's consistent between - draft and target workers. - """ - vocab_sizes = [ - worker.vocab_size - for worker in [self.proposer_worker, self.scorer_worker] - ] - assert all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes) - return vocab_sizes[0] - - @property - def rank(self): - return self.scorer_worker.rank - - @property - def device(self): - return self.scorer_worker.device - - @property - def _driver_rank(self) -> int: - return 0 - - def get_cache_block_size_bytes(self): - """Return the size of a cache block in bytes. - - This function is only used to compose workers within a SpecDecodeWorker. - We leave composing a SpecDecodeWorker within a SpecDecodeWorker - undefined for now, although it could be implemented in the future. - See https://arxiv.org/abs/2308.04623. - """ - raise NotImplementedError - - def start_profile(self): - if isinstance(self.scorer_worker, WorkerBase): - self.scorer_worker.start_profile() - - def stop_profile(self): - if isinstance(self.scorer_worker, WorkerBase): - self.scorer_worker.stop_profile() - - -def split_num_cache_blocks_evenly(scorer_cache_block_size_bytes: int, - proposer_cache_block_size_bytes: int, - total_num_gpu_blocks: int) -> int: - """Given total_num_gpu_blocks, the number of GPU blocks that could be - allocate to the target model, this function calculates how many blocks - should be given to the draft and target model. - - Note that usually the block size, in bytes, of each model is different, - as it's a function of number of KV/layer, number of heads, and hidden - dimension size. - - Since the target and draft models allocate the same number of blocks, we - simply calculate the number of blocks where if allocated by both models, - the total memory usage from KV cache is no larger than the number of - blocks allocatable by the target model alone. - """ - new_num_gpu_blocks = int( - total_num_gpu_blocks * scorer_cache_block_size_bytes / - (proposer_cache_block_size_bytes + scorer_cache_block_size_bytes)) - - return new_num_gpu_blocks - - -def prepare_prefill_hidden_states( - prefill_hidden_states: torch.Tensor) -> HiddenStates: - # For prefill step in proposer, we run the model for N-1 tokens - # because Nth token will be processed in the first decode step. For - # N-1 tokens, the input should be 0:N-1 hidden states which should - # be concatanated with 1:N token (since output of scorer has to be - # the input for proposer). Therefore, we shift the hidden states to - # align n-1th hidden state with nth token. - return HiddenStates(prefill_hidden_states.roll( - shifts=1, dims=0)) if prefill_hidden_states is not None else None diff --git a/vllm/spec_decode/target_model_runner.py b/vllm/spec_decode/target_model_runner.py deleted file mode 100644 index ca89eb60ac5..00000000000 --- a/vllm/spec_decode/target_model_runner.py +++ /dev/null @@ -1,45 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import List, Optional - -from vllm.sequence import SequenceGroupMetadata -from vllm.worker.model_runner_base import (ModelRunnerBase, - ModelRunnerInputBase, - ModelRunnerWrapperBase) - - -class TargetModelRunner(ModelRunnerWrapperBase): - """Specialized model runner for speculative decoding target model. - In speculative decoding, the log probabilities selected finally may not - be the same ones as selected by the target model sampling. This means - that the time spent in the log probability calculation of the target model - is time wasted, since we calculate log probabilities after deciding which - tokens are accepted. For this reason disabling log probabilities in the - target model will make decode faster. The model runner sets the - SamplingMetadata parameters according to whether log probabilities are - requested or not. - """ - - def __init__(self, model_runner: ModelRunnerBase): - # An internal boolean member variable to indicate if token log - # probabilities are needed or not. - super().__init__(model_runner) - self.disable_logprobs = True - - def prepare_model_input( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - virtual_engine: int = 0, - finished_requests_ids: Optional[List[str]] = None, - ) -> ModelRunnerInputBase: - model_input: ModelRunnerInputBase =\ - self.model_runner.prepare_model_input( - seq_group_metadata_list, virtual_engine, finished_requests_ids) - # If token log probabilities is disabled then skip generating sampler - # CPU output. We directly serialize the GPU sampled_token_id tensors - # as needed. If log probabilities is enabled then synchronize all the - # sampling related tensors which includes the logprobs tensors. - model_input.sampling_metadata.skip_sampler_cpu_output = ( - self.disable_logprobs) - return model_input diff --git a/vllm/spec_decode/top1_proposer.py b/vllm/spec_decode/top1_proposer.py deleted file mode 100644 index afd91b42b94..00000000000 --- a/vllm/spec_decode/top1_proposer.py +++ /dev/null @@ -1,275 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from typing import List, Optional, Set, Tuple - -import torch - -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.sequence import ExecuteModelRequest, SequenceGroupMetadata -from vllm.spec_decode.interfaces import (SpeculativeProposals, - SpeculativeProposer) -from vllm.spec_decode.proposer_worker_base import ProposerWorkerBase -from vllm.spec_decode.util import sampler_output_to_torch - - -class Top1Proposer(SpeculativeProposer): - """Helper class which separates out sequences which would exceed the max - model length when speculated upon. - - This allows combinations of models such as JackFram/llama-68m draft with - meta-llama/Llama2-13b-chat-hf, as llama-68m has max_position_embeddings of - 2048 while Llama2-13b has max_position_embeddings of 4096. - - We treat the sequences which exceed the proposal draft model length as - "non-spec sequences". Essentially they skip the draft model and go through - normal decoding in the target model. - - Currently, only proposal_lens of 0 and k are supported, where k is a global - batch proposal length. In the future vLLM should support per-sequence - proposal lengths. - """ - - def __init__( - self, - worker: ProposerWorkerBase, - device: str, - vocab_size: int, - max_proposal_len: Optional[int] = None, - ): - self._worker = worker - self._device = device - self.max_proposal_len = max_proposal_len - self._vocab_size = vocab_size - - def get_spec_proposals( - self, - execute_model_req: ExecuteModelRequest, - seq_ids_with_bonus_token_in_last_step: Set[int], - ) -> SpeculativeProposals: - """Get speculative proposals given the input batch. - - Sequences which would exceed the max model length are skipped during - speculation. - """ - proposal_len = execute_model_req.num_lookahead_slots - seq_group_metadata_list = execute_model_req.seq_group_metadata_list - - # Split speculative- and non-speculative- sequences. - ( - proposal_lens, - nonzero_proposal_len_seqs, - nonzero_proposal_len_indices, - ) = self._split_by_proposal_len(seq_group_metadata_list, proposal_len) - - if nonzero_proposal_len_seqs: - # Speculate tokens using the draft worker for the speculative - # sequences. - # If sampler_transposed is true, then maybe_sampler_output's - # token_ids is like [batch] format in proposal_len size list, - # while if it is false, the format would be [proposal_len] - # in batch size list - hidden_states = execute_model_req.previous_hidden_states - if hidden_states is not None: - hidden_states.prune(nonzero_proposal_len_seqs) - nonzero_execute_model_req = ExecuteModelRequest( - seq_group_metadata_list=nonzero_proposal_len_seqs, - num_lookahead_slots=proposal_len, - previous_hidden_states=hidden_states, - ) - maybe_sampler_output, transposed = self._worker.sampler_output( - execute_model_req=nonzero_execute_model_req, - sample_len=proposal_len, - seq_ids_with_bonus_token_in_last_step=\ - seq_ids_with_bonus_token_in_last_step, - ) - ( - proposal_lens, - maybe_sampler_output, - nonzero_proposal_len_indices, - ) = self._remove_no_proposal_seqs(proposal_lens, - maybe_sampler_output, - nonzero_proposal_len_indices, - transposed) - else: - # If no sequences can be speculated, set sampler output to None. - maybe_sampler_output = None - transposed = False - - # Combine speculative- and non-speculative sequences into the same - # representation. - proposal_tokens, proposal_probs, proposal_lens = self._merge_outputs( - batch_size=len(seq_group_metadata_list), - proposal_len=proposal_len, - maybe_sampler_output=maybe_sampler_output, - proposal_lens=proposal_lens, - nonzero_proposal_len_indices=nonzero_proposal_len_indices, - sampler_transposed=transposed, - ) - - proposals = SpeculativeProposals(proposal_token_ids=proposal_tokens, - proposal_probs=proposal_probs, - proposal_lens=proposal_lens, - no_proposals=maybe_sampler_output - is None) - return proposals - - def _split_by_proposal_len( - self, - seq_group_metadata_list: List[SequenceGroupMetadata], - proposal_len: int, - ) -> Tuple[List[int], List[SequenceGroupMetadata], List[int]]: - """Split sequences by two groups: - 1. Sequences with non-zero proposal length. - 2. Sequences with zero proposal length (due to disabled speculation - or exceed the maximum model length). - """ - - proposal_lens: List[int] = [] - nonzero_proposal_len_seqs: List[SequenceGroupMetadata] = [] - nonzero_proposal_len_indices: List[int] = [] - for i, seq_group_metadata in enumerate(seq_group_metadata_list): - # The speculative decoding for this request has either been disabled - # (e.g. due to high traffic) or this is a prompt request. - if (seq_group_metadata.is_prompt - or seq_group_metadata.num_speculative_tokens == 0): - proposal_lens.append(0) - continue - - seq_data = next(iter(seq_group_metadata.seq_data.values())) - seq_len = seq_data.get_len() - - # Currently only proposal lens of 0 or the global batch proposal len - # are supported. - # If max_proposal_len is defined, then we shall not exceed this - # quota for nonzero_proposal - new_k = 0 - if (self.max_proposal_len is None - or seq_len + proposal_len < self.max_proposal_len): - new_k = proposal_len - nonzero_proposal_len_seqs.append(seq_group_metadata) - nonzero_proposal_len_indices.append(i) - proposal_lens.append(new_k) - seq_group_metadata.num_speculative_tokens = new_k - - return ( - proposal_lens, - nonzero_proposal_len_seqs, - nonzero_proposal_len_indices, - ) - - @staticmethod - def _remove_no_proposal_seqs(proposal_lens, maybe_sampler_output, - nonzero_proposal_len_indices, transposed): - """Remove sequences from nonzero_proposal_len_indices and reset - their proposal_len to 0 the draft worker does not provide a proposal - (maybe_sampler_output=None). This can avoid scoring overheads. - """ - - # If maybe_sampler_output is None, then the draft worker did not - # provide a proposal for any sequence and thus no action needed. - # Also we do not support transposed maybe_sampler_output for now - # because it seems not straightforward for draft workers outputting - # transposed sampler outputs to handle the case of no proposal. - if maybe_sampler_output is None or transposed: - return (proposal_lens, maybe_sampler_output, - nonzero_proposal_len_indices) - - new_proposal_lens: List[int] = [] - new_nonzero_proposal_len_indices: List[int] = [] - new_maybe_sampler_output: List[SamplerOutput] = [] - nonzero_proposal_len_idx_ptr = 0 - seq_idx = 0 - while seq_idx < len( - proposal_lens) and nonzero_proposal_len_idx_ptr < len( - nonzero_proposal_len_indices): - if seq_idx < nonzero_proposal_len_indices[ - nonzero_proposal_len_idx_ptr]: - # Sequence is not in the original nonzero_proposal_len_indices, - # meaning that it has a proposal length of 0 before sending to - # the draft worker. - assert proposal_lens[seq_idx] == 0 - new_proposal_lens.append(0) - else: - # Sequence is in the original nonzero_proposal_len_indices - if maybe_sampler_output[nonzero_proposal_len_idx_ptr] is None: - # but does not have a proposal from the draft worker. - new_proposal_lens.append(0) - else: - # and has a proposal from the draft worker. Add it to the - # new nonzero proposal list and keep the sampler output. - new_proposal_lens.append(proposal_lens[seq_idx]) - new_nonzero_proposal_len_indices.append(seq_idx) - new_maybe_sampler_output.append( - maybe_sampler_output[nonzero_proposal_len_idx_ptr]) - nonzero_proposal_len_idx_ptr += 1 - seq_idx += 1 - - # The remaining sequences should have proposal length of 0. - new_proposal_lens.extend(proposal_lens[seq_idx:]) - - # We assume sampler_output will not be a list of all Nones. - # In this case this function should not be called. - assert new_maybe_sampler_output - return (new_proposal_lens, new_maybe_sampler_output, - new_nonzero_proposal_len_indices) - - def _merge_outputs( - self, - batch_size: int, - proposal_len: int, - maybe_sampler_output: Optional[List[SamplerOutput]], - proposal_lens: List[int], - nonzero_proposal_len_indices: List[int], - sampler_transposed: bool, - ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: - """After speculations are produced, merge the speculation results with - the skipped sequences. - """ - if maybe_sampler_output is None: - # If no speculative tokens, the sampler output will be None. - # In this case we return empty proposals. - proposal_tokens = torch.tensor(-1, - dtype=torch.long, - device=self._device).expand( - batch_size, proposal_len) - proposal_probs = torch.tensor(0, - dtype=torch.float32, - device=self._device).expand( - batch_size, proposal_len, - self._vocab_size) - proposal_lens_tensor = torch.tensor(0, - dtype=torch.long, - device=self._device).expand( - len(proposal_lens)) - return proposal_tokens, proposal_probs, proposal_lens_tensor - - sampler_output = maybe_sampler_output - proposal_tokens, proposal_probs, *_ = sampler_output_to_torch( - sampler_output, sampler_transposed) - - # Now, reformat the output GPU tensors such that each sequence has - # a proposal. the proposal can be empty, e.g. [-1, -1, -1] - - entire_proposal_tokens = proposal_tokens.new_full( - size=(batch_size, *proposal_tokens.shape[1:]), - fill_value=-1, - ) - entire_proposal_tokens[nonzero_proposal_len_indices] = proposal_tokens - entire_proposal_probs = proposal_probs.new_zeros( - batch_size, - *proposal_probs.shape[1:], - ) - entire_proposal_probs[nonzero_proposal_len_indices] = proposal_probs - - proposal_tokens, proposal_probs = ( - entire_proposal_tokens, - entire_proposal_probs, - ) - - proposal_lens_tensor = torch.zeros(batch_size, - dtype=torch.long, - device=self._device) - proposal_lens_tensor[nonzero_proposal_len_indices] = proposal_len - - return proposal_tokens, proposal_probs, proposal_lens_tensor diff --git a/vllm/spec_decode/util.py b/vllm/spec_decode/util.py deleted file mode 100644 index 22d2a4833ac..00000000000 --- a/vllm/spec_decode/util.py +++ /dev/null @@ -1,277 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import time -from contextlib import contextmanager -from typing import Dict, List, Optional, Sequence, Tuple - -import torch - -from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.platforms import current_platform -from vllm.sequence import (CompletionSequenceGroupOutput, Logprob, - PromptLogprobs, SequenceGroupMetadata, - SequenceOutput) - -SeqId = int - - -def get_all_num_logprobs( - seq_group_metadata_list: List[SequenceGroupMetadata]) -> List[int]: - """Given a list of SequenceGroupMetadata, create a list of all num_logprobs. - - If the sampling params do not call for any logprobs, return 0 for that - sequence. - """ - - all_num_logprobs: List[int] = [] - for seq_group_metadata in seq_group_metadata_list: - num_logprobs = seq_group_metadata.sampling_params.logprobs - if num_logprobs is None: - num_logprobs = 0 - all_num_logprobs.append(num_logprobs) - - return all_num_logprobs - - -def get_sampled_token_logprobs( - # shape [num_steps, batch_size, vocab_size] - logprob_tensor: torch.Tensor, - sampled_token_ids: torch.Tensor, # shape [num_steps, batch_size] -) -> Tuple[torch.Tensor, torch.Tensor]: - """Get the logprobs for the sampled tokens. Returns the ranks and logprobs. - """ - num_steps, batch_size, vocab_size = logprob_tensor.shape - - selected_logprobs = logprob_tensor[ - torch.arange(num_steps).unsqueeze(1), - torch.arange(batch_size), - sampled_token_ids, - ] - expanded_selected_logprobs = selected_logprobs.unsqueeze(-1).expand( - -1, -1, vocab_size) - sampled_token_ids_ranks = (logprob_tensor - > expanded_selected_logprobs).sum(-1).add_(1) - - return sampled_token_ids_ranks, selected_logprobs - - -def create_logprobs_output( - token_id: int, - token_id_logprob_rank: int, - token_id_logprob: float, - topk_token_ids: List[Optional[int]], - topk_logprobs: List[Optional[float]], -) -> Dict[int, Logprob]: - """Create a Logprob Dict for a token given the sampling results. - - Args: - token_id (int): The sampled token for the sequence. - token_id_logprob_rank (int): The logprob rank of the sampled token. - token_id_logprob (float): The logprob value of the sampled token. - topk_token_ids (List[Optional[int]]): The list of top-k token ids. - topk_logprobs (List[Optional[float]]): The list of top-k logprobs. - """ - # vLLM logprobs always include the sampled token. In addition, the user may - # request topk-logprobs (where top-k varies per user up to max_logprobs). - logprobs: Dict[int, Logprob] = { - token_id: Logprob( - logprob=token_id_logprob, - rank=token_id_logprob_rank, - ), - } - logprobs.update({ - topk_token_id: Logprob( - logprob=topk_logprob if topk_logprob is not None else 0.0, - rank=topk_index + 1, - ) - for topk_index, (topk_token_id, topk_logprob) \ - in enumerate(zip(topk_token_ids, topk_logprobs)) \ - if topk_token_id is not None - }) - - return logprobs - - -def create_sequence_group_output( - token_id: int, - token_id_logprob_rank: int, - token_id_logprob: float, - seq_id: SeqId, - topk_token_ids: List[Optional[int]], - topk_logprobs: List[Optional[float]], - prompt_logprobs: Optional[PromptLogprobs] = None, - step_index: Optional[int] = 0) -> CompletionSequenceGroupOutput: - """Create a SequenceGroupOutput given the sampling results. - - Args: - token_id (int): The sampled token for the sequence. - token_id_logprob_rank (int): The logprob rank of the sampled token. - token_id_logprob (float): The logprob value of the sampled token. - seq_id (int): The sequence id. - topk_token_ids (List[Optional[int]]): The list of top-k token ids. - topk_logprobs (List[Optional[float]]): The list of top-k logprobs. - step_index: (Optional[int]): The index of the speculative token. - """ - - logprobs = create_logprobs_output( - token_id, - token_id_logprob_rank, - token_id_logprob, - topk_token_ids, - topk_logprobs, - ) - - return CompletionSequenceGroupOutput(samples=[ - SequenceOutput(parent_seq_id=seq_id, - output_token=token_id, - logprobs=logprobs) - ], - prompt_logprobs=prompt_logprobs, - step_index=step_index) - - -def split_batch_by_proposal_len( - seq_group_metadata_list: List[SequenceGroupMetadata], - proposal_lens: List[int], -) -> Tuple[Tuple[List[SequenceGroupMetadata], List[int]], Tuple[ - List[SequenceGroupMetadata], List[int]]]: - """Utility function that splits a batch based on whether the proposal len is - zero or not. We should remove this once vLLM supports per-sequence proposal - lens in a batch. - """ - - nonzero_lists: Tuple[List[SequenceGroupMetadata], List[int]] = ([], []) - zero_lists: Tuple[List[SequenceGroupMetadata], List[int]] = ([], []) - for i, (seq_group, proposal_len) in enumerate( - zip(seq_group_metadata_list, proposal_lens)): - seq_groups, indices = nonzero_lists if proposal_len else zero_lists - seq_groups.append(seq_group) - indices.append(i) - return nonzero_lists, zero_lists - - -def sampler_output_to_torch( - sampler_output_list: Sequence[SamplerOutput], sampler_transposed: bool -) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, Optional[torch.Tensor]]: - """Utility function which converts a list of SamplerOutput to tensors. - - sampler_transposed here is used as the indicator for whether - we need do additional tensor transpose logic here. - - Returns: - sampled_token_ids: torch.Tensor - shape: [batch_size, len(sampler_output_list)] - - sampled_token_probs: torch.Tensor - shape: [batch_size, len(sampler_output_list), vocab_size] - """ - - # shape: [batch_size, num_sampler_output, vocab_size] - sampled_token_probs = torch.stack( - [ - sampler_output.sampled_token_probs - for sampler_output in sampler_output_list - ], - dim=0, - ) - - # shape: [batch_size, num_sampler_output, vocab_size] - sampled_token_logprobs = torch.stack( - [sampler_output.logprobs for sampler_output in sampler_output_list], - dim=0, - ) - - # shape: [batch_size, num_sampler_output] - sampled_token_ids = torch.stack( - [ - sampler_output.sampled_token_ids.flatten() - for sampler_output in sampler_output_list - ], - dim=0, - ) - - if sampler_transposed: - sampled_token_probs = sampled_token_probs.transpose(0, 1) - sampled_token_logprobs = sampled_token_logprobs.transpose(0, 1) - sampled_token_ids = sampled_token_ids.transpose(0, 1) - - if sampler_output_list[0].hidden_states is not None: - # shape: [batch_size, num_sampler_output, hidden_dim] - sampled_hidden_states = torch.stack( - [ - sampler_output.hidden_states - for sampler_output in sampler_output_list - ], - dim=0, - ) - - if sampler_transposed: - sampled_hidden_states = sampled_hidden_states.transpose(0, 1) - else: - sampled_hidden_states = None - - return (sampled_token_ids, sampled_token_probs, sampled_token_logprobs, - sampled_hidden_states) - - -def maybe_mock_device_tensors(sampler_output: SamplerOutput, batch_size: int, - vocab_size: int, device: str) -> None: - """Helper method which mocks out the GPU tensors in SamplerOutput with dummy - values. This will be removed in PR 7/9. - https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/edit#heading=h.qijw1sdidrer - """ - values = [ - sampler_output.sampled_token_probs, sampler_output.sampled_token_ids - ] - assert all(v is None for v in values) or not any(v is None for v in values) - if not any(v is None for v in values): - # Do nothing if the tensors are already created (usually in unit tests). - return - - # Softmax to ensure valid probs. - sampler_output.sampled_token_probs = torch.nn.functional.softmax( - torch.rand(batch_size, vocab_size, dtype=torch.float32, device=device), - dim=-1) - - sampler_output.sampled_token_ids = torch.randint(low=10, - high=100, - size=(batch_size, ), - dtype=torch.long, - device=device) - - -@contextmanager -def nvtx_range(msg, *args, **kwargs): - """ - Context manager / decorator that pushes an NVTX range at the beginning - of its scope, and pops it at the end. If extra arguments are given, - they are passed as arguments to msg.format(). - - If running with cuda graphs, you must enable nsys cuda graph profiling. - - Arguments: - msg (string): message to associate with the range - """ - if current_platform.is_cuda_alike(): - torch.cuda.nvtx.range_push(msg.format(*args, **kwargs)) - try: - yield - finally: - torch.cuda.nvtx.range_pop() - else: - yield - - -class Timer: - """Basic timer context manager for measuring CPU time. - """ - - def __enter__(self): - self.start_time = time.time() - return self - - def __exit__(self, exc_type, exc_value, traceback): - self.end_time = time.time() - self.elapsed_time_s = self.end_time - self.start_time - self.elapsed_time_ms = self.elapsed_time_s * 1000 diff --git a/vllm/transformers_utils/configs/eagle.py b/vllm/transformers_utils/configs/eagle.py index fb2e8a1df70..5445a333c49 100644 --- a/vllm/transformers_utils/configs/eagle.py +++ b/vllm/transformers_utils/configs/eagle.py @@ -6,7 +6,6 @@ from transformers import AutoConfig, PretrainedConfig -import vllm.envs as envs from vllm.transformers_utils.configs.deepseek_vl2 import DeepseekV2Config @@ -44,28 +43,25 @@ def __init__(self, self.truncated_vocab_size = self.model.vocab_size if \ truncated_vocab_size is None else truncated_vocab_size - if not envs.VLLM_USE_V1: - kwargs["architectures"] = ["EAGLEModel"] + # Eagle model name should follow naming convention of + # LlamaForCausalLM -> EagleLlamaForCausalLM + if method == "eagle": + assert self.model is not None, \ + "model should not be None when method is eagle" + kwargs["architectures"] = [ + f"Eagle{arch}" if not arch.startswith("Eagle") \ + else arch for arch in self.model.architectures + ] + elif method == "eagle3": + assert self.model is not None, \ + "model should not be None when method is eagle3" + kwargs["architectures"] = [ + f"Eagle3{arch}" if not arch.startswith("Eagle3") \ + else arch for arch in self.model.architectures + ] else: - # Eagle model name should follow naming convention of - # LlamaForCausalLM -> EagleLlamaForCausalLM - if method == "eagle": - assert self.model is not None, \ - "model should not be None when method is eagle" - kwargs["architectures"] = [ - f"Eagle{arch}" if not arch.startswith("Eagle") \ - else arch for arch in self.model.architectures - ] - elif method == "eagle3": - assert self.model is not None, \ - "model should not be None when method is eagle3" - kwargs["architectures"] = [ - f"Eagle3{arch}" if not arch.startswith("Eagle3") \ - else arch for arch in self.model.architectures - ] - else: - raise ValueError(f"Invalid method {method}. \ - Supported methods are eagle and eagle3.") + raise ValueError(f"Invalid method {method}. \ + Supported methods are eagle and eagle3.") super().__init__(**kwargs) diff --git a/vllm/worker/worker_base.py b/vllm/worker/worker_base.py index c382b29ad19..55705062d39 100644 --- a/vllm/worker/worker_base.py +++ b/vllm/worker/worker_base.py @@ -397,8 +397,6 @@ def execute_model( model_input, worker_input, kwargs = inputs num_steps = worker_input.num_steps - if execute_model_req is not None and execute_model_req.spec_step_idx: - kwargs["spec_step_idx"] = execute_model_req.spec_step_idx self.execute_worker(worker_input) From 507071387285411036238466c0645168b43da639 Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Sat, 19 Jul 2025 11:39:51 +0530 Subject: [PATCH 191/552] [Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel (#21193) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- .../layers/fused_moe/batched_deep_gemm_moe.py | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py index 628aa5c7bb0..3ccddb52998 100644 --- a/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py +++ b/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py @@ -55,6 +55,7 @@ def _silu_mul_fp8_quant_deep_gemm( # Meta --------------------------------------------------------------- BLOCK: tl.constexpr, + NUM_STAGES: tl.constexpr, ): G = H // GROUP_SIZE @@ -73,8 +74,7 @@ def _silu_mul_fp8_quant_deep_gemm( cols = cols.to(tl.int64) mask_h = cols < BLOCK - t = tl.zeros([], tl.int64) - while t < n_tokens: + for t in tl.range(0, n_tokens, num_stages=NUM_STAGES): base_i_offset = (e * stride_i_e + t * stride_i_t + g * GROUP_SIZE * stride_i_h) base_yq_offset = (e * stride_yq_e + t * stride_yq_t + @@ -102,8 +102,6 @@ def _silu_mul_fp8_quant_deep_gemm( tl.store(y_q_ptr + base_yq_offset + cols * stride_yq_h, y_q, mask=mask) tl.store(y_s_ptr + base_ys_offset, y_s) - t += 1 - def silu_mul_fp8_quant_deep_gemm( y: torch.Tensor, # (E, T, 2*H) float32 @@ -180,7 +178,8 @@ def silu_mul_fp8_quant_deep_gemm( fp8_max, is_blackwell_deep_gemm_used(), BLOCK=group_size, - num_warps=4, + NUM_STAGES=8, + num_warps=1, ) return y_q, y_s From 4170396bd76100819b8e3ff79a9c9b55508ad308 Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Sat, 19 Jul 2025 02:18:48 -0400 Subject: [PATCH 192/552] [BugFix][CPU] Fix `TorchSDPABackendImpl` doesn't have `use_irope` (#21200) Signed-off-by: Lucas Wilkinson Signed-off-by: x22x22 --- vllm/v1/worker/gpu_model_runner.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 9620bf6a795..47b14d076ea 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2668,7 +2668,8 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: # TODO: Support other attention modules, e.g., cross-attention if attn_module.attn_type == AttentionType.DECODER: use_local_attention = (self.attention_chunk_size is not None - and attn_module.impl.use_irope) + and getattr(attn_module.impl, + "use_irope", False)) if attn_module.sliding_window is not None: kv_cache_spec[layer_name] = SlidingWindowSpec( block_size=block_size, From 01db8d6de079b24f3c386edb9e2e9331033950a7 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Sat, 19 Jul 2025 02:25:22 -0400 Subject: [PATCH 193/552] [Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 (#21187) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- vllm/utils/deep_gemm.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/utils/deep_gemm.py b/vllm/utils/deep_gemm.py index 56326c9315b..8b5713e02c9 100644 --- a/vllm/utils/deep_gemm.py +++ b/vllm/utils/deep_gemm.py @@ -99,7 +99,7 @@ def fp8_m_grouped_gemm_nt_masked(*args, **kwargs): def per_block_cast_to_fp8(x, *args, **kwargs): if _per_block_cast_impl is not None and is_blackwell_deep_gemm_used(): - return _per_block_cast_impl(x) + return _per_block_cast_impl(x, use_ue8m0=True) # TODO: refactor the `per_block_cast_to_fp8` from tests to vllm utils from tests.kernels.quant_utils import per_block_cast_to_fp8 as _pbcf return _pbcf(x, *args, **kwargs) From d7d64b8c513f1aeb0c716fab225cd48a5dd52267 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=EA=B9=80=EC=A2=85=EA=B3=A4?= <149566442+Deepfocused@users.noreply.github.com> Date: Sat, 19 Jul 2025 15:25:44 +0900 Subject: [PATCH 194/552] [Model] EXAONE 4.0 model support (#21060) Signed-off-by: Deepfocused Signed-off-by: woongsik Signed-off-by: x22x22 --- docs/models/supported_models.md | 1 + tests/models/registry.py | 1 + vllm/model_executor/models/exaone4.py | 547 ++++++++++++++++++++ vllm/model_executor/models/registry.py | 1 + vllm/transformers_utils/config.py | 8 +- vllm/transformers_utils/configs/__init__.py | 2 + vllm/transformers_utils/configs/exaone4.py | 252 +++++++++ 7 files changed, 809 insertions(+), 3 deletions(-) create mode 100644 vllm/model_executor/models/exaone4.py create mode 100644 vllm/transformers_utils/configs/exaone4.py diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 11a7f2440a4..3731c676f5e 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -331,6 +331,7 @@ Specified using `--task generate`. | `Ernie4_5_ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. |✅︎| ✅︎ | ✅︎ | | `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Exaone4ForCausalLM` | EXAONE-4 | `LGAI-EXAONE/EXAONE-4.0-32B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Fairseq2LlamaForCausalLM` | Llama (fairseq2 format) | `mgleize/fairseq2-dummy-Llama-3.2-1B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | | ✅︎ | ✅︎ | | `FalconMambaForCausalLM` | FalconMamba | `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. | | ✅︎ | ✅︎ | diff --git a/tests/models/registry.py b/tests/models/registry.py index 3ffa7f81a1a..095e6f59011 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -169,6 +169,7 @@ def check_available_online( "Ernie4_5_MoeForCausalLM": _HfExamplesInfo("baidu/ERNIE-4.5-21B-A3B-PT", trust_remote_code=True), "ExaoneForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct"), # noqa: E501 + "Exaone4ForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-4.0-32B"), # noqa: E501 "Fairseq2LlamaForCausalLM": _HfExamplesInfo("mgleize/fairseq2-dummy-Llama-3.2-1B"), # noqa: E501 "FalconForCausalLM": _HfExamplesInfo("tiiuae/falcon-7b"), "FalconH1ForCausalLM":_HfExamplesInfo("tiiuae/Falcon-H1-0.5B-Base", diff --git a/vllm/model_executor/models/exaone4.py b/vllm/model_executor/models/exaone4.py new file mode 100644 index 00000000000..97aeb6fd7b1 --- /dev/null +++ b/vllm/model_executor/models/exaone4.py @@ -0,0 +1,547 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# ruff: noqa: E501 + +# Adapted from +# https://github.com/lgai-exaone/transformers/blob/add-exaone4/src/transformers/models/exaone4/modeling_exaone4.py +# Copyright 2025 The LG CNS Gen AI Solution Delivery Team. +# Copyright 2025 The LG AI Research and HuggingFace Inc. team. All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Inference-only Exaone model compatible with HuggingFace weights.""" + +from collections.abc import Iterable +from typing import Any, Optional, Union + +import torch +from torch import nn + +from vllm.attention import Attention +from vllm.compilation.decorators import support_torch_compile +from vllm.config import CacheConfig, VllmConfig +from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size +from vllm.model_executor.layers.activation import SiluAndMul +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.model_executor.layers.linear import (MergedColumnParallelLinear, + QKVParallelLinear, + RowParallelLinear) +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.quantization import QuantizationConfig +from vllm.model_executor.layers.rotary_embedding import get_rope +from vllm.model_executor.layers.vocab_parallel_embedding import ( + DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) +from vllm.model_executor.model_loader.weight_utils import ( + default_weight_loader, maybe_remap_kv_scale_name) +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.sequence import IntermediateTensors +from vllm.transformers_utils.configs.exaone4 import Exaone4Config + +from .interfaces import SupportsLoRA, SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, + is_pp_missing_parameter, + make_empty_intermediate_tensors_factory, make_layers, + maybe_prefix) + + +class Exaone4GatedMLP(nn.Module): + + def __init__( + self, + hidden_size: int, + intermediate_size: int, + hidden_act: str, + quant_config: Optional[QuantizationConfig] = None, + bias: bool = False, + prefix: str = "", + ) -> None: + super().__init__() + self.gate_up_proj = MergedColumnParallelLinear( + input_size=hidden_size, + output_sizes=[intermediate_size] * 2, + bias=bias, + quant_config=quant_config, + prefix=f"{prefix}.gate_up_proj", + ) + self.down_proj = RowParallelLinear( + input_size=intermediate_size, + output_size=hidden_size, + bias=bias, + quant_config=quant_config, + prefix=f"{prefix}.down_proj", + ) + if hidden_act != "silu": + raise ValueError(f"Unsupported activation: {hidden_act}. " + "Only silu is supported for now.") + self.act_fn = SiluAndMul() + + def forward(self, x): + gate_up, _ = self.gate_up_proj(x) + x = self.act_fn(gate_up) + x, _ = self.down_proj(x) + return x + + +class Exaone4Attention(nn.Module): + + def __init__( + self, + config: Exaone4Config, + hidden_size: int, + num_heads: int, + num_kv_heads: int, + rope_theta: float = 1000000, + rope_scaling: Optional[dict[str, Any]] = None, + max_position_embeddings: int = 8192, + quant_config: Optional[QuantizationConfig] = None, + bias: bool = False, + cache_config: Optional[CacheConfig] = None, + prefix: str = "", + ) -> None: + super().__init__() + self.hidden_size = hidden_size + tp_size = get_tensor_model_parallel_world_size() + self.total_num_heads = num_heads + assert self.total_num_heads % tp_size == 0 + self.num_heads = self.total_num_heads // tp_size + self.total_num_kv_heads = num_kv_heads + if self.total_num_kv_heads >= tp_size: + # Number of KV heads is greater than TP size, so we partition + # the KV heads across multiple tensor parallel GPUs. + assert self.total_num_kv_heads % tp_size == 0 + else: + # Number of KV heads is less than TP size, so we replicate + # the KV heads across multiple tensor parallel GPUs. + assert tp_size % self.total_num_kv_heads == 0 + self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size) + # MistralConfig has an optional head_dim introduced by Mistral-Nemo + self.head_dim = getattr(config, "head_dim", None) + if self.head_dim is None: + self.head_dim = self.hidden_size // self.total_num_heads + self.q_size = self.num_heads * self.head_dim + self.kv_size = self.num_kv_heads * self.head_dim + self.scaling = self.head_dim**-0.5 + self.rope_theta = rope_theta + self.max_position_embeddings = max_position_embeddings + + self.qkv_proj = QKVParallelLinear( + hidden_size=hidden_size, + head_size=self.head_dim, + total_num_heads=self.total_num_heads, + total_num_kv_heads=self.total_num_kv_heads, + bias=bias, + quant_config=quant_config, + prefix=f"{prefix}.qkv_proj", + ) + + self.o_proj = RowParallelLinear( + input_size=self.total_num_heads * self.head_dim, + output_size=hidden_size, + bias=bias, + quant_config=quant_config, + prefix=f"{prefix}.o_proj", + ) + + self.q_norm = RMSNorm(self.head_dim, eps=config.rms_norm_eps) + self.k_norm = RMSNorm(self.head_dim, eps=config.rms_norm_eps) + + is_neox_style = True + if quant_config is not None and quant_config.get_name() == "gguf": + is_neox_style = False + + self.apply_all_layers = False # apply rotary embeddings to every layer. + layer_idx = extract_layer_index(prefix) + interleaved_sliding_window = getattr(config, + "interleaved_sliding_window", + 4096) + sliding_window_pattern = getattr(config, "sliding_window_pattern", + "LLLG") + + if sliding_window_pattern: + layer_has_sliding_window = ( + layer_idx + 1) % sliding_window_pattern.__len__() != 0 + else: + layer_has_sliding_window = False + self.apply_all_layers = True + + if layer_has_sliding_window: + self.sliding_window = interleaved_sliding_window + else: + self.sliding_window = None + + self.rotary_emb = get_rope( + self.head_dim, + rotary_dim=self.head_dim, + max_position=max_position_embeddings, + base=rope_theta, + rope_scaling=rope_scaling, + is_neox_style=is_neox_style, + ) + self.attn = Attention( + self.num_heads, + self.head_dim, + self.scaling, + num_kv_heads=self.num_kv_heads, + cache_config=cache_config, + quant_config=quant_config, + per_layer_sliding_window=self.sliding_window, + prefix=f"{prefix}.attn", + ) + + def forward( + self, + positions: torch.Tensor, + hidden_states: torch.Tensor, + ) -> torch.Tensor: + qkv, _ = self.qkv_proj(hidden_states) + q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) + + q = q.unflatten(-1, (self.num_heads, self.head_dim)) + q = self.q_norm(q) + q = q.flatten(-2, -1) + k = k.unflatten(-1, (self.num_kv_heads, self.head_dim)) + k = self.k_norm(k) + k = k.flatten(-2, -1) + + if self.sliding_window or self.apply_all_layers: + q, k = self.rotary_emb(positions, q, k) + attn_output = self.attn(q, k, v) + output, _ = self.o_proj(attn_output) + return output + + +class Exaone4DecoderLayer(nn.Module): + + def __init__( + self, + config: Exaone4Config, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ) -> None: + super().__init__() + self.hidden_size = config.hidden_size + rope_theta = getattr(config, "rope_theta", 1000000) + rope_scaling = getattr(config, "rope_scaling", None) + if rope_scaling is not None and getattr( + config, "original_max_position_embeddings", None): + rope_scaling["original_max_position_embeddings"] = ( + config.original_max_position_embeddings) + max_position_embeddings = getattr(config, "max_position_embeddings", + 8192) + # Support abacusai/Smaug-72B-v0.1 with attention_bias + # Support internlm/internlm-7b with bias + attention_bias = getattr(config, "attention_bias", False) or getattr( + config, "bias", False) + + self.self_attn = Exaone4Attention( + config=config, + hidden_size=self.hidden_size, + num_heads=config.num_attention_heads, + num_kv_heads=getattr(config, "num_key_value_heads", + config.num_attention_heads), + rope_theta=rope_theta, + rope_scaling=rope_scaling, + max_position_embeddings=max_position_embeddings, + quant_config=quant_config, + bias=attention_bias, + cache_config=cache_config, + prefix=f"{prefix}.self_attn", + ) + self.mlp = Exaone4GatedMLP( + hidden_size=self.hidden_size, + intermediate_size=config.intermediate_size, + hidden_act=config.hidden_act, + quant_config=quant_config, + bias=getattr(config, "mlp_bias", False), + prefix=f"{prefix}.mlp", + ) + self.post_attention_layernorm = RMSNorm(config.hidden_size, + eps=config.rms_norm_eps) + self.post_feedforward_layernorm = RMSNorm(config.hidden_size, + eps=config.rms_norm_eps) + + def forward( + self, + positions: torch.Tensor, + hidden_states: torch.Tensor, + residual: Optional[torch.Tensor], + ) -> tuple[torch.Tensor, torch.Tensor]: + residual = hidden_states + + # Self Attention + hidden_states = self.self_attn( + positions=positions, + hidden_states=hidden_states, + ) + + # Use post-LN + hidden_states = self.post_attention_layernorm(hidden_states) + hidden_states = residual + hidden_states + + residual = hidden_states + + # Fully Connected + hidden_states = self.mlp(hidden_states) + + # Use post-LN + hidden_states = self.post_feedforward_layernorm(hidden_states) + hidden_states = residual + hidden_states + + return hidden_states, residual + + +@support_torch_compile +class Exaone4Model(nn.Module): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + + config = vllm_config.model_config.hf_config + cache_config = vllm_config.cache_config + quant_config = vllm_config.quant_config + lora_config = vllm_config.lora_config + + self.config = config + self.quant_config = quant_config + lora_vocab = ((lora_config.lora_extra_vocab_size * + (lora_config.max_loras or 1)) if lora_config else 0) + self.vocab_size = config.vocab_size + lora_vocab + if get_pp_group().is_first_rank or (config.tie_word_embeddings + and get_pp_group().is_last_rank): + self.embed_tokens = VocabParallelEmbedding( + self.vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + quant_config=quant_config, + ) + else: + self.embed_tokens = PPMissingLayer() + self.start_layer, self.end_layer, self.layers = make_layers( + config.num_hidden_layers, + lambda prefix: Exaone4DecoderLayer( + config=config, + cache_config=cache_config, + quant_config=quant_config, + prefix=prefix, + ), + prefix=f"{prefix}.layers", + ) + if get_pp_group().is_last_rank: + self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + else: + self.norm = PPMissingLayer() + + self.make_empty_intermediate_tensors = ( + make_empty_intermediate_tensors_factory( + ["hidden_states", "residual"], config.hidden_size)) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.embed_tokens(input_ids) + + def forward( + self, + input_ids: Optional[torch.Tensor], + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors], + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + if get_pp_group().is_first_rank: + if inputs_embeds is not None: + hidden_states = inputs_embeds + else: + hidden_states = self.get_input_embeddings(input_ids) + residual = None + else: + assert intermediate_tensors is not None + hidden_states = intermediate_tensors["hidden_states"] + residual = intermediate_tensors["residual"] + + for layer in self.layers[self.start_layer:self.end_layer]: + hidden_states, residual = layer( + positions, + hidden_states, + residual, + ) + + if not get_pp_group().is_last_rank: + return IntermediateTensors({ + "hidden_states": hidden_states, + "residual": residual + }) + + hidden_states = self.norm(hidden_states) + return hidden_states + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + (".qkv_proj", ".q_proj", "q"), + (".qkv_proj", ".k_proj", "k"), + (".qkv_proj", ".v_proj", "v"), + (".gate_up_proj", ".gate_proj", 0), + (".gate_up_proj", ".up_proj", 1), + ] + params_dict = dict(self.named_parameters()) + loaded_params: set[str] = set() + for name, loaded_weight in weights: + if "rotary_emb.inv_freq" in name: + continue + if ("rotary_emb.cos_cached" in name + or "rotary_emb.sin_cached" in name): + # Models trained using ColossalAI may include these tensors in + # the checkpoint. Skip them. + continue + if (self.quant_config is not None and + (scale_name := self.quant_config.get_cache_scale(name))): + # Loading kv cache quantization scales + param = params_dict[scale_name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + loaded_weight = (loaded_weight if loaded_weight.dim() == 0 else + loaded_weight[0]) + weight_loader(param, loaded_weight) + loaded_params.add(scale_name) + continue + for param_name, weight_name, shard_id in stacked_params_mapping: + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + + if is_pp_missing_parameter(name, self): + continue + + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + + break + else: + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + # Remapping the name of FP8 kv-scale. + name = maybe_remap_kv_scale_name(name, params_dict) + if name is None: + continue + + if is_pp_missing_parameter(name, self): + continue + + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + return loaded_params + + +class Exaone4ForCausalLM(nn.Module, SupportsLoRA, SupportsPP): + packed_modules_mapping = { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } + + # LoRA specific attributes + embedding_modules = { + "embed_tokens": "input_embeddings", + "lm_head": "output_embeddings", + } + embedding_padding_modules = ["lm_head"] + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + lora_config = vllm_config.lora_config + + self.config = config + self.lora_config = lora_config + self.quant_config = quant_config + + self.model = Exaone4Model( + vllm_config=vllm_config, + prefix=maybe_prefix(prefix, "model"), + ) + if get_pp_group().is_last_rank: + self.unpadded_vocab_size = config.vocab_size + if lora_config: + self.unpadded_vocab_size += lora_config.lora_extra_vocab_size + self.lm_head = ParallelLMHead( + self.unpadded_vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + padding_size=DEFAULT_VOCAB_PADDING_SIZE + # We need bigger padding if using lora for kernel + # compatibility + if not lora_config else lora_config.lora_vocab_padding_size, + quant_config=quant_config, + ) + if config.tie_word_embeddings: + self.lm_head.weight = self.model.embed_tokens.weight + + logit_scale = getattr(config, "logit_scale", 1.0) + self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, + config.vocab_size, + logit_scale) + else: + self.lm_head = PPMissingLayer() + + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.model.get_input_embeddings(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + model_output = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds) + return model_output + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader( + self, + # With tie_word_embeddings, we can skip lm_head.weight + # The weight might appear unnecessarily in the files if the model is + # processed with quantization, LoRA, fine-tuning, etc. + skip_prefixes=(["lm_head."] + if self.config.tie_word_embeddings else None), + ) + return loader.load_weights(weights) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index d5233c28b19..2ca37867b88 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -57,6 +57,7 @@ "Ernie4_5_ForCausalLM": ("ernie45", "Ernie4_5_ForCausalLM"), "Ernie4_5_MoeForCausalLM": ("ernie45_moe", "Ernie4_5_MoeForCausalLM"), "ExaoneForCausalLM": ("exaone", "ExaoneForCausalLM"), + "Exaone4ForCausalLM": ("exaone4", "Exaone4ForCausalLM"), "FalconForCausalLM": ("falcon", "FalconForCausalLM"), "Fairseq2LlamaForCausalLM": ("fairseq2_llama", "Fairseq2LlamaForCausalLM"), "GemmaForCausalLM": ("gemma", "GemmaForCausalLM"), diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index dc35d212766..2e66dc16b47 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -31,9 +31,10 @@ # yapf: disable from vllm.transformers_utils.configs import (ChatGLMConfig, Cohere2Config, DbrxConfig, DeepseekVLV2Config, - EAGLEConfig, ExaoneConfig, - JAISConfig, KimiVLConfig, - MedusaConfig, MiniMaxText01Config, + EAGLEConfig, Exaone4Config, + ExaoneConfig, JAISConfig, + KimiVLConfig, MedusaConfig, + MiniMaxText01Config, MiniMaxVL01Config, MllamaConfig, MLPSpeculatorConfig, MPTConfig, NemotronConfig, NVLM_D_Config, @@ -87,6 +88,7 @@ def _get_hf_token() -> Optional[str]: "medusa": MedusaConfig, "eagle": EAGLEConfig, "exaone": ExaoneConfig, + "exaone4": Exaone4Config, "minimax_text_01": MiniMaxText01Config, "minimax_vl_01": MiniMaxVL01Config, "nemotron": NemotronConfig, diff --git a/vllm/transformers_utils/configs/__init__.py b/vllm/transformers_utils/configs/__init__.py index 734f1e09d0f..5d84d648f1c 100644 --- a/vllm/transformers_utils/configs/__init__.py +++ b/vllm/transformers_utils/configs/__init__.py @@ -7,6 +7,7 @@ from vllm.transformers_utils.configs.deepseek_vl2 import DeepseekVLV2Config from vllm.transformers_utils.configs.eagle import EAGLEConfig from vllm.transformers_utils.configs.exaone import ExaoneConfig +from vllm.transformers_utils.configs.exaone4 import Exaone4Config # RWConfig is for the original tiiuae/falcon-40b(-instruct) and # tiiuae/falcon-7b(-instruct) models. Newer Falcon models will use the # `FalconConfig` class from the official HuggingFace transformers library. @@ -40,6 +41,7 @@ "MedusaConfig", "EAGLEConfig", "ExaoneConfig", + "Exaone4Config", "MiniMaxText01Config", "MiniMaxVL01Config", "MllamaConfig", diff --git a/vllm/transformers_utils/configs/exaone4.py b/vllm/transformers_utils/configs/exaone4.py new file mode 100644 index 00000000000..a22ebaa6bd6 --- /dev/null +++ b/vllm/transformers_utils/configs/exaone4.py @@ -0,0 +1,252 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# ruff: noqa: E501 + +# Copied from +# https://github.com/lgai-exaone/transformers/blob/add-exaone4/src/transformers/models/exaone4/configuration_exaone4.py +# Copyright 2025 The LG CNS Gen AI Solution Delivery Team. +# Copyright 2025 The LG AI Research and HuggingFace Inc. team. All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from transformers.configuration_utils import (PretrainedConfig, + layer_type_validation) +from transformers.utils import logging + +logger = logging.get_logger(__name__) + + +def check_is_sliding(config, layer_idx): + """ + Check if the current layer is a sliding window attention (local attention) layer. + """ + if config.sliding_window is None: + return False + if config.layer_types is not None: + return config.layer_types[layer_idx] == "sliding_attention" + if isinstance(config.sliding_window_pattern, int): + return ((layer_idx + 1) % config.sliding_window_pattern) != 0 + elif isinstance(config.sliding_window_pattern, str): + assert isinstance(config.sliding_window, int), ( + f"Sliding window must be positive integer, but got {config.sliding_window}" + ) + return (layer_idx != config.num_hidden_layers - 1 + and config.sliding_window_pattern[layer_idx % len( + config.sliding_window_pattern)] == "L") + else: + logger.warning_once( + "Sliding window is set, but none of `sliding_window_pattern` or `layer_types` is set. " + "Defaulting to use 'full_attention' for all layers.") + return False + + +class Exaone4Config(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`Exaone4Model`]. It is used to + instantiate a EXAONE 4.0 model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the EXAONE-4.0-Instruct [LGAI-EXAONE/EXAONE-4.0-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-Instruct) + NOTE: `EXAONE-4.0-Instruct` is a placeholder model ID. The exact model ID will be updated in the future. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model + outputs. Read the documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 102400): + Vocabulary size of the EXAONE 4.0 model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`Exaone4Model`]. + hidden_size (`int`, *optional*, defaults to 4096): + Dimension of the hidden representations. + intermediate_size (`int`, *optional*, defaults to `hidden_size * 4`): + Dimensionality of the MLP representations. + num_hidden_layers (`int`, *optional*, defaults to 32): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 32): + Number of attention heads for each attention layer in the Transformer decoder. + num_key_value_heads (`int`, *optional*): + This is the number of key_value heads that should be used to implement Grouped Query Attention. If + `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if + `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When + converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed + by meanpooling all the original heads within that group. For more details checkout [this + paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to + `num_attention_heads`. + hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): + The non-linear activation function (function or string) in the decoder. + max_position_embeddings (`int`, *optional*, defaults to 2048): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 32768 for EXAONE 3.5). + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + rms_norm_eps (`float`, *optional*, defaults to 1e-05): + The epsilon used by the layer normalization layers. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). Only + relevant if ``config.is_decoder=True``. + bos_token_id (`int`, *optional*, defaults to 0): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 2): + End of stream token id. + tie_word_embeddings (`bool`, *optional*, defaults to `False`): + Whether to tie weight embeddings + rope_theta (`float`, *optional*, defaults to 10000.0): + The base period of the RoPE embeddings. + rope_scaling (`Dict`, *optional*): + Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type + and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value + accordingly. + Expected contents: + `rope_type` (`str`): + The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', + 'llama3'], with 'default' being the original RoPE implementation. + `factor` (`float`, *optional*): + Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In + most scaling types, a `factor` of x will enable the model to handle sequences of length x * + original maximum pre-trained length. + `original_max_position_embeddings` (`int`, *optional*): + Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during + pretraining. + `attention_factor` (`float`, *optional*): + Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention + computation. If unspecified, it defaults to value recommended by the implementation, using the + `factor` field to infer the suggested value. + `beta_fast` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear + ramp function. If unspecified, it defaults to 32. + `beta_slow` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear + ramp function. If unspecified, it defaults to 1. + `short_factor` (`List[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to short contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `long_factor` (`List[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to long contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `low_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE + `high_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + sliding_window (`int`, *optional*): + The size of the sliding window for the sliding window attention. + sliding_window_pattern (`str`, *optional*): + The pattern to use for sliding window attention. Can be one of: + - `None`: No sliding window attention is used + - `int`: Every `sliding_window` layers, use global attention, else use local attention. + - `str`: A sequence of "L" (local attention) and "G" (global attention) characters that defines the + attention pattern. The pattern starts from layer 0 and repeats every `sliding_window` layers. The + final layer always uses global attention regardless of the pattern. + For instance, sliding_window_pattern="LLLG" same as sliding_window=4, which means: + - Layer 0, 1, 2: local attention, + - Layer 3: global attention, + ...(repeated) + layer_types (`list`, *optional*): + Attention pattern for each layer. Prioritized over `sliding_window_pattern`. + + Example: + + ```python + >>> from transformers import Exaone4Model, Exaone4Config + + >>> # Initializing a EXAONE configuration + >>> configuration = Exaone4Config() + + >>> # Initializing a model from configuration + >>> model = Exaone4Model(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "exaone4" + keys_to_ignore_at_inference = ["past_key_values"] + # Default tensor parallel plan for base model `LlamaModel` + base_model_tp_plan = { + "layers.*.self_attn.q_proj": "colwise", + "layers.*.self_attn.k_proj": "colwise", + "layers.*.self_attn.v_proj": "colwise", + "layers.*.self_attn.o_proj": "rowwise", + "layers.*.mlp.gate_proj": "colwise", + "layers.*.mlp.up_proj": "colwise", + "layers.*.mlp.down_proj": "rowwise", + } + base_model_pp_plan = { + "embed_tokens": (["input_ids"], ["inputs_embeds"]), + "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), + "norm": (["hidden_states"], ["hidden_states"]), + } + + def __init__( + self, + vocab_size=102400, + hidden_size=4096, + intermediate_size=None, + num_hidden_layers=32, + num_attention_heads=32, + num_key_value_heads=None, + hidden_act="silu", + max_position_embeddings=2048, + initializer_range=0.02, + rms_norm_eps=1e-5, + use_cache=True, + bos_token_id=0, + eos_token_id=2, + tie_word_embeddings=False, + rope_theta=10000.0, + rope_scaling=None, + attention_dropout=0.0, + sliding_window=None, + sliding_window_pattern=None, + layer_types=None, + **kwargs, + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + if num_key_value_heads is None: + num_key_value_heads = num_attention_heads + self.num_key_value_heads = num_key_value_heads + if intermediate_size: + self.intermediate_size = intermediate_size + else: + self.intermediate_size = hidden_size * 4 + self.hidden_act = hidden_act + self.max_position_embeddings = max_position_embeddings + self.initializer_range = initializer_range + self.rms_norm_eps = rms_norm_eps + self.use_cache = use_cache + self.attention_dropout = attention_dropout + self.rope_theta = rope_theta + self.rope_scaling = rope_scaling + self.sliding_window = sliding_window + self.sliding_window_pattern = sliding_window_pattern + + self.layer_types = layer_types + if self.layer_types is None: + self.layer_types = [ + "sliding_attention" + if check_is_sliding(self, i) else "full_attention" + for i in range(self.num_hidden_layers) + ] + layer_type_validation(self.layer_types) + + super().__init__(bos_token_id=bos_token_id, + eos_token_id=eos_token_id, + tie_word_embeddings=tie_word_embeddings, + **kwargs) + + +__all__ = ["Exaone4Config"] From fc8d0e1c61d7279b64db8affff2012493790c4fb Mon Sep 17 00:00:00 2001 From: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com> Date: Sat, 19 Jul 2025 02:06:59 -0700 Subject: [PATCH 195/552] [Misc][Tools][Benchmark] Add readme file for auto_tune script (#20779) Signed-off-by: Chenyaaang Signed-off-by: x22x22 --- benchmarks/auto_tune/README.md | 137 ++++++++++++++++++++++++ benchmarks/{ => auto_tune}/auto_tune.sh | 31 +----- 2 files changed, 138 insertions(+), 30 deletions(-) create mode 100644 benchmarks/auto_tune/README.md rename benchmarks/{ => auto_tune}/auto_tune.sh (81%) diff --git a/benchmarks/auto_tune/README.md b/benchmarks/auto_tune/README.md new file mode 100644 index 00000000000..7732f50b1d2 --- /dev/null +++ b/benchmarks/auto_tune/README.md @@ -0,0 +1,137 @@ +# Automated vLLM Server Parameter Tuning + +This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate. + +## Table of Contents +- [Prerequisites](#prerequisites) +- [Configuration](#configuration) +- [How to Run](#how-to-run) +- [Example Use Cases](#example-use-cases) +- [Output](#output) +- [How It Works](#how-it-works) + +## Prerequisites + +Before running the script, please ensure the following steps are completed: + +1. **Clone vLLM & Set Up Branch**: Clone the vLLM repository and check out to your desired branch. + +```bash +git clone https://github.com/vllm-project/vllm.git +cd vllm +# git checkout +``` + +1. **Install Environment**: Install or update the correct running environment. For TPU usage, activate your `conda` environment and install the corresponding `torch` and `torch_xla` versions. + +2. **Model Configuration**: If you are using a customized model, ensure its configuration files are correctly placed and accessible. + +## Configuration + +You must set the following variables at the top of the script before execution. + +| Variable | Description | Example Value | +| --- | --- | --- | +| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` | +| `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` | +| `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` | +| `TP` | **Required.** The tensor-parallelism size. | `1` | +| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) | +| `INPUT_LEN` | **Required.** Request input length. | `4000` | +| `OUTPUT_LEN` | **Required.** Request output length. | `16` | +| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` | +| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` | +| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` | +| `NUM_BATCHED_TOKENS_LIST` | A space-separated string of `max-num-batched-tokens` values to test. | `"1024 2048 4096"` | + +**Note**: The default `NUM_SEQS_LIST` and `NUM_BATCHED_TOKENS_LIST` are set for medium-sized inputs/outputs. For very short contexts (e.g., 20 input, 20 output tokens), you may need to test larger values for `max-num-seqs`. + +## How to Run + +1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section. +2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost. + +``` +cd +bash auto_tune.sh +``` + + Please note that the `bash auto_tune.sh` command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself. + +## Example Use Cases + +Here are a few examples of how to configure the script for different goals: + +### 1. Maximize Throughput (No Latency Constraint) +- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens. +- **Configuration**: + +```bash +INPUT_LEN=1800 +OUTPUT_LEN=20 +MIN_CACHE_HIT_PCT=0 +MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number +``` + +#### 2. Maximize Throughput with a Latency Requirement +- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms. +- **Configuration**: + +```bash +INPUT_LEN=1800 +OUTPUT_LEN=20 +MIN_CACHE_HIT_PCT=0 +MAX_LATENCY_ALLOWED_MS=500 +``` + +#### 3. Maximize Throughput with Prefix Caching and Latency Requirements +- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms. +- **Configuration**: + +```bash +INPUT_LEN=1800 +OUTPUT_LEN=20 +MIN_CACHE_HIT_PCT=60 +MAX_LATENCY_ALLOWED_MS=500 +``` + +## Output + +After the script finishes, you will find the results in a new, timestamped directory created inside `$BASE/auto-benchmark/`. + +- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run: + - `vllm_log_...txt`: The log output from the vLLM server for each parameter combination. + - `bm_log_...txt`: The log output from the `benchmark_serving.py` script for each benchmark run. + +- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found. + +``` +# Example result.txt content +hash:a1b2c3d4... +max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8 +max_num_seqs: 128, max_num_batched_tokens: 4096 does not meet latency requirement 500 +... +best_max_num_seqs: 256, best_num_batched_tokens: 2048, best_throughput: 12.5, profile saved in: /home/user/vllm/auto-benchmark/2024_08_01_10_30/profile +``` + + If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict. + +- **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb` for TPU or a `.json` trace for GPU) from the single best-performing run. + +## How It Works + +The script follows a systematic process to find the optimal parameters: + +1. **Find Max GPU Memory Utilization**: The script first determines the highest safe `gpu-memory-utilization` (starting from 0.98 and decreasing) that does not cause an Out-Of-Memory (OOM) error when launching the server. This ensures the benchmark runs use the maximum available memory without crashing. + +2. **Iterate and Benchmark**: It then enters a nested loop, iterating through every combination of `max-num-seqs` and `max-num-batched-tokens` provided in the configuration lists. + +3. **Latency-Aware Throughput Search**: For each parameter combination: + - The vLLM server is started. + - A benchmark is first run with an infinite request rate (`--request-rate inf`). + - If the resulting P99 E2E latency is within the `MAX_LATENCY_ALLOWED_MS` limit, this throughput is considered the maximum for this configuration. + - If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement. + +4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far. + +5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard. diff --git a/benchmarks/auto_tune.sh b/benchmarks/auto_tune/auto_tune.sh similarity index 81% rename from benchmarks/auto_tune.sh rename to benchmarks/auto_tune/auto_tune.sh index b257b57ce06..159ee142147 100644 --- a/benchmarks/auto_tune.sh +++ b/benchmarks/auto_tune/auto_tune.sh @@ -1,36 +1,7 @@ #!/bin/bash # This script aims to tune the best server parameter combinations to maximize throughput for given requirement. -# The current server parameter combination is max_num_seqs and max_num_batched_tokens -# It also supports additional requirement: e2e latency and prefix cache. - -# Pre-requisite: -# 1. Checkout to your branch, install/ update the correct running env. For TPU, activate conda env and install the corresponding torch, xla version. -# 2. If the model is customized, replace the MODEL's config with the customized config. -# 3. Set variables (ALL REQUIRED) -# BASE: your directory for vllm repo -# MODEL: the model served by vllm -# SYSTEM: the hardware, choice TPU or GPU, for other systems, "get best profile" might not support. -# TP: ways of tensor parallelism -# DOWNLOAD_DIR: directory to download and load model weights. -# INPUT_LEN: request input len -# OUTPUT_LEN: request output len -# MIN_CACHE_HIT_PCT: prefix cache rate -# MAX_LATENCY_ALLOWED_MS: (e2e) latency requirement. If there's no latency requirement, set it to a large number like 1000000000 -# NUM_SEQS_LIST: a list of `max-num-seqs` you want to loop with. -# NUM_BATCHED_TOKENS_LIST: a list of `max-num-batched-tokens` you want to loop with. -# Note that the default NUM_SEQS_LIST and NUM_BATCHED_TOKENS_LIST are set for medium size input/output len, for extra short context (such as 20:20), you might need to include larger numbers in NUM_SEQS_LIST. -# 4. Run the script, it might take a long time, you can use tmux to avoid the script stop if disconnection happens. -# 5. The final result will be saved in RESULT file. - - -# Example use cases -# 1. Given input_len=1800, output_len=20, what's the best max_num_seqs and max_num_batched_tokens to get highest throughput? -# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=100000000000 -# 2. If we have latency requirement to be lower than 500ms, what's the best server parameter? -# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=500 -# 3. If we want to reach 60% prefix cache, what's the best server parameter? -# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=60, MAX_LATENCY_ALLOWED_MS=500 +# See details in README (benchmarks/auto_tune/README.md). TAG=$(date +"%Y_%m_%d_%H_%M") BASE="" From ff307cc7724fee15c83118f9a6c9b37563667a49 Mon Sep 17 00:00:00 2001 From: Huy Do Date: Sat, 19 Jul 2025 02:13:41 -0700 Subject: [PATCH 196/552] Fix a couple of Voxtral tests (#21218) Signed-off-by: Huy Do Signed-off-by: x22x22 --- tests/models/registry.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tests/models/registry.py b/tests/models/registry.py index 095e6f59011..5c546a6c86d 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -449,7 +449,11 @@ def check_available_online( tokenizer="Isotr0py/Florence-2-tokenizer", # noqa: E501 trust_remote_code=True), # noqa: E501 "MllamaForConditionalGeneration": _HfExamplesInfo("meta-llama/Llama-3.2-11B-Vision-Instruct"), # noqa: E501 - "VoxtralForConditionalGeneration": _HfExamplesInfo("mistralai/Voxtral-Mini-3B-2507", tokenizer_mode="mistral"), # noqa: E501 + "VoxtralForConditionalGeneration": _HfExamplesInfo( + "mistralai/Voxtral-Mini-3B-2507", + tokenizer_mode="mistral", + min_transformers_version="4.54" + ), "WhisperForConditionalGeneration": _HfExamplesInfo("openai/whisper-large-v3"), # noqa: E501 # [Cross-encoder] From b675f426facac8002ac7475623d9ea4f9d0b2283 Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Sat, 19 Jul 2025 17:15:41 +0800 Subject: [PATCH 197/552] [V0 deprecation] Remove long context LoRA (#21169) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- tests/lora/conftest.py | 5 -- tests/lora/test_peft_helper.py | 11 ++- vllm/config.py | 14 +--- vllm/engine/arg_utils.py | 5 -- vllm/lora/layers.py | 90 ------------------------- vllm/lora/models.py | 80 +++------------------- vllm/lora/peft_helper.py | 9 --- vllm/lora/punica_wrapper/punica_base.py | 45 +++---------- vllm/lora/punica_wrapper/punica_gpu.py | 21 ++---- vllm/lora/punica_wrapper/punica_tpu.py | 14 ---- vllm/lora/punica_wrapper/utils.py | 38 ++--------- vllm/lora/utils.py | 2 - vllm/lora/worker_manager.py | 2 +- 13 files changed, 35 insertions(+), 301 deletions(-) diff --git a/tests/lora/conftest.py b/tests/lora/conftest.py index 881d5efa691..909b7393313 100644 --- a/tests/lora/conftest.py +++ b/tests/lora/conftest.py @@ -221,11 +221,6 @@ def phi2_lora_files(): return snapshot_download(repo_id="isotr0py/phi-2-test-sql-lora") -@pytest.fixture(scope="session") -def long_context_lora_files_16k_1(): - return snapshot_download(repo_id="SangBinCho/long_context_16k_testing_1") - - @pytest.fixture def llama_2_7b_engine_extra_embeddings(): cleanup_dist_env_and_memory(shutdown_ray=True) diff --git a/tests/lora/test_peft_helper.py b/tests/lora/test_peft_helper.py index f16589e06b2..df8696cf58e 100644 --- a/tests/lora/test_peft_helper.py +++ b/tests/lora/test_peft_helper.py @@ -38,8 +38,8 @@ ] -def test_peft_helper_pass(long_context_lora_files_16k_1, tmp_path): - peft_helper = PEFTHelper.from_local_dir(long_context_lora_files_16k_1, +def test_peft_helper_pass(sql_lora_files, tmp_path): + peft_helper = PEFTHelper.from_local_dir(sql_lora_files, max_position_embeddings=4096) lora_config = LoRAConfig(max_lora_rank=16, max_cpu_loras=3, max_loras=2) peft_helper.validate_legal(lora_config) @@ -56,15 +56,12 @@ def test_peft_helper_pass(long_context_lora_files_16k_1, tmp_path): "embed_tokens", "lm_head", ] - assert peft_helper.context_length == 16384 assert peft_helper.vllm_max_position_embeddings == 4096 - assert peft_helper.vllm_long_context_scaling_factor == float( - math.ceil(peft_helper.context_length / - peft_helper.vllm_max_position_embeddings)) + # test RSLoRA rslora_config = dict(use_rslora=True) test_dir = tmp_path / "test_rslora" - shutil.copytree(long_context_lora_files_16k_1, test_dir) + shutil.copytree(sql_lora_files, test_dir) # Load and modify configuration config_path = test_dir / "adapter_config.json" diff --git a/vllm/config.py b/vllm/config.py index c00ca475d8b..5727e97a887 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3014,12 +3014,7 @@ class LoRAConfig: (added to the base model vocabulary).""" lora_vocab_padding_size: ClassVar[int] = current_platform\ .get_lora_vocab_padding_size() - long_lora_scaling_factors: Optional[tuple[float, ...]] = None - """Specify multiple scaling factors (which can be different from base model - scaling factor - see eg. Long LoRA) to allow for multiple LoRA adapters - trained with those scaling factors to be used at the same time. If not - specified, only adapters trained with the base model scaling factor are - allowed.""" + default_mm_loras: Optional[dict[str, str]] = None """Dictionary mapping specific modalities to LoRA model paths; this field is only applicable to multimodal models and should be leveraged when a @@ -3052,7 +3047,6 @@ def compute_hash(self) -> str: factors.append(self.lora_dtype) factors.append(self.lora_extra_vocab_size) factors.append(self.lora_vocab_padding_size) - factors.append(self.long_lora_scaling_factors) factors.append(self.bias_enabled) hash_str = hashlib.md5(str(factors).encode(), usedforsecurity=False).hexdigest() @@ -3091,11 +3085,6 @@ def verify_with_model_config(self, model_config: ModelConfig): elif isinstance(self.lora_dtype, str): self.lora_dtype = getattr(torch, self.lora_dtype) - def verify_lora_support(self): - if self.long_lora_scaling_factors is not None and envs.VLLM_USE_V1: - raise ValueError( - "V1 LoRA does not support long LoRA, please use V0.") - @config @dataclass(config=ConfigDict(arbitrary_types_allowed=True)) @@ -4593,7 +4582,6 @@ def __post_init__(self): if self.lora_config is not None: self.lora_config.verify_with_cache_config(self.cache_config) self.lora_config.verify_with_model_config(self.model_config) - self.lora_config.verify_lora_support() if self.prompt_adapter_config is not None: self.prompt_adapter_config.verify_with_model_config( self.model_config) diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index a7fcf6c354e..d352a22a6d9 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -358,8 +358,6 @@ class EngineArgs: max_cpu_loras: Optional[int] = LoRAConfig.max_cpu_loras lora_dtype: Optional[Union[str, torch.dtype]] = LoRAConfig.lora_dtype lora_extra_vocab_size: int = LoRAConfig.lora_extra_vocab_size - long_lora_scaling_factors: Optional[tuple[float, ...]] = \ - LoRAConfig.long_lora_scaling_factors # PromptAdapter fields enable_prompt_adapter: bool = False max_prompt_adapters: int = PromptAdapterConfig.max_prompt_adapters @@ -723,8 +721,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: "--lora-dtype", **lora_kwargs["lora_dtype"], ) - lora_group.add_argument("--long-lora-scaling-factors", - **lora_kwargs["long_lora_scaling_factors"]) lora_group.add_argument("--max-cpu-loras", **lora_kwargs["max_cpu_loras"]) lora_group.add_argument("--fully-sharded-loras", @@ -1245,7 +1241,6 @@ def create_engine_config( default_mm_loras=self.default_mm_loras, fully_sharded_loras=self.fully_sharded_loras, lora_extra_vocab_size=self.lora_extra_vocab_size, - long_lora_scaling_factors=self.long_lora_scaling_factors, lora_dtype=self.lora_dtype, max_cpu_loras=self.max_cpu_loras if self.max_cpu_loras and self.max_cpu_loras > 0 else None) if self.enable_lora else None diff --git a/vllm/lora/layers.py b/vllm/lora/layers.py index 779f0264684..c3512ec3dbd 100644 --- a/vllm/lora/layers.py +++ b/vllm/lora/layers.py @@ -28,8 +28,6 @@ RowParallelLinear) # yapf: enable from vllm.model_executor.layers.logits_processor import LogitsProcessor -from vllm.model_executor.layers.rotary_embedding import ( - LinearScalingRotaryEmbedding, RotaryEmbedding) from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.platforms import current_platform @@ -1193,91 +1191,3 @@ def can_replace_layer( ) -> bool: # Special handling for the LogitsProcessor. return False - - -class LinearScalingRotaryEmbeddingWithLoRA(BaseLayerWithLoRA): - """Implements RoPE-scaled embeddings with linear scaling for - multiple LoRA adapters with a specialized kernel. - - Replace LinearScalingRotaryEmbedding with MultiLinearScalingRotaryEmbedding - which can handle multi lora adapters in a specialized kernel. - """ - - def __init__(self, base_layer: RotaryEmbedding) -> None: - super().__init__() - self.base_layer = base_layer - - @property - def scaling_factors(self): - return self.base_layer.scaling_factors - - @property - def rotary_dim(self): - return self.base_layer.rotary_dim - - def create_lora_weights( - self, - max_loras: int, - lora_config: LoRAConfig, - model_config: Optional[PretrainedConfig] = None, - ) -> None: - scaling_factors = (list(lora_config.long_lora_scaling_factors) - if lora_config.long_lora_scaling_factors else []) - base_scaling_factor = (self.base_layer.scaling_factor if isinstance( - self.base_layer, LinearScalingRotaryEmbedding) else 1.0) - scaling_factors = sorted( - list(set([base_scaling_factor] + scaling_factors))) - self.base_layer = LinearScalingRotaryEmbedding( - self.base_layer.head_size, - self.base_layer.rotary_dim, - self.base_layer.max_position_embeddings, - self.base_layer.base, - self.base_layer.is_neox_style, - scaling_factors, - self.base_layer.dtype, - ) - - def reset_lora(self, index: int): - ... - - def set_lora( - self, - index: int, - lora_a: torch.Tensor, - lora_b: torch.Tensor, - embeddings_tensor: Optional[torch.Tensor], - bias: Optional[torch.Tensor] = None, - ): - ... - - def forward( - self, - positions: torch.Tensor, - query: torch.Tensor, - key: torch.Tensor, - ) -> tuple[torch.Tensor, torch.Tensor]: - return self.base_layer( - positions, - query, - key, - offsets=self.punica_wrapper.long_lora_indices, - ) - - @property - def scaling_factor_to_offset(self) -> dict[float, int]: - return self.base_layer.scaling_factor_to_offset - - @classmethod - def can_replace_layer( - cls, - source_layer: nn.Module, - lora_config: LoRAConfig, - packed_modules_list: list, - model_config: Optional[PretrainedConfig], - ) -> bool: - """Returns True if the layer can be replaced by this LoRA layer.""" - return (type(source_layer) is LinearScalingRotaryEmbedding - or type(source_layer) is RotaryEmbedding) - - def extra_repr(self) -> str: - return self.base_layer.extra_repr() diff --git a/vllm/lora/models.py b/vllm/lora/models.py index 633674d5fb2..e6b19d4748f 100644 --- a/vllm/lora/models.py +++ b/vllm/lora/models.py @@ -4,7 +4,6 @@ import math import os from collections.abc import Sequence -from dataclasses import dataclass, field from typing import Any, Callable, Optional, Union import regex as re @@ -19,9 +18,7 @@ remove_adapter, set_adapter_mapping) from vllm.config import LoRAConfig from vllm.logger import init_logger -from vllm.lora.layers import (BaseLayerWithLoRA, - LinearScalingRotaryEmbeddingWithLoRA, - LoRAMapping) +from vllm.lora.layers import BaseLayerWithLoRA, LoRAMapping from vllm.lora.lora import LoRALayerWeights, PackedLoRALayerWeights from vllm.lora.peft_helper import PEFTHelper from vllm.lora.punica_wrapper import get_punica_wrapper @@ -43,18 +40,6 @@ _GLOBAL_LORA_ID = 0 -@dataclass -class LongContextLoRAContext: - """Context for lora adapters that support long context.""" - # The scaling factors to support long context lora fine tuned models. - scaling_factors: list[float] - # dimension to apply rotary embedding. - rot_dim: int - # offsets to the sin_cos_cache for each lora_id loaded. - # This value is dynamically modified. - offsets_by_lora_id: dict[int, int] = field(default_factory=dict) - - def get_lora_id(): global _GLOBAL_LORA_ID _GLOBAL_LORA_ID += 1 @@ -80,20 +65,16 @@ def __init__( lora_model_id: int, rank: int, loras: dict[str, LoRALayerWeights], - scaling_factor: Optional[float] = None, ) -> None: """ Args: lora_model_id: The integer id for the lora model. rank: lora rank. loras: module name -> weights for lora-replaced layers. - scaling_factor: Scaling factor to support long context lora model. - None if the lora is not tuned for long context support. + """ self.id = lora_model_id - # Scaling factor for long context lora model. None if it is not - # fine tuned for the long context. - self.scaling_factor = scaling_factor + assert ( lora_model_id > 0), f"a valid lora id should be greater than 0, got {self.id}" @@ -192,10 +173,7 @@ def from_lora_tensors( for lora in loras.values(): lora.optimize() - return cls(lora_model_id, - peft_helper.r, - loras, - scaling_factor=peft_helper.vllm_long_context_scaling_factor) + return cls(lora_model_id, peft_helper.r, loras) @classmethod def from_local_checkpoint( @@ -360,24 +338,17 @@ def __init__( self.max_num_batched_tokens = math.ceil(max_num_batched_tokens / 8) * 8 self.lora_index_to_id: list[Optional[int]] = [None] * self.lora_slots self.vocab_size = vocab_size - self.long_lora_context: Optional[LongContextLoRAContext] = None self.punica_wrapper = get_punica_wrapper( max_num_batched_tokens, max_batches=self.max_num_seqs, device=self.device, max_loras=self.lora_config.max_loras) - # Scaling factor -> offset to the sin_cos_cache to it. - # Used for long context lora. - self.scaling_factor_to_offset: dict[float, int] = {} + super().__init__(model) self.supported_lora_modules = get_supported_lora_modules(self.model) assert self.supported_lora_modules, "No supported LoRA modules found in" f" {self.model.__class__.__name__}." - if lora_config.long_lora_scaling_factors: - # We need to replace rotary emb layer to do batch computation - # for long lora. - self.supported_lora_modules.append("rotary_emb") self.packed_modules_mapping = get_packed_modules_mapping(self.model) # Used to indicate whether the model is a multimodal model @@ -454,25 +425,9 @@ def _deactivate_adapter(self, lora_id: int): except ValueError: pass - def _set_long_lora_context(self, lora: LoRAModel): - if self.long_lora_context is None: - return - - if lora.scaling_factor is None: - return - - if (lora.scaling_factor not in self.scaling_factor_to_offset): - raise ValueError(f"Long LoRA scaling factor {lora.scaling_factor}" - " has not been initialized.") - - offsets = self.scaling_factor_to_offset.get(lora.scaling_factor) - if offsets: - self.long_lora_context.offsets_by_lora_id[lora.id] = offsets - def _add_adapter(self, lora: LoRAModel): self._create_merged_loras_inplace(lora) self._registered_adapters[lora.id] = lora - self._set_long_lora_context(lora) def pin_adapter(self, lora_id: int) -> bool: """Pin a LoRAModel in the manager cache.""" @@ -488,7 +443,6 @@ def _set_adapter_mapping(self, mapping: LoRAMapping) -> None: self.lora_slots + 1, self.vocab_size, self.lora_config.lora_extra_vocab_size, - self.long_lora_context, ) def remove_all_adapters(self): @@ -528,13 +482,6 @@ def _parent_module(module_name: str) -> str: from_layer(module, self.lora_slots, self.lora_config, packed_moduled_lst, self.model.config)) - # LinearScalingRotaryEmbeddingWithLoRA is used to handle - # long context lora. Register relevant metadata. - if isinstance(new_module, LinearScalingRotaryEmbeddingWithLoRA): - self.long_lora_context = LongContextLoRAContext( - new_module.scaling_factors, new_module.rotary_dim) - self.scaling_factor_to_offset = \ - new_module.scaling_factor_to_offset # (yard1): TODO make this more robust if "lm_head" in module_name: logits_processor_module_name = 'logits_processor' @@ -574,15 +521,13 @@ def create_dummy_lora( self, lora_id: int, rank: int, - scaling_factor: Optional[float], embedding_modules: Optional[dict[str, str]] = None) -> LoRAModel: """Create zero-initialized LoRAModel for warmup.""" - model = LoRAModel(lora_id, rank, {}, scaling_factor) + model = LoRAModel(lora_id, rank, {}) for module_name, module in self.model.named_modules(): bias_enabled = self.lora_config.bias_enabled if (not self._match_target_modules(module_name) or not isinstance(module, BaseLayerWithLoRA) - or isinstance(module, LinearScalingRotaryEmbeddingWithLoRA) or self._filter_unsupported_mm_module(module_name)): continue parts = module_name.split(".") @@ -723,11 +668,8 @@ def deactivate_adapter(self, adapter_id: int) -> bool: self._deactivate_adapter) def add_adapter(self, adapter: LoRAModel) -> bool: - logger.debug( - "Adding lora. Model id: %d, " - "int id: %d, " - "scaling factor: %s", adapter.id, adapter.id, - adapter.scaling_factor) + logger.debug("Adding lora. Model id: %d, " + "int id: %d", adapter.id, adapter.id) return add_adapter(adapter, self._registered_adapters, self.capacity, self._add_adapter) @@ -772,10 +714,8 @@ def list_adapters(self) -> dict[int, LoRAModel]: def add_adapter(self, lora: LoRAModel) -> bool: """Add a LoRAModel to the manager.""" - logger.debug( - "Adding lora. Model id: %d, " - "int id: %d, " - "scaling factor: %s", lora.id, lora.id, lora.scaling_factor) + logger.debug("Adding lora. Model id: %d, " + "int id: %d", lora.id, lora.id) if lora.id not in self._registered_adapters: self._add_adapter(lora) was_added = True diff --git a/vllm/lora/peft_helper.py b/vllm/lora/peft_helper.py index 24099bf479d..8b8e5cb7d5f 100644 --- a/vllm/lora/peft_helper.py +++ b/vllm/lora/peft_helper.py @@ -35,12 +35,9 @@ class PEFTHelper: use_rslora: bool = field(default=False) # True to use Weight-Decomposed Low-Rank Adaptation (DoRA, see: https://arxiv.org/abs/2402.09353) use_dora: bool = field(default=False) - # long context lora field - context_length: int = field(default=0) # Extra vllm field, start with 'vllm_' to avoid conflict vllm_lora_scaling_factor: float = field(default=1.0) vllm_max_position_embeddings: Optional[int] = field(default=False) - vllm_long_context_scaling_factor: Optional[float] = field(default=None) def _validate_features(self) -> list[str]: """ @@ -59,12 +56,6 @@ def __post_init__(self): self.vllm_lora_scaling_factor = self.lora_alpha / math.sqrt(self.r) else: self.vllm_lora_scaling_factor = self.lora_alpha / self.r - if self.context_length: - if self.vllm_max_position_embeddings is None: - self.vllm_max_position_embeddings = self.context_length - self.vllm_long_context_scaling_factor = float( - math.ceil(self.context_length / - self.vllm_max_position_embeddings)) @classmethod def from_dict(cls, config_dict: dict) -> "PEFTHelper": diff --git a/vllm/lora/punica_wrapper/punica_base.py b/vllm/lora/punica_wrapper/punica_base.py index 5b4902dcbeb..b3413de1c81 100644 --- a/vllm/lora/punica_wrapper/punica_base.py +++ b/vllm/lora/punica_wrapper/punica_base.py @@ -17,7 +17,6 @@ if TYPE_CHECKING: # avoid circuit import from vllm.lora.layers import LoRAMapping - from vllm.lora.models import LongContextLoRAContext class PunicaWrapperABC(ABC): @@ -33,7 +32,6 @@ def update_metadata( max_loras: int, vocab_size: int, extra_vocab_size: int, - long_lora_context: Optional["LongContextLoRAContext"] = None, **kwargs, ) -> None: """ @@ -144,14 +142,11 @@ def __init__(self, max_num_batched_tokens: int, max_batches: int, max_num_batched_tokens, dtype=torch.long, device=device) - self._long_lora_indices = torch.empty(max_num_batched_tokens, - dtype=torch.long, - device=device) - # 5 is the number of indices tensors. + # 4 is the number of indices tensors. # base_indices, sampler_indices, sampler_indices_padded, - # embeddings_indices,long_lora_indices - self.indices_len: list[Optional[int]] = [None] * 5 + # embeddings_indices + self.indices_len: list[Optional[int]] = [None] * 4 # these attributes are the information required for sgmv kernel self._seq_start_locs = torch.empty(max_batches, dtype=torch.long, @@ -176,14 +171,12 @@ def _update_base_metadata( max_loras: int, vocab_size: int, extra_vocab_size: int, - long_lora_context: Optional["LongContextLoRAContext"] = None, ): ( base_indices, sampler_indices, sampler_indices_padded, embeddings_indices, - long_lora_offsets_tensor, indices_len, ) = convert_mapping( mapping, @@ -192,7 +185,6 @@ def _update_base_metadata( vocab_size, extra_vocab_size, self.device, - long_lora_context, ) self._token_lora_indices[:base_indices.shape[0]].copy_(base_indices) self._sampler_indices[:sampler_indices.shape[0]].copy_(sampler_indices) @@ -201,11 +193,7 @@ def _update_base_metadata( self._embeddings_indices[:embeddings_indices. shape[0], :embeddings_indices.shape[1]].copy_( embeddings_indices) - if long_lora_offsets_tensor is not None: - self._long_lora_indices[:long_lora_offsets_tensor.shape[0]].copy_( - long_lora_offsets_tensor) - else: - self._long_lora_indices.zero_() + self.indices_len[:] = indices_len def _update_prefill_metadata(self, @@ -312,28 +300,13 @@ def embeddings_indices(self) -> torch.Tensor: embeddings_indices_len = self.indices_len[3] return self._embeddings_indices[:, :embeddings_indices_len] - @property - def long_lora_indices(self) -> torch.Tensor: - """ - This property provides access to the indices used for long context - lora, specifically for LinearScalingRotaryEmbeddingWithLoRA. - """ - long_lora_len = self.indices_len[4] - return self._long_lora_indices[:long_lora_len] - - def update_metadata( - self, - mapping: "LoRAMapping", - lora_index_to_id: list[Optional[int]], - max_loras: int, - vocab_size: int, - extra_vocab_size: int, - long_lora_context: Optional["LongContextLoRAContext"] = None, - **kwargs): + def update_metadata(self, mapping: "LoRAMapping", + lora_index_to_id: list[Optional[int]], max_loras: int, + vocab_size: int, extra_vocab_size: int, **kwargs): self._update_base_metadata(mapping, lora_index_to_id, max_loras, - vocab_size, extra_vocab_size, - long_lora_context) + vocab_size, extra_vocab_size) + if mapping.is_prefill: # Update metadata required for prefill-related operators. self._update_prefill_metadata(self.token_lora_indices) diff --git a/vllm/lora/punica_wrapper/punica_gpu.py b/vllm/lora/punica_wrapper/punica_gpu.py index 6b038309d55..2db0e9fee14 100644 --- a/vllm/lora/punica_wrapper/punica_gpu.py +++ b/vllm/lora/punica_wrapper/punica_gpu.py @@ -7,7 +7,7 @@ https://arxiv.org/abs/2310.18547 """ -from typing import TYPE_CHECKING, Optional, Union, final +from typing import Optional, Union, final import torch @@ -21,10 +21,6 @@ from .punica_base import PunicaWrapperBase -if TYPE_CHECKING: - # avoid circuit import - from vllm.lora.models import LongContextLoRAContext - @final class PunicaWrapperGPU(PunicaWrapperBase): @@ -55,20 +51,13 @@ def __init__(self, max_num_batched_tokens: int, max_batches: int, max_num_prompts, device=device) - def update_metadata( - self, - mapping: LoRAMapping, - lora_index_to_id: list[Optional[int]], - max_loras: int, - vocab_size: int, - extra_vocab_size: int, - long_lora_context: Optional["LongContextLoRAContext"] = None, - **kwargs): + def update_metadata(self, mapping: LoRAMapping, + lora_index_to_id: list[Optional[int]], max_loras: int, + vocab_size: int, extra_vocab_size: int, **kwargs): self.is_prefill = mapping.is_prefill self._update_base_metadata(mapping, lora_index_to_id, max_loras, - vocab_size, extra_vocab_size, - long_lora_context) + vocab_size, extra_vocab_size) # Prepare cuda kernel metadata tensors self.token_mapping_meta.prepare_tensors(self.token_lora_indices) diff --git a/vllm/lora/punica_wrapper/punica_tpu.py b/vllm/lora/punica_wrapper/punica_tpu.py index 6b48268c500..07dc337a1cc 100644 --- a/vllm/lora/punica_wrapper/punica_tpu.py +++ b/vllm/lora/punica_wrapper/punica_tpu.py @@ -14,7 +14,6 @@ if TYPE_CHECKING: # avoid circuit import from vllm.lora.layers import LoRAMapping - from vllm.lora.models import LongContextLoRAContext from .punica_base import PunicaWrapperBase @@ -45,7 +44,6 @@ def __init__(self, max_num_batched_tokens: int, max_batches: int, torch.ops.xla.dynamo_set_buffer_donor_(self._sampler_indices_padded, True) torch.ops.xla.dynamo_set_buffer_donor_(self._embeddings_indices, True) - torch.ops.xla.dynamo_set_buffer_donor_(self._long_lora_indices, True) torch.ops.xla.dynamo_set_buffer_donor_(self._lora_indices_per_batch, True) @@ -323,7 +321,6 @@ def _update_base_metadata( max_loras: int, vocab_size: int, extra_vocab_size: int, - long_lora_context: Optional["LongContextLoRAContext"] = None, ): # Make sure we don't accidentally collect outside operations xm.mark_step() @@ -339,7 +336,6 @@ def _update_base_metadata( sampler_indices, sampler_indices_padded, embeddings_indices, - long_lora_offsets_tensor, indices_len, ) = convert_mapping( mapping, @@ -348,7 +344,6 @@ def _update_base_metadata( vocab_size, extra_vocab_size, "cpu", - long_lora_context, ) self._token_lora_indices = self._pad_to_shape( base_indices, self._token_lora_indices.shape, @@ -362,15 +357,6 @@ def _update_base_metadata( self._embeddings_indices = self._pad_to_shape( embeddings_indices, self._embeddings_indices.shape, dims=2).to(self.device) - if long_lora_offsets_tensor is not None: - self._long_lora_indices = self._pad_to_shape( - long_lora_offsets_tensor, - self._long_lora_indices.shape, - dims=1).to(self.device) - else: - zeroed = torch.zeros_like(self._long_lora_indices.cpu(), - dtype=torch.int32) - self._long_lora_indices = zeroed.to(self.device) self.indices_len[:] = indices_len def _update_prefill_metadata(self, diff --git a/vllm/lora/punica_wrapper/utils.py b/vllm/lora/punica_wrapper/utils.py index 8430cb91865..d22c29da1c6 100644 --- a/vllm/lora/punica_wrapper/utils.py +++ b/vllm/lora/punica_wrapper/utils.py @@ -8,7 +8,6 @@ if TYPE_CHECKING: # avoid circuit import from vllm.lora.layers import LoRAMapping - from vllm.lora.models import LongContextLoRAContext def compute_meta( @@ -49,9 +48,7 @@ def convert_mapping( vocab_size: int, extra_vocab_size: int, device: torch.device, - long_lora_context: Optional["LongContextLoRAContext"] = None, -) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, - Optional[torch.Tensor], list[int]]: +) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, list[int]]: """Converts LoRAMapping to index tensors. Args: @@ -60,7 +57,6 @@ def convert_mapping( max_loras: Maximum number of LoRAs. vocab_size: Model vocab size. extra_vocab_size: Extra vocab size each LoRA can have. - long_lora_context: Passed if there are long context lora in a batch. Returns: A tuple of tensors: @@ -78,21 +74,14 @@ def convert_mapping( requests to embedding indices. First row is for embeddings added by the LoRAs, second row is for the LoRA.lora_a embeddings. - long_lora_indices: Tensor of shape [batch_size] mapping - requests to RoPE offsets and rot dims for long LoRAs. - None if long context lora doesn't exist. indices_len: List of lengths of the above tensors. It contains (base_indices, sampler_indices, sampler_indices_padded, - embeddings_indices, long_lora_indices). + embeddings_indices). """ index_mapping_indices: list[int] = list(mapping.index_mapping).copy() embedding_indices = index_mapping_indices.copy() lora_indices = index_mapping_indices.copy() - long_lora_offsets: Optional[torch.Tensor] = None - if long_lora_context: - long_lora_offsets = torch.zeros(len(index_mapping_indices), - device=device, - dtype=torch.long) + prompt_mapping: list[int] = [ lora_index_to_id.index(x) if x > 0 else -1 for x in mapping.prompt_mapping @@ -104,20 +93,13 @@ def convert_mapping( if index_mapping_indices[i] > 0 else -1) embedding_indices[i] = lora_idx if index_mapping_indices[i] > 0 else 0 lora_indices[i] = lora_idx - if long_lora_context: - assert long_lora_offsets is not None - lora_offset: int = long_lora_context.offsets_by_lora_id.get( - index_mapping_indices[i], 0) - long_lora_offsets[i] = lora_offset indices_list: list[Union[list[int], torch.Tensor]] = [ index_mapping_indices, lora_indices, embedding_indices, ] - if long_lora_context: - assert long_lora_offsets is not None - indices_list.append(long_lora_offsets) + indices = torch.tensor(indices_list, dtype=torch.long, device=device) prompt_mapping_tensor = torch.tensor(prompt_mapping, dtype=torch.long, @@ -136,11 +118,7 @@ def convert_mapping( sampler_indices_padded = torch.arange( 0, len(sampler_indices_padded), device=device, dtype=torch.long) + ( sampler_indices_padded * len(sampler_indices_padded)) - long_lora_indices = None - long_lora_indices_len: Optional[int] = None - if long_lora_context: - long_lora_indices = indices[3] - long_lora_indices_len = long_lora_indices.shape[-1] + # Contain length of indices tensors. Used to index into each tensor. indices_len = [ base_indices.shape[-1], @@ -148,17 +126,11 @@ def convert_mapping( sampler_indices_padded.shape[-1], embeddings_indices.shape[-1], ] - if long_lora_indices_len is not None: - indices_len.append(long_lora_indices_len) - else: - # If long_lora doesn't exist,append None - indices_len.append(None) return ( base_indices, sampler_indices, sampler_indices_padded, embeddings_indices, - long_lora_indices, indices_len, ) diff --git a/vllm/lora/utils.py b/vllm/lora/utils.py index 7148ffe1494..ab0a9fbd255 100644 --- a/vllm/lora/utils.py +++ b/vllm/lora/utils.py @@ -22,7 +22,6 @@ # yapf conflicts with isort for this block # yapf: disable from vllm.lora.layers import (BaseLayerWithLoRA, ColumnParallelLinearWithLoRA, - LinearScalingRotaryEmbeddingWithLoRA, LogitsProcessorWithLoRA, MergedColumnParallelLinearWithLoRA, MergedQKVParallelLinearWithLoRA, @@ -56,7 +55,6 @@ MergedColumnParallelLinearWithShardedLoRA, MergedQKVParallelLinearWithShardedLoRA, RowParallelLinearWithShardedLoRA, - LinearScalingRotaryEmbeddingWithLoRA, } diff --git a/vllm/lora/worker_manager.py b/vllm/lora/worker_manager.py index 7a4af74cbeb..248d2954f1e 100644 --- a/vllm/lora/worker_manager.py +++ b/vllm/lora/worker_manager.py @@ -154,7 +154,7 @@ def add_dummy_lora(self, lora_request: LoRARequest, rank: int) -> bool: lora_request.lora_int_id) else: dummy_lora = self._adapter_manager.create_dummy_lora( - lora_request.lora_int_id, rank, 1, self.embedding_modules) + lora_request.lora_int_id, rank, self.embedding_modules) if self._cached_dummy_lora is None: self._cached_dummy_lora = dummy_lora return self._adapter_manager.add_adapter(dummy_lora) From eb53c9bf3acfeb82674da1e3298c80eda10ce71e Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Sat, 19 Jul 2025 17:17:16 +0800 Subject: [PATCH 198/552] [Bugfix] Fix ndarray video color from VideoAsset (#21064) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- tests/multimodal/test_video.py | 103 +++++++++++++++++++++++++-------- tests/multimodal/utils.py | 46 +++++++++++++++ vllm/assets/video.py | 9 ++- 3 files changed, 130 insertions(+), 28 deletions(-) diff --git a/tests/multimodal/test_video.py b/tests/multimodal/test_video.py index 897c9c33461..05b7b84be7f 100644 --- a/tests/multimodal/test_video.py +++ b/tests/multimodal/test_video.py @@ -1,14 +1,22 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import tempfile +from pathlib import Path + import numpy as np import numpy.typing as npt import pytest +from PIL import Image -from vllm import envs +from vllm.assets.base import get_vllm_public_assets +from vllm.assets.video import video_to_ndarrays, video_to_pil_images_list from vllm.multimodal.image import ImageMediaIO from vllm.multimodal.video import (VIDEO_LOADER_REGISTRY, VideoLoader, VideoMediaIO) +from .utils import cosine_similarity, create_video_from_image, normalize_image + NUM_FRAMES = 10 FAKE_OUTPUT_1 = np.random.rand(NUM_FRAMES, 1280, 720, 3) FAKE_OUTPUT_2 = np.random.rand(NUM_FRAMES, 1280, 720, 3) @@ -59,30 +67,79 @@ def load_bytes(cls, return FAKE_OUTPUT_2 -def test_video_media_io_kwargs(): - envs.VLLM_VIDEO_LOADER_BACKEND = "assert_10_frames_1_fps" - imageio = ImageMediaIO() +def test_video_media_io_kwargs(monkeypatch: pytest.MonkeyPatch): + with monkeypatch.context() as m: + m.setenv("VLLM_VIDEO_LOADER_BACKEND", "assert_10_frames_1_fps") + imageio = ImageMediaIO() - # Verify that different args pass/fail assertions as expected. - videoio = VideoMediaIO(imageio, **{"num_frames": 10, "fps": 1.0}) - _ = videoio.load_bytes(b"test") - - videoio = VideoMediaIO( - imageio, **{ - "num_frames": 10, - "fps": 1.0, - "not_used": "not_used" - }) - _ = videoio.load_bytes(b"test") - - with pytest.raises(AssertionError, match="bad num_frames"): - videoio = VideoMediaIO(imageio, **{}) + # Verify that different args pass/fail assertions as expected. + videoio = VideoMediaIO(imageio, **{"num_frames": 10, "fps": 1.0}) _ = videoio.load_bytes(b"test") - with pytest.raises(AssertionError, match="bad num_frames"): - videoio = VideoMediaIO(imageio, **{"num_frames": 9, "fps": 1.0}) + videoio = VideoMediaIO( + imageio, **{ + "num_frames": 10, + "fps": 1.0, + "not_used": "not_used" + }) _ = videoio.load_bytes(b"test") - with pytest.raises(AssertionError, match="bad fps"): - videoio = VideoMediaIO(imageio, **{"num_frames": 10, "fps": 2.0}) - _ = videoio.load_bytes(b"test") + with pytest.raises(AssertionError, match="bad num_frames"): + videoio = VideoMediaIO(imageio, **{}) + _ = videoio.load_bytes(b"test") + + with pytest.raises(AssertionError, match="bad num_frames"): + videoio = VideoMediaIO(imageio, **{"num_frames": 9, "fps": 1.0}) + _ = videoio.load_bytes(b"test") + + with pytest.raises(AssertionError, match="bad fps"): + videoio = VideoMediaIO(imageio, **{"num_frames": 10, "fps": 2.0}) + _ = videoio.load_bytes(b"test") + + +@pytest.mark.parametrize("is_color", [True, False]) +@pytest.mark.parametrize("fourcc, ext", [("mp4v", "mp4"), ("XVID", "avi")]) +def test_opencv_video_io_colorspace(is_color: bool, fourcc: str, ext: str): + """ + Test all functions that use OpenCV for video I/O return RGB format. + Both RGB and grayscale videos are tested. + """ + image_path = get_vllm_public_assets(filename="stop_sign.jpg", + s3_prefix="vision_model_images") + image = Image.open(image_path) + with tempfile.TemporaryDirectory() as tmpdir: + if not is_color: + image_path = f"{tmpdir}/test_grayscale_image.png" + image = image.convert("L") + image.save(image_path) + # Convert to gray RGB for comparison + image = image.convert("RGB") + video_path = f"{tmpdir}/test_RGB_video.{ext}" + create_video_from_image( + image_path, + video_path, + num_frames=2, + is_color=is_color, + fourcc=fourcc, + ) + + frames = video_to_ndarrays(video_path) + for frame in frames: + sim = cosine_similarity(normalize_image(np.array(frame)), + normalize_image(np.array(image))) + assert np.sum(np.isnan(sim)) / sim.size < 0.001 + assert np.nanmean(sim) > 0.99 + + pil_frames = video_to_pil_images_list(video_path) + for frame in pil_frames: + sim = cosine_similarity(normalize_image(np.array(frame)), + normalize_image(np.array(image))) + assert np.sum(np.isnan(sim)) / sim.size < 0.001 + assert np.nanmean(sim) > 0.99 + + io_frames, _ = VideoMediaIO(ImageMediaIO()).load_file(Path(video_path)) + for frame in io_frames: + sim = cosine_similarity(normalize_image(np.array(frame)), + normalize_image(np.array(image))) + assert np.sum(np.isnan(sim)) / sim.size < 0.001 + assert np.nanmean(sim) > 0.99 diff --git a/tests/multimodal/utils.py b/tests/multimodal/utils.py index 23346509a06..9a58292f9f4 100644 --- a/tests/multimodal/utils.py +++ b/tests/multimodal/utils.py @@ -1,7 +1,9 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import cv2 import numpy as np +import numpy.typing as npt from PIL import Image @@ -31,3 +33,47 @@ def random_audio( ): audio_len = rng.randint(min_len, max_len) return rng.rand(audio_len), sr + + +def create_video_from_image( + image_path: str, + video_path: str, + num_frames: int = 10, + fps: float = 1.0, + is_color: bool = True, + fourcc: str = "mp4v", +): + image = cv2.imread(image_path) + if not is_color: + # Convert to grayscale if is_color is False + image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) + height, width = image.shape + else: + height, width, _ = image.shape + + video_writer = cv2.VideoWriter( + video_path, + cv2.VideoWriter_fourcc(*fourcc), + fps, + (width, height), + isColor=is_color, + ) + + for _ in range(num_frames): + video_writer.write(image) + + video_writer.release() + return video_path + + +def cosine_similarity(A: npt.NDArray, + B: npt.NDArray, + axis: int = -1) -> npt.NDArray: + """Compute cosine similarity between two vectors.""" + return (np.sum(A * B, axis=axis) / + (np.linalg.norm(A, axis=axis) * np.linalg.norm(B, axis=axis))) + + +def normalize_image(image: npt.NDArray) -> npt.NDArray: + """Normalize image to [0, 1] range.""" + return image.astype(np.float32) / 255.0 \ No newline at end of file diff --git a/vllm/assets/video.py b/vllm/assets/video.py index 16412121cf0..8ab0e9760be 100644 --- a/vllm/assets/video.py +++ b/vllm/assets/video.py @@ -59,7 +59,9 @@ def video_to_ndarrays(path: str, num_frames: int = -1) -> npt.NDArray: if idx in frame_indices: # only decompress needed ret, frame = cap.retrieve() if ret: - frames.append(frame) + # OpenCV uses BGR format, we need to convert it to RGB + # for PIL and transformers compatibility + frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) frames = np.stack(frames) if len(frames) < num_frames: @@ -71,10 +73,7 @@ def video_to_ndarrays(path: str, num_frames: int = -1) -> npt.NDArray: def video_to_pil_images_list(path: str, num_frames: int = -1) -> list[Image.Image]: frames = video_to_ndarrays(path, num_frames) - return [ - Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) - for frame in frames - ] + return [Image.fromarray(frame) for frame in frames] def video_get_metadata(path: str) -> dict[str, Any]: From e2203a0c4715a83220fefdf719af4065799ac63b Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Sat, 19 Jul 2025 05:18:47 -0400 Subject: [PATCH 199/552] [BugFix] Fix potential cuda-graph IMA (#21196) Signed-off-by: Lucas Wilkinson Signed-off-by: x22x22 --- vllm/v1/attention/backends/utils.py | 5 ----- vllm/v1/worker/gpu_model_runner.py | 7 ++++++- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/vllm/v1/attention/backends/utils.py b/vllm/v1/attention/backends/utils.py index 65c3baa6784..fc8649d587e 100644 --- a/vllm/v1/attention/backends/utils.py +++ b/vllm/v1/attention/backends/utils.py @@ -59,11 +59,6 @@ class CommonAttentionMetadata: block_table_tensor: torch.Tensor slot_mapping: torch.Tensor - def __post_init__(self): - # Fill unused with -1. Needed for reshape_and_cache in full cuda graph - # mode. - self.slot_mapping[self.num_actual_tokens:].fill_(-1) - M = TypeVar("M") diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 47b14d076ea..a5c44673114 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -684,7 +684,7 @@ def _prepare_inputs( self.seq_lens[:num_reqs].copy_(self.seq_lens_cpu[:num_reqs], non_blocking=True) - # Fill unused with -1. Needed for reshape_and_cache + # Fill unused with 0 for full cuda graph mode. self.seq_lens[num_reqs:].fill_(0) # Note: pad query_start_loc to be non-decreasing, as kernels # like FlashAttention requires that @@ -704,6 +704,11 @@ def _prepare_inputs( blk_table = self.input_batch.block_table[kv_cache_group_id] blk_table_tensor = blk_table.get_device_tensor()[:num_reqs] slot_mapping = blk_table.slot_mapping[:total_num_scheduled_tokens] + + # Fill unused with -1. Needed for reshape_and_cache in full cuda + # graph mode. + blk_table.slot_mapping[total_num_scheduled_tokens:].fill_(-1) + common_attn_metadata = CommonAttentionMetadata( query_start_loc=self.query_start_loc[:num_reqs + 1], query_start_loc_cpu=self.query_start_loc_cpu[:num_reqs + 1], From 10b820eb604cda39e3e85a0f2cbbef7299459252 Mon Sep 17 00:00:00 2001 From: shixianc <49539556+shixianc@users.noreply.github.com> Date: Sat, 19 Jul 2025 02:32:36 -0700 Subject: [PATCH 200/552] Add torch golden impl for moe_align_block_size kernel test (#20653) Signed-off-by: Shixian Cui Co-authored-by: Shixian Cui Signed-off-by: x22x22 --- .../kernels/moe/test_moe_align_block_size.py | 367 ++++++++++++++---- 1 file changed, 296 insertions(+), 71 deletions(-) diff --git a/tests/kernels/moe/test_moe_align_block_size.py b/tests/kernels/moe/test_moe_align_block_size.py index e980422a7b9..12ef9e776c3 100644 --- a/tests/kernels/moe/test_moe_align_block_size.py +++ b/tests/kernels/moe/test_moe_align_block_size.py @@ -1,90 +1,315 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import itertools +"""Tests for the MOE align block size function. + +Run `pytest tests/kernels/moe/test_moe_align_block_size.py`. +""" + +from typing import Optional import pytest import torch -from vllm import _custom_ops as ops from vllm.model_executor.layers.fused_moe.moe_align_block_size import ( - moe_align_block_size_triton) - - -@pytest.mark.parametrize( - "block_size,num_tokens,topk,num_experts", - list( - itertools.product( - [32, 64, 128, 256], # block_size - [ - 1, - 3, - 7, - 16, - 256, - 2256, - 4096, - ], # num_tokens - [1, 4, 16, 64], # topk - [64, 160, 256, 257, 260, 264], # num_experts - )), -) -def test_moe_align_block_size_compare_implementations(block_size, num_tokens, - topk, num_experts): - topk_ids = torch.stack([ - torch.randperm(num_experts, dtype=torch.int32, device="cuda")[:topk] - for _ in range(num_tokens) - ]) + moe_align_block_size) +from vllm.platforms import current_platform +from vllm.utils import round_up + +NUM_TOKENS = [1, 3, 7, 16, 256, 2256, 4096] +NUM_EXPERTS = [32, 160, 256, 257, 512] +TOP_KS = [1, 2, 16, 32] +BLOCK_SIZES = [32, 64, 128, 256] +current_platform.seed_everything(0) + + +def _group_tokens_by_expert( + sorted_ids: torch.Tensor, + expert_ids: torch.Tensor, + block_size: int, + valid_length: int, + total_tokens: int, +) -> dict: + num_blocks = valid_length // block_size + expert_tokens: dict[int, list[int]] = {} + + for block_idx in range(num_blocks): + expert_id = expert_ids[block_idx].item() + block_start = block_idx * block_size + block_end = min(block_start + block_size, valid_length) + + block_tokens = sorted_ids[block_start:block_end] + valid_tokens = block_tokens[block_tokens < total_tokens] + + if expert_id not in expert_tokens: + expert_tokens[expert_id] = [] + expert_tokens[expert_id].extend(valid_tokens.tolist()) + return expert_tokens + +def _verify_expert_level_sorting( + actual_sorted_ids: torch.Tensor, + golden_sorted_ids: torch.Tensor, + expert_ids: torch.Tensor, + block_size: int, + valid_length: int, + total_tokens: int, +): + """ + Verify that actual_sorted_ids follows the correct expert-level sorting. + The kerne limplementation may or may not preserve original token order + in topk_ids in the final sorted_ids however this does not impact quality. + """ + # Group tokens by expert from the golden implementation + golden_expert_tokens = _group_tokens_by_expert(golden_sorted_ids, + expert_ids, block_size, + valid_length, total_tokens) + + actual_expert_tokens = _group_tokens_by_expert(actual_sorted_ids, + expert_ids, block_size, + valid_length, total_tokens) + + assert set(golden_expert_tokens.keys()) == set( + actual_expert_tokens.keys()), ( + f"Expert IDs mismatch: golden={set(golden_expert_tokens.keys())}, " + f"actual={set(actual_expert_tokens.keys())}") + + for expert_id in golden_expert_tokens: + golden_tokens = torch.tensor(golden_expert_tokens[expert_id], + device=actual_sorted_ids.device) + actual_tokens = torch.tensor(actual_expert_tokens[expert_id], + device=actual_sorted_ids.device) + assert torch.equal( + torch.sort(golden_tokens)[0], + torch.sort(actual_tokens)[0]), ( + f"Expert {expert_id} token mismatch: " + f"golden={golden_expert_tokens[expert_id]}, " + f"actual={actual_expert_tokens[expert_id]}") + + +def torch_moe_align_block_size( + topk_ids: torch.Tensor, + block_size: int, + num_experts: int, + expert_map: Optional[torch.Tensor] = None, + pad_sorted_ids: bool = False, +) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: + """ + Golden torch implementation of moe_align_block_size. + + This function aligns the token distribution across experts to be compatible + with block size for matrix multiplication by sorting tokens by expert and + padding to block boundaries. + """ max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1) + if pad_sorted_ids: + max_num_tokens_padded = round_up(max_num_tokens_padded, block_size) + + flattened_token_indices = torch.arange(topk_ids.numel(), + device=topk_ids.device, + dtype=torch.int32) + flattened_expert_ids = topk_ids.flatten() + sorted_expert_ids, sort_indices = torch.sort(flattened_expert_ids, + stable=True) + sorted_token_indices = flattened_token_indices[sort_indices] + + expert_token_counts = torch.zeros(num_experts, + dtype=torch.int64, + device=topk_ids.device) + for expert_id in range(num_experts): + mask = sorted_expert_ids == expert_id + expert_token_counts[expert_id] = mask.sum() + + expert_padded_counts = torch.zeros(num_experts, + dtype=torch.int64, + device=topk_ids.device) + for expert_id in range(num_experts): + original_count = expert_token_counts[expert_id] + if original_count > 0: + expert_padded_counts[expert_id] = ( + (original_count + block_size - 1) // block_size) * block_size - sorted_ids_cuda = torch.empty((max_num_tokens_padded, ), - dtype=torch.int32, - device=topk_ids.device) - sorted_ids_cuda.fill_(topk_ids.numel()) - max_num_m_blocks = max_num_tokens_padded // block_size - expert_ids_cuda = torch.zeros((max_num_m_blocks, ), - dtype=torch.int32, - device=topk_ids.device) - num_tokens_post_pad_cuda = torch.empty((1), - dtype=torch.int32, - device=topk_ids.device) - - sorted_ids_triton = torch.empty_like(sorted_ids_cuda) - sorted_ids_triton.fill_(topk_ids.numel()) - expert_ids_triton = torch.zeros_like(expert_ids_cuda) - num_tokens_post_pad_triton = torch.empty_like(num_tokens_post_pad_cuda) - - ops.moe_align_block_size( - topk_ids, - num_experts, + sorted_token_ids = torch.full( + (max_num_tokens_padded, ), + topk_ids.numel(), + dtype=torch.int32, + device=topk_ids.device, + ) + max_num_blocks = (max_num_tokens_padded + block_size - 1) // block_size + expert_ids = torch.zeros(max_num_blocks, + dtype=torch.int32, + device=topk_ids.device) + + current_pos = 0 + current_block = 0 + for expert_id in range(num_experts): + expert_mask = sorted_expert_ids == expert_id + expert_tokens = sorted_token_indices[expert_mask] + num_expert_tokens = expert_tokens.shape[0] + + if num_expert_tokens > 0: + sorted_token_ids[current_pos:current_pos + + num_expert_tokens] = (expert_tokens) + + expert_blocks_needed = expert_padded_counts[expert_id] // block_size + expert_ids[current_block:current_block + + expert_blocks_needed] = (expert_id) + + current_pos += expert_padded_counts[expert_id] + current_block += expert_blocks_needed + + total_padded_tokens = expert_padded_counts.sum() + num_tokens_post_pad = torch.tensor([total_padded_tokens], + dtype=torch.int32, + device=topk_ids.device) + + if expert_map is not None: + expert_ids = expert_map[expert_ids] + return sorted_token_ids, expert_ids, num_tokens_post_pad + + +@pytest.mark.parametrize("m", NUM_TOKENS) +@pytest.mark.parametrize("topk", TOP_KS) +@pytest.mark.parametrize("num_experts", NUM_EXPERTS) +@pytest.mark.parametrize("block_size", BLOCK_SIZES) +@pytest.mark.parametrize("pad_sorted_ids", [False, True]) +@pytest.mark.skipif(current_platform.is_rocm(), reason="Skip for rocm") +def test_moe_align_block_size(m: int, topk: int, num_experts: int, + block_size: int, pad_sorted_ids: bool): + """Test moe_align_block_size without expert mapping""" + topk_ids = torch.zeros((m, topk), device="cuda", dtype=torch.int32) + for i in range(m): + experts = torch.randperm(num_experts, device="cuda")[:topk] + topk_ids[i] = experts + + actual_sorted_ids, actual_expert_ids, actual_num_tokens = ( + moe_align_block_size( + topk_ids=topk_ids, + block_size=block_size, + num_experts=num_experts, + pad_sorted_ids=pad_sorted_ids, + )) + golden_sorted_ids, golden_expert_ids, golden_num_tokens = ( + torch_moe_align_block_size( + topk_ids=topk_ids, + block_size=block_size, + num_experts=num_experts, + pad_sorted_ids=pad_sorted_ids, + )) + + torch.testing.assert_close(actual_num_tokens, + golden_num_tokens, + atol=0, + rtol=0) + torch.testing.assert_close(actual_expert_ids, + golden_expert_ids, + atol=0, + rtol=0) + + # For sorted_token_ids, verify block-level correctness rather than exact + # order Tokens within each expert's blocks can be in any order, but expert + # regions must be correct + _verify_expert_level_sorting( + actual_sorted_ids, + golden_sorted_ids, + actual_expert_ids, block_size, - sorted_ids_cuda, - expert_ids_cuda, - num_tokens_post_pad_cuda, + actual_num_tokens.item(), + m * topk, ) - moe_align_block_size_triton( - topk_ids, - num_experts, + total_tokens = m * topk + assert actual_num_tokens.item() % block_size == 0, ( + "num_tokens_post_pad should be divisible by block_size") + assert actual_num_tokens.item() >= total_tokens, ( + "num_tokens_post_pad should be at least total_tokens") + valid_tokens = actual_sorted_ids[actual_sorted_ids < total_tokens] + assert len(valid_tokens) == total_tokens, ( + f"Should have exactly {total_tokens} valid tokens, " + f"got {len(valid_tokens)}") + assert (actual_expert_ids >= 0).all() and ( + actual_expert_ids + < num_experts).all(), "expert_ids should contain valid expert indices" + + +@pytest.mark.parametrize("m", [16, 32]) +@pytest.mark.parametrize("topk", [2, 4]) +@pytest.mark.parametrize("num_experts", [8]) +@pytest.mark.parametrize("block_size", [64]) +@pytest.mark.skipif(current_platform.is_rocm(), reason="Skip for rocm") +def test_moe_align_block_size_with_expert_map(m: int, topk: int, + num_experts: int, + block_size: int): + """Test moe_align_block_size with expert mapping (EP scenario)""" + topk_ids = torch.zeros((m, topk), device="cuda", dtype=torch.int32) + for i in range(m): + experts = torch.randperm(num_experts, device="cuda")[:topk] + topk_ids[i] = experts + + expert_map = torch.full((num_experts, ), + -1, + device="cuda", + dtype=torch.int32) + local_experts = list(range(0, num_experts, 2)) + for i, expert_id in enumerate(local_experts): + expert_map[expert_id] = i + + actual_sorted_ids, actual_expert_ids, actual_num_tokens = ( + moe_align_block_size( + topk_ids=topk_ids, + block_size=block_size, + num_experts=num_experts, + expert_map=expert_map, + )) + golden_sorted_ids, golden_expert_ids, golden_num_tokens = ( + torch_moe_align_block_size( + topk_ids=topk_ids, + block_size=block_size, + num_experts=num_experts, + expert_map=expert_map, + )) + + torch.testing.assert_close(actual_num_tokens, + golden_num_tokens, + atol=0, + rtol=0) + torch.testing.assert_close(actual_expert_ids, + golden_expert_ids, + atol=0, + rtol=0) + _verify_expert_level_sorting( + actual_sorted_ids, + golden_sorted_ids, + actual_expert_ids, block_size, - sorted_ids_triton, - expert_ids_triton, - num_tokens_post_pad_triton, + actual_num_tokens.item(), + m * topk, ) - assert torch.allclose(expert_ids_cuda, expert_ids_triton), ( - f"Expert IDs mismatch for block_size={block_size}, " - f"num_tokens={num_tokens}, topk={topk}\n" - f"CUDA expert_ids: {expert_ids_cuda}\n" - f"Triton expert_ids: {expert_ids_triton}") - assert torch.allclose( - num_tokens_post_pad_cuda, num_tokens_post_pad_triton), ( - f"Num tokens post pad mismatch for block_size={block_size}, " - f"num_tokens={num_tokens}, topk={topk}\n" - f"CUDA num_tokens_post_pad: {num_tokens_post_pad_cuda}\n" - f"Triton num_tokens_post_pad: {num_tokens_post_pad_triton}") +def test_moe_align_block_size_deterministic(): + m, topk, num_experts, block_size = 128, 2, 32, 64 + + torch.manual_seed(42) + topk_ids = torch.randint(0, + num_experts, (m, topk), + device="cuda", + dtype=torch.int32) + # expect the results to be reproducible + results = [] + for _ in range(5): + sorted_ids, expert_ids, num_tokens = moe_align_block_size( + topk_ids=topk_ids, block_size=block_size, num_experts=num_experts) + results.append( + (sorted_ids.clone(), expert_ids.clone(), num_tokens.clone())) -if __name__ == "__main__": - pytest.main([__file__]) + for i in range(1, len(results)): + assert torch.equal( + results[0][0], + results[i][0]), ("sorted_ids should be deterministic") + assert torch.equal( + results[0][1], + results[i][1]), ("expert_ids should be deterministic") + assert torch.equal( + results[0][2], + results[i][2]), ("num_tokens should be deterministic") From 71dd173d455ab545092baf0395eb92f894019293 Mon Sep 17 00:00:00 2001 From: Kaixi Hou Date: Sat, 19 Jul 2025 02:33:01 -0700 Subject: [PATCH 201/552] [NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency (#20645) Signed-off-by: kaixih Signed-off-by: mgoin Co-authored-by: mgoin Signed-off-by: x22x22 --- vllm/envs.py | 11 +- .../model_executor/layers/fused_moe/config.py | 2 +- .../layers/fused_moe/fused_moe.py | 100 +++++++++++++++++- .../model_executor/layers/quantization/fp8.py | 82 ++++++++++---- .../layers/quantization/modelopt.py | 9 +- vllm/utils/flashinfer.py | 14 ++- 6 files changed, 187 insertions(+), 31 deletions(-) diff --git a/vllm/envs.py b/vllm/envs.py index 261cc7855b7..0896ae3a96c 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -119,7 +119,8 @@ VLLM_TPU_BUCKET_PADDING_GAP: int = 0 VLLM_TPU_MOST_MODEL_LEN: Optional[int] = None VLLM_USE_DEEP_GEMM: bool = False - VLLM_USE_FLASHINFER_MOE: bool = False + VLLM_USE_FLASHINFER_MOE_FP8: bool = False + VLLM_USE_FLASHINFER_MOE_FP4: bool = False VLLM_XGRAMMAR_CACHE_MB: int = 0 VLLM_MSGPACK_ZERO_COPY_THRESHOLD: int = 256 VLLM_ALLOW_INSECURE_SERIALIZATION: bool = False @@ -854,9 +855,13 @@ def get_vllm_port() -> Optional[int]: "VLLM_USE_DEEP_GEMM": lambda: bool(int(os.getenv("VLLM_USE_DEEP_GEMM", "0"))), + # Allow use of FlashInfer MoE kernels for fused moe ops. + "VLLM_USE_FLASHINFER_MOE_FP8": + lambda: bool(int(os.getenv("VLLM_USE_FLASHINFER_MOE_FP8", "0"))), + # Allow use of FlashInfer CUTLASS kernels for fused moe ops. - "VLLM_USE_FLASHINFER_MOE": - lambda: bool(int(os.getenv("VLLM_USE_FLASHINFER_MOE", "0"))), + "VLLM_USE_FLASHINFER_MOE_FP4": + lambda: bool(int(os.getenv("VLLM_USE_FLASHINFER_MOE_FP4", "0"))), # Control the cache sized used by the xgrammar compiler. The default # of 512 MB should be enough for roughly 1000 JSON schemas. diff --git a/vllm/model_executor/layers/fused_moe/config.py b/vllm/model_executor/layers/fused_moe/config.py index 9bebb6a65fc..51c421bd228 100644 --- a/vllm/model_executor/layers/fused_moe/config.py +++ b/vllm/model_executor/layers/fused_moe/config.py @@ -191,7 +191,7 @@ def use_deepep_ll_kernels(self): @property def use_flashinfer_cutlass_kernels(self): - return (envs.VLLM_USE_FLASHINFER_MOE + return (envs.VLLM_USE_FLASHINFER_MOE_FP4 and has_flashinfer_cutlass_fused_moe()) @staticmethod diff --git a/vllm/model_executor/layers/fused_moe/fused_moe.py b/vllm/model_executor/layers/fused_moe/fused_moe.py index aec5d7b252e..c412f695ae7 100644 --- a/vllm/model_executor/layers/fused_moe/fused_moe.py +++ b/vllm/model_executor/layers/fused_moe/fused_moe.py @@ -28,7 +28,7 @@ from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( TopKWeightAndReduceNoOP) from vllm.model_executor.layers.fused_moe.utils import ( - _resize_cache, moe_kernel_quantize_input) + _resize_cache, moe_kernel_quantize_input, per_token_group_quant_fp8) from vllm.model_executor.layers.quantization.utils.mxfp4_utils import ( dequant_mxfp4) from vllm.platforms import current_platform @@ -1061,6 +1061,104 @@ def inplace_fused_experts_fake( ) +def next_positive_power_of_2(x: int) -> int: + if x < 1: + return 1 + return 1 << (x - 1).bit_length() + + +def _get_tile_tokens_dim(num_tokens, top_k, num_experts): + # Guess tokens per expert assuming perfect expert distribution first. + num_tokens_per_expert = (num_tokens * top_k) // num_experts + # And pad the number to the next power of 2. + tile_tokens_dim = next_positive_power_of_2(num_tokens_per_expert) + # Cap to 8-64 tokens per CTA tile as it's the range supported by the kernel. + tile_tokens_dim = min(max(tile_tokens_dim, 8), 64) + return tile_tokens_dim + + +def flashinfer_fused_moe_blockscale_fp8( + routing_logits: torch.Tensor, + routing_bias: torch.Tensor, + x: torch.Tensor, + w13_weight: torch.Tensor, + w13_weight_scale_inv: torch.Tensor, + w2_weight: torch.Tensor, + w2_weight_scale_inv: torch.Tensor, + global_num_experts: int, + top_k: int, + num_expert_group: int, + topk_group: int, + intermediate_size: int, + expert_offset: int, + local_num_experts: int, + block_shape: list[int], + routed_scaling: float = 1.0) -> torch.Tensor: + from vllm.utils.flashinfer import flashinfer_trtllm_fp8_block_scale_moe + assert top_k <= global_num_experts + assert top_k <= 8 + assert topk_group <= 4 + assert global_num_experts > num_expert_group + assert global_num_experts % num_expert_group == 0 + assert global_num_experts % 4 == 0 + assert top_k < (topk_group * global_num_experts / num_expert_group) + assert block_shape == [128, 128] + + a_q, a_sf = per_token_group_quant_fp8(x, block_shape[1]) + # NOTE: scales of hidden states have to be transposed! + a_sf_t = a_sf.t().contiguous() + return flashinfer_trtllm_fp8_block_scale_moe( + routing_logits=routing_logits, + routing_bias=routing_bias, + hidden_states=a_q, + hidden_states_scale=a_sf_t, + gemm1_weights=w13_weight, + gemm1_weights_scale=w13_weight_scale_inv, + gemm2_weights=w2_weight, + gemm2_weights_scale=w2_weight_scale_inv, + num_experts=global_num_experts, + top_k=top_k, + n_group=num_expert_group, + topk_group=topk_group, + intermediate_size=intermediate_size, + local_expert_offset=expert_offset, + local_num_experts=local_num_experts, + routed_scaling_factor=routed_scaling, + tile_tokens_dim=_get_tile_tokens_dim(x.shape[0], top_k, + global_num_experts), + routing_method_type=2, # DeepSeek-styled routing method + ) + + +def flashinfer_fused_moe_blockscale_fp8_fake( + routing_logits: torch.Tensor, + routing_bias: torch.Tensor, + x: torch.Tensor, + w13_weight: torch.Tensor, + w13_weight_scale_inv: torch.Tensor, + w2_weight: torch.Tensor, + w2_weight_scale_inv: torch.Tensor, + global_num_experts: int, + top_k: int, + num_expert_group: int, + topk_group: int, + intermediate_size: int, + expert_offset: int, + local_num_experts: int, + block_shape: list[int], + routed_scaling: float = 1.0) -> torch.Tensor: + return torch.empty_like(x) + + +direct_register_custom_op( + op_name="flashinfer_fused_moe_blockscale_fp8", + op_func=flashinfer_fused_moe_blockscale_fp8, + mutates_args=[], + fake_impl=flashinfer_fused_moe_blockscale_fp8_fake, + tags=(torch.Tag.needs_fixed_stride_order, ), +) + + def outplace_fused_experts( hidden_states: torch.Tensor, w1: torch.Tensor, diff --git a/vllm/model_executor/layers/quantization/fp8.py b/vllm/model_executor/layers/quantization/fp8.py index 824dfe15ae2..35d7545d8c6 100644 --- a/vllm/model_executor/layers/quantization/fp8.py +++ b/vllm/model_executor/layers/quantization/fp8.py @@ -43,6 +43,7 @@ from vllm.scalar_type import scalar_types from vllm.utils import has_deep_gemm from vllm.utils.deep_gemm import is_blackwell_deep_gemm_used +from vllm.utils.flashinfer import has_flashinfer_moe if TYPE_CHECKING: from vllm.model_executor.models.utils import WeightsMapper @@ -52,6 +53,11 @@ logger = init_logger(__name__) +def _swap_w13_to_w31(x: torch.Tensor) -> torch.Tensor: + return x.reshape(-1, 2, x.shape[-2] // 2, + x.shape[-1]).flip(dims=[1]).reshape(x.shape) + + def _is_col_major(x: torch.Tensor) -> bool: assert x.dim() == 3 b, m, n = x.shape @@ -473,6 +479,11 @@ def __init__(self, quant_config: Fp8Config): self.quant_config = quant_config self.block_quant = self.quant_config.weight_block_size is not None + self.flashinfer_moe_enabled = False + if envs.VLLM_USE_FLASHINFER_MOE_FP8 and has_flashinfer_moe(): + logger.info_once( + "Using FlashInfer MoE FP8 kernels for Fp8MoEMethod.") + self.flashinfer_moe_enabled = True # For GPUs that lack FP8 hardware support, we can leverage the Marlin # kernel for fast weight-only FP8 quantization self.use_marlin = (not current_platform.has_device_capability(89) @@ -674,6 +685,14 @@ def process_weights_after_loading(self, layer: Module) -> None: normalize_e4m3fn_to_e4m3fnuz( layer.w2_weight, layer.w2_weight_scale_inv, layer.w2_input_scale) + elif self.flashinfer_moe_enabled: + # NOTE: weights have to be swapped since the activation is + # applied on different half for flashinfer vs vllm + w13_weight = _swap_w13_to_w31(layer.w13_weight.data) + w13_weight_scale_inv = _swap_w13_to_w31( + layer.w13_weight_scale_inv.data) + w2_weight = layer.w2_weight.data + w2_weight_scale_inv = layer.w2_weight_scale_inv.data else: w13_weight = layer.w13_weight.data w13_weight_scale_inv = layer.w13_weight_scale_inv.data @@ -915,25 +934,25 @@ def apply( assert logical_to_physical_map is not None assert logical_replica_count is not None assert isinstance(layer, FusedMoE) - - topk_weights, topk_ids = FusedMoE.select_experts( - hidden_states=x, - router_logits=router_logits, - use_grouped_topk=use_grouped_topk, - top_k=top_k, - renormalize=renormalize, - topk_group=topk_group, - num_expert_group=num_expert_group, - custom_routing_function=custom_routing_function, - scoring_func=scoring_func, - e_score_correction_bias=e_score_correction_bias, - indices_type=self.topk_indices_dtype, - enable_eplb=enable_eplb, - expert_map=expert_map, - expert_load_view=expert_load_view, - logical_to_physical_map=logical_to_physical_map, - logical_replica_count=logical_replica_count, - ) + if not self.flashinfer_moe_enabled: + topk_weights, topk_ids = FusedMoE.select_experts( + hidden_states=x, + router_logits=router_logits, + use_grouped_topk=use_grouped_topk, + top_k=top_k, + renormalize=renormalize, + topk_group=topk_group, + num_expert_group=num_expert_group, + custom_routing_function=custom_routing_function, + scoring_func=scoring_func, + e_score_correction_bias=e_score_correction_bias, + indices_type=self.topk_indices_dtype, + enable_eplb=enable_eplb, + expert_map=expert_map, + expert_load_view=expert_load_view, + logical_to_physical_map=logical_to_physical_map, + logical_replica_count=logical_replica_count, + ) if self.rocm_aiter_moe_enabled: from vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe import ( # noqa: E501 @@ -971,6 +990,31 @@ def apply( apply_router_weight_on_input=apply_router_weight_on_input, global_num_experts=global_num_experts, expert_map=expert_map) + elif self.flashinfer_moe_enabled: + # Currently only work with DS models + assert self.block_quant + assert (renormalize and use_grouped_topk + and scoring_func == 'sigmoid' + and custom_routing_function is None) + assert activation == "silu" + return torch.ops.vllm.flashinfer_fused_moe_blockscale_fp8( + routing_logits=router_logits.to(torch.float32), + routing_bias=e_score_correction_bias, + x=x, + w13_weight=layer.w13_weight, + w13_weight_scale_inv=layer.w13_weight_scale_inv, + w2_weight=layer.w2_weight, + w2_weight_scale_inv=layer.w2_weight_scale_inv, + global_num_experts=global_num_experts, + top_k=top_k, + num_expert_group=num_expert_group, + topk_group=topk_group, + intermediate_size=layer.intermediate_size_per_partition, + expert_offset=layer.ep_rank * layer.local_num_experts, + local_num_experts=layer.local_num_experts, + block_shape=self.quant_config.weight_block_size, + routed_scaling=1.0, + ) else: return self.fused_experts( hidden_states=x, diff --git a/vllm/model_executor/layers/quantization/modelopt.py b/vllm/model_executor/layers/quantization/modelopt.py index 3807899fc3e..20def70d197 100644 --- a/vllm/model_executor/layers/quantization/modelopt.py +++ b/vllm/model_executor/layers/quantization/modelopt.py @@ -721,7 +721,7 @@ def __init__(self, quant_config: ModelOptNvFp4Config): self.use_marlin = False self.allow_flashinfer_cutlass = False - if envs.VLLM_USE_FLASHINFER_MOE: + if envs.VLLM_USE_FLASHINFER_MOE_FP4: if self.cutlass_nvfp4_supported and current_platform.is_cuda() \ and current_platform.is_device_capability(100): logger.info_once( @@ -800,10 +800,9 @@ def select_gemm_impl(self, prepare_finalize, assert moe.dp_size > 1 logger.debug_once("Using CutlassExpertsFp4") # Currently CutlassExpertsFp4 doesn't support DP - raise ValueError( - "CutlassExpertsFp4 doesn't support DP. " - "Use flashinfer CUTLASS FusedMoE(VLLM_USE_FLASHINFER_MOE)" - " backend instead.") + raise ValueError("CutlassExpertsFp4 doesn't support DP. " + "Use flashinfer CUTLASS FusedMoE backend instead " + "(set VLLM_USE_FLASHINFER_MOE_FP4=1)") return experts diff --git a/vllm/utils/flashinfer.py b/vllm/utils/flashinfer.py index dbd2dc39304..fd8b384a616 100644 --- a/vllm/utils/flashinfer.py +++ b/vllm/utils/flashinfer.py @@ -64,6 +64,8 @@ def wrapper(*args, **kwargs): # Create lazy wrappers for each function +flashinfer_trtllm_fp8_block_scale_moe = _lazy_import_wrapper( + "flashinfer.fused_moe", "trtllm_fp8_block_scale_moe") flashinfer_cutlass_fused_moe = _lazy_import_wrapper("flashinfer.fused_moe", "cutlass_fused_moe") fp4_quantize = _lazy_import_wrapper("flashinfer", "fp4_quantize") @@ -77,10 +79,16 @@ def wrapper(*args, **kwargs): fallback_fn=lambda *args, **kwargs: contextlib.nullcontext()) +@functools.cache +def has_flashinfer_moe() -> bool: + """Return ``True`` if FlashInfer MoE module is available.""" + return importlib.util.find_spec("flashinfer.fused_moe") is not None + + @functools.cache def has_flashinfer_cutlass_fused_moe() -> bool: """Return ``True`` if FlashInfer CUTLASS fused MoE is available.""" - if not has_flashinfer(): + if not has_flashinfer_moe(): return False # Check if all required functions are available @@ -99,9 +107,11 @@ def has_flashinfer_cutlass_fused_moe() -> bool: __all__ = [ "has_flashinfer", - "has_flashinfer_cutlass_fused_moe", + "flashinfer_trtllm_fp8_block_scale_moe", "flashinfer_cutlass_fused_moe", "fp4_quantize", "fp4_swizzle_blockscale", "autotune", + "has_flashinfer_moe", + "has_flashinfer_cutlass_fused_moe", ] From a15984d4c0c65bf44cbff1f4ca07ca0c6b3ea6f1 Mon Sep 17 00:00:00 2001 From: 22quinn <33176974+22quinn@users.noreply.github.com> Date: Sat, 19 Jul 2025 02:40:38 -0700 Subject: [PATCH 202/552] [Bugfix][Frontend] Fix openai CLI arg `middleware` (#21220) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: x22x22 --- tests/entrypoints/openai/test_cli_args.py | 10 ++++++++++ vllm/entrypoints/openai/cli_args.py | 4 ++++ 2 files changed, 14 insertions(+) diff --git a/tests/entrypoints/openai/test_cli_args.py b/tests/entrypoints/openai/test_cli_args.py index 504fd72aa4a..b20838956d7 100644 --- a/tests/entrypoints/openai/test_cli_args.py +++ b/tests/entrypoints/openai/test_cli_args.py @@ -153,3 +153,13 @@ def test_chat_template_validation_for_sad_paths(serve_parser): args = serve_parser.parse_args(args=["--chat-template", "does/not/exist"]) with pytest.raises(ValueError): validate_parsed_serve_args(args) + + +@pytest.mark.parametrize( + "cli_args, expected_middleware", + [(["--middleware", "middleware1", "--middleware", "middleware2" + ], ["middleware1", "middleware2"]), ([], [])]) +def test_middleware(serve_parser, cli_args, expected_middleware): + """Ensure multiple middleware args are parsed properly""" + args = serve_parser.parse_args(args=cli_args) + assert args.middleware == expected_middleware diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 6456d009b95..28857f8caef 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -215,6 +215,10 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: # Special case: Middleware needs append action frontend_kwargs["middleware"]["action"] = "append" + frontend_kwargs["middleware"]["type"] = str + if "nargs" in frontend_kwargs["middleware"]: + del frontend_kwargs["middleware"]["nargs"] + frontend_kwargs["middleware"]["default"] = [] # Special case: Tool call parser shows built-in options. valid_tool_parsers = list(ToolParserManager.tool_parsers.keys()) From 1d0efe4dd579b81d6c0df93dbaa9aa59ef306fda Mon Sep 17 00:00:00 2001 From: "Li, Jiang" Date: Sat, 19 Jul 2025 20:13:55 +0800 Subject: [PATCH 203/552] [bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code (#21032) Signed-off-by: jiang1.li Signed-off-by: x22x22 --- .../scripts/hardware_ci/run-cpu-test.sh | 4 +- docs/getting_started/installation/cpu.md | 10 +- requirements/cpu.txt | 2 - vllm/envs.py | 5 +- vllm/platforms/cpu.py | 64 ++++++ vllm/v1/worker/cpu_model_runner.py | 7 +- vllm/v1/worker/cpu_worker.py | 202 ++++++------------ 7 files changed, 144 insertions(+), 150 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-cpu-test.sh b/.buildkite/scripts/hardware_ci/run-cpu-test.sh index afe3e4b7ef6..e3d47a0e6c1 100644 --- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh +++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh @@ -24,8 +24,8 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu . # Run the image, setting --shm-size=4g for tensor parallel. -docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE" -docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2 +docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE" +docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2 function cpu_tests() { set -e diff --git a/docs/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md index 14c9984487f..d77e7383650 100644 --- a/docs/getting_started/installation/cpu.md +++ b/docs/getting_started/installation/cpu.md @@ -94,8 +94,8 @@ Currently, there are no pre-built CPU wheels. ## Related runtime environment variables - `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`. -- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`. -- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`. +- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists or `auto` (by default). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively. +- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`. - `VLLM_CPU_MOE_PREPACK` (x86 only): whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False). - `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False). @@ -123,9 +123,13 @@ export VLLM_CPU_NUM_OF_RESERVED_CPU=1 vllm serve facebook/opt-125m --dtype=bfloat16 ``` +Note, it is recommended to manually reserve 1 CPU for vLLM front-end process when `world_size == 1`. + ### How to decide `VLLM_CPU_OMP_THREADS_BIND`? -- Bind each OpenMP thread to a dedicated physical CPU core respectively, or use auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores: +- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to a same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If have any performance problems or unexpected binding behaviours, please try to bind threads as following. + +- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores: ??? console "Commands" diff --git a/requirements/cpu.txt b/requirements/cpu.txt index df3a3393563..d80354342bc 100644 --- a/requirements/cpu.txt +++ b/requirements/cpu.txt @@ -24,6 +24,4 @@ datasets # for benchmark scripts # Intel Extension for PyTorch, only for x86_64 CPUs intel-openmp==2024.2.1; platform_machine == "x86_64" intel_extension_for_pytorch==2.6.0; platform_machine == "x86_64" # torch>2.6.0+cpu has performance regression on x86 platform, see https://github.com/pytorch/pytorch/pull/151218 -py-libnuma; platform_system != "Darwin" -psutil; platform_system != "Darwin" triton==3.2.0; platform_machine == "x86_64" # Triton is required for torch 2.6+cpu, as it is imported in torch.compile. diff --git a/vllm/envs.py b/vllm/envs.py index 0896ae3a96c..c5f97de807a 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -44,7 +44,7 @@ VLLM_PP_LAYER_PARTITION: Optional[str] = None VLLM_CPU_KVCACHE_SPACE: int = 0 VLLM_CPU_OMP_THREADS_BIND: str = "" - VLLM_CPU_NUM_OF_RESERVED_CPU: int = 0 + VLLM_CPU_NUM_OF_RESERVED_CPU: Optional[int] = None VLLM_CPU_MOE_PREPACK: bool = True VLLM_CPU_SGL_KERNEL: bool = False VLLM_XLA_CACHE_PATH: str = os.path.join(VLLM_CACHE_ROOT, "xla_cache") @@ -442,7 +442,8 @@ def get_vllm_port() -> Optional[int]: # (CPU backend only) CPU cores not used by OMP threads . # Those CPU cores will not be used by OMP threads of a rank. "VLLM_CPU_NUM_OF_RESERVED_CPU": - lambda: int(os.getenv("VLLM_CPU_NUM_OF_RESERVED_CPU", "0")), + lambda: int(os.getenv("VLLM_CPU_NUM_OF_RESERVED_CPU", "0")) + if "VLLM_CPU_NUM_OF_RESERVED_CPU" in os.environ else None, # (CPU backend only) whether to use prepack for MoE layer. This will be # passed to ipex.llm.modules.GatedMLPMOE. On unsupported CPUs, you might diff --git a/vllm/platforms/cpu.py b/vllm/platforms/cpu.py index a0aa981f951..70c339c9bc9 100644 --- a/vllm/platforms/cpu.py +++ b/vllm/platforms/cpu.py @@ -1,9 +1,12 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import json import os import platform +import subprocess import sys +from dataclasses import dataclass from importlib.util import find_spec from typing import TYPE_CHECKING, Optional @@ -31,6 +34,35 @@ def get_max_threads(pid=0): raise NotImplementedError("Unsupported OS") +@dataclass +class LogicalCPUInfo: + id: int = -1 + physical_core: int = -1 + numa_node: int = -1 + + @classmethod + def _int(cls, value: str) -> int: + try: + int_value = int(value) + except Exception: + int_value = -1 + return int_value + + @staticmethod + def json_decoder(obj_dict: dict): + id = obj_dict.get("cpu") + physical_core = obj_dict.get("core") + numa_node = obj_dict.get("node") + + if not (id is None or physical_core is None or numa_node is None): + return LogicalCPUInfo( + id=LogicalCPUInfo._int(id), + physical_core=LogicalCPUInfo._int(physical_core), + numa_node=LogicalCPUInfo._int(numa_node)) + else: + return obj_dict + + class CpuPlatform(Platform): _enum = PlatformEnum.CPU device_name: str = "cpu" @@ -240,6 +272,38 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None: vllm_config.scheduler_config.max_model_len, DEFAULT_MAX_NUM_BATCHED_TOKENS) + @classmethod + def get_allowed_cpu_memory_node_list( + cls) -> tuple[list[int], list[LogicalCPUInfo]]: + assert platform.system() == "Linux" + + # Init LogicalCPUInfo from lscpu + lscpu_output = subprocess.check_output("lscpu -J -e=CPU,CORE,NODE", + shell=True, + text=True) + logical_cpu_list: list[LogicalCPUInfo] = json.loads( + lscpu_output, object_hook=LogicalCPUInfo.json_decoder)['cpus'] + + # Filter CPUs with invalid attributes + logical_cpu_list = [ + x for x in logical_cpu_list + if -1 not in (x.id, x.physical_core, x.numa_node) + ] + + # Filter allowed CPUs + allowed_cpu_id_list = os.sched_getaffinity(0) + logical_cpu_list = [ + x for x in logical_cpu_list if x.id in allowed_cpu_id_list + ] + + # Get allowed NUMA nodes + allowed_numa_nodes = set() + for x in logical_cpu_list: + allowed_numa_nodes.add(x.numa_node) # type: ignore + allowed_numa_nodes_list = sorted(allowed_numa_nodes) + + return allowed_numa_nodes_list, logical_cpu_list + @classmethod def is_pin_memory_available(cls) -> bool: logger.warning("Pin memory is not supported on CPU.") diff --git a/vllm/v1/worker/cpu_model_runner.py b/vllm/v1/worker/cpu_model_runner.py index 136a9f08e82..ca94ac8c605 100644 --- a/vllm/v1/worker/cpu_model_runner.py +++ b/vllm/v1/worker/cpu_model_runner.py @@ -45,9 +45,10 @@ def replace_tensor(obj: Any, cpu_attr_name: str, if k.endswith("_cpu_tensor") and isinstance(v, torch.Tensor): replace_tensor(self.input_batch, k, k[:-11]) - for k, v in vars(self.input_batch.block_table).items(): - if k.endswith("_cpu") and isinstance(v, torch.Tensor): - replace_tensor(self.input_batch.block_table, k, k[:-4]) + for block_table in self.input_batch.block_table.block_tables: + for k, v in vars(block_table).items(): + if k.endswith("_cpu") and isinstance(v, torch.Tensor): + replace_tensor(block_table, k, k[:-4]) def load_model(self, eep_scale_up: bool = False) -> None: logger.info("Starting to load model %s...", self.model_config.model) diff --git a/vllm/v1/worker/cpu_worker.py b/vllm/v1/worker/cpu_worker.py index d31991b5b36..2dc28d93049 100644 --- a/vllm/v1/worker/cpu_worker.py +++ b/vllm/v1/worker/cpu_worker.py @@ -1,8 +1,8 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import os -from importlib import util -from typing import Optional +import platform +from typing import Callable, Optional import torch @@ -12,21 +12,14 @@ from vllm.logger import init_logger from vllm.model_executor.utils import set_random_seed from vllm.platforms import CpuArchEnum, current_platform +from vllm.platforms.cpu import CpuPlatform, LogicalCPUInfo from vllm.sequence import IntermediateTensors -from vllm.utils import PlaceholderModule from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.outputs import ModelRunnerOutput from vllm.v1.worker.cpu_model_runner import CPUModelRunner from vllm.v1.worker.gpu_worker import (Worker, init_worker_distributed_environment) -try: - import psutil - from numa import info -except ImportError: - psutil = PlaceholderModule("psutil") # type: ignore[assignment] - numa = PlaceholderModule("numa") # type: ignore[assignment] - logger = init_logger(__name__) @@ -45,20 +38,21 @@ def __init__(self, is_driver_worker=is_driver_worker) self.parallel_config.disable_custom_all_reduce = True - self.manually_bind_threads_suggestion = ( - "To get better performance, please try to manually bind threads.") def init_device(self): # Setup OpenMP threads affinity. omp_cpuids = envs.VLLM_CPU_OMP_THREADS_BIND - self.local_omp_cpuid = "all" - if omp_cpuids == "auto": + if omp_cpuids == "auto" and platform.system() == "Linux": if current_platform.get_cpu_architecture() == CpuArchEnum.POWERPC: - self.local_omp_cpuid = ( - self.get_cpus_id_binding_based_on_numa_nodes_ppc64le()) + # For POWERPC SMT-8/4/2 + self.local_omp_cpuid = self._get_autobind_cpu_ids( + lambda cpus: [cpu for cpu in cpus if cpu.id % 8 < 4]) + elif current_platform.get_cpu_architecture() == CpuArchEnum.X86: + # For x86 SMT-2, use 1 CPU per core + self.local_omp_cpuid = self._get_autobind_cpu_ids( + lambda cpus: cpus[-1:]) else: - self.local_omp_cpuid = ( - self.get_cpus_id_binding_based_on_numa_nodes()) + self.local_omp_cpuid = "all" else: self.local_omp_cpuid = omp_cpuids.split("|")[self.rank] @@ -122,126 +116,58 @@ def execute_model( assert isinstance(output, ModelRunnerOutput) return output if self.is_driver_worker else None - def warn_inability_to_detect_numa(self) -> None: - logger.warning( - "Auto thread-binding failed due to the " - "inability to detect numa nodes. %s", - self.manually_bind_threads_suggestion) - - def warn_lack_of_numa_and_psutil(self) -> None: - logger.warning( - "Auto thread-binding failed due to " - "the lack of package numa and psutil. %s", - self.manually_bind_threads_suggestion) - - def warn_world_size_too_large(self, world_size: int, - node_to_cpus_len: int) -> None: - logger.warning( - "Auto thread-binding failed due to " - "world size: %d being larger than " - "allowed NUMA nodes number: %d. %s", world_size, node_to_cpus_len, - self.manually_bind_threads_suggestion) - - def get_cpus_allow_list_and_numa_size(self): - cpus_allow_list = psutil.Process().cpu_affinity() - numa_size = info.get_num_configured_nodes() - return cpus_allow_list, numa_size - - def auto_thread_binding_based_on_numa_nodes(self, world_size: int, - rank_to_cpus: str) -> str: - cpu_count = psutil.cpu_count(logical=False) - cpus_allow_list, numa_size = self.get_cpus_allow_list_and_numa_size() - if not numa_size: - self.warn_inability_to_detect_numa() - return rank_to_cpus - - cpu_count_per_numa = cpu_count // numa_size - num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU, - cpu_count_per_numa // 2) - - node_to_cpus = [] - for i in range(numa_size): - node_intersect = set( - info.node_to_cpus(i)).intersection(cpus_allow_list) - if bool(node_intersect): - node_to_cpus.append(list(node_intersect)) - - node_to_cpus_len = len(node_to_cpus) - if world_size > node_to_cpus_len: - self.warn_world_size_too_large(world_size, node_to_cpus_len) - else: - end = cpu_count_per_numa - num_of_reserved_cpu - rank_to_cpus_list = node_to_cpus[self.rank][:end] - rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list) - logger.info("auto thread-binding list: %s", rank_to_cpus) - return rank_to_cpus - - def libnuma_and_psutil_found(self) -> bool: - libnuma_found = util.find_spec("numa") is not None - psutil_found = util.find_spec("psutil") is not None - - return libnuma_found and psutil_found - - def get_cpus_id_binding_based_on_numa_nodes(self) -> str: - """Return CPUs id binding based on NUMA nodes. + def _get_autobind_cpu_ids( + self, cpu_selector: Callable[[list[LogicalCPUInfo]], + list[LogicalCPUInfo]] + ) -> str: """ - rank_to_cpus = self.local_omp_cpuid - # Setup OpenMP thread affinity based on NUMA nodes automatically - world_size = self.vllm_config.parallel_config.world_size - if self.libnuma_and_psutil_found(): - rank_to_cpus = self.auto_thread_binding_based_on_numa_nodes( - world_size, rank_to_cpus) - else: - self.warn_lack_of_numa_and_psutil() - return rank_to_cpus - - def select_threads_per_power_core(self, - node_cpu_ids: list[int]) -> list[int]: - return [cpu for cpu in node_cpu_ids if cpu % 8 < 4] - - def auto_thread_binding_based_on_numa_nodes_ppc64le( - self, world_size: int, rank_to_cpus: str) -> str: - cpus_allow_list, numa_size = self.get_cpus_allow_list_and_numa_size() - if not numa_size: - self.warn_inability_to_detect_numa() - return rank_to_cpus - - node_to_cpus = [] - for i in range(numa_size): - node_intersect = set( - info.node_to_cpus(i)).intersection(cpus_allow_list) - if bool(node_intersect): - node_to_cpus.append(sorted(list(node_intersect))) - - node_to_cpus_len = len(node_to_cpus) - if world_size > node_to_cpus_len: - self.warn_world_size_too_large(world_size, node_to_cpus_len) - else: - node_cpus_this_rank = node_to_cpus[self.rank] - node_cpus_this_rank = self.select_threads_per_power_core( - node_cpus_this_rank) - cpu_count_per_numa = len(node_cpus_this_rank) - num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU, - cpu_count_per_numa // 2) - end = cpu_count_per_numa - num_of_reserved_cpu - rank_to_cpus_list = node_cpus_this_rank[:end] - rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list) - logger.info("ppc64le thread-binding list: %s", rank_to_cpus) - return rank_to_cpus - - def get_cpus_id_binding_based_on_numa_nodes_ppc64le(self) -> str: - """ - Power (ppc64le) specific: Selects a subset of threads per core for - each NUMA node.This is robust to SMT mode (SMT-8, SMT-4, etc) - because the OS only exposes available threads.This maximizes - performance by avoiding oversubscription of logical CPUs on Power. + Return CPU ids to bind based on NUMA nodes. + Currently for rank N, only CPU ids on the N-th node in available NUMA + node list will be selected. + Args: + cpu_selector: a callable object to select CPUs from a CPU list + of a physical core. The input is a LogicalCPUInfo list, sorted by + the LogicalCPUInfo.id. A selected LogicalCPUInfo list should be + returned. """ - rank_to_cpus = self.local_omp_cpuid - world_size = self.vllm_config.parallel_config.world_size - if self.libnuma_and_psutil_found(): - rank_to_cpus = self.auto_thread_binding_based_on_numa_nodes_ppc64le( - world_size, rank_to_cpus) - else: - self.warn_lack_of_numa_and_psutil() - return rank_to_cpus + allowed_numa_nodes, logical_cpu_list = \ + CpuPlatform.get_allowed_cpu_memory_node_list() + assert len(allowed_numa_nodes) >= self.parallel_config.world_size, ( + f"No enough allowed NUMA nodes to bind threads of " + f"{self.parallel_config.world_size} CPUWorkers. " + f"Allowed NUMA nodes are {allowed_numa_nodes}. " + "Please try to bind threads manually.") + + # Get CPUs on NUMA node `allowed_numa_nodes[local_rank]`` + selected_numa_node = allowed_numa_nodes[ + self.local_rank] # type: ignore + logical_cpu_list = [ + x for x in logical_cpu_list if x.numa_node == selected_numa_node + ] + + # Select CPUs from each physical core via cpu_selector + core_to_cpus: dict[int, list[LogicalCPUInfo]] = {} + for cpu_info in logical_cpu_list: + if cpu_info.physical_core not in core_to_cpus: + core_to_cpus[cpu_info.physical_core] = [] + core_to_cpus[cpu_info.physical_core].append(cpu_info) + logical_cpu_list = [] + for cpu_list in core_to_cpus.values(): + cpu_list = sorted(cpu_list, key=lambda x: x.id) + logical_cpu_list.extend(cpu_selector(cpu_list)) + logical_cpu_list = sorted(logical_cpu_list, key=lambda x: x.id) + + # Reserve CPUs for other processes + reserve_cpu_num = envs.VLLM_CPU_NUM_OF_RESERVED_CPU + if reserve_cpu_num is None: + reserve_cpu_num = 1 if self.parallel_config.world_size > 1 else 0 + assert len(logical_cpu_list) > reserve_cpu_num, ( + f"VLLM_CPU_NUM_OF_RESERVED_CPU ({reserve_cpu_num}) " + f"should less than {len(logical_cpu_list)}.") + if reserve_cpu_num != 0: + logical_cpu_list = logical_cpu_list[:-reserve_cpu_num] + + logger.info("auto thread-binding list (id, physical core): %s", + [(x.id, x.physical_core) for x in logical_cpu_list]) + return ",".join([str(x.id) for x in logical_cpu_list]) From 804b0ccb36bb8b35c75e356a97caf0d667bd35fd Mon Sep 17 00:00:00 2001 From: Rabi Mishra Date: Sat, 19 Jul 2025 17:45:07 +0530 Subject: [PATCH 204/552] Fix/remove some broken model executor tests (#21224) Signed-off-by: Rabi Mishra Signed-off-by: x22x22 --- tests/model_executor/test_guided_processors.py | 13 ------------- tests/model_executor/test_model_load_with_params.py | 6 +++--- 2 files changed, 3 insertions(+), 16 deletions(-) diff --git a/tests/model_executor/test_guided_processors.py b/tests/model_executor/test_guided_processors.py index f08c7f7efcc..721478f4244 100644 --- a/tests/model_executor/test_guided_processors.py +++ b/tests/model_executor/test_guided_processors.py @@ -189,19 +189,6 @@ def test_multiple_guided_options_not_allowed(sample_json_schema, sample_regex): GuidedDecodingParams(json=sample_json_schema, grammar="test grammar") -def test_guided_decoding_backend_options(): - """Test backend-specific options""" - with pytest.warns(DeprecationWarning): - guided_decoding_params = GuidedDecodingParams( - backend= - "xgrammar:no-fallback,disable-any-whitespace,no-additional-properties" - ) - assert guided_decoding_params.backend == "xgrammar" - assert guided_decoding_params.disable_fallback - assert guided_decoding_params.disable_any_whitespace - assert guided_decoding_params.disable_additional_properties - - def test_pickle_xgrammar_tokenizer_data(): try: import xgrammar as xgr diff --git a/tests/model_executor/test_model_load_with_params.py b/tests/model_executor/test_model_load_with_params.py index 4bdb651e517..1d2d9f9a65b 100644 --- a/tests/model_executor/test_model_load_with_params.py +++ b/tests/model_executor/test_model_load_with_params.py @@ -49,7 +49,7 @@ def test_model_loading_with_params(vllm_runner): def check_model(model): assert isinstance(model, BertEmbeddingModel) - assert isinstance(model._pooler, CLSPool) + assert isinstance(model.pooler.pooling, CLSPool) vllm_model.apply_model(check_model) @@ -87,7 +87,7 @@ def test_roberta_model_loading_with_params(vllm_runner): def check_model(model): assert isinstance(model, RobertaEmbeddingModel) - assert isinstance(model._pooler, MeanPool) + assert isinstance(model.pooler.pooling, MeanPool) vllm_model.apply_model(check_model) @@ -114,7 +114,7 @@ def test_facebook_roberta_model_loading_with_params(vllm_runner): def check_model(model): assert isinstance(model, RobertaEmbeddingModel) assert not hasattr(model, "lm_head") - assert isinstance(model._pooler, CLSPool) + assert isinstance(model.pooler.pooling, CLSPool) vllm_model.apply_model(check_model) From f0f36524f8017f8f511bb9f17e2aa5072fbdca3c Mon Sep 17 00:00:00 2001 From: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Date: Sat, 19 Jul 2025 21:16:48 +0900 Subject: [PATCH 205/552] [CI/CD][bugfix]fix: error argument to loads has incompatible type (#21223) Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee Signed-off-by: x22x22 --- vllm/engine/arg_utils.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index d352a22a6d9..1ca4917de26 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1266,8 +1266,8 @@ def create_engine_config( ) observability_config = ObservabilityConfig( - show_hidden_metrics_for_version=self. - show_hidden_metrics_for_version, + show_hidden_metrics_for_version=( + self.show_hidden_metrics_for_version), otlp_traces_endpoint=self.otlp_traces_endpoint, collect_detailed_traces=self.collect_detailed_traces, ) From f04dea677d044f9818932b3daa2bf1bd63741ddb Mon Sep 17 00:00:00 2001 From: Jiayi Yan <66017932+1195343015@users.noreply.github.com> Date: Sat, 19 Jul 2025 21:58:07 +0800 Subject: [PATCH 206/552] [Docs] Update the link to the 'Prometheus/Grafana' example (#21225) Signed-off-by: x22x22 --- docs/design/v1/metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/v1/metrics.md b/docs/design/v1/metrics.md index 7156ee9dd3e..eec42d79d82 100644 --- a/docs/design/v1/metrics.md +++ b/docs/design/v1/metrics.md @@ -61,7 +61,7 @@ These are documented under [Inferencing and Serving -> Production Metrics](../.. ### Grafana Dashboard -vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. +vLLM also provides [a reference example](https://docs.vllm.ai/en/stable/examples/online_serving/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: From b6ad5b25a45d6059b7123b05e8686838791063b4 Mon Sep 17 00:00:00 2001 From: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Sat, 19 Jul 2025 08:46:50 -0700 Subject: [PATCH 207/552] [BugFix] Make PD work with Ray (#21072) Signed-off-by: Kourosh Hakhamaneshi Signed-off-by: x22x22 --- .../kv_connector/unit/test_nixl_connector.py | 117 +++++++----------- .../unit/test_output_aggreagator.py} | 37 ++---- .../kv_transfer/kv_connector/utils.py | 90 ++++++++++++++ .../kv_transfer/kv_connector/v1/base.py | 2 +- vllm/mocks/__init__.py | 0 vllm/mocks/mock_nixl_connector.py | 76 ++++++++++++ vllm/sequence.py | 6 + vllm/v1/executor/multiproc_executor.py | 86 ++----------- vllm/v1/executor/ray_distributed_executor.py | 57 +++++++-- vllm/v1/worker/gpu_model_runner.py | 49 +++++++- vllm/v1/worker/gpu_worker.py | 30 ++--- 11 files changed, 329 insertions(+), 221 deletions(-) rename tests/v1/{executor/test_multiproc_executor.py => kv_connector/unit/test_output_aggreagator.py} (72%) create mode 100644 vllm/mocks/__init__.py create mode 100644 vllm/mocks/mock_nixl_connector.py diff --git a/tests/v1/kv_connector/unit/test_nixl_connector.py b/tests/v1/kv_connector/unit/test_nixl_connector.py index c4f558b7acd..a0dfd54fb82 100644 --- a/tests/v1/kv_connector/unit/test_nixl_connector.py +++ b/tests/v1/kv_connector/unit/test_nixl_connector.py @@ -1,13 +1,14 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import os +import tempfile +import textwrap import time -import uuid -from collections import defaultdict -from typing import Optional from unittest.mock import patch import pytest +import ray from vllm import LLM from vllm.config import KVTransferConfig @@ -15,11 +16,32 @@ KVConnectorRole, NixlAgentMetadata, NixlConnector, NixlConnectorMetadata, NixlConnectorWorker) from vllm.forward_context import ForwardContext +from vllm.mocks.mock_nixl_connector import FakeNixlWrapper from vllm.sampling_params import SamplingParams from .utils import create_request, create_scheduler, create_vllm_config +def _make_stub_pkg() -> str: + """Return a directory that makes + `from nixl._api import nixl_agent` resolve to our FakeNixlWrapper.""" + td = tempfile.mkdtemp() + pkg_root = os.path.join(td, "nixl", "_api") + os.makedirs(pkg_root, exist_ok=True) + + stub = textwrap.dedent("""\ + # Forward the real FakeNixlWrapper that the driver already defined. + print("In fake package") + from vllm.mocks.mock_nixl_connector import FakeNixlWrapper as nixl_agent + """) + with open(os.path.join(pkg_root, "__init__.py"), "w") as f: + f.write(stub) + + # touch parent package + open(os.path.join(td, "nixl", "__init__.py"), "w").close() + return td + + def test_basic_interface(): """Unit test for basic NixlConnector interface functionality.""" @@ -87,77 +109,6 @@ def test_prompt_less_than_block_size(): assert len(scheduler_output.scheduled_new_reqs) == 1 -class FakeNixlWrapper: - """Mock implementation of NixlWrapper for testing. - - We don't inherit from nixl._api.nixl_agent because nixl may not be - installed. - """ - - AGENT_METADATA = b"fake_agent_metadata" - REMOTE_AGENT_NAME = "remote_agent" - - def __init__(self, agent_name: str, *args, **kwargs): - self._cycles_before_xfer_done = 0 - self._check_xfer_state_cycles: defaultdict[int, int] = defaultdict( - lambda: 0) - - def get_reg_descs(self, caches_data, memory_type: str) -> list: - return [str(uuid.uuid4()) for _ in caches_data] - - def register_memory(self, descs) -> None: - pass - - def get_xfer_descs(self, blocks_data, memory_type: str) -> list: - return [str(uuid.uuid4()) for _ in blocks_data] - - def prep_xfer_dlist(self, agent_name: str, descs: list) -> int: - return uuid.uuid4().int - - def get_agent_metadata(self) -> bytes: - return self.AGENT_METADATA - - def add_remote_agent(self, agent_metadata: bytes) -> str: - return self.REMOTE_AGENT_NAME - - def get_new_notifs(self) -> dict[str, list[bytes]]: - # Used to collect done_sending, which we don't test yet. - return {} - - def check_xfer_state(self, handle: int) -> str: - if self._check_xfer_state_cycles[ - handle] >= self._cycles_before_xfer_done: - return "DONE" - self._check_xfer_state_cycles[handle] += 1 - return "PROC" - - def release_xfer_handle(self, handle: int) -> None: - pass - - def send_notif(self, agent_name: str, notif_msg: bytes) -> None: - pass - - def make_prepped_xfer(self, - xfer_type: str, - local_xfer_side_handle: int, - local_block_descs_ids: list[int], - remote_xfer_side_handle: int, - remote_block_descs_ids: list[int], - notif_msg: Optional[bytes] = None) -> int: - return uuid.uuid4().int - - def transfer(self, handle: int) -> str: - return "PROC" - - ############################################################ - # Follow are for changing the behavior during testing. - ############################################################ - - def set_cycles_before_xfer_done(self, cycles: int): - """Set the number of cycles before a transfer is considered done.""" - self._cycles_before_xfer_done = cycles - - class FakeNixlConnectorWorker(NixlConnectorWorker): REMOTE_ENGINE_ID = "remote_engine" @@ -378,10 +329,14 @@ def test_concurrent_load_kv( raise TimeoutError("Took too long to complete async handshake.") +# NOTE: resource cleanup in mp backend is a bit finicky, so the order in which +# we put here is important. First run ray, it will clean up the resources, then +# the rest of the tests. +@pytest.mark.parametrize("distributed_executor_backend", ["ray", None]) @patch( "vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector.NixlWrapper", FakeNixlWrapper) -def test_abort_timeout_on_prefiller(monkeypatch): +def test_abort_timeout_on_prefiller(monkeypatch, distributed_executor_backend): """ Test lifecycle of an aborted Remote Prefill request hitting the timeout. -----> P @@ -399,11 +354,23 @@ def test_abort_timeout_on_prefiller(monkeypatch): timeout = 6 monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0") monkeypatch.setenv("VLLM_NIXL_ABORT_REQUEST_TIMEOUT", str(timeout)) + + # Build runtime_env only if we’re using Ray + if distributed_executor_backend == "ray": + runtime_env = { + "working_dir": _make_stub_pkg(), # ship stub package + "env_vars": { + "VLLM_NIXL_ABORT_REQUEST_TIMEOUT": str(timeout), + }, + } + ray.init(runtime_env=runtime_env) + llm = LLM( model=model_name, enforce_eager=True, gpu_memory_utilization=0.5, kv_transfer_config=kv_transfer_config, + distributed_executor_backend=distributed_executor_backend, ) remote_prefill_opts = { "do_remote_decode": True, diff --git a/tests/v1/executor/test_multiproc_executor.py b/tests/v1/kv_connector/unit/test_output_aggreagator.py similarity index 72% rename from tests/v1/executor/test_multiproc_executor.py rename to tests/v1/kv_connector/unit/test_output_aggreagator.py index c1425d82bec..cad73f68e9f 100644 --- a/tests/v1/executor/test_multiproc_executor.py +++ b/tests/v1/kv_connector/unit/test_output_aggreagator.py @@ -1,28 +1,12 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import threading -from collections import defaultdict from concurrent.futures import Future from typing import Optional -from vllm.v1.executor.multiproc_executor import MultiprocExecutor +from vllm.distributed.kv_transfer.kv_connector.utils import KVOutputAggregator from vllm.v1.outputs import ModelRunnerOutput -class DummyMultiprocExecutor(MultiprocExecutor): - - def __init__(self, output_rank, world_size): - # Manually initialize minimal required fields - self.output_rank = output_rank - self.world_size = world_size - self._send_remaining_count = defaultdict[str, - int](lambda: self.world_size) - self._recv_remaining_count = defaultdict[str, - int](lambda: self.world_size) - self.io_thread_pool = None - self.shutdown_event = threading.Event() - - class DummyModelRunnerOutput(ModelRunnerOutput): def __init__(self, @@ -33,14 +17,14 @@ def __init__(self, def test_aggregate_workers_output(): - executor = DummyMultiprocExecutor(output_rank=0, world_size=2) + aggregator = KVOutputAggregator(world_size=2) output1 = DummyModelRunnerOutput(finished_sending={'req1'}, finished_recving={'req2'}) output2 = DummyModelRunnerOutput(finished_sending=None, finished_recving=None) - aggregated = executor._aggregate_workers_output([output1, output2]) + aggregated = aggregator.aggregate([output1, output2]) assert aggregated is output1 assert aggregated.finished_sending is None @@ -51,7 +35,7 @@ def test_aggregate_workers_output(): output2 = DummyModelRunnerOutput(finished_sending={'req1'}, finished_recving=None) - aggregated = executor._aggregate_workers_output([output1, output2]) + aggregated = aggregator.aggregate([output1, output2]) assert aggregated is output1 assert aggregated.finished_sending == {'req1'} @@ -62,7 +46,7 @@ def test_aggregate_workers_output(): output2 = DummyModelRunnerOutput(finished_sending={'req1'}, finished_recving={'req2'}) - aggregated = executor._aggregate_workers_output([output1, output2]) + aggregated = aggregator.aggregate([output1, output2]) assert aggregated is output1 assert aggregated.finished_sending is None @@ -70,12 +54,11 @@ def test_aggregate_workers_output(): def test_async_aggregate_workers_output(): - executor = DummyMultiprocExecutor(output_rank=0, world_size=2) + aggregator = KVOutputAggregator(world_size=2) future1: Future[DummyModelRunnerOutput] = Future() future2: Future[DummyModelRunnerOutput] = Future() - result_future = executor._async_aggregate_workers_output( - [future1, future2]) + result_future = aggregator.async_aggregate([future1, future2]) output1 = DummyModelRunnerOutput(finished_sending={'req1'}, finished_recving={'req2'}) @@ -92,8 +75,7 @@ def test_async_aggregate_workers_output(): future1 = Future() future2 = Future() - result_future = executor._async_aggregate_workers_output( - [future1, future2]) + result_future = aggregator.async_aggregate([future1, future2]) output1 = DummyModelRunnerOutput(finished_sending=None, finished_recving=None) @@ -110,8 +92,7 @@ def test_async_aggregate_workers_output(): future1 = Future() future2 = Future() - result_future = executor._async_aggregate_workers_output( - [future1, future2]) + result_future = aggregator.async_aggregate([future1, future2]) output1 = DummyModelRunnerOutput(finished_sending=None, finished_recving=None) diff --git a/vllm/distributed/kv_transfer/kv_connector/utils.py b/vllm/distributed/kv_transfer/kv_connector/utils.py index 5cbc8ca3175..c179d6cc29b 100644 --- a/vllm/distributed/kv_transfer/kv_connector/utils.py +++ b/vllm/distributed/kv_transfer/kv_connector/utils.py @@ -3,12 +3,18 @@ """ KV cache helper for store. """ +from collections import defaultdict +from collections.abc import Sequence +from concurrent.futures import CancelledError, Future +from typing import Optional, cast + import torch import vllm.envs as envs from vllm import _custom_ops as ops from vllm.config import VllmConfig, get_current_vllm_config from vllm.logger import init_logger +from vllm.v1.outputs import ModelRunnerOutput logger = init_logger(__name__) @@ -107,3 +113,87 @@ def get_kv_connector_cache_layout(): "layout to HND for better xfer performance.") return "HND" return "NHD" + + +class KVOutputAggregator: + """Utility class to aggregate the output of all workers into a single + output corresponding to Rank 0 for scheduler.""" + + def __init__(self, world_size: int): + # Complete transfer tracker. Used by to track finished requests + # [req_id -> n_finished_workers] + self._recv_remaining_count = defaultdict[str, int](lambda: world_size) + self._send_remaining_count = defaultdict[str, int](lambda: world_size) + + def aggregate(self, + outputs: list[ModelRunnerOutput], + output_rank: int = 0) -> ModelRunnerOutput: + # aggregate finished_sending, finished_recving from all workers + + def update_finished_set(req_ids: Optional[set[str]], + remaining_count_dict: dict[str, int], + finished_set: set[str]) -> None: + for req_id in req_ids or (): + new_count = remaining_count_dict[req_id] - 1 + if new_count == 0: + finished_set.add(req_id) + del remaining_count_dict[req_id] + else: + remaining_count_dict[req_id] = new_count + + finished_sending = set[str]() + finished_recving = set[str]() + for output in outputs: + update_finished_set(output.finished_sending, + self._send_remaining_count, finished_sending) + update_finished_set(output.finished_recving, + self._recv_remaining_count, finished_recving) + + # select output of the worker specified by output_rank + output = outputs[output_rank] + + # set the aggregated finished_sending / finished_recving + # if output.finished_sending/recving is not empty, but the other ranks + # still have unfinished send/recv, we want to set the aggregated + # finished_sending/recving to None until all ranks have finished + # send/recv + output.finished_sending = finished_sending if finished_sending else None + output.finished_recving = finished_recving if finished_recving else None + + return output + + def async_aggregate(self, + output_futures: Sequence[Future[ModelRunnerOutput]], + output_rank: int = 0) -> Future[ModelRunnerOutput]: + """Takes a list of futures and returns a single future which resolves + to the respective list of outputs.""" + result_future: Future[ModelRunnerOutput] = Future() + + outputs: list[Optional[ModelRunnerOutput]] = [None + ] * len(output_futures) + + def make_callback(idx): + + def callback(fut): + if result_future.done(): + return + + try: + outputs[idx] = fut.result() + except CancelledError: + result_future.cancel() + except Exception as e: + result_future.set_exception(e) + + # this check assumes io_thread_pool uses a single thread + if all(outputs): + result_future.set_result( + self.aggregate(cast(list[ModelRunnerOutput], outputs), + output_rank)) + + return callback + + for i, output_future in enumerate(output_futures): + output_future.add_done_callback(make_callback(i)) + + return result_future diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/base.py b/vllm/distributed/kv_transfer/kv_connector/v1/base.py index 9459ab27aba..e1245775bea 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/base.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/base.py @@ -194,7 +194,7 @@ def get_finished( """ Notifies worker-side connector ids of requests that have finished generating tokens on the worker. - The scheduler process (via the MultiprocExecutor) will use this output + The scheduler process (via the Executors) will use this output to track which workers are done. Returns: diff --git a/vllm/mocks/__init__.py b/vllm/mocks/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/vllm/mocks/mock_nixl_connector.py b/vllm/mocks/mock_nixl_connector.py new file mode 100644 index 00000000000..54e2c5ee3b0 --- /dev/null +++ b/vllm/mocks/mock_nixl_connector.py @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import uuid +from collections import defaultdict +from typing import Optional + + +class FakeNixlWrapper: + """Mock implementation of NixlWrapper for testing. + + We don't inherit from nixl._api.nixl_agent because nixl may not be + installed. + """ + + AGENT_METADATA = b"fake_agent_metadata" + REMOTE_AGENT_NAME = "remote_agent" + + def __init__(self, agent_name: str, *args, **kwargs): + self._cycles_before_xfer_done = 0 + self._check_xfer_state_cycles: defaultdict[int, int] = defaultdict( + lambda: 0) + + def get_reg_descs(self, caches_data, memory_type: str) -> list: + return [str(uuid.uuid4()) for _ in caches_data] + + def register_memory(self, descs) -> None: + pass + + def get_xfer_descs(self, blocks_data, memory_type: str) -> list: + return [str(uuid.uuid4()) for _ in blocks_data] + + def prep_xfer_dlist(self, agent_name: str, descs: list) -> int: + return uuid.uuid4().int + + def get_agent_metadata(self) -> bytes: + return self.AGENT_METADATA + + def add_remote_agent(self, agent_metadata: bytes) -> str: + return self.REMOTE_AGENT_NAME + + def get_new_notifs(self) -> dict[str, list[bytes]]: + # Used to collect done_sending, which we don't test yet. + return {} + + def check_xfer_state(self, handle: int) -> str: + if self._check_xfer_state_cycles[ + handle] >= self._cycles_before_xfer_done: + return "DONE" + self._check_xfer_state_cycles[handle] += 1 + return "PROC" + + def release_xfer_handle(self, handle: int) -> None: + pass + + def send_notif(self, agent_name: str, notif_msg: bytes) -> None: + pass + + def make_prepped_xfer(self, + xfer_type: str, + local_xfer_side_handle: int, + local_block_descs_ids: list[int], + remote_xfer_side_handle: int, + remote_block_descs_ids: list[int], + notif_msg: Optional[bytes] = None) -> int: + return uuid.uuid4().int + + def transfer(self, handle: int) -> str: + return "PROC" + + ############################################################ + # Follow are for changing the behavior during testing. + ############################################################ + + def set_cycles_before_xfer_done(self, cycles: int): + """Set the number of cycles before a transfer is considered done.""" + self._cycles_before_xfer_done = cycles diff --git a/vllm/sequence.py b/vllm/sequence.py index 87ba74c6853..99208fbad65 100644 --- a/vllm/sequence.py +++ b/vllm/sequence.py @@ -1188,9 +1188,15 @@ class IntermediateTensors: """For all pipeline stages except the last, we need to return the hidden states and residuals to be sent to the next stage. This data structure contains the hidden states and residuals for a request. + + Each stage also needs to handle its own finished_sending and + finished_recving in case of kv transfer. """ tensors: dict[str, torch.Tensor] + # [req_ids] + finished_sending: Optional[set[str]] = None + finished_recving: Optional[set[str]] = None def __init__(self, tensors): # manually define this function, so that diff --git a/vllm/v1/executor/multiproc_executor.py b/vllm/v1/executor/multiproc_executor.py index 4a4144c4860..11ddade3eb7 100644 --- a/vllm/v1/executor/multiproc_executor.py +++ b/vllm/v1/executor/multiproc_executor.py @@ -9,8 +9,7 @@ import time import traceback import weakref -from collections import defaultdict -from concurrent.futures import CancelledError, Future, ThreadPoolExecutor +from concurrent.futures import Future, ThreadPoolExecutor from dataclasses import dataclass from enum import Enum, auto from functools import partial @@ -27,6 +26,7 @@ destroy_model_parallel) from vllm.distributed.device_communicators.shm_broadcast import (Handle, MessageQueue) +from vllm.distributed.kv_transfer.kv_connector.utils import KVOutputAggregator from vllm.executor.multiproc_worker_utils import ( _add_prefix, set_multiprocessing_worker_envs) from vllm.logger import init_logger @@ -118,13 +118,8 @@ def _init_executor(self) -> None: self.output_rank = self._get_output_rank() self.has_connector = self.vllm_config.kv_transfer_config is not None - - # Complete transfer tracker. Used by to track finished requests - # [req_id -> n_finished_workers] - self._recv_remaining_count = defaultdict[str, - int](lambda: self.world_size) - self._send_remaining_count = defaultdict[str, - int](lambda: self.world_size) + self.kv_output_aggregator = KVOutputAggregator( + self.parallel_config.world_size) def start_worker_monitor(self): workers = self.workers @@ -186,8 +181,9 @@ def execute_model( # aggregate all workers output to a single output if non_block: - return self._async_aggregate_workers_output(outputs) - return self._aggregate_workers_output(outputs) + return self.kv_output_aggregator.async_aggregate( + outputs, self.output_rank) + return self.kv_output_aggregator.aggregate(outputs, self.output_rank) def collective_rpc(self, method: Union[str, Callable], @@ -246,74 +242,6 @@ def get_response(w: WorkerProcHandle, except TimeoutError as e: raise TimeoutError(f"RPC call to {method} timed out.") from e - def _aggregate_workers_output( - self, outputs: list[ModelRunnerOutput]) -> ModelRunnerOutput: - # aggregate finished_sending, finished_recving from all workers - - def update_finished_set(req_ids: Optional[set[str]], - remaining_count_dict: dict[str, int], - finished_set: set[str]) -> None: - for req_id in req_ids or (): - new_count = remaining_count_dict[req_id] - 1 - if new_count == 0: - finished_set.add(req_id) - del remaining_count_dict[req_id] - else: - remaining_count_dict[req_id] = new_count - - finished_sending = set[str]() - finished_recving = set[str]() - for output in outputs: - update_finished_set(output.finished_sending, - self._send_remaining_count, finished_sending) - update_finished_set(output.finished_recving, - self._recv_remaining_count, finished_recving) - - # select output of the worker specified by output_rank - output = outputs[self.output_rank] - - # set the aggregated finished_sending / finished_recving - output.finished_sending = finished_sending if finished_sending else None - output.finished_recving = finished_recving if finished_recving else None - - return output - - def _async_aggregate_workers_output( - self, output_futures: list[Future[ModelRunnerOutput]] - ) -> (Future[ModelRunnerOutput]): - """Takes a list of futures and returns a single future which resolves - to the respective list of outputs.""" - result_future: Future[ModelRunnerOutput] = Future() - - outputs: list[Optional[ModelRunnerOutput]] = [None - ] * len(output_futures) - - def make_callback(idx): - - def callback(fut): - if result_future.done(): - return - - try: - outputs[idx] = fut.result() - except CancelledError: - result_future.cancel() - except Exception as e: - result_future.set_exception(e) - - # this check assumes io_thread_pool uses a single thread - if all(outputs): - result_future.set_result( - self._aggregate_workers_output( - cast(list[ModelRunnerOutput], outputs))) - - return callback - - for i, output_future in enumerate(output_futures): - output_future.add_done_callback(make_callback(i)) - - return result_future - @staticmethod def _ensure_worker_termination(worker_procs: list[BaseProcess]): """Ensure that all worker processes are terminated. Assumes workers have diff --git a/vllm/v1/executor/ray_distributed_executor.py b/vllm/v1/executor/ray_distributed_executor.py index eb659e4f9e4..b86ac048f52 100644 --- a/vllm/v1/executor/ray_distributed_executor.py +++ b/vllm/v1/executor/ray_distributed_executor.py @@ -2,33 +2,55 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from concurrent.futures import Future -from typing import Union +from typing import Optional, Union +from vllm.distributed.kv_transfer.kv_connector.utils import KVOutputAggregator from vllm.executor.ray_distributed_executor import ( # noqa RayDistributedExecutor as RayDistributedExecutorV0) +from vllm.logger import init_logger from vllm.v1.engine import ReconfigureDistributedRequest, ReconfigureRankType from vllm.v1.executor.abstract import Executor from vllm.v1.outputs import ModelRunnerOutput +logger = init_logger(__name__) + class FutureWrapper(Future): - """A wrapper around a Ray output reference to meet the interface - of .execute_model(). + """A wrapper around Ray output reference to meet the interface + of .execute_model(): The top level (core busy loop) expects .result() api + to block and return a single output. + + If aggregator is provided, the outputs from all workers are aggregated upon + the result() call. If not only the first worker's output is returned. """ - def __init__(self, ref): + def __init__(self, refs, aggregator: Optional[KVOutputAggregator] = None): super().__init__() - self.ref = ref + self.refs = refs + self.aggregator = aggregator def result(self, timeout=None): if timeout is not None: raise NotImplementedError("timeout is not supported") - return self.ref.get() + + if self.aggregator is None: + return self.refs[0].get() + + outputs = [ref.get() for ref in self.refs] + return self.aggregator.aggregate(outputs, output_rank=0) class RayDistributedExecutor(RayDistributedExecutorV0, Executor): """Ray distributed executor using Ray Compiled Graphs.""" + def _init_executor(self) -> None: + super()._init_executor() + + # KV connector setup + self.has_connector = self.vllm_config.kv_transfer_config is not None + self.kv_output_aggregator = KVOutputAggregator( + self.parallel_config.world_size) + @property def max_concurrent_batches(self) -> int: """Ray distributed executor supports pipeline parallelism, @@ -56,13 +78,24 @@ def execute_model( refs = self.forward_dag.execute(scheduler_output) # type: ignore - # When PP is not used, we block here until the result is available. + if not self.has_connector: + # Get output only from a single worker (output_rank) + # When PP is not used, we block here until the result is available. + if self.max_concurrent_batches == 1: + return refs[0].get() + + # When PP is used, we return a FutureWrapper immediately so that + # the scheduler can yield to the next batch. + return FutureWrapper(refs) + + # Get output from all workers when connector is present if self.max_concurrent_batches == 1: - return refs[0].get() + # Block and get results from all workers + outputs = [ref.get() for ref in refs] + return self.kv_output_aggregator.aggregate(outputs) - # When PP is used, we return a FutureWrapper immediately so that - # the scheduler can yield to the next batch. - return FutureWrapper(refs[0]) + # Return a future that will aggregate outputs from all workers + return FutureWrapper(refs, self.kv_output_aggregator) def reinitialize_distributed( self, reconfig_request: ReconfigureDistributedRequest) -> None: @@ -70,4 +103,4 @@ def reinitialize_distributed( if reconfig_request.new_data_parallel_rank == \ ReconfigureRankType.SHUTDOWN_CURRENT_RANK: self.shutdown() - return + return \ No newline at end of file diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index a5c44673114..d5449a68bc2 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1,6 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import copy import gc import time from contextlib import contextmanager @@ -1270,6 +1271,8 @@ def _pool( hidden_states: torch.Tensor, num_scheduled_tokens: int, num_scheduled_tokens_np: np.ndarray, + finished_sending: Optional[set[str]], + finished_recving: Optional[set[str]], ) -> ModelRunnerOutput: assert self.input_batch.num_reqs ==\ len(self.input_batch.pooling_params), \ @@ -1304,6 +1307,8 @@ def _pool( logprobs=None, prompt_logprobs_dict={}, pooler_output=pooler_output, + finished_sending=finished_sending, + finished_recving=finished_recving, ) @torch.inference_mode() @@ -1314,12 +1319,11 @@ def execute_model( ) -> Union[ModelRunnerOutput, IntermediateTensors]: self._update_states(scheduler_output) if not scheduler_output.total_num_scheduled_tokens: - if has_kv_transfer_group(): - with set_forward_context(None, self.vllm_config): - self.maybe_setup_kv_connector(scheduler_output) + if not has_kv_transfer_group(): + # Return empty ModelRunnerOutput if there's no work to do. + return EMPTY_MODEL_RUNNER_OUTPUT - # Return empty ModelRunnerOutput if there's no work to do. - return EMPTY_MODEL_RUNNER_OUTPUT + return self.kv_connector_no_forward(scheduler_output) # Prepare the decoder inputs. (attn_metadata, attention_cuda_graphs, logits_indices, @@ -1412,6 +1416,8 @@ def execute_model( ) self.maybe_wait_for_kv_save() + finished_sending, finished_recving = ( + self.get_finished_kv_transfers(scheduler_output)) if self.use_aux_hidden_state_outputs: hidden_states, aux_hidden_states = model_output @@ -1429,6 +1435,9 @@ def execute_model( if not get_pp_group().is_last_rank: # For mid-pipeline stages, return the hidden states. if not broadcast_pp_output: + if finished_sending or finished_recving: + hidden_states.finished_sending = finished_sending + hidden_states.finished_recving = finished_recving return hidden_states assert isinstance(hidden_states, IntermediateTensors) get_pp_group().send_tensor_dict(hidden_states.tensors, @@ -1437,7 +1446,8 @@ def execute_model( else: if self.input_batch.pooling_params: return self._pool(hidden_states, num_scheduled_tokens, - num_scheduled_tokens_np) + num_scheduled_tokens_np, finished_sending, + finished_recving) sample_hidden_states = hidden_states[logits_indices] logits = self.model.compute_logits(sample_hidden_states, None) @@ -1587,6 +1597,8 @@ def execute_model( logprobs=logprobs_lists, prompt_logprobs_dict=prompt_logprobs_dict, pooler_output=[], + finished_sending=finished_sending, + finished_recving=finished_recving, num_nans_in_logits=num_nans_in_logits, ) @@ -1711,6 +1723,31 @@ def maybe_wait_for_kv_save() -> None: if has_kv_transfer_group(): get_kv_transfer_group().wait_for_save() + @staticmethod + def get_finished_kv_transfers( + scheduler_output: "SchedulerOutput", + ) -> tuple[Optional[set[str]], Optional[set[str]]]: + if has_kv_transfer_group(): + return get_kv_transfer_group().get_finished( + scheduler_output.finished_req_ids) + return None, None + + def kv_connector_no_forward( + self, scheduler_output: "SchedulerOutput") -> ModelRunnerOutput: + # KV send/recv even if no work to do. + with set_forward_context(None, self.vllm_config): + self.maybe_setup_kv_connector(scheduler_output) + finished_sending, finished_recving = ( + self.get_finished_kv_transfers(scheduler_output)) + + if not finished_sending and not finished_recving: + return EMPTY_MODEL_RUNNER_OUTPUT + + output = copy.copy(EMPTY_MODEL_RUNNER_OUTPUT) + output.finished_sending = finished_sending + output.finished_recving = finished_recving + return output + def propose_ngram_draft_token_ids( self, sampled_token_ids: list[list[int]], diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 2201481fa5b..6411874883e 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -15,9 +15,7 @@ from vllm.distributed import (ensure_model_parallel_initialized, init_distributed_environment, set_custom_all_reduce) -from vllm.distributed.kv_transfer import (ensure_kv_transfer_initialized, - get_kv_transfer_group, - has_kv_transfer_group) +from vllm.distributed.kv_transfer import ensure_kv_transfer_initialized from vllm.distributed.parallel_state import get_pp_group, get_tp_group from vllm.logger import init_logger from vllm.lora.request import LoRARequest @@ -335,25 +333,17 @@ def execute_model( assert isinstance(output, IntermediateTensors) get_pp_group().send_tensor_dict(output.tensors, all_gather_group=get_tp_group()) - output = EMPTY_MODEL_RUNNER_OUTPUT - assert isinstance(output, ModelRunnerOutput) - if has_kv_transfer_group(): - finished_sending, finished_recving = ( - get_kv_transfer_group().get_finished( - scheduler_output.finished_req_ids)) - if finished_sending or finished_recving: - if output is EMPTY_MODEL_RUNNER_OUTPUT: - output = copy.copy(EMPTY_MODEL_RUNNER_OUTPUT) - output.finished_sending = finished_sending - output.finished_recving = finished_recving - - # Clear KVConnector state for this step. - get_kv_transfer_group().clear_connector_metadata() - - # with a connector, the scheduler expects output from all workers - return output + # In case of PP with kv transfer, we need to pass through the + # finished_sending and finished_recving buffers. + empty_output = EMPTY_MODEL_RUNNER_OUTPUT + if output.finished_sending or output.finished_recving: + empty_output = copy.copy(empty_output) + empty_output.finished_sending = output.finished_sending + empty_output.finished_recving = output.finished_recving + output = empty_output + assert isinstance(output, ModelRunnerOutput) # return output only from the driver worker return output if self.is_driver_worker else None From 856ba1a84fe960ef7bf62855e67a226f2ab8c94c Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Sat, 19 Jul 2025 21:27:21 +0200 Subject: [PATCH 208/552] [V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers (#21194) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- .../models/language/generation/test_hybrid.py | 1 - vllm/config.py | 1 + .../layers/mamba/mamba_mixer2.py | 75 ++++++++++++++++--- vllm/model_executor/models/bamba.py | 11 +-- vllm/model_executor/models/falcon_h1.py | 8 +- .../model_executor/models/granitemoehybrid.py | 8 +- vllm/model_executor/models/mamba2.py | 8 +- vllm/model_executor/models/nemotron_h.py | 8 +- vllm/model_executor/models/zamba2.py | 8 +- vllm/v1/worker/gpu_model_runner.py | 3 - 10 files changed, 100 insertions(+), 31 deletions(-) diff --git a/tests/models/language/generation/test_hybrid.py b/tests/models/language/generation/test_hybrid.py index eba14e64553..e4294512338 100644 --- a/tests/models/language/generation/test_hybrid.py +++ b/tests/models/language/generation/test_hybrid.py @@ -104,7 +104,6 @@ def test_models( m.setenv("VLLM_ATTENTION_BACKEND", "FLASHINFER") with vllm_runner(model, max_num_seqs=MAX_NUM_SEQS, - enforce_eager=True, enable_prefix_caching=False) as vllm_model: vllm_v1_outputs = vllm_model.generate_greedy_logprobs( example_prompts, max_tokens, num_logprobs) diff --git a/vllm/config.py b/vllm/config.py index 5727e97a887..adf3fd701a9 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -4341,6 +4341,7 @@ def set_splitting_ops_for_v1(self): self.splitting_ops = [] if self.full_cuda_graph else [ "vllm.unified_attention", "vllm.unified_attention_with_output", + "vllm.mamba_mixer2", ] diff --git a/vllm/model_executor/layers/mamba/mamba_mixer2.py b/vllm/model_executor/layers/mamba/mamba_mixer2.py index f3850d31c82..e32b2be4d40 100644 --- a/vllm/model_executor/layers/mamba/mamba_mixer2.py +++ b/vllm/model_executor/layers/mamba/mamba_mixer2.py @@ -13,7 +13,7 @@ get_tensor_model_parallel_world_size, tensor_model_parallel_all_gather, tensor_model_parallel_all_reduce) -from vllm.forward_context import get_forward_context +from vllm.forward_context import ForwardContext, get_forward_context from vllm.model_executor.custom_op import CustomOp from vllm.model_executor.layers.linear import (ColumnParallelLinear, RowParallelLinear) @@ -33,6 +33,8 @@ LoaderFunction, composed_weight_loader, sharded_weight_loader) from vllm.model_executor.models.mamba_cache import MambaCacheParams from vllm.model_executor.utils import set_weight_attrs +from vllm.platforms import current_platform +from vllm.utils import direct_register_custom_op from vllm.v1.attention.backends.mamba_attn import Mamba2AttentionMetadata # Added by the IBM Team, 2024 @@ -424,14 +426,36 @@ def __init__( def forward_native( self, hidden_states: torch.Tensor, - conv_state: torch.Tensor, - ssm_state: torch.Tensor, + output: torch.Tensor, + mamba_cache_params: MambaCacheParams, + mamba2_metadata: Mamba2Metadata, + mup_vector: Optional[torch.Tensor] = None, ): pass + def forward( + self, + hidden_states: torch.Tensor, + output: torch.Tensor, + mamba_cache_params: MambaCacheParams, + mamba2_metadata: Mamba2Metadata, + mup_vector: Optional[torch.Tensor] = None, + ): + if not envs.VLLM_USE_V1: + CustomOp.forward(self, hidden_states, output, mamba_cache_params, + mamba2_metadata, mup_vector) + else: + torch.ops.vllm.mamba_mixer2( + hidden_states, + output, + self.prefix, + mup_vector, + ) + def forward_cuda( self, hidden_states: torch.Tensor, + output: torch.Tensor, mamba_cache_params: MambaCacheParams, mamba2_metadata: Mamba2Metadata, mup_vector: Optional[torch.Tensor] = None, @@ -517,6 +541,7 @@ def forward_cuda( num_prefill_tokens = attn_metadata.num_prefill_tokens # token count has_prefill = num_prefills > 0 has_decode = num_decodes > 0 + num_actual_tokens = num_prefill_tokens + num_decodes # NOTE: V0 put prefill before decode, v1 puts decode before prefill # Separate prefill and decode by splitting varlen input @@ -524,18 +549,18 @@ def forward_cuda( # NOTE: V0 put prefill before decode, v1 puts decode before prefill if envs.VLLM_USE_V1: hidden_states_B_C_d, hidden_states_B_C_p = torch.split( - hidden_states_B_C, + hidden_states_B_C[:num_actual_tokens], [num_decodes, num_prefill_tokens], dim=0, ) dt_d, dt_p = torch.split( - dt, + dt[:num_actual_tokens], [num_decodes, num_prefill_tokens], dim=0, ) # Split along batch dimension state_indices_tensor_d, state_indices_tensor_p = torch.split( - state_indices_tensor, + state_indices_tensor[:num_actual_tokens], [num_decodes, num_prefills], dim=0, ) @@ -696,11 +721,10 @@ def forward_cuda( # GatedRMSNorm internally applying SiLU to the gate # SiLU is applied internally before normalization, unlike standard # norm usage - hidden_states = self.norm(hidden_states, gate) + hidden_states = self.norm(hidden_states, gate[:num_actual_tokens]) # 5. Final linear projection - out, _ = self.out_proj(hidden_states) - return out + output[:num_actual_tokens], _ = self.out_proj(hidden_states) def get_state_shape(self) -> tuple[tuple[int, ...], tuple[int, ...]]: return get_mamba_state_shape( @@ -712,3 +736,36 @@ def get_state_shape(self) -> tuple[tuple[int, ...], tuple[int, ...]]: state_size=self.ssm_state_size, conv_kernel=self.conv_kernel_size, ) + + +def mamba_mixer2( + hidden_states: torch.Tensor, + output: torch.Tensor, + layer_name: str, + mup_vector: Optional[torch.Tensor] = None, +) -> None: + forward_context: ForwardContext = get_forward_context() + self = forward_context.no_compile_layers[layer_name] + self.forward_cuda(hidden_states=hidden_states, + output=output, + mamba_cache_params=None, + mamba2_metadata=None, + mup_vector=mup_vector) + + +def mamba_mixer2_fake( + hidden_states: torch.Tensor, + output: torch.Tensor, + layer_name: str, + mup_vector: Optional[torch.Tensor] = None, +) -> None: + return + + +direct_register_custom_op( + op_name="mamba_mixer2", + op_func=mamba_mixer2, + mutates_args=["output"], + fake_impl=mamba_mixer2_fake, + dispatch_key=current_platform.dispatch_key, +) diff --git a/vllm/model_executor/models/bamba.py b/vllm/model_executor/models/bamba.py index e93d4294a62..0f549442763 100644 --- a/vllm/model_executor/models/bamba.py +++ b/vllm/model_executor/models/bamba.py @@ -11,6 +11,7 @@ from vllm import envs from vllm.attention.layer import Attention +from vllm.compilation.decorators import support_torch_compile from vllm.config import CacheConfig, VllmConfig from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group @@ -122,11 +123,10 @@ def forward( hidden_states, residual = self.input_layernorm( hidden_states, residual) - hidden_states = self.mamba(hidden_states, mamba_cache_params, - mamba2_metadata) + output = torch.empty_like(hidden_states) + self.mamba(hidden_states, output, mamba_cache_params, mamba2_metadata) # Fully Connected - hidden_states, residual = self.pre_ff_layernorm( - hidden_states, residual) + hidden_states, residual = self.pre_ff_layernorm(output, residual) hidden_states = self.feed_forward(hidden_states) return hidden_states, residual @@ -169,7 +169,7 @@ def __init__( self.max_position_embeddings = max_position_embeddings if hasattr(config, "partial_rotary_factor"): - rotary_dim = self.head_dim * config.partial_rotary_factor + rotary_dim = int(self.head_dim * config.partial_rotary_factor) elif hasattr(config, "attn_rotary_emb"): rotary_dim = config.attn_rotary_emb # for backward compatibility else: @@ -258,6 +258,7 @@ def forward( } +@support_torch_compile class BambaModel(nn.Module): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): diff --git a/vllm/model_executor/models/falcon_h1.py b/vllm/model_executor/models/falcon_h1.py index 7761de224c9..6a58b1501fe 100644 --- a/vllm/model_executor/models/falcon_h1.py +++ b/vllm/model_executor/models/falcon_h1.py @@ -10,6 +10,7 @@ from vllm import envs from vllm.attention.layer import Attention +from vllm.compilation.decorators import support_torch_compile from vllm.config import CacheConfig, VllmConfig from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group @@ -179,13 +180,15 @@ def forward( mamba2_metadata: Mamba2Metadata, **kwargs, ): - hidden_states = self.mamba( + output = torch.empty_like(hidden_states) + self.mamba( hidden_states, + output, mamba_cache_params, mamba2_metadata=mamba2_metadata, mup_vector=self.mup_vector, ) - return hidden_states, residual + return output, residual class FalconH1AttentionDecoderLayer(nn.Module): @@ -398,6 +401,7 @@ def forward( return hidden_states +@support_torch_compile class FalconH1Model(nn.Module): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): diff --git a/vllm/model_executor/models/granitemoehybrid.py b/vllm/model_executor/models/granitemoehybrid.py index 1c93e90737a..59c1dce48ee 100644 --- a/vllm/model_executor/models/granitemoehybrid.py +++ b/vllm/model_executor/models/granitemoehybrid.py @@ -11,6 +11,7 @@ from vllm import envs from vllm.attention.layer import Attention +from vllm.compilation.decorators import support_torch_compile from vllm.config import CacheConfig, VllmConfig from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group @@ -104,9 +105,9 @@ def forward( ): residual = hidden_states hidden_states = self.input_layernorm(hidden_states) - hidden_states = self.mamba(hidden_states, mamba_cache_params, - mamba2_metadata) - hidden_states = residual + hidden_states * self.residual_multiplier + output = torch.empty_like(hidden_states) + self.mamba(hidden_states, output, mamba_cache_params, mamba2_metadata) + hidden_states = residual + output * self.residual_multiplier residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) @@ -307,6 +308,7 @@ def forward( } +@support_torch_compile class GraniteMoeHybridModel(nn.Module): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): diff --git a/vllm/model_executor/models/mamba2.py b/vllm/model_executor/models/mamba2.py index d812d8cc0a3..adad181617e 100644 --- a/vllm/model_executor/models/mamba2.py +++ b/vllm/model_executor/models/mamba2.py @@ -10,6 +10,7 @@ from vllm import envs from vllm.attention.backends.abstract import AttentionMetadata +from vllm.compilation.decorators import support_torch_compile from vllm.config import VllmConfig from vllm.distributed.parallel_state import get_pp_group from vllm.forward_context import get_forward_context @@ -79,11 +80,12 @@ def forward( else: hidden_states, residual = self.norm(hidden_states, residual) - hidden_states = self.mixer(hidden_states, mamba_cache_params, - mamba2_metadata) - return hidden_states, residual + output = torch.empty_like(hidden_states) + self.mixer(hidden_states, output, mamba_cache_params, mamba2_metadata) + return output, residual +@support_torch_compile class Mamba2Model(nn.Module): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): diff --git a/vllm/model_executor/models/nemotron_h.py b/vllm/model_executor/models/nemotron_h.py index cf7b39db1fe..6a999e2254e 100644 --- a/vllm/model_executor/models/nemotron_h.py +++ b/vllm/model_executor/models/nemotron_h.py @@ -25,6 +25,7 @@ from vllm import envs from vllm.attention.layer import Attention +from vllm.compilation.decorators import support_torch_compile from vllm.config import CacheConfig, VllmConfig from vllm.distributed import get_tensor_model_parallel_world_size from vllm.distributed.parallel_state import get_pp_group @@ -172,9 +173,9 @@ def forward( else: hidden_states, residual = self.norm(hidden_states, residual) - hidden_states = self.mixer(hidden_states, mamba_cache_params, - mamba2_metadata) - return hidden_states, residual + output = torch.empty_like(hidden_states) + self.mixer(hidden_states, output, mamba_cache_params, mamba2_metadata) + return output, residual class NemotronHAttention(nn.Module): @@ -292,6 +293,7 @@ def forward( } +@support_torch_compile class NemotronHModel(nn.Module): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): diff --git a/vllm/model_executor/models/zamba2.py b/vllm/model_executor/models/zamba2.py index ebf8dd497f6..7764fd9b9e0 100644 --- a/vllm/model_executor/models/zamba2.py +++ b/vllm/model_executor/models/zamba2.py @@ -17,6 +17,7 @@ from vllm import envs from vllm.attention.layer import Attention +from vllm.compilation.decorators import support_torch_compile from vllm.config import CacheConfig, VllmConfig from vllm.distributed import get_tensor_model_parallel_world_size from vllm.forward_context import get_forward_context @@ -548,14 +549,16 @@ def forward( hidden_states = self.input_layernorm(hidden_states) # Process through Mamba mixer - hidden_states = self.mamba( + output = torch.empty_like(hidden_states) + self.mamba( hidden_states, + output, mamba_cache_params=mamba_cache_params, mamba2_metadata=mamba2_metadata, ) # residual connection after mamba - hidden_states = residual + hidden_states + hidden_states = residual + output return hidden_states @@ -646,6 +649,7 @@ def forward( return layer_outputs +@support_torch_compile class Zamba2Model(nn.Module): """Core Zamba2 model combining transformer and Mamba architectures. diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index d5449a68bc2..1ee9c070226 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2753,9 +2753,6 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: if self.vllm_config.speculative_config is not None: raise NotImplementedError( "Mamba with speculative decoding is not supported yet.") - if not self.vllm_config.model_config.enforce_eager: - raise NotImplementedError( - "Mamba with cuda graph is not supported yet.") if self.vllm_config.cache_config.enable_prefix_caching: raise NotImplementedError( "Prefix caching is not supported for Mamba yet.") From 85a1995ee0db9d948cf8a3b5b8ab81f10fa54bc6 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Sat, 19 Jul 2025 13:53:17 -0700 Subject: [PATCH 209/552] [V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- .../scripts/hardware_ci/run-amd-test.sh | 1 - docs/models/supported_models.md | 1 - .../attention/test_blocksparse_attention.py | 441 ----------------- .../attention/test_rocm_attention_selector.py | 32 +- tests/models/registry.py | 4 - vllm/attention/backends/abstract.py | 1 - vllm/attention/backends/blocksparse_attn.py | 466 ------------------ .../backends/differential_flash_attn.py | 4 - .../backends/dual_chunk_flash_attn.py | 1 - vllm/attention/backends/flash_attn.py | 6 +- vllm/attention/backends/flashinfer.py | 1 - vllm/attention/backends/flashmla.py | 12 +- vllm/attention/backends/mla/common.py | 1 - vllm/attention/backends/rocm_aiter_mla.py | 12 +- vllm/attention/backends/rocm_flash_attn.py | 6 +- vllm/attention/backends/triton_mla.py | 12 +- vllm/attention/backends/xformers.py | 6 +- vllm/attention/layer.py | 6 +- .../ops/blocksparse_attention/__init__.py | 0 .../blocksparse_attention_kernel.py | 433 ---------------- .../ops/blocksparse_attention/interface.py | 239 --------- .../ops/blocksparse_attention/utils.py | 246 --------- vllm/attention/selector.py | 9 - vllm/model_executor/models/phi3_small.py | 465 ----------------- vllm/model_executor/models/registry.py | 1 - vllm/platforms/interface.py | 1 - vllm/v1/attention/backends/cpu_attn.py | 6 +- vllm/v1/attention/backends/flash_attn.py | 6 +- vllm/v1/attention/backends/flashinfer.py | 3 +- vllm/v1/attention/backends/flex_attention.py | 7 +- vllm/v1/attention/backends/mla/common.py | 3 +- vllm/v1/attention/backends/mla/cutlass_mla.py | 12 +- vllm/v1/attention/backends/mla/flashmla.py | 12 +- .../attention/backends/mla/rocm_aiter_mla.py | 12 +- vllm/v1/attention/backends/mla/triton_mla.py | 12 +- vllm/v1/attention/backends/pallas.py | 8 +- vllm/v1/attention/backends/rocm_aiter_fa.py | 6 +- vllm/v1/attention/backends/triton_attn.py | 6 +- 38 files changed, 65 insertions(+), 2435 deletions(-) delete mode 100644 tests/kernels/attention/test_blocksparse_attention.py delete mode 100644 vllm/attention/backends/blocksparse_attn.py delete mode 100644 vllm/attention/ops/blocksparse_attention/__init__.py delete mode 100644 vllm/attention/ops/blocksparse_attention/blocksparse_attention_kernel.py delete mode 100644 vllm/attention/ops/blocksparse_attention/interface.py delete mode 100644 vllm/attention/ops/blocksparse_attention/utils.py delete mode 100644 vllm/model_executor/models/phi3_small.py diff --git a/.buildkite/scripts/hardware_ci/run-amd-test.sh b/.buildkite/scripts/hardware_ci/run-amd-test.sh index 156456c92e6..5e5a532cb57 100755 --- a/.buildkite/scripts/hardware_ci/run-amd-test.sh +++ b/.buildkite/scripts/hardware_ci/run-amd-test.sh @@ -108,7 +108,6 @@ fi if [[ $commands == *" kernels/attention"* ]]; then commands="${commands} \ --ignore=kernels/attention/test_attention_selector.py \ - --ignore=kernels/attention/test_blocksparse_attention.py \ --ignore=kernels/attention/test_encoder_decoder_attn.py \ --ignore=kernels/attention/test_flash_attn.py \ --ignore=kernels/attention/test_flashinfer.py \ diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 3731c676f5e..250ce53fec3 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -376,7 +376,6 @@ Specified using `--task generate`. | `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ | ✅︎ | | `PhiForCausalLM` | Phi | `microsoft/phi-1_5`, `microsoft/phi-2`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Phi3ForCausalLM` | Phi-4, Phi-3 | `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | -| `Phi3SmallForCausalLM` | Phi-3-Small | `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc. | | ✅︎ | ✅︎ | | `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Phi4FlashForCausalLM` | Phi-4-mini-flash-reasoning | `microsoft/microsoft/Phi-4-mini-instruct`, etc. | | | | | `PersimmonForCausalLM` | Persimmon | `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc. | | ✅︎ | ✅︎ | diff --git a/tests/kernels/attention/test_blocksparse_attention.py b/tests/kernels/attention/test_blocksparse_attention.py deleted file mode 100644 index 9aee818c995..00000000000 --- a/tests/kernels/attention/test_blocksparse_attention.py +++ /dev/null @@ -1,441 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import random -from typing import Optional - -import pytest -import torch - -from tests.kernels.allclose_default import get_default_atol, get_default_rtol -from vllm import _custom_ops as ops -from vllm.attention.ops.blocksparse_attention.interface import ( - LocalStridedBlockSparseAttn) -from vllm.platforms import current_platform -from vllm.utils import get_max_shared_memory_bytes - -FLOAT32_BYTES = torch.finfo(torch.float).bits // 8 -# This will change depending on the compute capability. -# - 512 as a buffer -MAX_SEQ_LEN = get_max_shared_memory_bytes() // FLOAT32_BYTES - 512 -# MAX_SEQ_LEN = 2771 - -# There may not be enough gpu memory due to large NUM_BLOCKS. -# Reduce NUM_BLOCKS when it happens. -NUM_BLOCKS = 4321 # Arbitrary values for testing -PARTITION_SIZE = 512 -DTYPES = [torch.half, torch.bfloat16] -NUM_GEN_SEQS = [3] # Arbitrary values for testing -NUM_PREFILL_SEQS = [3] # Arbitrary values for testing -NUM_HEADS = [(40, 40)] # Arbitrary values for testing - -HEAD_SIZES = [64, 112] -BLOCK_SIZES = [16] -USE_ALIBI = [False, True] -KV_CACHE_DTYPE = ["auto", "fp8"] -SEEDS = [0] -CUDA_DEVICES = ['cuda:0'] -BLOCKSPARSE_LOCAL_BLOCKS = [16] -BLOCKSPARSE_VERT_STRIDES = [8] - -BLOCKSPARSE_BLOCK_SIZES = [64] -BLOCKSPARSE_HEADS_SLIDINGS = [2, -1] -BLOCKSPARSE_HOMO_HEADS = [True, False] - - -def ref_masked_attention( - query: torch.Tensor, - key: torch.Tensor, - value: torch.Tensor, - scale: float, - attn_mask: Optional[torch.Tensor] = None, -) -> torch.Tensor: - attn_weights = scale * torch.einsum("qhd,khd->hqk", query, key).float() - if attn_mask is not None: - attn_weights = attn_weights + attn_mask.float() - attn_weights = torch.softmax(attn_weights, dim=-1).to(value.dtype) - out = torch.einsum("hqk,khd->qhd", attn_weights, value) - return out - - -def ref_single_query_cached_kv_attention( - output: torch.Tensor, - query: torch.Tensor, - num_queries_per_kv: int, - key_cache: torch.Tensor, - value_cache: torch.Tensor, - block_tables: torch.Tensor, - seq_lens: torch.Tensor, - scale: float, - alibi_slopes: Optional[torch.Tensor], - tp_rank: int = 0, - blocksparse_local_blocks: int = 0, - blocksparse_vert_stride: int = 1, - blocksparse_block_size: int = 64, - blocksparse_head_sliding_step: int = 0, -) -> None: - num_query_heads = query.shape[1] - num_kv_heads = value_cache.shape[1] - head_size = value_cache.shape[2] - block_size = value_cache.shape[3] - num_seqs = query.shape[0] - - block_tables_lst = block_tables.cpu().tolist() - seq_lens_lst = seq_lens.cpu().tolist() - for i in range(num_seqs): - q = query[i].unsqueeze(0) - block_table = block_tables_lst[i] - seq_len = int(seq_lens_lst[i]) - - keys_lst: list[torch.Tensor] = [] - values_lst: list[torch.Tensor] = [] - for j in range(seq_len): - block_number = int(block_table[j // block_size]) - block_offset = j % block_size - - k = key_cache[block_number, :, :, block_offset, :] - k = k.reshape(num_kv_heads, head_size) - keys_lst.append(k) - - v = value_cache[block_number, :, :, block_offset] - values_lst.append(v) - keys = torch.stack(keys_lst, dim=0) - values = torch.stack(values_lst, dim=0) - if num_queries_per_kv > 1: - # Handle MQA and GQA - keys = torch.repeat_interleave(keys, num_queries_per_kv, dim=1) - values = torch.repeat_interleave(values, num_queries_per_kv, dim=1) - - alibi_bias = None - if alibi_slopes is not None: - # Create the ALiBi bias used in the paged attention kernel. - position_ids = torch.arange(seq_len).int() - alibi_bias = (position_ids - seq_len + 1).float() - alibi_bias = alibi_slopes.view(-1, 1, 1) * alibi_bias.view( - 1, 1, -1) - - if blocksparse_vert_stride >= 1: - bsize = blocksparse_block_size - hsliding = blocksparse_head_sliding_step - vert = blocksparse_vert_stride - locals = blocksparse_local_blocks - qb = (seq_len - 1) // bsize - attn_mask = q.new_zeros( - (num_query_heads, 1, seq_len)).float() - torch.inf - for h in range(num_query_heads): - if hsliding >= 0: # slide with q heads - bs_offset = (tp_rank * num_query_heads + h) * hsliding + 1 - else: # slide with kv heads - bs_offset = (tp_rank * num_kv_heads + - h // num_queries_per_kv) * (-hsliding) + 1 - for kb in range(qb + 1): - kj = kb * bsize - if (qb - kb) < locals or \ - (kb + bs_offset) % vert == 0: - attn_mask[h, 0, kj:min(kj + bsize, seq_len)] = 0 - if alibi_bias is not None: - attn_mask += alibi_bias - else: - attn_mask = alibi_bias - - out = ref_masked_attention(q, keys, values, scale, attn_mask=attn_mask) - out = out.view(num_query_heads, head_size) - output[i].copy_(out, non_blocking=True) - - -@pytest.mark.parametrize("version", ["v1", "v2"]) -@pytest.mark.parametrize("num_seqs", NUM_GEN_SEQS) -@pytest.mark.parametrize("num_heads", NUM_HEADS) -@pytest.mark.parametrize("head_size", HEAD_SIZES) -@pytest.mark.parametrize("use_alibi", USE_ALIBI) -@pytest.mark.parametrize("block_size", BLOCK_SIZES) -@pytest.mark.parametrize("dtype", DTYPES) -@pytest.mark.parametrize("kv_cache_dtype", KV_CACHE_DTYPE) -@pytest.mark.parametrize("seed", SEEDS) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@pytest.mark.parametrize("blocksparse_local_blocks", BLOCKSPARSE_LOCAL_BLOCKS) -@pytest.mark.parametrize("blocksparse_vert_stride", BLOCKSPARSE_VERT_STRIDES) -@pytest.mark.parametrize("blocksparse_block_size", BLOCKSPARSE_BLOCK_SIZES) -@pytest.mark.parametrize("blocksparse_head_sliding_step", - BLOCKSPARSE_HEADS_SLIDINGS) -def test_paged_attention( - kv_cache_factory, - version: str, - num_seqs: int, - num_heads: tuple[int, int], - head_size: int, - use_alibi: bool, - block_size: int, - dtype: torch.dtype, - kv_cache_dtype: str, - seed: int, - device: str, - blocksparse_local_blocks: int, - blocksparse_vert_stride: int, - blocksparse_block_size: int, - blocksparse_head_sliding_step: int, -) -> None: - current_platform.seed_everything(seed) - torch.set_default_device(device) - scale = float(1.0 / (head_size**0.5)) - num_query_heads, num_kv_heads = num_heads - query = torch.empty(num_seqs, num_query_heads, head_size, dtype=dtype) - query.uniform_(-scale, scale) - - assert num_query_heads % num_kv_heads == 0 - num_queries_per_kv = num_query_heads // num_kv_heads - alibi_slopes = None - if use_alibi: - alibi_slopes = torch.rand(num_query_heads, dtype=torch.float) - - seq_lens = [random.randint(1, MAX_SEQ_LEN) for _ in range(num_seqs)] - seq_lens[-1] = MAX_SEQ_LEN - max_seq_len = max(seq_lens) - seq_lens = torch.tensor(seq_lens, dtype=torch.int) - - # Create the block tables. - max_num_blocks_per_seq = (max_seq_len + block_size - 1) // block_size - block_tables = [] - for _ in range(num_seqs): - block_table = [ - random.randint(0, NUM_BLOCKS - 1) - for _ in range(max_num_blocks_per_seq) - ] - block_tables.append(block_table) - block_tables = torch.tensor(block_tables, dtype=torch.int) - - # Create the KV caches. - key_caches, value_caches = kv_cache_factory(NUM_BLOCKS, block_size, 1, - num_kv_heads, head_size, - kv_cache_dtype, dtype, seed, - device) - key_cache, value_cache = key_caches[0], value_caches[0] - - # Using default kv_scale - k_scale = v_scale = torch.tensor(1.0, dtype=torch.float32, device=device) - tp_rank = 0 - - # Call the paged attention kernel. - output = torch.empty_like(query) - if version == "v1": - ops.paged_attention_v1( - output, - query, - key_cache, - value_cache, - num_kv_heads, - scale, - block_tables, - seq_lens, - block_size, - max_seq_len, - alibi_slopes, - kv_cache_dtype, - k_scale, - v_scale, - tp_rank=tp_rank, - blocksparse_local_blocks=blocksparse_local_blocks, - blocksparse_vert_stride=blocksparse_vert_stride, - blocksparse_block_size=blocksparse_block_size, - blocksparse_head_sliding_step=blocksparse_head_sliding_step, - ) - elif version == "v2": - num_partitions = ((max_seq_len + PARTITION_SIZE - 1) // PARTITION_SIZE) - assert PARTITION_SIZE % block_size == 0 - num_seqs, num_heads, head_size = output.shape - tmp_output = torch.empty( - size=(num_seqs, num_heads, num_partitions, head_size), - dtype=output.dtype, - ) - exp_sums = torch.empty( - size=(num_seqs, num_heads, num_partitions), - dtype=torch.float32, - ) - max_logits = torch.empty_like(exp_sums) - ops.paged_attention_v2( - output, - exp_sums, - max_logits, - tmp_output, - query, - key_cache, - value_cache, - num_kv_heads, - scale, - block_tables, - seq_lens, - block_size, - max_seq_len, - alibi_slopes, - kv_cache_dtype, - k_scale, - v_scale, - tp_rank=tp_rank, - blocksparse_local_blocks=blocksparse_local_blocks, - blocksparse_vert_stride=blocksparse_vert_stride, - blocksparse_block_size=blocksparse_block_size, - blocksparse_head_sliding_step=blocksparse_head_sliding_step, - ) - else: - raise AssertionError(f"Unknown version: {version}") - - # Run the reference implementation. - if kv_cache_dtype == "fp8": - # Convert cache data back to dtype. - x = 16 // torch.tensor([], dtype=dtype).element_size() - key_cache_shape = (NUM_BLOCKS, num_kv_heads, head_size // x, - block_size, x) - dequantized_key_cache = torch.empty(size=key_cache_shape, - dtype=dtype, - device=device) - ops.convert_fp8(dequantized_key_cache, key_cache) - key_cache = dequantized_key_cache - - value_cache_shape = value_cache.shape - dequantized_value_cache = torch.empty(size=value_cache_shape, - dtype=dtype, - device=device) - ops.convert_fp8(dequantized_value_cache, value_cache) - value_cache = dequantized_value_cache - - ref_output = torch.empty_like(query) - ref_single_query_cached_kv_attention( - ref_output, - query, - num_queries_per_kv, - key_cache, - value_cache, - block_tables, - seq_lens, - scale, - alibi_slopes, - tp_rank, - blocksparse_local_blocks, - blocksparse_vert_stride, - blocksparse_block_size, - blocksparse_head_sliding_step, - ) - - # NOTE(woosuk): Due to the kernel-level differences in the two - # implementations, there is a small numerical difference in the two - # outputs. Thus, we use a relaxed tolerance for the test. - atol = get_default_atol(output) if current_platform.is_rocm() else 1e-3 - rtol = get_default_rtol(output) if current_platform.is_rocm() else 1e-5 - - # NOTE(zhaoyang): FP8 KV Cache will introduce quantization error, - # so we use a relaxed tolerance for the test. - atol, rtol = 1e-3, 1e-5 - if kv_cache_dtype == "fp8": - atol, rtol = 1e-2, 1e-5 - torch.testing.assert_close(output, ref_output, atol=atol, rtol=rtol) - - -def ref_multi_query_kv_attention( - cu_seq_lens: list[int], - query: torch.Tensor, - key: torch.Tensor, - value: torch.Tensor, - scale: float, - dtype: torch.dtype, -) -> torch.Tensor: - num_seqs = len(cu_seq_lens) - 1 - ref_outputs = [] - for i in range(num_seqs): - start_idx = cu_seq_lens[i] - end_idx = cu_seq_lens[i + 1] - seq_len = end_idx - start_idx - - # Create attention mask. - attn_mask = torch.triu(torch.ones(seq_len, seq_len, dtype=dtype), - diagonal=1) - attn_mask = attn_mask * torch.finfo(dtype).min - attn_mask = attn_mask.to(dtype=dtype) - - ref_output = ref_masked_attention( - query[start_idx:end_idx], - key[start_idx:end_idx], - value[start_idx:end_idx], - scale, - attn_mask=attn_mask, - ) - ref_outputs.append(ref_output) - ref_output = torch.cat(ref_outputs, dim=0) - return ref_output - - -@pytest.mark.parametrize("num_seqs", NUM_PREFILL_SEQS) -@pytest.mark.parametrize("num_heads", NUM_HEADS) -@pytest.mark.parametrize("head_size", HEAD_SIZES) -@pytest.mark.parametrize("blocksparse_local_blocks", BLOCKSPARSE_LOCAL_BLOCKS) -@pytest.mark.parametrize("blocksparse_vert_stride", BLOCKSPARSE_VERT_STRIDES) -@pytest.mark.parametrize("blocksparse_block_size", BLOCKSPARSE_BLOCK_SIZES) -@pytest.mark.parametrize("blocksparse_homo_heads", BLOCKSPARSE_HOMO_HEADS) -@pytest.mark.parametrize("dtype", DTYPES) -@pytest.mark.parametrize("seed", SEEDS) -@pytest.mark.parametrize("device", CUDA_DEVICES) -@torch.inference_mode() -def test_varlen_blocksparse_attention_prefill( - num_seqs: int, - num_heads: tuple[int, int], - head_size: int, - blocksparse_local_blocks: int, - blocksparse_vert_stride: int, - blocksparse_block_size: int, - blocksparse_homo_heads: bool, - dtype: torch.dtype, - seed: int, - device: str, -) -> None: - current_platform.seed_everything(seed) - torch.set_default_device(device) - # MAX_SEQ_LEN sometimes causes OOM in the reference implementation. - # As the xformers library is already tested with its own tests, we can use - # a smaller MAX_SEQ_LEN here. - max_len = min(MAX_SEQ_LEN, 4096) - seq_lens = random.sample(range(1, max_len), num_seqs) - cu_seq_lens = torch.cumsum(torch.tensor([0] + seq_lens), dim=0) - num_tokens = sum(seq_lens) - - scale = float(1.0 / (head_size**0.5)) - num_query_heads, num_kv_heads = num_heads - assert num_query_heads % num_kv_heads == 0 - num_queries_per_kv = num_query_heads // num_kv_heads - - qkv = torch.empty(num_tokens, - num_query_heads + 2 * num_kv_heads, - head_size, - dtype=dtype) - qkv.uniform_(-scale, scale) - query, key, value = qkv.split( - [num_query_heads, num_kv_heads, num_kv_heads], dim=1) - - bs_attn_op = LocalStridedBlockSparseAttn( - num_query_heads, - max_len, - local_blocks=blocksparse_local_blocks, - vert_stride=blocksparse_vert_stride, - block_size=blocksparse_block_size, - device=device, - dtype=dtype, - homo_head=blocksparse_homo_heads) - - output = bs_attn_op(query, - key, - value, - cu_seq_lens.to(device), - sm_scale=scale) - - if num_queries_per_kv > 1: - # Handle MQA and GQA - key = torch.repeat_interleave(key, num_queries_per_kv, dim=1) - value = torch.repeat_interleave(value, num_queries_per_kv, dim=1) - - ref_output = ref_multi_query_kv_attention( - cu_seq_lens.tolist(), - query, - key, - value, - scale, - dtype, - ) - torch.testing.assert_close(output, ref_output, atol=1e-2, rtol=1e-2) diff --git a/tests/kernels/attention/test_rocm_attention_selector.py b/tests/kernels/attention/test_rocm_attention_selector.py index 34311b9ccd7..d56d3f4638f 100644 --- a/tests/kernels/attention/test_rocm_attention_selector.py +++ b/tests/kernels/attention/test_rocm_attention_selector.py @@ -33,8 +33,12 @@ def test_selector(monkeypatch: pytest.MonkeyPatch): # change the attention backend to triton MLA m.setenv(STR_BACKEND_ENV_VAR, "TRITON_MLA") - backend = get_attn_backend(576, torch.bfloat16, "auto", 16, False, - False, True) + backend = get_attn_backend(576, + torch.bfloat16, + "auto", + 16, + False, + use_mla=True) assert (backend.get_name() == "TRITON_MLA" or backend.get_name() == "TRITON_MLA_VLLM_V1") @@ -42,15 +46,23 @@ def test_selector(monkeypatch: pytest.MonkeyPatch): # If use_mla is true # The selected backend is triton MLA m.setenv(STR_BACKEND_ENV_VAR, None) - backend = get_attn_backend(576, torch.bfloat16, "auto", 16, False, - False, True) + backend = get_attn_backend(576, + torch.bfloat16, + "auto", + 16, + False, + use_mla=True) assert (backend.get_name() == "TRITON_MLA" or backend.get_name() == "TRITON_MLA_VLLM_V1") # change the attention backend to AITER MLA m.setenv(STR_BACKEND_ENV_VAR, "ROCM_AITER_MLA") - backend = get_attn_backend(576, torch.bfloat16, "auto", 1, False, - False, True) + backend = get_attn_backend(576, + torch.bfloat16, + "auto", + 1, + False, + use_mla=True) assert (backend.get_name() == "ROCM_AITER_MLA" or backend.get_name() == "ROCM_AITER_MLA_VLLM_V1") @@ -60,7 +72,11 @@ def test_selector(monkeypatch: pytest.MonkeyPatch): # The selected backend is ROCM_AITER_MLA m.setenv(STR_BACKEND_ENV_VAR, None) m.setenv("VLLM_ROCM_USE_AITER", "1") - backend = get_attn_backend(576, torch.bfloat16, "auto", 1, False, - False, True) + backend = get_attn_backend(576, + torch.bfloat16, + "auto", + 1, + False, + use_mla=True) assert (backend.get_name() == "ROCM_AITER_MLA" or backend.get_name() == "ROCM_AITER_MLA_VLLM_V1") diff --git a/tests/models/registry.py b/tests/models/registry.py index 5c546a6c86d..8afac32e1cf 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -247,10 +247,6 @@ def check_available_online( "PersimmonForCausalLM": _HfExamplesInfo("adept/persimmon-8b-chat"), "PhiForCausalLM": _HfExamplesInfo("microsoft/phi-2"), "Phi3ForCausalLM": _HfExamplesInfo("microsoft/Phi-3-mini-4k-instruct"), - # Blocksparse attention not supported in V1 yet - "Phi3SmallForCausalLM": _HfExamplesInfo("microsoft/Phi-3-small-8k-instruct", - trust_remote_code=True, - v0_only=True), "Phi4FlashForCausalLM": _HfExamplesInfo("microsoft/Phi-4-mini-flash-reasoning", # noqa: E501 trust_remote_code=True, v0_only=True, diff --git a/vllm/attention/backends/abstract.py b/vllm/attention/backends/abstract.py index 05c098a58a0..ba20da4fd75 100644 --- a/vllm/attention/backends/abstract.py +++ b/vllm/attention/backends/abstract.py @@ -269,7 +269,6 @@ def __init__( alibi_slopes: Optional[List[float]] = None, sliding_window: Optional[int] = None, kv_cache_dtype: str = "auto", - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, diff --git a/vllm/attention/backends/blocksparse_attn.py b/vllm/attention/backends/blocksparse_attn.py deleted file mode 100644 index e4338805f56..00000000000 --- a/vllm/attention/backends/blocksparse_attn.py +++ /dev/null @@ -1,466 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from dataclasses import dataclass, field -from typing import Any, Dict, List, Optional, Tuple, Type - -import torch - -from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, - AttentionLayer, - AttentionMetadata, AttentionType) -from vllm.attention.backends.utils import (CommonAttentionState, - CommonMetadataBuilder) -from vllm.attention.ops.blocksparse_attention.interface import ( - LocalStridedBlockSparseAttn, get_head_sliding_step) -from vllm.attention.ops.paged_attn import PagedAttention -from vllm.distributed import (get_tensor_model_parallel_rank, - get_tensor_model_parallel_world_size) - - -@dataclass -class BlocksparseParams: - max_seqlen: int - - # Num q heads per tensor-parallel rank/partition - num_heads: int # per TP partition - # Num kv heads per tensor-parallel rank/partition - num_kv_heads: int - - # block size used for blocksparse attention. - # This is the block_size used in `local_blocks`, `vert_stride`. - block_size: int - - # Number of blocks for local attention, i.e., number of - # local attended tokens / `sparse_block_size` - local_blocks: int - - # Attend to one block per every `vert_stride` blocks. - # Controlling the sparsity - vert_stride: int - """ - If to use the same vertical stride offset for all heads, - i.e., attend to the same block of tokens on all heads. - By default, it is False, i.e., attention on the non-local - blocks depends on the `head_idx`, that is on - blocks satisfying - `(block_idx + head_idx * head_sliding_step + 1) % vert_stride == 0` - where `head_sliding_step=max(1, int(vert_stride / num_total_heads))`, - `block_idx = position_id // sparse_block_size`. - See `..ops.blocksparse_attention.utils:get_sparse_attn_mask` - for more detail. - """ - homo_head: bool = False - - # If within a group, the kv offsets that each q attends is the same or no. - homo_head_group: bool = False - - # Decided by homo_head and homo_head group - head_sliding_step: int = field(init=False) - - # range of q heads to for a TP rank - active_head_range: Tuple = field(init=False) - - def __post_init__(self): - assert self.block_size > 0 - assert self.local_blocks >= 0 - assert self.vert_stride >= 1 - - tp_size = get_tensor_model_parallel_world_size() - tp_rank = get_tensor_model_parallel_rank() - total_heads = tp_size * self.num_heads - total_kv_heads = tp_size * self.num_kv_heads - - if self.homo_head: - self.head_sliding_step = 0 - elif self.homo_head_group: - head_sliding_step = get_head_sliding_step(total_kv_heads, - self.vert_stride) - # negative indicates sliding along kv heads, i.e., homo q group - self.head_sliding_step = -head_sliding_step - else: - self.head_sliding_step = get_head_sliding_step( - total_heads, self.vert_stride) - - self.active_head_range = ( - tp_rank * self.num_heads, - (tp_rank + 1) * self.num_heads, - ) - - -class BlocksparseFlashAttentionBackend(AttentionBackend): - - @staticmethod - def get_name() -> str: - return "BLOCK_SPARSE_FLASH_ATTN" - - @staticmethod - def get_impl_cls() -> Type["BlocksparseFlashAttentionImpl"]: - return BlocksparseFlashAttentionImpl - - @staticmethod - def get_metadata_cls() -> Type["AttentionMetadata"]: - return BlocksparseFlashAttentionMetadata - - @staticmethod - def get_builder_cls() -> Type["BlocksparseFlashAttentionMetadataBuilder"]: - return BlocksparseFlashAttentionMetadataBuilder - - @staticmethod - def get_state_cls() -> Type["CommonAttentionState"]: - return CommonAttentionState - - @staticmethod - def get_kv_cache_shape( - num_blocks: int, - block_size: int, - num_kv_heads: int, - head_size: int, - ) -> Tuple[int, ...]: - return PagedAttention.get_kv_cache_shape(num_blocks, block_size, - num_kv_heads, head_size) - - @staticmethod - def swap_blocks( - src_kv_cache: torch.Tensor, - dst_kv_cache: torch.Tensor, - src_to_dst: Dict[int, int], - ) -> None: - PagedAttention.swap_blocks(src_kv_cache, dst_kv_cache, src_to_dst) - - @staticmethod - def copy_blocks( - kv_caches: List[torch.Tensor], - src_to_dists: Dict[int, List[int]], - ) -> None: - PagedAttention.copy_blocks(kv_caches, src_to_dists) - - -@dataclass -class BlocksparseFlashAttentionMetadata(AttentionMetadata): - """A copy of Metadata for FlashAttentionBackend, - to avoid having to install flash_attn. - - NOTE: Any python object stored here is not updated when it is - cuda-graph replayed. If you have values that need to be changed - dynamically, it should be stored in tensor. The tensor has to be - updated from `CUDAGraphRunner.forward` API. - """ - # (batch_size,). The sequence length per sequence. Sequence length means - # the computed tokens + new tokens None if it is a decoding. - seq_lens: Optional[List[int]] - # seq_lens stored as a tensor. - seq_lens_tensor: Optional[torch.Tensor] - - # NOTE(sang): Definition of context_len, query_len, and seq_len. - # |---------- N-1 iteration --------| - # |---------------- N iteration ---------------------| - # |- tokenA -|......................|-- newTokens ---| - # |---------- context_len ----------| - # |-------------------- seq_len ----------------------| - # |-- query_len ---| - - # Maximum query length in the batch. None for decoding. - max_query_len: Optional[int] - # Maximum sequence length among prefill batch. 0 if there are decoding - # requests only. - max_prefill_seq_len: int - # Maximum sequence length among decode batch. 0 if there are prefill - # requests only. - max_decode_seq_len: int - # (batch_size + 1,). The cumulative subquery lengths of the sequences in - # the batch, used to index into subquery. E.g., if the subquery length - # is [4, 6], it is [0, 4, 10]. - query_start_loc: Optional[torch.Tensor] - # (batch_size + 1,). The cumulative sequence lengths of the sequences in - # the batch, used to index into sequence. E.g., if the sequence length is - # [4, 6], it is [0, 4, 10]. - seq_start_loc: Optional[torch.Tensor] - # (batch_size,) A tensor of context lengths (tokens that are computed - # so far). - context_lens_tensor: Optional[torch.Tensor] - - # (batch_size, max_blocks_per_seq). - # Block addresses per sequence. (Seq id -> list of physical block) - # E.g., [0, 1, 2] means tokens are stored in 0th, 1st, and 2nd blocks - # in the kv cache. Each block can contain up to block_size tokens. - # 2nd dimensions are padded up to max_blocks_per_seq if it is cuda-graph - # captured. - block_tables: Optional[torch.Tensor] - - # Whether or not if cuda graph is enabled. - # Cuda-graph is currently enabled for decoding only. - # TODO(woosuk): Move `use_cuda_graph` out since it's unrelated to attention. - use_cuda_graph: bool - - # Max number of query tokens for among request in the batch. - max_decode_query_len: Optional[int] = None - - _cached_prefill_metadata: Optional[ - "BlocksparseFlashAttentionMetadata"] = None - _cached_decode_metadata: Optional[ - "BlocksparseFlashAttentionMetadata"] = None - - @property - def prefill_metadata( - self) -> Optional["BlocksparseFlashAttentionMetadata"]: - if self.num_prefills == 0: - return None - - if self._cached_prefill_metadata is not None: - return self._cached_prefill_metadata - - assert self.seq_lens is not None - assert self.seq_lens_tensor is not None - assert self.query_start_loc is not None - assert self.context_lens_tensor is not None - assert self.block_tables is not None - assert self.seq_start_loc is not None - - self._cached_prefill_metadata = BlocksparseFlashAttentionMetadata( - num_prefills=self.num_prefills, - num_prefill_tokens=self.num_prefill_tokens, - num_decode_tokens=0, - slot_mapping=self.slot_mapping[:self.num_prefill_tokens], - multi_modal_placeholder_index_maps=self. - multi_modal_placeholder_index_maps, - enable_kv_scales_calculation=self.enable_kv_scales_calculation, - seq_lens=self.seq_lens[:self.num_prefills], - seq_lens_tensor=self.seq_lens_tensor[:self.num_prefills], - max_query_len=self.max_query_len, - max_prefill_seq_len=self.max_prefill_seq_len, - max_decode_seq_len=0, - query_start_loc=self.query_start_loc[:self.num_prefills + 1], - seq_start_loc=self.seq_start_loc[:self.num_prefills + 1], - context_lens_tensor=self.context_lens_tensor[:self.num_prefills], - block_tables=self.block_tables[:self.num_prefills], - use_cuda_graph=False, - ) - return self._cached_prefill_metadata - - @property - def decode_metadata(self) -> Optional["BlocksparseFlashAttentionMetadata"]: - if self.num_decode_tokens == 0: - return None - - if self._cached_decode_metadata is not None: - return self._cached_decode_metadata - assert self.block_tables is not None - assert self.seq_lens_tensor is not None - - self._cached_decode_metadata = BlocksparseFlashAttentionMetadata( - num_prefills=0, - num_prefill_tokens=0, - num_decode_tokens=self.num_decode_tokens, - slot_mapping=self.slot_mapping[self.num_prefill_tokens:], - multi_modal_placeholder_index_maps=None, - enable_kv_scales_calculation=False, - seq_lens=None, - seq_lens_tensor=self.seq_lens_tensor[self.num_prefills:], - max_query_len=None, - max_prefill_seq_len=0, - max_decode_seq_len=self.max_decode_seq_len, - query_start_loc=None, - seq_start_loc=None, - context_lens_tensor=None, - block_tables=self.block_tables[self.num_prefills:], - use_cuda_graph=self.use_cuda_graph, - ) - return self._cached_decode_metadata - - -class BlocksparseFlashAttentionMetadataBuilder( - CommonMetadataBuilder[BlocksparseFlashAttentionMetadata]): - - _metadata_cls = BlocksparseFlashAttentionMetadata - - -class BlocksparseFlashAttentionImpl(AttentionImpl): - """ - If the input tensors contain prompt tokens, the layout is as follows: - |<--------------- num_prompt_tokens -------------->| - |<--prompt_0-->|<--prompt_1-->|...|<--prompt_N-1-->| - - Otherwise, the layout is as follows: - |<------------------ num_generation_tokens (M) ----------------->| - |<--generation_0-->|..........|<--generation_M-1-->|<--padding-->| - - Generation tokens can contain padding when cuda-graph is used. - Currently, prompt tokens don't contain any padding. - - The prompts might have different lengths, while the generation tokens - always have length 1. - - """ - - def __init__( - self, - num_heads: int, - head_size: int, - scale: float, - num_kv_heads: int, - alibi_slopes: Optional[List[float]], - sliding_window: Optional[int], - kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, - logits_soft_cap: Optional[float] = None, - attn_type: str = AttentionType.DECODER, - kv_sharing_target_layer_name: Optional[str] = None, - ) -> None: - if kv_sharing_target_layer_name is not None: - raise NotImplementedError("KV sharing is not supported in V0 " - "BLOCK_SPARSE_FLASH_ATTN Backend.") - assert blocksparse_params is not None - assert alibi_slopes is None, ValueError( - "Alibi not support for blocksparse flash attention.") - assert sliding_window is None, ValueError( - "sliding_window is invalid for blocksparse attention.") - assert logits_soft_cap is None, ValueError( - "logits_soft_cap is invalid for blocksparse attention.") - - if "num_heads" not in blocksparse_params: - blocksparse_params["num_heads"] = num_heads - if "num_kv_heads" not in blocksparse_params: - blocksparse_params["num_kv_heads"] = num_kv_heads or num_heads - self.blocksparse_params = BlocksparseParams(**blocksparse_params) - self.kv_cache_dtype = kv_cache_dtype - - self.num_heads = num_heads - self.head_size = head_size - self.scale = float(scale) - self.alibi_slopes = alibi_slopes - self.num_kv_heads = num_kv_heads - - self.num_queries_per_kv = self.num_heads // self.num_kv_heads - - self.local_blocks = self.blocksparse_params.local_blocks - self.vert_stride = self.blocksparse_params.vert_stride - self.sparse_block_size = self.blocksparse_params.block_size - self.head_sliding_step = self.blocksparse_params.head_sliding_step - - supported_head_sizes = PagedAttention.get_supported_head_sizes() - if head_size not in supported_head_sizes: - raise ValueError( - f"Head size {head_size} is not supported by PagedAttention. " - f"Supported head sizes are: {supported_head_sizes}.") - - self.tp_size = get_tensor_model_parallel_world_size() - self.tp_rank = get_tensor_model_parallel_rank() - - total_num_heads = num_heads * self.tp_size - self.bs_attn = LocalStridedBlockSparseAttn( - total_num_heads, - self.blocksparse_params.max_seqlen, - self.blocksparse_params.local_blocks, - self.blocksparse_params.vert_stride, - self.blocksparse_params.block_size, - homo_head=self.blocksparse_params.homo_head, - active_head_range=self.blocksparse_params.active_head_range, - ) - - if attn_type != AttentionType.DECODER: - raise NotImplementedError("Encoder self-attention and " - "encoder/decoder cross-attention " - "are not implemented for " - "BlocksparseFlashAttentionImpl") - - def forward( - self, - layer: AttentionLayer, - query: torch.Tensor, - key: torch.Tensor, - value: torch.Tensor, - kv_cache: torch.Tensor, - attn_metadata: BlocksparseFlashAttentionMetadata, - output: Optional[torch.Tensor] = None, - output_scale: Optional[torch.Tensor] = None, - ) -> torch.Tensor: - """Forward pass with FlashAttention and PagedAttention. - - Args: - query: shape = [num_tokens, num_heads * head_size] - key: shape = [num_tokens, num_kv_heads * head_size] - value: shape = [num_tokens, num_kv_heads * head_size] - kv_cache = [2, num_blocks, block_size * num_kv_heads * head_size] - NOTE: kv_cache will be an empty tensor with shape [0] - for profiling run. - attn_metadata: Metadata for attention. - Returns: - shape = [num_tokens, num_heads * head_size] - """ - if output_scale is not None: - raise NotImplementedError( - "fused output quantization is not yet supported" - " for BlocksparseFlashAttentionImpl") - - num_tokens, hidden_size = query.shape - # Reshape the query, key, and value tensors. - query = query.view(-1, self.num_heads, self.head_size) - key = key.view(-1, self.num_kv_heads, self.head_size) - value = value.view(-1, self.num_kv_heads, self.head_size) - - if kv_cache.numel() > 0: - key_cache, value_cache = PagedAttention.split_kv_cache( - kv_cache, self.num_kv_heads, self.head_size) - - # Reshape the input keys and values and store them in the cache. - # If kv_cache is not provided, the new key and value tensors are - # not cached. This happens during the initial memory profiling run. - - PagedAttention.write_to_paged_cache( - key, - value, - key_cache, - value_cache, - attn_metadata.slot_mapping, - self.kv_cache_dtype, - layer._k_scale, - layer._v_scale, - ) - - if prefill_meta := attn_metadata.prefill_metadata: - - # Prompt run. - # normal attention - # When block_tables are not filled, it means q and k are the - # prompt, and they have the same length. - - assert kv_cache.numel() == 0 \ - or prefill_meta.block_tables is None \ - or prefill_meta.block_tables.numel() == 0, \ - "Does not support prefix-enabled attention." - - output = self.bs_attn( - q=query, - k=key, - v=value, - cu_seqlens_q=prefill_meta.seq_start_loc, - cu_seqlens_k=prefill_meta.seq_start_loc, - sm_scale=self.scale, - ) - - if decode_meta := attn_metadata.decode_metadata: - # Decoding run. - output = PagedAttention.forward_decode( - query, - key_cache, - value_cache, - decode_meta.block_tables, - decode_meta.seq_lens_tensor, - self.blocksparse_params.max_seqlen, - self.kv_cache_dtype, - self.num_kv_heads, - self.scale, - self.alibi_slopes, - layer._k_scale, - layer._v_scale, - tp_rank=self.tp_rank, - blocksparse_local_blocks=self.local_blocks, - blocksparse_vert_stride=self.vert_stride, - blocksparse_block_size=self.sparse_block_size, - blocksparse_head_sliding_step=self.head_sliding_step, - ) - - assert output is not None - # Reshape the output tensor. - return output.view(num_tokens, hidden_size) diff --git a/vllm/attention/backends/differential_flash_attn.py b/vllm/attention/backends/differential_flash_attn.py index 1c139952371..bd9bc427728 100644 --- a/vllm/attention/backends/differential_flash_attn.py +++ b/vllm/attention/backends/differential_flash_attn.py @@ -667,7 +667,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, @@ -680,9 +679,6 @@ def __init__( differential_flash_attention_config self.used_shared_kv_cache = kv_sharing_target_layer_name is not None self.kv_sharing_target_layer_name = kv_sharing_target_layer_name - if blocksparse_params is not None: - raise ValueError( - "FlashAttention does not support block-sparse attention.") if use_irope: logger.warning( "Using irope in V0 is not supported yet, it will fall back " diff --git a/vllm/attention/backends/dual_chunk_flash_attn.py b/vllm/attention/backends/dual_chunk_flash_attn.py index 40557a4e8f8..e108646e7ff 100644 --- a/vllm/attention/backends/dual_chunk_flash_attn.py +++ b/vllm/attention/backends/dual_chunk_flash_attn.py @@ -287,7 +287,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, diff --git a/vllm/attention/backends/flash_attn.py b/vllm/attention/backends/flash_attn.py index 20e67eb9b40..ee36fd19e01 100755 --- a/vllm/attention/backends/flash_attn.py +++ b/vllm/attention/backends/flash_attn.py @@ -4,7 +4,7 @@ from collections import defaultdict from dataclasses import dataclass from itertools import accumulate -from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type +from typing import TYPE_CHECKING, Dict, List, Optional, Tuple, Type import torch @@ -615,7 +615,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, @@ -624,9 +623,6 @@ def __init__( if kv_sharing_target_layer_name is not None: raise NotImplementedError("KV sharing is not supported in V0 " "FLASH_ATTN backend.") - if blocksparse_params is not None: - raise ValueError( - "FlashAttention does not support block-sparse attention.") if use_irope: logger.warning( "Using irope in V0 is not supported yet, it will fall back " diff --git a/vllm/attention/backends/flashinfer.py b/vllm/attention/backends/flashinfer.py index 1f913ad8952..56d3da699f4 100644 --- a/vllm/attention/backends/flashinfer.py +++ b/vllm/attention/backends/flashinfer.py @@ -999,7 +999,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, diff --git a/vllm/attention/backends/flashmla.py b/vllm/attention/backends/flashmla.py index e185d0260d0..a242ac9bbe0 100644 --- a/vllm/attention/backends/flashmla.py +++ b/vllm/attention/backends/flashmla.py @@ -3,7 +3,7 @@ from contextlib import contextmanager from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type +from typing import TYPE_CHECKING, List, Optional, Tuple, Type import torch @@ -181,7 +181,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str] = None, @@ -189,20 +188,17 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) assert is_flashmla_supported(), \ "FlashMLA is not supported on this device" - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "FlashMLAImpl does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " diff --git a/vllm/attention/backends/mla/common.py b/vllm/attention/backends/mla/common.py index 0c3ff26d04c..52c4a9e7da3 100644 --- a/vllm/attention/backends/mla/common.py +++ b/vllm/attention/backends/mla/common.py @@ -997,7 +997,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], diff --git a/vllm/attention/backends/rocm_aiter_mla.py b/vllm/attention/backends/rocm_aiter_mla.py index 1edf34351db..a165a786d63 100644 --- a/vllm/attention/backends/rocm_aiter_mla.py +++ b/vllm/attention/backends/rocm_aiter_mla.py @@ -3,7 +3,7 @@ from contextlib import contextmanager from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, Optional, Type, Union +from typing import TYPE_CHECKING, Optional, Type, Union import torch @@ -367,7 +367,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], @@ -375,17 +374,14 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "Aiter MLA does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") from aiter import flash_attn_varlen_func self.flash_attn_varlen_func = flash_attn_varlen_func diff --git a/vllm/attention/backends/rocm_flash_attn.py b/vllm/attention/backends/rocm_flash_attn.py index 4653d5267e1..1ee1dea729d 100644 --- a/vllm/attention/backends/rocm_flash_attn.py +++ b/vllm/attention/backends/rocm_flash_attn.py @@ -4,7 +4,7 @@ import itertools from dataclasses import dataclass from functools import cache -from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type +from typing import TYPE_CHECKING, List, Optional, Tuple, Type import torch @@ -494,7 +494,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, @@ -507,9 +506,6 @@ def __init__( logger.warning_once( "Using irope in ROCm Flash Attention is not supported yet, it " "will fail back to global attention for long context.") - if blocksparse_params is not None: - raise ValueError( - "ROCmFlashAttention does not support blocksparse attention.") if use_irope: logger.warning( "Using irope in V0 is not supported yet, it will fall back " diff --git a/vllm/attention/backends/triton_mla.py b/vllm/attention/backends/triton_mla.py index e06f7d54e34..fba5b5f6bca 100644 --- a/vllm/attention/backends/triton_mla.py +++ b/vllm/attention/backends/triton_mla.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Any, Dict, List, Optional, Type +from typing import List, Optional, Type import torch @@ -35,7 +35,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], @@ -43,17 +42,14 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "TritonMLAImpl does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " diff --git a/vllm/attention/backends/xformers.py b/vllm/attention/backends/xformers.py index 3ef79bb6212..0bc38b41429 100644 --- a/vllm/attention/backends/xformers.py +++ b/vllm/attention/backends/xformers.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with xFormers and PagedAttention.""" from dataclasses import dataclass -from typing import Any, Dict, List, Optional, Tuple, Type +from typing import Dict, List, Optional, Tuple, Type import torch from xformers import ops as xops @@ -387,7 +387,6 @@ def __init__( alibi_slopes: Optional[List[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, @@ -396,9 +395,6 @@ def __init__( if kv_sharing_target_layer_name is not None: raise NotImplementedError("KV sharing is not supported in V0 " "XFORMERS backend.") - if blocksparse_params is not None: - raise ValueError( - "XFormers does not support block-sparse attention.") if logits_soft_cap is not None: logger.warning_once("XFormers does not support logits soft cap. " "Outputs may be slightly off.") diff --git a/vllm/attention/layer.py b/vllm/attention/layer.py index d0677525d31..5d8ffb8e82d 100644 --- a/vllm/attention/layer.py +++ b/vllm/attention/layer.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer.""" -from typing import Any, Dict, List, Optional +from typing import List, Optional import torch import torch.nn as nn @@ -74,7 +74,6 @@ def __init__( alibi_slopes: Optional[List[float]] = None, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, - blocksparse_params: Optional[Dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, per_layer_sliding_window: Optional[int] = None, use_mla: bool = False, @@ -163,12 +162,11 @@ def __init__( kv_cache_dtype, block_size, is_attention_free, - blocksparse_params is not None, use_mla=use_mla) impl_cls = attn_backend.get_impl_cls() self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **extra_impl_args) self.backend = backend_name_to_enum(attn_backend.get_name()) self.dtype = dtype diff --git a/vllm/attention/ops/blocksparse_attention/__init__.py b/vllm/attention/ops/blocksparse_attention/__init__.py deleted file mode 100644 index e69de29bb2d..00000000000 diff --git a/vllm/attention/ops/blocksparse_attention/blocksparse_attention_kernel.py b/vllm/attention/ops/blocksparse_attention/blocksparse_attention_kernel.py deleted file mode 100644 index 05fa9d11f22..00000000000 --- a/vllm/attention/ops/blocksparse_attention/blocksparse_attention_kernel.py +++ /dev/null @@ -1,433 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import torch - -from vllm.triton_utils import tl, triton - - -def blocksparse_flash_attn_varlen_fwd( - q, - k, - v, # (#tokens, n_heads, head_size) - cu_seqlens_k, - cu_seqlens_q, - sm_scale, - sparse_layout, - *, - block_size=64, - q_block_size=None, - max_seqlen=None): - # split q to blocks - - assert isinstance(sparse_layout, (list, tuple)) - - _, n_heads, head_size = q.shape - batch_size = cu_seqlens_k.size(0) - 1 - q_block_size = q_block_size or block_size - - assert q.dim() == k.dim() == v.dim() == 3 - assert q.size(1) % k.size(1) == 0 - assert q.size(2) == k.size(2) - # TODO(linxihui): allow k, v to have different head_size - assert k.shape == v.shape - assert cu_seqlens_k.dim() == 1 - - q_k_ratio = q.size(1) // k.size(1) - - if cu_seqlens_q is None: - if q.size(0) == batch_size: # decoding only - cu_seqlens_q = torch.arange( - 0, - batch_size + 1, - dtype=cu_seqlens_k.dtype, - device=cu_seqlens_k.device, - ) - elif q.size(0) == k.size(0): - cu_seqlens_q = cu_seqlens_k - else: - raise ValueError("cu_seqlens_q must be specified\ - if it mix of prefilling and decoding.") - else: - assert cu_seqlens_k.size(0) == cu_seqlens_q.size(0) - - # switch to use cpu to avoid too many kernel launches when iterated over - q_lens = (cu_seqlens_q[1:] - cu_seqlens_q[:-1]).cpu() - k_lens = (cu_seqlens_k[1:] - cu_seqlens_k[:-1]).cpu() - - assert torch.logical_or(q_lens == 1, k_lens == q_lens).all(), ( - "length of q should either be 1 (decoding) or same as k (prefilling).") - - if max_seqlen: - assert k_lens.max() <= max_seqlen - - n_blocks = (q_lens + q_block_size - 1) // q_block_size - - q_batch_ids = torch.tensor( - [i for i, n in enumerate(n_blocks) for _ in range(n)], - dtype=cu_seqlens_q.dtype, - device=cu_seqlens_q.device, - ) - q_start_sids = torch.tensor( - [i * q_block_size for n in n_blocks for i in range(n)], - dtype=cu_seqlens_q.dtype, - device=cu_seqlens_q.device, - ) - - out = q.new_empty(q.shape) - cu_seqlens_q = cu_seqlens_q.contiguous() - cu_seqlens_k = cu_seqlens_k.contiguous() - - layout_crow_indices, layout_col_indices = sparse_layout - block_d = triton.next_power_of_2(head_size) - - decoding_only = (q_lens == 1).all().item() - grid = (len(q_start_sids), n_heads, 1) - - _fwd_kernel_batch_inference[grid]( - q, - k, - v, - out, - sm_scale, - cu_seqlens_q[:-1], - cu_seqlens_q[1:], - cu_seqlens_k[:-1], - cu_seqlens_k[1:], - q_batch_ids, - q_start_sids, - 0, - *q.stride(), - 0, - *k.stride(), - 0, - *v.stride(), - 0, - *out.stride(), - layout_crow_indices, - layout_col_indices, - *layout_crow_indices.stride(), - *layout_col_indices.stride(), - q_k_ratio, - HAS_BATCH_DIM=False, - D_HEAD=head_size, - BLOCK_M=q_block_size, - BLOCK_N=block_size, - BLOCK_D=block_d, - BLOCK_M_LOADING=(16 if decoding_only else - q_block_size), # smaller for decoding - EVEN_D=block_d == head_size, - num_warps=1 if decoding_only else 4, - num_stages=3) - - return out - - -@triton.jit -def _fwd_kernel_inner( - acc, - l_i, - m_i, - q, - Q, - k_block_col_idx, - layout_col_ptr, - layout_col_stride_h, - layout_col_stride_m, - k_ptrs, - v_ptrs, - off_h, - offs_m, - offs_n, - offs_d, - stride_kt, - stride_vt, - sm_scale, - k_seqlen, - past_len, - LAST_K_BLOCK: tl.constexpr, - BLOCK_M_LOADING: tl.constexpr, - BLOCK_N: tl.constexpr, - D_HEAD: tl.constexpr, - EVEN_D: tl.constexpr, - M_LT_N: tl.constexpr, -): - k_block_id = tl.load(layout_col_ptr + off_h * layout_col_stride_h + - k_block_col_idx * layout_col_stride_m).to(tl.int32) - start_n = k_block_id * BLOCK_N - if LAST_K_BLOCK: - if EVEN_D: - k = tl.load( - k_ptrs + start_n * stride_kt, - mask=offs_n[None, :] + start_n < k_seqlen, - other=0.0, - ) - else: - k = tl.load( - k_ptrs + start_n * stride_kt, - mask=(offs_n[None, :] + start_n < k_seqlen) & - (offs_d[:, None] < D_HEAD), - other=0.0, - ) - else: - if EVEN_D: - k = tl.load(k_ptrs + start_n * stride_kt) - else: - k = tl.load(k_ptrs + start_n * stride_kt, - mask=offs_d[:, None] < D_HEAD, - other=0.0) - - qk = tl.zeros([BLOCK_M_LOADING, BLOCK_N], dtype=tl.float32) - qk += tl.dot(q, k) - qk *= sm_scale - - # the following is needed only when LAST_K_BLOCK or BLOCK_M < BLOCK_N - if LAST_K_BLOCK | M_LT_N: - qk += tl.where( - offs_m[:, None] + past_len >= (start_n + offs_n[None, :]), - 0, - float("-inf"), - ) - - # flash-attn2 - m_ij = tl.maximum(m_i, tl.max(qk, 1)) - p = tl.math.exp2(qk - m_ij[:, None]) - l_ij = tl.sum(p, 1) - alpha = tl.math.exp2(m_i - m_ij) - acc = acc * alpha[:, None] - # update m_i - m_i = m_ij - l_i = l_i * alpha + l_ij - - p = p.to(Q.dtype.element_ty) - # update acc - if LAST_K_BLOCK: - if EVEN_D: - v = tl.load( - v_ptrs + start_n * stride_vt, - mask=offs_n[:, None] + start_n < k_seqlen, - other=0.0, - ) - else: - v = tl.load( - v_ptrs + start_n * stride_vt, - mask=(offs_n[:, None] + start_n < k_seqlen) & - (offs_d[None, :] < D_HEAD), - other=0.0, - ) - else: - if EVEN_D: - v = tl.load(v_ptrs + start_n * stride_vt) - else: - v = tl.load(v_ptrs + start_n * stride_vt, - mask=offs_d[None, :] < D_HEAD, - other=0.0) - - acc += tl.dot(p, v) - - return acc, l_i, m_i - - -@triton.heuristics({ - "M_LT_N": - lambda kwargs: kwargs["BLOCK_M"] < kwargs["BLOCK_N"], -}) -@triton.jit -def _fwd_kernel_batch_inference( - Q, - K, - V, - Out, - sm_scale, - q_batch_starts, - q_batch_ends, - k_batch_starts, - k_batch_ends, - q_batch_ids, - q_start_sids, - stride_qb, - stride_qt, - stride_qh, - stride_qd, - stride_kb, - stride_kt, - stride_kh, - stride_kd, - stride_vb, - stride_vt, - stride_vh, - stride_vd, - stride_ob, - stride_ot, - stride_oh, - stride_od, - layout_crow_ptr, - layout_col_ptr, - layout_crow_stride_h, - layout_crow_stride_m, - layout_col_stride_h, - layout_col_stride_m, - q_k_ratio, - HAS_BATCH_DIM: tl.constexpr, - D_HEAD: tl.constexpr, - BLOCK_M: tl.constexpr, - BLOCK_N: tl.constexpr, - BLOCK_D: tl.constexpr, - BLOCK_M_LOADING: tl.constexpr, - EVEN_D: tl.constexpr, - M_LT_N: tl.constexpr, -): - """ - NOTATION: - pid: position id - sid: storage id - sbid: storage block id - pbid: position block id - offs_m, offs_n: storage offsets of m-dim(q, row) and n-dim(k, col) - - TODO(linxihui): - Optimize grouped-attn - """ - off_zm = tl.program_id(0) - off_h = tl.program_id(1) - - off_h_for_kv = off_h // q_k_ratio - - if HAS_BATCH_DIM: - off_z = tl.program_id(2) - Q += off_z * stride_qb - K += off_z * stride_kb - V += off_z * stride_vb - Out += off_z * stride_ob - start_m = off_zm - q_start_sid = start_m * BLOCK_M # always 0 for decoding - else: - off_z = tl.load(q_batch_ids + off_zm).to(tl.int32) # [0, 0, 0, 1] - q_start_sid = tl.load(q_start_sids + off_zm) - start_m = q_start_sid // BLOCK_M # q_sbid - - offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M_LOADING) - offs_n = tl.arange(0, BLOCK_N) - offs_d = tl.arange(0, BLOCK_D) - - q_cu_start = tl.load(q_batch_starts + off_z).to(tl.int32) - q_seqlen = tl.load(q_batch_ends + off_z).to(tl.int32) - q_cu_start - k_cu_start = tl.load(k_batch_starts + off_z).to(tl.int32) - k_seqlen = tl.load(k_batch_ends + off_z).to(tl.int32) - k_cu_start - past_len = k_seqlen - q_seqlen - - Q += q_cu_start * stride_qt + off_h * stride_qh - K += k_cu_start * stride_kt + off_h_for_kv * stride_kh - V += k_cu_start * stride_vt + off_h_for_kv * stride_vh - Out += q_cu_start * stride_ot + off_h * stride_oh - - q_pbid = (past_len + q_start_sid) // BLOCK_M - - if EVEN_D: - q = tl.load( - Q + offs_m[:, None] * stride_qt + offs_d[None, :] * stride_qd, - mask=offs_m[:, None] < q_seqlen, - other=0.0, - ) - else: - q = tl.load( - Q + offs_m[:, None] * stride_qt + offs_d[None, :] * stride_qd, - mask=(offs_m[:, None] < q_seqlen) & (offs_d[None, :] < D_HEAD), - other=0.0, - ) - - sparse_crow_ptr = (layout_crow_ptr + off_h * layout_crow_stride_h + - q_pbid * layout_crow_stride_m) - - # TODO(linxihui): load at once, with any Triton version - # that supports `tl.split`, e.g., Triton 3.0 - k_block_start = tl.load(sparse_crow_ptr).to(tl.int32) - k_block_end = tl.load(sparse_crow_ptr + 1).to(tl.int32) - - m_i = tl.zeros([BLOCK_M_LOADING], dtype=tl.float32) - float("inf") - l_i = tl.zeros([BLOCK_M_LOADING], dtype=tl.float32) - acc = tl.zeros([BLOCK_M_LOADING, BLOCK_D], dtype=tl.float32) - - k_ptrs = K + offs_n[None, :] * stride_kt + offs_d[:, None] * stride_kd - v_ptrs = V + offs_n[:, None] * stride_vt + offs_d[None, :] * stride_vd - - sm_scale *= ( - 1.44269504 # 1/log2 as we use base2 for exponential and logarithm - ) - - for k_block_col_idx in range(k_block_start, k_block_end - 1): - acc, l_i, m_i = _fwd_kernel_inner( - acc, - l_i, - m_i, - q, - Q, - k_block_col_idx, - layout_col_ptr, - layout_col_stride_h, - layout_col_stride_m, - k_ptrs, - v_ptrs, - off_h, - offs_m, - offs_n, - offs_d, - stride_kt, - stride_vt, - sm_scale, - k_seqlen, - past_len, - False, - BLOCK_M_LOADING, - BLOCK_N, - D_HEAD, - EVEN_D, - M_LT_N, - ) - - acc, l_i, m_i = _fwd_kernel_inner( - acc, - l_i, - m_i, - q, - Q, - k_block_end - 1, - layout_col_ptr, - layout_col_stride_h, - layout_col_stride_m, - k_ptrs, - v_ptrs, - off_h, - offs_m, - offs_n, - offs_d, - stride_kt, - stride_vt, - sm_scale, - k_seqlen, - past_len, - True, - BLOCK_M_LOADING, - BLOCK_N, - D_HEAD, - EVEN_D, - M_LT_N, - ) - - # flash-attn 2 - m_i += tl.math.log2(l_i) - acc = acc / l_i[:, None] - - # write output - if EVEN_D: - tl.store( - Out + offs_m[:, None] * stride_ot + offs_d[None, :] * stride_od, - acc, - mask=offs_m[:, None] < q_seqlen, - ) - else: - tl.store( - Out + offs_m[:, None] * stride_ot + offs_d[None, :] * stride_od, - acc, - mask=(offs_m[:, None] < q_seqlen) & (offs_d[None, :] < D_HEAD), - ) diff --git a/vllm/attention/ops/blocksparse_attention/interface.py b/vllm/attention/ops/blocksparse_attention/interface.py deleted file mode 100644 index c6f6cc29793..00000000000 --- a/vllm/attention/ops/blocksparse_attention/interface.py +++ /dev/null @@ -1,239 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import math - -import torch - -from vllm.platforms import current_platform - -from .utils import (dense_to_crow_col, get_head_sliding_step, - get_sparse_attn_mask) - -IS_COMPUTE_8_OR_ABOVE = current_platform.has_device_capability(80) - -if IS_COMPUTE_8_OR_ABOVE: - from .blocksparse_attention_kernel import blocksparse_flash_attn_varlen_fwd - - -class LocalStridedBlockSparseAttn(torch.nn.Module): - - def __init__( - self, - n_heads, - max_seqlen, - local_blocks, - vert_stride, - block_size, - device=None, - dtype=None, - homo_head=False, - active_head_range=None, - q_block_size=None, - use_spda=None, - ): - super().__init__() - if use_spda is None: - use_spda = current_platform.is_rocm() or \ - current_platform.is_cpu() or not \ - IS_COMPUTE_8_OR_ABOVE - device = device or (torch.cuda.current_device() - if current_platform.is_cuda_alike() else "cpu") - device = torch.device(device) - # NOTE: vllm CPU backend support BF16 instead of FP16. - dtype = dtype or (torch.bfloat16 if IS_COMPUTE_8_OR_ABOVE - or device.type == "cpu" else torch.half) - - self.n_heads = n_heads - self.max_seqlen = max_seqlen - self.local_blocks = local_blocks - self.vert_stride = vert_stride - self.use_spda = use_spda - self.dtype = dtype - self.device = device - self.block_size = block_size - self.q_block_size = q_block_size - self.homo_head = homo_head - self.active_head_range = active_head_range - self.head_sliding_step = get_head_sliding_step(n_heads, vert_stride, - homo_head) - - sparse_layout, sparse_pattern, self.dense_attn_mask = ( - self.get_attn_pattern(dtype, device)) - - if q_block_size is not None and q_block_size != block_size: - if q_block_size > block_size: - assert q_block_size % block_size == 0 - blocks_to_merge = q_block_size // block_size - shape = sparse_pattern.shape - sparse_pattern = sparse_pattern.view(shape[0], -1, - blocks_to_merge, - shape[-1]) - sparse_pattern = sparse_pattern.sum(2) - sparse_layout = dense_to_crow_col(sparse_pattern) - else: - raise ValueError( - "Does not support smaller q_block_size. It will be slower." - ) - - self.sparse_layout = sparse_layout - - def get_attn_pattern(self, dtype, device): - sparse_layout, sparse_pattern, dense_attn_mask = get_sparse_attn_mask( - self.n_heads, - self.max_seqlen, - self.max_seqlen, - dtype, - device, - block_size=self.block_size, - local_blocks=self.local_blocks, - vert_stride=self.vert_stride, - homo_head=self.homo_head, - return_dense=self.use_spda, - dense_mask_type="bias", - ) - if (not self.homo_head) and (self.active_head_range is not None): - assert isinstance(self.active_head_range, tuple) - assert (len(self.active_head_range) == 2) - h_start, h_end = self.active_head_range - sparse_layout = tuple(x[h_start:h_end] for x in sparse_layout) - if self.use_spda: - dense_attn_mask = dense_attn_mask[h_start:h_end] - return sparse_layout, sparse_pattern, dense_attn_mask - - def varlen_attn(self, - q, - k, - v, - cu_seqlens_k, - cu_seqlens_q=None, - sm_scale=None): - """ - q, k, v: shape = (num_tokens, num_heads_q/kv, head_size). - Support grouped attention, with `q[:, i*r:(i*r + r)]` - is correspondent to `k[:, i]`, where `r` is the q/k ratio. - cu_seqlens_k: shape=(batch_size + 1,), - indicating segment of samples, - e.g., `k[cu_seqlen[i]:cu_seqlne[i+1]]` is q of sample i - cu_seqlens_q: shape=(batch_size + 1, ). - Default None: same as cu_seqlens_k for prefilling or - [0, 1, .., batch_size] for decoding. - The only case you need to specify is when q is a mix of - prefilling and decoding. - sm_scale: softmax scale, default to 1/sqrt(head_size). - - return: tensor of shape as q. - """ - assert ( - IS_COMPUTE_8_OR_ABOVE - ), "Requires compute capability of 8 or above (Ampere or newer) to use \ - Triton kernel." - - sm_scale = sm_scale or 1.0 / math.sqrt(q.size(-1)) - - return blocksparse_flash_attn_varlen_fwd( - q, - k, - v, - cu_seqlens_k, - cu_seqlens_q, - sm_scale, - self.sparse_layout, - block_size=self.block_size, - q_block_size=self.q_block_size, - max_seqlen=self.max_seqlen, - ) - - @staticmethod - def transpose_and_pad(x, cu_seqlens, maxlen, head_repeats=1): - """ - :param x: (total_tokens, n_heads, head_size) - :return: (batch, n_heads, length, head_size) - """ - x_padded = x.new_empty( - len(cu_seqlens) - 1, x.size(1), head_repeats, maxlen, x.size(2)) - cu_seqlens = cu_seqlens.cpu() - for i, (s, e) in enumerate(zip(cu_seqlens[:-1], cu_seqlens[1:])): - x_padded[i, :, :, :e - s].copy_(x[s:e].transpose(0, - 1).unsqueeze(1)) - return x_padded.flatten(1, 2) - - @staticmethod - def transpose_and_unpad(x_padded, cu_seqlens): - """ - :param x_padded: (batch, n_heads, length, head_size) - :return: (total_tokens, n_heads, head_size) - """ - cu_seqlens = cu_seqlens.cpu() - total_n_tokens = cu_seqlens[-1] - x = x_padded.new_empty(total_n_tokens, x_padded.size(1), - x_padded.size(3)) - for i, (s, e) in enumerate(zip(cu_seqlens[:-1], cu_seqlens[1:])): - x[s:e].copy_(x_padded[i, :, :e - s].transpose(0, 1)) - return x - - def spda(self, q, k, v, cu_seqlens_k, cu_seqlens_q=None, sm_scale=None): - """For CPU, V100 or other older GPUs. - NOTE: torch SPDA supports nested tensor, - but seems extremely slow. Choose to pad instead. - """ - assert (cu_seqlens_q is None or - (cu_seqlens_q - == cu_seqlens_k).all()), "Can only handle prompt with SPDA." - assert q.size(0) == k.size(0), "can only handle prompt with SPDA." - - assert q.size(1) % k.size(1) == 0 - q_k_ratio = q.size(1) // k.size(1) - sm_scale = sm_scale or 1.0 / math.sqrt(q.size(-1)) - cu_seqlens = cu_seqlens_k.cpu() - maxlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max() - - if (self.dense_attn_mask.dtype != q.dtype - or self.dense_attn_mask.device != q.device): - _, _, self.dense_attn_mask = self.get_attn_pattern( - q.dtype, q.device) - attn_mask = self.dense_attn_mask[None, :, :maxlen, :maxlen] - - q2 = self.transpose_and_pad(q, cu_seqlens, maxlen, 1) - k2, v2 = (self.transpose_and_pad(x, cu_seqlens, maxlen, q_k_ratio) - for x in [k, v]) - spda_output = torch.nn.functional.scaled_dot_product_attention( - q2, k2, v2, attn_mask=attn_mask, scale=sm_scale) - return self.transpose_and_unpad(spda_output, cu_seqlens) - - def forward(self, q, k, v, cu_seqlens_k, cu_seqlens_q=None, sm_scale=None): - """Dispatch to `varlen_attn` (Ampere or newer) or - `self.spda`(cpu, Volta, Turing or older)based on - the type of device used and cuda compute capability. - - q, k, v: shape = (num_tokens, num_heads_q/kv, head_size). - Support grouped attention, with `q[:, i*r:(i*r + r)]` - is correspondent to `k[:, i]`, where `r` is the q/k ratio. - cu_seqlens_k: shape=(batch_size + 1,), indicating segment of samples, - e.g., `k[cu_seqlen[i]:cu_seqlne[i+1]]` is q of sample i - cu_seqlens_q: shape=(batch_size + 1, ). - Default None: same as cu_seqlens_k for prefilling or - [0, 1, .., batch_size] for decoding. - The only case you need to specify - is when q is a mix of prefilling - and decoding. - sm_scale: softmax scale, default to 1/sqrt(head_size). - - return: tensor of shape as q. - """ - assert k.dim() == 3 - if self.use_spda: - return self.spda( - q, - k, - v, - cu_seqlens_k, - cu_seqlens_q=cu_seqlens_q, - sm_scale=sm_scale, - ) - return self.varlen_attn(q, - k, - v, - cu_seqlens_k, - cu_seqlens_q=cu_seqlens_q, - sm_scale=sm_scale) diff --git a/vllm/attention/ops/blocksparse_attention/utils.py b/vllm/attention/ops/blocksparse_attention/utils.py deleted file mode 100644 index 445720c709c..00000000000 --- a/vllm/attention/ops/blocksparse_attention/utils.py +++ /dev/null @@ -1,246 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# Helper functions for 3D sparse pattern -# These function are not optimized and very inefficient. -# Avoid calling them too frequent or use a cache mechanism. - -from functools import lru_cache - -import numpy as np -import torch - -from vllm.triton_utils import triton - - -class csr_matrix: - """Simple implementation of CSR matrix conversion without scipy. - This replaced scipy.sparse.csr_matrix() previously used.""" - - def __init__(self, input_array): - if not isinstance(input_array, np.ndarray): - raise ValueError("Input must be a NumPy array") - - self.shape = input_array.shape - rows, cols = self.shape - data = [] - indices = [] - indptr = [0] - - for i in range(rows): - for j in range(cols): - if input_array[i, j]: - data.append(input_array[i, j]) - indices.append(j) - indptr.append(len(indices)) - - self.data = np.array(data) - self.indices = np.array(indices) - self.indptr = np.array(indptr) - - -def dense_to_crow_col(x: torch.Tensor): - """Turning a 2D/3D torch tensor (x) to CSR rows/cols indexing. - NOTE: col_indices padded -1 - """ - device = x.device - pad = -1 - dim = x.dim() - assert x.dim() in (2, 3) - if x.dim() == 2: - x = x[None] - x = [csr_matrix(xi.bool().cpu().numpy()) for xi in x] - crows = torch.vstack([torch.from_numpy(xi.indptr) for xi in x]) - cols = [torch.from_numpy(xi.indices) for xi in x] - max_cols = max(len(xi) for xi in cols) - cols = [ - torch.cat([xi, pad + xi.new_zeros(max_cols - xi.shape[0])]) - for xi in cols - ] - cols = torch.vstack(cols) - if dim == 2: - crows = crows[0] - cols = cols[0] - return crows.to(device), cols.to(device) - - -def crow_col_to_dense(crows: torch.Tensor, - cols: torch.Tensor, - dtype: torch.dtype = torch.float16): - dim = crows.dim() - if dim == 1: - crows = crows[None] - cols = cols[None] - device = crows.device - crows, cols = crows.cpu(), cols.cpu() # faster in cpu - shape = (crows.shape[0], crows.shape[1] - 1, cols.max() + 1) - x = torch.zeros(shape, dtype=dtype) - for i in range(shape[0]): - for j in range(shape[1]): - x[i, j, cols[i, crows[i, j]:crows[i, j + 1]]] = 1 - if dim == 1: - x = x[0] - return x.to(device) - - -def dense_to_ccol_row(x: torch.Tensor): - """Similar, but to CSC format""" - x = x.transpose(-2, -1) - return dense_to_crow_col(x) - - -def ccol_row_to_dense(ccol: torch.Tensor, - rows: torch.Tensor, - dtype: torch.dtype = torch.float16): - return crow_col_to_dense(ccol, rows, dtype).permute(0, 2, 1).contiguous() - - -def _get_sparse_attn_mask_homo_head( - q_len: int, - max_seqlen: int, - dtype: torch.dtype, - device: torch.device, - block_size: int = 128, - local_blocks: int = 4, - vert_stride: int = 4, - return_dense: bool = False, -): - """ - :return: a tuple of 3: - - tuple of crow_indices, col_indices representation - of CSR format. - - block dense mask - - all token dense mask (be aware that it can be - OOM if it is too big) if `return_dense==True`, - otherwise, None - """ - with torch.no_grad(): - num_blocks = triton.cdiv(max_seqlen, block_size) - q_pos = torch.arange(num_blocks)[:, None] - k_pos = torch.arange(num_blocks)[None] - mask_vert_strided = (torch.arange(num_blocks) + 1) % vert_stride == 0 - block_mask_dense = (((q_pos >= k_pos) - & ((q_pos - k_pos < local_blocks) - | mask_vert_strided)).to(device).to(dtype)) - num_blocks_q = triton.cdiv(q_len, block_size) - block_mask_dense_output = (dense_to_crow_col( - block_mask_dense[-num_blocks_q:].contiguous())) - if return_dense: - mask_dense = torch.kron( - block_mask_dense, - block_mask_dense.new_ones((block_size, block_size)), - ) - causal_mask = torch.tril(torch.ones( - max_seqlen, max_seqlen)).type_as(mask_dense)[-q_len:] - mask_dense = mask_dense[-q_len:, :max_seqlen] * causal_mask - return ( - block_mask_dense_output, - block_mask_dense, - mask_dense, - ) - else: - return ( - block_mask_dense_output, - block_mask_dense, - None, - ) - - -def binary_mask_to_bias(mask_dense: torch.Tensor): - mask_dense = 1 - mask_dense - mask_dense.masked_fill_(mask_dense.bool(), -torch.inf) - return mask_dense - - -def get_head_sliding_step(n_heads: int, - vert_stride: int, - homo_head: bool = False): - if homo_head: - return 0 - return max(1, int(vert_stride / n_heads)) - - -@lru_cache -def get_sparse_attn_mask( - n_heads: int, - q_len: int, - max_seqlen: int, - dtype: torch.dtype, - device: torch.device, - block_size: int = 64, - local_blocks: int = 4, - vert_stride: int = 4, - homo_head: bool = True, - return_dense: bool = False, - dense_mask_type: str = "binary", -): - """ - :param dense_mask_type: "binary" (0 for skip token, 1 for others) - or "bias" (-inf for skip token, 0 or others) - :return: a tuple of 3: - - tuple of crow_indices, col_indices representation - of CSR format. - - block dense mask - - all token dense mask (be aware that it can be OOM if it - is too big) if `return_dense==True`, otherwise, None - """ - assert dense_mask_type in ("binary", "bias") - if homo_head: - with torch.no_grad(): - (crow, col), block_mask_dense, mask_dense = ( - _get_sparse_attn_mask_homo_head( - q_len, - max_seqlen, - dtype, - device, - block_size, - local_blocks, - vert_stride, - return_dense, - )) - crow = crow[None].expand(n_heads, crow.shape[0]) - col = col[None].expand(n_heads, col.shape[0]) - if return_dense: - mask_dense = mask_dense[None].expand(n_heads, - *mask_dense.shape) - if dense_mask_type == "bias": - mask_dense = binary_mask_to_bias(mask_dense) - return (crow, col), block_mask_dense, mask_dense - - with torch.no_grad(): - num_blocks = triton.cdiv(max_seqlen, block_size) - q_pos = torch.arange(num_blocks)[None, :, None] - k_pos = torch.arange(num_blocks)[None, None] - head_sliding_step = get_head_sliding_step(n_heads, vert_stride) - mask_vert_strided = [ - (torch.arange(num_blocks) + h * head_sliding_step + 1) % - vert_stride == 0 for h in range(n_heads) - ] - mask_vert_strided = torch.vstack(mask_vert_strided).unsqueeze(1) - block_mask_dense = (((q_pos >= k_pos) - & ((q_pos - k_pos < local_blocks) - | mask_vert_strided)).to(device).to(dtype)) - num_blocks_q = triton.cdiv(q_len, block_size) - block_mask_dense_output = block_mask_dense[:, -num_blocks_q:] - if return_dense: - mask_dense = torch.kron( - block_mask_dense, - block_mask_dense.new_ones((block_size, block_size)), - ) - causal_mask = torch.tril(torch.ones( - max_seqlen, max_seqlen)).type_as(mask_dense)[-q_len:] - mask_dense = mask_dense[..., -q_len:, :max_seqlen] * causal_mask[None] - if dense_mask_type == "bias": - mask_dense = binary_mask_to_bias(mask_dense) - - return ( - dense_to_crow_col(block_mask_dense_output), - block_mask_dense, - mask_dense, - ) - else: - return ( - dense_to_crow_col(block_mask_dense_output), - block_mask_dense, - None, - ) diff --git a/vllm/attention/selector.py b/vllm/attention/selector.py index 4d4886d02b7..2e3c8638125 100644 --- a/vllm/attention/selector.py +++ b/vllm/attention/selector.py @@ -143,7 +143,6 @@ def get_attn_backend( kv_cache_dtype: Optional[str], block_size: int, is_attention_free: bool, - is_blocksparse: bool = False, use_mla: bool = False, ) -> type[AttentionBackend]: """Selects which attention backend to use and lazily imports it.""" @@ -157,7 +156,6 @@ def get_attn_backend( kv_cache_dtype=kv_cache_dtype, block_size=block_size, is_attention_free=is_attention_free, - is_blocksparse=is_blocksparse, use_v1=envs.VLLM_USE_V1, use_mla=use_mla, ) @@ -170,16 +168,9 @@ def _cached_get_attn_backend( kv_cache_dtype: Optional[str], block_size: int, is_attention_free: bool, - is_blocksparse: bool = False, use_v1: bool = False, use_mla: bool = False, ) -> type[AttentionBackend]: - if is_blocksparse: - logger.info("Using BlocksparseFlashAttention backend.") - from vllm.attention.backends.blocksparse_attn import ( - BlocksparseFlashAttentionBackend) - return BlocksparseFlashAttentionBackend - # If there are no attention layers (e.g. we are running Mamba), # use the placeholder NO_ATTENTION if is_attention_free: diff --git a/vllm/model_executor/models/phi3_small.py b/vllm/model_executor/models/phi3_small.py deleted file mode 100644 index 754ddda233f..00000000000 --- a/vllm/model_executor/models/phi3_small.py +++ /dev/null @@ -1,465 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import math -from collections.abc import Iterable -from typing import Optional, Union - -import torch -from torch import nn -from transformers.configuration_utils import PretrainedConfig - -from vllm.attention import Attention -from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import (get_pp_group, get_tensor_model_parallel_rank, - get_tensor_model_parallel_world_size) -from vllm.model_executor.layers.linear import (MergedColumnParallelLinear, - QKVParallelLinear, - RowParallelLinear) -from vllm.model_executor.layers.logits_processor import LogitsProcessor -from vllm.model_executor.layers.quantization import QuantizationConfig -from vllm.model_executor.layers.rotary_embedding import get_rope -from vllm.model_executor.layers.vocab_parallel_embedding import ( - DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) -from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.model_executor.sampling_metadata import SamplingMetadata -from vllm.platforms import current_platform -from vllm.sequence import IntermediateTensors - -from .interfaces import SupportsPP -from .utils import (AutoWeightsLoader, WeightsMapper, is_pp_missing_parameter, - make_empty_intermediate_tensors_factory, make_layers, - maybe_prefix) - - -def load_column_parallel_weight(param: torch.nn.Parameter, - loaded_weight: torch.Tensor): - tp = get_tensor_model_parallel_world_size() - rk = get_tensor_model_parallel_rank() - assert param.size(0) * tp == loaded_weight.size(0) - s = rk * param.size(0) - e = (rk + 1) * param.size(0) - loaded_weight = loaded_weight[s:e] - assert param.shape == loaded_weight.shape - param.data.copy_(loaded_weight) - - -class HeadMajorQKVParallelLinear(QKVParallelLinear): - - def weight_loader(self, param: torch.nn.Parameter, - loaded_weight: torch.Tensor): - return load_column_parallel_weight(param, loaded_weight) - - -class HeadMajorColumnParallelLinear(MergedColumnParallelLinear): - - def weight_loader(self, param: torch.nn.Parameter, - loaded_weight: torch.Tensor): - return load_column_parallel_weight(param, loaded_weight) - - -@torch.compile(dynamic=True, backend=current_platform.simple_compile_backend) -def quick_gelu(x): - return x * torch.sigmoid(1.702 * x) - - -@torch.compile(dynamic=True, backend=current_platform.simple_compile_backend) -def gegelu(input, limit: Optional[float] = None): - a_gelu, a_linear = input[..., ::2], input[..., 1::2] - if limit is not None: - a_gelu = torch.where(torch.isinf(a_gelu), a_gelu, - a_gelu.clamp(min=None, max=limit)) - a_linear = torch.where( - torch.isinf(a_linear), - a_linear, - a_linear.clamp(min=-limit, max=limit), - ) - out_gelu = quick_gelu(a_gelu) - return out_gelu * (a_linear + 1) - - -class Phi3SmallMLP(nn.Module): - - def __init__( - self, - config: PretrainedConfig, - quant_config: Optional[QuantizationConfig] = None, - ) -> None: - super().__init__() - self.config = config - assert (self.config.hidden_act == "gegelu" - ), "Only `gegelu` is supported for the 4.7 series of models .." - self.hidden_size = config.hidden_size - self.gegelu_limit = config.gegelu_limit - self.intermediate_size = config.intermediate_size - - self.up_proj = HeadMajorColumnParallelLinear( - self.hidden_size, - 2 * [self.intermediate_size], - bias=True, - quant_config=quant_config, - ) - self.down_proj = RowParallelLinear( - self.intermediate_size, - self.hidden_size, - bias=True, - quant_config=quant_config, - ) - - def forward(self, x): - gate_up, _ = self.up_proj(x) - x = gegelu(gate_up) - x, _ = self.down_proj(x) - return x - - -class Phi3SmallSelfAttention(nn.Module): - - def __init__( - self, - config: PretrainedConfig, - layer_idx: int, - cache_config: Optional[CacheConfig] = None, - quant_config: Optional[QuantizationConfig] = None, - prefix: str = "", - ) -> None: - super().__init__() - self.layer_idx = layer_idx - self.config = config - self.sparse_block_size = config.blocksparse_block_size - self.homo_heads = config.blocksparse_homo_head_pattern - self.local_blocks = config.blocksparse_num_local_blocks - self.vert_stride = config.blocksparse_vert_stride - - assert (config.blocksparse_block_size == - config.blocksparse_triton_kernel_block_size) - - self.hidden_size = config.hidden_size - # Number of Query Heads - self.num_heads = config.num_attention_heads - - self.head_dim = self.hidden_size // self.num_heads - self.tp_size = get_tensor_model_parallel_world_size() - # Number of total Key Value Heads before tensor parallel - self.num_key_value_heads = config.num_key_value_heads - self.num_q_per_kv = self.num_heads // self.num_key_value_heads - if self.tp_size > 1: - assert self.num_key_value_heads % self.tp_size == 0 - self.num_kv_heads_per_partition = max( - 1, self.num_key_value_heads // self.tp_size) - self.num_heads_per_partition = self.num_heads // self.tp_size - - self.max_position_embeddings = config.max_position_embeddings - self.rope_embedding_base = config.rope_embedding_base - self.rope_position_scale = config.rope_position_scale - self.is_causal = True - - norm_factor = None - if config.mup_use_scaling: - norm_factor = self.head_dim / config.mup_attn_multiplier - else: - norm_factor = math.sqrt(self.head_dim) - self.scale = 1 / norm_factor - - self.query_key_value = HeadMajorQKVParallelLinear( - self.hidden_size, - self.head_dim, - self.num_heads, - self.num_key_value_heads, - bias=True, - quant_config=quant_config, - ) - - self.dense = RowParallelLinear(self.hidden_size, - self.hidden_size, - bias=True, - quant_config=quant_config) - - if getattr(self.config, "rope_scaling", None) is not None: - rope_scaling = self.config.rope_scaling - for key in rope_scaling: - if isinstance(rope_scaling[key], list): - rope_scaling[key] = tuple(rope_scaling[key]) - - if "factor" not in rope_scaling: - rope_scaling["factor"] = self.rope_position_scale - else: - rope_scaling = { - "rope_type": "linear", - "factor": self.rope_position_scale, - } - - self.rotary_emb = get_rope( - self.head_dim, - rotary_dim=self.head_dim, - max_position=self.max_position_embeddings, - base=self.rope_embedding_base, - rope_scaling=rope_scaling, - ) - - # blocksparse params - self.blocksparse_block_size = config.blocksparse_block_size - self.blocksparse_num_local_blocks = config.blocksparse_num_local_blocks - self.blocksparse_vert_stride = config.blocksparse_vert_stride - - use_dense_attn = (getattr(self.config, - "dense_attention_every_n_layers", None) - and (self.layer_idx + 1) % - self.config.dense_attention_every_n_layers == 0) - - bs_params = None - if not use_dense_attn: - bs_params = { - 'max_seqlen': self.max_position_embeddings, - 'num_heads': self.num_heads_per_partition, - "num_kv_heads": self.num_kv_heads_per_partition, - "block_size": self.sparse_block_size, - "local_blocks": self.local_blocks, - "vert_stride": self.vert_stride, - "homo_head": self.homo_heads - } - - self.attn = Attention(self.num_heads_per_partition, - self.head_dim, - self.scale, - num_kv_heads=self.num_kv_heads_per_partition, - cache_config=cache_config, - quant_config=quant_config, - blocksparse_params=bs_params, - prefix=f"{prefix}.attn") - - def forward( - self, - positions: torch.Tensor, - hidden_states: torch.Tensor, - ) -> tuple[torch.Tensor, Optional[torch.Tensor], - Optional[tuple[torch.Tensor]]]: - qkv, _ = self.query_key_value(hidden_states) - - qkv = qkv.view(qkv.shape[:-1] + - (-1, (self.num_q_per_kv + 2), self.head_dim)) - q, k, v = qkv.split([self.num_q_per_kv, 1, 1], dim=-2) - - # NOTE: this is required by RotaryEmbed, which indeed does not have to - # TODO: allow 3D QK for rotary forward - q = q.reshape(-1, self.head_dim * self.num_heads_per_partition) - k = k.reshape(-1, self.head_dim * self.num_kv_heads_per_partition) - v = v.reshape(-1, self.head_dim * self.num_kv_heads_per_partition) - - q, k = self.rotary_emb(positions, q, k) - attn_output = self.attn(q, k, v) - output, _ = self.dense(attn_output) - - return output - - -class Phi3SmallDecoderLayer(nn.Module): - - def __init__( - self, - config: PretrainedConfig, - layer_idx: int, - cache_config: Optional[CacheConfig] = None, - quant_config: Optional[QuantizationConfig] = None, - prefix: str = "", - ): - super().__init__() - self.hidden_size = config.hidden_size - self.self_attn = Phi3SmallSelfAttention(config, - layer_idx, - cache_config=cache_config, - quant_config=quant_config, - prefix=f"{prefix}.self_attn") - self.mlp = Phi3SmallMLP(config, quant_config) - - self.input_layernorm = nn.LayerNorm(config.hidden_size, - eps=config.layer_norm_epsilon) - self.post_attention_layernorm = nn.LayerNorm( - config.hidden_size, eps=config.layer_norm_epsilon) - - def forward( - self, - positions: torch.Tensor, - hidden_states: torch.Tensor, - ) -> torch.Tensor: - residual = hidden_states - hidden_states = self.input_layernorm(hidden_states) - - hidden_states = self.self_attn( - positions=positions, - hidden_states=hidden_states, - ) - hidden_states = residual + hidden_states - - residual = hidden_states - hidden_states = self.post_attention_layernorm(hidden_states) - hidden_states = self.mlp(hidden_states) - hidden_states = residual + hidden_states - return hidden_states - - -class Phi3SmallModel(nn.Module): - - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - - config = vllm_config.model_config.hf_config - cache_config = vllm_config.cache_config - quant_config = vllm_config.quant_config - - self.config = config - self.embed_tokens = VocabParallelEmbedding(config.vocab_size, - config.hidden_size) - self.mup_embedding_multiplier = config.mup_embedding_multiplier - self.start_layer, self.end_layer, self.layers = make_layers( - config.num_hidden_layers, - lambda prefix: Phi3SmallDecoderLayer(config, - int(prefix.split('.')[-1]), - cache_config, - quant_config, - prefix=prefix), - prefix=f"{prefix}.layers") - - self.final_layernorm = nn.LayerNorm(config.hidden_size, - eps=config.layer_norm_epsilon) - self.make_empty_intermediate_tensors = ( - make_empty_intermediate_tensors_factory(["hidden_states"], - config.hidden_size)) - - def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: - return self.embed_tokens(input_ids) - - def forward( - self, - input_ids: torch.LongTensor, - positions: Optional[torch.LongTensor], - intermediate_tensors: Optional[IntermediateTensors], - inputs_embeds: Optional[torch.Tensor], - ) -> Union[torch.Tensor, IntermediateTensors]: - if get_pp_group().is_first_rank: - if inputs_embeds is not None: - hidden_states = inputs_embeds - else: - hidden_states = self.get_input_embeddings(input_ids) - if (self.mup_embedding_multiplier is not None - and self.mup_embedding_multiplier > 0.0): - hidden_states = hidden_states * self.mup_embedding_multiplier - else: - assert intermediate_tensors - hidden_states = intermediate_tensors["hidden_states"] - for layer in self.layers[self.start_layer:self.end_layer]: - hidden_states = layer(positions, hidden_states) - if not get_pp_group().is_last_rank: - return IntermediateTensors({"hidden_states": hidden_states}) - hidden_states = self.final_layernorm(hidden_states) - return hidden_states - - def load_weights(self, weights: Iterable[tuple[str, - torch.Tensor]]) -> set[str]: - params_dict = dict(self.named_parameters()) - loaded_params: set[str] = set() - for name, loaded_weight in weights: - if name.endswith(".bias") and name not in params_dict: - continue - if is_pp_missing_parameter(name, self): - continue - param = params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - weight_loader(param, loaded_weight) - loaded_params.add(name) - return loaded_params - - -class Phi3SmallForCausalLM(nn.Module, SupportsPP): - _tied_weights_keys = ["lm_head.weight"] - - hf_to_vllm_mapper = WeightsMapper( - orig_to_new_suffix={"rotary_emb.inv_freq": None}) - - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - config = vllm_config.model_config.hf_config - quant_config = vllm_config.quant_config - self.config = config - self.quant_config = quant_config - self.model = Phi3SmallModel(vllm_config=vllm_config, - prefix=maybe_prefix(prefix, "model")) - self.vocab_size = config.vocab_size - self.mup_width_multiplier = config.mup_width_multiplier - self.lm_head = ParallelLMHead( - self.vocab_size, - config.hidden_size, - org_num_embeddings=config.vocab_size, - padding_size=DEFAULT_VOCAB_PADDING_SIZE, - quant_config=quant_config, - ) - if self.config.tie_word_embeddings: - self.lm_head.weight = self.model.embed_tokens.weight - self.logits_processor = LogitsProcessor(config.vocab_size) - self.make_empty_intermediate_tensors = ( - self.model.make_empty_intermediate_tensors) - - # tokens in tiktoken but not used - if hasattr(config, 'dummy_token_indices'): - device = self.lm_head.weight.device - self.register_buffer('dummy_token_indices', - torch.LongTensor( - config.dummy_token_indices).to(device), - persistent=False) - else: - self.dummy_token_indices = None - - def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: - return self.model.get_input_embeddings(input_ids) - - def set_input_embeddings(self, value): - self.model.embed_tokens = value - - def get_output_embeddings(self): - return self.lm_head - - def set_output_embeddings(self, value): - self.lm_head = value - - def set_decoder(self, decoder): - self.model = decoder - - def get_decoder(self): - return self.model - - def compute_logits( - self, - hidden_states: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[torch.Tensor]: - logits = self.logits_processor(self.lm_head, hidden_states, - sampling_metadata) - if self.dummy_token_indices is not None and logits is not None: - logits.index_fill_(-1, self.dummy_token_indices, -torch.inf) - logits = logits / self.mup_width_multiplier - return logits - - def forward( - self, - input_ids: torch.LongTensor, - positions: Optional[torch.LongTensor], - intermediate_tensors: Optional[IntermediateTensors] = None, - inputs_embeds: Optional[torch.Tensor] = None, - ) -> Union[torch.Tensor, IntermediateTensors]: - output_hidden_states = self.model( - input_ids=input_ids, - positions=positions, - intermediate_tensors=intermediate_tensors, - inputs_embeds=inputs_embeds, - ) - output_hidden_states = output_hidden_states - return output_hidden_states - - def load_weights(self, weights: Iterable[tuple[str, - torch.Tensor]]) -> set[str]: - loader = AutoWeightsLoader( - self, - skip_prefixes=(["lm_head.weight"] - if self.config.tie_word_embeddings else None)) - return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 2ca37867b88..3440dd656c5 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -110,7 +110,6 @@ "PersimmonForCausalLM": ("persimmon", "PersimmonForCausalLM"), "PhiForCausalLM": ("phi", "PhiForCausalLM"), "Phi3ForCausalLM": ("phi3", "Phi3ForCausalLM"), - "Phi3SmallForCausalLM": ("phi3_small", "Phi3SmallForCausalLM"), "PhiMoEForCausalLM": ("phimoe", "PhiMoEForCausalLM"), "Phi4FlashForCausalLM": ("phi4flash", "Phi4FlashForCausalLM"), "Plamo2ForCausalLM": ("plamo2", "Plamo2ForCausalLM"), diff --git a/vllm/platforms/interface.py b/vllm/platforms/interface.py index b8e788de11c..1cd5cb5e83d 100644 --- a/vllm/platforms/interface.py +++ b/vllm/platforms/interface.py @@ -57,7 +57,6 @@ class _Backend(enum.Enum): PALLAS = enum.auto() PALLAS_VLLM_V1 = enum.auto() IPEX = enum.auto() - BLOCK_SPARSE_FLASH_ATTN = enum.auto() DUAL_CHUNK_FLASH_ATTN = enum.auto() DIFFERENTIAL_FLASH_ATTN = enum.auto() NO_ATTENTION = enum.auto() diff --git a/vllm/v1/attention/backends/cpu_attn.py b/vllm/v1/attention/backends/cpu_attn.py index d63b82012a5..2efbe0de272 100644 --- a/vllm/v1/attention/backends/cpu_attn.py +++ b/vllm/v1/attention/backends/cpu_attn.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from dataclasses import dataclass -from typing import Any, Optional +from typing import Optional import numpy as np import torch @@ -443,7 +443,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, @@ -451,9 +450,6 @@ def __init__( ) -> None: if kv_sharing_target_layer_name is not None: raise NotImplementedError("KV sharing is not supported in V0.") - if blocksparse_params is not None: - raise ValueError( - "Torch SPDA does not support block-sparse attention.") if logits_soft_cap is not None: logger.warning_once("Torch SPDA does not support logits soft cap. " "Outputs may be slightly off.") diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index a37bf2a7115..ad414ee0a1f 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with FlashAttention.""" from dataclasses import dataclass -from typing import Any, ClassVar, Optional +from typing import ClassVar, Optional import numpy as np import torch @@ -349,15 +349,11 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, use_irope: bool = False, ) -> None: - if blocksparse_params is not None: - raise ValueError( - "FlashAttention does not support block-sparse attention.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 7f3c4ed129c..e1ffa61a600 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -4,7 +4,7 @@ from __future__ import annotations from dataclasses import dataclass -from typing import TYPE_CHECKING, Any, Optional +from typing import TYPE_CHECKING, Optional import torch from flashinfer import (BatchDecodeWithPagedKVCacheWrapper, @@ -490,7 +490,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, diff --git a/vllm/v1/attention/backends/flex_attention.py b/vllm/v1/attention/backends/flex_attention.py index c229ec12fd1..ad63f92cd88 100644 --- a/vllm/v1/attention/backends/flex_attention.py +++ b/vllm/v1/attention/backends/flex_attention.py @@ -3,7 +3,7 @@ """Attention layer with FlashAttention.""" from collections import defaultdict from dataclasses import dataclass -from typing import Any, Optional +from typing import Optional import torch from torch.nn.attention.flex_attention import (BlockMask, _mask_mod_signature, @@ -342,15 +342,10 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, ) -> None: - if blocksparse_params is not None: - # TODO we should support this :think - raise ValueError( - "FlashAttention does not support block-sparse attention.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py index 93c8156b16a..cf17d933023 100755 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -190,7 +190,7 @@ import functools from abc import abstractmethod from dataclasses import dataclass, field -from typing import TYPE_CHECKING, Any, Generic, Optional, TypeVar, Union +from typing import TYPE_CHECKING, Generic, Optional, TypeVar, Union import torch @@ -754,7 +754,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], diff --git a/vllm/v1/attention/backends/mla/cutlass_mla.py b/vllm/v1/attention/backends/mla/cutlass_mla.py index a0f7c39c004..c787f25cd3a 100644 --- a/vllm/v1/attention/backends/mla/cutlass_mla.py +++ b/vllm/v1/attention/backends/mla/cutlass_mla.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import os -from typing import Any, Optional +from typing import Optional import torch @@ -74,7 +74,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], @@ -82,17 +81,14 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "CutlassMLAImpl does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " diff --git a/vllm/v1/attention/backends/mla/flashmla.py b/vllm/v1/attention/backends/mla/flashmla.py index 935311aacc3..d3e5300dbbd 100644 --- a/vllm/v1/attention/backends/mla/flashmla.py +++ b/vllm/v1/attention/backends/mla/flashmla.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from dataclasses import dataclass -from typing import Any, ClassVar, Optional +from typing import ClassVar, Optional import torch @@ -119,7 +119,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], @@ -127,20 +126,17 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) assert is_flashmla_supported(), \ "FlashMLA is not supported on this device" - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "FlashMLAImpl does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " diff --git a/vllm/v1/attention/backends/mla/rocm_aiter_mla.py b/vllm/v1/attention/backends/mla/rocm_aiter_mla.py index 42a04258361..834c2345583 100644 --- a/vllm/v1/attention/backends/mla/rocm_aiter_mla.py +++ b/vllm/v1/attention/backends/mla/rocm_aiter_mla.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from dataclasses import dataclass -from typing import Any, ClassVar, Optional +from typing import ClassVar, Optional import torch @@ -167,7 +167,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], @@ -175,20 +174,17 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) assert (num_heads == 16 or num_heads == 128), ( f"Aiter MLA only supports 16 or 128 number of heads.\n" f"Provided {num_heads} number of heads.\n" "Try adjusting tensor_parallel_size value.") - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "Aiter MLA does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") from aiter import flash_attn_varlen_func self.flash_attn_varlen_func = flash_attn_varlen_func diff --git a/vllm/v1/attention/backends/mla/triton_mla.py b/vllm/v1/attention/backends/mla/triton_mla.py index 99938f22f10..700fce68953 100644 --- a/vllm/v1/attention/backends/mla/triton_mla.py +++ b/vllm/v1/attention/backends/mla/triton_mla.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Any, Optional +from typing import Optional import torch @@ -42,7 +42,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]], logits_soft_cap: Optional[float], attn_type: str, kv_sharing_target_layer_name: Optional[str], @@ -50,17 +49,14 @@ def __init__( **mla_args) -> None: super().__init__(num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, - blocksparse_params, logits_soft_cap, attn_type, + logits_soft_cap, attn_type, kv_sharing_target_layer_name, **mla_args) - unsupported_features = [ - alibi_slopes, sliding_window, blocksparse_params, logits_soft_cap - ] + unsupported_features = [alibi_slopes, sliding_window, logits_soft_cap] if any(unsupported_features): raise NotImplementedError( "TritonMLAImpl does not support one of the following: " - "alibi_slopes, sliding_window, blocksparse_params, " - "logits_soft_cap") + "alibi_slopes, sliding_window, logits_soft_cap") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " diff --git a/vllm/v1/attention/backends/pallas.py b/vllm/v1/attention/backends/pallas.py index 52e12a1a506..ac7980c79e4 100644 --- a/vllm/v1/attention/backends/pallas.py +++ b/vllm/v1/attention/backends/pallas.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from dataclasses import dataclass -from typing import Any, Optional +from typing import Optional import torch import torch_xla.core.xla_builder as xb @@ -132,7 +132,6 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, @@ -142,9 +141,6 @@ def __init__( logger.warning_once( "Using irope in Pallas is not supported yet, it will fall back " "to global attention for long context.") - if blocksparse_params is not None: - raise ValueError("Paged attention Pallas kernel does " - "not support block-sparse attention.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) @@ -158,8 +154,6 @@ def __init__( raise NotImplementedError("Alibi slopes is not supported.") if kv_cache_dtype != "auto": raise NotImplementedError("FP8 KV cache dtype is not supported.") - if blocksparse_params is not None: - raise NotImplementedError("Blocksparse is not supported.") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index 43fe30a9a89..8f756763944 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with AiterFlashAttention.""" from dataclasses import dataclass -from typing import Any, Optional +from typing import Optional import torch @@ -334,15 +334,11 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, use_irope: bool = False, ) -> None: - if blocksparse_params is not None: - raise ValueError( - "AiterFlashAttention does not support block-sparse attention.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) diff --git a/vllm/v1/attention/backends/triton_attn.py b/vllm/v1/attention/backends/triton_attn.py index 79796ac1492..d65ff5ff74e 100644 --- a/vllm/v1/attention/backends/triton_attn.py +++ b/vllm/v1/attention/backends/triton_attn.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with PagedAttention and Triton prefix prefill.""" from dataclasses import dataclass -from typing import Any, ClassVar, Optional +from typing import ClassVar, Optional import torch @@ -205,15 +205,11 @@ def __init__( alibi_slopes: Optional[list[float]], sliding_window: Optional[int], kv_cache_dtype: str, - blocksparse_params: Optional[dict[str, Any]] = None, logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, use_irope: bool = False, ) -> None: - if blocksparse_params is not None: - raise ValueError( - "TritonAttention does not support block-sparse attention.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) From cf0595d2c715505f54dfadd6112eaa5e9209a09e Mon Sep 17 00:00:00 2001 From: fhl2000 <63384265+fhl2000@users.noreply.github.com> Date: Sun, 20 Jul 2025 05:13:18 +0800 Subject: [PATCH 210/552] [BugFix] Fix full cuda graph slot_mapping (#21228) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/v1/worker/gpu_model_runner.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 1ee9c070226..670e653929c 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2079,7 +2079,7 @@ def _dummy_run( block_table_tensor=self.input_batch.block_table[ kv_cache_group_id].get_device_tensor()[:num_reqs], slot_mapping=self.input_batch. - block_table[kv_cache_group_id].slot_mapping[:num_reqs]) + block_table[kv_cache_group_id].slot_mapping[:num_tokens]) attn_metadata_i = self.attn_metadata_builders[ kv_cache_group_id].build_for_cudagraph_capture( From e9d85d113850ca236cc695f5082a3ef5e6b2bafb Mon Sep 17 00:00:00 2001 From: Yuxuan Zhang <2448370773@qq.com> Date: Sun, 20 Jul 2025 06:40:31 +0800 Subject: [PATCH 211/552] GLM-4 Update (#20736) Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: Isotr0py Signed-off-by: Lu Fang Co-authored-by: Isotr0py Co-authored-by: Lu Fang Signed-off-by: x22x22 --- benchmarks/kernels/benchmark_moe.py | 6 +- .../benchmark_moe_permute_unpermute.py | 1 + docs/models/supported_models.md | 1 + tests/models/registry.py | 7 + tests/tool_use/test_glm4_moe_tool_parser.py | 410 +++++++++++ vllm/config.py | 15 +- .../openai/tool_parsers/__init__.py | 25 +- .../tool_parsers/glm4_moe_tool_parser.py | 402 ++++++++++ vllm/model_executor/models/glm4_moe.py | 685 ++++++++++++++++++ vllm/model_executor/models/glm4_moe_mtp.py | 307 ++++++++ vllm/model_executor/models/registry.py | 2 + vllm/reasoning/__init__.py | 2 + vllm/reasoning/glm4_moe_reasoning_parser.py | 151 ++++ vllm/worker/worker.py | 3 +- 14 files changed, 2006 insertions(+), 11 deletions(-) create mode 100644 tests/tool_use/test_glm4_moe_tool_parser.py create mode 100644 vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py create mode 100644 vllm/model_executor/models/glm4_moe.py create mode 100644 vllm/model_executor/models/glm4_moe_mtp.py create mode 100644 vllm/reasoning/glm4_moe_reasoning_parser.py diff --git a/benchmarks/kernels/benchmark_moe.py b/benchmarks/kernels/benchmark_moe.py index 132c325ce59..c350aaf5d3a 100644 --- a/benchmarks/kernels/benchmark_moe.py +++ b/benchmarks/kernels/benchmark_moe.py @@ -576,7 +576,11 @@ def main(args: argparse.Namespace): topk = config.num_experts_per_tok intermediate_size = config.intermediate_size shard_intermediate_size = 2 * intermediate_size // args.tp_size - elif config.architectures[0] in ("DeepseekV3ForCausalLM", "DeepseekV2ForCausalLM"): + elif config.architectures[0] in ( + "DeepseekV3ForCausalLM", + "DeepseekV2ForCausalLM", + "Glm4MoeForCausalLM", + ): E = config.n_routed_experts topk = config.num_experts_per_tok intermediate_size = config.moe_intermediate_size diff --git a/benchmarks/kernels/benchmark_moe_permute_unpermute.py b/benchmarks/kernels/benchmark_moe_permute_unpermute.py index dba1f3943b9..4ed69009014 100644 --- a/benchmarks/kernels/benchmark_moe_permute_unpermute.py +++ b/benchmarks/kernels/benchmark_moe_permute_unpermute.py @@ -318,6 +318,7 @@ def main(args: argparse.Namespace): elif ( config.architectures[0] == "DeepseekV3ForCausalLM" or config.architectures[0] == "DeepseekV2ForCausalLM" + or config.architectures[0] == "Glm4MoeForCausalLM" ): E = config.n_routed_experts topk = config.num_experts_per_tok diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 250ce53fec3..b3201ce32f7 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -579,6 +579,7 @@ Specified using `--task generate`. | `Gemma3ForConditionalGeneration` | Gemma 3 | T + I+ | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ | | `GLM4VForCausalLM`^ | GLM-4V | T + I | `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + IE+ + VE+ | `THUDM/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `Glm4MoeForCausalLM` | GLM-4.5 | T + IE+ + VE+ | `THUDM/GLM-4.5`, etc. | ✅︎ | ✅︎ | ✅︎ | | `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | ✅︎ | | `H2OVLChatModel` | H2OVL | T + IE+ | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | ✅︎ | | `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3`, etc. | ✅︎ | | ✅︎ | diff --git a/tests/models/registry.py b/tests/models/registry.py index 8afac32e1cf..c2f1089af2a 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -360,6 +360,9 @@ def check_available_online( trust_remote_code=True, hf_overrides={"architectures": ["GLM4VForCausalLM"]}), # noqa: E501 "Glm4vForConditionalGeneration": _HfExamplesInfo("THUDM/GLM-4.1V-9B-Thinking", min_transformers_version="4.53"), # noqa: E501 + "Glm4MoeForCausalLM": _HfExamplesInfo("THUDM/GLM-4.5", + min_transformers_version="4.54", + is_available_online=False), # noqa: E501 "H2OVLChatModel": _HfExamplesInfo("h2oai/h2ovl-mississippi-800m", extras={"2b": "h2oai/h2ovl-mississippi-2b"}, # noqa: E501 max_transformers_version="4.48", # noqa: E501 @@ -485,6 +488,10 @@ def check_available_online( is_available_online=False, speculative_model="openbmb/MiniCPM-2B-sft-bf16", tokenizer="openbmb/MiniCPM-2B-sft-bf16"), + "Glm4MoeMTPModel": _HfExamplesInfo("THUDM/GLM-4.5", + speculative_model="THUDM/GLM-4.5", + min_transformers_version="4.54", + is_available_online=False), "MiMoMTPModel": _HfExamplesInfo("XiaomiMiMo/MiMo-7B-RL", trust_remote_code=True, speculative_model="XiaomiMiMo/MiMo-7B-RL") diff --git a/tests/tool_use/test_glm4_moe_tool_parser.py b/tests/tool_use/test_glm4_moe_tool_parser.py new file mode 100644 index 00000000000..478f4b91667 --- /dev/null +++ b/tests/tool_use/test_glm4_moe_tool_parser.py @@ -0,0 +1,410 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# ruff: noqa: E501 + +import json + +import pytest + +from vllm.entrypoints.openai.protocol import FunctionCall, ToolCall +from vllm.entrypoints.openai.tool_parsers import Glm4MoeModelToolParser +from vllm.transformers_utils.tokenizer import get_tokenizer + +pytest.skip("skip glm4_moe parser test", allow_module_level=True) +# Use a common model that is likely to be available +MODEL = "THUDM/GLM-4.5" + + +@pytest.fixture(scope="module") +def glm4_moe_tokenizer(): + return get_tokenizer(tokenizer_name=MODEL) + + +@pytest.fixture +def glm4_moe_tool_parser(glm4_moe_tokenizer): + return Glm4MoeModelToolParser(glm4_moe_tokenizer) + + +def assert_tool_calls(actual_tool_calls: list[ToolCall], + expected_tool_calls: list[ToolCall]): + assert len(actual_tool_calls) == len(expected_tool_calls) + + for actual_tool_call, expected_tool_call in zip(actual_tool_calls, + expected_tool_calls): + assert isinstance(actual_tool_call.id, str) + assert len(actual_tool_call.id) > 0 + + assert actual_tool_call.type == "function" + assert actual_tool_call.function.name == expected_tool_call.function.name + # Compare arguments as JSON objects to handle formatting differences + actual_args = json.loads(actual_tool_call.function.arguments) + expected_args = json.loads(expected_tool_call.function.arguments) + assert actual_args == expected_args + + +def test_extract_tool_calls_no_tools(glm4_moe_tool_parser): + model_output = "This is a test" + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + assert not extracted_tool_calls.tools_called + assert extracted_tool_calls.tool_calls == [] + assert extracted_tool_calls.content == model_output + + +@pytest.mark.parametrize( + ids=[ + "single_tool_call", + "multiple_tool_calls", + "tool_call_with_content_before", + "tool_call_with_mixed_args", + "tool_call_with_chinese_content", + ], + argnames=["model_output", "expected_tool_calls", "expected_content"], + argvalues=[ + ( + """get_current_weather + city + Dallas + state + TX + unit + fahrenheit + """, + [ + ToolCall(function=FunctionCall( + name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit", + }), + )) + ], + None, + ), + ( + """get_current_weather + city + Dallas + state + TX + unit + fahrenheit + + get_current_weather + city + Orlando + state + FL + unit + fahrenheit + """, + [ + ToolCall(function=FunctionCall( + name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit", + }), + )), + ToolCall(function=FunctionCall( + name="get_current_weather", + arguments=json.dumps({ + "city": "Orlando", + "state": "FL", + "unit": "fahrenheit", + }), + )), + ], + None, + ), + ( + """I'll help you check the weather. get_current_weather + city + Seattle + state + WA + unit + celsius + """, + [ + ToolCall(function=FunctionCall( + name="get_current_weather", + arguments=json.dumps({ + "city": "Seattle", + "state": "WA", + "unit": "celsius", + }), + )) + ], + "I'll help you check the weather.", + ), + ( + """get_current_weather + city + New York + state + NY + unit + celsius + """, + [ + ToolCall(function=FunctionCall( + name="get_current_weather", + arguments=json.dumps({ + "city": "New York", + "state": "NY", + "unit": "celsius", + }), + )) + ], + None, + ), + ("""I will help you get the weather.get_weather + city + Beijing + date + 2025-08-01 + """, [ + ToolCall(function=FunctionCall( + name="get_weather", + arguments=json.dumps({ + "city": "Beijing", + "date": "2025-08-01", + }), + )) + ], "I will help you get the weather."), + ], +) +def test_extract_tool_calls(glm4_moe_tool_parser, model_output, + expected_tool_calls, expected_content): + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + assert extracted_tool_calls.tools_called + assert_tool_calls(extracted_tool_calls.tool_calls, expected_tool_calls) + + assert extracted_tool_calls.content == expected_content + + +def test_extract_tool_calls_with_thinking_tags(glm4_moe_tool_parser): + """Test tool extraction when thinking tags are present.""" + model_output = """I want to get the weather. + +I will help you get the weather. +get_weather +city +Beijing +date +2025-08-01 +""" + + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + + assert extracted_tool_calls.tools_called + assert len(extracted_tool_calls.tool_calls) == 1 + assert extracted_tool_calls.tool_calls[0].function.name == "get_weather" + + expected_content = """I want to get the weather. + +I will help you get the weather.""" + assert extracted_tool_calls.content == expected_content + + +def test_extract_tool_calls_malformed_xml(glm4_moe_tool_parser): + """Test that malformed XML is handled gracefully.""" + model_output = """get_weather +city +Seattle +incomplete_arg +value +""" + + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + + # Should handle malformed XML gracefully + # The parser should either extract what it can or return no tool calls + # depending on how robust we want the parsing to be + assert isinstance(extracted_tool_calls.tools_called, bool) + assert isinstance(extracted_tool_calls.tool_calls, list) + + +def test_extract_tool_calls_empty_arguments(glm4_moe_tool_parser): + """Test tool calls with no arguments.""" + model_output = """get_current_time +""" + + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + + assert extracted_tool_calls.tools_called + assert len(extracted_tool_calls.tool_calls) == 1 + assert extracted_tool_calls.tool_calls[ + 0].function.name == "get_current_time" + # Empty arguments should result in empty JSON object + assert extracted_tool_calls.tool_calls[0].function.arguments == "{}" + + +def test_extract_tool_calls_mixed_content(glm4_moe_tool_parser): + """Test extraction with mixed content and multiple tool calls.""" + model_output = """I will help you get the weather info. + +get_weather +city +Beijing +date +2025-08-01 + + +meaningwhile, I will also check the weather in Shanghai. + +get_weather +city +Shanghai +date +2025-08-01 +""" + + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + + assert extracted_tool_calls.tools_called + assert len(extracted_tool_calls.tool_calls) == 2 + + # Check first tool call + assert extracted_tool_calls.tool_calls[0].function.name == "get_weather" + args1 = json.loads(extracted_tool_calls.tool_calls[0].function.arguments) + assert args1["city"] == "Beijing" + assert args1["date"] == "2025-08-01" + + # Check second tool call + assert extracted_tool_calls.tool_calls[1].function.name == "get_weather" + args2 = json.loads(extracted_tool_calls.tool_calls[1].function.arguments) + assert args2["city"] == "Shanghai" + assert args2["date"] == "2025-08-01" + + # Content should be everything before the first tool call + assert extracted_tool_calls.content == "I will help you get the weather info." + + +def test_streaming_basic_functionality(glm4_moe_tool_parser): + """Test basic streaming functionality.""" + # Reset streaming state + glm4_moe_tool_parser.current_tool_name_sent = False + glm4_moe_tool_parser.prev_tool_call_arr = [] + glm4_moe_tool_parser.current_tool_id = -1 + glm4_moe_tool_parser.streamed_args_for_tool = [] + + # Test with a simple tool call + current_text = """get_weather +city +Beijing +""" + + # Mock token IDs for testing + tool_call_start_id = glm4_moe_tool_parser.tool_call_start_token_id or 12345 + tool_call_end_id = glm4_moe_tool_parser.tool_call_end_token_id or 12346 + + result = glm4_moe_tool_parser.extract_tool_calls_streaming( + previous_text="", + current_text=current_text, + delta_text="", + previous_token_ids=[], + current_token_ids=[tool_call_start_id, tool_call_end_id], + delta_token_ids=[tool_call_end_id], + request=None, + ) + + # The result behavior depends on the streaming state + # This test mainly ensures no exceptions are thrown + assert result is None or hasattr(result, 'tool_calls') or hasattr( + result, 'content') + + +def test_streaming_no_tool_calls(glm4_moe_tool_parser): + """Test streaming when there are no tool calls.""" + current_text = "This is just regular text without any tool calls." + + result = glm4_moe_tool_parser.extract_tool_calls_streaming( + previous_text="This is just regular text", + current_text=current_text, + delta_text=" without any tool calls.", + previous_token_ids=[], + current_token_ids=[], + delta_token_ids=[], + request=None, + ) + + # Should return the delta text as content + assert result is not None + assert hasattr(result, 'content') + assert result.content == " without any tool calls." + + +def test_streaming_with_content_before_tool_calls(glm4_moe_tool_parser): + """Test streaming when there's content before tool calls.""" + # Reset streaming state + glm4_moe_tool_parser.current_tool_name_sent = False + glm4_moe_tool_parser.prev_tool_call_arr = [] + glm4_moe_tool_parser.current_tool_id = -1 + glm4_moe_tool_parser.streamed_args_for_tool = [] + + current_text = "I will help you get the weather" + + result = glm4_moe_tool_parser.extract_tool_calls_streaming( + previous_text="I will help you", + current_text=current_text, + delta_text="get the weather.", + previous_token_ids=[], + current_token_ids=[], + delta_token_ids=[], + request=None, + ) + + # Should return content when no tool call tokens are detected + assert result is not None + assert hasattr(result, 'content') + assert result.content == "get the weather." + + +def test_extract_tool_calls_special_characters(glm4_moe_tool_parser): + """Test tool calls with special characters and unicode.""" + model_output = """send_message +recipient +Amy +message +It is a nice day +priority +high +""" + + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + + assert extracted_tool_calls.tools_called + assert len(extracted_tool_calls.tool_calls) == 1 + assert extracted_tool_calls.tool_calls[0].function.name == "send_message" + + args = json.loads(extracted_tool_calls.tool_calls[0].function.arguments) + assert args["recipient"] == "Amy" + assert args["message"] == "It is a nice day" + assert args["priority"] == "high" + + +def test_extract_tool_calls_incomplete_tool_call(glm4_moe_tool_parser): + """Test incomplete tool calls (missing closing tag).""" + model_output = """get_weather +city +Beijing +date +2025-08-01""" + + extracted_tool_calls = glm4_moe_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + + # Incomplete tool calls should not be extracted + assert not extracted_tool_calls.tools_called + assert extracted_tool_calls.tool_calls == [] + assert extracted_tool_calls.content == model_output diff --git a/vllm/config.py b/vllm/config.py index adf3fd701a9..f9f8eb38c66 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -1333,7 +1333,8 @@ def get_layers_start_end_indices( self, parallel_config: "ParallelConfig") -> tuple[int, int]: from vllm.distributed.utils import get_pp_indices if (self.hf_text_config.model_type == "deepseek_mtp" - or self.hf_config.model_type == "mimo_mtp"): + or self.hf_config.model_type == "mimo_mtp" + or self.hf_config.model_type == "glm4_moe_mtp"): total_num_hidden_layers = getattr(self.hf_text_config, "num_nextn_predict_layers", 0) else: @@ -2663,7 +2664,15 @@ def hf_config_override(hf_config: PretrainedConfig) -> PretrainedConfig: "n_predict": n_predict, "architectures": ["MiMoMTPModel"] }) - return hf_config + + if hf_config.architectures[0] == "Glm4MoeForCausalLM": + hf_config.model_type = "glm4_moe_mtp" + n_predict = getattr(hf_config, "num_nextn_predict_layers", None) + hf_config.update({ + "num_hidden_layers": 0, + "n_predict": n_predict, + "architectures": ["Glm4MoeMTPModel"] + }) return hf_config @@ -2774,7 +2783,7 @@ def __post_init__(self): "mlp_speculator"): self.method = "mlp_speculator" elif (self.draft_model_config.hf_config.model_type - in ("deepseek_mtp", "mimo_mtp")): + in ("deepseek_mtp", "mimo_mtp", "glm4_moe_mtp")): self.method = "deepseek_mtp" if self.num_speculative_tokens > 1: logger.warning( diff --git a/vllm/entrypoints/openai/tool_parsers/__init__.py b/vllm/entrypoints/openai/tool_parsers/__init__.py index 137375b9707..9eda7155f01 100644 --- a/vllm/entrypoints/openai/tool_parsers/__init__.py +++ b/vllm/entrypoints/openai/tool_parsers/__init__.py @@ -3,6 +3,7 @@ from .abstract_tool_parser import ToolParser, ToolParserManager from .deepseekv3_tool_parser import DeepSeekV3ToolParser +from .glm4_moe_tool_parser import Glm4MoeModelToolParser from .granite_20b_fc_tool_parser import Granite20bFCToolParser from .granite_tool_parser import GraniteToolParser from .hermes_tool_parser import Hermes2ProToolParser @@ -19,10 +20,22 @@ from .xlam_tool_parser import xLAMToolParser __all__ = [ - "ToolParser", "ToolParserManager", "Granite20bFCToolParser", - "GraniteToolParser", "Hermes2ProToolParser", "MistralToolParser", - "Internlm2ToolParser", "Llama3JsonToolParser", "JambaToolParser", - "Llama4PythonicToolParser", "PythonicToolParser", "Phi4MiniJsonToolParser", - "DeepSeekV3ToolParser", "xLAMToolParser", "MinimaxToolParser", - "KimiK2ToolParser", "HunyuanA13BToolParser" + "ToolParser", + "ToolParserManager", + "Granite20bFCToolParser", + "GraniteToolParser", + "Hermes2ProToolParser", + "MistralToolParser", + "Internlm2ToolParser", + "Llama3JsonToolParser", + "JambaToolParser", + "Llama4PythonicToolParser", + "PythonicToolParser", + "Phi4MiniJsonToolParser", + "DeepSeekV3ToolParser", + "xLAMToolParser", + "MinimaxToolParser", + "KimiK2ToolParser", + "HunyuanA13BToolParser", + "Glm4MoeModelToolParser", ] diff --git a/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py b/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py new file mode 100644 index 00000000000..c3f9d792357 --- /dev/null +++ b/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py @@ -0,0 +1,402 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# code modified from deepseekv3_tool_parser.py + +from collections.abc import Sequence +from typing import Union + +import regex as re + +from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, + DeltaFunctionCall, DeltaMessage, + DeltaToolCall, + ExtractedToolCallInformation, + FunctionCall, ToolCall) +from vllm.entrypoints.openai.tool_parsers.abstract_tool_parser import ( + ToolParser, ToolParserManager) +from vllm.logger import init_logger +from vllm.transformers_utils.tokenizer import AnyTokenizer + +logger = init_logger(__name__) + + +@ToolParserManager.register_module("glm4_moe") +class Glm4MoeModelToolParser(ToolParser): + + def __init__(self, tokenizer: AnyTokenizer): + super().__init__(tokenizer) + self.current_tool_name_sent = False + self.prev_tool_call_arr: list[dict] = [] + self.current_tool_id = -1 + self.streamed_args_for_tool: list[str] = [] + self.tool_call_start_token = "" + self.tool_call_end_token = "" + + self.tool_calls_start_token = self.tool_call_start_token + + # Updated regex for the XML-based format + self.tool_call_regex = re.compile( + r"\s*" + r"(?P[^\n<]+)\s*" # 函数名(到换行或 <) + r"(?P(?:\s*[^<]+\s*" + r"[^<]*\s*)*)\s*" + r"", + re.DOTALL, + ) + + # Regex for parsing individual arguments + self.arg_regex = re.compile( + r"(?P[^<]+)\s*(?P[^<]*)", + re.DOTALL, + ) + + # Streaming regex + self.stream_tool_call_portion_regex = re.compile( + r"(?P[^\n<]+)\s*" + r"(?P(?:\s*[^<]+\s*" + r"[^<]*\s*)*)", + re.DOTALL, + ) + + # For streaming, we also need a regex to match just the function name + self.stream_tool_call_name_regex = re.compile( + r"(?P[^\n<]+)", + re.DOTALL, + ) + + if not self.model_tokenizer: + raise ValueError( + "The model tokenizer must be passed to the ToolParser " + "constructor during construction.") + + self.tool_call_start_token_id = self.vocab.get( + self.tool_call_start_token) + self.tool_call_end_token_id = self.vocab.get(self.tool_call_end_token) + + def _parse_arguments(self, args_text: str) -> str: + """Parse XML-based arguments into JSON format.""" + if not args_text or not args_text.strip(): + return "{}" + + args_dict = {} + matches = self.arg_regex.findall(args_text) + + for key, value in matches: + args_dict[key.strip()] = value.strip() + + import json + return json.dumps(args_dict, ensure_ascii=False) + + def extract_tool_calls( + self, + model_output: str, + request: ChatCompletionRequest, + ) -> ExtractedToolCallInformation: + + # sanity check; avoid unnecessary processing + if self.tool_calls_start_token not in model_output: + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=model_output) + + try: + # Find all tool calls in the output + function_call_matches = self.tool_call_regex.findall(model_output) + + logger.debug("function_call_matches: %s", function_call_matches) + + if not function_call_matches: + return ExtractedToolCallInformation( + tools_called=False, + tool_calls=[], + content=model_output, + ) + + tool_calls = [] + for i, match in enumerate(function_call_matches): + function_name, function_args_xml = match + function_name = function_name.strip() + + # Parse XML arguments to JSON + function_args_json = self._parse_arguments(function_args_xml) + + tool_calls.append( + ToolCall( + id=f"call_{i}", + type='function', + function=FunctionCall(name=function_name, + arguments=function_args_json), + )) + + # Extract content before the first tool call + content = model_output[:model_output.find(self. + tool_calls_start_token)] + return ExtractedToolCallInformation( + tools_called=bool(tool_calls), + tool_calls=tool_calls, + content=content.strip() if content.strip() else None, + ) + + except Exception: + logger.exception("Error in extracting tool call from response.") + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=model_output) + + def extract_tool_calls_streaming( + self, + previous_text: str, + current_text: str, + delta_text: str, + previous_token_ids: Sequence[int], + current_token_ids: Sequence[int], + delta_token_ids: Sequence[int], + request: ChatCompletionRequest, + ) -> Union[DeltaMessage, None]: + + logger.debug("delta_text: %s", delta_text) + logger.debug("delta_token_ids: %s", delta_token_ids) + # check to see if we should be streaming a tool call - is there a + if self.tool_call_start_token_id not in current_token_ids: + logger.debug("No tool call tokens found!") + return DeltaMessage(content=delta_text) + delta_text = delta_text.replace(self.tool_calls_start_token, + "").replace(self.tool_call_end_token, + "") + try: + + # figure out where we are in the parsing by counting tool call + # start & end tags + prev_tool_start_count = previous_token_ids.count( + self.tool_call_start_token_id) + prev_tool_end_count = previous_token_ids.count( + self.tool_call_end_token_id) + cur_tool_start_count = current_token_ids.count( + self.tool_call_start_token_id) + cur_tool_end_count = current_token_ids.count( + self.tool_call_end_token_id) + tool_call_portion = None + text_portion = None + + # case: if we're generating text, OR rounding out a tool call + if (cur_tool_start_count == cur_tool_end_count + and prev_tool_end_count == cur_tool_end_count + and self.tool_call_end_token not in delta_text): + logger.debug("Generating text content! skipping tool parsing.") + return DeltaMessage(content=delta_text) + + if self.tool_call_end_token in delta_text: + logger.debug("tool_call_end_token in delta_text") + full_text = current_text + delta_text + tool_call_portion = full_text.split( + self.tool_call_start_token)[-1].split( + self.tool_call_end_token)[0].rstrip() + delta_text = delta_text.split( + self.tool_call_end_token)[0].rstrip() + text_portion = delta_text.split( + self.tool_call_end_token)[-1].lstrip() + + # case -- we're starting a new tool call + if (cur_tool_start_count > cur_tool_end_count + and cur_tool_start_count > prev_tool_start_count): + if len(delta_token_ids) > 1: + tool_call_portion = current_text.split( + self.tool_call_start_token)[-1] + else: + tool_call_portion = None + delta = None + + text_portion = None + + # set cursors and state appropriately + self.current_tool_id += 1 + self.current_tool_name_sent = False + self.streamed_args_for_tool.append("") + logger.debug("Starting on a new tool %s", self.current_tool_id) + + # case -- we're updating an existing tool call + elif (cur_tool_start_count > cur_tool_end_count + and cur_tool_start_count == prev_tool_start_count): + + # get the portion of the text that's the tool call + tool_call_portion = current_text.split( + self.tool_call_start_token)[-1] + text_portion = None + + # case -- the current tool call is being closed. + elif (cur_tool_start_count == cur_tool_end_count + and cur_tool_end_count >= prev_tool_end_count): + if self.prev_tool_call_arr is None or len( + self.prev_tool_call_arr) == 0: + logger.debug( + "attempting to close tool call, but no tool call") + return None + diff = self.prev_tool_call_arr[self.current_tool_id].get( + "arguments") + if diff: + diff = (diff.encode("utf-8").decode("unicode_escape") + if diff is str else diff) + if '"}' not in delta_text: + return None + end_loc = delta_text.rindex('"}') + diff = delta_text[:end_loc] + '"}' + logger.debug( + "Finishing tool and found diff that had not " + "been streamed yet: %s", + diff, + ) + self.streamed_args_for_tool[self.current_tool_id] += diff + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_id, + function=DeltaFunctionCall( + arguments=diff).model_dump(exclude_none=True), + ) + ]) + + # case -- otherwise we're just generating text + else: + text = delta_text.replace(self.tool_call_start_token, "") + text = text.replace(self.tool_call_end_token, "") + delta = DeltaMessage(tool_calls=[], content=text) + return delta + + current_tool_call = dict() + if tool_call_portion: + current_tool_call_matches = ( + self.stream_tool_call_portion_regex.match( + tool_call_portion)) + if current_tool_call_matches: + tool_id, tool_args = (current_tool_call_matches.groups()) + tool_name = tool_id.split('.')[1].split(':')[0] + current_tool_call['id'] = tool_id + current_tool_call["name"] = tool_name + current_tool_call["arguments"] = tool_args + else: + current_tool_call_name_matches = ( + self.stream_tool_call_name_regex.match( + tool_call_portion)) + if current_tool_call_name_matches: + tool_id_str, = current_tool_call_name_matches.groups() + tool_name = tool_id_str.split('.')[1].split(':')[0] + current_tool_call['id'] = tool_id_str + current_tool_call["name"] = tool_name + current_tool_call["arguments"] = "" + else: + logger.debug("Not enough token") + return None + + # case - we haven't sent the tool name yet. If it's available, send + # it. otherwise, wait until it's available. + if not self.current_tool_name_sent: + if current_tool_call is None: + return None + function_name: Union[str, None] = current_tool_call.get("name") + tool_id = current_tool_call.get("id") + if function_name: + self.current_tool_name_sent = True + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_id, + type="function", + id=tool_id, + function=DeltaFunctionCall( + name=function_name).model_dump( + exclude_none=True), + ) + ]) + else: + return None + + # case -- otherwise, send the tool call delta + + # if the tool call portion is None, send the delta as text + if tool_call_portion is None: + # if there's text but not tool calls, send that - + # otherwise None to skip chunk + delta = (DeltaMessage( + content=delta_text) if text_portion is not None else None) + return delta + + # now, the nitty-gritty of tool calls + # now we have the portion to parse as tool call. + + logger.debug("Trying to parse current tool call with ID %s", + self.current_tool_id) + + # if we're starting a new tool call, push an empty object in as + # a placeholder for the arguments + if len(self.prev_tool_call_arr) <= self.current_tool_id: + self.prev_tool_call_arr.append({}) + + # main logic for tool parsing here - compare prev. partially-parsed + # JSON to the current partially-parsed JSON + prev_arguments = self.prev_tool_call_arr[self.current_tool_id].get( + "arguments") + cur_arguments = current_tool_call.get("arguments") + + logger.debug("diffing old arguments: %s", prev_arguments) + logger.debug("against new ones: %s", cur_arguments) + + # case -- no arguments have been created yet. skip sending a delta. + if not cur_arguments and not prev_arguments: + logger.debug("Skipping text %s - no arguments", delta_text) + delta = None + + # case -- prev arguments are defined, but non are now. + # probably impossible, but not a fatal error - just keep going + elif not cur_arguments and prev_arguments: + logger.error("should be impossible to have arguments reset " + "mid-call. skipping streaming anything.") + delta = None + + # case -- we now have the first info about arguments available from + # autocompleting the JSON + elif cur_arguments and not prev_arguments: + + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_id, + function=DeltaFunctionCall( + arguments=cur_arguments).model_dump( + exclude_none=True), + ) + ]) + self.streamed_args_for_tool[ + self.current_tool_id] = cur_arguments + + # last case -- we have an update to existing arguments. + elif cur_arguments and prev_arguments: + if (isinstance(delta_text, str) + and cur_arguments != prev_arguments + and len(cur_arguments) > len(prev_arguments) + and cur_arguments.startswith(prev_arguments)): + delta_arguments = cur_arguments[len(prev_arguments):] + logger.debug("got diff %s", delta_text) + + delta = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_id, + function=DeltaFunctionCall( + arguments=delta_arguments).model_dump( + exclude_none=True), + ) + ]) + self.streamed_args_for_tool[ + self.current_tool_id] = cur_arguments + else: + delta = None + + # handle saving the state for the current tool into + # the "prev" list for use in diffing for the next iteration + if self.current_tool_id == len(self.prev_tool_call_arr) - 1: + self.prev_tool_call_arr[ + self.current_tool_id] = current_tool_call + else: + self.prev_tool_call_arr.append(current_tool_call) + + return delta + + except Exception: + logger.exception("Error trying to handle streaming tool call.") + return None # do not stream a delta. skip this token ID. diff --git a/vllm/model_executor/models/glm4_moe.py b/vllm/model_executor/models/glm4_moe.py new file mode 100644 index 00000000000..bdca293d21d --- /dev/null +++ b/vllm/model_executor/models/glm4_moe.py @@ -0,0 +1,685 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# Copyright 2025 The ZhipuAI Team. +# Copyright 2023 The vLLM team. +# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. +# +# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX +# and OPT implementations in this library. It has been modified from its +# original forms to accommodate minor architectural differences compared +# to GPT-NeoX and OPT used by the Meta AI team that trained the model. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Inference-only GLM-4.5 model compatible with HuggingFace weights.""" +import typing +from collections.abc import Callable, Iterable +from typing import Any, Optional, Union + +import torch +from torch import nn +from transformers import PretrainedConfig + +from vllm.attention import Attention +from vllm.compilation.decorators import support_torch_compile +from vllm.config import CacheConfig, VllmConfig, get_current_vllm_config +from vllm.distributed import (get_ep_group, get_pp_group, + get_tensor_model_parallel_world_size) +from vllm.logger import init_logger +from vllm.model_executor.layers.activation import SiluAndMul +from vllm.model_executor.layers.fused_moe import FusedMoE +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.model_executor.layers.linear import (MergedColumnParallelLinear, + QKVParallelLinear, + ReplicatedLinear, + RowParallelLinear) +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.quantization import QuantizationConfig +from vllm.model_executor.layers.rotary_embedding import get_rope +from vllm.model_executor.layers.vocab_parallel_embedding import ( + ParallelLMHead, VocabParallelEmbedding) +from vllm.model_executor.model_loader.weight_utils import ( + default_weight_loader, maybe_remap_kv_scale_name) +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.sequence import IntermediateTensors + +from .interfaces import SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, + make_empty_intermediate_tensors_factory, make_layers, + maybe_prefix) + +logger = init_logger(__name__) + + +class Glm4MoeMLP(nn.Module): + + def __init__( + self, + hidden_size: int, + intermediate_size: int, + hidden_act: str, + quant_config: Optional[QuantizationConfig] = None, + reduce_results: bool = True, + prefix: str = "", + ) -> None: + super().__init__() + self.gate_up_proj = MergedColumnParallelLinear( + hidden_size, [intermediate_size] * 2, + bias=False, + quant_config=quant_config, + prefix=f"{prefix}.gate_up_proj") + self.down_proj = RowParallelLinear(intermediate_size, + hidden_size, + bias=False, + quant_config=quant_config, + reduce_results=reduce_results, + prefix=f"{prefix}.down_proj") + if hidden_act != "silu": + raise ValueError(f"Unsupported activation: {hidden_act}. " + "Only silu is supported for now.") + self.act_fn = SiluAndMul() + + def forward(self, x): + gate_up, _ = self.gate_up_proj(x) + x = self.act_fn(gate_up) + x, _ = self.down_proj(x) + return x + + +class Glm4MoE(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + enable_eplb: bool = False, + ): + super().__init__() + self.tp_size = get_tensor_model_parallel_world_size() + self.routed_scaling_factor = config.routed_scaling_factor + + self.ep_group = get_ep_group().device_group + self.ep_rank = self.ep_group.rank() + self.ep_size = self.ep_group.size() + self.n_routed_experts: int = config.n_routed_experts + self.n_shared_experts: int = config.n_shared_experts + + if config.hidden_act != "silu": + raise ValueError(f"Unsupported activation: {config.hidden_act}. " + "Only silu is supported for now.") + + self.gate = ReplicatedLinear(config.hidden_size, + config.n_routed_experts, + bias=False, + quant_config=None, + prefix=f"{prefix}.gate") + + # noaux_tc is not set in transformers new config now + self.gate.e_score_correction_bias = (nn.Parameter( + torch.empty(config.n_routed_experts))) + + # Load balancing settings. + vllm_config = get_current_vllm_config() + parallel_config = vllm_config.parallel_config + self.enable_eplb = enable_eplb + + self.n_redundant_experts = parallel_config.num_redundant_experts + self.n_logical_experts = self.n_routed_experts + self.n_physical_experts = (self.n_logical_experts + + self.n_redundant_experts) + self.n_local_physical_experts = self.n_physical_experts // self.ep_size + + self.physical_expert_start = (self.ep_rank * + self.n_local_physical_experts) + self.physical_expert_end = (self.physical_expert_start + + self.n_local_physical_experts) + + self.experts = FusedMoE( + num_experts=config.n_routed_experts, + top_k=config.num_experts_per_tok, + hidden_size=config.hidden_size, + intermediate_size=config.moe_intermediate_size, + reduce_results=False, + renormalize=config.norm_topk_prob, + quant_config=quant_config, + use_grouped_topk=True, + num_expert_group=config.n_group, + topk_group=config.topk_group, + prefix=f"{prefix}.experts", + scoring_func="sigmoid", + e_score_correction_bias=self.gate.e_score_correction_bias, + enable_eplb=self.enable_eplb, + num_redundant_experts=self.n_redundant_experts) + + if config.n_shared_experts is not None: + intermediate_size = (config.moe_intermediate_size * + config.n_shared_experts) + self.shared_experts = Glm4MoeMLP( + hidden_size=config.hidden_size, + intermediate_size=intermediate_size, + hidden_act=config.hidden_act, + quant_config=quant_config, + reduce_results=self.experts.must_reduce_shared_expert_outputs( + ), + prefix=f"{prefix}.shared_experts", + ) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + num_tokens, hidden_dim = hidden_states.shape + hidden_states = hidden_states.view(-1, hidden_dim) + + if self.n_shared_experts is not None: + shared_output = self.shared_experts(hidden_states) + router_logits, _ = self.gate(hidden_states) + final_hidden_states = self.experts( + hidden_states=hidden_states, + router_logits=router_logits) * self.routed_scaling_factor + if shared_output is not None: + final_hidden_states = final_hidden_states + shared_output + if self.tp_size > 1: + final_hidden_states = ( + self.experts.maybe_all_reduce_tensor_model_parallel( + final_hidden_states)) + return final_hidden_states.view(num_tokens, hidden_dim) + + +class Glm4MoeAttention(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + hidden_size: int, + num_heads: int, + num_kv_heads: int, + rope_theta: float = 10000, + rope_scaling: Optional[dict[str, Any]] = None, + max_position_embeddings: int = 131072, + head_dim: Optional[int] = None, + rms_norm_eps: float = 1e-05, + qkv_bias: bool = False, + use_qk_norm: bool = False, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ) -> None: + super().__init__() + self.hidden_size = hidden_size + tp_size = get_tensor_model_parallel_world_size() + self.total_num_heads = num_heads + assert self.total_num_heads % tp_size == 0 + self.num_heads = self.total_num_heads // tp_size + self.total_num_kv_heads = num_kv_heads + if self.total_num_kv_heads >= tp_size: + # Number of KV heads is greater than TP size, so we partition + # the KV heads across multiple tensor parallel GPUs. + assert self.total_num_kv_heads % tp_size == 0 + else: + # Number of KV heads is less than TP size, so we replicate + # the KV heads across multiple tensor parallel GPUs. + assert tp_size % self.total_num_kv_heads == 0 + self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size) + self.head_dim = head_dim or (hidden_size // self.total_num_heads) + self.q_size = self.num_heads * self.head_dim + self.kv_size = self.num_kv_heads * self.head_dim + self.scaling = self.head_dim**-0.5 + self.rope_theta = rope_theta + self.max_position_embeddings = max_position_embeddings + self.use_qk_norm = use_qk_norm + + self.qkv_proj = QKVParallelLinear(hidden_size, + self.head_dim, + self.total_num_heads, + self.total_num_kv_heads, + bias=qkv_bias, + quant_config=quant_config, + prefix=f"{prefix}.qkv_proj") + + self.o_proj = RowParallelLinear(self.total_num_heads * self.head_dim, + hidden_size, + bias=False, + quant_config=quant_config, + prefix=f"{prefix}.o_proj") + + partial_rotary_factor = getattr(config, "partial_rotary_factor", 0.5) + self.rotary_emb = get_rope( + self.head_dim, + rotary_dim=self.head_dim, + max_position=max_position_embeddings, + base=rope_theta, + rope_scaling=rope_scaling, + partial_rotary_factor=partial_rotary_factor, + ) + self.attn = Attention( + self.num_heads, + self.head_dim, + self.scaling, + num_kv_heads=self.num_kv_heads, + cache_config=cache_config, + quant_config=quant_config, + prefix=f"{prefix}.attn", + ) + + if self.use_qk_norm: + self.q_norm = RMSNorm(self.head_dim, eps=rms_norm_eps) + self.k_norm = RMSNorm(self.head_dim, eps=rms_norm_eps) + + def forward( + self, + positions: torch.Tensor, + hidden_states: torch.Tensor, + ) -> torch.Tensor: + qkv, _ = self.qkv_proj(hidden_states) + q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) + if self.use_qk_norm: + q = self.q_norm(q.reshape(-1, self.num_heads, + self.head_dim)).reshape(q.shape) + k = self.k_norm(k.reshape(-1, self.num_kv_heads, + self.head_dim)).reshape(k.shape) + + q, k = self.rotary_emb(positions, q, k) + attn_output = self.attn(q, k, v) + output, _ = self.o_proj(attn_output) + return output + + +class Glm4MoeDecoderLayer(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + enable_eplb: bool = False, + ) -> None: + super().__init__() + self.hidden_size = config.hidden_size + rope_theta = getattr(config, "rope_theta", 10000) + rope_scaling = getattr(config, "rope_scaling", None) + max_position_embeddings = getattr(config, "max_position_embeddings", + 131072) + # DecoderLayers are created with `make_layers` which passes the prefix + # with the layer's index. + layer_idx = int(prefix.split(sep='.')[-1]) + self.layer_idx = layer_idx + + self.self_attn = Glm4MoeAttention( + config=config, + hidden_size=self.hidden_size, + num_heads=config.num_attention_heads, + num_kv_heads=config.num_key_value_heads, + rope_theta=rope_theta, + rope_scaling=rope_scaling, + max_position_embeddings=max_position_embeddings, + head_dim=config.head_dim, + rms_norm_eps=config.rms_norm_eps, + qkv_bias=config.attention_bias, + cache_config=cache_config, + quant_config=quant_config, + prefix=f"{prefix}.self_attn", + use_qk_norm=config.use_qk_norm, + ) + + if (config.n_routed_experts is not None + and layer_idx >= config.first_k_dense_replace): + self.mlp = Glm4MoE( + config=config, + quant_config=quant_config, + prefix=f"{prefix}.mlp", + enable_eplb=enable_eplb, + ) + else: + self.mlp = Glm4MoeMLP(hidden_size=config.hidden_size, + intermediate_size=config.intermediate_size, + hidden_act=config.hidden_act, + quant_config=quant_config, + prefix=f"{prefix}.mlp") + + self.input_layernorm = RMSNorm(config.hidden_size, + eps=config.rms_norm_eps) + self.post_attention_layernorm = RMSNorm(config.hidden_size, + eps=config.rms_norm_eps) + self.routed_scaling_factor = config.routed_scaling_factor + + def forward( + self, + positions: torch.Tensor, + hidden_states: torch.Tensor, + residual: Optional[torch.Tensor], + ) -> tuple[torch.Tensor, torch.Tensor]: + if residual is None: + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) + else: + hidden_states, residual = self.input_layernorm( + hidden_states, residual) + hidden_states = self.self_attn(positions=positions, + hidden_states=hidden_states) + hidden_states, residual = self.post_attention_layernorm( + hidden_states, residual) + hidden_states = self.mlp(hidden_states) + return hidden_states, residual + + +@support_torch_compile +class Glm4MoeModel(nn.Module): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + + config = vllm_config.model_config.hf_config + cache_config = vllm_config.cache_config + quant_config = vllm_config.quant_config + enable_eplb = vllm_config.parallel_config.enable_eplb + self.config = config + + self.vocab_size = config.vocab_size + + if get_pp_group().is_first_rank: + self.embed_tokens = VocabParallelEmbedding( + config.vocab_size, + config.hidden_size, + quant_config=quant_config, + prefix=f"{prefix}.embed_tokens") + else: + self.embed_tokens = PPMissingLayer() + + self.start_layer, self.end_layer, self.layers = make_layers( + config.num_hidden_layers, + lambda prefix: Glm4MoeDecoderLayer( + config=config, + cache_config=cache_config, + quant_config=quant_config, + prefix=prefix, + enable_eplb=enable_eplb, + ), + prefix=f"{prefix}.layers") + + if get_pp_group().is_last_rank: + self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + else: + self.norm = PPMissingLayer() + self.make_empty_intermediate_tensors = ( + make_empty_intermediate_tensors_factory( + ["hidden_states", "residual"], config.hidden_size)) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.embed_tokens(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + if get_pp_group().is_first_rank: + if inputs_embeds is not None: + hidden_states = inputs_embeds + else: + hidden_states = self.get_input_embeddings(input_ids) + residual = None + else: + assert intermediate_tensors is not None + hidden_states = intermediate_tensors["hidden_states"] + residual = intermediate_tensors["residual"] + + for i in range(self.start_layer, self.end_layer): + layer = self.layers[i] + hidden_states, residual = layer(positions, hidden_states, residual) + + if not get_pp_group().is_last_rank: + return IntermediateTensors({ + "hidden_states": hidden_states, + "residual": residual + }) + + hidden_states, _ = self.norm(hidden_states, residual) + return hidden_states + + def make_empty_intermediate_tensors( + self, batch_size: int, dtype: torch.dtype, + device: torch.device) -> IntermediateTensors: + return IntermediateTensors({ + "hidden_states": + torch.zeros((batch_size, self.config.hidden_size), + dtype=dtype, + device=device), + "residual": + torch.zeros((batch_size, self.config.hidden_size), + dtype=dtype, + device=device), + }) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + ("qkv_proj", "q_proj", "q"), + ("qkv_proj", "k_proj", "k"), + ("qkv_proj", "v_proj", "v"), + ("gate_up_proj", "gate_proj", 0), + ("gate_up_proj", "up_proj", 1), + ] + + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + expert_params_mapping = FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.n_routed_experts) + + params_dict = dict(self.named_parameters()) + loaded_params: set[str] = set() + for name, loaded_weight in weights: + spec_layer = get_spec_layer_idx_from_weight_name(self.config, name) + if spec_layer is not None: + continue + for (param_name, weight_name, shard_id) in stacked_params_mapping: + # Skip non-stacked layers and experts (experts handled below). + if weight_name not in name: + continue + # We have mlp.experts[0].gate_proj in the checkpoint. + # Since we handle the experts below in expert_params_mapping, + # we need to skip here BEFORE we update the name, otherwise + # name will be updated to mlp.experts[0].gate_up_proj, which + # will then be updated below in expert_params_mapping + # for mlp.experts[0].gate_gate_up_proj, which breaks load. + if (("mlp.experts." in name) and name not in params_dict): + continue + name = name.replace(weight_name, param_name) + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + if is_pp_missing_parameter(name, self): + continue + + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + break + else: + is_expert_weight = False + for mapping in expert_params_mapping: + param_name, weight_name, expert_id, shard_id = mapping + if weight_name not in name: + continue + + # Anyway, this is an expert weight and should not be + # attempted to load as other weights later + is_expert_weight = True + + # Do not modify `name` since the loop may continue here + # Instead, create a new variable + name_mapped = name.replace(weight_name, param_name) + + if is_pp_missing_parameter(name_mapped, self): + continue + + param = params_dict[name_mapped] + # We should ask the weight loader to return success or not + # here since otherwise we may skip experts with other + # available replicas. + weight_loader = typing.cast(Callable[..., bool], + param.weight_loader) + success = weight_loader(param, + loaded_weight, + name_mapped, + shard_id=shard_id, + expert_id=expert_id, + return_success=True) + if success: + name = name_mapped + break + else: + if is_expert_weight: + # We've checked that this is an expert weight + # However it's not mapped locally to this rank + # So we simply skip it + continue + + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + + # Remapping the name of FP8 kv-scale. + name = maybe_remap_kv_scale_name(name, params_dict) + if name is None: + continue + + if is_pp_missing_parameter(name, self): + continue + + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + + return loaded_params + + +class Glm4MoeForCausalLM(nn.Module, SupportsPP): + packed_modules_mapping = { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } + + fall_back_to_pt_during_load = False + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + self.config = config + self.quant_config = quant_config + self.model = Glm4MoeModel(vllm_config=vllm_config, + prefix=maybe_prefix(prefix, "model")) + if get_pp_group().is_last_rank: + self.lm_head = ParallelLMHead(config.vocab_size, + config.hidden_size, + quant_config=quant_config) + else: + self.lm_head = PPMissingLayer() + if self.config.tie_word_embeddings: + self.lm_head.weight = self.model.embed_tokens.weight + self.logits_processor = LogitsProcessor(config.vocab_size) + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + self.expert_weights = [] + + # Set MoE hyperparameters + self.num_moe_layers = (config.num_hidden_layers - + config.first_k_dense_replace) + self.num_expert_groups = config.n_group + + self.moe_layers: list[FusedMoE] = [] + for layer in self.model.layers: + assert isinstance(layer, Glm4MoeDecoderLayer) + if isinstance(layer.mlp, Glm4MoE): + self.moe_layers.append(layer.mlp.experts) + + # Pick last one layer since the first ones may be dense layers. + example_moe = typing.cast( + Glm4MoE, self.model.layers[config.num_hidden_layers - 1].mlp) + self.num_logical_experts = example_moe.n_logical_experts + self.num_physical_experts = example_moe.n_physical_experts + self.num_local_physical_experts = example_moe.n_local_physical_experts + self.num_routed_experts = example_moe.n_routed_experts + self.num_shared_experts = example_moe.n_shared_experts + self.num_redundant_experts = example_moe.n_redundant_experts + + def set_eplb_state( + self, + expert_load_view: torch.Tensor, + logical_to_physical_map: torch.Tensor, + logical_replica_count: torch.Tensor, + ) -> None: + for layer_idx, layer in enumerate(self.moe_layers): + # Register the expert weights. + self.expert_weights.append(layer.get_expert_weights()) + layer.set_eplb_state( + moe_layer_idx=layer_idx, + expert_load_view=expert_load_view, + logical_to_physical_map=logical_to_physical_map, + logical_replica_count=logical_replica_count, + ) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.model.get_input_embeddings(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + hidden_states = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds) + return hidden_states + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader(self) + return loader.load_weights(weights) + + +def get_spec_layer_idx_from_weight_name(config: PretrainedConfig, + weight_name: str) -> Optional[int]: + if hasattr(config, + "num_nextn_predict_layers") and (config.num_nextn_predict_layers + > 0): + layer_idx = config.num_hidden_layers + for i in range(config.num_nextn_predict_layers): + if f"layers.{layer_idx+i}." in weight_name: + return layer_idx + i + return None diff --git a/vllm/model_executor/models/glm4_moe_mtp.py b/vllm/model_executor/models/glm4_moe_mtp.py new file mode 100644 index 00000000000..0624640054d --- /dev/null +++ b/vllm/model_executor/models/glm4_moe_mtp.py @@ -0,0 +1,307 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# Copyright 2025 The ZhipuAI Team. +# Copyright 2023 The vLLM team. +# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. +# +# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX +# and OPT implementations in this library. It has been modified from its +# original forms to accommodate minor architectural differences compared +# to GPT-NeoX and OPT used by the Meta AI team that trained the model. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Inference-only GLM-4.5 MTP model compatible with HuggingFace weights.""" + +from collections.abc import Iterable +from typing import Optional + +import torch +import torch.nn as nn +from transformers import PretrainedConfig + +from vllm.config import CacheConfig, VllmConfig +from vllm.model_executor.layers.fused_moe import FusedMoE +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.quantization import QuantizationConfig +from vllm.model_executor.layers.vocab_parallel_embedding import ( + ParallelLMHead, VocabParallelEmbedding) +from vllm.model_executor.model_loader.weight_utils import default_weight_loader +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.sequence import IntermediateTensors + +from .glm4_moe import Glm4MoeDecoderLayer, get_spec_layer_idx_from_weight_name +from .interfaces import SupportsPP +from .utils import maybe_prefix + + +class SharedHead(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + quant_config: Optional[QuantizationConfig] = None, + ) -> None: + super().__init__() + self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.head = ParallelLMHead(config.vocab_size, + config.hidden_size, + quant_config=quant_config) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + return self.norm(hidden_states) + + +class Glm4MoeMultiTokenPredictorLayer(nn.Module): + + def __init__( + self, + config: PretrainedConfig, + prefix: str, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + ) -> None: + super().__init__() + self.enorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.hnorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.eh_proj = nn.Linear(config.hidden_size * 2, + config.hidden_size, + bias=False) + self.shared_head = SharedHead(config=config, quant_config=quant_config) + self.mtp_block = Glm4MoeDecoderLayer(config=config, + cache_config=cache_config, + quant_config=quant_config, + prefix=prefix) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + previous_hidden_states: torch.Tensor, + inputs_embeds: Optional[torch.Tensor] = None, + spec_step_index: int = 0, + ) -> torch.Tensor: + assert inputs_embeds is not None + # masking inputs at position 0, as not needed by MTP + inputs_embeds[positions == 0] = 0 + inputs_embeds = self.enorm(inputs_embeds) + previous_hidden_states = self.hnorm(previous_hidden_states) + + hidden_states = self.eh_proj( + torch.cat([inputs_embeds, previous_hidden_states], dim=-1)) + + hidden_states, residual = self.mtp_block(positions=positions, + hidden_states=hidden_states, + residual=None) + hidden_states = residual + hidden_states + return hidden_states + + +class Glm4MoeMultiTokenPredictor(nn.Module): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + config = vllm_config.model_config.hf_config + self.mtp_start_layer_idx = config.num_hidden_layers + self.num_mtp_layers = config.num_nextn_predict_layers + # to map the exact layer index from weights + self.layers = torch.nn.ModuleDict({ + str(idx): + Glm4MoeMultiTokenPredictorLayer( + config, + f"{prefix}.layers.{idx}", + cache_config=vllm_config.cache_config, + quant_config=vllm_config.quant_config, + ) + for idx in range(self.mtp_start_layer_idx, + self.mtp_start_layer_idx + self.num_mtp_layers) + }) + self.embed_tokens = VocabParallelEmbedding( + config.vocab_size, + config.hidden_size, + ) + self.logits_processor = LogitsProcessor(config.vocab_size) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + previous_hidden_states: torch.Tensor, + inputs_embeds: Optional[torch.Tensor] = None, + spec_step_idx: int = 0, + ) -> torch.Tensor: + if inputs_embeds is None: + inputs_embeds = self.embed_tokens(input_ids) + current_step_idx = (spec_step_idx % self.num_mtp_layers) + return self.layers[str(self.mtp_start_layer_idx + current_step_idx)]( + input_ids, + positions, + previous_hidden_states, + inputs_embeds, + current_step_idx, + ) + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + spec_step_idx: int = 0, + ) -> torch.Tensor: + current_step_idx = (spec_step_idx % self.num_mtp_layers) + mtp_layer = self.layers[str(self.mtp_start_layer_idx + + current_step_idx)] + logits = self.logits_processor(mtp_layer.shared_head.head, + mtp_layer.shared_head(hidden_states), + sampling_metadata) + return logits + + +class Glm4MoeMTP(nn.Module, SupportsPP): + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + self.config = vllm_config.model_config.hf_config + self.model = Glm4MoeMultiTokenPredictor(vllm_config=vllm_config, + prefix=maybe_prefix( + prefix, "model")) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + previous_hidden_states: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + spec_step_idx: int = 0, + ) -> torch.Tensor: + hidden_states = self.model(input_ids, positions, + previous_hidden_states, inputs_embeds, + spec_step_idx) + return hidden_states + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + spec_step_idx: int = 0, + ) -> Optional[torch.Tensor]: + return self.model.compute_logits(hidden_states, sampling_metadata, + spec_step_idx) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + ("qkv_proj", "q_proj", "q"), + ("qkv_proj", "k_proj", "k"), + ("qkv_proj", "v_proj", "v"), + ("gate_up_proj", "gate_proj", 0), + ("gate_up_proj", "up_proj", 1), + ] + + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + expert_params_mapping = FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.n_routed_experts) + + params_dict = dict(self.named_parameters()) + loaded_params: set[str] = set() + for name, loaded_weight in weights: + spec_layer = get_spec_layer_idx_from_weight_name(self.config, name) + if spec_layer is None: + continue + name = self._rewrite_spec_layer_name(spec_layer, name) + for (param_name, weight_name, shard_id) in stacked_params_mapping: + # Skip non-stacked layers and experts (experts handled below). + if weight_name not in name: + continue + # We have mlp.experts[0].gate_proj in the checkpoint. + # Since we handle the experts below in expert_params_mapping, + # we need to skip here BEFORE we update the name, otherwise + # name will be updated to mlp.experts[0].gate_up_proj, which + # will then be updated below in expert_params_mapping + # for mlp.experts[0].gate_gate_up_proj, which breaks load. + if (("mlp.experts." in name) and name not in params_dict): + continue + name = name.replace(weight_name, param_name) + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + break + else: + for mapping in expert_params_mapping: + param_name, weight_name, expert_id, shard_id = mapping + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + + param = params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, + loaded_weight, + name, + shard_id=shard_id, + expert_id=expert_id) + break + else: + # Skip loading extra bias for GPTQ models. + if name.endswith(".bias") and name not in params_dict: + continue + + # According to DeepSeek-V3 Technical Report, MTP modules + # shares embedding layer. We only load the first weights. + if (spec_layer != self.model.mtp_start_layer_idx + and ".layers" not in name): + continue + + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + return loaded_params + + def _rewrite_spec_layer_name(self, spec_layer: int, name: str) -> str: + """ + Rewrite the weight name to match the format of the original model. + Add .mtp_block for modules in transformer layer block for spec layer + and rename shared layer weights to be top level. + """ + spec_layer_weight_names = [ + "embed_tokens", "enorm", "hnorm", "eh_proj", "shared_head" + ] + shared_weight_names = ["embed_tokens"] + spec_layer_weight = False + shared_weight = False + for weight_name in spec_layer_weight_names: + if weight_name in name: + spec_layer_weight = True + if weight_name in shared_weight_names: + shared_weight = True + break + if not spec_layer_weight: + # treat rest weights as weights for transformer layer block + name = name.replace(f"model.layers.{spec_layer}.", + f"model.layers.{spec_layer}.mtp_block.") + elif shared_weight: + # treat shared weights as top level weights + name = name.replace(f"model.layers.{spec_layer}.", "model.") + return name diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 3440dd656c5..b57130ec84c 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -67,6 +67,7 @@ "Gemma3nForConditionalGeneration": ("gemma3n", "Gemma3nForConditionalGeneration"), # noqa: E501 "GlmForCausalLM": ("glm", "GlmForCausalLM"), "Glm4ForCausalLM": ("glm4", "Glm4ForCausalLM"), + "Glm4MoeForCausalLM": ("glm4_moe", "Glm4MoeForCausalLM"), "GPT2LMHeadModel": ("gpt2", "GPT2LMHeadModel"), "GPTBigCodeForCausalLM": ("gpt_bigcode", "GPTBigCodeForCausalLM"), "GPTJForCausalLM": ("gpt_j", "GPTJForCausalLM"), @@ -244,6 +245,7 @@ "EagleMiniCPMForCausalLM": ("minicpm_eagle", "EagleMiniCPMForCausalLM"), "Eagle3LlamaForCausalLM": ("llama_eagle3", "Eagle3LlamaForCausalLM"), "DeepSeekMTPModel": ("deepseek_mtp", "DeepSeekMTP"), + "Glm4MoeMTPModel": ("glm4_moe_mtp", "Glm4MoeMTP"), "MedusaModel": ("medusa", "Medusa"), # Temporarily disabled. # # TODO(woosuk): Re-enable this once the MLP Speculator is supported in V1. diff --git a/vllm/reasoning/__init__.py b/vllm/reasoning/__init__.py index 3e5485b883f..bae593c1dff 100644 --- a/vllm/reasoning/__init__.py +++ b/vllm/reasoning/__init__.py @@ -3,6 +3,7 @@ from .abs_reasoning_parsers import ReasoningParser, ReasoningParserManager from .deepseek_r1_reasoning_parser import DeepSeekR1ReasoningParser +from .glm4_moe_reasoning_parser import Glm4MoeModelReasoningParser from .granite_reasoning_parser import GraniteReasoningParser from .hunyuan_a13b_reasoning_parser import HunyuanA13BReasoningParser from .qwen3_reasoning_parser import Qwen3ReasoningParser @@ -14,4 +15,5 @@ "GraniteReasoningParser", "HunyuanA13BReasoningParser", "Qwen3ReasoningParser", + "Glm4MoeModelReasoningParser", ] diff --git a/vllm/reasoning/glm4_moe_reasoning_parser.py b/vllm/reasoning/glm4_moe_reasoning_parser.py new file mode 100644 index 00000000000..6511fb49d10 --- /dev/null +++ b/vllm/reasoning/glm4_moe_reasoning_parser.py @@ -0,0 +1,151 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from collections.abc import Sequence +from typing import Optional, Union + +from transformers import PreTrainedTokenizerBase + +from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, + DeltaMessage) +from vllm.logger import init_logger +from vllm.reasoning import ReasoningParser, ReasoningParserManager + +logger = init_logger(__name__) + + +@ReasoningParserManager.register_module("glm4_moe") +class Glm4MoeModelReasoningParser(ReasoningParser): + """ + Reasoning parser for the Glm4MoeModel model. + + The Glm4MoeModel model uses ... tokens to denote reasoning + text within its output. The model provides a strict switch to disable + reasoning output via the 'enable_thinking=False' parameter. This parser + extracts the reasoning content enclosed by and tokens + from the model's output. + """ + + def __init__(self, tokenizer: PreTrainedTokenizerBase): + super().__init__(tokenizer) + self.think_start_token = "" + self.think_end_token = "" + + if not self.model_tokenizer: + raise ValueError( + "The model tokenizer must be passed to the ReasoningParser " + "constructor during construction.") + + self.think_start_token_id = self.vocab.get(self.think_start_token) + self.think_end_token_id = self.vocab.get(self.think_end_token) + if (self.think_start_token_id is None + or self.think_end_token_id is None): + raise RuntimeError( + "Glm4MoeModel reasoning parser could not locate " + "think start/end tokens in the tokenizer!") + + def is_reasoning_end(self, input_ids: list[int]) -> bool: + return self.think_end_token_id in input_ids + + def extract_content_ids(self, input_ids: list[int]) -> list[int]: + """ + Extract the content after the end tokens + """ + if self.think_end_token_id not in input_ids[:-1]: + return [] + else: + return input_ids[input_ids.index(self.think_end_token_id) + 1:] + + def extract_reasoning_content_streaming( + self, + previous_text: str, + current_text: str, + delta_text: str, + previous_token_ids: Sequence[int], + current_token_ids: Sequence[int], + delta_token_ids: Sequence[int], + ) -> Union[DeltaMessage, None]: + """ + Extract reasoning content from a delta message. + Handles streaming output where previous + delta = current. + Uses token IDs for faster processing. + For text abcxyz: + - 'abc' goes to reasoning_content + - 'xyz' goes to content + """ + # Skip single special tokens + if len(delta_token_ids) == 1 and (delta_token_ids[0] in [ + self.think_start_token_id, self.think_end_token_id + ]): + return None + + if self.think_start_token_id in previous_token_ids: + if self.think_end_token_id in delta_token_ids: + # in previous, in delta, + # extract reasoning content + end_index = delta_text.find(self.think_end_token) + reasoning_content = delta_text[:end_index] + content = delta_text[end_index + len(self.think_end_token):] + return DeltaMessage(reasoning_content=reasoning_content, + content=content if content else None) + elif self.think_end_token_id in previous_token_ids: + # in previous, in previous, + # reasoning content continues + return DeltaMessage(content=delta_text) + else: + # in previous, no in previous or delta, + # reasoning content continues + return DeltaMessage(reasoning_content=delta_text) + elif self.think_start_token_id in delta_token_ids: + if self.think_end_token_id in delta_token_ids: + # in delta, in delta, extract reasoning content + start_index = delta_text.find(self.think_start_token) + end_index = delta_text.find(self.think_end_token) + reasoning_content = delta_text[start_index + + len(self.think_start_token + ):end_index] + content = delta_text[end_index + len(self.think_end_token):] + return DeltaMessage(reasoning_content=reasoning_content, + content=content if content else None) + else: + # in delta, no in delta, + # reasoning content continues + return DeltaMessage(reasoning_content=delta_text) + else: + # thinking is disabled, just content + return DeltaMessage(content=delta_text) + + def extract_reasoning_content( + self, model_output: str, request: ChatCompletionRequest + ) -> tuple[Optional[str], Optional[str]]: + """ + Extract reasoning content from the model output. + + For text abcxyz: + - 'abc' goes to reasoning_content + - 'xyz' goes to content + + Returns: + tuple[Optional[str], Optional[str]]: reasoning content and content + """ + + # Check if the model output contains the and tokens. + if (self.think_start_token not in model_output + or self.think_end_token not in model_output): + return None, model_output + # Check if the is present in the model output, remove it + # if it is present. + model_output_parts = model_output.partition(self.think_start_token) + model_output = model_output_parts[2] if model_output_parts[ + 1] else model_output_parts[0] + # Check if the model output contains the tokens. + # If the end token is not found, return the model output as is. + if self.think_end_token not in model_output: + return None, model_output + + # Extract reasoning content from the model output. + reasoning_content, _, content = model_output.partition( + self.think_end_token) + + final_content = content or None + return reasoning_content, final_content diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py index b2926dbd185..6b6943d7643 100644 --- a/vllm/worker/worker.py +++ b/vllm/worker/worker.py @@ -77,7 +77,8 @@ def __init__( "mlp_speculator", "eagle", "deepseek_mtp", - "mimo_mtp")) \ + "glm4_moe_mtp", + "mimo_mtp")) \ else {"return_hidden_states": True} ModelRunnerClass: Type[GPUModelRunnerBase] = ModelRunner From ccb828bc953eb6b05ee358a23260c41e5b770c4d Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Sun, 20 Jul 2025 01:09:58 +0200 Subject: [PATCH 212/552] [Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. (#21233) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- docs/usage/v1_guide.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 12150cf2a82..498ff3da0ca 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -107,12 +107,11 @@ to enable simultaneous generation and embedding using the same engine instance i Models using selective state-space mechanisms instead of standard transformer attention are partially supported. Models that use Mamba-2 layers (e.g., `Mamba2ForCausalLM`) are supported, but models that use older Mamba-1 layers (e.g., `MambaForCausalLM`, `JambaForCausalLM`) are not yet supported. Please note that these models currently require -enforcing eager mode and disabling prefix caching in V1. +disabling prefix caching in V1. Models that combine Mamba-2 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`, `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`). Please note that -these models currently require enforcing eager mode, disabling prefix caching, and using the FlashInfer attention -backend in V1. +these models currently require disabling prefix caching and using the FlashInfer attention backend in V1. #### Encoder-Decoder Models From 0c9680de1d357fb4c01c4e9eacb89d3e7fadbb41 Mon Sep 17 00:00:00 2001 From: Chengji Yao Date: Sat, 19 Jul 2025 20:01:00 -0700 Subject: [PATCH 213/552] [TPU] support fp8 kv cache quantization (#19292) Signed-off-by: Chengji Yao Signed-off-by: x22x22 --- tests/entrypoints/llm/test_accuracy.py | 40 +++++++++++++----- tests/v1/tpu/test_pallas.py | 2 + vllm/engine/arg_utils.py | 8 ++-- vllm/platforms/tpu.py | 4 +- vllm/v1/attention/backends/pallas.py | 58 ++++++++++++++++++++++---- vllm/v1/worker/tpu_model_runner.py | 11 ++--- 6 files changed, 95 insertions(+), 28 deletions(-) diff --git a/tests/entrypoints/llm/test_accuracy.py b/tests/entrypoints/llm/test_accuracy.py index 30a666d4c39..6c5706d1634 100644 --- a/tests/entrypoints/llm/test_accuracy.py +++ b/tests/entrypoints/llm/test_accuracy.py @@ -15,15 +15,18 @@ from vllm.platforms import current_platform MODEL_NAMES = [ - "Qwen/Qwen2-1.5B-Instruct", + "Qwen/Qwen3-1.7B", "google/gemma-3-1b-it", ] +FP8_KV_MODEL_NAMES = [ + "Qwen/Qwen3-1.7B", +] NUM_CONCURRENT = 500 TASK = "gsm8k" FILTER = "exact_match,strict-match" RTOL = 0.03 EXPECTED_VALUES = { - "Qwen/Qwen2-1.5B-Instruct": 0.58, + "Qwen/Qwen3-1.7B": 0.68, "google/gemma-3-1b-it": 0.25, } @@ -70,10 +73,9 @@ def test_lm_eval_accuracy_v1_engine(model, monkeypatch: pytest.MonkeyPatch): if current_platform.is_tpu(): # Limit compilation time for TPU V1 - if model == "google/gemma-3-1b-it": - # TPU + google/gemma-3-1b-it + xet doesn't work well. - m.setenv("HF_HUB_DISABLE_XET", "1") - + # xet doesn't work well for both Qwen/Qwen3-1.7B and + # google/gemma-3-1b-it + m.setenv("HF_HUB_DISABLE_XET", "1") more_args = "max_model_len=2048,max_num_seqs=64" # Add TP test (if provided) @@ -83,9 +85,27 @@ def test_lm_eval_accuracy_v1_engine(model, monkeypatch: pytest.MonkeyPatch): run_test(model, more_args) -def test_lm_eval_accuracy_v0_engine(monkeypatch: pytest.MonkeyPatch): - """Run with the V0 Engine.""" +@pytest.mark.skipif(not current_platform.is_cuda() + and not current_platform.is_tpu(), + reason="V1 is currently only supported on CUDA and TPU") +@pytest.mark.parametrize("model", FP8_KV_MODEL_NAMES) +def test_lm_eval_accuracy_v1_engine_fp8_kv_cache( + model, monkeypatch: pytest.MonkeyPatch): + """Run with the V1 Engine.""" with monkeypatch.context() as m: - m.setenv("VLLM_USE_V1", "0") - run_test("Qwen/Qwen2-1.5B-Instruct") + m.setenv("VLLM_USE_V1", "1") + + more_args = None + if current_platform.is_tpu(): + # Limit compilation time for TPU V1 + + # xet doesn't work well for Qwen/Qwen3-1.7B + m.setenv("HF_HUB_DISABLE_XET", "1") + more_args = "max_model_len=2048,max_num_seqs=128,kv_cache_dtype=fp8" + + # Add TP test (if provided) + if TPU_TP_TEST_STR: + more_args += ",{}".format(TPU_TP_TEST_STR) + + run_test(model, more_args) diff --git a/tests/v1/tpu/test_pallas.py b/tests/v1/tpu/test_pallas.py index df89133170b..bfba3af57f7 100644 --- a/tests/v1/tpu/test_pallas.py +++ b/tests/v1/tpu/test_pallas.py @@ -95,4 +95,6 @@ class FakeAttentionLayer: sm_scale=scale, sliding_window=sliding_window, soft_cap=logits_soft_cap, + k_scale=1.0, + v_scale=1.0, ) diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 1ca4917de26..019ff033eda 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1358,10 +1358,10 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: and not envs.is_set("VLLM_ATTENTION_BACKEND") ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" supported = False - if current_platform.is_rocm() or ( - current_platform.is_cuda() - and current_platform.is_device_capability(100) - ): # handle hpu also for OOT platform + if (current_platform.is_rocm() + or (current_platform.is_cuda() + and current_platform.is_device_capability(100)) + or current_platform.is_tpu()): supported = True elif fp8_attention and will_use_fa: from vllm.attention.utils.fa_utils import ( diff --git a/vllm/platforms/tpu.py b/vllm/platforms/tpu.py index 5ec3be908e7..febc6ae4662 100644 --- a/vllm/platforms/tpu.py +++ b/vllm/platforms/tpu.py @@ -35,7 +35,9 @@ class TpuPlatform(Platform): device_control_env_var: str = "TPU_VISIBLE_CHIPS" simple_compile_backend: str = "openxla" - supported_quantization: list[str] = ["tpu_int8", "compressed-tensors"] + supported_quantization: list[str] = [ + "fp8", "tpu_int8", "compressed-tensors" + ] additional_env_vars: list[str] = [ "TPU_CHIPS_PER_HOST_BOUNDS", "TPU_HOST_BOUNDS" diff --git a/vllm/v1/attention/backends/pallas.py b/vllm/v1/attention/backends/pallas.py index ac7980c79e4..9307cd937d5 100644 --- a/vllm/v1/attention/backends/pallas.py +++ b/vllm/v1/attention/backends/pallas.py @@ -24,6 +24,19 @@ # TPU requires the head size to be a multiple of 128. TPU_HEAD_SIZE_ALIGNMENT = 128 +# Note: TPU can fp8 as storage dtype but doesn't support converting from uint8 +# from to fp32 directly. That's why it has a dtype mapping different from GPU +TPU_STR_DTYPE_TO_TORCH_DTYPE = { + "half": torch.half, + "bfloat16": torch.bfloat16, + "float": torch.float, + "fp8": torch.float8_e4m3fn, + "fp8_e4m3": torch.float8_e4m3fn, + "fp8_e5m2": torch.float8_e5m2, + "int8": torch.int8, + "uint8": torch.uint8, +} + class PallasAttentionBackend(AttentionBackend): @@ -152,8 +165,6 @@ def __init__( self.num_queries_per_kv = self.num_heads // self.num_kv_heads if alibi_slopes is not None: raise NotImplementedError("Alibi slopes is not supported.") - if kv_cache_dtype != "auto": - raise NotImplementedError("FP8 KV cache dtype is not supported.") if attn_type != AttentionType.DECODER: raise NotImplementedError("Encoder self-attention and " @@ -161,6 +172,11 @@ def __init__( "are not implemented for " "PallasAttentionBackendImpl") + self.kv_cache_quantized_dtype = None + if kv_cache_dtype != "auto": + self.kv_cache_quantized_dtype = TPU_STR_DTYPE_TO_TORCH_DTYPE.get( + kv_cache_dtype.lower().strip()) + def forward( self, layer: AttentionLayer, @@ -194,7 +210,6 @@ def forward( output = torch.ones_like(query) return output - assert layer._k_scale_float == 1.0 and layer._v_scale_float == 1.0 num_tokens, hidden_size = query.shape query = query.view(num_tokens, self.num_heads, self.head_size) key = key.view(-1, self.num_kv_heads, self.head_size) @@ -215,10 +230,21 @@ def forward( # Skip this if sharing KV cache with an earlier attention layer. slot_mapping = attn_metadata.slot_mapping write_to_kv_cache( - key, value, kv_cache, slot_mapping, + key, + value, + kv_cache, + slot_mapping, attn_metadata.num_slices_per_kv_cache_update_block, - attn_metadata.num_kv_update_slices) - + attn_metadata.num_kv_update_slices, + self.kv_cache_quantized_dtype, + layer._k_scale_float, + layer._v_scale_float, + ) + + if self.kv_cache_quantized_dtype is not None and ( + layer._k_scale_float == 0.0 or layer._v_scale_float == 0.0): + raise ValueError( + "k_scale_float and v_scale_float must be non-zero") output = torch.ops.xla.ragged_paged_attention( query, kv_cache, @@ -236,6 +262,8 @@ def forward( sm_scale=self.scale, sliding_window=self.sliding_window, soft_cap=self.logits_soft_cap, + k_scale=layer._k_scale_float, + v_scale=layer._v_scale_float, ) if self.head_size % TPU_HEAD_SIZE_ALIGNMENT != 0: @@ -251,18 +279,32 @@ def write_to_kv_cache( slot_mapping: torch.Tensor, num_slices_per_kv_cache_update_block: int, num_kv_update_slices: torch.Tensor, + kv_cache_quantized_dtype: Optional[torch.dtype] = None, + k_scale: float = 1.0, + v_scale: float = 1.0, ) -> None: """ Write the key and values to the KV cache. Args: - key: shape = [num_tokens, num_kv_heads * head_size] - value: shape = [num_tokens, num_kv_heads * head_size] + key: shape = [num_tokens, num_kv_heads, head_size] + value: shape = [num_tokens, num_kv_heads, head_size] kv_cache = [num_blocks, block_size, num_kv_heads * 2, head_size] num_slices_per_kv_cache_update_block: int """ _, page_size, num_combined_kv_heads, head_size = kv_cache.shape head_size = cdiv(head_size, TPU_HEAD_SIZE_ALIGNMENT) * TPU_HEAD_SIZE_ALIGNMENT + + if kv_cache_quantized_dtype is not None: + dtype_info = torch.finfo(kv_cache_quantized_dtype) + key = key.to(torch.float32) / k_scale + # NOTE: clamp is added here to avoid out of range of quantized dtype + key = torch.clamp(key, dtype_info.min, dtype_info.max) + key = key.to(kv_cache_quantized_dtype) + value = value.to(torch.float32) / v_scale + value = torch.clamp(value, dtype_info.min, dtype_info.max) + value = value.to(kv_cache_quantized_dtype) + kv = torch.cat([key, value], axis=-1).reshape(-1, num_combined_kv_heads, head_size) diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 1b55e5d61aa..7ed1cf41011 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -32,9 +32,10 @@ from vllm.multimodal.utils import group_mm_inputs_by_modality from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors -from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, LayerBlockType, cdiv, - is_pin_memory_available, prev_power_of_2) -from vllm.v1.attention.backends.pallas import (PallasAttentionBackend, +from vllm.utils import (LayerBlockType, cdiv, is_pin_memory_available, + prev_power_of_2) +from vllm.v1.attention.backends.pallas import (TPU_STR_DTYPE_TO_TORCH_DTYPE, + PallasAttentionBackend, PallasMetadata, get_page_size_bytes) from vllm.v1.core.encoder_cache_manager import compute_encoder_budget @@ -142,11 +143,11 @@ def __init__( if cache_config.cache_dtype == "auto": model_dtype = self.dtype if isinstance(model_dtype, str): - self.kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[model_dtype] + self.kv_cache_dtype = TPU_STR_DTYPE_TO_TORCH_DTYPE[model_dtype] else: self.kv_cache_dtype = model_dtype else: - self.kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[ + self.kv_cache_dtype = TPU_STR_DTYPE_TO_TORCH_DTYPE[ cache_config.cache_dtype] self._hidden_states_dtype = self.dtype From 4b1514205afd8121521cf52d1c02536a1f555194 Mon Sep 17 00:00:00 2001 From: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Sat, 19 Jul 2025 20:22:02 -0700 Subject: [PATCH 214/552] Enable v1 metrics tests (#20953) Signed-off-by: Seiji Eicher Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 1 + tests/v1/metrics/test_ray_metrics.py | 18 ++++++++++++------ vllm/v1/metrics/ray_wrappers.py | 8 +++++++- 3 files changed, 20 insertions(+), 7 deletions(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 7f1848b4bfb..114c48dba53 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -264,6 +264,7 @@ steps: - pytest -v -s v1/structured_output - pytest -v -s v1/spec_decode - pytest -v -s v1/kv_connector/unit + - pytest -v -s v1/metrics - pytest -v -s v1/test_serial_utils.py - pytest -v -s v1/test_utils.py - pytest -v -s v1/test_oracle.py diff --git a/tests/v1/metrics/test_ray_metrics.py b/tests/v1/metrics/test_ray_metrics.py index 0898ae65e7c..92f6c6f0e89 100644 --- a/tests/v1/metrics/test_ray_metrics.py +++ b/tests/v1/metrics/test_ray_metrics.py @@ -1,8 +1,11 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import os + import pytest import ray +from vllm.config import ModelDType from vllm.sampling_params import SamplingParams from vllm.v1.engine.async_llm import AsyncEngineArgs, AsyncLLM from vllm.v1.metrics.ray_wrappers import RayPrometheusStatLogger @@ -27,7 +30,7 @@ def use_v1_only(monkeypatch): def test_engine_log_metrics_ray( example_prompts, model: str, - dtype: str, + dtype: ModelDType, max_tokens: int, ) -> None: """ Simple smoke test, verifying this can be used without exceptions. @@ -37,11 +40,14 @@ def test_engine_log_metrics_ray( class EngineTestActor: async def run(self): - engine_args = AsyncEngineArgs( - model=model, - dtype=dtype, - disable_log_stats=False, - ) + # Set environment variable inside the Ray actor since environment + # variables from pytest fixtures don't propagate to Ray actors + os.environ['VLLM_USE_V1'] = '1' + + engine_args = AsyncEngineArgs(model=model, + dtype=dtype, + disable_log_stats=False, + enforce_eager=True) engine = AsyncLLM.from_engine_args( engine_args, stat_loggers=[RayPrometheusStatLogger]) diff --git a/vllm/v1/metrics/ray_wrappers.py b/vllm/v1/metrics/ray_wrappers.py index cce692d6c09..8384310062d 100644 --- a/vllm/v1/metrics/ray_wrappers.py +++ b/vllm/v1/metrics/ray_wrappers.py @@ -51,7 +51,13 @@ class RayGaugeWrapper(RayPrometheusMetric): def __init__(self, name: str, documentation: Optional[str] = "", - labelnames: Optional[list[str]] = None): + labelnames: Optional[list[str]] = None, + multiprocess_mode: Optional[str] = ""): + + # All Ray metrics are keyed by WorkerId, so multiprocess modes like + # "mostrecent", "all", "sum" do not apply. This logic can be manually + # implemented at the observability layer (Prometheus/Grafana). + del multiprocess_mode labelnames_tuple = tuple(labelnames) if labelnames else None self.metric = ray_metrics.Gauge(name=name, description=documentation, From 1462881533aaac5df5600d616f2c9e87932a2137 Mon Sep 17 00:00:00 2001 From: Calvin Chen Date: Sun, 20 Jul 2025 16:15:50 +0800 Subject: [PATCH 215/552] [Model] use AutoWeightsLoader for bart (#18299) Signed-off-by: calvin chen <120380290@qq.com> Signed-off-by: x22x22 --- vllm/model_executor/models/bart.py | 172 ++++++++++++----------------- 1 file changed, 71 insertions(+), 101 deletions(-) diff --git a/vllm/model_executor/models/bart.py b/vllm/model_executor/models/bart.py index a0ec12674f1..3d328c88ff6 100644 --- a/vllm/model_executor/models/bart.py +++ b/vllm/model_executor/models/bart.py @@ -46,7 +46,7 @@ from vllm.sequence import IntermediateTensors from .interfaces import SupportsQuant, SupportsV0Only -from .utils import maybe_prefix +from .utils import AutoWeightsLoader, WeightsMapper, maybe_prefix logger = logging.get_logger(__name__) @@ -700,7 +700,8 @@ def forward( class BartModel(nn.Module, SupportsQuant): _tied_weights_keys = [ - "encoder.embed_tokens.weight", "decoder.embed_tokens.weight" + "encoder.embed_tokens.weight", + "decoder.embed_tokens.weight", ] def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): @@ -763,10 +764,54 @@ def forward(self, input_ids: torch.Tensor, positions: torch.Tensor, return decoder_outputs + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + stacked_params_mapping = [ + # (param_name, shard_name, shard_id) + ("qkv_proj", "q_proj", "q"), + ("qkv_proj", "k_proj", "k"), + ("qkv_proj", "v_proj", "v"), + ] + + other_weights = [] + loaded_stacked_params = [] + model_params_dict = dict(self.named_parameters()) + + for name, loaded_weight in weights: + for (param_name, weight_name, shard_id) in stacked_params_mapping: + if weight_name not in name: + continue + name = name.replace(weight_name, param_name) + if name not in model_params_dict: + continue + param = model_params_dict[name] + weight_loader = param.weight_loader + weight_loader(param, loaded_weight, shard_id) + loaded_stacked_params.append(name) + break + else: + if name in model_params_dict: + other_weights.append((name, loaded_weight)) + + loader = AutoWeightsLoader(self) + loaded_params = loader.load_weights(other_weights) + loaded_params.update(loaded_stacked_params) + return loaded_params + class BartForConditionalGeneration(nn.Module, SupportsV0Only, SupportsQuant): - packed_modules_mapping = {"qkv_proj": ["q_proj", "k_proj", "v_proj"]} - base_model_prefix = "model" + hf_to_vllm_mapper = WeightsMapper( + orig_to_new_prefix={ + "decoder.": "model.decoder.", + "encoder.": "model.encoder.", + "shared.": "model.shared." + }, + orig_to_new_substr={ + "beta": "bias", + "gamma": "weight", + "LayerNorm": "layernorm", + }, + ) def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): @@ -789,7 +834,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.lm_head = BartParallelLMHead(config.vocab_size, config.d_model, embed_scale=embed_scale) - self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, config.vocab_size) @@ -828,61 +872,12 @@ def compute_logits( sampling_metadata) return logits - stacked_params_mapping = { - "q_proj": { - "param_name": "qkv_proj", - "shard_id": "q", - }, - "k_proj": { - "param_name": "qkv_proj", - "shard_id": "k", - }, - "v_proj": { - "param_name": "qkv_proj", - "shard_id": "v", - }, - } - - params_mapping = { - "beta": "bias", - "gamma": "weight", - "LayerNorm": "layernorm", - } - - def _rename_key(self, key: str): - prefix = f"{self.base_model_prefix}." - key = key[len(prefix):] if key.startswith(prefix) else key - - for src, dst in self.params_mapping.items(): - key = key.replace(src, dst) - - return key - - def _rename_stacked_param( - self, - name: str, - ) -> tuple[str, Optional[str]]: - for key, mapping in self.stacked_params_mapping.items(): - if key in name: - name = name.replace(key, mapping["param_name"]) - return name, mapping["shard_id"] - return name, None - - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): - - model_params_dict = dict(self.model.named_parameters()) - top_params_dict = dict(self.named_parameters()) - + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: weights_tuple_list = list(weights) shared_embedding_weight = None - shared_embedding_shard_id = None - for name, loaded_weight in weights_tuple_list: - - name = self._rename_key(name) - name, shard_id = self._rename_stacked_param(name) - if ('shared.weight' in name or 'encoder.embed_tokens.weight' in name or 'decoder.embed_tokens.weight' in name @@ -890,49 +885,24 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): assert shared_embedding_weight is None, ( "Conflicting embedding weights.") shared_embedding_weight = loaded_weight - shared_embedding_shard_id = shard_id - else: - # Skip the specific downstream task weight. - if name.startswith('cls.'): - continue - # use Pooler instead. - if name.startswith('pooler.'): - continue - # Skip loading extra bias for GPTQ models. - if name.endswith(".bias") and name not in model_params_dict: - continue - param = model_params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - if shard_id: - weight_loader(param, loaded_weight, shard_id) - else: - weight_loader(param, loaded_weight) - - # Assign shared weight values - encoder_in_param = model_params_dict['encoder.embed_tokens.weight'] - encoder_in_weight_loader = getattr(encoder_in_param, "weight_loader", - default_weight_loader) - - decoder_in_param = model_params_dict['decoder.embed_tokens.weight'] - decoder_in_weight_loader = getattr(decoder_in_param, "weight_loader", - default_weight_loader) - - lm_head_in_param = top_params_dict['lm_head.weight'] - lm_head_in_weight_loader = getattr(lm_head_in_param, "weight_loader", - default_weight_loader) - - assert shared_embedding_weight is not None - - if shared_embedding_shard_id: - encoder_in_weight_loader(encoder_in_param, shared_embedding_weight, - shared_embedding_shard_id) - decoder_in_weight_loader(decoder_in_param, shared_embedding_weight, - shared_embedding_shard_id) - lm_head_in_weight_loader(lm_head_in_param, shared_embedding_weight, - shared_embedding_shard_id) - else: - encoder_in_weight_loader(encoder_in_param, shared_embedding_weight) - decoder_in_weight_loader(decoder_in_param, shared_embedding_weight) - lm_head_in_weight_loader(lm_head_in_param, shared_embedding_weight) + loader = AutoWeightsLoader( + self, + skip_prefixes=(["cls.", "pooler."]), + ) + loaded_params = loader.load_weights(weights_tuple_list, + mapper=self.hf_to_vllm_mapper) + + if shared_embedding_weight is not None: + weight_loader = getattr(self.lm_head.weight, "weight_loader", + default_weight_loader) + weight_loader(self.lm_head.weight, shared_embedding_weight) + + self.model.encoder.embed_tokens.weight = self.lm_head.weight + self.model.decoder.embed_tokens.weight = self.lm_head.weight + loaded_params.update({ + 'model.encoder.embed_tokens.weight', 'lm_head.weight', + 'model.decoder.embed_tokens.weight' + }) + + return loaded_params From 2b53bfbce2b9103dcc3b6a17330274a78538e0b8 Mon Sep 17 00:00:00 2001 From: Raushan Turganbay Date: Sun, 20 Jul 2025 15:25:50 +0200 Subject: [PATCH 216/552] [Model] Support VLMs with transformers backend (#20543) Signed-off-by: raushan Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Isotr0py Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py Co-authored-by: Cyrus Leung Signed-off-by: x22x22 --- docs/models/supported_models.md | 9 +- .../multimodal/generation/test_common.py | 75 +++ tests/models/registry.py | 1 + vllm/config.py | 39 +- vllm/model_executor/model_loader/utils.py | 49 +- vllm/model_executor/models/registry.py | 12 +- vllm/model_executor/models/transformers.py | 527 ++++++++++++++++-- 7 files changed, 625 insertions(+), 87 deletions(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index b3201ce32f7..57ba132b91d 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -18,7 +18,7 @@ These models are what we list in [supported-text-models][supported-text-models] ### Transformers -vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models are supported, and vision language model support is planned! +vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models and common vision language models are supported! Vision-language models currently accept only image inputs, and require setting `--disable_mm_preprocessor_cache` when running. Support for video inputs and caching of multi-modal preprocessors will be added in future releases. To check if the modeling backend is Transformers, you can simply do this: @@ -28,7 +28,7 @@ llm = LLM(model=..., task="generate") # Name or path of your model llm.apply_model(lambda model: print(type(model))) ``` -If it is `TransformersForCausalLM` then it means it's based on Transformers! +If it is `TransformersForCausalLM` or `TransformersForMultimodalLM` then it means it's based on Transformers! !!! tip You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference](../serving/offline_inference.md) or `--model-impl transformers` for the [openai-compatible-server](../serving/openai_compatible_server.md). @@ -36,6 +36,9 @@ If it is `TransformersForCausalLM` then it means it's based on Transformers! !!! note vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM. +!!! note + In case of vision language models if you are loading with `dtype="auto"`, vLLM loads the whole model with config's `dtype` if it exists. In contrast the native Transformers will respect the `dtype` attribute of each backbone in the model. That might cause a slight difference in performance. + #### Custom models If a model is neither supported natively by vLLM or Transformers, it can still be used in vLLM! @@ -99,7 +102,7 @@ Here is what happens in the background when this model is loaded: 1. The config is loaded. 2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`. -3. `MyModel` is loaded into `TransformersForCausalLM` (see ) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used. +3. `MyModel` is loaded into `TransformersForCausalLM` or `TransformersForMultimodalLM` (see ) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used. That's it! diff --git a/tests/models/multimodal/generation/test_common.py b/tests/models/multimodal/generation/test_common.py index 98461676aa4..9859ac5a89d 100644 --- a/tests/models/multimodal/generation/test_common.py +++ b/tests/models/multimodal/generation/test_common.py @@ -35,6 +35,8 @@ REQUIRES_V0_MODELS = [ # V1 Test: not enough KV cache space in C1. "fuyu", + # V1 Test: Deadlock issue when processing mm_inputs + "llava-onevision-transformers", ] # yapf: disable @@ -170,6 +172,79 @@ hf_output_post_proc=model_utils.ultravox_trunc_hf_output, marks=[pytest.mark.core_model, pytest.mark.cpu_model], ), + #### Transformers fallback to test + ## To reduce test burden, we only test batching arbitrary image size + # Dynamic image length and number of patches + "llava-onevision-transformers": VLMTestInfo( + models=["llava-hf/llava-onevision-qwen2-0.5b-ov-hf"], + test_type=VLMTestType.IMAGE, + prompt_formatter=lambda vid_prompt: f"<|im_start|>user\n{vid_prompt}<|im_end|>\n<|im_start|>assistant\n", # noqa: E501 + max_model_len=16384, + hf_model_kwargs=model_utils.llava_onevision_hf_model_kwargs("llava-hf/llava-onevision-qwen2-0.5b-ov-hf"), # noqa: E501 + auto_cls=AutoModelForImageTextToText, + vllm_output_post_proc=model_utils.llava_onevision_vllm_to_hf_output, + image_size_factors=[(0.25, 0.5, 1.0)], + vllm_runner_kwargs={ + "model_impl": "transformers", + "disable_mm_preprocessor_cache": True, + "enable_prefix_caching": False, + }, + marks=[pytest.mark.core_model], + ), + # FIXME(Isotr0py): Enable this test after + # https://github.com/huggingface/transformers/pull/39470 released + # "idefics3-transformers": VLMTestInfo( + # models=["HuggingFaceTB/SmolVLM-256M-Instruct"], + # test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE), + # prompt_formatter=lambda img_prompt:f"<|begin_of_text|>User:{img_prompt}\nAssistant:", # noqa: E501 + # img_idx_to_prompt=lambda idx: "", + # max_model_len=8192, + # max_num_seqs=2, + # auto_cls=AutoModelForImageTextToText, + # hf_output_post_proc=model_utils.idefics3_trunc_hf_output, + # image_size_factors=[(0.25, 0.5, 1.0)], + # vllm_runner_kwargs={ + # "model_impl": "transformers", + # "disable_mm_preprocessor_cache": True, + # "enable_prefix_caching": False, + # }, + # marks=[pytest.mark.core_model], + # ), + # Pixel values from processor are not 4D or 5D arrays + "qwen2_5_vl-transformers": VLMTestInfo( + models=["Qwen/Qwen2.5-VL-3B-Instruct"], + test_type=VLMTestType.IMAGE, + prompt_formatter=lambda img_prompt: f"<|im_start|>User\n{img_prompt}<|im_end|>\n<|im_start|>assistant\n", # noqa: E501 + img_idx_to_prompt=lambda idx: "<|vision_start|><|image_pad|><|vision_end|>", # noqa: E501 + max_model_len=4096, + max_num_seqs=2, + auto_cls=AutoModelForImageTextToText, + vllm_output_post_proc=model_utils.qwen2_vllm_to_hf_output, + image_size_factors=[(0.25, 0.2, 0.15)], + vllm_runner_kwargs={ + "model_impl": "transformers", + "disable_mm_preprocessor_cache": True, + "enable_prefix_caching": False, + }, + marks=[large_gpu_mark(min_gb=32)], + ), + # Check "auto" with fallback to transformers + "internvl-transformers": VLMTestInfo( + models=["OpenGVLab/InternVL3-1B-hf"], + test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE), + prompt_formatter=lambda img_prompt: f"<|im_start|>User\n{img_prompt}<|im_end|>\n<|im_start|>Assistant\n", # noqa: E501 + img_idx_to_prompt=lambda idx: "", + max_model_len=4096, + use_tokenizer_eos=True, + image_size_factors=[(0.25, 0.5, 1.0)], + vllm_runner_kwargs={ + "model_impl": "auto", + "disable_mm_preprocessor_cache": True, + "enable_prefix_caching": False, + }, + auto_cls=AutoModelForImageTextToText, + marks=[pytest.mark.core_model], + ), #### Extended model tests "aria": VLMTestInfo( models=["rhymes-ai/Aria"], diff --git a/tests/models/registry.py b/tests/models/registry.py index c2f1089af2a..19725acd6c4 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -499,6 +499,7 @@ def check_available_online( _TRANSFORMERS_MODELS = { "TransformersForCausalLM": _HfExamplesInfo("ArthurZ/Ilama-3.2-1B", trust_remote_code=True), # noqa: E501 + "TransformersForMultimodalLM": _HfExamplesInfo("OpenGVLab/InternVL3-1B-hf"), } _EXAMPLE_MODELS = { diff --git a/vllm/config.py b/vllm/config.py index f9f8eb38c66..73e88b13bc5 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -562,6 +562,10 @@ def __post_init__(self) -> None: self.task = "embed" + model_info, arch = self.registry.inspect_model_cls(self.architectures) + self._model_info = model_info + self._architecture = arch + all_supported_tasks = self._get_supported_tasks(self.task) logger.debug("Tasks supported by runner type: %s", all_supported_tasks) supported_runner_types = self._get_supported_runner_types( @@ -587,10 +591,6 @@ def __post_init__(self) -> None: else: self.truncation_side = "right" - model_info, arch = self.registry.inspect_model_cls(self.architectures) - self._model_info = model_info - self._architecture = arch - self.pooler_config = self._init_pooler_config() self.dtype = _get_and_verify_dtype( @@ -674,6 +674,16 @@ def validate_model_config_after(self: "ModelConfig") -> "ModelConfig": "max_model_len must be an integer after __post_init__.") return self + def _get_transformers_backend_cls(self) -> str: + """Determine which Transformers backend class will be used if + `model_impl` is set to `transformers` or `auto`.""" + if self.hf_config != self.hf_text_config: + # If 'hf_text_config' is the same as 'hf_config'. If not, it is + # probably a composite config, i.e. multimodal + return "TransformersForMultimodalLM" + else: + return "TransformersForCausalLM" + @property def registry(self): return me_models.ModelRegistry @@ -681,7 +691,19 @@ def registry(self): @property def architectures(self) -> list[str]: # architectures in the model config. - return getattr(self.hf_config, "architectures", []) + architectures = getattr(self.hf_config, "architectures", []) + # The registry assumes that it can always inspect the vLLM model class + # for a given architecture. This assumption breaks down for the + # Transformers backend, which may use a different class depending on + # the model type. To work around this, we add the correct Transformers + # backend class to the architectures list. We must do this here because + # we need access to the `hf_config` to determine the backend class. + transformers_backend_cls = self._get_transformers_backend_cls() + if (self.model_impl != ModelImpl.VLLM.value + and all(arch != transformers_backend_cls + for arch in architectures)): + architectures.append(transformers_backend_cls) + return architectures @property def architecture(self) -> str: @@ -827,10 +849,9 @@ def _get_preferred_pooling_task( ("EmbeddingModel", "embed"), ("RewardModel", "reward"), ] - _, arch = self.registry.inspect_model_cls(architectures) for suffix, pref_task in suffix_to_preferred_task: - if arch.endswith(suffix): + if self.architecture.endswith(suffix): return pref_task return "embed" @@ -944,10 +965,10 @@ def _resolve_runner( ("EmbeddingModel", "pooling"), ("RewardModel", "pooling"), ] - _, arch = self.registry.inspect_model_cls(self.architectures) for suffix, pref_runner in suffix_to_preferred_runner: - if arch.endswith(suffix) and pref_runner in supported_runner_types: + if self.architecture.endswith( + suffix) and pref_runner in supported_runner_types: return pref_runner if "generate" in supported_runner_types: diff --git a/vllm/model_executor/model_loader/utils.py b/vllm/model_executor/model_loader/utils.py index 190d1f006bc..42c5512905f 100644 --- a/vllm/model_executor/model_loader/utils.py +++ b/vllm/model_executor/model_loader/utils.py @@ -25,6 +25,7 @@ as_reward_model, as_seq_cls_model) from vllm.model_executor.models.interfaces import SupportsQuant +from vllm.model_executor.models.registry import _TRANSFORMERS_MODELS from vllm.utils import is_pin_memory_available logger = init_logger(__name__) @@ -169,9 +170,22 @@ def device_loading_context(module: torch.nn.Module, def resolve_transformers_arch(model_config: ModelConfig, architectures: list[str]): + if model_config.model_impl == ModelImpl.VLLM: + raise ValueError( + "Attempting to resolve architecture from the Transformers library " + "but the model implementation is set to vLLM. This should never " + "happen.") + for i, arch in enumerate(architectures): - if arch == "TransformersForCausalLM": + if arch in _TRANSFORMERS_MODELS: continue + + if model_config.model_impl == ModelImpl.AUTO: + logger.warning( + "%s has no vLLM implementation, falling back to Transformers " + "implementation. Some features may not be supported and " + "performance may not be optimal.", arch) + auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map", None) or dict() # Make sure that config class is always initialized before model class, @@ -199,25 +213,13 @@ def resolve_transformers_arch(model_config: ModelConfig, "not present in the model config's 'auto_map' (relevant " "if the model is custom).") model_module = auto_modules["AutoModel"] - # TODO(Isotr0py): Further clean up these raises. - # perhaps handled them in _ModelRegistry._raise_for_unsupported? - if model_config.model_impl == ModelImpl.TRANSFORMERS: - if not model_module.is_backend_compatible(): - raise ValueError( - f"The Transformers implementation of {arch} is not " - "compatible with vLLM.") - architectures[i] = "TransformersForCausalLM" - if model_config.model_impl == ModelImpl.AUTO: - if not model_module.is_backend_compatible(): - raise ValueError( - f"{arch} has no vLLM implementation and the Transformers " - "implementation is not compatible with vLLM. Try setting " - "VLLM_USE_V1=0.") - logger.warning( - "%s has no vLLM implementation, falling back to Transformers " - "implementation. Some features may not be supported and " - "performance may not be optimal.", arch) - architectures[i] = "TransformersForCausalLM" + + if not model_module.is_backend_compatible(): + raise ValueError( + f"The Transformers implementation of '{arch}' is not " + "compatible with vLLM.") + + architectures[i] = model_config._get_transformers_backend_cls() return architectures @@ -237,8 +239,9 @@ def get_model_architecture( ] vllm_supported_archs = ModelRegistry.get_supported_archs() - vllm_not_supported = not any(arch in vllm_supported_archs - for arch in architectures) + is_supported = lambda arch: (arch in vllm_supported_archs and arch not in + _TRANSFORMERS_MODELS) + vllm_not_supported = not any(is_supported(arch) for arch in architectures) if vllm_not_supported: # try automatic conversion in adapters.py @@ -259,7 +262,7 @@ def get_model_architecture( break if (model_config.model_impl == ModelImpl.TRANSFORMERS or - model_config.model_impl != ModelImpl.VLLM and vllm_not_supported): + model_config.model_impl == ModelImpl.AUTO and vllm_not_supported): architectures = resolve_transformers_arch(model_config, architectures) logger.debug_once("Resolve transformers arch %s", str(architectures)) elif (model_config.quantization is not None diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index b57130ec84c..a85e8b0e7b1 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -253,6 +253,7 @@ } _TRANSFORMERS_MODELS = { + "TransformersForMultimodalLM": ("transformers", "TransformersForMultimodalLM"), # noqa: E501 "TransformersForCausalLM": ("transformers", "TransformersForCausalLM"), } # yapf: enable @@ -504,9 +505,14 @@ def _normalize_archs( if causal_lm_arch in self.models: normalized_arch.append(arch) - # make sure Transformers backend is put at the last as a fallback - if len(normalized_arch) != len(architectures): - normalized_arch.append("TransformersForCausalLM") + # NOTE(Isotr0py): Be careful of architectures' order! + # Make sure Transformers backend architecture is at the end of the + # list, otherwise pooling models automatic conversion will fail! + for arch in normalized_arch: + if arch.startswith("TransformersFor"): + normalized_arch.remove(arch) + normalized_arch.append(arch) + return normalized_arch def inspect_model_cls( diff --git a/vllm/model_executor/models/transformers.py b/vllm/model_executor/models/transformers.py index 04ee3a454f9..47cff29caab 100644 --- a/vllm/model_executor/models/transformers.py +++ b/vllm/model_executor/models/transformers.py @@ -15,8 +15,8 @@ # See the License for the specific language governing permissions and # limitations under the License. """Wrapper around `transformers` models""" -from collections.abc import Iterable -from contextlib import nullcontext +from collections.abc import Iterable, Mapping +from contextlib import contextmanager, nullcontext from typing import Literal, Optional, Union import regex as re @@ -41,11 +41,21 @@ ParallelLMHead, VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalKwargs +from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, + MultiModalInputs, PlaceholderRange) +from vllm.multimodal.parse import ImageProcessorItems, MultiModalDataItems +from vllm.multimodal.processing import (BaseMultiModalProcessor, + BaseProcessingInfo) +from vllm.multimodal.profiling import BaseDummyInputsBuilder from vllm.sequence import IntermediateTensors +from vllm.transformers_utils.processor import cached_get_processor +from vllm.utils import is_list_of -from .interfaces import SupportsLoRA, SupportsPP, SupportsQuant +from .interfaces import (SupportsLoRA, SupportsMultiModal, SupportsPP, + SupportsQuant) from .utils import (AutoWeightsLoader, PPMissingLayer, WeightsMapper, - is_pp_missing_parameter, + flatten_bn, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, maybe_prefix) logger = init_logger(__name__) @@ -112,6 +122,269 @@ def replace_linear_class( ) +# Copied from `accelerate` +@contextmanager +def init_on_device_without_buffers(device: torch.device): + """ + A context manager under which models are initialized with all + parameters on the specified device. However buffers are not + initialized on specified device. + + Args: + device (`torch.device`): + Device to initialize all parameters on. + """ + + old_register_parameter = nn.Module.register_parameter + + def register_empty_parameter(module, name, param): + old_register_parameter(module, name, param) + if param is not None: + param_cls = type(module._parameters[name]) + kwargs = module._parameters[name].__dict__ + kwargs["requires_grad"] = param.requires_grad + module._parameters[name] = param_cls( + module._parameters[name].to(device), **kwargs) + + tensor_constructors_to_patch = {} + + def patch_tensor_constructor(fn): + + def wrapper(*args, **kwargs): + kwargs["device"] = device + return fn(*args, **kwargs) + + return wrapper + + try: + nn.Module.register_parameter = register_empty_parameter + for torch_function_name in tensor_constructors_to_patch: + setattr( + torch, torch_function_name, + patch_tensor_constructor(getattr(torch, torch_function_name))) + yield + finally: + nn.Module.register_parameter = old_register_parameter + for torch_function_name, old_torch_function in ( + tensor_constructors_to_patch.items()): + setattr(torch, torch_function_name, old_torch_function) + + +class MultiModalProcessingInfo(BaseProcessingInfo): + + def get_hf_config(self): + return self.ctx.model_config.hf_config + + def get_supported_mm_limits(self): + return {"image": None} + + def get_mm_max_tokens_per_item(self, seq_len, mm_counts): + return {"image": self.get_max_image_tokens()} + + def get_max_image_tokens(self) -> int: + width, height = self.get_max_image_size() + processor = self.get_hf_processor() + mm_processor_kwargs = self.ctx.model_config.mm_processor_kwargs or {} + mm_tokens = processor._get_num_multimodal_tokens( + image_sizes=([height, width], ), **mm_processor_kwargs) + image_tokens = mm_tokens["num_image_tokens"][0] + return image_tokens + + def get_hf_processor(self): + processor = cached_get_processor(self.ctx.model_config.model) + return processor + + def get_max_image_size(self): + return 10_000, 10_000 # hardcode for arbitrary very large size + + +class MultiModalDummyInputsBuilder( + BaseDummyInputsBuilder[MultiModalProcessingInfo]): + + def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str: + num_images = mm_counts.get("image", 0) + + processor = self.info.get_hf_processor() + if "gemma3" in processor.__class__.__name__.lower(): + image_token = processor.boi_token + else: + image_token = getattr(processor, "image_token", "") + return image_token * num_images + + def get_dummy_mm_data( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> MultiModalDataDict: + num_images = mm_counts.get("image", 0) + + target_width, target_height = self.info.get_max_image_size() + + return { + "image": + self._get_dummy_images(width=target_width, + height=target_height, + num_images=num_images), + } + + +class MultiModalProcessor(BaseMultiModalProcessor[MultiModalProcessingInfo]): + + def _get_prompt_updates( + self, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + out_mm_kwargs: MultiModalKwargs, + ): + """ + Given the original multi-modal items for this modality + and HF-processed data, output the updates to perform. + + The information returned by this method is used to update token inputs + which bypass the HF processor. It is also used to update the output of + HF processor if the HF process does not apply prompt updates to text + inputs. + + Moreover, this information is critical to determine the token positions + in order to construct :class:`~vllm-multimodal.input.PlaceholderRange` + for each multi-modal item. + """ + return None + + def _get_mm_fields_config( + self, + hf_inputs, + hf_processor_mm_kwargs, + num_image_patches: torch.Tensor = None, + ): + # HF Processors always return a mask but vLLM doesn't need it + hf_inputs.pop("attention_mask", None) + mm_fields = { + key: MultiModalFieldConfig.flat_from_sizes("image", + num_image_patches) + for key in hf_inputs + } + mm_fields["image_embeds"] = MultiModalFieldConfig.flat_from_sizes( + "image", num_image_patches) + mm_fields["num_image_patches"] = MultiModalFieldConfig.batched("image") + return mm_fields + + def _apply_hf_processor_text_mm( + self, + prompt_text: str, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + tokenization_kwargs: Mapping[str, object], + ): + """ + Apply the HF processor on the prompt text and multi-modal data + together. + + In addition, return whether prompt replacements have been applied. + """ + processor_data, passthrough_data = self._get_hf_mm_data(mm_items) + processor_data["return_mm_token_type_ids"] = True + + processed_data = self._call_hf_processor( + prompt=prompt_text, + mm_data=processor_data, + mm_kwargs=hf_processor_mm_kwargs, + tok_kwargs=tokenization_kwargs, + ) + processed_data.update(passthrough_data) + + prompt_ids, = processed_data.pop("input_ids").tolist() + mm_token_type_ids = processed_data.pop( + "mm_token_type_ids" + ) if "mm_token_type_ids" in processed_data else processed_data.pop( + "token_type_ids") # for gemma3 only + + return prompt_ids, processed_data, mm_token_type_ids + + def apply( + self, + prompt: Union[str, list[int]], + mm_data: MultiModalDataDict, + hf_processor_mm_kwargs: Mapping[str, object], + tokenization_kwargs: Optional[Mapping[str, object]] = None, + return_mm_hashes: bool = False, + ) -> MultiModalInputs: + """ + Process multi-modal inputs to be used in vLLM. + + Apply HF Processor on prompt text and multi-modal data together, + outputting token IDs and processed tensors. + """ + if return_mm_hashes: + raise ValueError( + "TransformersForMultimodalLM doesn't support mm hashing yet! " + "Probably you didn't set `disable_mm_preprocessor_cache=True`") + + if tokenization_kwargs is None: + tokenization_kwargs = {} + + mm_items = self._to_mm_items(mm_data) + hf_processor = self.info.get_hf_processor(**hf_processor_mm_kwargs) + + (prompt_ids, processed_data, + mm_token_type_ids) = self._apply_hf_processor_text_mm( + prompt_text=prompt, + mm_items=mm_items, + hf_processor_mm_kwargs=hf_processor_mm_kwargs, + tokenization_kwargs=tokenization_kwargs, + ) + + # HF processor will return `mm_token_type_ids` from which + # we can infer mm_placeholders. Until then hardcode to make code run + # Below tested on Llava. Prompts and `mm_token_type_ids` are always bs=1 + mm_positions = torch.where(mm_token_type_ids == 1)[1] + images = mm_items.get_items("image", ImageProcessorItems) + mm_processor_kwargs = (self.info.ctx.model_config.mm_processor_kwargs + or {}) + image_sizes = [] + for item_idx in range(len(images)): + image_size = images.get_image_size(item_idx) + image_sizes.append((image_size.height, image_size.width)) + + mm_tokens_per_modality = hf_processor._get_num_multimodal_tokens( + image_sizes=image_sizes, **mm_processor_kwargs) + + mm_placeholders = {} + split_sizes = mm_tokens_per_modality["num_image_tokens"] + if split_sizes: + chunked_mm_positions = torch.split(mm_positions, split_sizes) + mm_tokens = torch.tensor(prompt_ids)[mm_token_type_ids[0].bool()] + chunked_mm_tokens = torch.split(mm_tokens, split_sizes) + ranges = [ + PlaceholderRange( + offset=positions[0].item(), + length=positions.shape[0], + is_embed=(mm_tokens == hf_processor.image_token_id).bool()) + for positions, mm_tokens in zip(chunked_mm_positions, + chunked_mm_tokens) + ] + mm_placeholders = {"image": ranges} + + num_image_patches = torch.tensor( + mm_tokens_per_modality["num_image_patches"] + ) if "num_image_patches" in mm_tokens_per_modality else None + processed_data['num_image_patches'] = num_image_patches + mm_kwargs = MultiModalKwargs.from_hf_inputs( + processed_data, + self._get_mm_fields_config(processed_data, hf_processor_mm_kwargs, + num_image_patches), + ) + + return MultiModalInputs( + type="multimodal", + prompt=prompt, + prompt_token_ids=prompt_ids, + mm_kwargs=mm_kwargs, + mm_hashes=None, + mm_placeholders=mm_placeholders, + ) + + class ConfigOverride: """Context manager to temporarily override config attributes.""" @@ -153,6 +426,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): quant_config: QuantizationConfig = vllm_config.quant_config self.config = config + self.text_config = config.get_text_config() self.cache_config = cache_config self.device_config = device_config self.model_config = model_config @@ -173,14 +447,16 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config_override = ConfigOverride( config, sliding_window=config.interleaved_sliding_window) - # Use meta device to delay allocating GPU tensors - with torch.device("meta"), config_override: + # Set correct attn and init on "meta" to delay allocating GPU tensors + # TODO: @raushan, use the public `model.set_attn_implementation()` + # method after v4.54.0 is released + self.text_config._attn_implementation = "vllm" + with init_on_device_without_buffers("meta"), config_override: # FIXME(Isotr0py): We need to refactor this part in the future to # avoid registering an extra model layer, otherwise we will need a # weights mapper to rename weights. self.model: PreTrainedModel = AutoModel.from_config( config, - attn_implementation="vllm", torch_dtype=model_config.dtype, trust_remote_code=model_config.trust_remote_code, ) @@ -189,27 +465,25 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.tensor_parallel() # Input embeddings + text_config = config.get_text_config() if not isinstance(self.model.get_input_embeddings(), PPMissingLayer): self.model.set_input_embeddings( VocabParallelEmbedding( - config.vocab_size, - config.hidden_size, - org_num_embeddings=config.vocab_size, + text_config.vocab_size, + text_config.hidden_size, + org_num_embeddings=text_config.vocab_size, quant_config=quant_config, )) # Attention layers self.attention_instances = self.create_attention_instances() - # Initialize buffers (e.g. rotary embedding inverse frequency) - self.init_buffers(self.model) - # Initialize any parameters that have not had their modules replaced self.init_parameters(self.model) self.make_empty_intermediate_tensors = ( make_empty_intermediate_tensors_factory(["hidden_states"], - config.hidden_size)) + text_config.hidden_size)) def pipeline_parallel(self): """ @@ -240,14 +514,15 @@ def pipeline_parallel(self): # Layers before module list for name in pp_plan[:module_list_idx]: - if self.pp_group.is_first_rank or (self.config.tie_word_embeddings - and self.pp_group.is_last_rank): + if self.pp_group.is_first_rank or ( + self.text_config.tie_word_embeddings + and self.pp_group.is_last_rank): continue setattr(self.model, name, PPMissingLayer()) # Module list - start_layer, end_layer = get_pp_indices(self.config.num_hidden_layers, - self.pp_rank, self.pp_size) + start_layer, end_layer = get_pp_indices( + self.text_config.num_hidden_layers, self.pp_rank, self.pp_size) layers_name = pp_plan[module_list_idx] layers = getattr(self.model, layers_name) for i in range(len(layers)): @@ -298,7 +573,7 @@ def create_attention_instances(self) -> dict[int, Attention]: self.parallel_config) head_size = self.model_config.get_head_size() num_kv_heads = self.model_config.get_num_kv_heads(self.parallel_config) - start, end = get_pp_indices(self.config.num_hidden_layers, + start, end = get_pp_indices(self.text_config.num_hidden_layers, self.pp_rank, self.pp_size) attention_instances = {} @@ -323,35 +598,6 @@ def create_attention_instances(self) -> dict[int, Attention]: prefix=f"{i}.attn") return attention_instances - def init_buffers(self, module: nn.Module): - """ - If a `buffer` is on the `meta` device, then its parent - `module` is the original module created by: - - ```python - with torch.device("meta"): - self.model: PreTrainedModel = AutoModel.from_config(...) - ``` - - This means that: - - `type(module)` is a class from `transformers` - - This class is constructed using a `PretrainedConfig` - """ - for name, buffer in module.named_buffers(recurse=False): - if buffer.device == torch.device("meta"): - if module == self.model: - logger.warning( - "To initialize buffers correctly, we instantiate the " - "parent module and and extract the value of the " - "buffer from it. In this case, the parent module is " - "the base model. Instantiating the entire model here " - "risks GPU OOM. Could this buffer be moved to a child " - "module?") - new_buffer = getattr(type(module)(self.config), name) - setattr(module, name, new_buffer) - for child in module.children(): - self.init_buffers(child) - def init_parameters(self, module: nn.Module): """ If a `parameter` is on the `meta` device, then its parent @@ -366,6 +612,7 @@ def init_parameters(self, module: nn.Module): if param.device == torch.device("meta"): new_param = nn.Parameter( torch.empty_like(param.data, + dtype=self.model_config.dtype, device=self.device_config.device)) setattr(module, name, new_param) for child in module.children(): @@ -391,11 +638,16 @@ def forward( if inputs_embeds is not None: inputs_embeds = inputs_embeds[None, ...] + if self.model_config.uses_mrope: + position_ids = positions[:, None] + else: + position_ids = positions[None, ...] + hidden_states = self.model( input_ids=input_ids, inputs_embeds=inputs_embeds, use_cache=False, - position_ids=positions[None, ...], + position_ids=position_ids, attention_instances=self.attention_instances, return_dict=False)[0][0, ...] # we remove batch dimension for now @@ -507,3 +759,180 @@ def load_weights(self, weights: Iterable[tuple[str, if self.config.tie_word_embeddings else None), ) return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) + + +@MULTIMODAL_REGISTRY.register_processor( + MultiModalProcessor, + info=MultiModalProcessingInfo, + dummy_inputs=MultiModalDummyInputsBuilder) +class TransformersForMultimodalLM(nn.Module, SupportsQuant, SupportsLoRA, + SupportsPP, SupportsMultiModal): + embedding_padding_modules = ["lm_head"] + embedding_modules = ["embed_tokens"] + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + config: PretrainedConfig = vllm_config.model_config.hf_config + quant_config: QuantizationConfig = vllm_config.quant_config + + self.config = config + self.dtype = vllm_config.model_config.dtype + + self.model = TransformersModel(vllm_config=vllm_config, prefix=prefix) + text_config = config.get_text_config() + + if get_pp_group().is_last_rank: + self.unpadded_vocab_size = text_config.vocab_size + self.lm_head = ParallelLMHead( + text_config.vocab_size, + text_config.hidden_size, + quant_config=quant_config, + prefix=maybe_prefix(prefix, "lm_head"), + ) + if text_config.tie_word_embeddings: + self.lm_head = self.lm_head.tie_weights( + self.model.get_input_embeddings()) + + logit_scale = getattr(config, "logit_scale", 1.0) + self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, + text_config.vocab_size, + logit_scale) + else: + self.lm_head = PPMissingLayer() + + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + + @property + def hf_to_vllm_mapper(self): + # Backwards compatibility for prev released models + # State dicts back then had different formats + # and cannot be loaded with `AutoModel` mapping + # as is + prefix_mapper = { + "language_model.model": "model.language_model", + "text_model.model": "model.text_model", + "vision_tower": "model.vision_tower", + "vqmodel": "model.vqmodel", + "vision_model": "model.vision_model", + "vision_embed_tokens": "model.vision_embed_tokens", + "image_newline": "model.image_newline", + "multi_modal_projector": "model.multi_modal_projector", + "text_model.lm_head": "lm_head", + "language_model.lm_head": "lm_head", + } + # Don't change the order for QwenVL + if 'Qwen2' in self.config.__class__.__name__: + prefix_mapper["model"] = "model.language_model" + prefix_mapper["visual"] = "model.visual" + + return WeightsMapper(orig_to_new_prefix=prefix_mapper, ) + + def forward( + self, + input_ids: Optional[torch.Tensor], + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + **kwargs: object, + ) -> Union[torch.Tensor, IntermediateTensors]: + # NOTE: In v1, inputs_embeds is always generated at model runner from + # `get_multimodal_embeddings` and `get_input_embeddings`, this + # condition is only for v0 compatibility. + if inputs_embeds is None: + multimodal_embeds = self.get_multimodal_embeddings(**kwargs) + if multimodal_embeds is not None: + inputs_embeds = self.get_input_embeddings( + input_ids, multimodal_embeds) + input_ids = None + + model_output = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds) + return model_output + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader( + self, + skip_prefixes=([ + "lm_head." + ] if self.config.get_text_config().tie_word_embeddings else None), + ) + return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) + + def get_multimodal_embeddings(self, **kwargs): + pixel_values = kwargs.pop("pixel_values", None) + pixel_values = pixel_values if pixel_values is not None else kwargs.pop( + "image_patches", None) + image_embeds = kwargs.pop("image_embeds", None) + + if image_embeds is not None: + return image_embeds + + if pixel_values is None and image_embeds is None: + return None + + num_image_patches = kwargs.pop("num_image_patches") + if pixel_values is not None: + if isinstance(pixel_values, torch.Tensor): + pixel_values = flatten_bn(pixel_values).to(self.dtype) + elif is_list_of(pixel_values, torch.Tensor): + pixel_values = flatten_bn(flatten_bn(pixel_values), + concat=True).to(self.dtype) + else: + raise ValueError( + f"Unsupported pixel_values type {type(pixel_values)}. " + "Expected `torch.Tensor` or list of `torch.Tensor`.") + + if isinstance(num_image_patches, list): + num_image_patches = torch.cat(num_image_patches) + + vision_embeddings = self.model.model.get_image_features( + pixel_values, + **{ + k: v.flatten(0, 1) + for k, v in kwargs.items() + }, + ) + + if isinstance(vision_embeddings, torch.Tensor): + if vision_embeddings.ndim == 2: + vision_embeddings = vision_embeddings.unsqueeze(0) + + # Embeddings have to be 2D tensors of length `num_images` + # but transformers returns concat tensors if each patch + # is of different size. We split it back to make vLLM happy + vision_embeddings = torch.split( + vision_embeddings, + num_image_patches.flatten().tolist()) + vision_embeddings = [ + embed.flatten(start_dim=0, end_dim=-2) + for embed in vision_embeddings + ] + + return vision_embeddings + + def get_input_embeddings( + self, + input_ids: torch.Tensor, + multimodal_embeddings=None, + ) -> torch.Tensor: + inputs_embeds = self.model.model.get_input_embeddings()(input_ids) + if (multimodal_embeddings is not None + and len(multimodal_embeddings) != 0): + mask = (input_ids == self.config.image_token_id) + mask = mask.unsqueeze(-1).expand_as(inputs_embeds) + multimodal_embeddings = torch.cat(multimodal_embeddings) + + inputs_embeds = inputs_embeds.masked_scatter( + mask, multimodal_embeddings) + return inputs_embeds From 6737434c98d17b8f3ca770920a2050be32b2c675 Mon Sep 17 00:00:00 2001 From: Jiayi Yan <66017932+1195343015@users.noreply.github.com> Date: Mon, 21 Jul 2025 01:12:10 +0800 Subject: [PATCH 217/552] [bugfix] fix syntax warning caused by backslash (#21251) Signed-off-by: x22x22 --- examples/offline_inference/neuron_eagle.py | 2 +- tests/v1/kv_connector/unit/test_nixl_connector.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/offline_inference/neuron_eagle.py b/examples/offline_inference/neuron_eagle.py index 0b2070c8e25..8b1d235ff97 100644 --- a/examples/offline_inference/neuron_eagle.py +++ b/examples/offline_inference/neuron_eagle.py @@ -54,7 +54,7 @@ def main(): for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, \n\n\n\ Generated text: {generated_text!r}") + print(f"Prompt: {prompt!r}, \n\n\n Generated text: {generated_text!r}") if __name__ == "__main__": diff --git a/tests/v1/kv_connector/unit/test_nixl_connector.py b/tests/v1/kv_connector/unit/test_nixl_connector.py index a0dfd54fb82..99bde919c72 100644 --- a/tests/v1/kv_connector/unit/test_nixl_connector.py +++ b/tests/v1/kv_connector/unit/test_nixl_connector.py @@ -341,7 +341,7 @@ def test_abort_timeout_on_prefiller(monkeypatch, distributed_executor_backend): Test lifecycle of an aborted Remote Prefill request hitting the timeout. -----> P | {process request} - <-\--- | {result is NOT delivered, eg proxy is down} + <-/--- | {result is NOT delivered, eg proxy is down} | | | {eventually free blocks} From 799d11ff706b19a43f8f20b913b1432b9e6e4855 Mon Sep 17 00:00:00 2001 From: Kay Yan Date: Mon, 21 Jul 2025 11:13:02 +0800 Subject: [PATCH 218/552] [CI] Cleanup modelscope version constraint in Dockerfile (#21243) Signed-off-by: Kay Yan Signed-off-by: x22x22 --- docker/Dockerfile | 2 +- docker/Dockerfile.xpu | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index b06c4d33626..d1fa92ce6d1 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -510,7 +510,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \ else \ BITSANDBYTES_VERSION="0.46.1"; \ fi; \ - uv pip install --system accelerate hf_transfer 'modelscope!=1.15.0' "bitsandbytes>=${BITSANDBYTES_VERSION}" 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3] + uv pip install --system accelerate hf_transfer modelscope "bitsandbytes>=${BITSANDBYTES_VERSION}" 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3] ENV VLLM_USAGE_SOURCE production-docker-image diff --git a/docker/Dockerfile.xpu b/docker/Dockerfile.xpu index 41b4c42e4c4..3130435ca72 100644 --- a/docker/Dockerfile.xpu +++ b/docker/Dockerfile.xpu @@ -47,7 +47,7 @@ FROM vllm-base AS vllm-openai # install additional dependencies for openai api server RUN --mount=type=cache,target=/root/.cache/pip \ - pip install accelerate hf_transfer pytest 'modelscope!=1.15.0' + pip install accelerate hf_transfer pytest modelscope ENV VLLM_USAGE_SOURCE production-docker-image \ TRITON_XPU_PROFILE 1 From b4619ffd7f745a8b2ccf3f6be7e4b4ae6ca51723 Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Sun, 20 Jul 2025 21:58:07 -0700 Subject: [PATCH 219/552] [Docs] Add RFC Meeting to Issue Template (#21279) Signed-off-by: simon-mo Signed-off-by: x22x22 --- .github/ISSUE_TEMPLATE/750-RFC.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/ISSUE_TEMPLATE/750-RFC.yml b/.github/ISSUE_TEMPLATE/750-RFC.yml index e447c077473..7ee57c42895 100644 --- a/.github/ISSUE_TEMPLATE/750-RFC.yml +++ b/.github/ISSUE_TEMPLATE/750-RFC.yml @@ -46,7 +46,7 @@ body: - type: markdown attributes: value: > - Thanks for contributing 🎉! + Thanks for contributing 🎉! The vLLM core team hosts a biweekly RFC review session at 9:30AM Pacific Time, while most RFCs can be discussed online, you can optionally sign up for a slot to discuss your RFC online [here](https://docs.google.com/document/d/1CiLVBZeIVfR7_PNAKVSusxpceywkoOOB78qoWqHvSZc/edit). - type: checkboxes id: askllm attributes: From b55a51a68aa57ac71b55aa80196b1bd49a186c4b Mon Sep 17 00:00:00 2001 From: Huy Do Date: Sun, 20 Jul 2025 22:29:18 -0700 Subject: [PATCH 220/552] Add the instruction to run e2e validation manually before release (#21023) Signed-off-by: Huy Do Signed-off-by: x22x22 --- RELEASE.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/RELEASE.md b/RELEASE.md index 7f527071521..9352e7ef706 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -52,3 +52,36 @@ After branch cut, we approach finalizing the release branch with clear criteria * Release branch specific changes (e.g. change version identifiers or CI fixes) Please note: **No feature work allowed for cherry picks**. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes. + +## Manual validations + +### E2E Performance Validation + +Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI. + +**Current Coverage:** +* Models: Llama3, Llama4, and Mixtral +* Hardware: NVIDIA H100 and AMD MI300x +* *Note: Coverage may change based on new model releases and hardware availability* + +**Performance Validation Process:** + +**Step 1: Get Access** +Request write access to the [pytorch/pytorch-integration-testing](https://github.com/pytorch/pytorch-integration-testing) repository to run the benchmark workflow. + +**Step 2: Review Benchmark Setup** +Familiarize yourself with the benchmark configurations: +* [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda) +* [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm) + +**Step 3: Run the Benchmark** +Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure: +* **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`) +* **vLLM commit**: Set to the RC commit hash + +**Step 4: Review Results** +Once the workflow completes, benchmark results will be available on the [vLLM benchmark dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) under the corresponding branch and commit. + +**Step 5: Performance Comparison** +Compare the current results against the previous release to verify no performance regressions have occurred. Here is an +example of [v0.9.1 vs v0.9.2](https://hud.pytorch.org/benchmark/llms?startTime=Thu%2C%2017%20Apr%202025%2021%3A43%3A50%20GMT&stopTime=Wed%2C%2016%20Jul%202025%2021%3A43%3A50%20GMT&granularity=week&lBranch=releases/v0.9.1&lCommit=b6553be1bc75f046b00046a4ad7576364d03c835&rBranch=releases/v0.9.2&rCommit=a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f&repoName=vllm-project%2Fvllm&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=All%20Devices&archName=All%20Platforms). From c3bdf2768649e2f923bf2399f643f576806985a5 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Mon, 21 Jul 2025 13:50:06 +0800 Subject: [PATCH 221/552] [Bugfix] Fix missing placeholder in logger debug (#21280) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/transformers_utils/configs/mistral.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/transformers_utils/configs/mistral.py b/vllm/transformers_utils/configs/mistral.py index e66f762eb80..8a9c660b882 100644 --- a/vllm/transformers_utils/configs/mistral.py +++ b/vllm/transformers_utils/configs/mistral.py @@ -42,7 +42,7 @@ def adapt_config_dict(config_dict: dict[str, Any], config = PretrainedConfig.from_dict(config_dict) - logger.debug("Initialized config", config) + logger.debug("Initialized config %s", config) return config From d954ee4a5f060519660c28fd1e7edbeabcdba64f Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Mon, 21 Jul 2025 17:22:21 +0800 Subject: [PATCH 222/552] [Model][1/N] Support multiple poolers at model level (#21227) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- docs/models/pooling_models.md | 53 ++- tests/models/test_transformers.py | 2 +- .../my_gemma_embedding.py | 15 +- vllm/config.py | 8 +- vllm/entrypoints/openai/api_server.py | 2 +- vllm/model_executor/layers/pooler.py | 346 +++++++++--------- vllm/model_executor/models/adapters.py | 108 +++--- vllm/model_executor/models/bert.py | 132 +++++-- vllm/model_executor/models/gpt2.py | 16 +- vllm/model_executor/models/gritlm.py | 39 +- vllm/model_executor/models/internlm2.py | 12 +- vllm/model_executor/models/jamba.py | 29 +- vllm/model_executor/models/jina_vl.py | 18 +- vllm/model_executor/models/modernbert.py | 50 ++- vllm/model_executor/models/qwen2_rm.py | 35 +- vllm/model_executor/models/roberta.py | 44 ++- vllm/model_executor/pooling_metadata.py | 7 + vllm/v1/pool/metadata.py | 8 + vllm/v1/worker/gpu_model_runner.py | 16 +- vllm/v1/worker/tpu_model_runner.py | 7 +- vllm/worker/model_runner_base.py | 7 +- vllm/worker/pooling_model_runner.py | 10 +- 22 files changed, 550 insertions(+), 414 deletions(-) diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index f9ebac8ed27..4f347d165ee 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -11,26 +11,51 @@ before returning them. As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to pooling models as they only work on the generation or decode stage, so performance may not improve as much. -For pooling models, we support the following `--task` options. -The selected option sets the default pooler used to extract the final hidden states: +If the model doesn't implement this interface, you can set `--task` which tells vLLM +to convert the model into a pooling model. -| Task | Pooling Type | Normalization | Softmax | -|---------------------------------|----------------|-----------------|-----------| -| Embedding (`embed`) | `LAST` | ✅︎ | ❌ | -| Classification (`classify`) | `LAST` | ❌ | ✅︎ | -| Sentence Pair Scoring (`score`) | \* | \* | \* | +| `--task` | Model type | Supported pooling tasks | +|------------|----------------------|-------------------------------| +| `embed` | Embedding model | `encode`, `embed` | +| `classify` | Classification model | `encode`, `classify`, `score` | +| `reward` | Reward model | `encode` | -\*The default pooler is always defined by the model. +## Pooling Tasks -!!! note - If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table. +In vLLM, we define the following pooling tasks and corresponding APIs: + +| Task | APIs | +|------------|--------------------| +| `encode` | `encode` | +| `embed` | `embed`, `score`\* | +| `classify` | `classify` | +| `score` | `score` | + +\*The `score` API falls back to `embed` task if the model does not support `score` task. + +Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks]. + +By default, the pooler assigned to each task has the following attributes: + +| Task | Pooling Type | Normalization | Softmax | +|------------|----------------|---------------|---------| +| `encode` | `ALL` | ❌ | ❌ | +| `embed` | `LAST` | ✅︎ | ❌ | +| `classify` | `LAST` | ❌ | ✅︎ | + +These defaults may be overridden by the model's implementation in vLLM. When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, -we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`). +we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`), +which takes priority over the model's defaults. + +You can further customize this via the `--override-pooler-config` option, +which takes priority over both the model's and Sentence Transformers's defaults. + +!!! note -!!! tip - You can customize the model's pooling method via the `--override-pooler-config` option, - which takes priority over both the model's and Sentence Transformers's defaults. + The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler + that is not based on [PoolerConfig][vllm.config.PoolerConfig]. ## Chunked Processing for Long Text diff --git a/tests/models/test_transformers.py b/tests/models/test_transformers.py index b87290e96a2..16b9bcffd26 100644 --- a/tests/models/test_transformers.py +++ b/tests/models/test_transformers.py @@ -144,7 +144,7 @@ def test_quantization( "model", ["jason9693/Qwen2.5-1.5B-apeach"], ) -@pytest.mark.parametrize("dtype", ["half"]) +@pytest.mark.parametrize("dtype", ["float"]) def test_classify( hf_runner, vllm_runner, diff --git a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py index 797353e4f7a..fc654f20fff 100644 --- a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py +++ b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py @@ -8,7 +8,7 @@ import torch.nn as nn from vllm.config import VllmConfig -from vllm.model_executor.layers.pooler import Pooler, PoolingType +from vllm.model_executor.layers.pooler import DispatchPooler, Pooler from vllm.model_executor.models.gemma2 import Gemma2Model from vllm.model_executor.models.utils import WeightsMapper, maybe_prefix from vllm.sequence import IntermediateTensors @@ -26,12 +26,13 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.model = Gemma2Model(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")) - self.pooler = Pooler.from_config_with_defaults( - vllm_config.model_config.pooler_config, - pooling_type=PoolingType.LAST, - normalize=True, - softmax=False, - ) + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": Pooler.for_encode(pooler_config), + "embed": Pooler.for_embed(pooler_config), + }) self.make_empty_intermediate_tensors = ( self.model.make_empty_intermediate_tensors) diff --git a/vllm/config.py b/vllm/config.py index 73e88b13bc5..a6134c85b2e 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -94,7 +94,7 @@ TaskOption = Literal["auto", "generate", "embedding", "embed", "classify", "score", "reward", "transcription", "draft"] -_ResolvedTask = Literal["generate", "transcription", "pooling", "embed", +_ResolvedTask = Literal["generate", "transcription", "encode", "embed", "classify", "reward", "draft"] RunnerOption = Literal["auto", "generate", "pooling", "draft"] @@ -103,7 +103,7 @@ _RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = { "generate": ["generate", "transcription"], - "pooling": ["pooling", "embed", "classify", "reward"], + "pooling": ["encode", "embed", "classify", "reward"], "draft": [], } @@ -579,7 +579,7 @@ def __post_init__(self) -> None: # user-selected task if runner_type == "pooling" and self.task == "auto": selected_task = all_supported_tasks[runner_type][-1] - assert selected_task != "pooling" + assert selected_task != "encode" self.task = selected_task self.supported_runner_types = supported_runner_types self.runner_type = runner_type @@ -884,7 +884,7 @@ def _get_supported_pooling_tasks( supported_tasks = list[_ResolvedTask]() if registry.is_pooling_model(architectures): - supported_tasks.append("pooling") + supported_tasks.append("encode") # For now, users must specify the task (other than "pooling") # to use for pooling models diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 3f0c1c85dee..57240bb4f33 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1668,7 +1668,7 @@ async def init_app_state( request_logger=request_logger, chat_template=resolved_chat_template, chat_template_content_format=args.chat_template_content_format, - ) if "pooling" in model_config.supported_tasks else None + ) if "encode" in model_config.supported_tasks else None state.openai_serving_embedding = OpenAIServingEmbedding( engine_client, model_config, diff --git a/vllm/model_executor/layers/pooler.py b/vllm/model_executor/layers/pooler.py index 6a474b8e73a..c06cca08022 100644 --- a/vllm/model_executor/layers/pooler.py +++ b/vllm/model_executor/layers/pooler.py @@ -1,15 +1,16 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from abc import ABC, abstractmethod +from collections.abc import Mapping, Set from dataclasses import dataclass from enum import IntEnum +from itertools import groupby from typing import Callable, Optional, TypeVar, Union import torch import torch.nn as nn import torch.nn.functional as F from transformers import PretrainedConfig -from typing_extensions import assert_never from vllm.config import ModelConfig, PoolerConfig from vllm.model_executor.pooling_metadata import ( # noqa: E501 @@ -21,6 +22,10 @@ from vllm.v1.pool.metadata import PoolingMetadata as V1PoolingMetadata PoolingMetadata = Union[V0PoolingMetadata, V1PoolingMetadata] +PoolingFn = Callable[ + [Union[torch.Tensor, list[torch.Tensor]], PoolingMetadata], + Union[torch.Tensor, list[torch.Tensor]]] +ClassifierFn = Callable[[torch.Tensor], torch.Tensor] class PoolingType(IntEnum): @@ -79,37 +84,81 @@ class Pooler(nn.Module, ABC): """The interface required for all poolers used in pooling models in vLLM.""" @staticmethod - def from_config_with_defaults( + def for_encode( pooler_config: PoolerConfig, - pooling_type: PoolingType, - normalize: bool, - softmax: bool, - step_tag_id: Optional[int] = None, - returned_token_ids: Optional[list[int]] = None, - ) -> "Pooler": + *, + default_pooling_type: PoolingType = PoolingType.ALL, + default_normalize: bool = False, + default_softmax: bool = False, + default_step_tag_id: Optional[int] = None, + default_returned_token_ids: Optional[list[int]] = None, + ): resolved_config = ResolvedPoolingConfig.from_config_with_defaults( pooler_config=pooler_config, - pooling_type=pooling_type, - normalize=normalize, - softmax=softmax, - step_tag_id=step_tag_id, - returned_token_ids=returned_token_ids, + pooling_type=default_pooling_type, + normalize=default_normalize, + softmax=default_softmax, + step_tag_id=default_step_tag_id, + returned_token_ids=default_returned_token_ids, ) - if pooling_type == PoolingType.STEP: + if resolved_config.pooling_type == PoolingType.STEP: return StepPooler.from_config(resolved_config) return SimplePooler.from_config(resolved_config) - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: + @staticmethod + def for_embed( + pooler_config: PoolerConfig, + *, + default_pooling_type: PoolingType = PoolingType.LAST, + default_normalize: bool = True, + default_softmax: bool = False, + ): + resolved_config = ResolvedPoolingConfig.from_config_with_defaults( + pooler_config=pooler_config, + pooling_type=default_pooling_type, + normalize=default_normalize, + softmax=default_softmax, + ) + + return SimplePooler.from_config(resolved_config) + + @staticmethod + def for_classify( + pooler_config: PoolerConfig, + classifier: Optional[ClassifierFn], + *, + default_pooling_type: PoolingType = PoolingType.LAST, + default_normalize: bool = False, + default_softmax: bool = True, + ): + resolved_config = ResolvedPoolingConfig.from_config_with_defaults( + pooler_config=pooler_config, + pooling_type=default_pooling_type, + normalize=default_normalize, + softmax=default_softmax, + ) + base_pooler = SimplePooler.from_config(resolved_config) + if classifier is None: + return base_pooler + + return ClassifierPooler( + pooling=base_pooler.pooling, + classifier=classifier, + act_fn=base_pooler.head.activation, + ) + + @abstractmethod + def get_supported_tasks(self) -> Set[PoolingTask]: + """Determine which pooling tasks are supported.""" + raise NotImplementedError + + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: """ - Construct the pooling parameters to use for a task, - or `None` if the task is not supported. + Construct the updated pooling parameters to use for a supported task. """ - return None + return PoolingParamsUpdate() @abstractmethod def forward( @@ -127,9 +176,8 @@ def get_prompt_lens( if isinstance(pooling_metadata, V1PoolingMetadata): return pooling_metadata.prompt_lens - assert isinstance(hidden_states, torch.Tensor) return PoolingTensors.from_pooling_metadata( - pooling_metadata, hidden_states.device).prompt_lens + pooling_metadata, hidden_states[0].device).prompt_lens def get_prompt_token_ids( @@ -149,6 +197,21 @@ def get_prompt_token_ids( ] +def get_tasks(pooling_metadata: PoolingMetadata) -> list[PoolingTask]: + if isinstance(pooling_metadata, V0PoolingMetadata): + pooling_params = [p for _, p in pooling_metadata.seq_groups] + else: + pooling_params = pooling_metadata.pooling_params + + tasks: list[PoolingTask] = [ + task for pooling_param in pooling_params + if (task := pooling_param.task) is not None + ] + assert len(pooling_params) == len(tasks) + + return tasks + + def get_classification_activation_function(config: PretrainedConfig): return PoolerClassify() @@ -172,7 +235,8 @@ def get_cross_encoder_activation_function(config: PretrainedConfig): return PoolerScore() -def build_output(all_data: torch.Tensor) -> PoolerOutput: +def build_output( + all_data: Union[torch.Tensor, list[torch.Tensor]], ) -> PoolerOutput: all_outputs = [PoolingSequenceGroupOutput(data) for data in all_data] return PoolerOutput(outputs=all_outputs) @@ -193,12 +257,12 @@ def from_pooling_type(pooling_type: PoolingType) -> "PoolingMethod": raise NotImplementedError(f"Unsupported method: {pooling_type}") @abstractmethod - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: + def get_supported_tasks(self) -> Set[PoolingTask]: raise NotImplementedError + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: + return PoolingParamsUpdate() + @abstractmethod def forward_one( self, @@ -237,16 +301,8 @@ def forward( class CLSPool(PoolingMethod): - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - # The equalities are split up to keep mypy happy - if (task == "encode" or task == "embed" or task == "classify" - or task == "score"): - return PoolingParamsUpdate() - - assert_never(task) + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"encode", "embed", "classify", "score"} def forward_one( self, @@ -270,16 +326,8 @@ def forward_all( class LastPool(PoolingMethod): - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - # The equalities are split up to keep mypy happy - if (task == "encode" or task == "embed" or task == "classify" - or task == "score"): - return PoolingParamsUpdate() - - assert_never(task) + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"encode", "embed", "classify", "score"} def forward_one( self, @@ -299,18 +347,8 @@ def forward_all( class AllPool(PoolingMethod): - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - if task == "encode": - return PoolingParamsUpdate() - - # The equalities are split up to keep mypy happy - if task == "embed" or task == "classify" or task == "score": - return None - - assert_never(task) + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"encode"} def forward_one( self, @@ -327,28 +365,13 @@ def forward_all( hidden_states: torch.Tensor, prompt_lens: torch.Tensor, ) -> Union[list[torch.Tensor], torch.Tensor]: - offset = 0 - pooled_data = list[torch.Tensor]() - - for prompt_len in prompt_lens: - pooled_data.append(hidden_states[offset:offset + prompt_len]) - offset += prompt_len - - return pooled_data + return list(hidden_states.split_with_sizes(prompt_lens.tolist())) class MeanPool(PoolingMethod): - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - # The equalities are split up to keep mypy happy - if (task == "encode" or task == "embed" or task == "classify" - or task == "score"): - return PoolingParamsUpdate() - - assert_never(task) + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"encode", "embed", "classify", "score"} def forward_one( self, @@ -529,24 +552,6 @@ class SimplePooler(Pooler): 3. Returns structured results as `PoolerOutput`. """ - @classmethod - def from_config_with_defaults( # type: ignore[override] - cls, - pooler_config: PoolerConfig, - pooling_type: PoolingType, - normalize: bool, - softmax: bool, - ) -> "SimplePooler": - resolved_config = ResolvedPoolingConfig.from_config_with_defaults( - pooler_config=pooler_config, - pooling_type=pooling_type, - normalize=normalize, - softmax=softmax, - ) - assert resolved_config.pooling_type != PoolingType.STEP - - return cls.from_config(resolved_config) - @classmethod def from_config( cls, @@ -563,10 +568,10 @@ def __init__(self, pooling: PoolingMethod, head: PoolerHead) -> None: self.pooling = pooling self.head = head - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: + def get_supported_tasks(self) -> Set[PoolingTask]: + return self.pooling.get_supported_tasks() + + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: return self.pooling.get_pooling_updates(task) def forward( @@ -627,18 +632,11 @@ def extract_states( return pooled_data - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - if task == "encode": - return PoolingParamsUpdate(requires_token_ids=True) + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"encode"} - # The equalities are split up to keep mypy happy - if task == "embed" or task == "classify" or task == "score": - return None - - assert_never(task) + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: + return PoolingParamsUpdate(requires_token_ids=True) def forward( self, @@ -650,68 +648,43 @@ def forward( return build_output(pooled_data) -PoolingFn = Callable[ - [Union[torch.Tensor, list[torch.Tensor]], PoolingMetadata], - Union[torch.Tensor, list[torch.Tensor]]] -ClassifierFn = Callable[[torch.Tensor], torch.Tensor] - - -class ClassifierPooler(nn.Module): +class ClassifierPooler(Pooler): """A pooling layer for classification tasks. This layer does the following: 1. Applies a classification layer to the hidden states. 2. Optionally applies a pooler layer. - 3. Applies an activation function to the output. In the case of - classification models it is either sigmoid or softmax. In the - case of scoring models, the same behavior is configuration - dependent, as in the sentence-transformers library. + 3. Applies an activation function to the output. """ + @staticmethod + def act_fn_for_seq_cls(config: ModelConfig): + return get_classification_activation_function(config.hf_config) + + @staticmethod + def act_fn_for_cross_encoder(config: ModelConfig): + return get_cross_encoder_activation_function(config.hf_config) + def __init__( self, - config: ModelConfig, pooling: PoolingFn, classifier: ClassifierFn, - act_fn: Optional[PoolerActivation] = None, + act_fn: PoolerActivation, ) -> None: super().__init__() self.pooling = pooling self.classifier = classifier + self.act_fn = act_fn - self.classification_act_fn = get_classification_activation_function( - config.hf_config) if act_fn is None else act_fn - self.cross_encoder_act_fn = get_cross_encoder_activation_function( - config.hf_config) if act_fn is None else act_fn - - def _get_act_fn(self, task: PoolingTask): - if task == "encode" or task == "classify": - return self.classification_act_fn - if task == "score": - return self.cross_encoder_act_fn - - raise ValueError(f"Unsupported task: {task!r}") - - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - # The equalities are split up to keep mypy happy - if task == "encode" or task == "classify" or task == "score": - return PoolingParamsUpdate() - - if task == "embed": - return None - - assert_never(task) + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"classify", "score"} def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, ) -> PoolerOutput: - """Pools sentence pair scores from the hidden_states.""" pooled_data = self.pooling(hidden_states, pooling_metadata) # apply classifier once on the full batch if possible @@ -722,28 +695,59 @@ def forward( else: pooled_output = [self.classifier(data) for data in pooled_data] - task_list: list[PoolingTask] - if isinstance(pooling_metadata, V0PoolingMetadata): - task_list = [ - task for _, pooling_param in pooling_metadata.seq_groups - if (task := pooling_param.task) is not None - ] - else: - task_list = [ - task for pooling_param in pooling_metadata.pooling_params - if (task := pooling_param.task) is not None - ] + scores = self.act_fn(pooled_output) + + return build_output(scores) + + +class DispatchPooler(Pooler): + """Dispatches calls to a sub-pooler based on the pooling task.""" + + def __init__(self, poolers_by_task: Mapping[PoolingTask, Pooler]) -> None: + super().__init__() + + for task, pooler in poolers_by_task.items(): + if task not in pooler.get_supported_tasks(): + raise ValueError( + f"{pooler=} does not support {task=}. " + f"Supported tasks: {pooler.get_supported_tasks()}") + + self.poolers_by_task = poolers_by_task + + def get_supported_tasks(self) -> Set[PoolingTask]: + return set(self.poolers_by_task) - assert len(task_list) == len(pooled_output) + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: + return self.poolers_by_task[task].get_pooling_updates(task) - # shape of scores: (batch_size, num_labels) - if len(set(task_list)) <= 1: - act_fn = self._get_act_fn(task_list[0]) - scores = act_fn(pooled_output) + def forward( + self, + hidden_states: Union[torch.Tensor, list[torch.Tensor]], + pooling_metadata: PoolingMetadata, + ) -> PoolerOutput: + poolers_by_task = self.poolers_by_task + + if isinstance(hidden_states, list): + hidden_states_lst = hidden_states else: - scores = torch.stack([ - self._get_act_fn(task)(vecs) - for task, vecs in zip(task_list, pooled_output) - ]) + prompt_lens = get_prompt_lens(hidden_states, pooling_metadata) + hidden_states_lst = list(hidden_states.split(prompt_lens.tolist())) - return build_output(scores) + outputs = list[PoolingSequenceGroupOutput]() + offset = 0 + for task, group in groupby(get_tasks(pooling_metadata)): + if not (pooler := poolers_by_task.get(task)): + raise ValueError( + f"Unsupported task: {task} " + f"Supported tasks: {self.get_supported_tasks()}") + + num_items = len(list(group)) + group_output: PoolerOutput = pooler( + hidden_states_lst[offset:offset + num_items], + pooling_metadata[offset:offset + num_items], + ) + + outputs.extend(group_output.outputs) + offset += num_items + + return PoolerOutput(outputs) diff --git a/vllm/model_executor/models/adapters.py b/vllm/model_executor/models/adapters.py index 31b1d9a8b3c..867de2c68b4 100644 --- a/vllm/model_executor/models/adapters.py +++ b/vllm/model_executor/models/adapters.py @@ -13,7 +13,6 @@ if TYPE_CHECKING: from vllm.config import VllmConfig - from vllm.model_executor.layers.pooler import PoolingType _T = TypeVar("_T", bound=type[nn.Module]) @@ -34,16 +33,8 @@ def _get_pooling_model_name(orig_model_name: str, pooling_suffix: str) -> str: return model_name + pooling_suffix -def _create_pooling_model_cls( - orig_cls: _T, - *, - default_pooling_type: "PoolingType", - default_normalize: bool, - default_softmax: bool, -) -> _T: +def _create_pooling_model_cls(orig_cls: _T) -> _T: # Lazy import - from vllm.model_executor.layers.pooler import Pooler - from .utils import AutoWeightsLoader, WeightsMapper class ModelForPooling(orig_cls, VllmModelForPooling): @@ -71,15 +62,7 @@ def __init__( self._init_pooler(vllm_config, prefix=prefix) def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): - pooler_config = vllm_config.model_config.pooler_config - assert pooler_config is not None - - self.pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=default_pooling_type, - normalize=default_normalize, - softmax=default_softmax, - ) + raise NotImplementedError def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): # TODO: Support uninitialized params tracking @@ -132,14 +115,20 @@ def as_embedding_model(cls: _T) -> _T: return cls # Lazy import - from vllm.model_executor.layers.pooler import PoolingType - - ModelForEmbedding = _create_pooling_model_cls( - cls, - default_pooling_type=PoolingType.LAST, - default_normalize=True, - default_softmax=False, - ) + from vllm.model_executor.layers.pooler import DispatchPooler, Pooler + + class ModelForEmbedding(_create_pooling_model_cls(cls)): + + def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler( + { + "encode": Pooler.for_encode(pooler_config), + "embed": Pooler.for_embed(pooler_config), + }, ) + ModelForEmbedding.__name__ = \ _get_pooling_model_name(cls.__name__, "ForEmbedding") @@ -165,20 +154,14 @@ def as_seq_cls_model(cls: _T) -> _T: # Lazy import from vllm.model_executor.layers.linear import RowParallelLinear from vllm.model_executor.layers.pooler import (ClassifierPooler, - PoolingType, SimplePooler) + DispatchPooler, Pooler, + PoolingMethod, PoolingType) from vllm.model_executor.models.interfaces import SupportsCrossEncoding from vllm.sequence import IntermediateTensors from .utils import maybe_prefix - ModelForPooling = _create_pooling_model_cls( - cls, - default_pooling_type=PoolingType.LAST, - default_normalize=False, - default_softmax=True, - ) - - class ModelForSequenceClassification(ModelForPooling, + class ModelForSequenceClassification(_create_pooling_model_cls(cls), SupportsCrossEncoding): def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): @@ -198,19 +181,28 @@ def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): pooler_config = vllm_config.model_config.pooler_config assert pooler_config is not None - pooler = SimplePooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.LAST, - normalize=False, - softmax=True, - ) - - self.pooler = ClassifierPooler( - vllm_config.model_config, - pooling=pooler.pooling, - classifier=self._classifier, - act_fn=pooler.head.activation, - ) + pooling_type_str = pooler_config.pooling_type + pooling_type = (PoolingType.LAST if pooling_type_str is None else + PoolingType[pooling_type_str]) + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + ClassifierPooler( + pooling=PoolingMethod.from_pooling_type(pooling_type), + classifier=self._classifier, + act_fn=ClassifierPooler.act_fn_for_seq_cls( + vllm_config.model_config), + ), + "score": + ClassifierPooler( + pooling=PoolingMethod.from_pooling_type(pooling_type), + classifier=self._classifier, + act_fn=ClassifierPooler.act_fn_for_cross_encoder( + vllm_config.model_config), + ), + }) def _classifier(self, x: torch.Tensor): x, _ = self.score(x.float()) @@ -259,14 +251,16 @@ def as_reward_model(cls: _T) -> _T: return cls # Lazy import - from vllm.model_executor.layers.pooler import PoolingType - - ModelForReward = _create_pooling_model_cls( - cls, - default_pooling_type=PoolingType.ALL, - default_normalize=False, - default_softmax=False, - ) + from vllm.model_executor.layers.pooler import DispatchPooler, Pooler + + class ModelForReward(_create_pooling_model_cls(cls)): + + def _init_pooler(self, vllm_config: "VllmConfig", prefix: str = ""): + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler( + {"encode": Pooler.for_encode(pooler_config)}, ) ModelForReward.__name__ = \ _get_pooling_model_name(cls.__name__, "ForReward") diff --git a/vllm/model_executor/models/bert.py b/vllm/model_executor/models/bert.py index 006f547bb46..9dc6115f850 100644 --- a/vllm/model_executor/models/bert.py +++ b/vllm/model_executor/models/bert.py @@ -1,7 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from collections.abc import Iterable +from collections.abc import Iterable, Set from typing import Optional, Union import torch @@ -17,7 +17,8 @@ from vllm.model_executor.layers.linear import (ColumnParallelLinear, QKVParallelLinear, RowParallelLinear) -from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, +from vllm.model_executor.layers.pooler import (ClassifierPooler, + DispatchPooler, Pooler, PoolingMethod, PoolingParamsUpdate, PoolingType) @@ -92,20 +93,29 @@ def __init__(self, config: BertConfig): self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh() - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: + def get_supported_tasks(self) -> Set[PoolingTask]: + return self.pooling.get_supported_tasks() + + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: return self.pooling.get_pooling_updates(task) + def _head(self, pooled_output: torch.Tensor): + pooled_output = self.dense(pooled_output) + pooled_output = self.activation(pooled_output) + return pooled_output + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, ) -> Union[torch.Tensor, list[torch.Tensor]]: pooled_output = self.pooling(hidden_states, pooling_metadata) - pooled_output = self.dense(pooled_output) - pooled_output = self.activation(pooled_output) + + if isinstance(pooled_output, list): + pooled_output = [self._head(output) for output in pooled_output] + else: + pooled_output = self._head(pooled_output) + return pooled_output @@ -333,18 +343,19 @@ class BertModel(nn.Module, SupportsQuant): packed_modules_mapping = {"qkv_proj": ["query", "key", "value"]} - def __init__(self, - *, - vllm_config: VllmConfig, - prefix: str = "", - embedding_class: type = BertEmbedding, - add_pooling_layer: bool = False): + def __init__( + self, + *, + vllm_config: VllmConfig, + prefix: str = "", + embedding_class: type[nn.Module] = BertEmbedding, + ) -> None: super().__init__() + config = vllm_config.model_config.hf_config self.embeddings = embedding_class(config) self.encoder = BertEncoder(vllm_config=vllm_config, prefix=f"{prefix}.encoder") - self.pooler = BertPooler(config) if add_pooling_layer else None def forward( self, @@ -366,8 +377,7 @@ def forward( token_type_ids=token_type_ids) return self.encoder(hidden_states) - def load_weights(self, weights: Iterable[tuple[str, - torch.Tensor]]) -> set[str]: + def _load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): stacked_params_mapping = [ # (param_name, shard_name, shard_id) ("qkv_proj", "query", "q"), @@ -395,10 +405,43 @@ def load_weights(self, weights: Iterable[tuple[str, if name in params_dict: other_weights.append((name, loaded_weight)) - loader = AutoWeightsLoader( - self, - skip_prefixes=(["pooler."] if self.pooler is None else []), + return other_weights, loaded_stacked_params + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + other_weights, loaded_stacked_params = self._load_weights(weights) + + loader = AutoWeightsLoader(self, skip_prefixes=["pooler."]) + loaded_params = loader.load_weights(other_weights) + loaded_params.update(loaded_stacked_params) + return loaded_params + + +class BertPoolingModel(BertModel): + + is_pooling_model = True + + def __init__( + self, + *, + vllm_config: VllmConfig, + prefix: str = "", + embedding_class: type[nn.Module] = BertEmbedding, + ) -> None: + super().__init__( + vllm_config=vllm_config, + prefix=prefix, + embedding_class=embedding_class, ) + + config = vllm_config.model_config.hf_config + self.pooler = BertPooler(config) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + other_weights, loaded_stacked_params = self._load_weights(weights) + + loader = AutoWeightsLoader(self) loaded_params = loader.load_weights(other_weights) loaded_params.update(loaded_stacked_params) return loaded_params @@ -421,6 +464,8 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + self.model = self._build_model(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")) self.pooler = self._build_pooler(pooler_config) @@ -456,10 +501,15 @@ def _build_model(self, embedding_class=BertEmbedding) def _build_pooler(self, pooler_config: PoolerConfig) -> Pooler: - return Pooler.from_config_with_defaults(pooler_config, - pooling_type=PoolingType.CLS, - normalize=True, - softmax=False) + return DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "embed": + Pooler.for_embed( + pooler_config, + default_pooling_type=PoolingType.CLS, + ), + }) class BertForSequenceClassification(nn.Module, SupportsV0Only, @@ -481,16 +531,32 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config self.num_labels = config.num_labels - self.bert = BertModel(vllm_config=vllm_config, - prefix=maybe_prefix(prefix, "bert"), - embedding_class=BertEmbedding, - add_pooling_layer=True) + self.bert = BertPoolingModel(vllm_config=vllm_config, + prefix=maybe_prefix(prefix, "bert"), + embedding_class=BertEmbedding) self.classifier = nn.Linear(config.hidden_size, config.num_labels) - self.pooler = ClassifierPooler( - vllm_config.model_config, - pooling=self.bert.pooler, - classifier=self.classifier, - ) + + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + ClassifierPooler( + pooling=self.bert.pooler, + classifier=self.classifier, + act_fn=ClassifierPooler.act_fn_for_seq_cls( + vllm_config.model_config), + ), + "score": + ClassifierPooler( + pooling=self.bert.pooler, + classifier=self.classifier, + act_fn=ClassifierPooler.act_fn_for_cross_encoder( + vllm_config.model_config), + ), + }) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) diff --git a/vllm/model_executor/models/gpt2.py b/vllm/model_executor/models/gpt2.py index 82883bfa890..98d76337395 100644 --- a/vllm/model_executor/models/gpt2.py +++ b/vllm/model_executor/models/gpt2.py @@ -43,7 +43,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from ..layers.pooler import Pooler, PoolingType +from ..layers.pooler import DispatchPooler, Pooler from .interfaces import SupportsPP from .utils import (AutoWeightsLoader, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, @@ -339,12 +339,16 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.transformer = GPT2Model(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "gpt2")) self.score = nn.Linear(config.n_embd, config.num_labels, bias=False) + pooler_config = vllm_config.model_config.pooler_config - self.pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.LAST, - normalize=False, - softmax=True) + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + Pooler.for_classify(pooler_config, classifier=None), + }) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) diff --git a/vllm/model_executor/models/gritlm.py b/vllm/model_executor/models/gritlm.py index 8443482119b..8a3fbc6a49f 100644 --- a/vllm/model_executor/models/gritlm.py +++ b/vllm/model_executor/models/gritlm.py @@ -1,17 +1,16 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project - +from collections.abc import Set from typing import Optional, Union import numpy as np import torch import torch.nn as nn -from typing_extensions import assert_never from vllm.config import ModelConfig, VllmConfig from vllm.logger import init_logger -from vllm.model_executor.layers.pooler import (Pooler, PoolerHead, - PoolerNormalize, +from vllm.model_executor.layers.pooler import (DispatchPooler, Pooler, + PoolerHead, PoolerNormalize, PoolingParamsUpdate, build_output, get_prompt_lens, get_prompt_token_ids) @@ -135,18 +134,11 @@ def _get_instruction_len(self, prompt_token_ids: np.ndarray) -> int: return instruction_len - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: - # The equalities are split up to keep mypy happy - if task == "encode" or task == "embed": - return PoolingParamsUpdate(requires_token_ids=True) - - if task == "classify" or task == "score": - return None + def get_supported_tasks(self) -> Set[PoolingTask]: + return {"encode", "embed"} - assert_never(task) + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: + return PoolingParamsUpdate(requires_token_ids=True) def forward_one( self, @@ -207,10 +199,10 @@ def __init__(self, model_config: ModelConfig): self.pooling = GritLMMeanPool(model_config) self.head = PoolerHead(PoolerNormalize()) - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: + def get_supported_tasks(self) -> Set[PoolingTask]: + return self.pooling.get_supported_tasks() + + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: return self.pooling.get_pooling_updates(task) def forward( @@ -262,4 +254,11 @@ def __init__( super().__init__(vllm_config=vllm_config, prefix=prefix, **kwargs) - self.pooler = GritLMPooler(vllm_config.model_config) + pooler_config = vllm_config.model_config.pooler_config + if pooler_config is not None: + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "embed": + GritLMPooler(vllm_config.model_config), + }) diff --git a/vllm/model_executor/models/internlm2.py b/vllm/model_executor/models/internlm2.py index d9bbee0a246..d29779a35e5 100644 --- a/vllm/model_executor/models/internlm2.py +++ b/vllm/model_executor/models/internlm2.py @@ -22,7 +22,7 @@ QKVParallelLinear, RowParallelLinear) from vllm.model_executor.layers.logits_processor import LogitsProcessor -from vllm.model_executor.layers.pooler import Pooler, PoolingType +from vllm.model_executor.layers.pooler import DispatchPooler, Pooler from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.rotary_embedding import get_rope from vllm.model_executor.layers.vocab_parallel_embedding import ( @@ -429,12 +429,10 @@ def __init__( ) pooler_config = vllm_config.model_config.pooler_config - self.pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.ALL, - normalize=False, - softmax=False, - ) + assert pooler_config is not None + + self.pooler = DispatchPooler( + {"encode": Pooler.for_encode(pooler_config)}, ) def forward( self, diff --git a/vllm/model_executor/models/jamba.py b/vllm/model_executor/models/jamba.py index e95f3491c6b..34281b2e99e 100644 --- a/vllm/model_executor/models/jamba.py +++ b/vllm/model_executor/models/jamba.py @@ -19,8 +19,8 @@ RowParallelLinear) from vllm.model_executor.layers.logits_processor import LogitsProcessor from vllm.model_executor.layers.mamba.mamba_mixer import MambaMixer -from vllm.model_executor.layers.pooler import (ClassifierPooler, PoolingType, - SimplePooler) +from vllm.model_executor.layers.pooler import (DispatchPooler, Pooler, + PoolingType) from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) @@ -584,16 +584,15 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): pooler_config = vllm_config.model_config.pooler_config assert pooler_config is not None - pooler = SimplePooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.LAST, - normalize=False, - softmax=False, - ) - - self.pooler = ClassifierPooler( - vllm_config.model_config, - pooling=pooler.pooling, - classifier=self.score, - act_fn=pooler.head.activation, - ) + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + Pooler.for_classify( + pooler_config, + classifier=self.score, + default_pooling_type=PoolingType.LAST, + default_normalize=False, + default_softmax=False, + ), + }) diff --git a/vllm/model_executor/models/jina_vl.py b/vllm/model_executor/models/jina_vl.py index 6b191b09b4b..0c4284f7daa 100644 --- a/vllm/model_executor/models/jina_vl.py +++ b/vllm/model_executor/models/jina_vl.py @@ -12,7 +12,7 @@ from vllm.logger import init_logger from vllm.model_executor.layers.linear import (ColumnParallelLinear, RowParallelLinear) -from vllm.model_executor.layers.pooler import Pooler, PoolingType +from vllm.model_executor.layers.pooler import DispatchPooler, Pooler from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.sequence import IntermediateTensors @@ -96,11 +96,17 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.score = JinaVLScorer(config) - self.pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.LAST, - normalize=False, - softmax=True) + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + Pooler.for_classify(pooler_config, classifier=None), + "score": + Pooler.for_classify(pooler_config, classifier=None), + }) @classmethod def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: diff --git a/vllm/model_executor/models/modernbert.py b/vllm/model_executor/models/modernbert.py index 74986f9f573..be1c3438d9d 100644 --- a/vllm/model_executor/models/modernbert.py +++ b/vllm/model_executor/models/modernbert.py @@ -1,6 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from collections.abc import Iterable +from collections.abc import Iterable, Set from typing import Optional, Union import torch @@ -13,7 +13,8 @@ from vllm.distributed import get_tensor_model_parallel_world_size from vllm.model_executor.layers.linear import (QKVParallelLinear, RowParallelLinear) -from vllm.model_executor.layers.pooler import (ClassifierPooler, Pooler, +from vllm.model_executor.layers.pooler import (ClassifierPooler, + DispatchPooler, Pooler, PoolingMethod, PoolingParamsUpdate, PoolingType) @@ -271,19 +272,27 @@ def __init__(self, config: ModernBertConfig): eps=config.norm_eps, bias=config.norm_bias) - def get_pooling_updates( - self, - task: PoolingTask, - ) -> Optional[PoolingParamsUpdate]: + def get_supported_tasks(self) -> Set[PoolingTask]: + return self.pooling.get_supported_tasks() + + def get_pooling_updates(self, task: PoolingTask) -> PoolingParamsUpdate: return self.pooling.get_pooling_updates(task) + def _head(self, pooled_output: torch.Tensor): + return self.norm(self.act(self.dense(pooled_output))) + def forward( self, hidden_states: Union[torch.Tensor, list[torch.Tensor]], pooling_metadata: PoolingMetadata, ) -> Union[torch.Tensor, list[torch.Tensor]]: pooled_output = self.pooling(hidden_states, pooling_metadata) - pooled_output = self.norm(self.act(self.dense(pooled_output))) + + if isinstance(pooled_output, list): + pooled_output = [self._head(output) for output in pooled_output] + else: + pooled_output = self._head(pooled_output) + return pooled_output @@ -299,11 +308,28 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.model = ModernBertModel(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "modernbert")) self.classifier = nn.Linear(config.hidden_size, config.num_labels) - self.pooler = ClassifierPooler( - vllm_config.model_config, - pooling=ModernBertPooler(config), - classifier=self.classifier, - ) + + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + ClassifierPooler( + pooling=ModernBertPooler(config), + classifier=self.classifier, + act_fn=ClassifierPooler.act_fn_for_seq_cls( + vllm_config.model_config), + ), + "score": + ClassifierPooler( + pooling=ModernBertPooler(config), + classifier=self.classifier, + act_fn=ClassifierPooler.act_fn_for_cross_encoder( + vllm_config.model_config), + ), + }) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): diff --git a/vllm/model_executor/models/qwen2_rm.py b/vllm/model_executor/models/qwen2_rm.py index 58f95d6eebf..f12e9a041a9 100644 --- a/vllm/model_executor/models/qwen2_rm.py +++ b/vllm/model_executor/models/qwen2_rm.py @@ -15,7 +15,8 @@ from vllm.config import VllmConfig from vllm.model_executor.layers.linear import (ColumnParallelLinear, RowParallelLinear) -from vllm.model_executor.layers.pooler import Pooler, PoolingType, SimplePooler +from vllm.model_executor.layers.pooler import (DispatchPooler, Pooler, + PoolingType) from vllm.sequence import IntermediateTensors from .interfaces import SupportsLoRA, SupportsPP @@ -26,7 +27,7 @@ class Qwen2RewardBaseModel(nn.Module, SupportsLoRA, SupportsPP): is_pooling_model = True - pooler: SimplePooler + pooler: Pooler packed_modules_mapping = { "qkv_proj": [ @@ -94,12 +95,12 @@ class Qwen2ForRewardModel(Qwen2RewardBaseModel): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): vllm_config.model_config.hf_config.num_labels = 1 super().__init__(vllm_config=vllm_config, prefix=prefix) + pooler_config = vllm_config.model_config.pooler_config - self.pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.ALL, - normalize=False, - softmax=False) + assert pooler_config is not None + + self.pooler = DispatchPooler( + {"encode": Pooler.for_encode(pooler_config)}, ) class Qwen2ForProcessRewardModel(Qwen2RewardBaseModel): @@ -107,11 +108,17 @@ class Qwen2ForProcessRewardModel(Qwen2RewardBaseModel): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): vllm_config.model_config.hf_config.num_labels = 2 super().__init__(vllm_config=vllm_config, prefix=prefix) + pooler_config = vllm_config.model_config.pooler_config - self.pooler = Pooler.from_config_with_defaults( - pooler_config, - pooling_type=PoolingType.STEP, - normalize=False, - softmax=True, - step_tag_id=151651, - ) + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode( + pooler_config, + default_pooling_type=PoolingType.STEP, + default_normalize=False, + default_softmax=True, + default_step_tag_id=151651, + ) + }) diff --git a/vllm/model_executor/models/roberta.py b/vllm/model_executor/models/roberta.py index 7d3b56ced5c..c6b41164403 100644 --- a/vllm/model_executor/models/roberta.py +++ b/vllm/model_executor/models/roberta.py @@ -9,7 +9,8 @@ from transformers import RobertaConfig from vllm.config import VllmConfig -from vllm.model_executor.layers.pooler import ClassifierPooler, CLSPool +from vllm.model_executor.layers.pooler import (ClassifierPooler, CLSPool, + DispatchPooler, Pooler) from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.models.bert import BertEmbeddingModel, BertModel @@ -63,16 +64,10 @@ def forward( # References: # - https://github.com/huggingface/transformers/blob/a3d69a8994d673899608a7c17fbf4f953f50474e/src/transformers/models/roberta/modeling_roberta.py#L133 # - https://github.com/huggingface/transformers/blob/a3d69a8994d673899608a7c17fbf4f953f50474e/src/transformers/models/roberta/modeling_roberta.py#L1669 - pos_list = [] - token_list = [] - offset = 0 - for seq_len in seq_lens: - pos_list.append(position_ids[offset:offset + seq_len]) - token_list.append(input_ids[offset:offset + seq_len]) - offset += seq_len - + seq_lens_list = seq_lens.tolist() new_pos_list = [] - for positions, tokens in zip(pos_list, token_list): + for positions, tokens in zip(position_ids.split(seq_lens_list), + input_ids.split(seq_lens_list)): # Verify assumption that incoming position are # always a sequence from 0 to N. expected_pos = torch.arange(positions.size()[0], @@ -184,15 +179,30 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.num_labels = config.num_labels self.roberta = BertModel(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "bert"), - embedding_class=RobertaEmbedding, - add_pooling_layer=False) + embedding_class=RobertaEmbedding) self.classifier = RobertaClassificationHead(config) - self.pooler = ClassifierPooler( - vllm_config.model_config, - pooling=CLSPool(), - classifier=self.classifier, - ) + pooler_config = vllm_config.model_config.pooler_config + assert pooler_config is not None + + self.pooler = DispatchPooler({ + "encode": + Pooler.for_encode(pooler_config), + "classify": + ClassifierPooler( + pooling=CLSPool(), + classifier=self.classifier, + act_fn=ClassifierPooler.act_fn_for_seq_cls( + vllm_config.model_config), + ), + "score": + ClassifierPooler( + pooling=CLSPool(), + classifier=self.classifier, + act_fn=ClassifierPooler.act_fn_for_cross_encoder( + vllm_config.model_config), + ), + }) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loader = AutoWeightsLoader(self) diff --git a/vllm/model_executor/pooling_metadata.py b/vllm/model_executor/pooling_metadata.py index 4dd443bc26e..e6f1ca61dd2 100644 --- a/vllm/model_executor/pooling_metadata.py +++ b/vllm/model_executor/pooling_metadata.py @@ -38,6 +38,13 @@ def __repr__(self) -> str: f"seq_data={self.seq_data}, " f"prompt_lens={self.prompt_lens})") + def __getitem__(self, indices: slice): + return PoolingMetadata( + seq_groups=self.seq_groups[indices], + seq_data=dict(list(self.seq_data.items())[indices]), + prompt_lens=self.prompt_lens[indices], + ) + @dataclass class PoolingTensors: diff --git a/vllm/v1/pool/metadata.py b/vllm/v1/pool/metadata.py index 5f321cd87c5..28af720d05f 100644 --- a/vllm/v1/pool/metadata.py +++ b/vllm/v1/pool/metadata.py @@ -15,3 +15,11 @@ class PoolingMetadata: prompt_lens: torch.Tensor prompt_token_ids: Optional[torch.Tensor] pooling_params: list[PoolingParams] + + def __getitem__(self, indices: slice): + return PoolingMetadata( + prompt_lens=self.prompt_lens[indices], + prompt_token_ids=None if self.prompt_token_ids is None else + self.prompt_token_ids[indices], + pooling_params=self.pooling_params[indices], + ) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 670e653929c..cd66d8bcd63 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -5,7 +5,7 @@ import gc import time from contextlib import contextmanager -from typing import TYPE_CHECKING, Any, Optional, Union, cast, get_args +from typing import TYPE_CHECKING, Any, Optional, Union, cast import numpy as np import torch @@ -415,15 +415,11 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> None: generator = None if pooling_params: - assert pooling_params.task is not None, ( + assert (task := pooling_params.task) is not None, ( "You did not set `task` in the API") model = cast(VllmModelForPooling, self.model) - to_update = (model.pooler.get_pooling_updates( - pooling_params.task)) - assert to_update is not None, ( - f"{pooling_params.task=} is not supported by the model") - + to_update = model.pooler.get_pooling_updates(task) to_update.apply(pooling_params) self.requests[req_id] = CachedRequestState( @@ -1122,10 +1118,7 @@ def get_supported_pooling_tasks(self) -> list[PoolingTask]: if not is_pooling_model(model): return [] - return [ - task for task in get_args(PoolingTask) - if model.pooler.get_pooling_updates(task) - ] + return list(model.pooler.get_supported_tasks()) def apply_grammar_bitmask( self, @@ -2247,7 +2240,6 @@ def _dummy_pooler_run( dummy_pooling_params = PoolingParams(task=dummy_task) to_update = model.pooler.get_pooling_updates(dummy_task) - assert to_update is not None to_update.apply(dummy_pooling_params) dummy_metadata = PoolingMetadata( diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 7ed1cf41011..aad45b6abd1 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -3,7 +3,7 @@ import bisect import gc import time -from typing import TYPE_CHECKING, Any, Optional, cast, get_args +from typing import TYPE_CHECKING, Any, Optional, cast from unittest.mock import patch import numpy as np @@ -491,10 +491,7 @@ def get_supported_pooling_tasks(self) -> list[PoolingTask]: if not is_pooling_model(model): return [] - return [ - task for task in get_args(PoolingTask) - if model.pooler.get_pooling_updates(task) - ] + return list(model.pooler.get_supported_tasks()) def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: """ diff --git a/vllm/worker/model_runner_base.py b/vllm/worker/model_runner_base.py index b0737dfe319..62f26ac57a9 100644 --- a/vllm/worker/model_runner_base.py +++ b/vllm/worker/model_runner_base.py @@ -4,7 +4,7 @@ import dataclasses from abc import ABC, abstractmethod from typing import (TYPE_CHECKING, Any, Dict, Generic, List, Optional, Type, - TypeVar, get_args) + TypeVar) import torch import torch.nn as nn @@ -230,10 +230,7 @@ def get_supported_pooling_tasks(self) -> list[PoolingTask]: if not is_pooling_model(model): return [] - return [ - task for task in get_args(PoolingTask) - if model.pooler.get_pooling_updates(task) - ] + return list(model.pooler.get_supported_tasks()) def execute_model( self, diff --git a/vllm/worker/pooling_model_runner.py b/vllm/worker/pooling_model_runner.py index 2c3f4eb3ad4..d91b16be83d 100644 --- a/vllm/worker/pooling_model_runner.py +++ b/vllm/worker/pooling_model_runner.py @@ -199,15 +199,11 @@ def _prepare_pooling( pooling_params = seq_group_metadata.pooling_params assert pooling_params is not None - assert pooling_params.task is not None, ( + assert (task := pooling_params.task) is not None, ( "You did not set `task` in the API") - to_update = (cast(VllmModelForPooling, - self.model).pooler.get_pooling_updates( - pooling_params.task)) - assert to_update is not None, ( - f"{pooling_params.task=} is not supported by the model") - + model = cast(VllmModelForPooling, self.model) + to_update = model.pooler.get_pooling_updates(task) to_update.apply(pooling_params) seq_groups.append((seq_ids, pooling_params)) From d985882f71aabbc218336ff9e74c87478c73c18b Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Mon, 21 Jul 2025 10:23:57 +0100 Subject: [PATCH 223/552] [Docs] Fix hardcoded links in docs (#21287) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/design/v1/metrics.md | 5 ++--- docs/features/multimodal_inputs.md | 2 +- docs/features/quantization/bitblas.md | 2 +- docs/features/tool_calling.md | 2 +- docs/models/extensions/tensorizer.md | 2 +- 5 files changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/design/v1/metrics.md b/docs/design/v1/metrics.md index eec42d79d82..e23308f2637 100644 --- a/docs/design/v1/metrics.md +++ b/docs/design/v1/metrics.md @@ -61,7 +61,7 @@ These are documented under [Inferencing and Serving -> Production Metrics](../.. ### Grafana Dashboard -vLLM also provides [a reference example](https://docs.vllm.ai/en/stable/examples/online_serving/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. +vLLM also provides [a reference example](../../examples/online_serving/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: @@ -672,8 +672,7 @@ v0 has support for OpenTelemetry tracing: `--collect-detailed-traces` - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) -- [User-facing - docs](https://docs.vllm.ai/en/latest/examples/opentelemetry.html) +- [User-facing docs](../../examples/online_serving/opentelemetry.md) - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) - [IBM product diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index f9df2c89c60..e820ace4f8f 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -98,7 +98,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis Full example: -If using the [LLM.chat](https://docs.vllm.ai/en/stable/models/generative_models.html#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: +If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: ```python from vllm import LLM diff --git a/docs/features/quantization/bitblas.md b/docs/features/quantization/bitblas.md index ba014d28cde..6f53a448ee3 100644 --- a/docs/features/quantization/bitblas.md +++ b/docs/features/quantization/bitblas.md @@ -5,7 +5,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic !!! note Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`). Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper. - For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html). + For details see [supported hardware](supported_hardware.md). Below are the steps to utilize BitBLAS with vLLM. diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index 9b9d6e1360e..8d89dc4c8d8 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -95,7 +95,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha ## Required Function Calling -vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/usage/v1_guide.html#feature-model) for the V1 engine. +vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](../usage/v1_guide.md#features) for the V1 engine. When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter. diff --git a/docs/models/extensions/tensorizer.md b/docs/models/extensions/tensorizer.md index 5aa647b1992..6ea61b080cd 100644 --- a/docs/models/extensions/tensorizer.md +++ b/docs/models/extensions/tensorizer.md @@ -7,7 +7,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor For more information on CoreWeave's Tensorizer, please refer to [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see -the [vLLM example script](https://docs.vllm.ai/en/latest/examples/others/tensorize_vllm_model.html). +the [vLLM example script](../../examples/others/tensorize_vllm_model.md). !!! note Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. From 0a85e263d471b42448a26f520f2dfe6aa8848fe7 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Mon, 21 Jul 2025 10:25:02 +0100 Subject: [PATCH 224/552] [Docs] Make tables more space efficient in `supported_models.md` (#21291) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/models/supported_models.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 57ba132b91d..943f8590ac0 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -314,6 +314,13 @@ See [this page](generative_models.md) for more information on how to use generat Specified using `--task generate`. + + | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | |--------------|--------|-------------------|----------------------|---------------------------|---------------------| | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | From 0a6ac1c6c39145e0e120f260a300d32e6b319bd8 Mon Sep 17 00:00:00 2001 From: Ning Xie Date: Mon, 21 Jul 2025 19:18:33 +0800 Subject: [PATCH 225/552] [Misc] unify variable for LLM instance (#20996) Signed-off-by: Andy Xie Signed-off-by: x22x22 --- docs/configuration/model_resolution.md | 2 +- docs/features/lora.md | 4 +- docs/features/quantization/fp8.md | 10 ++- docs/features/quantization/int4.md | 3 +- docs/features/quantization/int8.md | 3 +- docs/models/pooling_models.md | 10 +-- examples/offline_inference/basic/classify.py | 4 +- examples/offline_inference/basic/embed.py | 4 +- examples/offline_inference/basic/score.py | 4 +- .../embed_jina_embeddings_v3.py | 4 +- .../offline_inference/embed_matryoshka_fy.py | 4 +- .../offline_inference/neuron_speculation.py | 12 +-- .../prithvi_geospatial_mae.py | 4 +- examples/offline_inference/qwen3_reranker.py | 8 +- .../test_basic_correctness.py | 4 +- tests/basic_correctness/test_preemption.py | 10 +-- tests/conftest.py | 32 ++++---- tests/core/test_num_computed_tokens_update.py | 2 +- tests/detokenizer/test_stop_reason.py | 2 +- tests/detokenizer/test_stop_strings.py | 42 +++++------ tests/lora/test_llama_tp.py | 20 ++--- tests/metrics/test_metrics.py | 14 ++-- .../test_model_load_with_params.py | 10 +-- .../models/language/generation/test_hybrid.py | 2 +- .../language/generation/test_mistral.py | 14 ++-- tests/models/language/pooling/mteb_utils.py | 18 ++--- tests/models/language/pooling/test_gritlm.py | 4 +- tests/models/language/pooling/test_jina.py | 4 +- .../pooling/test_nomic_max_model_len.py | 6 +- .../pooling/test_truncation_control.py | 6 +- .../multimodal/generation/test_pixtral.py | 5 +- .../multimodal/generation/test_whisper.py | 2 +- .../multimodal/generation/vlm_utils/core.py | 2 +- .../multimodal/pooling/test_dse_qwen2_vl.py | 2 +- .../pooling/test_jinavl_reranker.py | 2 +- tests/models/quantization/test_modelopt.py | 6 +- tests/models/quantization/test_nvfp4.py | 6 +- .../test_disable_sliding_window.py | 22 +++--- tests/prefix_caching/test_prefix_caching.py | 6 +- tests/quantization/test_gptq_dynamic.py | 2 +- tests/quantization/test_quark.py | 4 +- .../test_register_quantization_config.py | 2 +- tests/samplers/test_ignore_eos.py | 2 +- tests/samplers/test_logits_processor.py | 10 +-- tests/samplers/test_logprobs.py | 4 +- tests/samplers/test_no_bad_words.py | 12 +-- tests/samplers/test_seeded_generate.py | 2 +- tests/tokenization/test_detokenize.py | 2 +- tests/v1/core/test_scheduler_e2e.py | 12 +-- tests/v1/engine/test_llm_engine.py | 14 ++-- tests/v1/sample/test_logprobs.py | 8 +- tests/v1/sample/test_sampling_params_e2e.py | 74 +++++++++---------- tests/v1/test_oracle.py | 6 +- 53 files changed, 237 insertions(+), 236 deletions(-) diff --git a/docs/configuration/model_resolution.md b/docs/configuration/model_resolution.md index d98142a835c..49576a8217d 100644 --- a/docs/configuration/model_resolution.md +++ b/docs/configuration/model_resolution.md @@ -14,7 +14,7 @@ For example: ```python from vllm import LLM -model = LLM( +llm = LLM( model="cerebras/Cerebras-GPT-1.3B", hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2 ) diff --git a/docs/features/lora.md b/docs/features/lora.md index 6acfdcce445..ea1b495138c 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -302,7 +302,7 @@ To this end, we allow registration of default multimodal LoRAs to handle this au return tokenizer.apply_chat_template(chat, tokenize=False) - model = LLM( + llm = LLM( model=model_id, enable_lora=True, max_lora_rank=64, @@ -329,7 +329,7 @@ To this end, we allow registration of default multimodal LoRAs to handle this au } - outputs = model.generate( + outputs = llm.generate( inputs, sampling_params=SamplingParams( temperature=0.2, diff --git a/docs/features/quantization/fp8.md b/docs/features/quantization/fp8.md index a6c0fd78e76..0661933acd6 100644 --- a/docs/features/quantization/fp8.md +++ b/docs/features/quantization/fp8.md @@ -86,8 +86,9 @@ Load and run the model in `vllm`: ```python from vllm import LLM -model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic") -result = model.generate("Hello my name is") + +llm = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic") +result = llm.generate("Hello my name is") print(result[0].outputs[0].text) ``` @@ -125,9 +126,10 @@ In this mode, all Linear modules (except for the final `lm_head`) have their wei ```python from vllm import LLM -model = LLM("facebook/opt-125m", quantization="fp8") + +llm = LLM("facebook/opt-125m", quantization="fp8") # INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB -result = model.generate("Hello, my name is") +result = llm.generate("Hello, my name is") print(result[0].outputs[0].text) ``` diff --git a/docs/features/quantization/int4.md b/docs/features/quantization/int4.md index f26de73c2f0..1df32a11ed9 100644 --- a/docs/features/quantization/int4.md +++ b/docs/features/quantization/int4.md @@ -108,7 +108,8 @@ After quantization, you can load and run the model in vLLM: ```python from vllm import LLM -model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128") + +llm = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128") ``` To evaluate accuracy, you can use `lm_eval`: diff --git a/docs/features/quantization/int8.md b/docs/features/quantization/int8.md index 7e1cb3fee94..45fae58a648 100644 --- a/docs/features/quantization/int8.md +++ b/docs/features/quantization/int8.md @@ -114,7 +114,8 @@ After quantization, you can load and run the model in vLLM: ```python from vllm import LLM -model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token") + +llm = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token") ``` To evaluate accuracy, you can use `lm_eval`: diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index 4f347d165ee..4c1e5c1f3bf 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -345,11 +345,11 @@ You can change the output dimensions of embedding models that support Matryoshka ```python from vllm import LLM, PoolingParams -model = LLM(model="jinaai/jina-embeddings-v3", - task="embed", - trust_remote_code=True) -outputs = model.embed(["Follow the white rabbit."], - pooling_params=PoolingParams(dimensions=32)) +llm = LLM(model="jinaai/jina-embeddings-v3", + task="embed", + trust_remote_code=True) +outputs = llm.embed(["Follow the white rabbit."], + pooling_params=PoolingParams(dimensions=32)) print(outputs[0].outputs) ``` diff --git a/examples/offline_inference/basic/classify.py b/examples/offline_inference/basic/classify.py index 219064e9742..aaf0e83c9de 100644 --- a/examples/offline_inference/basic/classify.py +++ b/examples/offline_inference/basic/classify.py @@ -28,10 +28,10 @@ def main(args: Namespace): # Create an LLM. # You should pass task="classify" for classification models - model = LLM(**vars(args)) + llm = LLM(**vars(args)) # Generate logits. The output is a list of ClassificationRequestOutputs. - outputs = model.classify(prompts) + outputs = llm.classify(prompts) # Print the outputs. print("\nGenerated Outputs:\n" + "-" * 60) diff --git a/examples/offline_inference/basic/embed.py b/examples/offline_inference/basic/embed.py index 1114033d5ce..7ff9c7f5e0e 100644 --- a/examples/offline_inference/basic/embed.py +++ b/examples/offline_inference/basic/embed.py @@ -31,10 +31,10 @@ def main(args: Namespace): # Create an LLM. # You should pass task="embed" for embedding models - model = LLM(**vars(args)) + llm = LLM(**vars(args)) # Generate embedding. The output is a list of EmbeddingRequestOutputs. - outputs = model.embed(prompts) + outputs = llm.embed(prompts) # Print the outputs. print("\nGenerated Outputs:\n" + "-" * 60) diff --git a/examples/offline_inference/basic/score.py b/examples/offline_inference/basic/score.py index 6a08de2d2c3..d37527b0a13 100644 --- a/examples/offline_inference/basic/score.py +++ b/examples/offline_inference/basic/score.py @@ -27,10 +27,10 @@ def main(args: Namespace): # Create an LLM. # You should pass task="score" for cross-encoder models - model = LLM(**vars(args)) + llm = LLM(**vars(args)) # Generate scores. The output is a list of ScoringRequestOutputs. - outputs = model.score(text_1, texts_2) + outputs = llm.score(text_1, texts_2) # Print the outputs. print("\nGenerated Outputs:\n" + "-" * 60) diff --git a/examples/offline_inference/embed_jina_embeddings_v3.py b/examples/offline_inference/embed_jina_embeddings_v3.py index e68128399ba..7d78b8c63c6 100644 --- a/examples/offline_inference/embed_jina_embeddings_v3.py +++ b/examples/offline_inference/embed_jina_embeddings_v3.py @@ -30,11 +30,11 @@ def main(args: Namespace): # Create an LLM. # You should pass task="embed" for embedding models - model = LLM(**vars(args)) + llm = LLM(**vars(args)) # Generate embedding. The output is a list of EmbeddingRequestOutputs. # Only text matching task is supported for now. See #16120 - outputs = model.embed(prompts) + outputs = llm.embed(prompts) # Print the outputs. print("\nGenerated Outputs:") diff --git a/examples/offline_inference/embed_matryoshka_fy.py b/examples/offline_inference/embed_matryoshka_fy.py index 7f5d74d9a3a..50a645ba827 100644 --- a/examples/offline_inference/embed_matryoshka_fy.py +++ b/examples/offline_inference/embed_matryoshka_fy.py @@ -30,10 +30,10 @@ def main(args: Namespace): # Create an LLM. # You should pass task="embed" for embedding models - model = LLM(**vars(args)) + llm = LLM(**vars(args)) # Generate embedding. The output is a list of EmbeddingRequestOutputs. - outputs = model.embed(prompts, pooling_params=PoolingParams(dimensions=32)) + outputs = llm.embed(prompts, pooling_params=PoolingParams(dimensions=32)) # Print the outputs. print("\nGenerated Outputs:") diff --git a/examples/offline_inference/neuron_speculation.py b/examples/offline_inference/neuron_speculation.py index 2ef69f29863..26276cba202 100644 --- a/examples/offline_inference/neuron_speculation.py +++ b/examples/offline_inference/neuron_speculation.py @@ -25,7 +25,7 @@ def config_buckets(): os.environ["NEURON_TOKEN_GEN_BUCKETS"] = "128,512,1024,2048" -def initialize_model(): +def initialize_llm(): """Create an LLM with speculative decoding.""" return LLM( model="openlm-research/open_llama_7b", @@ -43,9 +43,9 @@ def initialize_model(): ) -def process_requests(model: LLM, sampling_params: SamplingParams): +def process_requests(llm: LLM, sampling_params: SamplingParams): """Generate texts from prompts and print them.""" - outputs = model.generate(prompts, sampling_params) + outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text @@ -53,12 +53,12 @@ def process_requests(model: LLM, sampling_params: SamplingParams): def main(): - """Main function that sets up the model and processes prompts.""" + """Main function that sets up the llm and processes prompts.""" config_buckets() - model = initialize_model() + llm = initialize_llm() # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, top_k=1) - process_requests(model, sampling_params) + process_requests(llm, sampling_params) if __name__ == "__main__": diff --git a/examples/offline_inference/prithvi_geospatial_mae.py b/examples/offline_inference/prithvi_geospatial_mae.py index 567c448a8c9..6dc03e85baa 100644 --- a/examples/offline_inference/prithvi_geospatial_mae.py +++ b/examples/offline_inference/prithvi_geospatial_mae.py @@ -140,7 +140,7 @@ class PrithviMAE: def __init__(self): print("Initializing PrithviMAE model") - self.model = LLM( + self.llm = LLM( model=os.path.join(os.path.dirname(__file__), "./model"), skip_tokenizer_init=True, dtype="float32", @@ -158,7 +158,7 @@ def run(self, input_data, location_coords): prompt = {"prompt_token_ids": [1], "multi_modal_data": mm_data} - outputs = self.model.encode(prompt, use_tqdm=False) + outputs = self.llm.encode(prompt, use_tqdm=False) print("################ Inference done (it took seconds) ##############") return outputs[0].outputs.data diff --git a/examples/offline_inference/qwen3_reranker.py b/examples/offline_inference/qwen3_reranker.py index fe3cebc348f..b0fd57237d4 100644 --- a/examples/offline_inference/qwen3_reranker.py +++ b/examples/offline_inference/qwen3_reranker.py @@ -17,13 +17,13 @@ # Models converted offline using this method can not only be more efficient # and support the vllm score API, but also make the init parameters more # concise, for example. -# model = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score") +# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score") # If you want to load the official original version, the init parameters are # as follows. -def get_model() -> LLM: +def get_llm() -> LLM: """Initializes and returns the LLM model for Qwen3-Reranker.""" return LLM( model=model_name, @@ -77,8 +77,8 @@ def main() -> None: ] documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents] - model = get_model() - outputs = model.score(queries, documents) + llm = get_llm() + outputs = llm.score(queries, documents) print("-" * 30) print([output.outputs.score for output in outputs]) diff --git a/tests/basic_correctness/test_basic_correctness.py b/tests/basic_correctness/test_basic_correctness.py index 2e103019f7a..13ddf035a55 100644 --- a/tests/basic_correctness/test_basic_correctness.py +++ b/tests/basic_correctness/test_basic_correctness.py @@ -236,13 +236,13 @@ def test_failed_model_execution(vllm_runner, monkeypatch) -> None: monkeypatch.setenv('VLLM_ENABLE_V1_MULTIPROCESSING', '0') with vllm_runner('facebook/opt-125m', enforce_eager=True) as vllm_model: - if isinstance(vllm_model.model.llm_engine, LLMEngineV1): + if isinstance(vllm_model.llm.llm_engine, LLMEngineV1): v1_test_failed_model_execution(vllm_model) def v1_test_failed_model_execution(vllm_model): - engine = vllm_model.model.llm_engine + engine = vllm_model.llm.llm_engine mocked_execute_model = Mock( side_effect=RuntimeError("Mocked Critical Error")) engine.engine_core.engine_core.model_executor.execute_model =\ diff --git a/tests/basic_correctness/test_preemption.py b/tests/basic_correctness/test_preemption.py index 341a39a42b8..db2fa2f6bef 100644 --- a/tests/basic_correctness/test_preemption.py +++ b/tests/basic_correctness/test_preemption.py @@ -81,7 +81,7 @@ def test_chunked_prefill_recompute( disable_log_stats=False, ) as vllm_model: vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens) - assert (vllm_model.model.llm_engine.scheduler[0].artificial_preempt_cnt + assert (vllm_model.llm.llm_engine.scheduler[0].artificial_preempt_cnt < ARTIFICIAL_PREEMPTION_MAX_CNT) for i in range(len(example_prompts)): @@ -118,10 +118,10 @@ def test_preemption( distributed_executor_backend=distributed_executor_backend, ) as vllm_model: vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens) - assert (vllm_model.model.llm_engine.scheduler[0].artificial_preempt_cnt + assert (vllm_model.llm.llm_engine.scheduler[0].artificial_preempt_cnt < ARTIFICIAL_PREEMPTION_MAX_CNT) total_preemption = ( - vllm_model.model.llm_engine.scheduler[0].num_cumulative_preemption) + vllm_model.llm.llm_engine.scheduler[0].num_cumulative_preemption) check_outputs_equal( outputs_0_lst=hf_outputs, @@ -174,12 +174,12 @@ def test_preemption_infeasible( ) as vllm_model: sampling_params = SamplingParams(max_tokens=max_tokens, ignore_eos=True) - req_outputs = vllm_model.model.generate( + req_outputs = vllm_model.llm.generate( example_prompts, sampling_params=sampling_params, ) - assert (vllm_model.model.llm_engine.scheduler[0].artificial_preempt_cnt + assert (vllm_model.llm.llm_engine.scheduler[0].artificial_preempt_cnt < ARTIFICIAL_PREEMPTION_MAX_CNT) # Verify the request is ignored and not hang. diff --git a/tests/conftest.py b/tests/conftest.py index f3524d1fe2a..a18dbf58c80 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -784,7 +784,7 @@ def __init__( enforce_eager: Optional[bool] = False, **kwargs, ) -> None: - self.model = LLM( + self.llm = LLM( model=model_name, task=task, tokenizer=tokenizer_name, @@ -854,9 +854,9 @@ def generate( videos=videos, audios=audios) - req_outputs = self.model.generate(inputs, - sampling_params=sampling_params, - **kwargs) + req_outputs = self.llm.generate(inputs, + sampling_params=sampling_params, + **kwargs) outputs: list[tuple[list[list[int]], list[str]]] = [] for req_output in req_outputs: @@ -902,9 +902,9 @@ def generate_w_logprobs( videos=videos, audios=audios) - req_outputs = self.model.generate(inputs, - sampling_params=sampling_params, - **kwargs) + req_outputs = self.llm.generate(inputs, + sampling_params=sampling_params, + **kwargs) toks_str_logsprobs_prompt_logprobs = ( self._final_steps_generate_w_logprobs(req_outputs)) @@ -924,8 +924,8 @@ def generate_encoder_decoder_w_logprobs( ''' assert sampling_params.logprobs is not None - req_outputs = self.model.generate(encoder_decoder_prompts, - sampling_params=sampling_params) + req_outputs = self.llm.generate(encoder_decoder_prompts, + sampling_params=sampling_params) toks_str_logsprobs_prompt_logprobs = ( self._final_steps_generate_w_logprobs(req_outputs)) # Omit prompt logprobs if not required by sampling params @@ -1018,7 +1018,7 @@ def generate_beam_search( videos=videos, audios=audios) - outputs = self.model.beam_search( + outputs = self.llm.beam_search( inputs, BeamSearchParams(beam_width=beam_width, max_tokens=max_tokens)) returned_outputs = [] @@ -1029,7 +1029,7 @@ def generate_beam_search( return returned_outputs def classify(self, prompts: list[str]) -> list[list[float]]: - req_outputs = self.model.classify(prompts) + req_outputs = self.llm.classify(prompts) return [req_output.outputs.probs for req_output in req_outputs] def embed(self, @@ -1044,11 +1044,11 @@ def embed(self, videos=videos, audios=audios) - req_outputs = self.model.embed(inputs, *args, **kwargs) + req_outputs = self.llm.embed(inputs, *args, **kwargs) return [req_output.outputs.embedding for req_output in req_outputs] def encode(self, prompts: list[str]) -> list[list[float]]: - req_outputs = self.model.encode(prompts) + req_outputs = self.llm.encode(prompts) return [req_output.outputs.data for req_output in req_outputs] def score( @@ -1058,18 +1058,18 @@ def score( *args, **kwargs, ) -> list[float]: - req_outputs = self.model.score(text_1, text_2, *args, **kwargs) + req_outputs = self.llm.score(text_1, text_2, *args, **kwargs) return [req_output.outputs.score for req_output in req_outputs] def apply_model(self, func: Callable[[nn.Module], _R]) -> list[_R]: - executor = self.model.llm_engine.model_executor + executor = self.llm.llm_engine.model_executor return executor.apply_model(func) def __enter__(self): return self def __exit__(self, exc_type, exc_value, traceback): - del self.model + del self.llm cleanup_dist_env_and_memory() diff --git a/tests/core/test_num_computed_tokens_update.py b/tests/core/test_num_computed_tokens_update.py index 1b958e34df8..9e1b7913dfb 100644 --- a/tests/core/test_num_computed_tokens_update.py +++ b/tests/core/test_num_computed_tokens_update.py @@ -37,7 +37,7 @@ def test_num_computed_tokens_update(num_scheduler_steps: int, num_scheduler_steps=num_scheduler_steps, enable_chunked_prefill=enable_chunked_prefill, enforce_eager=enforce_eager) - engine: LLMEngine = runner.model.llm_engine + engine: LLMEngine = runner.llm.llm_engine # In multi-step + chunked-prefill there is no separate single prompt step. # What is scheduled will run for num_scheduler_steps always. diff --git a/tests/detokenizer/test_stop_reason.py b/tests/detokenizer/test_stop_reason.py index 9716f7d72a5..1ff679789c9 100644 --- a/tests/detokenizer/test_stop_reason.py +++ b/tests/detokenizer/test_stop_reason.py @@ -28,7 +28,7 @@ def vllm_model(vllm_runner): def test_stop_reason(vllm_model, example_prompts): tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL) stop_token_id = tokenizer.convert_tokens_to_ids(STOP_STR) - llm = vllm_model.model + llm = vllm_model.llm # test stop token outputs = llm.generate(example_prompts, diff --git a/tests/detokenizer/test_stop_strings.py b/tests/detokenizer/test_stop_strings.py index efe938a20c4..cb87c44cc39 100644 --- a/tests/detokenizer/test_stop_strings.py +++ b/tests/detokenizer/test_stop_strings.py @@ -101,42 +101,42 @@ def _stop_token_id(llm): def test_stop_strings(): # If V0, must set enforce_eager=False since we use # async output processing below. - vllm_model = LLM(MODEL, enforce_eager=envs.VLLM_USE_V1) + llm = LLM(MODEL, enforce_eager=envs.VLLM_USE_V1) if envs.VLLM_USE_V1: - _stop_basic(vllm_model) + _stop_basic(llm) else: - _set_async_mode(vllm_model, True) - _stop_basic(vllm_model) + _set_async_mode(llm, True) + _stop_basic(llm) - _set_async_mode(vllm_model, False) - _stop_basic(vllm_model) + _set_async_mode(llm, False) + _stop_basic(llm) if envs.VLLM_USE_V1: - _stop_multi_tokens(vllm_model) + _stop_multi_tokens(llm) else: - _set_async_mode(vllm_model, True) - _stop_multi_tokens(vllm_model) + _set_async_mode(llm, True) + _stop_multi_tokens(llm) - _set_async_mode(vllm_model, False) - _stop_multi_tokens(vllm_model) + _set_async_mode(llm, False) + _stop_multi_tokens(llm) if envs.VLLM_USE_V1: - _stop_partial_token(vllm_model) + _stop_partial_token(llm) else: - _set_async_mode(vllm_model, True) - _stop_partial_token(vllm_model) + _set_async_mode(llm, True) + _stop_partial_token(llm) - _set_async_mode(vllm_model, False) - _stop_partial_token(vllm_model) + _set_async_mode(llm, False) + _stop_partial_token(llm) if envs.VLLM_USE_V1: # FIXME: this does not respect include_in_output=False - # _stop_token_id(vllm_model) + # _stop_token_id(llm) pass else: - _set_async_mode(vllm_model, True) - _stop_token_id(vllm_model) + _set_async_mode(llm, True) + _stop_token_id(llm) - _set_async_mode(vllm_model, False) - _stop_token_id(vllm_model) + _set_async_mode(llm, False) + _stop_token_id(llm) diff --git a/tests/lora/test_llama_tp.py b/tests/lora/test_llama_tp.py index bebf44b6dfd..b1ad1fdd060 100644 --- a/tests/lora/test_llama_tp.py +++ b/tests/lora/test_llama_tp.py @@ -186,25 +186,25 @@ def test_tp2_serialize_and_deserialize_lora(tmp_path, sql_lora_files, model_uri = tmp_path / "vllm" / model_ref / suffix / model_name tensorizer_config = TensorizerConfig(tensorizer_uri=str(model_uri)) - loaded_vllm_model = LLM(model=model_ref, - load_format="tensorizer", - enable_lora=True, - enforce_eager=True, - model_loader_extra_config=tensorizer_config, - max_num_seqs=13, - tensor_parallel_size=2, - max_loras=2) + loaded_llm = LLM(model=model_ref, + load_format="tensorizer", + enable_lora=True, + enforce_eager=True, + model_loader_extra_config=tensorizer_config, + max_num_seqs=13, + tensor_parallel_size=2, + max_loras=2) tc_as_dict = tensorizer_config.to_serializable() print("lora adapter created") - assert do_sample(loaded_vllm_model, + assert do_sample(loaded_llm, sql_lora_files, tensorizer_config_dict=tc_as_dict, lora_id=0) == EXPECTED_NO_LORA_OUTPUT print("lora 1") - assert do_sample(loaded_vllm_model, + assert do_sample(loaded_llm, sql_lora_files, tensorizer_config_dict=tc_as_dict, lora_id=1) == EXPECTED_LORA_OUTPUT diff --git a/tests/metrics/test_metrics.py b/tests/metrics/test_metrics.py index 54dbb747de0..8cae8a80d38 100644 --- a/tests/metrics/test_metrics.py +++ b/tests/metrics/test_metrics.py @@ -41,7 +41,7 @@ def test_metric_counter_prompt_tokens( dtype=dtype, disable_log_stats=False, gpu_memory_utilization=0.4) as vllm_model: - tokenizer = vllm_model.model.get_tokenizer() + tokenizer = vllm_model.llm.get_tokenizer() prompt_token_counts = [ len(tokenizer.encode(p)) for p in example_prompts ] @@ -53,7 +53,7 @@ def test_metric_counter_prompt_tokens( vllm_prompt_token_count = sum(prompt_token_counts) _ = vllm_model.generate_greedy(example_prompts, max_tokens) - stat_logger = vllm_model.model.llm_engine.stat_loggers['prometheus'] + stat_logger = vllm_model.llm.llm_engine.stat_loggers['prometheus'] metric_count = stat_logger.metrics.counter_prompt_tokens.labels( **stat_logger.labels)._value.get() @@ -77,8 +77,8 @@ def test_metric_counter_generation_tokens( disable_log_stats=False, gpu_memory_utilization=0.4) as vllm_model: vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens) - tokenizer = vllm_model.model.get_tokenizer() - stat_logger = vllm_model.model.llm_engine.stat_loggers['prometheus'] + tokenizer = vllm_model.llm.get_tokenizer() + stat_logger = vllm_model.llm.llm_engine.stat_loggers['prometheus'] metric_count = stat_logger.metrics.counter_generation_tokens.labels( **stat_logger.labels)._value.get() vllm_generation_count = 0 @@ -113,8 +113,8 @@ def test_metric_counter_generation_tokens_multi_step( disable_async_output_proc=disable_async_output_proc, ) as vllm_model: vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens) - tokenizer = vllm_model.model.get_tokenizer() - stat_logger = vllm_model.model.llm_engine.stat_loggers['prometheus'] + tokenizer = vllm_model.llm.get_tokenizer() + stat_logger = vllm_model.llm.llm_engine.stat_loggers['prometheus'] metric_count = stat_logger.metrics.counter_generation_tokens.labels( **stat_logger.labels)._value.get() vllm_generation_count = 0 @@ -145,7 +145,7 @@ def test_metric_set_tag_model_name(vllm_runner, model: str, dtype: str, disable_log_stats=False, gpu_memory_utilization=0.3, served_model_name=served_model_name) as vllm_model: - stat_logger = vllm_model.model.llm_engine.stat_loggers['prometheus'] + stat_logger = vllm_model.llm.llm_engine.stat_loggers['prometheus'] metrics_tag_content = stat_logger.labels["model_name"] if envs.VLLM_CI_USE_S3: diff --git a/tests/model_executor/test_model_load_with_params.py b/tests/model_executor/test_model_load_with_params.py index 1d2d9f9a65b..27374763021 100644 --- a/tests/model_executor/test_model_load_with_params.py +++ b/tests/model_executor/test_model_load_with_params.py @@ -32,8 +32,8 @@ def test_model_loading_with_params(vllm_runner): output = vllm_model.embed("Write a short story about a robot that" " dreams for the first time.\n") - model_config = vllm_model.model.llm_engine.model_config - model_tokenizer = vllm_model.model.llm_engine.tokenizer + model_config = vllm_model.llm.llm_engine.model_config + model_tokenizer = vllm_model.llm.llm_engine.tokenizer # asserts on the bert model config file assert model_config.encoder_config["max_seq_length"] == 512 @@ -70,8 +70,8 @@ def test_roberta_model_loading_with_params(vllm_runner): output = vllm_model.embed("Write a short story about a robot that" " dreams for the first time.\n") - model_config = vllm_model.model.llm_engine.model_config - model_tokenizer = vllm_model.model.llm_engine.tokenizer + model_config = vllm_model.llm.llm_engine.model_config + model_tokenizer = vllm_model.llm.llm_engine.tokenizer # asserts on the bert model config file assert model_config.encoder_config["max_seq_length"] == 512 @@ -108,7 +108,7 @@ def test_facebook_roberta_model_loading_with_params(vllm_runner): output = vllm_model.embed("Write a short story about a robot that" " dreams for the first time.\n") - model_tokenizer = vllm_model.model.llm_engine.tokenizer + model_tokenizer = vllm_model.llm.llm_engine.tokenizer assert model_tokenizer.tokenizer_id == model_name def check_model(model): diff --git a/tests/models/language/generation/test_hybrid.py b/tests/models/language/generation/test_hybrid.py index e4294512338..2238924c1b5 100644 --- a/tests/models/language/generation/test_hybrid.py +++ b/tests/models/language/generation/test_hybrid.py @@ -274,7 +274,7 @@ def test_models_preemption_recompute( Tests that outputs are identical with and w/o preemptions (recompute). """ with vllm_runner(model, max_num_seqs=MAX_NUM_SEQS) as vllm_model: - scheduler = vllm_model.model.llm_engine.scheduler[0] + scheduler = vllm_model.llm.llm_engine.scheduler[0] scheduler.ENABLE_ARTIFICIAL_PREEMPT = True preempt_vllm_outputs = vllm_model.generate_greedy( example_prompts, max_tokens) diff --git a/tests/models/language/generation/test_mistral.py b/tests/models/language/generation/test_mistral.py index c70698ede37..81a88f2d485 100644 --- a/tests/models/language/generation/test_mistral.py +++ b/tests/models/language/generation/test_mistral.py @@ -238,8 +238,8 @@ def test_mistral_symbolic_languages(vllm_runner, model: str, load_format="mistral") as vllm_model: for prompt in SYMBOLIC_LANG_PROMPTS: msg = {"role": "user", "content": prompt} - outputs = vllm_model.model.chat([msg], - sampling_params=SAMPLING_PARAMS) + outputs = vllm_model.llm.chat([msg], + sampling_params=SAMPLING_PARAMS) assert "�" not in outputs[0].outputs[0].text.strip() @@ -253,11 +253,11 @@ def test_mistral_function_calling(vllm_runner, model: str, dtype: str) -> None: load_format="mistral") as vllm_model: msgs = copy.deepcopy(MSGS) - outputs = vllm_model.model.chat(msgs, - tools=TOOLS, - sampling_params=SAMPLING_PARAMS) + outputs = vllm_model.llm.chat(msgs, + tools=TOOLS, + sampling_params=SAMPLING_PARAMS) - tokenizer = vllm_model.model.get_tokenizer() + tokenizer = vllm_model.llm.get_tokenizer() tool_parser = MistralToolParser(tokenizer) model_output = outputs[0].outputs[0].text.strip() @@ -308,7 +308,7 @@ def test_mistral_guided_decoding( f"Give an example JSON for an employee profile that " f"fits this schema: {SAMPLE_JSON_SCHEMA}" }] - outputs = vllm_model.model.chat(messages, sampling_params=params) + outputs = vllm_model.llm.chat(messages, sampling_params=params) generated_text = outputs[0].outputs[0].text json_response = json.loads(generated_text) diff --git a/tests/models/language/pooling/mteb_utils.py b/tests/models/language/pooling/mteb_utils.py index 6c4fde5fdfa..97362f64166 100644 --- a/tests/models/language/pooling/mteb_utils.py +++ b/tests/models/language/pooling/mteb_utils.py @@ -30,7 +30,7 @@ class VllmMtebEncoder(mteb.Encoder): def __init__(self, vllm_model): super().__init__() - self.model = vllm_model + self.llm = vllm_model self.rng = np.random.default_rng(seed=42) def encode( @@ -43,7 +43,7 @@ def encode( # issues by randomizing the order. r = self.rng.permutation(len(sentences)) sentences = [sentences[i] for i in r] - outputs = self.model.embed(sentences, use_tqdm=False) + outputs = self.llm.embed(sentences, use_tqdm=False) embeds = np.array(outputs) embeds = embeds[np.argsort(r)] return embeds @@ -61,10 +61,10 @@ def predict( queries = [s[0] for s in sentences] corpus = [s[1] for s in sentences] - outputs = self.model.score(queries, - corpus, - truncate_prompt_tokens=-1, - use_tqdm=False) + outputs = self.llm.score(queries, + corpus, + truncate_prompt_tokens=-1, + use_tqdm=False) scores = np.array(outputs) scores = scores[np.argsort(r)] return scores @@ -178,11 +178,11 @@ def mteb_test_embed_models(hf_runner, if model_info.architecture: assert (model_info.architecture - in vllm_model.model.llm_engine.model_config.architectures) + in vllm_model.llm.llm_engine.model_config.architectures) vllm_main_score = run_mteb_embed_task(VllmMtebEncoder(vllm_model), MTEB_EMBED_TASKS) - vllm_dtype = vllm_model.model.llm_engine.model_config.dtype + vllm_dtype = vllm_model.llm.llm_engine.model_config.dtype with hf_runner(model_info.name, is_sentence_transformer=True, @@ -284,7 +284,7 @@ def mteb_test_rerank_models(hf_runner, max_num_seqs=8, **vllm_extra_kwargs) as vllm_model: - model_config = vllm_model.model.llm_engine.model_config + model_config = vllm_model.llm.llm_engine.model_config if model_info.architecture: assert (model_info.architecture in model_config.architectures) diff --git a/tests/models/language/pooling/test_gritlm.py b/tests/models/language/pooling/test_gritlm.py index 1274657991b..efa119bb765 100644 --- a/tests/models/language/pooling/test_gritlm.py +++ b/tests/models/language/pooling/test_gritlm.py @@ -120,7 +120,7 @@ def test_gritlm_offline_embedding(vllm_runner): task="embed", max_model_len=MAX_MODEL_LEN, ) as vllm_model: - llm = vllm_model.model + llm = vllm_model.llm d_rep = run_llm_encode( llm, @@ -167,7 +167,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner): task="generate", max_model_len=MAX_MODEL_LEN, ) as vllm_model: - llm = vllm_model.model + llm = vllm_model.llm sampling_params = SamplingParams(temperature=0.0, max_tokens=256) outputs = llm.generate(input, sampling_params=sampling_params) diff --git a/tests/models/language/pooling/test_jina.py b/tests/models/language/pooling/test_jina.py index 9bfe7411e16..16c711407ae 100644 --- a/tests/models/language/pooling/test_jina.py +++ b/tests/models/language/pooling/test_jina.py @@ -87,10 +87,10 @@ def test_matryoshka( task="embed", dtype=dtype, max_model_len=None) as vllm_model: - assert vllm_model.model.llm_engine.model_config.is_matryoshka + assert vllm_model.llm.llm_engine.model_config.is_matryoshka matryoshka_dimensions = ( - vllm_model.model.llm_engine.model_config.matryoshka_dimensions) + vllm_model.llm.llm_engine.model_config.matryoshka_dimensions) assert matryoshka_dimensions is not None if dimensions not in matryoshka_dimensions: diff --git a/tests/models/language/pooling/test_nomic_max_model_len.py b/tests/models/language/pooling/test_nomic_max_model_len.py index 250b3a52835..7413ef578e3 100644 --- a/tests/models/language/pooling/test_nomic_max_model_len.py +++ b/tests/models/language/pooling/test_nomic_max_model_len.py @@ -23,7 +23,7 @@ def test_default(model_info, vllm_runner): with vllm_runner(model_info.name, task="embed", max_model_len=None) as vllm_model: - model_config = vllm_model.model.llm_engine.model_config + model_config = vllm_model.llm.llm_engine.model_config if model_info.name == "nomic-ai/nomic-embed-text-v2-moe": # For nomic-embed-text-v2-moe the length is set to 512 # by sentence_bert_config.json. @@ -38,7 +38,7 @@ def test_set_max_model_len_legal(model_info, vllm_runner): # set max_model_len <= 512 with vllm_runner(model_info.name, task="embed", max_model_len=256) as vllm_model: - model_config = vllm_model.model.llm_engine.model_config + model_config = vllm_model.llm.llm_engine.model_config assert model_config.max_model_len == 256 # set 512 < max_model_len <= 2048 @@ -52,7 +52,7 @@ def test_set_max_model_len_legal(model_info, vllm_runner): else: with vllm_runner(model_info.name, task="embed", max_model_len=1024) as vllm_model: - model_config = vllm_model.model.llm_engine.model_config + model_config = vllm_model.llm.llm_engine.model_config assert model_config.max_model_len == 1024 diff --git a/tests/models/language/pooling/test_truncation_control.py b/tests/models/language/pooling/test_truncation_control.py index 33aff1c873f..c7399e01c73 100644 --- a/tests/models/language/pooling/test_truncation_control.py +++ b/tests/models/language/pooling/test_truncation_control.py @@ -28,7 +28,7 @@ def test_smaller_truncation_size(vllm_runner, with vllm_runner(model_name, task="embed", max_model_len=max_model_len) as vllm_model: - vllm_output = vllm_model.model.encode( + vllm_output = vllm_model.llm.encode( input_str, truncate_prompt_tokens=truncate_prompt_tokens) prompt_tokens = vllm_output[0].prompt_token_ids @@ -43,7 +43,7 @@ def test_max_truncation_size(vllm_runner, with vllm_runner(model_name, task="embed", max_model_len=max_model_len) as vllm_model: - vllm_output = vllm_model.model.encode( + vllm_output = vllm_model.llm.encode( input_str, truncate_prompt_tokens=truncate_prompt_tokens) prompt_tokens = vllm_output[0].prompt_token_ids @@ -61,7 +61,7 @@ def test_bigger_truncation_size(vllm_runner, model_name, task="embed", max_model_len=max_model_len) as vllm_model: - llm_output = vllm_model.model.encode( + llm_output = vllm_model.llm.encode( input_str, truncate_prompt_tokens=truncate_prompt_tokens) assert llm_output == f"""truncate_prompt_tokens value diff --git a/tests/models/multimodal/generation/test_pixtral.py b/tests/models/multimodal/generation/test_pixtral.py index 1def825ab08..e157d6f4a79 100644 --- a/tests/models/multimodal/generation/test_pixtral.py +++ b/tests/models/multimodal/generation/test_pixtral.py @@ -180,8 +180,7 @@ def test_chat( ) as vllm_model: outputs = [] for msg in MSGS: - output = vllm_model.model.chat(msg, - sampling_params=SAMPLING_PARAMS) + output = vllm_model.llm.chat(msg, sampling_params=SAMPLING_PARAMS) outputs.extend(output) @@ -217,7 +216,7 @@ def test_multi_modal_placeholders(vllm_runner, prompt, max_model_len=8192, limit_mm_per_prompt=LIMIT_MM_PER_PROMPT, ) as vllm_model: - outputs = vllm_model.model.generate(prompt) + outputs = vllm_model.llm.generate(prompt) assert len(outputs) == 1, f"{len(outputs)=}" output: RequestOutput = outputs[0] diff --git a/tests/models/multimodal/generation/test_whisper.py b/tests/models/multimodal/generation/test_whisper.py index 363d55153aa..4a65e8c9520 100644 --- a/tests/models/multimodal/generation/test_whisper.py +++ b/tests/models/multimodal/generation/test_whisper.py @@ -106,7 +106,7 @@ def run_test( tensor_parallel_size=tensor_parallel_size, distributed_executor_backend=distributed_executor_backend, ) as vllm_model: - llm = vllm_model.model + llm = vllm_model.llm sampling_params = SamplingParams( temperature=0, diff --git a/tests/models/multimodal/generation/vlm_utils/core.py b/tests/models/multimodal/generation/vlm_utils/core.py index 8c83d8f8a8a..cf8962ce497 100644 --- a/tests/models/multimodal/generation/vlm_utils/core.py +++ b/tests/models/multimodal/generation/vlm_utils/core.py @@ -85,7 +85,7 @@ def run_test( enforce_eager=enforce_eager, task=task, **vllm_runner_kwargs_) as vllm_model: - tokenizer = vllm_model.model.get_tokenizer() + tokenizer = vllm_model.llm.get_tokenizer() vllm_kwargs: dict[str, Any] = {} if get_stop_token_ids is not None: diff --git a/tests/models/multimodal/pooling/test_dse_qwen2_vl.py b/tests/models/multimodal/pooling/test_dse_qwen2_vl.py index f889eea5e83..a6f5aeccf94 100644 --- a/tests/models/multimodal/pooling/test_dse_qwen2_vl.py +++ b/tests/models/multimodal/pooling/test_dse_qwen2_vl.py @@ -96,7 +96,7 @@ def _run_test( dtype=dtype, enforce_eager=True, max_model_len=8192) as vllm_model: - tokenizer = vllm_model.model.get_tokenizer() + tokenizer = vllm_model.llm.get_tokenizer() texts = [ # this is necessary because vllm_model.embed will not apply any # templating to the prompt, and therefore lacks an image_pad diff --git a/tests/models/multimodal/pooling/test_jinavl_reranker.py b/tests/models/multimodal/pooling/test_jinavl_reranker.py index 50c91f1f81c..712b6801de4 100644 --- a/tests/models/multimodal/pooling/test_jinavl_reranker.py +++ b/tests/models/multimodal/pooling/test_jinavl_reranker.py @@ -56,7 +56,7 @@ def create_image_param(url: str) -> ChatCompletionContentPartImageParam: mm_processor_kwargs=mm_processor_kwargs, limit_mm_per_prompt=limit_mm_per_prompt, ) as vllm_model: - outputs = vllm_model.model.score(query, documents) + outputs = vllm_model.llm.score(query, documents) return [output.outputs.score for output in outputs] diff --git a/tests/models/quantization/test_modelopt.py b/tests/models/quantization/test_modelopt.py index 6ad526cc893..e23d4d9d211 100644 --- a/tests/models/quantization/test_modelopt.py +++ b/tests/models/quantization/test_modelopt.py @@ -45,7 +45,7 @@ reason="fp8 is not supported on this GPU type.") @pytest.mark.parametrize("model_name", MODELS) def test_models(example_prompts, model_name) -> None: - model = LLM( + llm = LLM( model=model_name, max_model_len=MAX_MODEL_LEN, trust_remote_code=True, @@ -68,9 +68,9 @@ def test_models(example_prompts, model_name) -> None: # Note: these need to be run 1 at a time due to numerical precision, # since the expected strs were generated this way. for prompt in formatted_prompts: - outputs = model.generate(prompt, params) + outputs = llm.generate(prompt, params) generations.append(outputs[0].outputs[0].text) - del model + del llm print(model_name, generations) expected_strs = EXPECTED_STRS_MAP[model_name] diff --git a/tests/models/quantization/test_nvfp4.py b/tests/models/quantization/test_nvfp4.py index b95dad9a4ef..b3c217e729e 100644 --- a/tests/models/quantization/test_nvfp4.py +++ b/tests/models/quantization/test_nvfp4.py @@ -46,7 +46,7 @@ reason="modelopt_fp4 is not supported on this GPU type.") @pytest.mark.parametrize("model_name", MODELS) def test_models(example_prompts, model_name) -> None: - model = LLM( + llm = LLM( model=model_name, max_model_len=MAX_MODEL_LEN, trust_remote_code=True, @@ -69,9 +69,9 @@ def test_models(example_prompts, model_name) -> None: # Note: these need to be run 1 at a time due to numerical precision, # since the expected strs were generated this way. for prompt in formatted_prompts: - outputs = model.generate(prompt, params) + outputs = llm.generate(prompt, params) generations.append(outputs[0].outputs[0].text) - del model + del llm print(model_name, generations) expected_strs = EXPECTED_STRS_MAP[model_name] diff --git a/tests/prefix_caching/test_disable_sliding_window.py b/tests/prefix_caching/test_disable_sliding_window.py index f00a8f6998c..b940ab416e6 100644 --- a/tests/prefix_caching/test_disable_sliding_window.py +++ b/tests/prefix_caching/test_disable_sliding_window.py @@ -25,25 +25,25 @@ @pytest.mark.parametrize("model_len_len", MODEL_LEN_LEN) def test_disable_sliding_window(model_len_len, ): model, sliding_len, full_len = model_len_len - vllm_disabled_model = LLM(model, disable_sliding_window=True) - vllm_disabled_model.generate("Hi my name is") - model_config = vllm_disabled_model.llm_engine.model_config + disabled_llm = LLM(model, disable_sliding_window=True) + disabled_llm.generate("Hi my name is") + model_config = disabled_llm.llm_engine.model_config assert model_config.max_model_len == sliding_len, ( "Max len expected to equal sliding_len of %s, but got %s", sliding_len, model_config.max_model_len) - del vllm_disabled_model + del disabled_llm cleanup_dist_env_and_memory() - vllm_enabled_model = LLM(model, - enforce_eager=True, - disable_sliding_window=False, - enable_prefix_caching=False) - vllm_enabled_model.generate("Hi my name is") - model_config = vllm_enabled_model.llm_engine.model_config + enabled_llm = LLM(model, + enforce_eager=True, + disable_sliding_window=False, + enable_prefix_caching=False) + enabled_llm.generate("Hi my name is") + model_config = enabled_llm.llm_engine.model_config assert model_config.max_model_len == full_len, ( "Max len expected to equal full_len of %s, but got %s", full_len, model_config.max_model_len) - del vllm_enabled_model + del enabled_llm cleanup_dist_env_and_memory() diff --git a/tests/prefix_caching/test_prefix_caching.py b/tests/prefix_caching/test_prefix_caching.py index a65fc934b16..5bf6ed957c7 100644 --- a/tests/prefix_caching/test_prefix_caching.py +++ b/tests/prefix_caching/test_prefix_caching.py @@ -93,8 +93,8 @@ def test_mixed_requests( # Run all the promopts greedy_params = SamplingParams(temperature=0.0, max_tokens=max_tokens) - req_outputs = vllm_model.model.generate(example_prompts, - greedy_params) + req_outputs = vllm_model.llm.generate(example_prompts, + greedy_params) # Verify number of cached tokens for i in range(len(req_outputs)): @@ -161,7 +161,7 @@ def test_fully_cached_prefill_needs_uncached_token(model): max_num_batched_tokens=max_num_batched_tokens, max_num_seqs=max_num_batched_tokens, ) - engine: LLMEngine = runner.model.llm_engine + engine: LLMEngine = runner.llm.llm_engine scheduler: Scheduler = SchedulerProxy(engine.scheduler[0]) # type: ignore engine.scheduler[0] = scheduler diff --git a/tests/quantization/test_gptq_dynamic.py b/tests/quantization/test_gptq_dynamic.py index 23b999e7c67..aea50e99c1d 100644 --- a/tests/quantization/test_gptq_dynamic.py +++ b/tests/quantization/test_gptq_dynamic.py @@ -39,7 +39,7 @@ def test_gptq_with_dynamic(vllm_runner, model_id: str, use_marlin_kernel: bool, linear_method_cls = GPTQMarlinLinearMethod if use_marlin_kernel else ( GPTQLinearMethod) - for name, submodule in (vllm_model.model.llm_engine.model_executor. + for name, submodule in (vllm_model.llm.llm_engine.model_executor. driver_worker.model_runner.model.named_modules()): if name == "lm_head": assert isinstance(submodule.quant_method, linear_method_cls) diff --git a/tests/quantization/test_quark.py b/tests/quantization/test_quark.py index 2db11cb997d..4a0c8ba4d8a 100644 --- a/tests/quantization/test_quark.py +++ b/tests/quantization/test_quark.py @@ -107,11 +107,11 @@ def test_quark_fp8_parity(vllm_runner): } with (vllm_runner(quark_model_id, **llm_kwargs) as quark_handle, vllm_runner(fp8_model_id, **llm_kwargs) as fp8_handle): - quark_model = (quark_handle.model.llm_engine.model_executor. + quark_model = (quark_handle.llm.llm_engine.model_executor. driver_worker.model_runner.model) quark_state_dict = quark_model.state_dict() - fp8_model = (fp8_handle.model.llm_engine.model_executor.driver_worker. + fp8_model = (fp8_handle.llm.llm_engine.model_executor.driver_worker. model_runner.model) fp8_state_dict = fp8_model.state_dict() diff --git a/tests/quantization/test_register_quantization_config.py b/tests/quantization/test_register_quantization_config.py index 6c541fdbeea..84705e92c85 100644 --- a/tests/quantization/test_register_quantization_config.py +++ b/tests/quantization/test_register_quantization_config.py @@ -111,7 +111,7 @@ def test_custom_quant(vllm_runner, model, monkeypatch): quantization="custom_quant", enforce_eager=True) as llm: - model = llm.model.llm_engine.model_executor.driver_worker.model_runner.model # noqa: E501 + model = llm.llm.llm_engine.model_executor.driver_worker.model_runner.model # noqa: E501 layer = model.model.layers[0] qkv_proj = layer.self_attn.qkv_proj diff --git a/tests/samplers/test_ignore_eos.py b/tests/samplers/test_ignore_eos.py index 7eb9c0b5fb8..ea4a17dd230 100644 --- a/tests/samplers/test_ignore_eos.py +++ b/tests/samplers/test_ignore_eos.py @@ -36,7 +36,7 @@ def test_ignore_eos( ignore_eos=True) for prompt in example_prompts: - ignore_eos_output = vllm_model.model.generate( + ignore_eos_output = vllm_model.llm.generate( prompt, sampling_params=sampling_params) output_length = len(ignore_eos_output[0].outputs[0].token_ids) assert output_length == max_tokens diff --git a/tests/samplers/test_logits_processor.py b/tests/samplers/test_logits_processor.py index 901c8759126..123f9595e97 100644 --- a/tests/samplers/test_logits_processor.py +++ b/tests/samplers/test_logits_processor.py @@ -26,7 +26,7 @@ def test_logits_processor_force_generate( dtype: str, ) -> None: with vllm_runner(model, dtype=dtype) as vllm_model: - tokenizer = vllm_model.model.get_tokenizer() + tokenizer = vllm_model.llm.get_tokenizer() repeat_times = 2 enforced_answers = " vLLM" vllm_token_ids = tokenizer.encode(enforced_answers, @@ -45,13 +45,13 @@ def pick_vllm(token_ids, logits): ) # test logits_processors when prompt_logprobs is not None - vllm_model.model._add_request( + vllm_model.llm._add_request( example_prompts[0], params=params_with_logprobs, ) # test prompt_logprobs is not None - vllm_model.model._add_request( + vllm_model.llm._add_request( example_prompts[1], params=SamplingParams( prompt_logprobs=3, @@ -60,11 +60,11 @@ def pick_vllm(token_ids, logits): ) # test grouped requests - vllm_model.model._add_request( + vllm_model.llm._add_request( example_prompts[2], params=SamplingParams(max_tokens=max_tokens), ) - outputs = vllm_model.model._run_engine(use_tqdm=False) + outputs = vllm_model.llm._run_engine(use_tqdm=False) assert outputs[0].outputs[0].text == enforced_answers * repeat_times diff --git a/tests/samplers/test_logprobs.py b/tests/samplers/test_logprobs.py index 86c8a03eee1..87f40b10053 100644 --- a/tests/samplers/test_logprobs.py +++ b/tests/samplers/test_logprobs.py @@ -64,7 +64,7 @@ def test_get_prompt_logprobs( prompt_logprobs=num_top_logprobs, temperature=0.0, detokenize=detokenize) - vllm_results = vllm_model.model.generate( + vllm_results = vllm_model.llm.generate( example_prompts, sampling_params=vllm_sampling_params) # Test whether logprobs are included in the results. @@ -174,7 +174,7 @@ def test_none_logprobs(vllm_runner, model, chunked_prefill_token_size: int, logprobs=None, temperature=0.0, detokenize=detokenize) - results_logprobs_none = vllm_model.model.generate( + results_logprobs_none = vllm_model.llm.generate( example_prompts, sampling_params=sampling_params_logprobs_none) for i in range(len(results_logprobs_none)): diff --git a/tests/samplers/test_no_bad_words.py b/tests/samplers/test_no_bad_words.py index 42b529ae169..11803b8d7a5 100644 --- a/tests/samplers/test_no_bad_words.py +++ b/tests/samplers/test_no_bad_words.py @@ -20,7 +20,7 @@ def v1(run_with_both_engines): def _generate( - model: LLM, + llm: LLM, prompt: str, num_prompt_tokens: int, temperature: float = 0, @@ -32,7 +32,7 @@ def _generate( ) # [([output_token_ids, ], [output_text, ]), ] - output = model.generate([prompt], sampling_params=sampling_params) + output = llm.generate([prompt], sampling_params=sampling_params) output_token_ids = output[0][0][0][num_prompt_tokens:] # [0] first (and only) request output @@ -66,10 +66,10 @@ def test_one_token_bad_word(self, vllm_runner): assert self.target_token_id not in output_token_ids def _generate(self, - model: LLM, + llm: LLM, bad_words: Optional[list[str]] = None) -> list[int]: return _generate( - model=model, + llm=llm, prompt=self.PROMPT, num_prompt_tokens=self.num_prompt_tokens, bad_words=bad_words, @@ -156,10 +156,10 @@ def test_two_token_bad_word(self, vllm_runner): or (self.neighbour_token_id2 in output_token_ids)) def _generate(self, - model: LLM, + llm: LLM, bad_words: Optional[list[str]] = None) -> list[int]: return _generate( - model=model, + llm=llm, prompt=self.PROMPT, num_prompt_tokens=self.num_prompt_tokens, bad_words=bad_words, diff --git a/tests/samplers/test_seeded_generate.py b/tests/samplers/test_seeded_generate.py index b339b4b2ddf..5a0efd98acc 100644 --- a/tests/samplers/test_seeded_generate.py +++ b/tests/samplers/test_seeded_generate.py @@ -49,7 +49,7 @@ def test_random_sample_with_seed( sampling_params_seed_2 = copy.deepcopy(sampling_params) sampling_params_seed_2.seed = 200 - llm = vllm_model.model + llm = vllm_model.llm for prompt in example_prompts: for params in ( diff --git a/tests/tokenization/test_detokenize.py b/tests/tokenization/test_detokenize.py index f8aeba8301b..ccafc884612 100644 --- a/tests/tokenization/test_detokenize.py +++ b/tests/tokenization/test_detokenize.py @@ -393,7 +393,7 @@ def test_decode_prompt_logprobs_chunked_prefill( logprobs=5, prompt_logprobs=5, temperature=0.0) - vllm_results = vllm_model.model.generate( + vllm_results = vllm_model.llm.generate( example_prompts, sampling_params=vllm_sampling_params) for idx, result in enumerate(vllm_results): diff --git a/tests/v1/core/test_scheduler_e2e.py b/tests/v1/core/test_scheduler_e2e.py index 85415f6ad4b..bd0320baef8 100644 --- a/tests/v1/core/test_scheduler_e2e.py +++ b/tests/v1/core/test_scheduler_e2e.py @@ -14,7 +14,7 @@ @pytest.fixture(scope="module") -def model() -> LLM: +def llm() -> LLM: return LLM(MODEL, enforce_eager=True, enable_prefix_caching=True, @@ -24,16 +24,16 @@ def model() -> LLM: block_size=16) -def test_concurrent_partial_prefill(model): - outputs = model.generate([PROMPT] * 3) +def test_concurrent_partial_prefill(llm): + outputs = llm.generate([PROMPT] * 3) assert len(outputs) == 3 for output in outputs: assert len(output.outputs) == 1 -def test_prefix_cache_stats_is_recorded(model): +def test_prefix_cache_stats_is_recorded(llm): # 17 tokens will make sure first 16 tokens are cached in a block input_tokens = {"prompt_token_ids": [101] * 17} - _ = model.generate([input_tokens]) - outputs = model.generate([input_tokens]) + _ = llm.generate([input_tokens]) + outputs = llm.generate([input_tokens]) assert outputs[0].num_cached_tokens == 16 diff --git a/tests/v1/engine/test_llm_engine.py b/tests/v1/engine/test_llm_engine.py index 059106c62a2..f37686317fd 100644 --- a/tests/v1/engine/test_llm_engine.py +++ b/tests/v1/engine/test_llm_engine.py @@ -112,9 +112,9 @@ def test_compatibility_with_skip_tokenizer_init( example_prompts, structured_outputs=True, ) - model: LLM = vllm_model_skip_tokenizer_init.model + llm: LLM = vllm_model_skip_tokenizer_init.llm with pytest.raises(ValueError): - _ = model.generate(example_prompts, sampling_params_list) + _ = llm.generate(example_prompts, sampling_params_list) def test_parallel_sampling(vllm_model, example_prompts) -> None: @@ -125,8 +125,8 @@ def test_parallel_sampling(vllm_model, example_prompts) -> None: example_prompt: test fixture providing prompts for testing. """ sampling_params_list, n_list = _get_test_sampling_params(example_prompts) - model: LLM = vllm_model.model - outputs = model.generate(example_prompts, sampling_params_list) + llm: LLM = vllm_model.llm + outputs = llm.generate(example_prompts, sampling_params_list) # Validate each request response for out, n in zip(outputs, n_list): @@ -166,10 +166,10 @@ def test_engine_metrics(vllm_runner, monkeypatch, example_prompts): speculative_config=speculative_config, disable_log_stats=False, ) as vllm_model: - model: LLM = vllm_model.model + llm: LLM = vllm_model.llm sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens) - outputs = model.generate(example_prompts, sampling_params) + outputs = llm.generate(example_prompts, sampling_params) n_prompts = len(example_prompts) assert len(outputs) == n_prompts @@ -180,7 +180,7 @@ def test_engine_metrics(vllm_runner, monkeypatch, example_prompts): total_tokens += len(out.outputs[0].token_ids) assert total_tokens == max_tokens * n_prompts - metrics = model.get_metrics() + metrics = llm.get_metrics() def find_metric(name) -> list[Metric]: found = [] diff --git a/tests/v1/sample/test_logprobs.py b/tests/v1/sample/test_logprobs.py index 69180e6e5db..4f1f340a4cc 100644 --- a/tests/v1/sample/test_logprobs.py +++ b/tests/v1/sample/test_logprobs.py @@ -112,7 +112,7 @@ def _run_and_validate( max_tokens: int, do_apc: bool, ) -> None: - vllm_results = vllm_model.model.generate( + vllm_results = vllm_model.llm.generate( test_prompts, sampling_params=vllm_sampling_params) for vllm_result, hf_logprob, hf_output, logprob_prompt_logprob in zip( @@ -288,7 +288,7 @@ def test_get_logprobs_and_prompt_logprobs( """ with monkeypatch.context() as m: m.setenv("VLLM_USE_V1", "1") - do_apc = vllm_model.model.llm_engine.cache_config.enable_prefix_caching + do_apc = vllm_model.llm.llm_engine.cache_config.enable_prefix_caching if do_apc and (temperature < 2.0 or batch_logprobs_composition != SAMPLE_PROMPT): # Skip some test-cases to save time. @@ -378,7 +378,7 @@ def test_none_logprobs(vllm_model, example_prompts, prompt_logprobs=None, temperature=0.0, ) - results_logprobs_none = vllm_model.model.generate( + results_logprobs_none = vllm_model.llm.generate( example_prompts, sampling_params=sampling_params_logprobs_none, ) @@ -408,7 +408,7 @@ def test_zero_logprobs(vllm_model, example_prompts, logprobs=0, prompt_logprobs=0, temperature=0.0) - results_logprobs_zero = vllm_model.model.generate( + results_logprobs_zero = vllm_model.llm.generate( example_prompts, sampling_params=sampling_params_logprobs_zero) for i in range(len(results_logprobs_zero)): diff --git a/tests/v1/sample/test_sampling_params_e2e.py b/tests/v1/sample/test_sampling_params_e2e.py index ac0f3eb5883..f53e1e1c485 100644 --- a/tests/v1/sample/test_sampling_params_e2e.py +++ b/tests/v1/sample/test_sampling_params_e2e.py @@ -14,30 +14,30 @@ @pytest.fixture(scope="module") -def model() -> LLM: +def llm() -> LLM: # Disable prefix caching so that we can test prompt logprobs. # TODO remove this after https://github.com/vllm-project/vllm/pull/13949 # is merged return LLM(MODEL, enforce_eager=True, enable_prefix_caching=False) -def test_n_gt_1(model): +def test_n_gt_1(llm): """ParallelSampling is supported.""" params = SamplingParams(n=3) - outputs = model.generate(PROMPT, params) + outputs = llm.generate(PROMPT, params) assert len(outputs[0].outputs) == 3 -def test_best_of(model): +def test_best_of(llm): """Raise a ValueError since best_of is deprecated.""" params = SamplingParams(n=2, best_of=3) with pytest.raises(ValueError): - _ = model.generate(PROMPT, params) + _ = llm.generate(PROMPT, params) -def test_penalties(model): +def test_penalties(llm): """Check that we do not get errors if applied.""" params = SamplingParams( @@ -49,18 +49,18 @@ def test_penalties(model): top_p=0.5, top_k=3, ) - _ = model.generate(PROMPT, params) + _ = llm.generate(PROMPT, params) -def test_stop(model): +def test_stop(llm): """Check that we respect the stop words.""" - output = model.generate(PROMPT, SamplingParams(temperature=0)) + output = llm.generate(PROMPT, SamplingParams(temperature=0)) split_text = output[0].outputs[0].text.split() STOP_IDX = 5 params = SamplingParams(temperature=0, stop=split_text[STOP_IDX]) - output = model.generate(PROMPT, params) + output = llm.generate(PROMPT, params) new_split_text = output[0].outputs[0].text.split() # Output should not contain the stop word. @@ -69,40 +69,40 @@ def test_stop(model): params = SamplingParams(temperature=0, stop=split_text[STOP_IDX], include_stop_str_in_output=True) - output = model.generate(PROMPT, params) + output = llm.generate(PROMPT, params) new_split_text = output[0].outputs[0].text.split() # Output should contain the stop word. assert len(new_split_text) == STOP_IDX + 1 -def test_stop_token_ids(model): +def test_stop_token_ids(llm): """Check that we respect the stop token ids.""" - output = model.generate(PROMPT, SamplingParams(temperature=0)) + output = llm.generate(PROMPT, SamplingParams(temperature=0)) stop_token_id_0 = output[0].outputs[0].token_ids[5] stop_token_id_1 = output[0].outputs[0].token_ids[6] stop_token_ids = [stop_token_id_1, stop_token_id_0] params = SamplingParams(temperature=0, stop_token_ids=stop_token_ids) - output = model.generate(PROMPT, params) + output = llm.generate(PROMPT, params) assert output[0].outputs[0].token_ids[-1] == stop_token_id_0 stop_token_ids = [stop_token_id_0, stop_token_id_1] params = SamplingParams(temperature=0, stop_token_ids=stop_token_ids) - output = model.generate(PROMPT, params) + output = llm.generate(PROMPT, params) assert output[0].outputs[0].token_ids[-1] == stop_token_id_0 -def test_detokenize_false(model): +def test_detokenize_false(llm): """Check that detokenize=False option works.""" - output = model.generate(PROMPT, SamplingParams(detokenize=False)) + output = llm.generate(PROMPT, SamplingParams(detokenize=False)) assert len(output[0].outputs[0].token_ids) > 0 assert len(output[0].outputs[0].text) == 0 - output = model.generate( + output = llm.generate( PROMPT, SamplingParams(detokenize=False, logprobs=3, prompt_logprobs=3)) assert len(output[0].outputs[0].token_ids) > 0 @@ -118,28 +118,28 @@ def test_detokenize_false(model): assert all(lp.decoded_token is None for lp in logprobs.values()) -def test_bad_words(model): +def test_bad_words(llm): """Check that we respect bad words.""" - output = model.generate(PROMPT, SamplingParams(temperature=0)) + output = llm.generate(PROMPT, SamplingParams(temperature=0)) split_text = output[0].outputs[0].text.split() bad_words_1 = " ".join(split_text[:2]) params = SamplingParams(temperature=0, bad_words=[bad_words_1]) - output = model.generate(PROMPT, params) + output = llm.generate(PROMPT, params) new_text = output[0].outputs[0].text assert bad_words_1 not in new_text bad_words_2 = new_text.split()[-1] params = SamplingParams(temperature=0, bad_words=[bad_words_1, bad_words_2]) - output = model.generate(PROMPT, params) + output = llm.generate(PROMPT, params) new_text = output[0].outputs[0].text assert bad_words_1 not in new_text assert bad_words_2 not in new_text -def test_logits_processor(model): +def test_logits_processor(llm): """Check that we reject logits processor.""" # This sample logits processor gives infinite score to the i-th token, @@ -150,47 +150,45 @@ def pick_ith(token_ids, logits): return logits with pytest.raises(ValueError): - _ = model.generate(PROMPT, - SamplingParams(logits_processors=[pick_ith])) + _ = llm.generate(PROMPT, SamplingParams(logits_processors=[pick_ith])) -def test_allowed_token_ids(model): +def test_allowed_token_ids(llm): """Check that we can use allowed_token_ids.""" TOKEN_ID = 10 allowed_token_ids = [TOKEN_ID] - output = model.generate( - PROMPT, SamplingParams(allowed_token_ids=allowed_token_ids)) + output = llm.generate(PROMPT, + SamplingParams(allowed_token_ids=allowed_token_ids)) assert output[0].outputs[0].token_ids[-1] == TOKEN_ID # Reject empty allowed_token_ids. with pytest.raises(ValueError): - _ = model.generate(PROMPT, SamplingParams(allowed_token_ids=[])) + _ = llm.generate(PROMPT, SamplingParams(allowed_token_ids=[])) # Reject negative token id. with pytest.raises(ValueError): - _ = model.generate(PROMPT, SamplingParams(allowed_token_ids=[-1])) + _ = llm.generate(PROMPT, SamplingParams(allowed_token_ids=[-1])) # Reject out of vocabulary. with pytest.raises(ValueError): - _ = model.generate(PROMPT, - SamplingParams(allowed_token_ids=[10000000])) + _ = llm.generate(PROMPT, SamplingParams(allowed_token_ids=[10000000])) -def test_priority(model): +def test_priority(llm): """Check that we reject requests with priority.""" # Reject all allowed token ids with pytest.raises(ValueError): - _ = model.generate(PROMPT, priority=[1]) + _ = llm.generate(PROMPT, priority=[1]) -def test_seed(model): +def test_seed(llm): """Check that seed impacts randomness.""" - out_1 = model.generate(PROMPT, SamplingParams(seed=42)) - out_2 = model.generate(PROMPT, SamplingParams(seed=42)) - out_3 = model.generate(PROMPT, SamplingParams(seed=43)) + out_1 = llm.generate(PROMPT, SamplingParams(seed=42)) + out_2 = llm.generate(PROMPT, SamplingParams(seed=42)) + out_3 = llm.generate(PROMPT, SamplingParams(seed=43)) assert out_1[0].outputs[0].text == out_2[0].outputs[0].text assert out_1[0].outputs[0].text != out_3[0].outputs[0].text diff --git a/tests/v1/test_oracle.py b/tests/v1/test_oracle.py index 39515d710e8..b4d4348c7fd 100644 --- a/tests/v1/test_oracle.py +++ b/tests/v1/test_oracle.py @@ -106,9 +106,9 @@ def test_v1_llm_by_default(monkeypatch): m.delenv("VLLM_USE_V1") # Should default to V1 for supported config. - model = LLM(MODEL, enforce_eager=True, enable_lora=True) - print(model.generate("Hello my name is")) - assert hasattr(model.llm_engine, "engine_core") + llm = LLM(MODEL, enforce_eager=True, enable_lora=True) + print(llm.generate("Hello my name is")) + assert hasattr(llm.llm_engine, "engine_core") m.delenv("VLLM_USE_V1") From 43726a41e0520d0e96a4d4a08dc23000a5d6a2da Mon Sep 17 00:00:00 2001 From: Zhiyu Date: Mon, 21 Jul 2025 07:02:58 -0700 Subject: [PATCH 226/552] Add Nvidia ModelOpt config adaptation (#19815) Signed-off-by: Zhiyu Cheng Signed-off-by: x22x22 --- tests/quantization/test_modelopt.py | 91 ++++++++ vllm/config.py | 20 +- .../layers/quantization/modelopt.py | 208 +++++++++++++++--- 3 files changed, 287 insertions(+), 32 deletions(-) create mode 100644 tests/quantization/test_modelopt.py diff --git a/tests/quantization/test_modelopt.py b/tests/quantization/test_modelopt.py new file mode 100644 index 00000000000..fcbfa681d75 --- /dev/null +++ b/tests/quantization/test_modelopt.py @@ -0,0 +1,91 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Test ModelOpt quantization method setup and weight loading. + +Run `pytest tests/quantization/test_modelopt.py`. +""" + +import os + +import pytest +import torch + +from tests.quantization.utils import is_quant_method_supported +from vllm.platforms import current_platform + + +@pytest.fixture(scope="function", autouse=True) +def use_v0_only(monkeypatch): + """ + This module relies on V0 internals, so set VLLM_USE_V1=0. + """ + if not current_platform.is_cpu(): + monkeypatch.setenv('VLLM_USE_V1', '0') + + +@pytest.mark.skipif(not is_quant_method_supported("modelopt"), + reason="ModelOpt FP8 is not supported on this GPU type.") +def test_modelopt_fp8_checkpoint_setup(vllm_runner): + """Test ModelOpt FP8 checkpoint loading and structure validation.""" + # TODO: provide a small publically available test checkpoint + model_path = ("/home/scratch.omniml_data_1/zhiyu/ckpts/test_ckpts/" + "TinyLlama-1.1B-Chat-v1.0-fp8-0710") + + # Skip test if checkpoint doesn't exist + if not os.path.exists(model_path): + pytest.skip(f"Test checkpoint not found at {model_path}. " + "This test requires a local ModelOpt FP8 checkpoint.") + + with vllm_runner(model_path, quantization="modelopt", + enforce_eager=True) as llm: + + def check_model(model): + layer = model.model.layers[0] + + qkv_proj = layer.self_attn.qkv_proj + o_proj = layer.self_attn.o_proj + gate_up_proj = layer.mlp.gate_up_proj + down_proj = layer.mlp.down_proj + + # Check that ModelOpt quantization method is properly applied + from vllm.model_executor.layers.quantization.modelopt import ( + ModelOptFp8LinearMethod) + assert isinstance(qkv_proj.quant_method, ModelOptFp8LinearMethod) + assert isinstance(o_proj.quant_method, ModelOptFp8LinearMethod) + assert isinstance(gate_up_proj.quant_method, + ModelOptFp8LinearMethod) + assert isinstance(down_proj.quant_method, ModelOptFp8LinearMethod) + + # Check weight dtype is FP8 + assert qkv_proj.weight.dtype == torch.float8_e4m3fn + assert o_proj.weight.dtype == torch.float8_e4m3fn + assert gate_up_proj.weight.dtype == torch.float8_e4m3fn + assert down_proj.weight.dtype == torch.float8_e4m3fn + + # Check scales are present and have correct dtype + assert hasattr(qkv_proj, 'weight_scale') + assert hasattr(qkv_proj, 'input_scale') + assert qkv_proj.weight_scale.dtype == torch.float32 + assert qkv_proj.input_scale.dtype == torch.float32 + + assert hasattr(o_proj, 'weight_scale') + assert hasattr(o_proj, 'input_scale') + assert o_proj.weight_scale.dtype == torch.float32 + assert o_proj.input_scale.dtype == torch.float32 + + assert hasattr(gate_up_proj, 'weight_scale') + assert hasattr(gate_up_proj, 'input_scale') + assert gate_up_proj.weight_scale.dtype == torch.float32 + assert gate_up_proj.input_scale.dtype == torch.float32 + + assert hasattr(down_proj, 'weight_scale') + assert hasattr(down_proj, 'input_scale') + assert down_proj.weight_scale.dtype == torch.float32 + assert down_proj.input_scale.dtype == torch.float32 + + llm.apply_model(check_model) + + # Run a simple generation test to ensure the model works + output = llm.generate_greedy(["Hello my name is"], max_tokens=20) + assert output + print(f"ModelOpt FP8 output: {output}") diff --git a/vllm/config.py b/vllm/config.py index a6134c85b2e..1089e7ccd50 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -346,11 +346,11 @@ class ModelConfig: """Maximum number of data items per modality per prompt. Only applicable for multimodal models.""" interleave_mm_strings: bool = False - """Enable fully interleaved support for multimodal prompts, while using + """Enable fully interleaved support for multimodal prompts, while using --chat-template-content-format=string. Defaults to False.""" media_io_kwargs: dict[str, dict[str, Any]] = field(default_factory=dict) - """Additional args passed to process media inputs, keyed by modalities. - For example, to set num_frames for video, set + """Additional args passed to process media inputs, keyed by modalities. + For example, to set num_frames for video, set `--media-io-kwargs '{"video": {"num_frames": 40} }'` """ use_async_output_proc: bool = True """Whether to use async output processor.""" @@ -1000,9 +1000,13 @@ def _verify_quantization(self) -> None: quant_cfg = self._parse_quant_hf_config() if quant_cfg is not None: + # Use the community standard 'quant_method' quant_method = quant_cfg.get("quant_method", "").lower() + + # Normalize library names quant_method = quant_method.replace("compressed_tensors", "compressed-tensors") + quant_cfg["quant_method"] = quant_method # Quantization methods which are overrides (i.e. they have a @@ -1017,6 +1021,8 @@ def _verify_quantization(self) -> None: "awq_marlin", "ipex", "moe_wna16", + "modelopt", + "modelopt_fp4", ] quantization_methods = [ q for q in supported_quantization if q not in overrides @@ -3185,8 +3191,8 @@ class MultiModalConfig: """ media_io_kwargs: dict[str, dict[str, Any]] = field(default_factory=dict) - """Additional args passed to process media inputs, keyed by modalities. - For example, to set num_frames for video, set + """Additional args passed to process media inputs, keyed by modalities. + For example, to set num_frames for video, set `--media-io-kwargs '{"video": {"num_frames": 40} }'` """ mm_processor_kwargs: Optional[dict[str, object]] = None @@ -4115,7 +4121,7 @@ class CompilationConfig: - True: inductor compilation is used (custom_ops disabled by default). One graph for symbolic shape and one graph per size in compile_sizes are compiled using configurations in inductor_compile_config. - + This setting is ignored if level` can be used to directly specify the compilation level `n`: `-O3` is equivalent to `-O.level=3` (same as `-O='{"level":3}'`). - Currently, -O and -O= are supported as well but this will likely be + Currently, -O and -O= are supported as well but this will likely be removed in favor of clearer -O syntax in the future. NOTE: level 0 is the default level without any optimization. level 1 and 2 diff --git a/vllm/model_executor/layers/quantization/modelopt.py b/vllm/model_executor/layers/quantization/modelopt.py index 20def70d197..460334d77f0 100644 --- a/vllm/model_executor/layers/quantization/modelopt.py +++ b/vllm/model_executor/layers/quantization/modelopt.py @@ -75,20 +75,64 @@ def get_min_capability(cls) -> int: def get_config_filenames(cls) -> list[str]: return ["hf_quant_config.json"] + @classmethod + def override_quantization_method( + cls, hf_quant_cfg, user_quant) -> Optional[QuantizationMethods]: + """Detect if this ModelOpt config should be used based on + quantization config.""" + + if hf_quant_cfg is None: + return None + + # Use the community standard 'quant_method' + quant_method = hf_quant_cfg.get("quant_method", "").lower() + + # Only proceed if the method is explicitly "modelopt" + if quant_method != "modelopt": + return None + + # Look for ModelOpt-specific config structure + if "quantization" in hf_quant_cfg: + quant_config = hf_quant_cfg["quantization"] + if isinstance(quant_config, dict): + quant_algo = quant_config.get("quant_algo", "") + if "FP8" in quant_algo: + return "modelopt" + else: + # Check for compressed-tensors style config with specific quant_algo + quant_algo = hf_quant_cfg.get("quant_algo", "") + if isinstance(quant_algo, str) and "FP8" in quant_algo: + return "modelopt" + + return None + @classmethod def from_config(cls, config: dict[str, Any]) -> "ModelOptFp8Config": - quant_config = cls.get_from_keys(config, ["quantization"]) - quant_method = quant_config["quant_algo"] - kv_cache_quant_method = cls.get_from_keys( - config, ["quantization"]).get("kv_cache_quant_algo") - exclude_modules = cls.get_from_keys( - config, ["quantization"]).get("exclude_modules") + # Handle both ModelOpt format and compressed-tensors style format + if "quantization" in config: + # ModelOpt format: {"quantization": {"quant_algo": "..."}} + quant_config = cls.get_from_keys(config, ["quantization"]) + if not isinstance(quant_config, dict): + raise ValueError( + "Expected 'quantization' to be a dictionary in config") + quant_method = quant_config.get("quant_algo", "") + if not quant_method: + raise ValueError("Missing 'quant_algo' in quantization config") + kv_cache_quant_method = quant_config.get("kv_cache_quant_algo") + exclude_modules = quant_config.get("exclude_modules") + else: + # Compressed-tensors style format: + # {"quant_algo": "...", "quant_method": "modelopt"} + quant_method = config.get("quant_algo", "") + kv_cache_quant_method = config.get("kv_cache_quant_algo") + exclude_modules = config.get("exclude_modules") if quant_method not in QUANT_ALGOS: - raise ValueError(f"ModelOpt currently only supports: {QUANT_ALGOS}" - " quantizations in vLLM. Please check the " - "`hf_quant_config.json` file for your model's " - "quant configuration.") + raise ValueError( + f"ModelOpt currently only supports: {QUANT_ALGOS} " + "quantizations in vLLM. Please check the " + "`hf_quant_config.json` file for your model's " + "quant configuration.") is_checkpoint_fp8_serialized = ("FP8" in quant_method) return cls(is_checkpoint_fp8_serialized, kv_cache_quant_method, @@ -434,7 +478,7 @@ class ModelOptNvFp4Config(QuantizationConfig): def __init__( self, is_checkpoint_nvfp4_serialized: bool, - kv_cache_quant_algo: str, + kv_cache_quant_algo: Optional[str], exclude_modules: list[str], group_size: int = 16, ) -> None: @@ -465,24 +509,138 @@ def get_min_capability(cls) -> int: def get_config_filenames(cls) -> list[str]: return ["hf_quant_config.json"] + @classmethod + def override_quantization_method( + cls, hf_quant_cfg, user_quant) -> Optional[QuantizationMethods]: + """Detect if this ModelOpt FP4 config should be used based on + quantization config.""" + if hf_quant_cfg is None: + return None + + # Use the community standard 'quant_method' + quant_method = hf_quant_cfg.get("quant_method", "").lower() + + # Only proceed if the method is explicitly "modelopt" + if quant_method != "modelopt": + return None + + # Look for ModelOpt-specific config structure + if "quantization" in hf_quant_cfg: + quant_config = hf_quant_cfg["quantization"] + if isinstance(quant_config, dict): + quant_algo = quant_config.get("quant_algo", "") + if "NVFP4" in quant_algo: + return "modelopt_fp4" + else: + # Check for compressed-tensors style config with specific + # quant_algo field + quant_algo = hf_quant_cfg.get("quant_algo", "") + if isinstance(quant_algo, str) and "FP4" in quant_algo.upper(): + return "modelopt_fp4" + + return None + @classmethod def from_config(cls, config: dict[str, Any]) -> "ModelOptNvFp4Config": - quant_config = cls.get_from_keys(config, ["quantization"]) - quant_method = quant_config["quant_algo"] + # Handle both traditional ModelOpt format and compressed-tensors + # style format + if "quantization" in config: + # Traditional ModelOpt format: + # {"quantization": {"quant_algo": "..."}} + quant_config = cls.get_from_keys(config, ["quantization"]) + if not isinstance(quant_config, dict): + raise ValueError( + "Expected 'quantization' to be a dictionary in config") + + quant_method = quant_config.get("quant_algo", "") + if not quant_method: + raise ValueError("Missing 'quant_algo' in quantization config") + + # Handle kv_cache_quant_algo with proper type validation + kv_cache_quant_algo_raw = quant_config.get("kv_cache_quant_algo") + if kv_cache_quant_algo_raw is None: + # No KV cache quantization by default + kv_cache_quant_algo = None + elif isinstance(kv_cache_quant_algo_raw, str): + kv_cache_quant_algo = kv_cache_quant_algo_raw + else: + raise ValueError(f"kv_cache_quant_algo must be a string, got " + f"{type(kv_cache_quant_algo_raw)}") + + # Handle group_size with proper type validation + group_size_raw = quant_config.get("group_size") + if group_size_raw is None: + group_size = 16 # Default value + elif isinstance(group_size_raw, int): + group_size = group_size_raw + else: + try: + group_size = int(group_size_raw) + except (ValueError, TypeError): + raise ValueError(f"group_size must be an integer, got " + f"{type(group_size_raw)}") from None + + exclude_modules = quant_config.get("exclude_modules", []) + if not isinstance(exclude_modules, list): + raise ValueError(f"exclude_modules must be a list, got " + f"{type(exclude_modules)}") + else: + # Compressed-tensors style format: + # {"quant_algo": "...", "quant_method": "modelopt"} + quant_method = config.get("quant_algo", "") + + # Handle kv_cache_quant_algo with proper type validation + kv_cache_quant_algo_raw = config.get("kv_cache_quant_algo") + if kv_cache_quant_algo_raw is None: + # No KV cache quantization by default + kv_cache_quant_algo = None + elif isinstance(kv_cache_quant_algo_raw, str): + kv_cache_quant_algo = kv_cache_quant_algo_raw + else: + raise ValueError(f"kv_cache_quant_algo must be a string, got " + f"{type(kv_cache_quant_algo_raw)}") + + # Handle group_size with proper type validation + group_size_raw = config.get("group_size") + if group_size_raw is None: + group_size = 16 # Default value + elif isinstance(group_size_raw, int): + group_size = group_size_raw + else: + try: + group_size = int(group_size_raw) + except (ValueError, TypeError): + raise ValueError(f"group_size must be an integer, got " + f"{type(group_size_raw)}") from None + + exclude_modules = config.get("exclude_modules", []) + if not isinstance(exclude_modules, list): + raise ValueError(f"exclude_modules must be a list, got " + f"{type(exclude_modules)}") + if quant_method not in QUANT_ALGOS: - raise ValueError(f"ModelOpt currently only supports: {QUANT_ALGOS}" - " quantizations in vLLM. Please check the " - "`hf_quant_config.json` file for your model's " - "quant configuration.") + raise ValueError( + f"ModelOpt currently only supports: {QUANT_ALGOS} " + "quantizations in vLLM. Please check the " + "`hf_quant_config.json` file for your model's " + "quant configuration.") is_checkpoint_nvfp4_serialized = ("NVFP4" in quant_method) - if ("group_size" and "kv_cache_quant_algo" - and "exclude_modules") not in quant_config: - raise ValueError("NVFP4 quantization requires group size and " - "kv_cache_quant_algo specified in " - "hf_quant_config.json") - kv_cache_quant_algo = quant_config["kv_cache_quant_algo"] - group_size = quant_config["group_size"] - exclude_modules = quant_config["exclude_modules"] + + # For FP4, these fields are required + if is_checkpoint_nvfp4_serialized and "quantization" in config: + # Check if required fields are present in the quantization config + quant_config = config["quantization"] + required_fields = [ + "group_size", "kv_cache_quant_algo", "exclude_modules" + ] + missing_fields = [ + field for field in required_fields if field not in quant_config + ] + if missing_fields: + raise ValueError( + f"NVFP4 quantization requires the following fields in " + f"hf_quant_config.json: {missing_fields}") + return cls(is_checkpoint_nvfp4_serialized, kv_cache_quant_algo, exclude_modules, group_size) From 52f22e5113c847e4d4ba3ac2aeeb4c1db1f3af9a Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Mon, 21 Jul 2025 08:37:49 -0700 Subject: [PATCH 227/552] [Misc] Add sliding window to flashinfer test (#21282) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- tests/kernels/attention/test_flashinfer.py | 49 ++++++++++++++-------- 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/tests/kernels/attention/test_flashinfer.py b/tests/kernels/attention/test_flashinfer.py index 3ad6e1d3291..8f9b4eceaa7 100644 --- a/tests/kernels/attention/test_flashinfer.py +++ b/tests/kernels/attention/test_flashinfer.py @@ -77,6 +77,7 @@ def ref_paged_attn( @pytest.mark.parametrize("block_size", BLOCK_SIZES) @pytest.mark.parametrize("dtype", DTYPES) @pytest.mark.parametrize("soft_cap", [None, 30.0, 50.0]) +@pytest.mark.parametrize("sliding_window", [None, 64]) @torch.inference_mode def test_flashinfer_decode_with_paged_kv( kv_lens: list[int], @@ -85,6 +86,7 @@ def test_flashinfer_decode_with_paged_kv( dtype: torch.dtype, block_size: int, soft_cap: Optional[float], + sliding_window: Optional[int], ) -> None: torch.set_default_device("cuda") current_platform.seed_everything(0) @@ -136,17 +138,20 @@ def test_flashinfer_decode_with_paged_kv( use_tensor_cores=( (num_query_heads//num_kv_heads) > 4) ) - wrapper.plan(kv_indptr, - kv_indices, - kv_last_page_lens, - num_query_heads, - num_kv_heads, - head_size, - block_size, - "NONE", - q_data_type=dtype, - kv_data_type=dtype, - logits_soft_cap=soft_cap) + wrapper.plan( + kv_indptr, + kv_indices, + kv_last_page_lens, + num_query_heads, + num_kv_heads, + head_size, + block_size, + "NONE", + window_left=sliding_window - 1 if sliding_window is not None else -1, + q_data_type=dtype, + kv_data_type=dtype, + logits_soft_cap=soft_cap, + ) output = wrapper.run(query, key_value_cache) @@ -157,7 +162,8 @@ def test_flashinfer_decode_with_paged_kv( kv_lens=kv_lens, block_tables=block_tables, scale=scale, - soft_cap=soft_cap) + soft_cap=soft_cap, + sliding_window=sliding_window) torch.testing.assert_close(output, ref_output, atol=1e-2, rtol=1e-2), \ f"{torch.max(torch.abs(output - ref_output))}" @@ -168,12 +174,17 @@ def test_flashinfer_decode_with_paged_kv( @pytest.mark.parametrize("block_size", BLOCK_SIZES) @pytest.mark.parametrize("dtype", DTYPES) @pytest.mark.parametrize("soft_cap", [None, 30.0, 50.0]) +@pytest.mark.parametrize("sliding_window", [None, 64]) @torch.inference_mode -def test_flashinfer_prefill_with_paged_kv(seq_lens: list[tuple[int, int]], - num_heads: tuple[int, int], - head_size: int, dtype: torch.dtype, - block_size: int, - soft_cap: Optional[float]) -> None: +def test_flashinfer_prefill_with_paged_kv( + seq_lens: list[tuple[int, int]], + num_heads: tuple[int, int], + head_size: int, + dtype: torch.dtype, + block_size: int, + soft_cap: Optional[float], + sliding_window: Optional[int], +) -> None: torch.set_default_device("cuda") current_platform.seed_everything(0) num_seqs = len(seq_lens) @@ -242,6 +253,7 @@ def test_flashinfer_prefill_with_paged_kv(seq_lens: list[tuple[int, int]], num_kv_heads, head_size, block_size, + window_left=sliding_window - 1 if sliding_window is not None else -1, q_data_type=dtype, kv_data_type=dtype, logits_soft_cap=soft_cap, @@ -259,7 +271,8 @@ def test_flashinfer_prefill_with_paged_kv(seq_lens: list[tuple[int, int]], kv_lens=kv_lens, block_tables=block_tables, scale=scale, - soft_cap=soft_cap) + soft_cap=soft_cap, + sliding_window=sliding_window) torch.testing.assert_close(output, ref_output, atol=5e-2, rtol=1e-2), \ f"{torch.max(torch.abs(output - ref_output))}" From 99a8655866d0d4897f9c1657e773283a1750ee68 Mon Sep 17 00:00:00 2001 From: "Li, Jiang" Date: Tue, 22 Jul 2025 00:07:08 +0800 Subject: [PATCH 228/552] [CPU] Enable shared-memory based pipeline parallel for CPU backend (#21289) Signed-off-by: jiang1.li Signed-off-by: x22x22 --- .../scripts/hardware_ci/run-cpu-test.sh | 18 ++--- csrc/cpu/shm.cpp | 69 +++++++++++++------ docs/getting_started/installation/cpu.md | 14 ++++ .../device_communicators/cpu_communicator.py | 60 +++++++++++++++- vllm/distributed/parallel_state.py | 12 ++++ vllm/engine/arg_utils.py | 9 +-- vllm/envs.py | 7 +- vllm/platforms/cpu.py | 35 ++++------ 8 files changed, 165 insertions(+), 59 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-cpu-test.sh b/.buildkite/scripts/hardware_ci/run-cpu-test.sh index e3d47a0e6c1..90cc9c84462 100644 --- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh +++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh @@ -6,6 +6,7 @@ set -ex # allow to bind to different cores CORE_RANGE=${CORE_RANGE:-48-95} +# used for TP/PP E2E test OMP_CORE_RANGE=${OMP_CORE_RANGE:-48-95} NUMA_NODE=${NUMA_NODE:-1} @@ -24,8 +25,8 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu . # Run the image, setting --shm-size=4g for tensor parallel. -docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE" -docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2 +docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE" +docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2 function cpu_tests() { set -e @@ -78,17 +79,16 @@ function cpu_tests() { # tests/quantization/test_ipex_quant.py" # online serving - docker exec cpu-test-"$NUMA_NODE" bash -c " + docker exec cpu-test-"$NUMA_NODE" bash -c ' set -e - python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half & - timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1 - VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \ + VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 & + timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1 + python3 benchmarks/benchmark_serving.py \ --backend vllm \ --dataset-name random \ - --model facebook/opt-125m \ + --model meta-llama/Llama-3.2-3B-Instruct \ --num-prompts 20 \ - --endpoint /v1/completions \ - --tokenizer facebook/opt-125m" + --endpoint /v1/completions' # Run multi-lora tests docker exec cpu-test-"$NUMA_NODE" bash -c " diff --git a/csrc/cpu/shm.cpp b/csrc/cpu/shm.cpp index 9adb6f27ec4..7e64e1c5219 100644 --- a/csrc/cpu/shm.cpp +++ b/csrc/cpu/shm.cpp @@ -7,7 +7,7 @@ namespace { #define MAX_SHM_RANK_NUM 8 -#define PER_THREAD_SHM_BUFFER_BYTES (2 * 1024 * 1024) +#define PER_THREAD_SHM_BUFFER_BYTES (4 * 1024 * 1024) static_assert(PER_THREAD_SHM_BUFFER_BYTES % 2 == 0); #define PER_THREAD_SHM_BUFFER_OFFSET (PER_THREAD_SHM_BUFFER_BYTES >> 1) #define MIN_THREAD_PROCESS_SIZE (256) @@ -34,9 +34,10 @@ struct KernelVecType { }; struct ThreadSHMContext { - volatile char _curr_thread_stamp; - volatile char _ready_thread_stamp; - char _padding1[6]; + volatile char _curr_thread_stamp[2]; + volatile char _ready_thread_stamp[2]; + int local_stamp_buffer_idx; + int remote_stamp_buffer_idx; int thread_id; int thread_num; int rank; @@ -45,23 +46,28 @@ struct ThreadSHMContext { int swizzled_ranks[MAX_SHM_RANK_NUM]; void* thread_shm_ptrs[MAX_SHM_RANK_NUM]; ThreadSHMContext* shm_contexts[MAX_SHM_RANK_NUM]; - size_t _thread_buffer_mask; - char _padding2[56]; + size_t _thread_buffer_mask[2]; + char _padding2[40]; ThreadSHMContext(const int thread_id, const int thread_num, const int rank, const int group_size, void* thread_shm_ptr) - : _curr_thread_stamp(1), - _ready_thread_stamp(0), + : local_stamp_buffer_idx(0), + remote_stamp_buffer_idx(0), thread_id(thread_id), thread_num(thread_num), rank(rank), group_size(group_size), - _spinning_count(0), - _thread_buffer_mask(0) { + _spinning_count(0) { static_assert(sizeof(ThreadSHMContext) % 64 == 0); TORCH_CHECK(group_size <= MAX_SHM_RANK_NUM); TORCH_CHECK((size_t)this % 64 == 0); TORCH_CHECK((size_t)thread_shm_ptr % 64 == 0); + _curr_thread_stamp[0] = 1; + _curr_thread_stamp[1] = 1; + _ready_thread_stamp[0] = 0; + _ready_thread_stamp[1] = 0; + _thread_buffer_mask[0] = 0; + _thread_buffer_mask[1] = 0; for (int i = 0; i < MAX_SHM_RANK_NUM; ++i) { shm_contexts[i] = nullptr; thread_shm_ptrs[i] = nullptr; @@ -70,6 +76,11 @@ struct ThreadSHMContext { set_context(rank, this, thread_shm_ptr); } + void set_stamp_buffer_idx(int local, int remote) { + local_stamp_buffer_idx = local; + remote_stamp_buffer_idx = remote; + } + void set_context(int rank, ThreadSHMContext* ptr, void* thread_shm_ptr) { TORCH_CHECK(rank < MAX_SHM_RANK_NUM); TORCH_CHECK(ptr); @@ -84,23 +95,27 @@ struct ThreadSHMContext { T* get_thread_shm_ptr(int rank) { return reinterpret_cast( reinterpret_cast(thread_shm_ptrs[rank]) + - (PER_THREAD_SHM_BUFFER_OFFSET & _thread_buffer_mask)); + (PER_THREAD_SHM_BUFFER_OFFSET & + _thread_buffer_mask[local_stamp_buffer_idx])); } - void next_buffer() { _thread_buffer_mask ^= 0xFFFFFFFFFFFFFFFF; } + void next_buffer() { + _thread_buffer_mask[local_stamp_buffer_idx] ^= 0xFFFFFFFFFFFFFFFF; + } - char get_curr_stamp() const { return _curr_thread_stamp; } + char get_curr_stamp(int idx) const { return _curr_thread_stamp[idx]; } - char get_ready_stamp() const { return _ready_thread_stamp; } + char get_ready_stamp(int idx) const { return _ready_thread_stamp[idx]; } void next_stamp() { _mm_mfence(); - _curr_thread_stamp += 1; + _curr_thread_stamp[local_stamp_buffer_idx] += 1; } void commit_ready_stamp() { _mm_mfence(); - _ready_thread_stamp = _curr_thread_stamp; + _ready_thread_stamp[local_stamp_buffer_idx] = + _curr_thread_stamp[local_stamp_buffer_idx]; } int get_swizzled_rank(int idx) { return swizzled_ranks[idx]; } @@ -117,10 +132,11 @@ struct ThreadSHMContext { void wait_for_one(int rank, Cond&& cond) { ThreadSHMContext* rank_ctx = shm_contexts[rank]; for (;;) { - char local_curr_stamp = get_curr_stamp(); - char local_ready_stamp = get_ready_stamp(); - char rank_curr_stamp = rank_ctx->get_curr_stamp(); - char rank_ready_stamp = rank_ctx->get_ready_stamp(); + char local_curr_stamp = get_curr_stamp(local_stamp_buffer_idx); + char local_ready_stamp = get_ready_stamp(local_stamp_buffer_idx); + char rank_curr_stamp = rank_ctx->get_curr_stamp(remote_stamp_buffer_idx); + char rank_ready_stamp = + rank_ctx->get_ready_stamp(remote_stamp_buffer_idx); if (cond(local_curr_stamp, local_ready_stamp, rank_curr_stamp, rank_ready_stamp)) { break; @@ -361,6 +377,15 @@ void shm_cc_loop(ThreadSHMContext* ctx, int64_t elem_num, F&& inner_func) { } } } + +void reset_threads_stamp_buffer_idx(ThreadSHMContext* ctx, int local, + int remote) { + int thread_num = ctx->thread_num; + for (int i = 0; i < thread_num; ++i) { + ThreadSHMContext* thread_ctx = ctx + i; + thread_ctx->set_stamp_buffer_idx(local, remote); + } +} }; // namespace shm_cc_ops namespace shm_cc_ops { @@ -632,6 +657,7 @@ void shm_send_tensor_list_impl(ThreadSHMContext* ctx, int64_t dst, TensorListMeta* metadata = new (metadata_tensor.data_ptr()) TensorListMeta(); metadata->bind_tensor_list(tensor_list_with_metadata); + shm_cc_ops::reset_threads_stamp_buffer_idx(ctx, 0, 1); shm_cc_ops::shm_cc_loop( ctx, metadata->total_bytes, [&](ThreadSHMContext* thread_ctx, int64_t data_offset, @@ -659,6 +685,7 @@ std::vector shm_recv_tensor_list_impl(ThreadSHMContext* ctx, torch::Tensor metadata_tensor = torch::empty({sizeof(TensorListMeta)}, options); + shm_cc_ops::reset_threads_stamp_buffer_idx(ctx, 1, 0); ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready); shm_cc_ops::memcpy(metadata_tensor.data_ptr(), ctx->get_thread_shm_ptr(src), @@ -677,7 +704,7 @@ std::vector shm_recv_tensor_list_impl(ThreadSHMContext* ctx, ctx, metadata.total_bytes, [&](ThreadSHMContext* thread_ctx, int64_t data_offset, int64_t data_elem_num, bool fast_mode) { - ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready); + thread_ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready); int64_t curr_shm_offset = 0; while (curr_shm_offset < data_elem_num) { MemPiece frag = metadata.get_data(data_offset + curr_shm_offset); diff --git a/docs/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md index d77e7383650..5721195172d 100644 --- a/docs/getting_started/installation/cpu.md +++ b/docs/getting_started/installation/cpu.md @@ -166,6 +166,20 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe - This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory. +### How to do performance tuning for vLLM CPU? + + - First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`. + + - Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM: + - `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as: + - Offline Inference: `4096 * world_size` + - Online Serving: `2048 * world_size` + - `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance. + - Offline Inference: `256 * world_size` + - Online Serving: `128 * world_size` + + - vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes. + ### Which quantization configs does vLLM CPU support? - vLLM CPU supports quantizations: diff --git a/vllm/distributed/device_communicators/cpu_communicator.py b/vllm/distributed/device_communicators/cpu_communicator.py index 94effa0b2ca..bda567f8489 100644 --- a/vllm/distributed/device_communicators/cpu_communicator.py +++ b/vllm/distributed/device_communicators/cpu_communicator.py @@ -2,11 +2,12 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import os -from typing import Optional +from typing import Any, Optional, Union import torch from torch.distributed import ProcessGroup +from vllm.distributed.utils import pickle from vllm.platforms import current_platform from vllm.platforms.interface import CpuArchEnum @@ -26,7 +27,8 @@ def __init__(self, if (current_platform.get_cpu_architecture() == CpuArchEnum.X86) and hasattr( torch.ops._C, - "init_shm_manager") and unique_name.startswith("tp"): + "init_shm_manager") and (unique_name.startswith("tp") + or unique_name.startswith("pp")): self.dist_module = _CPUSHMDistributed(self) def all_reduce(self, input_): @@ -94,6 +96,19 @@ def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: input_size[dim + 1:]) return output_tensor + def send_tensor_dict( + self, + tensor_dict: dict[str, Union[torch.Tensor, Any]], + dst: int, + ) -> None: + return self.dist_module.send_tensor_dict(tensor_dict, dst) + + def recv_tensor_dict( + self, + src: int, + ) -> dict[str, Union[torch.Tensor, Any]]: + return self.dist_module.recv_tensor_dict(src) + class _CPUSHMDistributed: @@ -143,3 +158,44 @@ def all_gather_into_tensor(self, input: torch.Tensor, group: Optional[ProcessGroup] = None) -> None: torch.ops._C.shm_all_gather(self.handle, input, output) + + def send_tensor_dict( + self, + tensor_dict: dict[str, Union[torch.Tensor, Any]], + dst: int, + ) -> None: + key_list = list(tensor_dict.keys()) + value_list = list(tensor_dict.values()) + size_list = [] + for v in value_list: + if not isinstance(v, torch.Tensor): + raise RuntimeError( + "CpuCommunicator only supports sending tensors.") + size_list.append(v.size()) + key_size_tensor = torch.frombuffer(pickle.dumps([key_list, size_list]), + dtype=torch.uint8) + value_list.append(key_size_tensor) + + torch.ops._C.shm_send_tensor_list(self.handle, value_list, dst) + + return None + + def recv_tensor_dict( + self, + src: int, + ) -> dict[str, Union[torch.Tensor, Any]]: + tensor_list = torch.ops._C.shm_recv_tensor_list(self.handle, src) + + value_list: list[torch.Tensor] = tensor_list[:-1] + key_size_tensor = tensor_list[-1] + + key_size = pickle.loads(key_size_tensor.numpy().tobytes()) + key_list = key_size[0] + size_list = key_size[1] + assert len(key_list) == len(size_list) + assert len(key_list) == len(value_list) + + tensor_dict: dict[str, torch.Tensor] = {} + for key, size, t in zip(key_list, size_list, value_list): + tensor_dict[key] = t.view(size) + return tensor_dict diff --git a/vllm/distributed/parallel_state.py b/vllm/distributed/parallel_state.py index 1bb0ca79cc1..1f7a14920c4 100644 --- a/vllm/distributed/parallel_state.py +++ b/vllm/distributed/parallel_state.py @@ -272,6 +272,9 @@ def __init__( self.use_custom_op_call = (current_platform.is_cuda_alike() or current_platform.is_tpu()) + self.use_cpu_custom_send_recv = (current_platform.is_cpu() and hasattr( + torch.ops._C, "init_shm_manager")) + @property def first_rank(self): """Return the global rank of the first process in the group""" @@ -663,6 +666,11 @@ def send_tensor_dict( dst = (self.rank_in_group + 1) % self.world_size assert dst < self.world_size, f"Invalid dst rank ({dst})" + if self.use_cpu_custom_send_recv: + self.device_communicator.send_tensor_dict( # type: ignore + tensor_dict, dst) + return None + metadata_list: list[tuple[Any, Any]] = [] assert isinstance( tensor_dict, @@ -718,6 +726,10 @@ def recv_tensor_dict( src = (self.rank_in_group - 1) % self.world_size assert src < self.world_size, f"Invalid src rank ({src})" + if self.use_cpu_custom_send_recv: + return self.device_communicator.recv_tensor_dict( # type: ignore + src) + recv_metadata_list = self.recv_object(src=src) tensor_dict: dict[str, Any] = {} for key, value in recv_metadata_list: diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 019ff033eda..28b1c1c363a 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1639,13 +1639,14 @@ def _set_default_args_v1(self, usage_context: UsageContext, # cpu specific default values. if current_platform.is_cpu(): + world_size = self.pipeline_parallel_size * self.tensor_parallel_size default_max_num_batched_tokens = { - UsageContext.LLM_CLASS: 4096, - UsageContext.OPENAI_API_SERVER: 2048, + UsageContext.LLM_CLASS: 4096 * world_size, + UsageContext.OPENAI_API_SERVER: 2048 * world_size, } default_max_num_seqs = { - UsageContext.LLM_CLASS: 128, - UsageContext.OPENAI_API_SERVER: 32, + UsageContext.LLM_CLASS: 256 * world_size, + UsageContext.OPENAI_API_SERVER: 128 * world_size, } use_context_value = usage_context.value if usage_context else None diff --git a/vllm/envs.py b/vllm/envs.py index c5f97de807a..16f635b3ac4 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -42,7 +42,7 @@ VLLM_USE_FLASHINFER_SAMPLER: Optional[bool] = None VLLM_FLASHINFER_FORCE_TENSOR_CORES: bool = False VLLM_PP_LAYER_PARTITION: Optional[str] = None - VLLM_CPU_KVCACHE_SPACE: int = 0 + VLLM_CPU_KVCACHE_SPACE: Optional[int] = 0 VLLM_CPU_OMP_THREADS_BIND: str = "" VLLM_CPU_NUM_OF_RESERVED_CPU: Optional[int] = None VLLM_CPU_MOE_PREPACK: bool = True @@ -430,9 +430,10 @@ def get_vllm_port() -> Optional[int]: lambda: os.getenv("VLLM_PP_LAYER_PARTITION", None), # (CPU backend only) CPU key-value cache space. - # default is 4 GiB + # default is None and will be set as 4 GB "VLLM_CPU_KVCACHE_SPACE": - lambda: int(os.getenv("VLLM_CPU_KVCACHE_SPACE", "0")), + lambda: int(os.getenv("VLLM_CPU_KVCACHE_SPACE", "0")) + if "VLLM_CPU_KVCACHE_SPACE" in os.environ else None, # (CPU backend only) CPU core ids bound by OpenMP threads, e.g., "0-31", # "0,1,2", "0-31,33". CPU cores of different ranks are separated by '|'. diff --git a/vllm/platforms/cpu.py b/vllm/platforms/cpu.py index 70c339c9bc9..31a67183ff1 100644 --- a/vllm/platforms/cpu.py +++ b/vllm/platforms/cpu.py @@ -104,8 +104,19 @@ def get_attn_backend_cls(cls, selected_backend: _Backend, head_size: int, @classmethod def get_device_total_memory(cls, device_id: int = 0) -> int: - import psutil - return psutil.virtual_memory().total + import vllm.envs as envs + from vllm.utils import GiB_bytes + + kv_cache_space = envs.VLLM_CPU_KVCACHE_SPACE + if kv_cache_space is None: + kv_cache_space = 4 * GiB_bytes # type: ignore + logger.warning_once( + "Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) " + "for CPU backend is not set, using 4 by default.") + else: + kv_cache_space *= GiB_bytes + + return kv_cache_space @classmethod def set_device(cls, device: torch.device) -> None: @@ -124,8 +135,6 @@ def inference_mode(cls): @classmethod def check_and_update_config(cls, vllm_config: VllmConfig) -> None: - import vllm.envs as envs - from vllm.utils import GiB_bytes model_config = vllm_config.model_config if model_config is not None: @@ -162,20 +171,8 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None: " support fp16 for now, cast to bf16.") model_config.dtype = torch.bfloat16 - kv_cache_space = envs.VLLM_CPU_KVCACHE_SPACE - - if kv_cache_space >= 0: - if kv_cache_space == 0: - cache_config.cpu_kvcache_space_bytes = 4 * GiB_bytes # type: ignore - logger.warning( - "Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) " - "for CPU backend is not set, using 4 by default.") - else: - cache_config.cpu_kvcache_space_bytes = kv_cache_space * GiB_bytes # type: ignore # noqa - else: - raise RuntimeError( - "Invalid environment variable VLLM_CPU_KVCACHE_SPACE" - f" {kv_cache_space}, expect a positive integer value.") + cache_config.cpu_kvcache_space_bytes = \ + CpuPlatform.get_device_total_memory() parallel_config = vllm_config.parallel_config if (parallel_config.world_size > 1 @@ -216,8 +213,6 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None: False, "nan_asserts": False, - "memory_planning": - True, "epilogue_fusion": True, }) From 2ccd24377d3ddb6f8176feb46f8a528605be7d4f Mon Sep 17 00:00:00 2001 From: simpx Date: Tue, 22 Jul 2025 00:07:36 +0800 Subject: [PATCH 229/552] [BugFix] make utils.current_stream thread-safety (#21252) (#21253) Signed-off-by: simpx Signed-off-by: x22x22 --- tests/test_utils.py | 44 +++++++++++++++++++++++++++++++++++++++--- vllm/utils/__init__.py | 15 +++++++------- 2 files changed, 48 insertions(+), 11 deletions(-) diff --git a/tests/test_utils.py b/tests/test_utils.py index 28acacd2519..53a34642e5b 100644 --- a/tests/test_utils.py +++ b/tests/test_utils.py @@ -23,9 +23,9 @@ from vllm.utils import (CacheInfo, FlexibleArgumentParser, LRUCache, MemorySnapshot, PlaceholderModule, StoreBoolean, bind_kv_cache, common_broadcastable_dtype, - deprecate_kwargs, get_open_port, get_tcp_uri, - is_lossless_cast, join_host_port, make_zmq_path, - make_zmq_socket, memory_profiling, + current_stream, deprecate_kwargs, get_open_port, + get_tcp_uri, is_lossless_cast, join_host_port, + make_zmq_path, make_zmq_socket, memory_profiling, merge_async_iterators, sha256, split_host_port, split_zmq_path, supports_kw, swap_dict_values) @@ -957,3 +957,41 @@ def test_convert_ids_list_to_tokens(): ] tokens = convert_ids_list_to_tokens(tokenizer, token_ids) assert tokens == ['Hello', ',', ' world', '!'] + + +def test_current_stream_multithread(): + import threading + if not torch.cuda.is_available(): + pytest.skip("CUDA not available") + + main_default_stream = torch.cuda.current_stream() + child_stream = torch.cuda.Stream() + + thread_stream_ready = threading.Event() + thread_can_exit = threading.Event() + + def child_thread_func(): + with torch.cuda.stream(child_stream): + thread_stream_ready.set() + thread_can_exit.wait(timeout=10) + + child_thread = threading.Thread(target=child_thread_func) + child_thread.start() + + try: + assert thread_stream_ready.wait( + timeout=5), "Child thread failed to enter stream context in time" + + main_current_stream = current_stream() + + assert main_current_stream != child_stream, "Main thread's current_stream was contaminated by child thread" + assert main_current_stream == main_default_stream, "Main thread's current_stream is not the default stream" + + # Notify child thread it can exit + thread_can_exit.set() + + finally: + # Ensure child thread exits properly + child_thread.join(timeout=5) + if child_thread.is_alive(): + pytest.fail("Child thread failed to exit properly") diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index bbcc2a523dc..e4f495e22e2 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -1383,12 +1383,11 @@ def find_nccl_library() -> str: prev_set_stream = torch.cuda.set_stream -_current_stream = None +_current_stream_tls = threading.local() def _patched_set_stream(stream: torch.cuda.Stream) -> None: - global _current_stream - _current_stream = stream + _current_stream_tls.value = stream prev_set_stream(stream) @@ -1407,16 +1406,16 @@ def current_stream() -> torch.cuda.Stream: from C/C++ code. """ from vllm.platforms import current_platform - global _current_stream - if _current_stream is None: + if not hasattr(_current_stream_tls, + "value") or _current_stream_tls.value is None: # when this function is called before any stream is set, # we return the default stream. # On ROCm using the default 0 stream in combination with RCCL # is hurting performance. Therefore creating a dedicated stream # per process - _current_stream = torch.cuda.Stream() if current_platform.is_rocm( - ) else torch.cuda.current_stream() - return _current_stream + _current_stream_tls.value = torch.cuda.Stream( + ) if current_platform.is_rocm() else torch.cuda.current_stream() + return _current_stream_tls.value def enable_trace_function_call_for_thread(vllm_config: VllmConfig) -> None: From c65e3982d30c0e7b66b72b53fe6e7bd69f13e5d6 Mon Sep 17 00:00:00 2001 From: Ming Yang Date: Mon, 21 Jul 2025 09:08:09 -0700 Subject: [PATCH 230/552] [Misc] Add dummy maverick test (#21199) Signed-off-by: Ming Yang Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- .../multimodal/generation/test_maverick.py | 649 ++++++++++++++++++ 1 file changed, 649 insertions(+) create mode 100644 tests/models/multimodal/generation/test_maverick.py diff --git a/tests/models/multimodal/generation/test_maverick.py b/tests/models/multimodal/generation/test_maverick.py new file mode 100644 index 00000000000..083dc66148e --- /dev/null +++ b/tests/models/multimodal/generation/test_maverick.py @@ -0,0 +1,649 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Create a reduced-layer version of the Maverick model for testing purposes. + +This script creates a new model with fewer layers by: +1. Loading the original Maverick model configuration +2. Creating a reduced configuration +3. Generating compatible safetensors files with appropriate weights +4. Creating the necessary index files for vLLM compatibility +""" + +import json +import shutil +from pathlib import Path +from typing import Any + +import pytest +import torch +from safetensors.torch import save_file +from transformers import (AutoConfig, AutoProcessor, AutoTokenizer, + GenerationConfig) + +from vllm import LLM, SamplingParams + +# Sample prompts for testing +PROMPTS: list[str] = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] + + +def run_maverick_serving(model: str): + """Test Llama-4-Maverick model with vLLM LLM class using CLI equivalent + options with reduced layers. + """ + + try: + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model=model, + max_model_len=2048, + enforce_eager=True, + tensor_parallel_size=8, + enable_expert_parallel=True, + trust_remote_code=True, + gpu_memory_utilization=0.4, + kv_cache_dtype="fp8", + ) + + outputs = llm.generate(PROMPTS, sampling_params) + + # Print the outputs + print("\nGenerated Outputs:\n" + "-" * 60) + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}") + print(f"Output: {generated_text!r}") + print("-" * 60) + + except Exception as e: + print(f"Error initializing or running model: {e}") + raise + + +def create_reduced_maverick_model( + original_model_name: + str = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", + output_dir: str = "/tmp/reduced_maverick", + text_layers: int = 4, + num_experts: int = 4, + vision_layers: int = 2, + force_recreate: bool = False, +) -> str: + """ + Create a reduced-layer version of the Maverick model. + + Args: + original_model_name: Name of the original Maverick model + output_dir: Directory to save the reduced model + text_layers: Number of text transformer layers + num_experts: Number of experts per layer + vision_layers: Number of vision transformer layers + force_recreate: Whether to recreate if output_dir already exists + + Returns: + Path to the created reduced model directory + """ + + print( + f"Creating reduced Maverick model with {text_layers} text layers and " + f"{vision_layers} vision layers...") + + # Create output directory + output_path = Path(output_dir) + if output_path.exists(): + if force_recreate: + shutil.rmtree(output_path) + else: + print(f"Output directory {output_dir} already exists. " + "Use --force-recreate to overwrite.") + return str(output_path) + + output_path.mkdir(parents=True, exist_ok=True) + + try: + print("Loading original model configuration...") + original_config = AutoConfig.from_pretrained(original_model_name, + trust_remote_code=True) + + print("Creating reduced configuration...") + reduced_config = create_reduced_config(original_config, text_layers, + num_experts, vision_layers) + + config_path = output_path / "config.json" + with open(config_path, "w") as f: + json.dump(reduced_config, f, indent=2) + print(f"Saved reduced config to {config_path}") + + print("Copying tokenizer files...") + copy_tokenizer_files(original_model_name, output_path) + + print("Creating reduced safetensors files...") + create_reduced_safetensors(original_config, reduced_config, + output_path) + + print("Creating preprocessor config...") + create_preprocessor_config(original_config, output_path) + + try: + gen_config = GenerationConfig.from_pretrained(original_model_name) + gen_config.save_pretrained(output_path) + print("Copied generation config") + except Exception as e: + print(f"Could not copy generation config: {e}") + + print(f"Successfully created reduced Maverick model at {output_path}") + return str(output_path) + + except Exception as e: + print(f"Error creating reduced model: {e}") + # Clean up on failure + if output_path.exists(): + shutil.rmtree(output_path) + raise + + +def create_reduced_config(original_config: Any, text_layers: int, + num_experts: int, + vision_layers: int) -> dict[str, Any]: + """Create a reduced configuration based on the original.""" + + # Convert config to dictionary + config_dict = original_config.to_dict() + + # Reduce text layers + if "text_config" in config_dict: + original_text_layers = config_dict["text_config"]["num_hidden_layers"] + config_dict["text_config"]["num_hidden_layers"] = text_layers + print( + f"Reduced text layers from {original_text_layers} to {text_layers}" + ) + + original_num_experts = config_dict["text_config"]["num_local_experts"] + config_dict["text_config"]["num_local_experts"] = num_experts + print( + f"Reduced num experts from {original_num_experts} to {num_experts}" + ) + + hidden_dim_divisor = 4 + + original_hidden_size = config_dict["text_config"]["hidden_size"] + new_hidden_size = original_hidden_size // hidden_dim_divisor + config_dict["text_config"]["hidden_size"] = new_hidden_size + print(f"Reduced hidden size from {original_hidden_size} to " + f"{new_hidden_size}") + + original_head_dim = config_dict["text_config"]["head_dim"] + new_head_dim = original_head_dim // hidden_dim_divisor + config_dict["text_config"]["head_dim"] = new_head_dim + print(f"Reduced head dim from {original_head_dim} to {new_head_dim}") + + # Reduce vision layers + if "vision_config" in config_dict: + original_vision_layers = config_dict["vision_config"][ + "num_hidden_layers"] + config_dict["vision_config"]["num_hidden_layers"] = vision_layers + print(f"Reduced vision layers from {original_vision_layers} " + f"to {vision_layers}") + + # Update model name to indicate it's a reduced version + config_dict["_name_or_path"] = ( + f"reduced_maverick_{text_layers}t_{vision_layers}v") + + return config_dict + + +def copy_tokenizer_files(original_model_name: str, output_path: Path) -> None: + """Copy tokenizer files from the original model.""" + + try: + tokenizer = AutoTokenizer.from_pretrained(original_model_name, + trust_remote_code=True) + tokenizer.save_pretrained(output_path) + print("Tokenizer files copied successfully") + except Exception as e: + print(f"Warning: Could not copy tokenizer files: {e}") + + +def create_preprocessor_config(original_config: Any, + output_path: Path) -> None: + """Create preprocessor_config.json for multimodal model.""" + + # Try to load the original preprocessor config + try: + processor = AutoProcessor.from_pretrained( + original_config._name_or_path + or "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", + trust_remote_code=True, + ) + processor.save_pretrained(output_path) + print("Copied original preprocessor config") + return + except Exception as e: + print(f"Could not copy original preprocessor config: {e}") + raise + + +def create_reduced_safetensors(original_config: Any, reduced_config: dict[str, + Any], + output_path: Path) -> None: + """Create safetensors files with weights for the reduced model.""" + + print("Generating synthetic weights for reduced model...") + + text_config = reduced_config["text_config"] + vision_config = reduced_config["vision_config"] + + weights = {} + + print("Creating text model weights...") + weights.update(create_text_model_weights(text_config)) + + print("Creating vision model weights...") + weights.update(create_vision_model_weights(vision_config)) + + print("Creating shared model weights...") + weights.update(create_shared_weights(text_config, vision_config)) + + print("Saving weights to safetensors files...") + save_weights_to_safetensors(weights, output_path) + + +def create_text_model_weights( + text_config: dict[str, Any]) -> dict[str, torch.Tensor]: + """Create synthetic weights for the text model with MoE structure.""" + + weights = {} + + vocab_size = text_config["vocab_size"] + hidden_size = text_config["hidden_size"] + intermediate_size = text_config["intermediate_size"] + intermediate_size_mlp = text_config["intermediate_size_mlp"] + num_layers = text_config["num_hidden_layers"] + num_attention_heads = text_config["num_attention_heads"] + num_key_value_heads = text_config.get("num_key_value_heads", + num_attention_heads) + + # MoE specific parameters + num_experts = text_config.get("num_local_experts") + assert (num_experts + is not None), "num_local_experts must be specified for MoE" + + head_dim = hidden_size // num_attention_heads + + # Embedding layers + weights["language_model.model.embed_tokens.weight"] = torch.randn( + vocab_size, hidden_size, dtype=torch.float16) + + # Transformer layers + for layer_idx in range(num_layers): + layer_prefix = f"language_model.model.layers.{layer_idx}" + print(f"Creating weights for layer {layer_prefix}...") + + # Self-attention weights (separate q, k, v projections) + weights[f"{layer_prefix}.self_attn.q_proj.weight"] = torch.randn( + hidden_size, num_attention_heads * head_dim, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.k_proj.weight"] = torch.randn( + hidden_size, num_key_value_heads * head_dim, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.v_proj.weight"] = torch.randn( + num_key_value_heads * head_dim, hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.o_proj.weight"] = torch.randn( + hidden_size, num_attention_heads * head_dim, dtype=torch.bfloat16) + print("Self-attention weights created.") + + # Feed-forward weights - MoE pattern based on interleave_moe_layer_step + # For interleave_moe_layer_step=2: layers 1,3,5,... are MoE, layers + # 0,2,4,... are dense + interleave_step = text_config.get("interleave_moe_layer_step", 1) + is_moe_layer = (interleave_step > 0 + and (layer_idx + 1) % interleave_step == 0) + + if is_moe_layer: + # MoE layer structure + # 1. Router weights + weights[ + f"{layer_prefix}.feed_forward.router.weight"] = torch.randn( + num_experts, hidden_size, dtype=torch.float16) + + # 2. Individual expert weights (not fused) + for expert_idx in range(num_experts): + expert_prefix = ( + f"{layer_prefix}.feed_forward.experts.{expert_idx}") + + weights[f"{expert_prefix}.gate_proj.weight"] = torch.randn( + intermediate_size, hidden_size, dtype=torch.bfloat16) + weights[f"{expert_prefix}.up_proj.weight"] = torch.randn( + intermediate_size, hidden_size, dtype=torch.bfloat16) + weights[f"{expert_prefix}.down_proj.weight"] = torch.randn( + hidden_size, intermediate_size, dtype=torch.bfloat16) + + # Expert weight scales (FP8 quantization) + weights[ + f"{expert_prefix}.gate_proj.weight_scale"] = torch.ones( + intermediate_size, 1, dtype=torch.bfloat16) + weights[f"{expert_prefix}.up_proj.weight_scale"] = torch.ones( + intermediate_size, 1, dtype=torch.bfloat16) + weights[ + f"{expert_prefix}.down_proj.weight_scale"] = torch.ones( + hidden_size, 1, dtype=torch.bfloat16) + + # 3. Shared expert weights + shared_expert_prefix = f"{layer_prefix}.feed_forward.shared_expert" + weights[f"{shared_expert_prefix}.gate_proj.weight"] = torch.randn( + intermediate_size, hidden_size, dtype=torch.bfloat16) + weights[f"{shared_expert_prefix}.up_proj.weight"] = torch.randn( + intermediate_size, hidden_size, dtype=torch.bfloat16) + weights[f"{shared_expert_prefix}.down_proj.weight"] = torch.randn( + hidden_size, intermediate_size, dtype=torch.bfloat16) + print(f"MoE feed-forward weights created for layer {layer_idx}.") + else: + # Dense layer structure + weights[f"{layer_prefix}.feed_forward.gate_proj.weight"] = ( + torch.randn(intermediate_size_mlp, + hidden_size, + dtype=torch.bfloat16)) + weights[f"{layer_prefix}.feed_forward.up_proj.weight"] = ( + torch.randn(intermediate_size_mlp, + hidden_size, + dtype=torch.bfloat16)) + weights[f"{layer_prefix}.feed_forward.down_proj.weight"] = ( + torch.randn(hidden_size, + intermediate_size_mlp, + dtype=torch.bfloat16)) + print(f"Dense feed-forward weights created for layer {layer_idx}.") + + # Layer norms + weights[f"{layer_prefix}.input_layernorm.weight"] = torch.ones( + hidden_size, dtype=torch.bfloat16) + weights[ + f"{layer_prefix}.post_attention_layernorm.weight"] = torch.ones( + hidden_size, dtype=torch.bfloat16) + print("Layer norms created.") + + # Final layer norm and output projection + weights["language_model.model.norm.weight"] = torch.ones( + hidden_size, dtype=torch.bfloat16) + weights["language_model.lm_head.weight"] = torch.randn( + vocab_size, hidden_size, dtype=torch.bfloat16) + + return weights + + +def create_vision_model_weights( + vision_config: dict[str, Any]) -> dict[str, torch.Tensor]: + """Create synthetic weights for the vision model.""" + + weights = {} + + hidden_size = vision_config["hidden_size"] + intermediate_size = vision_config["intermediate_size"] + num_layers = vision_config["num_hidden_layers"] + + # Vision transformer layers + for layer_idx in range(num_layers): + layer_prefix = f"vision_model.model.layers.{layer_idx}" + + weights[f"{layer_prefix}.self_attn.q_proj.weight"] = torch.randn( + hidden_size, hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.q_proj.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.k_proj.weight"] = torch.randn( + hidden_size, hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.k_proj.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.v_proj.weight"] = torch.randn( + hidden_size, hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.v_proj.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.o_proj.weight"] = torch.randn( + hidden_size, hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.self_attn.o_proj.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + + weights[f"{layer_prefix}.mlp.fc1.weight"] = torch.randn( + intermediate_size, hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.mlp.fc1.bias"] = torch.zeros( + intermediate_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.mlp.fc2.weight"] = torch.randn( + hidden_size, intermediate_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.mlp.fc2.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + + weights[f"{layer_prefix}.input_layernorm.weight"] = torch.ones( + hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.input_layernorm.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + weights[ + f"{layer_prefix}.post_attention_layernorm.weight"] = torch.ones( + hidden_size, dtype=torch.bfloat16) + weights[f"{layer_prefix}.post_attention_layernorm.bias"] = torch.zeros( + hidden_size, dtype=torch.bfloat16) + + return weights + + +def create_shared_weights( + text_config: dict[str, Any], + vision_config: dict[str, Any]) -> dict[str, torch.Tensor]: + """Create weights for shared components (vision-language connector)""" + + weights = {} + + text_hidden_size = text_config["hidden_size"] + projector_input_dim = vision_config["projector_input_dim"] + + # Vision-language connector (projects vision features to text space) + weights["multi_modal_projector.linear_1.weight"] = torch.randn( + text_hidden_size, projector_input_dim, dtype=torch.bfloat16) + + return weights + + +def save_weights_to_safetensors(weights: dict[str, torch.Tensor], + output_path: Path) -> None: + """Save weights to safetensors files and create index.""" + + # Determine how to shard the weights + max_shard_size = 5 * 1024 * 1024 * 1024 # 5GB per shard + + # Calculate sizes and create shards + shards = [] + current_shard: dict[str, torch.Tensor] = {} + current_size = 0 + + for name, tensor in weights.items(): + tensor_size = tensor.numel() * tensor.element_size() + + if current_size + tensor_size > max_shard_size and current_shard: + shards.append(current_shard) + current_shard = {} + current_size = 0 + + current_shard[name] = tensor + current_size += tensor_size + + if current_shard: + shards.append(current_shard) + + # Save shards and create index + weight_map = {} + + if len(shards) == 1: + # Single file + filename = "model.safetensors" + save_file(shards[0], output_path / filename) + weight_map = {name: filename for name in shards[0]} + print(f"Saved weights to single file: {filename}") + else: + # Multiple shards + for i, shard in enumerate(shards): + filename = f"model-{i+1:05d}-of-{len(shards):05d}.safetensors" + save_file(shard, output_path / filename) + for name in shard: + weight_map[name] = filename + print(f"Saved shard {i+1}/{len(shards)}: {filename}") + + # Create index file + index_data = { + "metadata": { + "total_size": + sum(tensor.numel() * tensor.element_size() + for tensor in weights.values()) + }, + "weight_map": weight_map, + } + + index_path = output_path / "model.safetensors.index.json" + with open(index_path, "w") as f: + json.dump(index_data, f, indent=2) + + print(f"Created index file: {index_path}") + print(f"Total model size: " + f"{index_data['metadata']['total_size'] / (1024**3):.2f} GB") + + +def run_reduced_model(model_path: str, + should_profile: bool = False, + **kwargs) -> None: + """Test the created reduced model with vLLM.""" + + print(f"\nTesting reduced model at {model_path}...") + + llm = LLM( + model=model_path, + trust_remote_code=True, + max_model_len=512, # Small context for testing + gpu_memory_utilization=0.3, # Conservative memory usage + **kwargs, + ) + + sampling_params = SamplingParams(temperature=0.8, + top_p=0.95, + max_tokens=50) + + if should_profile: + llm.start_profile() + outputs = llm.generate(PROMPTS, sampling_params) + if should_profile: + llm.stop_profile() + + print("Test generation successful!") + for output in outputs: + print(f"Prompt: {output.prompt}") + print(f"Output: " + f"{output.outputs[0].text}") + print("-" * 40) + + +@pytest.mark.parametrize( + "original_model_name,text_layers,num_experts,vision_layers,", + [("meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", 4, 4, 2)]) +@pytest.mark.parametrize("enforce_eager", [True, False]) +@pytest.mark.parametrize("tp,ep", [(2, True)]) +@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") +def test_dummy_maverick( + original_model_name: str, + text_layers: int, + num_experts: int, + vision_layers: int, + enforce_eager: bool, + tp: int, + ep: bool, + output_dir: str = "/tmp/reduced_maverick", + force_recreate: bool = True, + profile: bool = False, +) -> None: + model_path = create_reduced_maverick_model( + original_model_name=original_model_name, + output_dir=output_dir, + text_layers=text_layers, + num_experts=num_experts, + vision_layers=vision_layers, + force_recreate=force_recreate, + ) + + print(f"\nReduced model created successfully at: {model_path}") + + run_reduced_model(model_path=model_path, + should_profile=profile, + enforce_eager=enforce_eager, + tensor_parallel_size=tp, + enable_expert_parallel=ep) + + +def main(): + """Main function to create and test the reduced model.""" + + import argparse + + parser = argparse.ArgumentParser( + description="Create a reduced-layer Maverick model") + parser.add_argument( + "--output-dir", + default="/tmp/reduced_maverick", + help="Output directory for the reduced model", + ) + parser.add_argument( + "--text-layers", + type=int, + default=4, + help="Number of text transformer layers", + ) + parser.add_argument("--num-experts", + type=int, + default=4, + help="Number of experts") + parser.add_argument( + "--vision-layers", + type=int, + default=2, + help="Number of vision transformer layers", + ) + parser.add_argument( + "--force-recreate", + action="store_true", + help="Force recreation if output directory exists", + ) + parser.add_argument("--test", + action="store_true", + help="Test the created model with vLLM") + parser.add_argument("--profile", + action="store_true", + help="Profile the created model with vLLM") + parser.add_argument( + "--test-original", + action="store_true", + help="Test the original model with vLLM", + ) + parser.add_argument( + "--original-model", + default="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", + help="Original model name to base the reduction on", + ) + + args = parser.parse_args() + + if args.test: + test_dummy_maverick(original_model_name=args.original_model, + output_dir=args.output_dir, + text_layers=args.text_layers, + num_experts=args.num_experts, + vision_layers=args.vision_layers, + force_recreate=args.force_recreate, + tp=2, + ep=True, + enforce_eager=True, + profile=args.profile) + + if args.test_original: + run_maverick_serving(args.original_model) + + +if __name__ == "__main__": + exit(main()) From cf48d01fbbc51e372806c66d77a7d00eb74495f8 Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Mon, 21 Jul 2025 12:10:30 -0400 Subject: [PATCH 231/552] [Attention] Clean up iRoPE in V1 (#21188) Signed-off-by: Lucas Wilkinson Co-authored-by: Michael Goin Signed-off-by: x22x22 --- vllm/attention/layer.py | 7 +++++++ vllm/v1/attention/backends/cpu_attn.py | 5 ----- vllm/v1/attention/backends/flash_attn.py | 2 -- vllm/v1/attention/backends/flashinfer.py | 2 -- vllm/v1/attention/backends/pallas.py | 5 ----- vllm/v1/attention/backends/rocm_aiter_fa.py | 2 -- vllm/v1/attention/backends/triton_attn.py | 6 ------ vllm/v1/worker/gpu_model_runner.py | 7 +++---- vllm/v1/worker/tpu_model_runner.py | 4 ++++ 9 files changed, 14 insertions(+), 26 deletions(-) diff --git a/vllm/attention/layer.py b/vllm/attention/layer.py index 5d8ffb8e82d..1b80fa19d54 100644 --- a/vllm/attention/layer.py +++ b/vllm/attention/layer.py @@ -137,6 +137,13 @@ def __init__( self.num_kv_heads = num_kv_heads self.sliding_window = sliding_window + # For v1 we have backend agnostic iRoPE (local chunked attention) + # we have to store the flag on the layer so gpu model runner can + # set KVSpec appropriately (and pop it so it doesnt get passed to + # the backends) + if envs.VLLM_USE_V1: + self.use_irope = extra_impl_args.pop("use_irope", False) + quant_method = quant_config.get_quant_method( self, prefix=prefix) if quant_config else None if quant_method is not None and not isinstance( diff --git a/vllm/v1/attention/backends/cpu_attn.py b/vllm/v1/attention/backends/cpu_attn.py index 2efbe0de272..3b6d753863d 100644 --- a/vllm/v1/attention/backends/cpu_attn.py +++ b/vllm/v1/attention/backends/cpu_attn.py @@ -446,17 +446,12 @@ def __init__( logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, - use_irope: bool = False, ) -> None: if kv_sharing_target_layer_name is not None: raise NotImplementedError("KV sharing is not supported in V0.") if logits_soft_cap is not None: logger.warning_once("Torch SPDA does not support logits soft cap. " "Outputs may be slightly off.") - if use_irope: - logger.warning_once( - "Using irope in Torch SPDA is not supported yet, it will fall" - " back to global attention for long context.") self.paged_attn_impl = _get_paged_attn_impl() self.num_heads = num_heads self.head_size = head_size diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index ad414ee0a1f..5fe274f2c65 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -352,7 +352,6 @@ def __init__( logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[str] = None, - use_irope: bool = False, ) -> None: self.num_heads = num_heads self.head_size = head_size @@ -381,7 +380,6 @@ def __init__( "encoder/decoder cross-attention " "are not implemented for " "FlashAttentionImpl") - self.use_irope = use_irope self.vllm_flash_attn_version = get_flash_attn_version() if is_quantized_kv_cache(self.kv_cache_dtype) \ and not flash_attn_supports_fp8(): diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index e1ffa61a600..953ef26c814 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -493,7 +493,6 @@ def __init__( logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, - use_irope: bool = False, ) -> None: self.num_heads = num_heads self.head_size = head_size @@ -509,7 +508,6 @@ def __init__( self.kv_cache_dtype = kv_cache_dtype self.logits_soft_cap = logits_soft_cap self.kv_sharing_target_layer_name = kv_sharing_target_layer_name - self.use_irope = use_irope self.num_queries_per_kv = self.num_heads // self.num_kv_heads diff --git a/vllm/v1/attention/backends/pallas.py b/vllm/v1/attention/backends/pallas.py index 9307cd937d5..9b122136afb 100644 --- a/vllm/v1/attention/backends/pallas.py +++ b/vllm/v1/attention/backends/pallas.py @@ -148,12 +148,7 @@ def __init__( logits_soft_cap: Optional[float] = None, attn_type: str = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, - use_irope: bool = False, ) -> None: - if use_irope: - logger.warning_once( - "Using irope in Pallas is not supported yet, it will fall back " - "to global attention for long context.") self.num_heads = num_heads self.head_size = head_size self.scale = float(scale) diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index 8f756763944..0739d259667 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -337,7 +337,6 @@ def __init__( logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, - use_irope: bool = False, ) -> None: self.num_heads = num_heads self.head_size = head_size @@ -367,7 +366,6 @@ def __init__( "encoder/decoder cross-attention " "are not implemented for " "FlashAttentionImpl") - self.use_irope = use_irope if is_quantized_kv_cache(self.kv_cache_dtype): raise NotImplementedError( "AiterFlashAttention does not support fp8 kv-cache on this " diff --git a/vllm/v1/attention/backends/triton_attn.py b/vllm/v1/attention/backends/triton_attn.py index d65ff5ff74e..83471ca51b7 100644 --- a/vllm/v1/attention/backends/triton_attn.py +++ b/vllm/v1/attention/backends/triton_attn.py @@ -72,9 +72,6 @@ def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, vllm_config.parallel_config) self.headdim = model_config.get_head_size() - self.attention_chunk_size = getattr(vllm_config.scheduler_config, - 'attention_chunk_size', None) - def build_for_cudagraph_capture( self, common_attn_metadata: CommonAttentionMetadata ) -> TritonAttentionMetadata: @@ -208,7 +205,6 @@ def __init__( logits_soft_cap: Optional[float] = None, attn_type: AttentionType = AttentionType.DECODER, kv_sharing_target_layer_name: Optional[int] = None, - use_irope: bool = False, ) -> None: self.num_heads = num_heads self.head_size = head_size @@ -228,8 +224,6 @@ def __init__( self.logits_soft_cap = logits_soft_cap self.kv_sharing_target_layer_name = kv_sharing_target_layer_name - self.use_irope = use_irope - self.num_queries_per_kv = self.num_heads // self.num_kv_heads TritonAttentionBackend.validate_head_size(head_size) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index cd66d8bcd63..4c14ac3be3c 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2702,8 +2702,7 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: # TODO: Support other attention modules, e.g., cross-attention if attn_module.attn_type == AttentionType.DECODER: use_local_attention = (self.attention_chunk_size is not None - and getattr(attn_module.impl, - "use_irope", False)) + and attn_module.use_irope) if attn_module.sliding_window is not None: kv_cache_spec[layer_name] = SlidingWindowSpec( block_size=block_size, @@ -2716,13 +2715,13 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: "attention module can not be with ", "both local attention and sliding window") elif use_local_attention: - kv_cache_spec[layer_name] = (ChunkedLocalAttentionSpec( + kv_cache_spec[layer_name] = ChunkedLocalAttentionSpec( block_size=block_size, num_kv_heads=attn_module.num_kv_heads, head_size=attn_module.head_size, dtype=self.kv_cache_dtype, attention_chunk_size=self.attention_chunk_size, - use_mla=use_mla)) + use_mla=use_mla) else: kv_cache_spec[layer_name] = FullAttentionSpec( block_size=block_size, diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index aad45b6abd1..31e9cff9124 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -519,6 +519,10 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: continue if attn_module.attn_type == AttentionType.DECODER: + if attn_module.use_irope: + logger.warning_once( + "Using irope in Pallas is not supported yet, it " + "will fall back to global attention for long context.") if attn_module.sliding_window is not None: kv_cache_spec[layer_name] = SlidingWindowSpec( block_size=block_size, From 83b9362c264482d6eeecacac982e705b6779296d Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Date: Mon, 21 Jul 2025 12:11:35 -0400 Subject: [PATCH 232/552] [DP] Fix Prometheus Logging (#21257) Signed-off-by: Robert Shaw Co-authored-by: Robert Shaw Signed-off-by: x22x22 --- tests/v1/engine/test_async_llm.py | 7 +- tests/v1/test_async_llm_dp.py | 6 +- vllm/v1/engine/async_llm.py | 69 ++-- vllm/v1/engine/core_client.py | 9 +- vllm/v1/metrics/loggers.py | 541 +++++++++++++++++++----------- vllm/v1/metrics/ray_wrappers.py | 4 - 6 files changed, 378 insertions(+), 258 deletions(-) diff --git a/tests/v1/engine/test_async_llm.py b/tests/v1/engine/test_async_llm.py index e137452f262..412df3acff1 100644 --- a/tests/v1/engine/test_async_llm.py +++ b/tests/v1/engine/test_async_llm.py @@ -336,9 +336,10 @@ async def test_customize_loggers(monkeypatch): await engine.do_log_stats() - assert len(engine.stat_loggers) == 1 - assert len(engine.stat_loggers[0]) == 1 - engine.stat_loggers[0][0].log.assert_called_once() + stat_loggers = engine.logger_manager.per_engine_logger_dict + assert len(stat_loggers) == 1 + assert len(stat_loggers[0]) == 1 + stat_loggers[0][0].log.assert_called_once() @pytest.mark.asyncio(scope="module") diff --git a/tests/v1/test_async_llm_dp.py b/tests/v1/test_async_llm_dp.py index 64a41bec379..6716d27f571 100644 --- a/tests/v1/test_async_llm_dp.py +++ b/tests/v1/test_async_llm_dp.py @@ -90,8 +90,10 @@ class SimpleStatsLogger(StatLoggerBase): def __init__(self, vllm_config: VllmConfig, engine_index: int = 0): stats_loggers[engine_index] = self - def record(self, scheduler_stats: Optional[SchedulerStats], - iteration_stats: Optional[IterationStats]): + def record(self, + scheduler_stats: Optional[SchedulerStats], + iteration_stats: Optional[IterationStats], + engine_idx: int = 0): if iteration_stats: self.finished_req_count += len( iteration_stats.finished_requests) diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index 6395d2c1875..b8ba36f3502 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -36,10 +36,9 @@ from vllm.v1.engine.parallel_sampling import ParentRequest from vllm.v1.engine.processor import Processor from vllm.v1.executor.abstract import Executor -from vllm.v1.metrics.loggers import (StatLoggerBase, StatLoggerFactory, - setup_default_loggers) +from vllm.v1.metrics.loggers import StatLoggerFactory, StatLoggerManager from vllm.v1.metrics.prometheus import shutdown_prometheus -from vllm.v1.metrics.stats import IterationStats, SchedulerStats +from vllm.v1.metrics.stats import IterationStats logger = init_logger(__name__) @@ -95,14 +94,6 @@ def __init__( self.log_requests = log_requests self.log_stats = log_stats - # Set up stat loggers; independent set for each DP rank. - self.stat_loggers: list[list[StatLoggerBase]] = setup_default_loggers( - vllm_config=vllm_config, - log_stats=self.log_stats, - engine_num=vllm_config.parallel_config.data_parallel_size, - custom_stat_loggers=stat_loggers, - ) - # Tokenizer (+ ensure liveness if running in another process). self.tokenizer = init_tokenizer_from_configs( model_config=vllm_config.model_config, @@ -121,7 +112,6 @@ def __init__( log_stats=self.log_stats) # EngineCore (starts the engine in background process). - self.engine_core = EngineCoreClient.make_async_mp_client( vllm_config=vllm_config, executor_class=executor_class, @@ -129,9 +119,17 @@ def __init__( client_addresses=client_addresses, client_index=client_index, ) - if self.stat_loggers: - for stat_logger in self.stat_loggers[0]: - stat_logger.log_engine_initialized() + + # Loggers. + self.logger_manager: Optional[StatLoggerManager] = None + if self.log_stats: + self.logger_manager = StatLoggerManager( + vllm_config=vllm_config, + engine_idxs=self.engine_core.engine_ranks, + custom_stat_loggers=stat_loggers, + ) + self.logger_manager.log_engine_initialized() + self.output_handler: Optional[asyncio.Task] = None try: # Start output handler eagerly if we are in the asyncio eventloop. @@ -370,7 +368,7 @@ def _run_output_handler(self): engine_core = self.engine_core output_processor = self.output_processor log_stats = self.log_stats - stat_loggers = self.stat_loggers if log_stats else None + logger_manager = self.logger_manager async def output_handler(): try: @@ -410,9 +408,9 @@ async def output_handler(): # 4) Logging. # TODO(rob): make into a coroutine and launch it in # background thread once Prometheus overhead is non-trivial. - if stat_loggers: - AsyncLLM._record_stats( - stat_loggers[outputs.engine_index], + if logger_manager: + logger_manager.record( + engine_idx=outputs.engine_index, scheduler_stats=outputs.scheduler_stats, iteration_stats=iteration_stats, ) @@ -431,18 +429,6 @@ async def abort(self, request_id: str) -> None: if self.log_requests: logger.info("Aborted request %s.", request_id) - @staticmethod - def _record_stats( - stat_loggers: list[StatLoggerBase], - scheduler_stats: Optional[SchedulerStats], - iteration_stats: Optional[IterationStats], - ): - """static so that it can be used from the output_handler task - without a circular ref to AsyncLLM.""" - for stat_logger in stat_loggers: - stat_logger.record(scheduler_stats=scheduler_stats, - iteration_stats=iteration_stats) - async def encode( self, prompt: PromptType, @@ -547,9 +533,8 @@ async def do_log_stats( scheduler_outputs=None, model_output=None, ) -> None: - for loggers in self.stat_loggers: - for stat_logger in loggers: - stat_logger.log() + if self.logger_manager: + self.logger_manager.log() async def check_health(self) -> None: logger.debug("Called check_health.") @@ -653,18 +638,16 @@ async def scale_elastic_ep(self, new_data_parallel_size # recreate stat loggers - if new_data_parallel_size > old_data_parallel_size: - stat_loggers: list[list[StatLoggerBase]] = setup_default_loggers( + if new_data_parallel_size > old_data_parallel_size and self.log_stats: + # TODO(rob): fix this after talking with Ray team. + # This resets all the prometheus metrics since we + # unregister during initialization. Need to understand + # the intended behavior here better. + self.logger_manager = StatLoggerManager( vllm_config=self.vllm_config, - log_stats=self.log_stats, - engine_num=new_data_parallel_size, + engine_idxs=list(range(new_data_parallel_size)), custom_stat_loggers=None, ) - num_new_engines = len(stat_loggers) - len(self.stat_loggers) - self.stat_loggers.extend(stat_loggers[-num_new_engines:]) - else: - for _ in range(old_data_parallel_size - new_data_parallel_size): - self.stat_loggers.pop() @property def is_running(self) -> bool: diff --git a/vllm/v1/engine/core_client.py b/vllm/v1/engine/core_client.py index 82fc1fa9937..2ebb76a97eb 100644 --- a/vllm/v1/engine/core_client.py +++ b/vllm/v1/engine/core_client.py @@ -432,14 +432,15 @@ def __init__( external_dp_lb = parallel_config.data_parallel_external_lb offline_mode = parallel_config.data_parallel_rank_local is not None - engine_ranks = [dp_rank] if (offline_mode - or external_dp_lb) else range(dp_size) + self.engine_ranks = ([dp_rank] if + (offline_mode or external_dp_lb) else list( + range(dp_size))) assert parallel_config.data_parallel_size_local <= len( - engine_ranks) + self.engine_ranks) # ZMQ identity of each engine that this client will talk to. self.core_engines: list[EngineIdentity] = [ - index.to_bytes(2, "little") for index in engine_ranks + index.to_bytes(2, "little") for index in self.engine_ranks ] # Wait for ready messages from each engine on the input socket. diff --git a/vllm/v1/metrics/loggers.py b/vllm/v1/metrics/loggers.py index c720ca13e51..7f2556bab5a 100644 --- a/vllm/v1/metrics/loggers.py +++ b/vllm/v1/metrics/loggers.py @@ -4,7 +4,7 @@ import logging import time from abc import ABC, abstractmethod -from typing import Callable, Optional +from typing import Callable, Optional, Union import numpy as np import prometheus_client @@ -35,8 +35,10 @@ def __init__(self, vllm_config: VllmConfig, engine_index: int = 0): ... @abstractmethod - def record(self, scheduler_stats: Optional[SchedulerStats], - iteration_stats: Optional[IterationStats]): + def record(self, + scheduler_stats: Optional[SchedulerStats], + iteration_stats: Optional[IterationStats], + engine_idx: int = 0): ... @abstractmethod @@ -78,8 +80,10 @@ def _get_throughput(self, tracked_stats: list[int], now: float) -> float: # Compute summary metrics for tracked stats return float(np.sum(tracked_stats) / (now - self.last_log_time)) - def record(self, scheduler_stats: Optional[SchedulerStats], - iteration_stats: Optional[IterationStats]): + def record(self, + scheduler_stats: Optional[SchedulerStats], + iteration_stats: Optional[IterationStats], + engine_idx: int = 0): """Log Stats to standard output.""" if iteration_stats: @@ -146,233 +150,290 @@ class PrometheusStatLogger(StatLoggerBase): _histogram_cls = prometheus_client.Histogram _spec_decoding_cls = SpecDecodingProm - def __init__(self, vllm_config: VllmConfig, engine_index: int = 0): + def __init__(self, + vllm_config: VllmConfig, + engine_indexes: Optional[list[int]] = None): + if engine_indexes is None: + engine_indexes = [0] + self.engine_indexes = engine_indexes unregister_vllm_metrics() self.vllm_config = vllm_config - self.engine_index = engine_index # Use this flag to hide metrics that were deprecated in # a previous release and which will be removed future self.show_hidden_metrics = \ vllm_config.observability_config.show_hidden_metrics labelnames = ["model_name", "engine"] - labelvalues = [ - vllm_config.model_config.served_model_name, - str(engine_index) - ] - + model_name = vllm_config.model_config.served_model_name max_model_len = vllm_config.model_config.max_model_len + if (len(self.engine_indexes) > 1 + and vllm_config.speculative_config is not None): + raise NotImplementedError("Prometheus metrics with Spec Decoding " + "with >1 EngineCore per AsyncLLM is not " + "supported yet.") + spec_decode_labelvalues = [ + vllm_config.model_config.served_model_name, + str(self.engine_indexes[0]) + ] self.spec_decoding_prom = self._spec_decoding_cls( - vllm_config.speculative_config, labelnames, labelvalues) + vllm_config.speculative_config, labelnames, + spec_decode_labelvalues) # # Scheduler state # - self.gauge_scheduler_running = self._gauge_cls( + gauge_scheduler_running = self._gauge_cls( name="vllm:num_requests_running", documentation="Number of requests in model execution batches.", multiprocess_mode="mostrecent", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.gauge_scheduler_running = make_per_engine(gauge_scheduler_running, + engine_indexes, + model_name) - self.gauge_scheduler_waiting = self._gauge_cls( + gauge_scheduler_waiting = self._gauge_cls( name="vllm:num_requests_waiting", documentation="Number of requests waiting to be processed.", multiprocess_mode="mostrecent", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.gauge_scheduler_waiting = make_per_engine(gauge_scheduler_waiting, + engine_indexes, + model_name) # # GPU cache # # Deprecated in 0.9 - Renamed as vllm:kv_cache_usage_perc # TODO: in 0.10, only enable if show_hidden_metrics=True - self.gauge_gpu_cache_usage = self._gauge_cls( + gauge_gpu_cache_usage = self._gauge_cls( name="vllm:gpu_cache_usage_perc", documentation=( "GPU KV-cache usage. 1 means 100 percent usage." "DEPRECATED: Use vllm:kv_cache_usage_perc instead."), multiprocess_mode="mostrecent", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.gauge_gpu_cache_usage = make_per_engine(gauge_gpu_cache_usage, + engine_indexes, + model_name) # Deprecated in 0.9 - Renamed as vllm:prefix_cache_queries # TODO: in 0.10, only enable if show_hidden_metrics=True - self.counter_gpu_prefix_cache_queries = self._counter_cls( + counter_gpu_prefix_cache_queries = self._counter_cls( name="vllm:gpu_prefix_cache_queries", - documentation= - ("GPU prefix cache queries, in terms of number of queried tokens." - "DEPRECATED: Use vllm:prefix_cache_queries instead."), - labelnames=labelnames).labels(*labelvalues) + documentation=( + "GPU prefix cache queries, in terms of number of queried" + "tokens. DEPRECATED: Use vllm:prefix_cache_queries instead."), + labelnames=labelnames) + self.counter_gpu_prefix_cache_queries = make_per_engine( + counter_gpu_prefix_cache_queries, engine_indexes, model_name) # Deprecated in 0.9 - Renamed as vllm:prefix_cache_hits # TODO: in 0.10, only enable if show_hidden_metrics=True - self.counter_gpu_prefix_cache_hits = self._counter_cls( + counter_gpu_prefix_cache_hits = self._counter_cls( name="vllm:gpu_prefix_cache_hits", documentation=( - "GPU prefix cache hits, in terms of number of cached tokens." - "DEPRECATED: Use vllm:prefix_cache_hits instead."), - labelnames=labelnames).labels(*labelvalues) + "GPU prefix cache hits, in terms of number of cached " + "tokens. DEPRECATED: Use vllm:prefix_cache_hits instead."), + labelnames=labelnames) + self.counter_gpu_prefix_cache_hits = make_per_engine( + counter_gpu_prefix_cache_hits, engine_indexes, model_name) - self.gauge_kv_cache_usage = self._gauge_cls( + gauge_kv_cache_usage = self._gauge_cls( name="vllm:kv_cache_usage_perc", documentation="KV-cache usage. 1 means 100 percent usage.", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.gauge_kv_cache_usage = make_per_engine(gauge_kv_cache_usage, + engine_indexes, model_name) - self.counter_prefix_cache_queries = self._counter_cls( + counter_prefix_cache_queries = self._counter_cls( name="vllm:prefix_cache_queries", documentation=( "Prefix cache queries, in terms of number of queried tokens."), - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.counter_prefix_cache_queries = make_per_engine( + counter_prefix_cache_queries, engine_indexes, model_name) - self.counter_prefix_cache_hits = self._counter_cls( + counter_prefix_cache_hits = self._counter_cls( name="vllm:prefix_cache_hits", documentation=( "Prefix cache hits, in terms of number of cached tokens."), - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.counter_prefix_cache_hits = make_per_engine( + counter_prefix_cache_hits, engine_indexes, model_name) # # Counters # - self.counter_num_preempted_reqs = self._counter_cls( + counter_num_preempted_reqs = self._counter_cls( name="vllm:num_preemptions", documentation="Cumulative number of preemption from the engine.", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.counter_num_preempted_reqs = make_per_engine( + counter_num_preempted_reqs, engine_indexes, model_name) - self.counter_prompt_tokens = self._counter_cls( + counter_prompt_tokens = self._counter_cls( name="vllm:prompt_tokens", documentation="Number of prefill tokens processed.", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.counter_prompt_tokens = make_per_engine(counter_prompt_tokens, + engine_indexes, + model_name) - self.counter_generation_tokens = self._counter_cls( + counter_generation_tokens = self._counter_cls( name="vllm:generation_tokens", documentation="Number of generation tokens processed.", - labelnames=labelnames).labels(*labelvalues) + labelnames=labelnames) + self.counter_generation_tokens = make_per_engine( + counter_generation_tokens, engine_indexes, model_name) - self.counter_request_success: dict[FinishReason, - prometheus_client.Counter] = {} + self.counter_request_success: dict[FinishReason, dict[ + int, prometheus_client.Counter]] = {} counter_request_success_base = self._counter_cls( name="vllm:request_success", documentation="Count of successfully processed requests.", labelnames=labelnames + ["finished_reason"]) for reason in FinishReason: - self.counter_request_success[ - reason] = counter_request_success_base.labels(*(labelvalues + - [str(reason)])) + self.counter_request_success[reason] = { + idx: + counter_request_success_base.labels(model_name, str(idx), + str(reason)) + for idx in engine_indexes + } # # Histograms of counts # - self.histogram_num_prompt_tokens_request = \ - self._histogram_cls( - name="vllm:request_prompt_tokens", - documentation="Number of prefill tokens processed.", - buckets=build_1_2_5_buckets(max_model_len), - labelnames=labelnames).labels(*labelvalues) - - self.histogram_num_generation_tokens_request = \ - self._histogram_cls( - name="vllm:request_generation_tokens", - documentation="Number of generation tokens processed.", - buckets=build_1_2_5_buckets(max_model_len), - labelnames=labelnames).labels(*labelvalues) + histogram_num_prompt_tokens_request = self._histogram_cls( + name="vllm:request_prompt_tokens", + documentation="Number of prefill tokens processed.", + buckets=build_1_2_5_buckets(max_model_len), + labelnames=labelnames) + self.histogram_num_prompt_tokens_request = make_per_engine( + histogram_num_prompt_tokens_request, engine_indexes, model_name) + + histogram_num_generation_tokens_request = self._histogram_cls( + name="vllm:request_generation_tokens", + documentation="Number of generation tokens processed.", + buckets=build_1_2_5_buckets(max_model_len), + labelnames=labelnames) + self.histogram_num_generation_tokens_request = make_per_engine( + histogram_num_generation_tokens_request, engine_indexes, + model_name) # TODO: This metric might be incorrect in case of using multiple # api_server counts which uses prometheus mp. # See: https://github.com/vllm-project/vllm/pull/18053 - self.histogram_iteration_tokens = \ - self._histogram_cls( - name="vllm:iteration_tokens_total", - documentation="Histogram of number of tokens per engine_step.", - buckets=[ - 1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, - 16384 - ], - labelnames=labelnames).labels(*labelvalues) - - self.histogram_max_num_generation_tokens_request = \ - self._histogram_cls( - name="vllm:request_max_num_generation_tokens", - documentation= - "Histogram of maximum number of requested generation tokens.", - buckets=build_1_2_5_buckets(max_model_len), - labelnames=labelnames).labels(*labelvalues) - - self.histogram_n_request = \ - self._histogram_cls( - name="vllm:request_params_n", - documentation="Histogram of the n request parameter.", - buckets=[1, 2, 5, 10, 20], - labelnames=labelnames).labels(*labelvalues) - - self.histogram_max_tokens_request = \ - self._histogram_cls( - name="vllm:request_params_max_tokens", - documentation="Histogram of the max_tokens request parameter.", - buckets=build_1_2_5_buckets(max_model_len), - labelnames=labelnames).labels(*labelvalues) + histogram_iteration_tokens = self._histogram_cls( + name="vllm:iteration_tokens_total", + documentation="Histogram of number of tokens per engine_step.", + buckets=[ + 1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 + ], + labelnames=labelnames) + self.histogram_iteration_tokens = make_per_engine( + histogram_iteration_tokens, engine_indexes, model_name) + + histogram_max_num_generation_tokens_request = self._histogram_cls( + name="vllm:request_max_num_generation_tokens", + documentation= + "Histogram of maximum number of requested generation tokens.", + buckets=build_1_2_5_buckets(max_model_len), + labelnames=labelnames) + self.histogram_max_num_generation_tokens_request = make_per_engine( + histogram_max_num_generation_tokens_request, engine_indexes, + model_name) + + histogram_n_request = self._histogram_cls( + name="vllm:request_params_n", + documentation="Histogram of the n request parameter.", + buckets=[1, 2, 5, 10, 20], + labelnames=labelnames) + self.histogram_n_request = make_per_engine(histogram_n_request, + engine_indexes, model_name) + + histogram_max_tokens_request = self._histogram_cls( + name="vllm:request_params_max_tokens", + documentation="Histogram of the max_tokens request parameter.", + buckets=build_1_2_5_buckets(max_model_len), + labelnames=labelnames) + self.histogram_max_tokens_request = make_per_engine( + histogram_max_tokens_request, engine_indexes, model_name) # # Histogram of timing intervals # - self.histogram_time_to_first_token = \ - self._histogram_cls( - name="vllm:time_to_first_token_seconds", - documentation="Histogram of time to first token in seconds.", - buckets=[ - 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, - 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, - 640.0, 2560.0 - ], - labelnames=labelnames).labels(*labelvalues) - - self.histogram_time_per_output_token = \ - self._histogram_cls( - name="vllm:time_per_output_token_seconds", - documentation="Histogram of time per output token in seconds.", - buckets=[ - 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, - 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0 - ], - labelnames=labelnames).labels(*labelvalues) + histogram_time_to_first_token = self._histogram_cls( + name="vllm:time_to_first_token_seconds", + documentation="Histogram of time to first token in seconds.", + buckets=[ + 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, + 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, + 2560.0 + ], + labelnames=labelnames) + self.histogram_time_to_first_token = make_per_engine( + histogram_time_to_first_token, engine_indexes, model_name) + + histogram_time_per_output_token = self._histogram_cls( + name="vllm:time_per_output_token_seconds", + documentation="Histogram of time per output token in seconds.", + buckets=[ + 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, + 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0 + ], + labelnames=labelnames) + self.histogram_time_per_output_token = make_per_engine( + histogram_time_per_output_token, engine_indexes, model_name) request_latency_buckets = [ 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0 ] - self.histogram_e2e_time_request = \ - self._histogram_cls( - name="vllm:e2e_request_latency_seconds", - documentation="Histogram of e2e request latency in seconds.", - buckets=request_latency_buckets, - labelnames=labelnames).labels(*labelvalues) - self.histogram_queue_time_request = \ - self._histogram_cls( - name="vllm:request_queue_time_seconds", - documentation= - "Histogram of time spent in WAITING phase for request.", - buckets=request_latency_buckets, - labelnames=labelnames).labels(*labelvalues) - self.histogram_inference_time_request = \ - self._histogram_cls( - name="vllm:request_inference_time_seconds", - documentation= - "Histogram of time spent in RUNNING phase for request.", - buckets=request_latency_buckets, - labelnames=labelnames).labels(*labelvalues) - self.histogram_prefill_time_request = \ - self._histogram_cls( - name="vllm:request_prefill_time_seconds", - documentation= - "Histogram of time spent in PREFILL phase for request.", - buckets=request_latency_buckets, - labelnames=labelnames).labels(*labelvalues) - self.histogram_decode_time_request = \ - self._histogram_cls( - name="vllm:request_decode_time_seconds", - documentation= - "Histogram of time spent in DECODE phase for request.", - buckets=request_latency_buckets, - labelnames=labelnames).labels(*labelvalues) + histogram_e2e_time_request = self._histogram_cls( + name="vllm:e2e_request_latency_seconds", + documentation="Histogram of e2e request latency in seconds.", + buckets=request_latency_buckets, + labelnames=labelnames) + self.histogram_e2e_time_request = make_per_engine( + histogram_e2e_time_request, engine_indexes, model_name) + + histogram_queue_time_request = self._histogram_cls( + name="vllm:request_queue_time_seconds", + documentation= + "Histogram of time spent in WAITING phase for request.", + buckets=request_latency_buckets, + labelnames=labelnames) + self.histogram_queue_time_request = make_per_engine( + histogram_queue_time_request, engine_indexes, model_name) + + histogram_inference_time_request = self._histogram_cls( + name="vllm:request_inference_time_seconds", + documentation= + "Histogram of time spent in RUNNING phase for request.", + buckets=request_latency_buckets, + labelnames=labelnames) + self.histogram_inference_time_request = make_per_engine( + histogram_inference_time_request, engine_indexes, model_name) + + histogram_prefill_time_request = self._histogram_cls( + name="vllm:request_prefill_time_seconds", + documentation= + "Histogram of time spent in PREFILL phase for request.", + buckets=request_latency_buckets, + labelnames=labelnames) + self.histogram_prefill_time_request = make_per_engine( + histogram_prefill_time_request, engine_indexes, model_name) + + histogram_decode_time_request = self._histogram_cls( + name="vllm:request_decode_time_seconds", + documentation= + "Histogram of time spent in DECODE phase for request.", + buckets=request_latency_buckets, + labelnames=labelnames) + self.histogram_decode_time_request = make_per_engine( + histogram_decode_time_request, engine_indexes, model_name) # # LoRA metrics @@ -382,6 +443,9 @@ def __init__(self, vllm_config: VllmConfig, engine_index: int = 0): # api_server counts which uses prometheus mp. self.gauge_lora_info: Optional[prometheus_client.Gauge] = None if vllm_config.lora_config is not None: + if len(self.engine_indexes) > 1: + raise NotImplementedError( + "LoRA in DP mode is not supported yet.") self.labelname_max_lora = "max_lora" self.labelname_waiting_lora_adapters = "waiting_lora_adapters" self.labelname_running_lora_adapters = "running_lora_adapters" @@ -399,9 +463,8 @@ def __init__(self, vllm_config: VllmConfig, engine_index: int = 0): ) def log_metrics_info(self, type: str, config_obj: SupportsMetricsInfo): - metrics_info = config_obj.metrics_info() - metrics_info["engine"] = self.engine_index + metrics_info["engine"] = "" name, documentation = None, None if type == "cache_config": @@ -417,27 +480,36 @@ def log_metrics_info(self, type: str, config_obj: SupportsMetricsInfo): documentation=documentation, multiprocess_mode="mostrecent", labelnames=metrics_info.keys(), - ).labels(**metrics_info) - info_gauge.set(1) - - def record(self, scheduler_stats: Optional[SchedulerStats], - iteration_stats: Optional[IterationStats]): + ) + for engine_index in self.engine_indexes: + metrics_info = config_obj.metrics_info() + metrics_info["engine"] = str(engine_index) + info_gauge.labels(**metrics_info).set(1) + + def record(self, + scheduler_stats: Optional[SchedulerStats], + iteration_stats: Optional[IterationStats], + engine_idx: int = 0): """Log to prometheus.""" if scheduler_stats is not None: - self.gauge_scheduler_running.set(scheduler_stats.num_running_reqs) - self.gauge_scheduler_waiting.set(scheduler_stats.num_waiting_reqs) + self.gauge_scheduler_running[engine_idx].set( + scheduler_stats.num_running_reqs) + self.gauge_scheduler_waiting[engine_idx].set( + scheduler_stats.num_waiting_reqs) - self.gauge_gpu_cache_usage.set(scheduler_stats.kv_cache_usage) - self.gauge_kv_cache_usage.set(scheduler_stats.kv_cache_usage) + self.gauge_gpu_cache_usage[engine_idx].set( + scheduler_stats.kv_cache_usage) + self.gauge_kv_cache_usage[engine_idx].set( + scheduler_stats.kv_cache_usage) - self.counter_gpu_prefix_cache_queries.inc( + self.counter_gpu_prefix_cache_queries[engine_idx].inc( scheduler_stats.prefix_cache_stats.queries) - self.counter_gpu_prefix_cache_hits.inc( + self.counter_gpu_prefix_cache_hits[engine_idx].inc( scheduler_stats.prefix_cache_stats.hits) - self.counter_prefix_cache_queries.inc( + self.counter_prefix_cache_queries[engine_idx].inc( scheduler_stats.prefix_cache_stats.queries) - self.counter_prefix_cache_hits.inc( + self.counter_prefix_cache_hits[engine_idx].inc( scheduler_stats.prefix_cache_stats.hits) if scheduler_stats.spec_decoding_stats is not None: @@ -447,42 +519,45 @@ def record(self, scheduler_stats: Optional[SchedulerStats], if iteration_stats is None: return - self.counter_num_preempted_reqs.inc(iteration_stats.num_preempted_reqs) - self.counter_prompt_tokens.inc(iteration_stats.num_prompt_tokens) - self.counter_generation_tokens.inc( + self.counter_num_preempted_reqs[engine_idx].inc( + iteration_stats.num_preempted_reqs) + self.counter_prompt_tokens[engine_idx].inc( + iteration_stats.num_prompt_tokens) + self.counter_generation_tokens[engine_idx].inc( iteration_stats.num_generation_tokens) - self.histogram_iteration_tokens.observe( + self.histogram_iteration_tokens[engine_idx].observe( iteration_stats.num_prompt_tokens + \ iteration_stats.num_generation_tokens) for max_gen_tokens in iteration_stats.max_num_generation_tokens_iter: - self.histogram_max_num_generation_tokens_request.observe( - max_gen_tokens) + self.histogram_max_num_generation_tokens_request[ + engine_idx].observe(max_gen_tokens) for n_param in iteration_stats.n_params_iter: - self.histogram_n_request.observe(n_param) + self.histogram_n_request[engine_idx].observe(n_param) for ttft in iteration_stats.time_to_first_tokens_iter: - self.histogram_time_to_first_token.observe(ttft) + self.histogram_time_to_first_token[engine_idx].observe(ttft) for tpot in iteration_stats.time_per_output_tokens_iter: - self.histogram_time_per_output_token.observe(tpot) + self.histogram_time_per_output_token[engine_idx].observe(tpot) for finished_request in iteration_stats.finished_requests: - self.counter_request_success[finished_request.finish_reason].inc() - self.histogram_e2e_time_request.observe( + self.counter_request_success[ + finished_request.finish_reason][engine_idx].inc() + self.histogram_e2e_time_request[engine_idx].observe( finished_request.e2e_latency) - self.histogram_queue_time_request.observe( + self.histogram_queue_time_request[engine_idx].observe( finished_request.queued_time) - self.histogram_prefill_time_request.observe( + self.histogram_prefill_time_request[engine_idx].observe( finished_request.prefill_time) - self.histogram_inference_time_request.observe( + self.histogram_inference_time_request[engine_idx].observe( finished_request.inference_time) - self.histogram_decode_time_request.observe( + self.histogram_decode_time_request[engine_idx].observe( finished_request.decode_time) - self.histogram_num_prompt_tokens_request.observe( + self.histogram_num_prompt_tokens_request[engine_idx].observe( finished_request.num_prompt_tokens) - self.histogram_num_generation_tokens_request.observe( + self.histogram_num_generation_tokens_request[engine_idx].observe( finished_request.num_generation_tokens) if finished_request.max_tokens_param: - self.histogram_max_tokens_request.observe( + self.histogram_max_tokens_request[engine_idx].observe( finished_request.max_tokens_param) if self.gauge_lora_info is not None: @@ -502,6 +577,18 @@ def log_engine_initialized(self): self.log_metrics_info("cache_config", self.vllm_config.cache_config) +PromMetric = Union[ + prometheus_client.Gauge, + prometheus_client.Counter, + prometheus_client.Histogram, +] + + +def make_per_engine(metric: PromMetric, engine_idxs: list[int], + model_name: str) -> dict[int, PromMetric]: + return {idx: metric.labels(model_name, str(idx)) for idx in engine_idxs} + + def build_buckets(mantissa_lst: list[int], max_value: int) -> list[int]: """ Builds a list of buckets with increasing powers of 10 multiplied by @@ -529,29 +616,79 @@ def build_1_2_5_buckets(max_value: int) -> list[int]: return build_buckets([1, 2, 5], max_value) -def setup_default_loggers( - vllm_config: VllmConfig, - log_stats: bool, - engine_num: int, - custom_stat_loggers: Optional[list[StatLoggerFactory]] = None, -) -> list[list[StatLoggerBase]]: - """Setup logging and prometheus metrics.""" - if not log_stats: - return [] - - factories: list[StatLoggerFactory] - if custom_stat_loggers is not None: - factories = custom_stat_loggers - else: - factories = [PrometheusStatLogger] - if logger.isEnabledFor(logging.INFO): - factories.append(LoggingStatLogger) - - stat_loggers: list[list[StatLoggerBase]] = [] - for i in range(engine_num): - per_engine_stat_loggers: list[StatLoggerBase] = [] - for logger_factory in factories: - per_engine_stat_loggers.append(logger_factory(vllm_config, i)) - stat_loggers.append(per_engine_stat_loggers) - - return stat_loggers +class StatLoggerManager: + """ + StatLoggerManager: + Logging happens at the level of the EngineCore (per scheduler). + * DP: >1 EngineCore per AsyncLLM - loggers for each EngineCore. + * With Local Logger, just make N copies for N EngineCores. + * With Prometheus, we need a single logger with N "labels" + + This class abstracts away this implementation detail from + the AsyncLLM, allowing the AsyncLLM to just call .record() + and .log() to a simple interface. + """ + + def __init__( + self, + vllm_config: VllmConfig, + engine_idxs: Optional[list[int]] = None, + custom_stat_loggers: Optional[list[StatLoggerFactory]] = None, + ): + self.engine_idxs = engine_idxs if engine_idxs else [0] + + factories: list[StatLoggerFactory] + if custom_stat_loggers is not None: + factories = custom_stat_loggers + else: + factories = [] + if logger.isEnabledFor(logging.INFO): + factories.append(LoggingStatLogger) + + # engine_idx: StatLogger + self.per_engine_logger_dict: dict[int, list[StatLoggerBase]] = {} + prometheus_factory = PrometheusStatLogger + for engine_idx in self.engine_idxs: + loggers: list[StatLoggerBase] = [] + for logger_factory in factories: + # If we get a custom prometheus logger, use that + # instead. This is typically used for the ray case. + if (isinstance(logger_factory, type) + and issubclass(logger_factory, PrometheusStatLogger)): + prometheus_factory = logger_factory + continue + loggers.append(logger_factory(vllm_config, + engine_idx)) # type: ignore + self.per_engine_logger_dict[engine_idx] = loggers + + # For Prometheus, need to share the metrics between EngineCores. + # Each EngineCore's metrics are expressed as a unique label. + self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs) + + def record( + self, + scheduler_stats: Optional[SchedulerStats], + iteration_stats: Optional[IterationStats], + engine_idx: Optional[int] = None, + ): + if engine_idx is None: + engine_idx = 0 + + per_engine_loggers = self.per_engine_logger_dict[engine_idx] + for logger in per_engine_loggers: + logger.record(scheduler_stats, iteration_stats, engine_idx) + + self.prometheus_logger.record(scheduler_stats, iteration_stats, + engine_idx) + + def log(self): + for per_engine_loggers in self.per_engine_logger_dict.values(): + for logger in per_engine_loggers: + logger.log() + + def log_engine_initialized(self): + self.prometheus_logger.log_engine_initialized() + + for per_engine_loggers in self.per_engine_logger_dict.values(): + for logger in per_engine_loggers: + logger.log_engine_initialized() diff --git a/vllm/v1/metrics/ray_wrappers.py b/vllm/v1/metrics/ray_wrappers.py index 8384310062d..ae8f9447e9c 100644 --- a/vllm/v1/metrics/ray_wrappers.py +++ b/vllm/v1/metrics/ray_wrappers.py @@ -3,7 +3,6 @@ import time from typing import Optional, Union -from vllm.config import VllmConfig from vllm.v1.metrics.loggers import PrometheusStatLogger from vllm.v1.spec_decode.metrics import SpecDecodingProm @@ -128,9 +127,6 @@ class RayPrometheusStatLogger(PrometheusStatLogger): _histogram_cls = RayHistogramWrapper _spec_decoding_cls = RaySpecDecodingProm - def __init__(self, vllm_config: VllmConfig, engine_index: int = 0): - super().__init__(vllm_config, engine_index) - @staticmethod def _unregister_vllm_metrics(): # No-op on purpose From 69a9185699b2795bdc117a56fe8b5998f476bc34 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Mon, 21 Jul 2025 13:47:51 -0400 Subject: [PATCH 233/552] Fix bad lm-eval fork (#21318) Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 114c48dba53..c476f71c663 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -273,7 +273,7 @@ steps: # VLLM_USE_FLASHINFER_SAMPLER or not on H100. - pytest -v -s v1/e2e # Integration test for streaming correctness (requires special branch). - - pip install -U git+https://github.com/robertgshaw2-neuralmagic/lm-evaluation-harness.git@streaming-api + - pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api - pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine - label: Examples Test # 25min From 8944e2375be4862e3ef48d39670e0f8c5d9e8bb5 Mon Sep 17 00:00:00 2001 From: Himanshu Jaju Date: Mon, 21 Jul 2025 19:19:23 +0100 Subject: [PATCH 234/552] [perf] Speed up align sum kernels (#21079) Signed-off-by: Himanshu Jaju Signed-off-by: x22x22 --- .../kernels/benchmark_moe_align_block_size.py | 7 +- csrc/moe/moe_align_sum_kernels.cu | 71 ++++++++++++++----- .../layers/fused_moe/moe_align_block_size.py | 7 +- 3 files changed, 60 insertions(+), 25 deletions(-) diff --git a/benchmarks/kernels/benchmark_moe_align_block_size.py b/benchmarks/kernels/benchmark_moe_align_block_size.py index 5170ac09dc4..1af5a21caf4 100644 --- a/benchmarks/kernels/benchmark_moe_align_block_size.py +++ b/benchmarks/kernels/benchmark_moe_align_block_size.py @@ -33,15 +33,13 @@ def check_correctness(num_tokens, num_experts=256, block_size=256, topk=8): sorted_ids_triton = torch.empty( (max_num_tokens_padded,), dtype=torch.int32, device="cuda" ) - sorted_ids_triton.fill_(topk_ids.numel()) # fill with sentinel value - expert_ids_triton = torch.zeros( + expert_ids_triton = torch.empty( (max_num_tokens_padded // block_size,), dtype=torch.int32, device="cuda" ) num_tokens_post_pad_triton = torch.empty((1,), dtype=torch.int32, device="cuda") sorted_ids_vllm = torch.empty_like(sorted_ids_triton) - sorted_ids_vllm.fill_(topk_ids.numel()) - expert_ids_vllm = torch.zeros_like(expert_ids_triton) + expert_ids_vllm = torch.empty_like(expert_ids_triton) num_tokens_post_pad_vllm = torch.empty_like(num_tokens_post_pad_triton) # 2. run implementations @@ -102,7 +100,6 @@ def benchmark(num_tokens, num_experts, topk, provider): max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1) sorted_ids = torch.empty((max_num_tokens_padded,), dtype=torch.int32, device="cuda") - sorted_ids.fill_(topk_ids.numel()) max_num_m_blocks = max_num_tokens_padded // block_size expert_ids = torch.empty((max_num_m_blocks,), dtype=torch.int32, device="cuda") num_tokens_post_pad = torch.empty((1,), dtype=torch.int32, device="cuda") diff --git a/csrc/moe/moe_align_sum_kernels.cu b/csrc/moe/moe_align_sum_kernels.cu index 462dbd1f8b3..8bbcf5a673f 100644 --- a/csrc/moe/moe_align_sum_kernels.cu +++ b/csrc/moe/moe_align_sum_kernels.cu @@ -1,6 +1,7 @@ #include #include #include +#include #include #include @@ -19,9 +20,14 @@ __global__ void moe_align_block_size_kernel( int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids, int32_t* __restrict__ total_tokens_post_pad, int32_t num_experts, int32_t padded_num_experts, int32_t experts_per_warp, int32_t block_size, - size_t numel, int32_t* __restrict__ cumsum) { + size_t numel, int32_t* __restrict__ cumsum, int32_t max_num_tokens_padded) { extern __shared__ int32_t shared_counts[]; + // Initialize sorted_token_ids with numel + for (size_t it = threadIdx.x; it < max_num_tokens_padded; it += blockDim.x) { + sorted_token_ids[it] = numel; + } + const int warp_id = threadIdx.x / WARP_SIZE; const int my_expert_start = warp_id * experts_per_warp; @@ -45,18 +51,27 @@ __global__ void moe_align_block_size_kernel( __syncthreads(); - if (threadIdx.x == 0) { - cumsum[0] = 0; - for (int i = 1; i <= num_experts; ++i) { - int expert_count = 0; - int warp_idx = (i - 1) / experts_per_warp; - int expert_offset = (i - 1) % experts_per_warp; - expert_count = shared_counts[warp_idx * experts_per_warp + expert_offset]; + // Compute prefix sum over token counts per expert + using BlockScan = cub::BlockScan; + __shared__ typename BlockScan::TempStorage temp_storage; - cumsum[i] = - cumsum[i - 1] + CEILDIV(expert_count, block_size) * block_size; - } - *total_tokens_post_pad = cumsum[num_experts]; + int expert_count = 0; + int expert_id = threadIdx.x; + if (expert_id < num_experts) { + int warp_idx = expert_id / experts_per_warp; + int expert_offset = expert_id % experts_per_warp; + expert_count = shared_counts[warp_idx * experts_per_warp + expert_offset]; + expert_count = CEILDIV(expert_count, block_size) * block_size; + } + + int cumsum_val; + BlockScan(temp_storage).ExclusiveSum(expert_count, cumsum_val); + if (expert_id <= num_experts) { + cumsum[expert_id] = cumsum_val; + } + + if (expert_id == num_experts) { + *total_tokens_post_pad = cumsum_val; } __syncthreads(); @@ -67,6 +82,13 @@ __global__ void moe_align_block_size_kernel( expert_ids[i / block_size] = threadIdx.x; } } + + // Fill remaining expert_ids with 0 + const size_t fill_start_idx = cumsum[num_experts] / block_size + threadIdx.x; + const size_t expert_ids_size = CEILDIV(max_num_tokens_padded, block_size); + for (size_t i = fill_start_idx; i < expert_ids_size; i += blockDim.x) { + expert_ids[i] = 0; + } } template @@ -105,7 +127,12 @@ __global__ void moe_align_block_size_small_batch_expert_kernel( const scalar_t* __restrict__ topk_ids, int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids, int32_t* __restrict__ total_tokens_post_pad, int32_t num_experts, - int32_t block_size, size_t numel) { + int32_t block_size, size_t numel, int32_t max_num_tokens_padded) { + // Initialize sorted_token_ids with numel + for (size_t it = threadIdx.x; it < max_num_tokens_padded; it += blockDim.x) { + sorted_token_ids[it] = numel; + } + const size_t tid = threadIdx.x; const size_t stride = blockDim.x; @@ -153,6 +180,13 @@ __global__ void moe_align_block_size_small_batch_expert_kernel( } } + // Fill remaining expert_ids with 0 + const size_t fill_start_idx = cumsum[num_experts] / block_size + threadIdx.x; + const size_t expert_ids_size = CEILDIV(max_num_tokens_padded, block_size); + for (size_t i = fill_start_idx; i < expert_ids_size; i += blockDim.x) { + expert_ids[i] = 0; + } + for (size_t i = tid; i < numel; i += stride) { int32_t expert_id = topk_ids[i]; int32_t rank_post_pad = @@ -179,13 +213,17 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts, int threads = 1024; threads = ((threads + WARP_SIZE - 1) / WARP_SIZE) * WARP_SIZE; + // BlockScan uses 1024 threads and assigns one thread per expert. + TORCH_CHECK(padded_num_experts < 1024, + "padded_num_experts must be less than 1024"); + VLLM_DISPATCH_INTEGRAL_AND_UNSIGNED_TYPES( topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] { // calc needed amount of shared mem for `cumsum` tensors auto options_int = torch::TensorOptions().dtype(torch::kInt).device(topk_ids.device()); torch::Tensor cumsum_buffer = - torch::zeros({num_experts + 1}, options_int); + torch::empty({num_experts + 1}, options_int); bool small_batch_expert_mode = (topk_ids.numel() < 1024) && (num_experts <= 64); @@ -203,7 +241,7 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts, sorted_token_ids.data_ptr(), experts_ids.data_ptr(), num_tokens_post_pad.data_ptr(), num_experts, block_size, - topk_ids.numel()); + topk_ids.numel(), sorted_token_ids.size(0)); } else { auto align_kernel = vllm::moe::moe_align_block_size_kernel; @@ -217,7 +255,8 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts, experts_ids.data_ptr(), num_tokens_post_pad.data_ptr(), num_experts, padded_num_experts, experts_per_warp, block_size, - topk_ids.numel(), cumsum_buffer.data_ptr()); + topk_ids.numel(), cumsum_buffer.data_ptr(), + sorted_token_ids.size(0)); const int block_threads = std::min(256, (int)threads); const int num_blocks = diff --git a/vllm/model_executor/layers/fused_moe/moe_align_block_size.py b/vllm/model_executor/layers/fused_moe/moe_align_block_size.py index 3aae183dfa2..2c9ad509fa9 100644 --- a/vllm/model_executor/layers/fused_moe/moe_align_block_size.py +++ b/vllm/model_executor/layers/fused_moe/moe_align_block_size.py @@ -111,6 +111,8 @@ def moe_align_block_size_triton( dtype=torch.int32, device=topk_ids.device) tokens_per_thread = cdiv(numel, num_experts) + sorted_token_ids.fill_(numel) + expert_ids.zero_() moe_align_block_size_stage1[grid]( topk_ids, @@ -205,11 +207,8 @@ def moe_align_block_size( sorted_ids = torch.empty((max_num_tokens_padded, ), dtype=torch.int32, device=topk_ids.device) - sorted_ids.fill_(topk_ids.numel()) max_num_m_blocks = triton.cdiv(max_num_tokens_padded, block_size) - # Expert ids must be zeroed out to prevent index out of bounds error while - # mapping global expert ids to local expert ids in expert parallelism. - expert_ids = torch.zeros((max_num_m_blocks, ), + expert_ids = torch.empty((max_num_m_blocks, ), dtype=torch.int32, device=topk_ids.device) num_tokens_post_pad = torch.empty((1), From cbfc3ac110be8fb6f6e9adcda2119dc7bd2cad6e Mon Sep 17 00:00:00 2001 From: Lu Fang <30275821+houseroad@users.noreply.github.com> Date: Mon, 21 Jul 2025 13:47:47 -0700 Subject: [PATCH 235/552] [v1][sampler] Inplace logprobs comparison to get the token rank (#21283) Signed-off-by: Lu Fang Signed-off-by: x22x22 --- vllm/v1/sample/ops/logprobs.py | 24 ++++++++++++++++++++++++ vllm/v1/sample/sampler.py | 3 ++- 2 files changed, 26 insertions(+), 1 deletion(-) create mode 100644 vllm/v1/sample/ops/logprobs.py diff --git a/vllm/v1/sample/ops/logprobs.py b/vllm/v1/sample/ops/logprobs.py new file mode 100644 index 00000000000..a4d65485140 --- /dev/null +++ b/vllm/v1/sample/ops/logprobs.py @@ -0,0 +1,24 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Some utilities for logprobs, including logits.""" + +import torch + + +@torch.compile(dynamic=True) +def batched_count_greater_than(x: torch.Tensor, + values: torch.Tensor) -> torch.Tensor: + """ + Counts elements in each row of x that are greater than the corresponding + value in values. Use torch.compile to generate an optimized kernel for + this function. otherwise, it will create additional copies of the input + tensors and cause memory issues. + + Args: + x (torch.Tensor): A 2D tensor of shape (batch_size, n_elements). + values (torch.Tensor): A 2D tensor of shape (batch_size, 1). + + Returns: + torch.Tensor: A 1D tensor of shape (batch_size,) with the counts. + """ + return (x >= values).sum(-1) diff --git a/vllm/v1/sample/sampler.py b/vllm/v1/sample/sampler.py index e79e4451a3a..fa078e62876 100644 --- a/vllm/v1/sample/sampler.py +++ b/vllm/v1/sample/sampler.py @@ -9,6 +9,7 @@ from vllm.v1.outputs import LogprobsTensors, SamplerOutput from vllm.v1.sample.metadata import SamplingMetadata from vllm.v1.sample.ops.bad_words import apply_bad_words +from vllm.v1.sample.ops.logprobs import batched_count_greater_than from vllm.v1.sample.ops.penalties import apply_all_penalties from vllm.v1.sample.ops.topk_topp_sampler import TopKTopPSampler @@ -174,7 +175,7 @@ def gather_logprobs( token_logprobs = logprobs.gather(-1, token_ids) # Compute the ranks of the actual token. - token_ranks = (logprobs >= token_logprobs).sum(-1) + token_ranks = batched_count_greater_than(logprobs, token_logprobs) # Concatenate together with the topk. indices = torch.cat((token_ids, topk_indices), dim=1) From 5c85e90c8fe8c7041a9b41c0a142edccec19ec5b Mon Sep 17 00:00:00 2001 From: Chaojun Zhang Date: Tue, 22 Jul 2025 12:47:35 +0800 Subject: [PATCH 236/552] [XPU] Enable external_launcher to serve as an executor via torchrun (#21021) Signed-off-by: chzhang Signed-off-by: x22x22 --- vllm/v1/worker/xpu_worker.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/vllm/v1/worker/xpu_worker.py b/vllm/v1/worker/xpu_worker.py index da271b2159a..c7885694f7a 100644 --- a/vllm/v1/worker/xpu_worker.py +++ b/vllm/v1/worker/xpu_worker.py @@ -7,6 +7,7 @@ import vllm.envs as envs from vllm.config import VllmConfig +from vllm.distributed import get_world_group from vllm.logger import init_logger from vllm.model_executor import set_random_seed from vllm.platforms import current_platform @@ -155,7 +156,8 @@ def init_device(self): current_platform.dist_backend) # global all_reduce needed for overall oneccl warm up - torch.distributed.all_reduce(torch.zeros(1).xpu()) + torch.distributed.all_reduce(torch.zeros(1).xpu(), + group=get_world_group().device_group) # Set random seed. set_random_seed(self.model_config.seed) From 7ed60ccef36ec2deb63f0069ba851153837f17a6 Mon Sep 17 00:00:00 2001 From: "Li, Jiang" Date: Tue, 22 Jul 2025 12:47:49 +0800 Subject: [PATCH 237/552] [Doc] Fix CPU doc format (#21316) Signed-off-by: jiang1.li Signed-off-by: x22x22 --- docs/getting_started/installation/cpu.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md index 5721195172d..2d2598da943 100644 --- a/docs/getting_started/installation/cpu.md +++ b/docs/getting_started/installation/cpu.md @@ -168,17 +168,18 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe ### How to do performance tuning for vLLM CPU? - - First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`. +First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`. - - Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM: - - `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as: - - Offline Inference: `4096 * world_size` - - Online Serving: `2048 * world_size` - - `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance. - - Offline Inference: `256 * world_size` - - Online Serving: `128 * world_size` +Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM: - - vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes. +- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as: + - Offline Inference: `4096 * world_size` + - Online Serving: `2048 * world_size` +- `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance. + - Offline Inference: `256 * world_size` + - Online Serving: `128 * world_size` + +vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes. ### Which quantization configs does vLLM CPU support? From f173eae11f871aff2573eacc26358850f3a4416c Mon Sep 17 00:00:00 2001 From: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com> Date: Mon, 21 Jul 2025 21:48:27 -0700 Subject: [PATCH 238/552] [Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU (#21338) Signed-off-by: ratnampa Signed-off-by: x22x22 --- vllm/executor/ray_distributed_executor.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/executor/ray_distributed_executor.py b/vllm/executor/ray_distributed_executor.py index dec32f8e50f..417750a08c6 100644 --- a/vllm/executor/ray_distributed_executor.py +++ b/vllm/executor/ray_distributed_executor.py @@ -67,8 +67,8 @@ def _init_executor(self) -> None: os.environ["VLLM_USE_RAY_SPMD_WORKER"] = "1" os.environ["VLLM_USE_RAY_COMPILED_DAG"] = "1" - # For TPU, avoid compiling NVIDIA's NCCL - if current_platform.is_tpu(): + # For TPU or XPU, avoid compiling NVIDIA's NCCL + if current_platform.is_tpu() or current_platform.is_xpu(): os.environ["VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE"] = "shm" # If the env var is set, it uses the Ray's compiled DAG API From 5418f5a722332418d6edffd05d8ca19a3ee62dff Mon Sep 17 00:00:00 2001 From: Ming Yang Date: Mon, 21 Jul 2025 21:49:01 -0700 Subject: [PATCH 239/552] Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) (#21334) Signed-off-by: Ming Yang Signed-off-by: x22x22 --- .../kernels/benchmark_grouped_gemm_cutlass.py | 35 +---------- csrc/moe/moe_permute_unpermute_op.cu | 53 ++++------------ tests/kernels/moe/test_cutlass_moe.py | 14 +---- tests/kernels/moe/test_pplx_cutlass_moe.py | 22 ------- .../layers/fused_moe/cutlass_moe.py | 62 +++++++------------ .../compressed_tensors_moe.py | 26 +------- 6 files changed, 38 insertions(+), 174 deletions(-) diff --git a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py index a6b42406b5c..1d4e730f99a 100644 --- a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py +++ b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py @@ -80,11 +80,6 @@ def bench_run( a, score, topk, renormalize=False ) - ab_strides1 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64) - ab_strides2 = torch.full((num_experts,), n, device="cuda", dtype=torch.int64) - c_strides1 = torch.full((num_experts,), 2 * n, device="cuda", dtype=torch.int64) - c_strides2 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64) - def run_triton_moe( a: torch.Tensor, w1: torch.Tensor, @@ -116,10 +111,6 @@ def run_cutlass_moe( w2: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, per_act_token: bool, @@ -134,10 +125,6 @@ def run_cutlass_moe( topk_ids, w1_scale, w2_scale, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, per_act_token, a1_scale=None, ) @@ -149,10 +136,6 @@ def run_cutlass_from_graph( w2_q: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, ): @@ -167,10 +150,6 @@ def run_cutlass_from_graph( topk_ids, w1_scale, w2_scale, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, per_act_token, a1_scale=None, ) @@ -215,10 +194,6 @@ def replay_graph(graph, num_repeats): w2_q, w1_scale, w2_scale, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, topk_weights, topk_ids, ) @@ -256,10 +231,6 @@ def replay_graph(graph, num_repeats): "w1_scale": w1_scale, "w2_scale": w2_scale, "per_act_token": per_act_token, - "ab_strides1": ab_strides1, - "ab_strides2": ab_strides2, - "c_strides1": c_strides1, - "c_strides2": c_strides2, # cuda graph params "cutlass_graph": cutlass_graph, "triton_graph": triton_graph, @@ -318,10 +289,6 @@ def replay_graph(graph, num_repeats): w2_q, w1_scale, w2_scale, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, topk_weights, topk_ids, per_act_token, @@ -330,7 +297,7 @@ def replay_graph(graph, num_repeats): results.append( benchmark.Timer( - stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, ab_strides1, ab_strides2, c_strides1, c_strides2, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501 + stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501 globals=globals, label=label, sub_label=sub_label, diff --git a/csrc/moe/moe_permute_unpermute_op.cu b/csrc/moe/moe_permute_unpermute_op.cu index 13aecd8007a..a77471a7f20 100644 --- a/csrc/moe/moe_permute_unpermute_op.cu +++ b/csrc/moe/moe_permute_unpermute_op.cu @@ -160,30 +160,6 @@ __global__ void shuffleInputRowsKernel(const T* input, } } -template -__global__ void shuffleInputRowsKernelSlow(const T* input, - const int32_t* dst2src_map, - T* output, int64_t num_src_rows, - int64_t num_dst_rows, - int64_t num_cols) { - int64_t dest_row_idx = blockIdx.x; - int64_t const source_row_idx = dst2src_map[dest_row_idx]; - - if (blockIdx.x < num_dst_rows) { - // Duplicate and permute rows - auto const* source_row_ptr = input + source_row_idx * num_cols; - auto* dest_row_ptr = output + dest_row_idx * num_cols; - - int64_t const start_offset = threadIdx.x; - int64_t const stride = blockDim.x; - - for (int elem_index = start_offset; elem_index < num_cols; - elem_index += stride) { - dest_row_ptr[elem_index] = source_row_ptr[elem_index]; - } - } -} - void shuffle_rows(const torch::Tensor& input_tensor, const torch::Tensor& dst2src_map, torch::Tensor& output_tensor) { @@ -197,24 +173,17 @@ void shuffle_rows(const torch::Tensor& input_tensor, int64_t const num_src_rows = input_tensor.size(0); int64_t const num_cols = input_tensor.size(1); - if (num_cols % (128 / sizeof(input_tensor.scalar_type()) / 8)) { - // use slow kernel if num_cols can't be aligned to 128 bits - MOE_DISPATCH(input_tensor.scalar_type(), [&] { - shuffleInputRowsKernelSlow<<>>( - reinterpret_cast(input_tensor.data_ptr()), - dst2src_map.data_ptr(), - reinterpret_cast(output_tensor.data_ptr()), num_src_rows, - num_dest_rows, num_cols); - }); - } else { - MOE_DISPATCH(input_tensor.scalar_type(), [&] { - shuffleInputRowsKernel<<>>( - reinterpret_cast(input_tensor.data_ptr()), - dst2src_map.data_ptr(), - reinterpret_cast(output_tensor.data_ptr()), num_src_rows, - num_dest_rows, num_cols); - }); - } + TORCH_CHECK(!(num_cols % (128 / sizeof(input_tensor.scalar_type()) / 8)), + "num_cols must be divisible by 128 / " + "sizeof(input_tensor.scalar_type()) / 8"); + + MOE_DISPATCH(input_tensor.scalar_type(), [&] { + shuffleInputRowsKernel<<>>( + reinterpret_cast(input_tensor.data_ptr()), + dst2src_map.data_ptr(), + reinterpret_cast(output_tensor.data_ptr()), num_src_rows, + num_dest_rows, num_cols); + }); } #else diff --git a/tests/kernels/moe/test_cutlass_moe.py b/tests/kernels/moe/test_cutlass_moe.py index 37727b75b07..81fb3ec1de1 100644 --- a/tests/kernels/moe/test_cutlass_moe.py +++ b/tests/kernels/moe/test_cutlass_moe.py @@ -207,10 +207,6 @@ def run_8_bit(moe_tensors: MOETensors8Bit, 'topk_ids': topk_ids, 'w1_scale': moe_tensors.w1_scale, 'w2_scale': moe_tensors.w2_scale, - 'ab_strides1': moe_tensors.ab_strides1, - 'ab_strides2': moe_tensors.ab_strides2, - 'c_strides1': moe_tensors.c_strides1, - 'c_strides2': moe_tensors.c_strides2, 'per_act_token': per_act_token, 'a1_scale': None #moe_tensors.a_scale } @@ -444,11 +440,6 @@ def test_run_cutlass_moe_fp8( expert_map[start:end] = list(range(num_local_experts)) expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda") - ab_strides1 = torch.full((e, ), k, device="cuda", dtype=torch.int64) - ab_strides2 = torch.full((e, ), n, device="cuda", dtype=torch.int64) - c_strides1 = torch.full((e, ), 2 * n, device="cuda", dtype=torch.int64) - c_strides2 = torch.full((e, ), k, device="cuda", dtype=torch.int64) - activation = lambda o, i: torch.ops._C.silu_and_mul(o, i) a1q, a1q_scale = moe_kernel_quantize_input(mt.a, mt.a_scale, torch.float8_e4m3fn, @@ -457,9 +448,8 @@ def test_run_cutlass_moe_fp8( func = lambda output: run_cutlass_moe_fp8( output, a1q, mt.w1_q, mt.w2_q, topk_ids, activation, global_num_experts, expert_map, mt.w1_scale, mt.w2_scale, - a1q_scale, None, ab_strides1, ab_strides2, c_strides1, c_strides2, - workspace13, workspace2, None, mt.a.dtype, per_act_token, - per_out_channel, False) + a1q_scale, None, workspace13, workspace2, None, mt.a.dtype, + per_act_token, per_out_channel, False) workspace13.random_() output_random_workspace = torch.empty(output_shape, diff --git a/tests/kernels/moe/test_pplx_cutlass_moe.py b/tests/kernels/moe/test_pplx_cutlass_moe.py index 77adc89ea9d..e4f4a393dfd 100644 --- a/tests/kernels/moe/test_pplx_cutlass_moe.py +++ b/tests/kernels/moe/test_pplx_cutlass_moe.py @@ -75,7 +75,6 @@ def pplx_cutlass_moe( assert torch.cuda.current_device() == pgi.local_rank num_tokens, hidden_dim = a.shape - intermediate_dim = w2.shape[2] num_experts = w1.shape[0] block_size = hidden_dim # TODO support more cases device = pgi.device @@ -124,31 +123,10 @@ def pplx_cutlass_moe( num_local_experts=num_local_experts, num_dispatchers=num_dispatchers) - ab_strides1 = torch.full((num_local_experts, ), - hidden_dim, - device="cuda", - dtype=torch.int64) - ab_strides2 = torch.full((num_local_experts, ), - intermediate_dim, - device="cuda", - dtype=torch.int64) - c_strides1 = torch.full((num_local_experts, ), - 2 * intermediate_dim, - device="cuda", - dtype=torch.int64) - c_strides2 = torch.full((num_local_experts, ), - hidden_dim, - device="cuda", - dtype=torch.int64) - experts = CutlassExpertsFp8(num_local_experts, out_dtype, per_act_token, per_out_ch, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, num_dispatchers=num_dispatchers, use_batched_format=True) diff --git a/vllm/model_executor/layers/fused_moe/cutlass_moe.py b/vllm/model_executor/layers/fused_moe/cutlass_moe.py index ff49d7bb780..2585a2953c9 100644 --- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py @@ -13,7 +13,8 @@ MoEPrepareAndFinalizeNoEP) from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import ( TopKWeightAndReduceDelegate) -from vllm.model_executor.layers.fused_moe.utils import (_fp8_quantize, +from vllm.model_executor.layers.fused_moe.utils import (_fp8_perm, + _fp8_quantize, _resize_cache, extract_required_args) from vllm.scalar_type import scalar_types @@ -34,10 +35,6 @@ def run_cutlass_moe_fp8( w2_scale: Optional[torch.Tensor], a1q_scale: Optional[torch.Tensor], a2_scale: Optional[torch.Tensor], - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, workspace13: torch.Tensor, workspace2: torch.Tensor, expert_num_tokens: Optional[torch.Tensor], @@ -156,11 +153,27 @@ def run_cutlass_moe_fp8( problem_sizes1, problem_sizes2, a_map, c_map, global_num_experts, N, K) - a1q = ops.shuffle_rows(a1q, a_map) - a1q_scale = (ops.shuffle_rows(a1q_scale, a_map) - if per_act_token else a1q_scale) + a1q = _fp8_perm(a1q, a_map) + a1q_scale = a1q_scale[a_map] if per_act_token else a1q_scale expert_offsets = expert_offsets[:-1] + ab_strides1 = torch.full((w1.size(0), ), + K, + device=device, + dtype=torch.int64) + c_strides1 = torch.full((w1.size(0), ), + 2 * N, + device=device, + dtype=torch.int64) + ab_strides2 = torch.full((w1.size(0), ), + N, + device=device, + dtype=torch.int64) + c_strides2 = torch.full((w1.size(0), ), + K, + device=device, + dtype=torch.int64) + if use_batched_format: c1 = _resize_cache(workspace13, (local_E * padded_M, N * 2)) c2 = _resize_cache(workspace2, (local_E * padded_M, N)) @@ -197,8 +210,7 @@ def run_cutlass_moe_fp8( else: # We can't do this inplace because output may point to the same tensor # as c3. - output.copy_(ops.shuffle_rows(c3, c_map).view(M * topk, K), - non_blocking=True) + output.copy_(c3[c_map].view(M * topk, K), non_blocking=True) # TODO (bnell): split class batched vs. non-batched? @@ -211,10 +223,6 @@ def __init__( out_dtype: Optional[torch.dtype], per_act_token_quant: bool, per_out_ch_quant: bool, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, block_shape: Optional[list[int]] = None, num_dispatchers: Optional[int] = None, use_batched_format: bool = False, @@ -231,10 +239,6 @@ def __init__( self.max_experts_per_worker = max_experts_per_worker self.num_dispatchers = num_dispatchers self.out_dtype = out_dtype - self.ab_strides1 = ab_strides1 - self.ab_strides2 = ab_strides2 - self.c_strides1 = c_strides1 - self.c_strides2 = c_strides2 self.use_batched_format = use_batched_format @property @@ -314,8 +318,7 @@ def apply(self, output: torch.Tensor, hidden_states: torch.Tensor, run_cutlass_moe_fp8( output, hidden_states, w1, w2, topk_ids, activation_callable, global_num_experts, expert_map, w1_scale, w2_scale, a1q_scale, - a2_scale, self.ab_strides1, self.ab_strides2, self.c_strides1, - self.c_strides2, workspace13, workspace2, expert_num_tokens, + a2_scale, workspace13, workspace2, expert_num_tokens, self.out_dtype if self.out_dtype is not None else in_dtype, self.per_act_token_quant, self.per_out_ch_quant, self.use_batched_format) @@ -329,10 +332,6 @@ def cutlass_moe_fp8( topk_ids: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, per_act_token: Optional[bool] = None, activation: str = "silu", a1_scale: Optional[torch.Tensor] = None, @@ -360,17 +359,6 @@ def cutlass_moe_fp8( Shape: [num_experts] or [num_experts, 2N] - w2_scale (torch.Tensor): The fp32 scale to dequantize w2_q. Shape: [num_experts] or [num_experts, K] - - ab_strides1 (torch.Tensor): The input/weight strides for the first gemm. - Shape: [num_experts] - - ab_strides2 (torch.Tensor): The input/weight strides for the second gemm. - Shape: [num_experts] - - c_strides1 (torch.Tensor): The output strides for the first gemm. - Shape: [num_experts] - - c_strides2 (torch.Tensor): The output strides for the second gemm. - Shape: [num_experts] - - per_act_token (Optional[bool]): Whether the scale is per-token or - per-tensor. - - activation (str): The activation function to use. - a1_scale (Optional[torch.Tensor]): The optional fp32 scale to quantize a. Shape: scalar or [M] - a2_scale (Optional[torch.Tensor]): The optional fp32 scale to @@ -403,10 +391,6 @@ def cutlass_moe_fp8( out_dtype=a.dtype, per_act_token_quant=per_act_token, per_out_ch_quant=per_out_ch, - ab_strides1=ab_strides1, - ab_strides2=ab_strides2, - c_strides1=c_strides1, - c_strides2=c_strides2, use_batched_format=False, ), ) diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py index 1a31410c338..2c93977beed 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py @@ -859,21 +859,6 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None: layer.w13_weight_scale = torch.nn.Parameter(max_w13_scales, requires_grad=False) - device = layer.w13_weight.device - # ab_strides1 and c_strides2 are the same - self.ab_strides1_c_strides2 = torch.full((layer.local_num_experts, ), - layer.hidden_size, - device=device, - dtype=torch.int64) - self.ab_strides2 = torch.full((layer.local_num_experts, ), - layer.intermediate_size_per_partition, - device=device, - dtype=torch.int64) - self.c_strides1 = torch.full((layer.local_num_experts, ), - 2 * layer.intermediate_size_per_partition, - device=device, - dtype=torch.int64) - def select_gemm_impl( self, prepare_finalize: FusedMoEPrepareAndFinalize, @@ -896,10 +881,6 @@ def select_gemm_impl( moe.in_dtype, self.input_quant.strategy == QuantizationStrategy.TOKEN, self.weight_quant.strategy == QuantizationStrategy.CHANNEL, - ab_strides1=self.ab_strides1_c_strides2, - ab_strides2=self.ab_strides2, - c_strides1=self.c_strides1, - c_strides2=self.ab_strides1_c_strides2, num_dispatchers=num_dispatchers, use_batched_format=use_batched_format, ) @@ -946,8 +927,7 @@ def apply( num_expert_group=num_expert_group, custom_routing_function=custom_routing_function, scoring_func=scoring_func, - e_score_correction_bias=e_score_correction_bias, - indices_type=self.topk_indices_dtype) + e_score_correction_bias=e_score_correction_bias) per_act_token = ( self.input_quant.strategy == QuantizationStrategy.TOKEN) @@ -968,10 +948,6 @@ def apply( expert_map=None if self.disable_expert_map else expert_map, w1_scale=layer.w13_weight_scale, w2_scale=layer.w2_weight_scale, - ab_strides1=self.ab_strides1_c_strides2, - ab_strides2=self.ab_strides2, - c_strides1=self.c_strides1, - c_strides2=self.ab_strides1_c_strides2, a1_scale=layer.w13_input_scale, a2_scale=layer.w2_input_scale, ) From b6cd9b04b611f0b3d837cbde6c22f9e9a8975b2e Mon Sep 17 00:00:00 2001 From: Jialin Ouyang Date: Mon, 21 Jul 2025 22:37:34 -0700 Subject: [PATCH 240/552] [Core] Minimize number of dict lookup in _maybe_evict_cached_block (#21281) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- vllm/v1/core/block_pool.py | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/vllm/v1/core/block_pool.py b/vllm/v1/core/block_pool.py index d21f94727cf..0fd6947ae0b 100644 --- a/vllm/v1/core/block_pool.py +++ b/vllm/v1/core/block_pool.py @@ -243,22 +243,27 @@ def _maybe_evict_cached_block(self, block: KVCacheBlock) -> bool: True if the block is evicted, False otherwise. """ block_hash = block.block_hash - if block_hash and block_hash in self.cached_block_hash_to_block: - block.reset_hash() - del self.cached_block_hash_to_block[block_hash][block.block_id] - - if len(self.cached_block_hash_to_block[block_hash]) == 0: - del self.cached_block_hash_to_block[block_hash] - - if self.enable_kv_cache_events: - # FIXME (Chen): Not sure whether we should return `hash_value` - # or `(hash_value, group_id)` here. But it's fine now because - # we disable hybrid kv cache manager when kv cache event is - # enabled, so there is only one group. - self.kv_event_queue.append( - BlockRemoved(block_hashes=[block_hash.get_hash_value()])) - return True - return False + if block_hash is None: + # The block doesn't have hash, eviction is not needed + return False + blocks_by_id = self.cached_block_hash_to_block.get(block_hash) + if blocks_by_id is None: + # block_hash not found in cached_block_hash_to_block, + # eviction is not needed + return False + block.reset_hash() + blocks_by_id.pop(block.block_id, None) + if blocks_by_id: + del self.cached_block_hash_to_block[block_hash] + + if self.enable_kv_cache_events: + # FIXME (Chen): Not sure whether we should return `hash_value` + # or `(hash_value, group_id)` here. But it's fine now because + # we disable hybrid kv cache manager when kv cache event is + # enabled, so there is only one group. + self.kv_event_queue.append( + BlockRemoved(block_hashes=[block_hash.get_hash_value()])) + return True def touch(self, blocks: tuple[list[KVCacheBlock], ...]) -> None: """Touch a block increases its reference count by 1, and may remove From 0cf27f50c8e65efac9bbb05b82b7fbd11c397c95 Mon Sep 17 00:00:00 2001 From: Thomas Parnell Date: Tue, 22 Jul 2025 08:31:18 +0200 Subject: [PATCH 241/552] [V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible (#21300) Signed-off-by: Thomas Parnell Signed-off-by: x22x22 --- tests/v1/worker/test_gpu_model_runner.py | 150 ++++++++++++++++++++++- 1 file changed, 149 insertions(+), 1 deletion(-) diff --git a/tests/v1/worker/test_gpu_model_runner.py b/tests/v1/worker/test_gpu_model_runner.py index 0bdf1f9820d..6ddcbfea24a 100644 --- a/tests/v1/worker/test_gpu_model_runner.py +++ b/tests/v1/worker/test_gpu_model_runner.py @@ -3,15 +3,19 @@ import random +import numpy as np import pytest import torch from vllm.attention import Attention from vllm.config import (CacheConfig, ModelConfig, ParallelConfig, SchedulerConfig, VllmConfig, set_current_vllm_config) +from vllm.distributed.parallel_state import (init_distributed_environment, + initialize_model_parallel) +from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2 from vllm.platforms import current_platform from vllm.sampling_params import SamplingParams -from vllm.utils import GiB_bytes +from vllm.utils import GiB_bytes, update_environment_variables from vllm.v1.core.kv_cache_utils import (estimate_max_model_len, get_kv_cache_config) from vllm.v1.core.sched.output import (CachedRequestData, NewRequestData, @@ -686,3 +690,147 @@ def test_init_kv_cache_with_kv_sharing_valid(): assert len(kv_cache_config.kv_cache_groups[0].layer_names) == 2 assert kv_cache_config.kv_cache_groups[0].layer_names[0] == layer_0 assert kv_cache_config.kv_cache_groups[0].layer_names[1] == layer_1 + + +def test_hybrid_attention_mamba_tensor_shapes(monkeypatch): + ''' + The GPU model runner creates different views into the + KVCacheTensors for the attention and mamba layers + (via _reshape_kv_cache_tensors function). This test verifies + that the views are compatible: writing a mamba block + will not corrupt an attention block and vice-versa + ''' + + current_platform.seed_everything(42) + + update_environment_variables({ + 'RANK': "0", + 'LOCAL_RANK': "0", + 'WORLD_SIZE': "1", + 'MASTER_ADDR': 'localhost', + 'MASTER_PORT': '12345', + }) + init_distributed_environment() + initialize_model_parallel(tensor_model_parallel_size=1) + torch.set_default_dtype(torch.float16) + + scheduler_config = SchedulerConfig( + max_num_seqs=10, + max_num_batched_tokens=512, + max_model_len=512, + ) + model_config = ModelConfig( + model="ibm-granite/granite-4.0-tiny-preview", + dtype="float16", + ) + cache_config = CacheConfig( + block_size=BLOCK_SIZE, + gpu_memory_utilization=0.9, + swap_space=0, + cache_dtype="auto", + ) + parallel_config = ParallelConfig() + vllm_config = VllmConfig( + model_config=model_config, + cache_config=cache_config, + scheduler_config=scheduler_config, + parallel_config=parallel_config, + ) + + layer_0 = "model.layers.0.self_attn.attn" + layer_1 = "model.layers.1.self_attn.attn" + layer_2 = "model.layers.2.mixer" + layer_3 = "model.layers.3.mixer" + layer_4 = "model.layers.4.mixer" + layer_5 = "model.layers.5.mixer" + + with set_current_vllm_config(vllm_config): + hf_config = vllm_config.model_config.hf_config + fwd_context = {} + for key in [layer_0, layer_1]: + fwd_context[key] = Attention( + num_heads=model_config.get_num_attention_heads( + parallel_config), + num_kv_heads=model_config.get_num_kv_heads(parallel_config), + head_size=model_config.get_head_size(), + scale=1.0, + prefix=key, + ) + for key in [layer_2, layer_3, layer_4, layer_5]: + fwd_context[key] = MambaMixer2( + hidden_size = hf_config.hidden_size, + ssm_state_size = hf_config.mamba_d_state, + conv_kernel_size = hf_config.mamba_d_conv, + intermediate_size = hf_config.mamba_expand *\ + hf_config.hidden_size, + use_conv_bias = hf_config.mamba_conv_bias, + use_bias = hf_config.mamba_proj_bias, + n_groups=hf_config.mamba_n_groups, + num_heads=hf_config.mamba_n_heads, + head_dim=hf_config.mamba_d_head, + rms_norm_eps=hf_config.rms_norm_eps, + activation=hf_config.hidden_act, + prefix=key, + ) + # suppress var not used error + assert fwd_context is not None + vllm_ctx = vllm_config.compilation_config.static_forward_context + + with monkeypatch.context() as m: + + m.setenv("VLLM_ATTENTION_BACKEND", "FLASHINFER") + + runner = GPUModelRunner(vllm_config, DEVICE) + kv_cache_spec = runner.get_kv_cache_spec() + + available_memory = 5 * GiB_bytes + kv_cache_config = get_kv_cache_config(vllm_config, kv_cache_spec, + available_memory) + runner.initialize_kv_cache(kv_cache_config) + + # random partition of blocks + # blocks0 will be assigned to attention layers + # blocks1 will be assigned to mamba layers + num_blocks = kv_cache_config.num_blocks + ind = np.arange(num_blocks) + np.random.shuffle(ind) + blocks0, blocks1 = ind[:(num_blocks // 2)], ind[(num_blocks // 2):] + + attn_shape = vllm_ctx[layer_0].kv_cache[0].shape + conv_shape = vllm_ctx[layer_2].kv_cache[0][0].shape + ssm_shape = vllm_ctx[layer_2].kv_cache[0][1].shape + + # assert we are using FlashInfer + assert attn_shape[0] == num_blocks + + attn_blocks_constant = torch.full((len(blocks0), *attn_shape[1:]), + device=DEVICE, + fill_value=3.33) + conv_blocks_constant = torch.full((len(blocks1), *conv_shape[1:]), + device=DEVICE, + fill_value=6.66) + ssm_blocks_constant = torch.full((len(blocks1), *ssm_shape[1:]), + device=DEVICE, + fill_value=9.99) + + # fill all attention blocks with constant + for layer in [layer_0, layer_1]: + vllm_ctx[layer].kv_cache[0][ + blocks0, :] = attn_blocks_constant.detach().clone() + + # fill all mamba blocks with constant + for layer in [layer_2, layer_3, layer_4, layer_5]: + vllm_ctx[layer].kv_cache[0][0][ + blocks1, :] = conv_blocks_constant.detach().clone() + vllm_ctx[layer].kv_cache[0][1][ + blocks1, :] = ssm_blocks_constant.detach().clone() + + # verify attention and mamba contents are correct + for layer in [layer_0, layer_1]: + assert torch.equal(vllm_ctx[layer].kv_cache[0][blocks0, :], + attn_blocks_constant) + for layer in [layer_2, layer_3, layer_4, layer_5]: + assert torch.equal(vllm_ctx[layer].kv_cache[0][0][blocks1, :], + conv_blocks_constant) + assert torch.equal(vllm_ctx[layer].kv_cache[0][1][blocks1, :], + ssm_blocks_constant) From a5d3e849fbb873f9b7445bd1994cff1464f35413 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Tue, 22 Jul 2025 02:33:51 -0400 Subject: [PATCH 242/552] [Refactor] Fix Compile Warning #1444-D (#21208) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- csrc/moe/topk_softmax_kernels.cu | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/csrc/moe/topk_softmax_kernels.cu b/csrc/moe/topk_softmax_kernels.cu index 064b76c9cd4..ea4ff67ef3e 100644 --- a/csrc/moe/topk_softmax_kernels.cu +++ b/csrc/moe/topk_softmax_kernels.cu @@ -20,6 +20,7 @@ #include #include #include "../cuda_compat.h" +#include #ifndef USE_ROCM #include @@ -62,7 +63,7 @@ __launch_bounds__(TPB) __global__ const int thread_row_offset = blockIdx.x * num_cols; - cub::Sum sum; + cuda::std::plus sum; float threadData(-FLT_MAX); // Don't touch finished rows. From 1e0ceb08d284a123c2409a579772db8c2641dd6e Mon Sep 17 00:00:00 2001 From: Konrad Zawora Date: Tue, 22 Jul 2025 08:35:14 +0200 Subject: [PATCH 243/552] Fix kv_cache_dtype handling for out-of-tree HPU plugin (#21302) Signed-off-by: Konrad Zawora Signed-off-by: Chendi.Xue Co-authored-by: Chendi.Xue Signed-off-by: x22x22 --- vllm/engine/arg_utils.py | 18 ++---------------- vllm/platforms/cuda.py | 13 +++++++++++++ vllm/platforms/interface.py | 7 +++++++ vllm/platforms/rocm.py | 4 ++++ vllm/platforms/tpu.py | 4 ++++ 5 files changed, 30 insertions(+), 16 deletions(-) diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 28b1c1c363a..1f74d22d07c 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1352,22 +1352,8 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: # No Fp8 KV cache so far. if self.kv_cache_dtype != "auto": - fp8_attention = self.kv_cache_dtype.startswith("fp8") - will_use_fa = ( - current_platform.is_cuda() - and not envs.is_set("VLLM_ATTENTION_BACKEND") - ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" - supported = False - if (current_platform.is_rocm() - or (current_platform.is_cuda() - and current_platform.is_device_capability(100)) - or current_platform.is_tpu()): - supported = True - elif fp8_attention and will_use_fa: - from vllm.attention.utils.fa_utils import ( - flash_attn_supports_fp8) - supported = flash_attn_supports_fp8() - + supported = current_platform.is_kv_cache_dtype_supported( + self.kv_cache_dtype) if not supported: _raise_or_fallback(feature_name="--kv-cache-dtype", recommend_to_remove=False) diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index 962e2b3aab6..fdf1f46e603 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -586,6 +586,19 @@ def is_fully_connected(cls, physical_device_ids: list[int]) -> bool: " not found. Assuming no NVLink available.") return False + @classmethod + def is_kv_cache_dtype_supported(cls, kv_cache_dtype: str) -> bool: + fp8_attention = kv_cache_dtype.startswith("fp8") + will_use_fa = (not envs.is_set("VLLM_ATTENTION_BACKEND") + ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" + supported = False + if cls.is_device_capability(100): + supported = True + elif fp8_attention and will_use_fa: + from vllm.attention.utils.fa_utils import flash_attn_supports_fp8 + supported = flash_attn_supports_fp8() + return supported + # Autodetect either NVML-enabled or non-NVML platform # based on whether NVML is available. diff --git a/vllm/platforms/interface.py b/vllm/platforms/interface.py index 1cd5cb5e83d..02cc392244b 100644 --- a/vllm/platforms/interface.py +++ b/vllm/platforms/interface.py @@ -543,6 +543,13 @@ def stateless_init_device_torch_dist_pg( """ raise RuntimeError(f"Unsupported torch distributed backend: {backend}") + @classmethod + def is_kv_cache_dtype_supported(cls, kv_cache_dtype: str) -> bool: + """ + Returns if the kv_cache_dtype is supported by the current platform. + """ + return False + class UnspecifiedPlatform(Platform): _enum = PlatformEnum.UNSPECIFIED diff --git a/vllm/platforms/rocm.py b/vllm/platforms/rocm.py index 0bf9262776b..b2e69f60343 100644 --- a/vllm/platforms/rocm.py +++ b/vllm/platforms/rocm.py @@ -454,3 +454,7 @@ def stateless_init_device_torch_dist_pg( @classmethod def device_count(cls) -> int: return cuda_device_count_stateless() + + @classmethod + def is_kv_cache_dtype_supported(cls, kv_cache_dtype: str) -> bool: + return True \ No newline at end of file diff --git a/vllm/platforms/tpu.py b/vllm/platforms/tpu.py index febc6ae4662..146801c9d77 100644 --- a/vllm/platforms/tpu.py +++ b/vllm/platforms/tpu.py @@ -190,6 +190,10 @@ def validate_request( and params.sampling_type == SamplingType.RANDOM_SEED): raise ValueError("Torch XLA does not support per-request seed.") + @classmethod + def is_kv_cache_dtype_supported(cls, kv_cache_dtype: str) -> bool: + return True + try: from tpu_commons.platforms import TpuPlatform as TpuCommonsPlatform From d0be9443ed055009f0db1753b862fde890f80b9e Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Tue, 22 Jul 2025 12:05:45 +0530 Subject: [PATCH 244/552] [Misc] DeepEPHighThroughtput - Enable Inductor pass (#21311) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- vllm/platforms/cuda.py | 3 --- 1 file changed, 3 deletions(-) diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index fdf1f46e603..cc2543538d0 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -182,9 +182,6 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: compilation_config.use_cudagraph = False if model_config is not None: model_config.enforce_eager = True - # TODO (varun): Turning this ON gives incorrect results for the - # Deepseek-V2-lite model. - vllm_config.compilation_config.use_inductor = False @classmethod def get_current_memory_usage(cls, From bc90d5a797f1ee9d45ee4f3b7cc290ef8da50c21 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Tue, 22 Jul 2025 02:36:18 -0400 Subject: [PATCH 245/552] [Bug] DeepGemm: Fix Cuda Init Error (#21312) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- vllm/utils/deep_gemm.py | 54 ++++++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 22 deletions(-) diff --git a/vllm/utils/deep_gemm.py b/vllm/utils/deep_gemm.py index 8b5713e02c9..09a12a8c11c 100644 --- a/vllm/utils/deep_gemm.py +++ b/vllm/utils/deep_gemm.py @@ -45,30 +45,36 @@ def _resolve_symbol(module, new: str, old: str) -> Callable[..., Any] | None: return None -if not has_deep_gemm(): - _fp8_gemm_nt_impl: Callable[..., Any] | None = None - _grouped_impl: Callable[..., Any] | None = None - _grouped_masked_impl: Callable[..., Any] | None = None - _per_block_cast_impl: Callable[..., Any] | None = None -else: - _dg = importlib.import_module("deep_gemm") # type: ignore - - _fp8_gemm_nt_impl = _resolve_symbol( - _dg, - "fp8_gemm_nt", - "gemm_fp8_fp8_bf16_nt", - ) +_fp8_gemm_nt_impl: Callable[..., Any] | None = None +_grouped_impl: Callable[..., Any] | None = None +_grouped_masked_impl: Callable[..., Any] | None = None +_per_block_cast_impl: Callable[..., Any] | None = None + + +def _lazy_init() -> None: + """Import deep_gemm and resolve symbols on first use.""" + global _fp8_gemm_nt_impl, _grouped_impl, _grouped_masked_impl, \ + _per_block_cast_impl + + # fast path + if (_fp8_gemm_nt_impl is not None or _grouped_impl is not None + or _grouped_masked_impl is not None + or _per_block_cast_impl is not None): + return + + if not has_deep_gemm(): + return + + _dg = importlib.import_module("deep_gemm") + + _fp8_gemm_nt_impl = _resolve_symbol(_dg, "fp8_gemm_nt", + "gemm_fp8_fp8_bf16_nt") _grouped_impl = _resolve_symbol( - _dg, - "m_grouped_fp8_gemm_nt_contiguous", - "m_grouped_gemm_fp8_fp8_bf16_nt_contiguous", - ) + _dg, "m_grouped_fp8_gemm_nt_contiguous", + "m_grouped_gemm_fp8_fp8_bf16_nt_contiguous") _grouped_masked_impl = _resolve_symbol( - _dg, - "fp8_m_grouped_gemm_nt_masked", - "m_grouped_gemm_fp8_fp8_bf16_nt_masked", - ) - + _dg, "fp8_m_grouped_gemm_nt_masked", + "m_grouped_gemm_fp8_fp8_bf16_nt_masked") # Try to get per_token_cast_to_fp8 from DeepGEMM math utils. try: _math_mod = importlib.import_module( @@ -80,24 +86,28 @@ def _resolve_symbol(module, new: str, old: str) -> Callable[..., Any] | None: def fp8_gemm_nt(*args, **kwargs): + _lazy_init() if _fp8_gemm_nt_impl is None: return _missing(*args, **kwargs) return _fp8_gemm_nt_impl(*args, **kwargs) def m_grouped_fp8_gemm_nt_contiguous(*args, **kwargs): + _lazy_init() if _grouped_impl is None: return _missing(*args, **kwargs) return _grouped_impl(*args, **kwargs) def fp8_m_grouped_gemm_nt_masked(*args, **kwargs): + _lazy_init() if _grouped_masked_impl is None: return _missing(*args, **kwargs) return _grouped_masked_impl(*args, **kwargs) def per_block_cast_to_fp8(x, *args, **kwargs): + _lazy_init() if _per_block_cast_impl is not None and is_blackwell_deep_gemm_used(): return _per_block_cast_impl(x, use_ue8m0=True) # TODO: refactor the `per_block_cast_to_fp8` from tests to vllm utils From 5ae6da4c2d0320a0a6ff5d0ccd09f36f534412ba Mon Sep 17 00:00:00 2001 From: Shu Wang Date: Tue, 22 Jul 2025 01:40:21 -0500 Subject: [PATCH 246/552] Update fp4 quantize API (#21327) Signed-off-by: Shu Wang Signed-off-by: x22x22 --- .../layers/fused_moe/flashinfer_cutlass_moe.py | 10 +++++----- .../fused_moe/flashinfer_cutlass_prepare_finalize.py | 4 ++-- vllm/utils/flashinfer.py | 8 ++++---- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py index 1753c4f6e23..3e79a1a8c24 100644 --- a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py +++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py @@ -181,12 +181,12 @@ def apply( g2_alphas, ] _ = flashinfer_cutlass_fused_moe( - hidden_states, - topk_ids.to(torch.int), - topk_weights, + input=hidden_states, + token_selected_experts=topk_ids.to(torch.int), + token_final_scales=topk_weights, # FlashInfer API requires weight to be long for nvfp4 - w1.view(torch.long), - w2.view(torch.long), + fc1_expert_weights=w1.view(torch.long), + fc2_expert_weights=w2.view(torch.long), output_dtype=out_dtype, quant_scales=quant_scales, input_sf=a1q_scale, diff --git a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py index 49819504c8e..e658990e95e 100644 --- a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py @@ -11,7 +11,7 @@ from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig from vllm.model_executor.layers.fused_moe.utils import ( extract_required_args, moe_kernel_quantize_input) -from vllm.utils.flashinfer import fp4_swizzle_blockscale +from vllm.utils.flashinfer import block_scale_interleave def get_local_sizes(local_tokens): @@ -92,7 +92,7 @@ def prepare( dim=0, sizes=get_local_sizes(local_tokens)) a1_m, a1_n = a1q.shape - a1q_scale = fp4_swizzle_blockscale(a1q_scale, a1_m, a1_n * 2) + a1q_scale = block_scale_interleave(a1q_scale) return a1q, a1q_scale, None, topk_ids, topk_weights diff --git a/vllm/utils/flashinfer.py b/vllm/utils/flashinfer.py index fd8b384a616..1ddafbae7fc 100644 --- a/vllm/utils/flashinfer.py +++ b/vllm/utils/flashinfer.py @@ -69,8 +69,8 @@ def wrapper(*args, **kwargs): flashinfer_cutlass_fused_moe = _lazy_import_wrapper("flashinfer.fused_moe", "cutlass_fused_moe") fp4_quantize = _lazy_import_wrapper("flashinfer", "fp4_quantize") -fp4_swizzle_blockscale = _lazy_import_wrapper("flashinfer", - "fp4_swizzle_blockscale") +block_scale_interleave = _lazy_import_wrapper("flashinfer", + "block_scale_interleave") # Special case for autotune since it returns a context manager autotune = _lazy_import_wrapper( @@ -95,7 +95,7 @@ def has_flashinfer_cutlass_fused_moe() -> bool: required_functions = [ ("flashinfer.fused_moe", "cutlass_fused_moe"), ("flashinfer", "fp4_quantize"), - ("flashinfer", "fp4_swizzle_blockscale"), + ("flashinfer", "block_scale_interleave"), ] for module_name, attr_name in required_functions: @@ -110,7 +110,7 @@ def has_flashinfer_cutlass_fused_moe() -> bool: "flashinfer_trtllm_fp8_block_scale_moe", "flashinfer_cutlass_fused_moe", "fp4_quantize", - "fp4_swizzle_blockscale", + "block_scale_interleave", "autotune", "has_flashinfer_moe", "has_flashinfer_cutlass_fused_moe", From 7f3d3228f1a179c0fcdfbdffe988a80fc97e4e2c Mon Sep 17 00:00:00 2001 From: "rongfu.leng" Date: Tue, 22 Jul 2025 14:41:14 +0800 Subject: [PATCH 247/552] [Feature][eplb] add verify ep or tp or dp (#21102) Signed-off-by: rongfu.leng Signed-off-by: x22x22 --- vllm/config.py | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/vllm/config.py b/vllm/config.py index 1089e7ccd50..5d7b19f9e9b 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -2108,6 +2108,15 @@ def __post_init__(self) -> None: raise ValueError( "num_redundant_experts must be non-negative, but got " f"{self.num_redundant_experts}.") + if not self.enable_expert_parallel: + raise ValueError( + "enable_expert_parallel must be True to use EPLB.") + if self.tensor_parallel_size * self.data_parallel_size <= 1: + raise ValueError( + "EPLB requires tensor_parallel_size or data_parallel_size " + f"to be greater than 1, but got " + f"TP={self.tensor_parallel_size},DP={self.data_parallel_size}." + ) else: if self.num_redundant_experts != 0: raise ValueError( From 7f0fd26b38e9f9dd814713f82ecd0e7de662fb4c Mon Sep 17 00:00:00 2001 From: Raghav Ravishankar <113712354+alyosha-swamy@users.noreply.github.com> Date: Tue, 22 Jul 2025 13:27:43 +0530 Subject: [PATCH 248/552] Add arcee model (#21296) Signed-off-by: alyosha-swamy Signed-off-by: Jee Jee Li Co-authored-by: Jee Jee Li Signed-off-by: x22x22 --- docs/models/supported_models.md | 1 + tests/models/registry.py | 2 + vllm/model_executor/models/arcee.py | 347 +++++++++++++++++++++++++ vllm/model_executor/models/registry.py | 1 + 4 files changed, 351 insertions(+) create mode 100644 vllm/model_executor/models/arcee.py diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 943f8590ac0..69f6a7aedd2 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -324,6 +324,7 @@ th { | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | |--------------|--------|-------------------|----------------------|---------------------------|---------------------| | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | +| `ArceeForCausalLM` | Arcee (AFM) | `arcee-ai/AFM-4.5B-Base`, etc. | ✅︎ | ✅︎ | ✅︎ | | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ | | `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `BailingMoeForCausalLM` | Ling | `inclusionAI/Ling-lite-1.5`, `inclusionAI/Ling-plus`, etc. | ✅︎ | ✅︎ | ✅︎ | diff --git a/tests/models/registry.py b/tests/models/registry.py index 19725acd6c4..8e3285aebbe 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -135,6 +135,8 @@ def check_available_online( trust_remote_code=True), "AquilaForCausalLM": _HfExamplesInfo("BAAI/AquilaChat2-7B", trust_remote_code=True), + "ArceeForCausalLM": _HfExamplesInfo("arcee-ai/AFM-4.5B-Base", + is_available_online=False), "ArcticForCausalLM": _HfExamplesInfo("Snowflake/snowflake-arctic-instruct", trust_remote_code=True), "BaiChuanForCausalLM": _HfExamplesInfo("baichuan-inc/Baichuan-7B", diff --git a/vllm/model_executor/models/arcee.py b/vllm/model_executor/models/arcee.py new file mode 100644 index 00000000000..4e3ba107ba7 --- /dev/null +++ b/vllm/model_executor/models/arcee.py @@ -0,0 +1,347 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# Copyright 2023-2025 vLLM Team +# Licensed under the Apache License, Version 2.0 (the "License"); +# You may not use this file except in compliance with the License. +# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 +# +# Inference-only Arcee (AFM) model – adds support for ReLU^2 feed-forward +# activation. + +from collections.abc import Iterable +from typing import Any, Optional, Union + +import torch +from torch import nn +from transformers import LlamaConfig + +from vllm.compilation.decorators import support_torch_compile +from vllm.distributed import get_pp_group +from vllm.model_executor.layers.activation import ReLUSquaredActivation +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm.model_executor.layers.linear import (ColumnParallelLinear, + RowParallelLinear) +from vllm.model_executor.layers.logits_processor import LogitsProcessor +from vllm.model_executor.layers.vocab_parallel_embedding import ( + DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding) +from vllm.sequence import IntermediateTensors + +from .interfaces import SupportsLoRA, SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, + make_empty_intermediate_tensors_factory, make_layers) + + +class ArceeMLP(nn.Module): + """Feed-forward layer for Arcee using ReLU^2 activation + (no gating as in LLaMA).""" + + def __init__(self, + hidden_size: int, + intermediate_size: int, + hidden_act: str, + quant_config: Optional[Any] = None, + bias: bool = False, + prefix: str = "", + reduce_results: bool = True) -> None: + super().__init__() + # Single linear projection up to intermediate size + # (no separate gate projection) + self.up_proj = ColumnParallelLinear( + input_size=hidden_size, + output_size=intermediate_size, + bias=bias, + quant_config=quant_config, + prefix=f"{prefix}.up_proj", + ) + # Down projection back to hidden size + self.down_proj = RowParallelLinear( + input_size=intermediate_size, + output_size=hidden_size, + bias=bias, + quant_config=quant_config, + reduce_results=reduce_results, + prefix=f"{prefix}.down_proj", + ) + if hidden_act != "relu2": + raise ValueError(f"Unsupported activation: {hidden_act}. " + "Only 'relu2' is supported for AFM.") + # Define ReLU^2 activation: (ReLU(x))^2 elementwise + self.act_fn = ReLUSquaredActivation() + + def forward(self, x: torch.Tensor) -> torch.Tensor: + x, _ = self.up_proj(x) # Project to intermediate size + x = self.act_fn(x) # Apply ReLU^2 activation elementwise + x, _ = self.down_proj(x) # Project back down to hidden size + return x + + +class ArceeDecoderLayer(nn.Module): + """Transformer decoder block for Arcee, with self-attention and + ReLU^2 MLP.""" + + def __init__(self, + config: LlamaConfig, + cache_config: Optional[Any] = None, + quant_config: Optional[Any] = None, + prefix: str = "") -> None: + super().__init__() + self.hidden_size = config.hidden_size + # Rotary embedding parameters (reuse LLaMA defaults) + rope_theta = getattr(config, "rope_theta", 10000) + rope_scaling = getattr(config, "rope_scaling", None) + if rope_scaling is not None and getattr( + config, "original_max_position_embeddings", None): + rope_scaling["original_max_position_embeddings"] = ( + config.original_max_position_embeddings) + max_position_embeddings = getattr(config, "max_position_embeddings", + 8192) + # Determine if attention bias is needed (some variants use bias terms) + attention_bias = getattr(config, "attention_bias", False) or getattr( + config, "bias", False) + bias_o_proj = attention_bias + if hasattr(config, "qkv_bias"): + attention_bias = config.qkv_bias + + # Self-Attention (using LLaMA's attention structure) + from vllm.model_executor.models.llama import ( + LlamaAttention) # import here to avoid circular import + self.self_attn = LlamaAttention( + config=config, + hidden_size=self.hidden_size, + num_heads=config.num_attention_heads, + num_kv_heads=getattr(config, "num_key_value_heads", + config.num_attention_heads), + rope_theta=rope_theta, + rope_scaling=rope_scaling, + max_position_embeddings=max_position_embeddings, + quant_config=quant_config, + bias=attention_bias, + bias_o_proj=bias_o_proj, + cache_config=cache_config, + prefix=f"{prefix}.self_attn", + attn_type=getattr( + config, "attn_type", + "decoder"), # assume decoder (causal) unless specified + ) + # MLP with ReLU^2 activation + self.mlp = ArceeMLP( + hidden_size=self.hidden_size, + intermediate_size=config.intermediate_size, + hidden_act=config.hidden_act, + quant_config=quant_config, + bias=getattr(config, "mlp_bias", False), + prefix=f"{prefix}.mlp", + ) + # Layer normalization layers (RMSNorm as in LLaMA) + self.input_layernorm = RMSNorm(config.hidden_size, + eps=config.rms_norm_eps) + self.post_attention_layernorm = RMSNorm(config.hidden_size, + eps=config.rms_norm_eps) + + def forward( + self, positions: torch.Tensor, hidden_states: torch.Tensor, + residual: Optional[torch.Tensor] + ) -> tuple[torch.Tensor, torch.Tensor]: + # Self-Attention block + if residual is None: + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) + else: + # Fused residual add + layernorm if supported + hidden_states, residual = self.input_layernorm( + hidden_states, residual) + hidden_states = self.self_attn(positions=positions, + hidden_states=hidden_states) + # Feed-forward block + hidden_states, residual = self.post_attention_layernorm( + hidden_states, residual) + hidden_states = self.mlp(hidden_states) + return hidden_states, residual + + +@support_torch_compile +class ArceeModel(nn.Module): + """The transformer model backbone for Arcee (embedding layer + stacked + decoder blocks + final norm).""" + + def __init__(self, + *, + vllm_config, + prefix: str = "", + layer_type: type[nn.Module] = ArceeDecoderLayer) -> None: + super().__init__() + config: LlamaConfig = vllm_config.model_config.hf_config + cache_config = vllm_config.cache_config + quant_config = vllm_config.quant_config + self.quant_config = quant_config + self.config = config + self.vocab_size = config.vocab_size + self.org_vocab_size = config.vocab_size + + # Word embeddings (parallelized if using pipeline parallel) + if get_pp_group().is_first_rank or (config.tie_word_embeddings + and get_pp_group().is_last_rank): + self.embed_tokens = VocabParallelEmbedding( + self.vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + quant_config=quant_config, + ) + else: + self.embed_tokens = PPMissingLayer( + ) # placeholder on non-embedding ranks + + # Build decoder layers across pipeline ranks + self.start_layer, self.end_layer, self.layers = make_layers( + config.num_hidden_layers, + lambda prefix: layer_type(config=config, + cache_config=cache_config, + quant_config=quant_config, + prefix=prefix), + prefix=f"{prefix}.layers", + ) + # Final RMSNorm on the last pipeline stage + if get_pp_group().is_last_rank: + self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + else: + self.norm = PPMissingLayer() + + # For optional capturing of intermediate hidden states + # (not used by default) + self.aux_hidden_state_layers: tuple[int, ...] = tuple() + + # Prepare factory for empty intermediate tensors + # (for pipeline scheduling) + self.make_empty_intermediate_tensors = ( + make_empty_intermediate_tensors_factory( + ["hidden_states", "residual"], config.hidden_size)) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.embed_tokens(input_ids) + + def forward( + self, + input_ids: Optional[torch.Tensor], + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors], + inputs_embeds: Optional[torch.Tensor] = None + ) -> Union[torch.Tensor, IntermediateTensors, tuple[torch.Tensor, + list[torch.Tensor]]]: + # Embedding lookup (on first pipeline rank) + if get_pp_group().is_first_rank: + hidden_states = (inputs_embeds if inputs_embeds is not None else + self.get_input_embeddings(input_ids)) + residual = None + else: + assert intermediate_tensors is not None, ( + "IntermediateTensors must be provided for non-first " + "pipeline ranks") + hidden_states = intermediate_tensors["hidden_states"] + residual = intermediate_tensors["residual"] + + aux_hidden_states: list[torch.Tensor] = [] + for idx, layer in enumerate( + self.layers[self.start_layer:self.end_layer]): + if idx in self.aux_hidden_state_layers: + aux_hidden_states.append( + hidden_states + + residual) # capture pre-layer hidden state if needed + hidden_states, residual = layer(positions, hidden_states, residual) + + if not get_pp_group().is_last_rank: + # Send intermediate results to the next pipeline stage + return IntermediateTensors({ + "hidden_states": hidden_states, + "residual": residual + }) + # On last rank: apply final layer norm + hidden_states, _ = self.norm(hidden_states, residual) + if len(aux_hidden_states) > 0: + return hidden_states, aux_hidden_states + return hidden_states + + +class ArceeForCausalLM(nn.Module, SupportsLoRA, SupportsPP): + """Arcee Model for causal language modeling, integrated with vLLM + runtime.""" + # Map fused module names to their sub-module components + # (for quantization and LoRA) + packed_modules_mapping = { + "qkv_proj": ["q_proj", "k_proj", "v_proj"], + } + + def __init__(self, *, vllm_config, prefix: str = "") -> None: + super().__init__() + config = vllm_config.model_config.hf_config + self.config = config + + # Initialize the inner Transformer model (ArceeModel) + self.model = ArceeModel(vllm_config=vllm_config, + prefix=f"{prefix}.model") + # On the last pipeline stage, set up the LM head and logits processor + if get_pp_group().is_last_rank: + # Determine vocabulary size (including any LoRA extra tokens + # for padded LM head) + self.unpadded_vocab_size = config.vocab_size + + self.lm_head = ParallelLMHead( + self.unpadded_vocab_size, + config.hidden_size, + org_num_embeddings=config.vocab_size, + padding_size=DEFAULT_VOCAB_PADDING_SIZE, + quant_config=vllm_config.quant_config, + bias=getattr(config, "lm_head_bias", False), + prefix=f"{prefix}.lm_head", + ) + if config.tie_word_embeddings: + # Tie output weights with input embedding matrix + self.lm_head = self.lm_head.tie_weights( + self.model.embed_tokens) + logit_scale = getattr(config, "logit_scale", 1.0) + self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, + config.vocab_size, + logit_scale) + else: + # Placeholder for lm_head on non-last ranks + self.lm_head = PPMissingLayer() + # Provide a reference to the model's method for generating empty + # tensors (used in pipeline parallel schedule) + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None + ) -> Union[torch.Tensor, IntermediateTensors]: + # Forward pass through the Arcee model backbone + model_output = self.model(input_ids=input_ids, + positions=positions, + intermediate_tensors=intermediate_tensors, + inputs_embeds=inputs_embeds) + return model_output + + def compute_logits(self, hidden_states: torch.Tensor, + sampling_metadata) -> Optional[torch.Tensor]: + # Compute final logits from hidden states (last pipeline rank only) + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.model.get_input_embeddings(input_ids) + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + """Load weights into the model (delegates to inner model and handles + tied embeddings).""" + loader = AutoWeightsLoader( + self, + skip_prefixes=(["lm_head."] + if self.config.tie_word_embeddings else None), + skip_substrs=["gate_proj"]) + # AutoWeightLoader handles weight name remapping, including fusing + # separate q_proj, k_proj, v_proj into qkv_proj + return loader.load_weights(weights) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index a85e8b0e7b1..9d88b5fe82c 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -33,6 +33,7 @@ # [Decoder-only] "AquilaModel": ("llama", "LlamaForCausalLM"), "AquilaForCausalLM": ("llama", "LlamaForCausalLM"), # AquilaChat2 + "ArceeForCausalLM": ("arcee", "ArceeForCausalLM"), "ArcticForCausalLM": ("arctic", "ArcticForCausalLM"), "MiniMaxForCausalLM": ("minimax_text_01", "MiniMaxText01ForCausalLM"), "MiniMaxText01ForCausalLM": ("minimax_text_01", "MiniMaxText01ForCausalLM"), From 069ec08b569bc3d1821ccbf9cec127e0b11fdbc3 Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Tue, 22 Jul 2025 01:18:40 -0700 Subject: [PATCH 249/552] [Bugfix] Fix eviction cached blocked logic (#21357) Signed-off-by: simon-mo Signed-off-by: x22x22 --- vllm/v1/core/block_pool.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/v1/core/block_pool.py b/vllm/v1/core/block_pool.py index 0fd6947ae0b..cbb6bb26822 100644 --- a/vllm/v1/core/block_pool.py +++ b/vllm/v1/core/block_pool.py @@ -253,7 +253,7 @@ def _maybe_evict_cached_block(self, block: KVCacheBlock) -> bool: return False block.reset_hash() blocks_by_id.pop(block.block_id, None) - if blocks_by_id: + if len(blocks_by_id) == 0: del self.cached_block_hash_to_block[block_hash] if self.enable_kv_cache_events: From 1b705a0f5011b7d3cf9339a2dd2345c0315c3e7a Mon Sep 17 00:00:00 2001 From: Kebe Date: Tue, 22 Jul 2025 20:26:39 +0800 Subject: [PATCH 250/552] [Misc] Remove deprecated args in v0.10 (#21349) Signed-off-by: Kebe Signed-off-by: x22x22 --- .../offline_inference/neuron_speculation.py | 1 - tests/neuron/2_core/test_mistral.py | 1 - tests/neuron/2_core/test_multi_lora.py | 2 -- vllm/engine/arg_utils.py | 21 ------------------- 4 files changed, 25 deletions(-) diff --git a/examples/offline_inference/neuron_speculation.py b/examples/offline_inference/neuron_speculation.py index 26276cba202..7fc22caee74 100644 --- a/examples/offline_inference/neuron_speculation.py +++ b/examples/offline_inference/neuron_speculation.py @@ -37,7 +37,6 @@ def initialize_llm(): max_num_seqs=4, max_model_len=2048, block_size=2048, - use_v2_block_manager=True, device="neuron", tensor_parallel_size=32, ) diff --git a/tests/neuron/2_core/test_mistral.py b/tests/neuron/2_core/test_mistral.py index d02fff943e9..ff59be1725b 100644 --- a/tests/neuron/2_core/test_mistral.py +++ b/tests/neuron/2_core/test_mistral.py @@ -9,7 +9,6 @@ def test_mistral(): tensor_parallel_size=2, max_num_seqs=4, max_model_len=128, - use_v2_block_manager=True, override_neuron_config={ "sequence_parallel_enabled": False, "skip_warmup": True diff --git a/tests/neuron/2_core/test_multi_lora.py b/tests/neuron/2_core/test_multi_lora.py index 6b97f47d4db..52ca9fe7b66 100644 --- a/tests/neuron/2_core/test_multi_lora.py +++ b/tests/neuron/2_core/test_multi_lora.py @@ -14,7 +14,6 @@ def test_llama_single_lora(): tensor_parallel_size=2, max_num_seqs=4, max_model_len=512, - use_v2_block_manager=True, override_neuron_config={ "sequence_parallel_enabled": False, "skip_warmup": True, @@ -57,7 +56,6 @@ def test_llama_multiple_lora(): tensor_parallel_size=2, max_num_seqs=4, max_model_len=512, - use_v2_block_manager=True, override_neuron_config={ "sequence_parallel_enabled": False, diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 1f74d22d07c..1e3d46a8d96 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -313,7 +313,6 @@ class EngineArgs: CacheConfig.prefix_caching_hash_algo disable_sliding_window: bool = ModelConfig.disable_sliding_window disable_cascade_attn: bool = ModelConfig.disable_cascade_attn - use_v2_block_manager: bool = True swap_space: float = CacheConfig.swap_space cpu_offload_gb: float = CacheConfig.cpu_offload_gb gpu_memory_utilization: float = CacheConfig.gpu_memory_utilization @@ -364,7 +363,6 @@ class EngineArgs: max_prompt_adapter_token: int = \ PromptAdapterConfig.max_prompt_adapter_token - device: Device = DeviceConfig.device num_scheduler_steps: int = SchedulerConfig.num_scheduler_steps multi_step_stream_outputs: bool = SchedulerConfig.multi_step_stream_outputs ray_workers_use_nsight: bool = ParallelConfig.ray_workers_use_nsight @@ -745,16 +743,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: "--max-prompt-adapter-token", **prompt_adapter_kwargs["max_prompt_adapter_token"]) - # Device arguments - device_kwargs = get_kwargs(DeviceConfig) - device_group = parser.add_argument_group( - title="DeviceConfig", - description=DeviceConfig.__doc__, - ) - device_group.add_argument("--device", - **device_kwargs["device"], - deprecated=True) - # Speculative arguments speculative_group = parser.add_argument_group( title="SpeculativeConfig", @@ -856,15 +844,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: **vllm_kwargs["additional_config"]) # Other arguments - parser.add_argument('--use-v2-block-manager', - action='store_true', - default=True, - deprecated=True, - help='[DEPRECATED] block manager v1 has been ' - 'removed and SelfAttnBlockSpaceManager (i.e. ' - 'block manager v2) is now the default. ' - 'Setting this flag to True or False' - ' has no effect on vLLM behavior.') parser.add_argument('--disable-log-stats', action='store_true', help='Disable logging statistics.') From daab1aa858aeb323b08a3137756bca3bdf3b1249 Mon Sep 17 00:00:00 2001 From: Jialin Ouyang Date: Tue, 22 Jul 2025 05:27:18 -0700 Subject: [PATCH 251/552] [Core] Optimize update checks in LogitsProcessor (#21245) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- vllm/v1/sample/logits_processor.py | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/vllm/v1/sample/logits_processor.py b/vllm/v1/sample/logits_processor.py index 3a4c25964e7..3a06e71057c 100644 --- a/vllm/v1/sample/logits_processor.py +++ b/vllm/v1/sample/logits_processor.py @@ -335,14 +335,19 @@ def update_state(self, batch_update: Optional[BatchUpdate]): if not batch_update: return + needs_update: bool = False # Process added requests. - needs_update = bool(batch_update.added) for index, params, _ in batch_update.added: if isinstance(params, SamplingParams) and (lb := params.logit_bias): self.biases[index] = lb + needs_update = True else: - self.biases.pop(index, None) + # Drop biases metadata at batch index + if self.biases.pop(index, None) is not None: + # If a new request replaces an old request which + # specified biases, we should update processor tensors + needs_update = True if self.biases: # Process removed requests. @@ -419,7 +424,6 @@ def update_state(self, batch_update: Optional[BatchUpdate]): if batch_update: # Process added requests. - needs_update |= bool(batch_update.added) for index, params, output_tok_ids in batch_update.added: if (isinstance(params, SamplingParams) and (min_tokens := params.min_tokens) @@ -427,9 +431,13 @@ def update_state(self, batch_update: Optional[BatchUpdate]): # Replace request metadata at batch index self.min_toks[index] = (min_tokens, output_tok_ids, params.all_stop_token_ids) + needs_update = True else: - # Drop request metadata at batch index - self.min_toks.pop(index, None) + # Drop min_toks metadata at batch index + if self.min_toks.pop(index, None) is not None: + # If a new request replaces an old request which + # specified min_toks, we should update processor tensors + needs_update = True if self.min_toks: # Process removed requests. From a0f2a45fc360ef12ddaae93e0235a28406cf4b47 Mon Sep 17 00:00:00 2001 From: Jialin Ouyang Date: Tue, 22 Jul 2025 05:28:00 -0700 Subject: [PATCH 252/552] [benchmark] Port benchmark request sent optimization to benchmark_serving (#21209) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- benchmarks/benchmark_serving.py | 98 +-------------------------------- vllm/benchmarks/serve.py | 10 ++-- 2 files changed, 7 insertions(+), 101 deletions(-) diff --git a/benchmarks/benchmark_serving.py b/benchmarks/benchmark_serving.py index f3a20842137..c597fb1068a 100644 --- a/benchmarks/benchmark_serving.py +++ b/benchmarks/benchmark_serving.py @@ -30,7 +30,7 @@ import random import time import warnings -from collections.abc import AsyncGenerator, Iterable +from collections.abc import Iterable from dataclasses import dataclass from datetime import datetime from typing import Any, Literal, Optional @@ -73,6 +73,7 @@ VisionArenaDataset, ) from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json +from vllm.benchmarks.serve import get_request MILLISECONDS_TO_SECONDS_CONVERSION = 1000 @@ -107,101 +108,6 @@ class BenchmarkMetrics: percentiles_e2el_ms: list[tuple[float, float]] -def _get_current_request_rate( - ramp_up_strategy: Optional[Literal["linear", "exponential"]], - ramp_up_start_rps: Optional[int], - ramp_up_end_rps: Optional[int], - request_index: int, - total_requests: int, - request_rate: float, -) -> float: - if ( - ramp_up_strategy - and ramp_up_start_rps is not None - and ramp_up_end_rps is not None - ): - progress = request_index / max(total_requests - 1, 1) - if ramp_up_strategy == "linear": - increase = (ramp_up_end_rps - ramp_up_start_rps) * progress - return ramp_up_start_rps + increase - elif ramp_up_strategy == "exponential": - ratio = ramp_up_end_rps / ramp_up_start_rps - return ramp_up_start_rps * (ratio**progress) - else: - raise ValueError(f"Unknown ramp-up strategy: {ramp_up_strategy}") - return request_rate - - -async def get_request( - input_requests: list[SampleRequest], - request_rate: float, - burstiness: float = 1.0, - ramp_up_strategy: Optional[Literal["linear", "exponential"]] = None, - ramp_up_start_rps: Optional[int] = None, - ramp_up_end_rps: Optional[int] = None, -) -> AsyncGenerator[tuple[SampleRequest, float], None]: - """ - Asynchronously generates requests at a specified rate - with OPTIONAL burstiness and OPTIONAL ramp-up strategy. - - Args: - input_requests: - A list of input requests, each represented as a SampleRequest. - request_rate: - The rate at which requests are generated (requests/s). - burstiness (optional): - The burstiness factor of the request generation. - Only takes effect when request_rate is not inf. - Default value is 1, which follows a Poisson process. - Otherwise, the request intervals follow a gamma distribution. - A lower burstiness value (0 < burstiness < 1) results - in more bursty requests, while a higher burstiness value - (burstiness > 1) results in a more uniform arrival of requests. - ramp_up_strategy (optional): - The ramp-up strategy. Can be "linear" or "exponential". - If None, uses constant request rate (specified by request_rate). - ramp_up_start_rps (optional): - The starting request rate for ramp-up. - ramp_up_end_rps (optional): - The ending request rate for ramp-up. - """ - assert burstiness > 0, ( - f"A positive burstiness factor is expected, but given {burstiness}." - ) - # Convert to list to get length for ramp-up calculations - if isinstance(input_requests, Iterable) and not isinstance(input_requests, list): - input_requests = list(input_requests) - - total_requests = len(input_requests) - request_index = 0 - - for request in input_requests: - current_request_rate = _get_current_request_rate( - ramp_up_strategy, - ramp_up_start_rps, - ramp_up_end_rps, - request_index, - total_requests, - request_rate, - ) - - yield request, current_request_rate - - request_index += 1 - - if current_request_rate == float("inf"): - # If the request rate is infinity, then we don't need to wait. - continue - - theta = 1.0 / (current_request_rate * burstiness) - - # Sample the request interval from the gamma distribution. - # If burstiness is 1, it follows exponential distribution. - interval = np.random.gamma(shape=burstiness, scale=theta) - # The next request will be sent after the interval. - await asyncio.sleep(interval) - - def calculate_metrics( input_requests: list[SampleRequest], outputs: list[RequestFuncOutput], diff --git a/vllm/benchmarks/serve.py b/vllm/benchmarks/serve.py index a4d51936320..f4506c9ce6f 100644 --- a/vllm/benchmarks/serve.py +++ b/vllm/benchmarks/serve.py @@ -179,12 +179,12 @@ async def get_request( delay_ts = [delay * normalize_factor for delay in delay_ts] start_ts = time.time() - request_index = 0 for request_index, request in enumerate(input_requests): - current_ts = time.time() - sleep_interval_s = start_ts + delay_ts[request_index] - current_ts - if sleep_interval_s > 0: - await asyncio.sleep(sleep_interval_s) + if delay_ts[request_index] > 0: + current_ts = time.time() + sleep_interval_s = start_ts + delay_ts[request_index] - current_ts + if sleep_interval_s > 0: + await asyncio.sleep(sleep_interval_s) yield request, request_rates[request_index] From a1cdc67dd0433ffe7de42c7f517044719c07de8e Mon Sep 17 00:00:00 2001 From: Jialin Ouyang Date: Tue, 22 Jul 2025 06:17:47 -0700 Subject: [PATCH 253/552] [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool (#21222) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- tests/v1/core/test_kv_cache_utils.py | 105 +++++++++++++++++++++++++++ vllm/v1/core/block_pool.py | 40 +++++----- vllm/v1/core/kv_cache_utils.py | 58 +++++++++++++++ 3 files changed, 183 insertions(+), 20 deletions(-) diff --git a/tests/v1/core/test_kv_cache_utils.py b/tests/v1/core/test_kv_cache_utils.py index 68b06015690..ccdbe79dfea 100644 --- a/tests/v1/core/test_kv_cache_utils.py +++ b/tests/v1/core/test_kv_cache_utils.py @@ -184,6 +184,111 @@ def test_free_kv_cache_block_queue_operations(): assert str(e.value) == "No free blocks available" +def test_free_kv_cache_block_queue_append_n(): + # Create an empty FreeKVCacheBlockQueue with these blocks + queue = FreeKVCacheBlockQueue([]) + blocks = [KVCacheBlock(block_id=i) for i in range(6)] + # Append 0 block + # fake_head->fake_tail + queue.append_n([]) + assert queue.num_free_blocks == 0 + assert (queue.fake_free_list_head.next_free_block + is queue.fake_free_list_tail) + assert (queue.fake_free_list_tail.prev_free_block + is queue.fake_free_list_head) + # Append 1 block + # fake_head->b0->fake_tail + queue.append_n(blocks[0:1]) + assert queue.num_free_blocks == 1 + assert queue.fake_free_list_head.next_free_block is blocks[0] + assert blocks[0].prev_free_block is queue.fake_free_list_head + assert blocks[0].next_free_block is queue.fake_free_list_tail + assert queue.fake_free_list_tail.prev_free_block is blocks[0] + # Append 2 blocks + # fake_head->b0->b4->b5->fake_tail + queue.append_n(blocks[4:6]) + assert queue.num_free_blocks == 3 + assert queue.fake_free_list_head.next_free_block is blocks[0] + assert blocks[0].prev_free_block is queue.fake_free_list_head + assert blocks[0].next_free_block is blocks[4] + assert blocks[4].prev_free_block is blocks[0] + assert blocks[4].next_free_block is blocks[5] + assert blocks[5].prev_free_block is blocks[4] + assert blocks[5].next_free_block is queue.fake_free_list_tail + assert queue.fake_free_list_tail.prev_free_block is blocks[5] + # Append 3 blocks + # fake_head->b0->b4->b5->b1->b2->b3->fake_tail + queue.append_n(blocks[1:4]) + assert queue.num_free_blocks == 6 + assert queue.fake_free_list_head.next_free_block is blocks[0] + assert blocks[0].prev_free_block is queue.fake_free_list_head + assert blocks[0].next_free_block is blocks[4] + assert blocks[4].prev_free_block is blocks[0] + assert blocks[4].next_free_block is blocks[5] + assert blocks[5].prev_free_block is blocks[4] + assert blocks[5].next_free_block is blocks[1] + assert blocks[1].prev_free_block is blocks[5] + assert blocks[1].next_free_block is blocks[2] + assert blocks[2].prev_free_block is blocks[1] + assert blocks[2].next_free_block is blocks[3] + assert blocks[3].prev_free_block is blocks[2] + assert blocks[3].next_free_block is queue.fake_free_list_tail + assert queue.fake_free_list_tail.prev_free_block is blocks[3] + + +def test_free_kv_cache_block_queue_popleft_n(): + blocks = [KVCacheBlock(block_id=i) for i in range(6)] + # Create a empty FreeKVCacheBlockQueue with these blocks + queue = FreeKVCacheBlockQueue( + [blocks[1], blocks[3], blocks[5], blocks[4], blocks[0], blocks[2]]) + assert queue.num_free_blocks == 6 + assert queue.fake_free_list_head.next_free_block is blocks[1] + assert blocks[1].prev_free_block is queue.fake_free_list_head + assert blocks[1].next_free_block is blocks[3] + assert blocks[3].prev_free_block is blocks[1] + assert blocks[3].next_free_block is blocks[5] + assert blocks[5].prev_free_block is blocks[3] + assert blocks[5].next_free_block is blocks[4] + assert blocks[4].prev_free_block is blocks[5] + assert blocks[4].next_free_block is blocks[0] + assert blocks[0].prev_free_block is blocks[4] + assert blocks[0].next_free_block is blocks[2] + assert blocks[2].prev_free_block is blocks[0] + assert blocks[2].next_free_block is queue.fake_free_list_tail + assert queue.fake_free_list_tail.prev_free_block is blocks[2] + + # Pop 0 block + # fake_head->b1->b3->b5->b4->b0->b2->fake_tail + assert len(queue.popleft_n(0)) == 0 + # Pop 1 block + # fake_head->b3->b5->b4->b0->b2->fake_tail + result_blocks = queue.popleft_n(1) + assert len(result_blocks) == 1 + assert result_blocks[0] is blocks[1] + for block in result_blocks: + assert block.prev_free_block is None + assert block.next_free_block is None + # Pop 2 blocks + # fake_head->b4->b0->b2->fake_tail + result_blocks = queue.popleft_n(2) + assert len(result_blocks) == 2 + assert result_blocks[0] is blocks[3] + assert result_blocks[1] is blocks[5] + for block in result_blocks: + assert block.prev_free_block is None + assert block.next_free_block is None + # Pop 3 blocks + # fake_head->fake_tail + result_blocks = queue.popleft_n(3) + assert len(result_blocks) == 3 + assert result_blocks[0] is blocks[4] + assert result_blocks[1] is blocks[0] + assert result_blocks[2] is blocks[2] + for block in result_blocks: + assert block.prev_free_block is None + assert block.next_free_block is None + + def test_free_kv_cache_block_queue_get_all_free_blocks(): # Create a list of KVCacheBlock objects blocks = [KVCacheBlock(block_id=i) for i in range(5)] diff --git a/vllm/v1/core/block_pool.py b/vllm/v1/core/block_pool.py index cbb6bb26822..5bf4d3a2acb 100644 --- a/vllm/v1/core/block_pool.py +++ b/vllm/v1/core/block_pool.py @@ -214,21 +214,18 @@ def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]: raise ValueError( f"Cannot get {num_blocks} free blocks from the pool") - ret: list[KVCacheBlock] = [] - idx = 0 - while idx < num_blocks: - # First allocate blocks. - curr_block = self.free_block_queue.popleft() - assert curr_block.ref_cnt == 0 - - # If the block is cached, evict it. - if self.enable_caching: - self._maybe_evict_cached_block(curr_block) - - curr_block.incr_ref() - ret.append(curr_block) - idx += 1 - + ret: list[KVCacheBlock] = self.free_block_queue.popleft_n(num_blocks) + + # In order to only iterate the list once, we duplicated code a bit + if self.enable_caching: + for block in ret: + self._maybe_evict_cached_block(block) + assert block.ref_cnt == 0 + block.ref_cnt += 1 + else: + for block in ret: + assert block.ref_cnt == 0 + block.ref_cnt += 1 return ret def _maybe_evict_cached_block(self, block: KVCacheBlock) -> bool: @@ -289,11 +286,14 @@ def free_blocks(self, ordered_blocks: Iterable[KVCacheBlock]) -> None: ordered_blocks: A list of blocks to free ordered by their eviction priority. """ - for block in ordered_blocks: - block.decr_ref() - # null_block should not be added to the free list. - if block.ref_cnt == 0 and not block.is_null: - self.free_block_queue.append(block) + # Materialize the iterable to allow multiple passes. + blocks_list = list(ordered_blocks) + for block in blocks_list: + block.ref_cnt -= 1 + self.free_block_queue.append_n([ + block for block in blocks_list + if block.ref_cnt == 0 and not block.is_null + ]) def reset_prefix_cache(self) -> bool: """Reset prefix cache. This function may be used in RLHF diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index 457d95cc738..198d79cfb42 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -154,6 +154,8 @@ class KVCacheBlock: # Whether the block is a null block that should never be cached. is_null: bool = False + # TODO(Jialin): For performance, let callers handle ref_cnt bumps to + # avoid function calls. def incr_ref(self): self.ref_cnt += 1 @@ -273,6 +275,39 @@ def popleft(self) -> KVCacheBlock: self.num_free_blocks -= 1 return first_block + def popleft_n(self, n: int) -> list[KVCacheBlock]: + """Pop the first n free blocks and reduce num_free_blocks by n. + + Args: + n: The number of blocks to pop. + + Returns: + A list of n free blocks. + """ + if n == 0: + return [] + assert self.num_free_blocks >= n + self.num_free_blocks -= n + + curr_block = self.fake_free_list_head.next_free_block + # Pop n blocks from the head of the list + ret = [] + for _ in range(n): + assert curr_block is not None + ret.append(curr_block) + last_block = curr_block + curr_block = curr_block.next_free_block + # Reset prev_free_block and next_free_block of all popped blocks + last_block.prev_free_block = None + last_block.next_free_block = None + + if curr_block is not None: + # The queue is not empty, connect the fake head to + # the new first block. + self.fake_free_list_head.next_free_block = curr_block + curr_block.prev_free_block = self.fake_free_list_head + return ret + def remove(self, block: KVCacheBlock) -> None: """Remove a block in the free list and reduce num_free_blocks by 1. @@ -315,6 +350,29 @@ def append(self, block: KVCacheBlock) -> None: self.num_free_blocks += 1 + def append_n(self, blocks: list[KVCacheBlock]) -> None: + """Put a list of blocks back into the free list + + Args: + blocks: The blocks to append. + """ + if len(blocks) == 0: + return + self.num_free_blocks += len(blocks) + + last_block = self.fake_free_list_tail.prev_free_block + assert last_block is not None, ( + "prev_free_block of fake_free_list_tail should always exist") + # Add inter-connections between consecutive blocks + for block in blocks: + block.prev_free_block = last_block + last_block.next_free_block = block + last_block = block + + # Connect the last block of to the fake tail + last_block.next_free_block = self.fake_free_list_tail + self.fake_free_list_tail.prev_free_block = last_block + def get_all_free_blocks(self) -> list[KVCacheBlock]: """Get all free blocks in the free list. Mainly used for testing. From cf0b9ab62ba843de8b16a8038ce59e54339375bf Mon Sep 17 00:00:00 2001 From: Ning Xie Date: Tue, 22 Jul 2025 21:32:36 +0800 Subject: [PATCH 254/552] [Misc] unify variable for LLM instance v2 (#21356) Signed-off-by: Andy Xie Signed-off-by: x22x22 --- tests/models/language/generation/test_gemma.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tests/models/language/generation/test_gemma.py b/tests/models/language/generation/test_gemma.py index 5be4ae874e6..60a4bc14be8 100644 --- a/tests/models/language/generation/test_gemma.py +++ b/tests/models/language/generation/test_gemma.py @@ -15,13 +15,13 @@ def test_dummy_loader(vllm_runner, monkeypatch, model: str) -> None: load_format="dummy", ) as llm: if model == "google/gemma-3-4b-it": - normalizers = llm.model.collective_rpc( + normalizers = llm.llm.collective_rpc( lambda self: self.model_runner.model.language_model.model. normalizer.cpu().item()) - config = llm.model.llm_engine.model_config.hf_config.text_config + config = llm.llm.llm_engine.model_config.hf_config.text_config else: - normalizers = llm.model.collective_rpc( + normalizers = llm.llm.collective_rpc( lambda self: self.model_runner.model.model.normalizer.cpu( ).item()) - config = llm.model.llm_engine.model_config.hf_config + config = llm.llm.llm_engine.model_config.hf_config assert np.allclose(normalizers, config.hidden_size**0.5, rtol=2e-3) From 95d77b59f9058e3f30a13fefda1f65bcb9867774 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micka=C3=ABl=20Seznec?= Date: Tue, 22 Jul 2025 16:07:44 +0200 Subject: [PATCH 255/552] [perf] Add fused MLA QKV + strided layernorm (#21116) Signed-off-by: Mickael Seznec Co-authored-by: mgoin Signed-off-by: x22x22 --- csrc/layernorm_kernels.cu | 63 +++++++++------ csrc/layernorm_quant_kernels.cu | 39 ++++++---- csrc/quantization/fp8/common.cu | 4 + tests/kernels/core/test_layernorm.py | 26 +++++-- vllm/model_executor/layers/linear.py | 78 ++++++++++++++++++- .../model_executor/layers/quantization/fp8.py | 13 +++- vllm/model_executor/models/deepseek_v2.py | 57 +++++++++----- 7 files changed, 214 insertions(+), 66 deletions(-) diff --git a/csrc/layernorm_kernels.cu b/csrc/layernorm_kernels.cu index d073dd6d2de..f051eb07022 100644 --- a/csrc/layernorm_kernels.cu +++ b/csrc/layernorm_kernels.cu @@ -15,15 +15,16 @@ namespace vllm { // TODO(woosuk): Further optimize this kernel. template __global__ void rms_norm_kernel( - scalar_t* __restrict__ out, // [..., hidden_size] - const scalar_t* __restrict__ input, // [..., hidden_size] + scalar_t* __restrict__ out, // [..., hidden_size] + const scalar_t* __restrict__ input, // [..., hidden_size] + const int64_t input_stride, const scalar_t* __restrict__ weight, // [hidden_size] const float epsilon, const int num_tokens, const int hidden_size) { __shared__ float s_variance; float variance = 0.0f; for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { - const float x = (float)input[blockIdx.x * hidden_size + idx]; + const float x = (float)input[blockIdx.x * input_stride + idx]; variance += x * x; } @@ -37,7 +38,7 @@ __global__ void rms_norm_kernel( __syncthreads(); for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { - float x = (float)input[blockIdx.x * hidden_size + idx]; + float x = (float)input[blockIdx.x * input_stride + idx]; out[blockIdx.x * hidden_size + idx] = ((scalar_t)(x * s_variance)) * weight[idx]; } @@ -50,7 +51,8 @@ __global__ void rms_norm_kernel( template __global__ std::enable_if_t<(width > 0) && _typeConvert::exists> fused_add_rms_norm_kernel( - scalar_t* __restrict__ input, // [..., hidden_size] + scalar_t* __restrict__ input, // [..., hidden_size] + const int64_t input_stride, scalar_t* __restrict__ residual, // [..., hidden_size] const scalar_t* __restrict__ weight, // [hidden_size] const float epsilon, const int num_tokens, const int hidden_size) { @@ -59,6 +61,7 @@ fused_add_rms_norm_kernel( static_assert(sizeof(_f16Vec) == sizeof(scalar_t) * width); const int vec_hidden_size = hidden_size / width; + const int64_t vec_input_stride = input_stride / width; __shared__ float s_variance; float variance = 0.0f; /* These and the argument pointers are all declared `restrict` as they are @@ -73,7 +76,8 @@ fused_add_rms_norm_kernel( for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) { int id = blockIdx.x * vec_hidden_size + idx; - _f16Vec temp = input_v[id]; + int64_t strided_id = blockIdx.x * vec_input_stride + idx; + _f16Vec temp = input_v[strided_id]; temp += residual_v[id]; variance += temp.sum_squares(); residual_v[id] = temp; @@ -90,10 +94,11 @@ fused_add_rms_norm_kernel( for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) { int id = blockIdx.x * vec_hidden_size + idx; + int64_t strided_id = blockIdx.x * vec_input_stride + idx; _f16Vec temp = residual_v[id]; temp *= s_variance; temp *= weight_v[idx]; - input_v[id] = temp; + input_v[strided_id] = temp; } } @@ -103,7 +108,8 @@ fused_add_rms_norm_kernel( template __global__ std::enable_if_t<(width == 0) || !_typeConvert::exists> fused_add_rms_norm_kernel( - scalar_t* __restrict__ input, // [..., hidden_size] + scalar_t* __restrict__ input, // [..., hidden_size] + const int64_t input_stride, scalar_t* __restrict__ residual, // [..., hidden_size] const scalar_t* __restrict__ weight, // [hidden_size] const float epsilon, const int num_tokens, const int hidden_size) { @@ -111,7 +117,7 @@ fused_add_rms_norm_kernel( float variance = 0.0f; for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { - scalar_t z = input[blockIdx.x * hidden_size + idx]; + scalar_t z = input[blockIdx.x * input_stride + idx]; z += residual[blockIdx.x * hidden_size + idx]; float x = (float)z; variance += x * x; @@ -129,7 +135,7 @@ fused_add_rms_norm_kernel( for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { float x = (float)residual[blockIdx.x * hidden_size + idx]; - input[blockIdx.x * hidden_size + idx] = + input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx]; } } @@ -141,11 +147,12 @@ void rms_norm(torch::Tensor& out, // [..., hidden_size] torch::Tensor& weight, // [hidden_size] double epsilon) { TORCH_CHECK(out.is_contiguous()); - TORCH_CHECK(input.is_contiguous()); + TORCH_CHECK(input.stride(-1) == 1); TORCH_CHECK(weight.is_contiguous()); int hidden_size = input.size(-1); int num_tokens = input.numel() / hidden_size; + int64_t input_stride = input.stride(-2); dim3 grid(num_tokens); dim3 block(std::min(hidden_size, 1024)); @@ -153,26 +160,29 @@ void rms_norm(torch::Tensor& out, // [..., hidden_size] const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); VLLM_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] { vllm::rms_norm_kernel<<>>( - out.data_ptr(), input.data_ptr(), + out.data_ptr(), input.data_ptr(), input_stride, weight.data_ptr(), epsilon, num_tokens, hidden_size); }); } -#define LAUNCH_FUSED_ADD_RMS_NORM(width) \ - VLLM_DISPATCH_FLOATING_TYPES( \ - input.scalar_type(), "fused_add_rms_norm_kernel", [&] { \ - vllm::fused_add_rms_norm_kernel \ - <<>>(input.data_ptr(), \ - residual.data_ptr(), \ - weight.data_ptr(), epsilon, \ - num_tokens, hidden_size); \ +#define LAUNCH_FUSED_ADD_RMS_NORM(width) \ + VLLM_DISPATCH_FLOATING_TYPES( \ + input.scalar_type(), "fused_add_rms_norm_kernel", [&] { \ + vllm::fused_add_rms_norm_kernel \ + <<>>( \ + input.data_ptr(), input_stride, \ + residual.data_ptr(), weight.data_ptr(), \ + epsilon, num_tokens, hidden_size); \ }); void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size] torch::Tensor& residual, // [..., hidden_size] torch::Tensor& weight, // [hidden_size] double epsilon) { + TORCH_CHECK(residual.is_contiguous()); + TORCH_CHECK(weight.is_contiguous()); int hidden_size = input.size(-1); + int64_t input_stride = input.stride(-2); int num_tokens = input.numel() / hidden_size; dim3 grid(num_tokens); @@ -194,9 +204,16 @@ void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size] auto inp_ptr = reinterpret_cast(input.data_ptr()); auto res_ptr = reinterpret_cast(residual.data_ptr()); auto wt_ptr = reinterpret_cast(weight.data_ptr()); - bool ptrs_are_aligned = - inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0; - if (ptrs_are_aligned && hidden_size % 8 == 0) { + constexpr int vector_width = 8; + constexpr int req_alignment_bytes = + vector_width * 2; // vector_width * sizeof(bfloat16 or float16) (float32 + // falls back to non-vectorized version anyway) + bool ptrs_are_aligned = inp_ptr % req_alignment_bytes == 0 && + res_ptr % req_alignment_bytes == 0 && + wt_ptr % req_alignment_bytes == 0; + bool offsets_are_multiple_of_vector_width = + hidden_size % vector_width == 0 && input_stride % vector_width == 0; + if (ptrs_are_aligned && offsets_are_multiple_of_vector_width) { LAUNCH_FUSED_ADD_RMS_NORM(8); } else { LAUNCH_FUSED_ADD_RMS_NORM(0); diff --git a/csrc/layernorm_quant_kernels.cu b/csrc/layernorm_quant_kernels.cu index d595b9e889c..0fd5849d962 100644 --- a/csrc/layernorm_quant_kernels.cu +++ b/csrc/layernorm_quant_kernels.cu @@ -23,8 +23,9 @@ namespace vllm { // TODO(woosuk): Further optimize this kernel. template __global__ void rms_norm_static_fp8_quant_kernel( - fp8_type* __restrict__ out, // [..., hidden_size] - const scalar_t* __restrict__ input, // [..., hidden_size] + fp8_type* __restrict__ out, // [..., hidden_size] + const scalar_t* __restrict__ input, // [..., hidden_size] + const int input_stride, const scalar_t* __restrict__ weight, // [hidden_size] const float* __restrict__ scale, // [1] const float epsilon, const int num_tokens, const int hidden_size) { @@ -32,7 +33,7 @@ __global__ void rms_norm_static_fp8_quant_kernel( float variance = 0.0f; for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { - const float x = (float)input[blockIdx.x * hidden_size + idx]; + const float x = (float)input[blockIdx.x * input_stride + idx]; variance += x * x; } @@ -49,7 +50,7 @@ __global__ void rms_norm_static_fp8_quant_kernel( float const scale_inv = 1.0f / *scale; for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { - float x = (float)input[blockIdx.x * hidden_size + idx]; + float x = (float)input[blockIdx.x * input_stride + idx]; float const out_norm = ((scalar_t)(x * s_variance)) * weight[idx]; out[blockIdx.x * hidden_size + idx] = scaled_fp8_conversion(out_norm, scale_inv); @@ -63,8 +64,9 @@ __global__ void rms_norm_static_fp8_quant_kernel( template __global__ std::enable_if_t<(width > 0) && _typeConvert::exists> fused_add_rms_norm_static_fp8_quant_kernel( - fp8_type* __restrict__ out, // [..., hidden_size] - scalar_t* __restrict__ input, // [..., hidden_size] + fp8_type* __restrict__ out, // [..., hidden_size] + scalar_t* __restrict__ input, // [..., hidden_size] + const int input_stride, scalar_t* __restrict__ residual, // [..., hidden_size] const scalar_t* __restrict__ weight, // [hidden_size] const float* __restrict__ scale, // [1] @@ -74,6 +76,7 @@ fused_add_rms_norm_static_fp8_quant_kernel( static_assert(sizeof(_f16Vec) == sizeof(scalar_t) * width); const int vec_hidden_size = hidden_size / width; + const int vec_input_stride = input_stride / width; __shared__ float s_variance; float variance = 0.0f; /* These and the argument pointers are all declared `restrict` as they are @@ -87,8 +90,9 @@ fused_add_rms_norm_static_fp8_quant_kernel( reinterpret_cast*>(weight); for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) { + int stride_id = blockIdx.x * vec_input_stride + idx; int id = blockIdx.x * vec_hidden_size + idx; - _f16Vec temp = input_v[id]; + _f16Vec temp = input_v[stride_id]; temp += residual_v[id]; variance += temp.sum_squares(); residual_v[id] = temp; @@ -125,8 +129,9 @@ fused_add_rms_norm_static_fp8_quant_kernel( template __global__ std::enable_if_t<(width == 0) || !_typeConvert::exists> fused_add_rms_norm_static_fp8_quant_kernel( - fp8_type* __restrict__ out, // [..., hidden_size] - scalar_t* __restrict__ input, // [..., hidden_size] + fp8_type* __restrict__ out, // [..., hidden_size] + scalar_t* __restrict__ input, // [..., hidden_size] + const int input_stride, scalar_t* __restrict__ residual, // [..., hidden_size] const scalar_t* __restrict__ weight, // [hidden_size] const float* __restrict__ scale, // [1] @@ -135,7 +140,7 @@ fused_add_rms_norm_static_fp8_quant_kernel( float variance = 0.0f; for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { - scalar_t z = input[blockIdx.x * hidden_size + idx]; + scalar_t z = input[blockIdx.x * input_stride + idx]; z += residual[blockIdx.x * hidden_size + idx]; float x = (float)z; variance += x * x; @@ -169,7 +174,9 @@ void rms_norm_static_fp8_quant(torch::Tensor& out, // [..., hidden_size] torch::Tensor& weight, // [hidden_size] torch::Tensor& scale, // [1] double epsilon) { + TORCH_CHECK(out.is_contiguous()); int hidden_size = input.size(-1); + int input_stride = input.stride(-2); int num_tokens = input.numel() / hidden_size; dim3 grid(num_tokens); @@ -183,8 +190,9 @@ void rms_norm_static_fp8_quant(torch::Tensor& out, // [..., hidden_size] vllm::rms_norm_static_fp8_quant_kernel <<>>( out.data_ptr(), input.data_ptr(), - weight.data_ptr(), scale.data_ptr(), - epsilon, num_tokens, hidden_size); + input_stride, weight.data_ptr(), + scale.data_ptr(), epsilon, num_tokens, + hidden_size); }); }); } @@ -198,7 +206,7 @@ void rms_norm_static_fp8_quant(torch::Tensor& out, // [..., hidden_size] width, fp8_t> \ <<>>( \ out.data_ptr(), input.data_ptr(), \ - residual.data_ptr(), \ + input_stride, residual.data_ptr(), \ weight.data_ptr(), scale.data_ptr(), \ epsilon, num_tokens, hidden_size); \ }); \ @@ -210,7 +218,10 @@ void fused_add_rms_norm_static_fp8_quant( torch::Tensor& weight, // [hidden_size] torch::Tensor& scale, // [1] double epsilon) { + TORCH_CHECK(out.is_contiguous()); + TORCH_CHECK(residual.is_contiguous()); int hidden_size = input.size(-1); + int input_stride = input.stride(-2); int num_tokens = input.numel() / hidden_size; dim3 grid(num_tokens); @@ -234,7 +245,7 @@ void fused_add_rms_norm_static_fp8_quant( auto wt_ptr = reinterpret_cast(weight.data_ptr()); bool ptrs_are_aligned = inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0; - if (ptrs_are_aligned && hidden_size % 8 == 0) { + if (ptrs_are_aligned && hidden_size % 8 == 0 && input_stride % 8 == 0) { LAUNCH_FUSED_ADD_RMS_NORM(8); } else { LAUNCH_FUSED_ADD_RMS_NORM(0); diff --git a/csrc/quantization/fp8/common.cu b/csrc/quantization/fp8/common.cu index f3f9f669e00..0e1eab66f0b 100644 --- a/csrc/quantization/fp8/common.cu +++ b/csrc/quantization/fp8/common.cu @@ -88,6 +88,8 @@ void static_scaled_fp8_quant(torch::Tensor& out, // [..., d] torch::Tensor const& input, // [..., d] torch::Tensor const& scale) // [1] { + TORCH_CHECK(input.is_contiguous()); + TORCH_CHECK(out.is_contiguous()); int const block_size = 256; int const num_tokens = input.numel() / input.size(-1); int const num_elems = input.numel(); @@ -111,6 +113,8 @@ void dynamic_scaled_fp8_quant(torch::Tensor& out, // [..., d] torch::Tensor const& input, // [..., d] torch::Tensor& scale) // [1] { + TORCH_CHECK(input.is_contiguous()); + TORCH_CHECK(out.is_contiguous()); int const block_size = 256; int const num_tokens = input.numel() / input.size(-1); int const num_elems = input.numel(); diff --git a/tests/kernels/core/test_layernorm.py b/tests/kernels/core/test_layernorm.py index 3eac062738f..02316ceaac7 100644 --- a/tests/kernels/core/test_layernorm.py +++ b/tests/kernels/core/test_layernorm.py @@ -26,6 +26,7 @@ @pytest.mark.parametrize("dtype", DTYPES) @pytest.mark.parametrize("seed", SEEDS) @pytest.mark.parametrize("device", CUDA_DEVICES) +@pytest.mark.parametrize("strided_input", [False, True]) @torch.inference_mode() def test_rms_norm( num_tokens: int, @@ -34,13 +35,17 @@ def test_rms_norm( dtype: torch.dtype, seed: int, device: str, + strided_input: bool, ) -> None: current_platform.seed_everything(seed) torch.set_default_device(device) layer = RMSNorm(hidden_size).to(dtype=dtype) layer.weight.data.normal_(mean=1.0, std=0.1) scale = 1 / (2 * hidden_size) - x = torch.randn(num_tokens, hidden_size, dtype=dtype) + last_dim = 2 * hidden_size if strided_input else hidden_size + x = torch.randn(num_tokens, last_dim, dtype=dtype) + x = x[..., :hidden_size] + assert x.is_contiguous() != strided_input x *= scale residual = torch.randn_like(x) * scale if add_residual else None @@ -72,6 +77,7 @@ def test_rms_norm( @pytest.mark.parametrize("quant_scale", [1.0, 0.01, 10.0]) @pytest.mark.parametrize("seed", SEEDS) @pytest.mark.parametrize("device", CUDA_DEVICES) +@pytest.mark.parametrize("strided_input", [False, True]) def test_fused_rms_norm_quant( num_tokens: int, hidden_size: int, @@ -80,13 +86,18 @@ def test_fused_rms_norm_quant( quant_scale: float, seed: int, device: str, + strided_input: bool, ) -> None: current_platform.seed_everything(seed) torch.set_default_device(device) weight = torch.empty(hidden_size, dtype=dtype).normal_(mean=1.0, std=0.1) scale = 1 / (2 * hidden_size) - x = torch.randn(num_tokens, hidden_size, dtype=dtype) + last_dim = 2 * hidden_size if strided_input else hidden_size + x_base = torch.randn(num_tokens, last_dim, dtype=dtype) + x = x_base[..., :hidden_size] + assert x.is_contiguous() != strided_input + x *= scale if add_residual: residual = torch.randn_like(x) * scale @@ -106,9 +117,11 @@ def test_fused_rms_norm_quant( # Unfused kernel is in-place so it goes second # Also use a separate clone of x to avoid modifying the input - x_unfused = x.clone() + x_unfused_base = x_base.clone() + x_unfused = x_unfused_base[..., :hidden_size] + assert x_unfused.is_contiguous() != strided_input torch.ops._C.fused_add_rms_norm(x_unfused, residual, weight, 1e-6) - torch.ops._C.static_scaled_fp8_quant(out_quant, x_unfused, + torch.ops._C.static_scaled_fp8_quant(out_quant, x_unfused.contiguous(), quant_scale_t) torch.cuda.synchronize() @@ -116,7 +129,6 @@ def test_fused_rms_norm_quant( residual, atol=1e-2, rtol=1e-2) - opcheck( torch.ops._C.fused_add_rms_norm_static_fp8_quant, (out_quant_fused, x, residual_fused, weight, quant_scale_t, 1e-6)) @@ -131,7 +143,7 @@ def test_fused_rms_norm_quant( opcheck(torch.ops._C.rms_norm_static_fp8_quant, (out_quant_fused, x, weight, quant_scale_t, 1e-6)) - torch.testing.assert_close(out_quant_fused.to(dtype=torch.float32), - out_quant.to(dtype=torch.float32), + torch.testing.assert_close(out_quant.to(dtype=torch.float32), + out_quant_fused.to(dtype=torch.float32), atol=1e-3, rtol=1e-3) diff --git a/vllm/model_executor/layers/linear.py b/vllm/model_executor/layers/linear.py index 366dfd97d81..bb81a663d45 100644 --- a/vllm/model_executor/layers/linear.py +++ b/vllm/model_executor/layers/linear.py @@ -259,6 +259,8 @@ def __init__( if params_dtype is None: params_dtype = torch.get_default_dtype() self.params_dtype = params_dtype + self.quant_config = quant_config + self.prefix = prefix if quant_config is None: self.quant_method: Optional[ QuantizeMethodBase] = UnquantizedLinearMethod() @@ -300,6 +302,12 @@ def __init__( *, return_bias: bool = True, ): + # If MergedReplicatedLinear, use output size of each partition. + if hasattr(self, "output_sizes"): + self.output_partition_sizes = self.output_sizes + else: + self.output_partition_sizes = [output_size] + super().__init__(input_size, output_size, skip_bias_add, @@ -311,7 +319,8 @@ def __init__( # All the linear layer supports quant method. assert self.quant_method is not None self.quant_method.create_weights(self, - self.input_size, [self.output_size], + self.input_size, + self.output_partition_sizes, self.input_size, self.output_size, self.params_dtype, @@ -367,6 +376,73 @@ def extra_repr(self) -> str: return s +class MergedReplicatedLinear(ReplicatedLinear): + """Replicated linear layer. + + Args: + input_size: input dimension of the linear layer. + output_size: output dimension of the linear layer. + bias: If true, add bias. + skip_bias_add: If true, skip adding bias but instead return it. + params_dtype: Data type for the parameters. + quant_config: Quantization configure. + prefix: The name of the layer in the state dict, including all parents + (e.g. model.layers.0.qkv_proj) + """ + + def __init__( + self, + input_size: int, + output_sizes: list[int], + bias: bool = True, + skip_bias_add: bool = False, + params_dtype: Optional[torch.dtype] = None, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + *, + return_bias: bool = True, + ): + self.output_sizes = output_sizes + super().__init__(input_size, + sum(output_sizes), + bias, + skip_bias_add, + params_dtype, + quant_config, + prefix=prefix, + return_bias=return_bias) + + def weight_loader(self, + param: Union[Parameter, BasevLLMParameter], + loaded_weight: torch.Tensor, + loaded_shard_id: Optional[int] = None): + assert loaded_shard_id is not None + assert loaded_shard_id < len(self.output_sizes) + + if isinstance(param, BlockQuantScaleParameter): + from vllm.model_executor.layers.quantization.fp8 import ( + Fp8LinearMethod, Fp8MoEMethod) + assert self.quant_method is not None + assert isinstance(self.quant_method, + (Fp8LinearMethod, Fp8MoEMethod)) + weight_block_size = self.quant_method.quant_config.weight_block_size + assert weight_block_size is not None + block_n, _ = weight_block_size[0], weight_block_size[1] + shard_offset = ( + (sum(self.output_sizes[:loaded_shard_id]) + block_n - 1) // + block_n) + shard_size = ((self.output_sizes[loaded_shard_id] + block_n - 1) // + block_n) + elif isinstance(param, PerTensorScaleParameter): + shard_offset = loaded_shard_id + shard_size = 1 + else: + shard_offset = sum(self.output_sizes[:loaded_shard_id]) + shard_size = self.output_sizes[loaded_shard_id] + + param[shard_offset:shard_offset + shard_size] = loaded_weight + + class ColumnParallelLinear(LinearBase): """Linear layer with column parallelism. diff --git a/vllm/model_executor/layers/quantization/fp8.py b/vllm/model_executor/layers/quantization/fp8.py index 35d7545d8c6..75f8adf34f7 100644 --- a/vllm/model_executor/layers/quantization/fp8.py +++ b/vllm/model_executor/layers/quantization/fp8.py @@ -257,9 +257,16 @@ def create_weights( f"{input_size_per_partition} is not divisible by " f"weight quantization block_k = {block_k}.") # Required by column parallel or enabling merged weights - if (tp_size > 1 and output_size // output_size_per_partition - == tp_size) or len(output_partition_sizes) > 1: - for output_partition_size in output_partition_sizes: + is_tp_split = (tp_size > 1 and + output_size // output_size_per_partition == tp_size) + is_merged_gemm = len(output_partition_sizes) > 1 + if is_tp_split or is_merged_gemm: + sizes_to_check = output_partition_sizes + if not is_tp_split and is_merged_gemm: + # In case of merged matrices, we allow the last + # matrix to not be a multiple of block size + sizes_to_check = output_partition_sizes[:-1] + for output_partition_size in sizes_to_check: if output_partition_size % block_n != 0: raise ValueError( f"Weight output_partition_size = " diff --git a/vllm/model_executor/models/deepseek_v2.py b/vllm/model_executor/models/deepseek_v2.py index 5106b9914b5..649109777b3 100644 --- a/vllm/model_executor/models/deepseek_v2.py +++ b/vllm/model_executor/models/deepseek_v2.py @@ -42,6 +42,7 @@ from vllm.model_executor.layers.layernorm import RMSNorm from vllm.model_executor.layers.linear import (ColumnParallelLinear, MergedColumnParallelLinear, + MergedReplicatedLinear, ReplicatedLinear, RowParallelLinear) from vllm.model_executor.layers.logits_processor import LogitsProcessor @@ -336,7 +337,7 @@ def forward( kv_a, _ = latent_cache.split( [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1) latent_cache = latent_cache.unsqueeze(1) - kv_a = self.kv_a_layernorm(kv_a.contiguous()) + kv_a = self.kv_a_layernorm(kv_a) kv = self.kv_b_proj(kv_a)[0] kv = kv.view(-1, self.num_local_heads, self.qk_nope_head_dim + self.v_head_dim) @@ -407,14 +408,24 @@ def __init__( self.max_position_embeddings = max_position_embeddings if self.q_lora_rank is not None: - self.q_a_proj = ReplicatedLinear(self.hidden_size, - self.q_lora_rank, - bias=False, - quant_config=quant_config, - prefix=f"{prefix}.q_a_proj") + self.fused_qkv_a_proj = MergedReplicatedLinear( + self.hidden_size, + [self.q_lora_rank, self.kv_lora_rank + self.qk_rope_head_dim], + bias=False, + quant_config=quant_config, + prefix=f"{prefix}.fused_qkv_a_proj") + else: + self.kv_a_proj_with_mqa = ReplicatedLinear( + self.hidden_size, + self.kv_lora_rank + self.qk_rope_head_dim, + bias=False, + quant_config=quant_config, + prefix=f"{prefix}.kv_a_proj_with_mqa") + + if self.q_lora_rank is not None: self.q_a_layernorm = RMSNorm(self.q_lora_rank, eps=config.rms_norm_eps) - self.q_b_proj = ColumnParallelLinear(q_lora_rank, + self.q_b_proj = ColumnParallelLinear(self.q_lora_rank, self.num_heads * self.qk_head_dim, bias=False, @@ -427,13 +438,6 @@ def __init__( bias=False, quant_config=quant_config, prefix=f"{prefix}.q_proj") - - self.kv_a_proj_with_mqa = ReplicatedLinear( - self.hidden_size, - self.kv_lora_rank + self.qk_rope_head_dim, - bias=False, - quant_config=quant_config, - prefix=f"{prefix}.kv_a_proj_with_mqa") self.kv_a_layernorm = RMSNorm(self.kv_lora_rank, eps=config.rms_norm_eps) self.kv_b_proj = ColumnParallelLinear( @@ -495,15 +499,24 @@ def forward( positions: torch.Tensor, hidden_states: torch.Tensor, ) -> torch.Tensor: + q_c = None + kv_lora = None + if self.q_lora_rank is not None: - q_c = self.q_a_proj(hidden_states)[0] + qkv_lora = self.fused_qkv_a_proj(hidden_states)[0] + q_c, kv_lora = qkv_lora.split( + [self.q_lora_rank, self.kv_lora_rank + self.qk_rope_head_dim], + dim=-1, + ) q_c = self.q_a_layernorm(q_c) q = self.q_b_proj(q_c)[0] else: + kv_lora = self.kv_a_proj_with_mqa(hidden_states)[0] q = self.q_proj(hidden_states)[0] - kv_c, k_pe = self.kv_a_proj_with_mqa(hidden_states)[0].split( - [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1) - kv_c_normed = self.kv_a_layernorm(kv_c.contiguous()) + + kv_c, k_pe = kv_lora.split([self.kv_lora_rank, self.qk_rope_head_dim], + dim=-1) + kv_c_normed = self.kv_a_layernorm(kv_c) q = q.view(-1, self.num_local_heads, self.qk_head_dim) # Add head dim of 1 to k_pe @@ -837,6 +850,8 @@ def load_weights(self, weights: Iterable[tuple[str, # (param_name, shard_name, shard_id) ("gate_up_proj", "gate_proj", 0), ("gate_up_proj", "up_proj", 1), + ("fused_qkv_a_proj", "q_a_proj", 0), + ("fused_qkv_a_proj", "kv_a_proj_with_mqa", 1), ] # Params for weights, fp8 weight scales, fp8 activation scales @@ -871,6 +886,12 @@ def load_weights(self, weights: Iterable[tuple[str, if (("mlp.experts." in name) and name not in params_dict): continue name = name.replace(weight_name, param_name) + + # QKV fusion is optional, fall back to normal + # weight loading if it's not enabled + if ((param_name == "fused_qkv_a_proj") + and name not in params_dict): + continue # Skip loading extra bias for GPTQ models. if name.endswith(".bias") and name not in params_dict: continue From e217ff6b96dc902a388bfc02e4ec7026e8f44303 Mon Sep 17 00:00:00 2001 From: Duncan Moss Date: Tue, 22 Jul 2025 07:27:12 -0700 Subject: [PATCH 256/552] [feat]: add SM100 support for cutlass FP8 groupGEMM (#20447) Signed-off-by: Duncan Moss Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin Signed-off-by: x22x22 --- CMakeLists.txt | 22 ++- .../cutlass_w8a8/moe/grouped_mm_c3x.cuh | 13 +- .../cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu | 140 ++++++++++++++++++ ...ouped_mm_c3x.cu => grouped_mm_c3x_sm90.cu} | 30 ++-- .../quantization/cutlass_w8a8/moe/moe_data.cu | 2 +- .../cutlass_w8a8/scaled_mm_entry.cu | 45 ++++-- .../compressed_tensors/compressed_tensors.py | 6 + .../compressed_tensors_moe.py | 29 +++- 8 files changed, 255 insertions(+), 32 deletions(-) create mode 100644 csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu rename csrc/quantization/cutlass_w8a8/moe/{grouped_mm_c3x.cu => grouped_mm_c3x_sm90.cu} (88%) diff --git a/CMakeLists.txt b/CMakeLists.txt index edc64f87730..10f8667db64 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -577,7 +577,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") # if it's possible to compile MoE kernels that use its output. cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a" "${CUDA_ARCHS}") if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS) - set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu") + set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm90.cu") set_gencode_flags_for_srcs( SRCS "${SRCS}" CUDA_ARCHS "${SCALED_MM_ARCHS}") @@ -595,6 +595,26 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") endif() endif() + cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS) + set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu") + set_gencode_flags_for_srcs( + SRCS "${SRCS}" + CUDA_ARCHS "${SCALED_MM_ARCHS}") + list(APPEND VLLM_EXT_SRC "${SRCS}") + list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1") + message(STATUS "Building grouped_mm_c3x for archs: ${SCALED_MM_ARCHS}") + else() + if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS) + message(STATUS "Not building grouped_mm_c3x kernels as CUDA Compiler version is " + "not >= 12.8, we recommend upgrading to CUDA 12.8 or later " + "if you intend on running FP8 quantized MoE models on Blackwell.") + else() + message(STATUS "Not building grouped_mm_c3x as no compatible archs found " + "in CUDA target architectures.") + endif() + endif() + # moe_data.cu is used by all CUTLASS MoE kernels. cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}") if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND CUTLASS_MOE_DATA_ARCHS) diff --git a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh index 3225378a6ca..659941de182 100644 --- a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh +++ b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cuh @@ -18,7 +18,6 @@ using ProblemShape = cutlass::gemm::GroupProblemShape>; using ElementAccumulator = float; -using ArchTag = cutlass::arch::Sm90; using OperatorClass = cutlass::arch::OpClassTensorOp; using LayoutA = cutlass::layout::RowMajor; @@ -33,7 +32,7 @@ using LayoutD_Transpose = using LayoutC = LayoutD; using LayoutC_Transpose = LayoutD_Transpose; -template typename Epilogue_, typename TileShape, typename ClusterShape, typename KernelSchedule, typename EpilogueSchedule, bool swap_ab_ = false> @@ -43,6 +42,7 @@ struct cutlass_3x_group_gemm { using ElementC = void; using ElementD = ElementC_; using ElementAccumulator = float; + using ArchTag = ArchTag_; using Epilogue = Epilogue_; @@ -77,7 +77,7 @@ struct cutlass_3x_group_gemm { LayoutB*, AlignmentAB, ElementAccumulator, TileShape, ClusterShape, Stages, KernelSchedule>::CollectiveOp>; - using KernelType = enable_sm90_only>; struct GemmKernel : public KernelType {}; @@ -156,9 +156,14 @@ void cutlass_group_gemm_caller( static_cast(out_ptrs.data_ptr()), static_cast(c_strides.data_ptr())}; + int device_id = a_tensors.device().index(); + static const cutlass::KernelHardwareInfo hw_info{ + device_id, cutlass::KernelHardwareInfo::query_device_multiprocessor_count( + device_id)}; + typename GemmKernel::Arguments args{ cutlass::gemm::GemmUniversalMode::kGrouped, prob_shape, mainloop_args, - epilogue_args}; + epilogue_args, hw_info}; using GemmOp = cutlass::gemm::device::GemmUniversalAdapter; GemmOp gemm_op; diff --git a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu new file mode 100644 index 00000000000..641e5997f0f --- /dev/null +++ b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu @@ -0,0 +1,140 @@ +#include + +#include +#include + +#include "cutlass/cutlass.h" +#include "grouped_mm_c3x.cuh" + +using namespace cute; + +namespace { + +template typename Epilogue> +struct sm100_fp8_config_default { + static_assert(std::is_same()); + using KernelSchedule = + cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100; + using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm; + using TileShape = cute::Shape; + using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm100; + + using Cutlass3xGemm = + cutlass_3x_group_gemm; +}; + +template typename Epilogue> +struct sm100_fp8_config_M64 { + // M in [1,64] + static_assert(std::is_same()); + using KernelSchedule = + cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100; + using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm; + using TileShape = cute::Shape; + using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm100; + + using Cutlass3xGemm = + cutlass_3x_group_gemm; +}; + +template typename Epilogue> +struct sm100_fp8_config_N8192 { + // N in [8192, inf) + static_assert(std::is_same()); + using KernelSchedule = + cutlass::gemm::KernelPtrArrayTmaWarpSpecialized2SmSm100; + using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized2Sm; + using TileShape = cute::Shape; + using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm100; + + using Cutlass3xGemm = + cutlass_3x_group_gemm; +}; + +template +void run_cutlass_moe_mm_sm100( + torch::Tensor& out_tensors, torch::Tensor const& a_tensors, + torch::Tensor const& b_tensors, torch::Tensor const& a_scales, + torch::Tensor const& b_scales, torch::Tensor const& expert_offsets, + torch::Tensor const& problem_sizes, torch::Tensor const& a_strides, + torch::Tensor const& b_strides, torch::Tensor const& c_strides, + bool per_act_token, bool per_out_ch) { + TORCH_CHECK(a_tensors.size(0) > 0, "No input A tensors provided."); + TORCH_CHECK(b_tensors.size(0) > 0, "No input B tensors provided."); + TORCH_CHECK(out_tensors.size(0) > 0, "No output tensors provided."); + + TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn, + "A tensors must be of type float8_e4m3fn."); + TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn, + "B tensors must be of type float8_e4m3fn."); + + using Cutlass3xGemmDefault = typename sm100_fp8_config_default< + InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; + using Cutlass3xGemmN8192 = typename sm100_fp8_config_N8192< + InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; + using Cutlass3xGemmM64 = typename sm100_fp8_config_M64< + InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; + + uint32_t const m = a_tensors.size(0); + uint32_t const n = out_tensors.size(1); + + if (m <= 64) { + cutlass_group_gemm_caller( + out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, + problem_sizes, a_strides, b_strides, c_strides, per_act_token, + per_out_ch); + } else if (n >= 8192) { + cutlass_group_gemm_caller( + out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, + problem_sizes, a_strides, b_strides, c_strides, per_act_token, + per_out_ch); + } else { + cutlass_group_gemm_caller( + out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, + problem_sizes, a_strides, b_strides, c_strides, per_act_token, + per_out_ch); + } +} +} // namespace + +void dispatch_moe_mm_sm100( + torch::Tensor& out_tensors, torch::Tensor const& a_tensors, + torch::Tensor const& b_tensors, torch::Tensor const& a_scales, + torch::Tensor const& b_scales, torch::Tensor const& expert_offsets, + torch::Tensor const& problem_sizes, torch::Tensor const& a_strides, + torch::Tensor const& b_strides, torch::Tensor const& c_strides, + bool per_act_token, bool per_out_ch) { + if (out_tensors.dtype() == torch::kBFloat16) { + run_cutlass_moe_mm_sm100( + out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, + problem_sizes, a_strides, b_strides, c_strides, per_act_token, + per_out_ch); + } else { + run_cutlass_moe_mm_sm100( + out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, + problem_sizes, a_strides, b_strides, c_strides, per_act_token, + per_out_ch); + } +} + +void cutlass_moe_mm_sm100( + torch::Tensor& out_tensors, torch::Tensor const& a_tensors, + torch::Tensor const& b_tensors, torch::Tensor const& a_scales, + torch::Tensor const& b_scales, torch::Tensor const& expert_offsets, + torch::Tensor const& problem_sizes, torch::Tensor const& a_strides, + torch::Tensor const& b_strides, torch::Tensor const& c_strides, + bool per_act_token, bool per_out_ch) { + dispatch_moe_mm_sm100(out_tensors, a_tensors, b_tensors, a_scales, b_scales, + expert_offsets, problem_sizes, a_strides, b_strides, + c_strides, per_act_token, per_out_ch); +} diff --git a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm90.cu similarity index 88% rename from csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu rename to csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm90.cu index b024482208d..8f21623b52f 100644 --- a/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu +++ b/csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm90.cu @@ -21,10 +21,11 @@ struct sm90_fp8_config_default { cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; using TileShape = cute::Shape; using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm90; using Cutlass3xGemm = - cutlass_3x_group_gemm; + cutlass_3x_group_gemm; }; template ; using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm90; using Cutlass3xGemm = - cutlass_3x_group_gemm; + cutlass_3x_group_gemm; }; template ; using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm90; using Cutlass3xGemm = - cutlass_3x_group_gemm; + cutlass_3x_group_gemm; }; template ; using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm90; using Cutlass3xGemm = - cutlass_3x_group_gemm; + cutlass_3x_group_gemm; }; template ; using ClusterShape = cute::Shape; + using ArchTag = cutlass::arch::Sm90; using Cutlass3xGemm = - cutlass_3x_group_gemm; + cutlass_3x_group_gemm; }; template @@ -112,9 +119,6 @@ void run_cutlass_moe_mm_sm90( TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn, "B tensors must be of type float8_e4m3fn."); - TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn); - TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn); - using Cutlass3xGemmN8192 = typename sm90_fp8_config_N8192< InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; using Cutlass3xGemmK8192 = typename sm90_fp8_config_K8192< diff --git a/csrc/quantization/cutlass_w8a8/moe/moe_data.cu b/csrc/quantization/cutlass_w8a8/moe/moe_data.cu index 623c9a2f096..993c30c48c8 100644 --- a/csrc/quantization/cutlass_w8a8/moe/moe_data.cu +++ b/csrc/quantization/cutlass_w8a8/moe/moe_data.cu @@ -190,4 +190,4 @@ void get_cutlass_pplx_moe_mm_data_caller(torch::Tensor& expert_offsets, static_cast(problem_sizes2.data_ptr()), static_cast(expert_num_tokens.data_ptr()), padded_m, n, k); -} +} \ No newline at end of file diff --git a/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu b/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu index 31b60488dfb..106bacb4883 100644 --- a/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu +++ b/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu @@ -41,6 +41,16 @@ void cutlass_moe_mm_sm90( #endif +#if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100 +void cutlass_moe_mm_sm100( + torch::Tensor& out_tensors, torch::Tensor const& a_tensors, + torch::Tensor const& b_tensors, torch::Tensor const& a_scales, + torch::Tensor const& b_scales, torch::Tensor const& expert_offsets, + torch::Tensor const& problem_sizes, torch::Tensor const& a_strides, + torch::Tensor const& b_strides, torch::Tensor const& c_strides, + bool per_act_token, bool per_out_ch); +#endif + #if defined ENABLE_SCALED_MM_SM120 && ENABLE_SCALED_MM_SM120 void cutlass_scaled_mm_sm120(torch::Tensor& c, torch::Tensor const& a, torch::Tensor const& b, @@ -130,10 +140,10 @@ bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) { // and at least SM90 (Hopper) #if defined CUDA_VERSION - if (cuda_device_capability >= 90 && cuda_device_capability < 100) { - return CUDA_VERSION >= 12000; - } else if (cuda_device_capability >= 100) { + if (cuda_device_capability >= 100) { return CUDA_VERSION >= 12080; + } else if (cuda_device_capability >= 90) { + return CUDA_VERSION >= 12000; } #endif @@ -141,11 +151,14 @@ bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) { } bool cutlass_group_gemm_supported(int64_t cuda_device_capability) { - // CUTLASS grouped FP8 kernels need at least CUDA 12.3 - // and SM90 (Hopper) + // CUTLASS grouped FP8 kernels need at least CUDA 12.3 and SM90 (Hopper) + // or CUDA 12.8 and SM100 (Blackwell) #if defined CUDA_VERSION - if (cuda_device_capability == 90) { + if (cuda_device_capability >= 100) { + return CUDA_VERSION >= 12080; + } + if (cuda_device_capability >= 90) { return CUDA_VERSION >= 12030; } #endif @@ -234,16 +247,26 @@ void cutlass_moe_mm( torch::Tensor const& b_strides, torch::Tensor const& c_strides, bool per_act_token, bool per_out_ch) { int32_t version_num = get_sm_version_num(); +#if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100 + if (version_num >= 100) { + cutlass_moe_mm_sm100(out_tensors, a_tensors, b_tensors, a_scales, b_scales, + expert_offsets, problem_sizes, a_strides, b_strides, + c_strides, per_act_token, per_out_ch); + return; + } +#endif #if defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90 - cutlass_moe_mm_sm90(out_tensors, a_tensors, b_tensors, a_scales, b_scales, - expert_offsets, problem_sizes, a_strides, b_strides, - c_strides, per_act_token, per_out_ch); - return; + if (version_num >= 90) { + cutlass_moe_mm_sm90(out_tensors, a_tensors, b_tensors, a_scales, b_scales, + expert_offsets, problem_sizes, a_strides, b_strides, + c_strides, per_act_token, per_out_ch); + return; + } #endif TORCH_CHECK_NOT_IMPLEMENTED( false, "No compiled cutlass_scaled_mm for CUDA device capability: ", version_num, - ". Required capability: 90"); + ". Required capability: 90 or 100"); } void get_cutlass_moe_mm_data( diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index e7f65d13181..90b45e32a68 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -332,6 +332,12 @@ def _is_fp8_w8a8_sm90(self, weight_quant: BaseModel, return (self._check_scheme_supported(90, error=False, match_exact=True) and self._is_fp8_w8a8(weight_quant, input_quant)) + def _is_fp8_w8a8_sm100(self, weight_quant: BaseModel, + input_quant: BaseModel) -> bool: + return (self._check_scheme_supported( + 100, error=False, match_exact=True) + and self._is_fp8_w8a8(weight_quant, input_quant)) + def _is_fp8_w8a16(self, weight_quant: BaseModel, input_quant: BaseModel) -> bool: # Confirm weights quantized. diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py index 2c93977beed..7da52ce6ff8 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py @@ -83,7 +83,8 @@ def get_moe_method( return CompressedTensorsWNA16MarlinMoEMethod(quant_config) elif quant_config._is_fp4a4_nvfp4(weight_quant, input_quant): return CompressedTensorsW4A4MoeMethod() - elif quant_config._is_fp8_w8a8_sm90(weight_quant, input_quant): + elif (quant_config._is_fp8_w8a8_sm90(weight_quant, input_quant) + or quant_config._is_fp8_w8a8_sm100(weight_quant, input_quant)): return CompressedTensorsW8A8Fp8MoECutlassMethod(quant_config) elif quant_config._is_fp8_w8a8(weight_quant, input_quant): return CompressedTensorsW8A8Fp8MoEMethod(quant_config) @@ -740,6 +741,8 @@ def __init__( self.topk_indices_dtype = None self.fused_experts = None # type: ignore self.disable_expert_map = False + self.is_fp8_w8a8_sm100 = self.quant_config._is_fp8_w8a8_sm100( + self.weight_quant, self.input_quant) def create_weights(self, layer: torch.nn.Module, num_experts: int, hidden_size: int, intermediate_size_per_partition: int, @@ -931,7 +934,29 @@ def apply( per_act_token = ( self.input_quant.strategy == QuantizationStrategy.TOKEN) - + per_channel_quant = ( + self.weight_quant.strategy == QuantizationStrategy.CHANNEL) + # Triton fused_experts is faster in small batch sizes on SM100. + # Fall back to fused_experts in small batch sizes. + if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8: + from vllm.model_executor.layers.fused_moe import fused_experts + return fused_experts( + x, + layer.w13_weight, + layer.w2_weight, + topk_weights, + topk_ids, + inplace=True, + activation=activation, + apply_router_weight_on_input=apply_router_weight_on_input, + use_fp8_w8a8=True, + per_channel_quant=per_channel_quant, + global_num_experts=global_num_experts, + expert_map=None if self.disable_expert_map else expert_map, + w1_scale=layer.w13_weight_scale, + w2_scale=layer.w2_weight_scale, + a1_scale=layer.w13_input_scale, + a2_scale=layer.w2_input_scale) if self.fused_experts is None: # If no modular kernel is provided, use cutlass_moe_fp8 from vllm.model_executor.layers.fused_moe.cutlass_moe import ( From e76fbbd87be4d5c9e257b54d83aafc0c1d4efa45 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Tue, 22 Jul 2025 10:27:15 -0400 Subject: [PATCH 257/552] [Perf] Cuda Kernel for Per Token Group Quant (#21083) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- CMakeLists.txt | 1 + csrc/ops.h | 5 + .../quantization/fp8/per_token_group_quant.cu | 213 ++++++++++++++++++ csrc/torch_bindings.cpp | 9 + .../test_per_token_group_quant.py | 44 ++++ .../layers/quantization/utils/fp8_utils.py | 17 +- 6 files changed, 285 insertions(+), 4 deletions(-) create mode 100644 csrc/quantization/fp8/per_token_group_quant.cu create mode 100644 tests/kernels/quantization/test_per_token_group_quant.py diff --git a/CMakeLists.txt b/CMakeLists.txt index 10f8667db64..767e9ad7541 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -245,6 +245,7 @@ set(VLLM_EXT_SRC "csrc/quantization/gptq/q_gemm.cu" "csrc/quantization/compressed_tensors/int8_quant_kernels.cu" "csrc/quantization/fp8/common.cu" + "csrc/quantization/fp8/per_token_group_quant.cu" "csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu" "csrc/quantization/gguf/gguf_kernel.cu" "csrc/quantization/activation_kernels.cu" diff --git a/csrc/ops.h b/csrc/ops.h index 7f3e6b6923a..fdd3071c56e 100644 --- a/csrc/ops.h +++ b/csrc/ops.h @@ -297,6 +297,11 @@ void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor& scales, std::optional const& azp); +void per_token_group_quant_fp8(const torch::Tensor& input, + torch::Tensor& output_q, torch::Tensor& output_s, + int64_t group_size, double eps, double fp8_min, + double fp8_max, bool scale_ue8m0); + torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight, torch::Tensor b_gptq_qzeros, torch::Tensor b_gptq_scales, torch::Tensor b_g_idx, diff --git a/csrc/quantization/fp8/per_token_group_quant.cu b/csrc/quantization/fp8/per_token_group_quant.cu new file mode 100644 index 00000000000..afc41faeca9 --- /dev/null +++ b/csrc/quantization/fp8/per_token_group_quant.cu @@ -0,0 +1,213 @@ +#include +#include + +#include + +#include +#include + +#include + +#include "../vectorization.cuh" +#include "../vectorization_utils.cuh" +#include "../../dispatch_utils.h" + +__device__ __forceinline__ float GroupReduceMax(float val, const int tid) { + unsigned mask = 0xffff; + + val = fmaxf(val, __shfl_xor_sync(mask, val, 8)); + val = fmaxf(val, __shfl_xor_sync(mask, val, 4)); + val = fmaxf(val, __shfl_xor_sync(mask, val, 2)); + val = fmaxf(val, __shfl_xor_sync(mask, val, 1)); + return val; +} + +template +__global__ void per_token_group_quant_8bit_kernel( + const T* __restrict__ input, void* __restrict__ output_q, + scale_packed_t* __restrict__ output_s, const int group_size, + const int num_groups, const int groups_per_block, const float eps, + const float min_8bit, const float max_8bit, const int scale_num_rows = 0, + const int scale_stride = 0) { + const int threads_per_group = 16; + const int64_t local_group_id = threadIdx.x / threads_per_group; + const int lane_id = threadIdx.x % threads_per_group; + + const int64_t block_group_id = blockIdx.x * groups_per_block; + const int64_t global_group_id = block_group_id + local_group_id; + const int64_t block_group_offset = global_group_id * group_size; + + float local_absmax = eps; + + using scale_element_t = float; + static_assert(sizeof(scale_packed_t) % sizeof(scale_element_t) == 0); + + const T* group_input = input + block_group_offset; + DST_DTYPE* group_output = + static_cast(output_q) + block_group_offset; + scale_element_t* scale_output; + + if constexpr (IS_COLUMN_MAJOR) { + const int num_elems_per_pack = + static_cast(sizeof(scale_packed_t) / sizeof(scale_element_t)); + const int scale_num_rows_element = scale_num_rows * num_elems_per_pack; + const int row_idx = global_group_id / scale_num_rows_element; + const int col_idx_raw = global_group_id % scale_num_rows_element; + const int col_idx = col_idx_raw / num_elems_per_pack; + const int pack_idx = col_idx_raw % num_elems_per_pack; + scale_output = reinterpret_cast(output_s) + + (col_idx * scale_stride * num_elems_per_pack + + row_idx * num_elems_per_pack + pack_idx); + } else { + scale_output = output_s + global_group_id; + } + + // shared memory to cache each group's data to avoid double DRAM reads. + extern __shared__ __align__(16) char smem_raw[]; + T* smem = reinterpret_cast(smem_raw); + T* smem_group = smem + local_group_id * group_size; + + constexpr int vec_size = 16 / sizeof(T); + using vec_t = vllm::vec_n_t; + + // copy global -> shared & compute absmax + auto scalar_op_cache = [&] __device__(T & dst, const T& src) { + float abs_v = fabsf(static_cast(src)); + local_absmax = fmaxf(local_absmax, abs_v); + dst = src; + }; + + vllm::vectorize_with_alignment( + group_input, // in + smem_group, // out (shared) + group_size, // elements per group + lane_id, // thread id + threads_per_group, // stride in group + scalar_op_cache); // scalar handler + + local_absmax = GroupReduceMax(local_absmax, lane_id); + + float y_s = local_absmax / max_8bit; + if constexpr (SCALE_UE8M0) { + y_s = exp2f(ceilf(log2f(fmaxf(fabsf(y_s), 1e-10f)))); + } + + scale_element_t y_s_quant = y_s; + + if (lane_id == 0) { + *scale_output = y_s_quant; + } + + __syncthreads(); + + // quantize shared -> global 8-bit + auto scalar_op_quant = [&] __device__(DST_DTYPE & dst, const T& src) { + float q = fminf(fmaxf(static_cast(src) / y_s, min_8bit), max_8bit); + dst = DST_DTYPE(q); + }; + + vllm::vectorize_with_alignment( + smem_group, // in (shared) + group_output, // out (global quant tensor) + group_size, // elements + lane_id, // tid + threads_per_group, // stride + scalar_op_quant); // scalar handler +} + +void per_token_group_quant_8bit(const torch::Tensor& input, + torch::Tensor& output_q, + torch::Tensor& output_s, int64_t group_size, + double eps, double min_8bit, double max_8bit, + bool scale_ue8m0 = false) { + TORCH_CHECK(input.is_contiguous()); + TORCH_CHECK(output_q.is_contiguous()); + + const int num_groups = input.numel() / group_size; + + TORCH_CHECK(input.numel() % group_size == 0); + TORCH_CHECK(output_s.dim() == 2); + + cudaStream_t stream = at::cuda::getCurrentCUDAStream(); + + constexpr int THREADS_PER_GROUP = 16; + + int groups_per_block = 1; + + if (num_groups % 16 == 0) { + groups_per_block = 16; + } else if (num_groups % 8 == 0) { + groups_per_block = 8; + } else if (num_groups % 4 == 0) { + groups_per_block = 4; + } else if (num_groups % 2 == 0) { + groups_per_block = 2; + } + + auto dst_type = output_q.scalar_type(); + const int num_blocks = num_groups / groups_per_block; + const int num_threads = groups_per_block * THREADS_PER_GROUP; + + const bool is_column_major = output_s.stride(0) < output_s.stride(1); + const int scale_num_rows = output_s.size(1); + const int scale_stride = output_s.stride(1); + +#define LAUNCH_KERNEL(T, DST_DTYPE) \ + do { \ + dim3 grid(num_blocks); \ + dim3 block(num_threads); \ + size_t smem_bytes = \ + static_cast(groups_per_block) * group_size * sizeof(T); \ + if (is_column_major) { \ + if (scale_ue8m0) { \ + per_token_group_quant_8bit_kernel \ + <<>>( \ + static_cast(input.data_ptr()), output_q.data_ptr(), \ + static_cast(output_s.data_ptr()), group_size, \ + num_groups, groups_per_block, (float)eps, (float)min_8bit, \ + (float)max_8bit, scale_num_rows, scale_stride); \ + } else { \ + per_token_group_quant_8bit_kernel \ + <<>>( \ + static_cast(input.data_ptr()), output_q.data_ptr(), \ + static_cast(output_s.data_ptr()), group_size, \ + num_groups, groups_per_block, (float)eps, (float)min_8bit, \ + (float)max_8bit, scale_num_rows, scale_stride); \ + } \ + } else { \ + if (scale_ue8m0) { \ + per_token_group_quant_8bit_kernel \ + <<>>( \ + static_cast(input.data_ptr()), output_q.data_ptr(), \ + static_cast(output_s.data_ptr()), group_size, \ + num_groups, groups_per_block, (float)eps, (float)min_8bit, \ + (float)max_8bit); \ + } else { \ + per_token_group_quant_8bit_kernel \ + <<>>( \ + static_cast(input.data_ptr()), output_q.data_ptr(), \ + static_cast(output_s.data_ptr()), group_size, \ + num_groups, groups_per_block, (float)eps, (float)min_8bit, \ + (float)max_8bit); \ + } \ + } \ + } while (0) + + VLLM_DISPATCH_FLOATING_TYPES( + input.scalar_type(), "per_token_group_quant_8bit", ([&] { + if (dst_type == at::ScalarType::Float8_e4m3fn) { + LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn); + } + })); + +#undef LAUNCH_KERNEL +} + +void per_token_group_quant_fp8(const torch::Tensor& input, + torch::Tensor& output_q, torch::Tensor& output_s, + int64_t group_size, double eps, double fp8_min, + double fp8_max, bool scale_ue8m0) { + per_token_group_quant_8bit(input, output_q, output_s, group_size, eps, + fp8_min, fp8_max, scale_ue8m0); +} diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index 79e2575974b..d310211afe4 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -601,6 +601,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { ops.impl("dynamic_scaled_int8_quant", torch::kCUDA, &dynamic_scaled_int8_quant); + // Compute per-token-group FP8 quantized tensor and scaling factor. + ops.def( + "per_token_group_fp8_quant(Tensor input, Tensor! output_q, Tensor! " + "output_s, " + "int group_size, float eps, float fp8_min, float fp8_max, bool " + "scale_ue8m0) -> ()"); + ops.impl("per_token_group_fp8_quant", torch::kCUDA, + &per_token_group_quant_fp8); + // Mamba selective scan kernel ops.def( "selective_scan_fwd(Tensor! u, Tensor! delta," diff --git a/tests/kernels/quantization/test_per_token_group_quant.py b/tests/kernels/quantization/test_per_token_group_quant.py new file mode 100644 index 00000000000..f826983fe94 --- /dev/null +++ b/tests/kernels/quantization/test_per_token_group_quant.py @@ -0,0 +1,44 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from unittest.mock import patch + +import pytest +import torch + +from vllm.model_executor.layers.quantization.utils import fp8_utils + + +@pytest.mark.parametrize("shape", [(32, 128), (64, 256), (16, 512)]) +@pytest.mark.parametrize("column_major", [False, True]) +@pytest.mark.parametrize("scale_ue8m0", [False, True]) +@pytest.mark.parametrize("group_size", [64, 128]) +@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") +def test_per_token_group_quant_fp8(shape, column_major: bool, + scale_ue8m0: bool, group_size: int): + device = "cuda" + + torch.manual_seed(42) + num_tokens, hidden_dim = shape + + x = (torch.randn( + (num_tokens, hidden_dim), device=device, dtype=torch.bfloat16) * 8) + + # cuda path + out_q, scale = fp8_utils.per_token_group_quant_fp8( + x, + group_size, + column_major_scales=column_major, + use_ue8m0=scale_ue8m0, + ) + + # triton ref + with patch("vllm.platforms.current_platform.is_cuda", return_value=False): + ref_q, ref_s = fp8_utils.per_token_group_quant_fp8( + x, + group_size, + column_major_scales=column_major, + use_ue8m0=scale_ue8m0, + ) + + assert torch.allclose(out_q.float(), ref_q.float(), atol=0.15, rtol=0.15) + assert torch.allclose(scale, ref_s, atol=0.01, rtol=0.01) diff --git a/vllm/model_executor/layers/quantization/utils/fp8_utils.py b/vllm/model_executor/layers/quantization/utils/fp8_utils.py index 20e7b444856..ee5f2b51564 100644 --- a/vllm/model_executor/layers/quantization/utils/fp8_utils.py +++ b/vllm/model_executor/layers/quantization/utils/fp8_utils.py @@ -366,6 +366,7 @@ def per_token_group_quant_fp8( dtype: Optional[torch.dtype] = None, column_major_scales: bool = False, out_q: Optional[torch.Tensor] = None, + use_ue8m0: bool = is_blackwell_deep_gemm_used(), ) -> tuple[torch.Tensor, torch.Tensor]: """Function to perform per-token-group quantization on an input tensor `x`. It converts the tensor values into signed float8 values and returns the @@ -397,8 +398,7 @@ def per_token_group_quant_fp8( if x_q is None: x_q = torch.empty_like(x, device=x.device, dtype=dtype) - M = x.numel() // group_size - N = group_size + # Allocate the scale tensor in either row- or column-major format. if column_major_scales: shape = (x.shape[-1] // group_size, ) + x.shape[:-1] x_s = torch.empty(shape, device=x.device, @@ -407,6 +407,15 @@ def per_token_group_quant_fp8( shape = x.shape[:-1] + (x.shape[-1] // group_size, ) x_s = torch.empty(shape, device=x.device, dtype=torch.float32) + # prefer CUDA kernel if available + if current_platform.is_cuda() and x.is_contiguous(): + torch.ops._C.per_token_group_fp8_quant(x, x_q, x_s, group_size, eps, + fp8_min, fp8_max, use_ue8m0) + return x_q, x_s + + # TRITON FALLBACK + M = x.numel() // group_size + N = group_size BLOCK = triton.next_power_of_2(N) # heuristics for number of warps num_warps = min(max(BLOCK // 256, 1), 8) @@ -423,7 +432,7 @@ def per_token_group_quant_fp8( eps, fp8_min=fp8_min, fp8_max=fp8_max, - use_ue8m0=is_blackwell_deep_gemm_used(), + use_ue8m0=use_ue8m0, BLOCK=BLOCK, num_warps=num_warps, num_stages=num_stages, @@ -439,7 +448,7 @@ def per_token_group_quant_fp8( eps, fp8_min=fp8_min, fp8_max=fp8_max, - use_ue8m0=is_blackwell_deep_gemm_used(), + use_ue8m0=use_ue8m0, BLOCK=BLOCK, num_warps=num_warps, num_stages=num_stages, From db2c92d3cbe2e75a86d514d99e9e1ae09d8804b1 Mon Sep 17 00:00:00 2001 From: Benjamin Bartels Date: Tue, 22 Jul 2025 16:15:53 +0100 Subject: [PATCH 258/552] Adds parallel model weight loading for runai_streamer (#21330) Signed-off-by: bbartels Co-authored-by: Cyrus Leung Signed-off-by: x22x22 --- setup.py | 3 ++- .../model_loader/weight_utils.py | 22 ++++++++++++------- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/setup.py b/setup.py index 9a5ca3456a0..d46e678e7aa 100644 --- a/setup.py +++ b/setup.py @@ -659,7 +659,8 @@ def _read_requirements(filename: str) -> list[str]: "bench": ["pandas", "datasets"], "tensorizer": ["tensorizer==2.10.1"], "fastsafetensors": ["fastsafetensors >= 0.1.10"], - "runai": ["runai-model-streamer", "runai-model-streamer-s3", "boto3"], + "runai": + ["runai-model-streamer >= 0.13.3", "runai-model-streamer-s3", "boto3"], "audio": ["librosa", "soundfile", "mistral_common[audio]"], # Required for audio processing "video": [] # Kept for backwards compatibility diff --git a/vllm/model_executor/model_loader/weight_utils.py b/vllm/model_executor/model_loader/weight_utils.py index 64a2089921e..074126fa669 100644 --- a/vllm/model_executor/model_loader/weight_utils.py +++ b/vllm/model_executor/model_loader/weight_utils.py @@ -482,14 +482,20 @@ def runai_safetensors_weights_iterator( ) -> Generator[tuple[str, torch.Tensor], None, None]: """Iterate over the weights in the model safetensor files.""" with SafetensorsStreamer() as streamer: - for st_file in tqdm( - hf_weights_files, - desc="Loading safetensors using Runai Model Streamer", - disable=not enable_tqdm(use_tqdm_on_load), - bar_format=_BAR_FORMAT, - ): - streamer.stream_file(st_file) - yield from streamer.get_tensors() + streamer.stream_files(hf_weights_files) + total_tensors = sum( + len(tensors_meta) + for tensors_meta in streamer.files_to_tensors_metadata.values()) + + tensor_iter = tqdm( + streamer.get_tensors(), + total=total_tensors, + desc="Loading safetensors using Runai Model Streamer", + bar_format=_BAR_FORMAT, + disable=not enable_tqdm(use_tqdm_on_load), + ) + + yield from tensor_iter def fastsafetensors_weights_iterator( From e805e760a2df3e0680c0edfe490747954c3f65d5 Mon Sep 17 00:00:00 2001 From: Raushan Turganbay Date: Tue, 22 Jul 2025 17:18:46 +0200 Subject: [PATCH 259/552] [feat] Enable mm caching for transformers backend (#21358) Signed-off-by: raushan Signed-off-by: x22x22 --- docs/models/supported_models.md | 2 +- tests/models/multimodal/generation/test_common.py | 8 -------- vllm/model_executor/models/transformers.py | 9 +++------ vllm/v1/core/kv_cache_utils.py | 6 +++--- 4 files changed, 7 insertions(+), 18 deletions(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 69f6a7aedd2..391e27cc12b 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -18,7 +18,7 @@ These models are what we list in [supported-text-models][supported-text-models] ### Transformers -vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models and common vision language models are supported! Vision-language models currently accept only image inputs, and require setting `--disable_mm_preprocessor_cache` when running. Support for video inputs and caching of multi-modal preprocessors will be added in future releases. +vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models and common vision language models are supported! Vision-language models currently accept only image inputs. Support for video inputs will be added in future releases. To check if the modeling backend is Transformers, you can simply do this: diff --git a/tests/models/multimodal/generation/test_common.py b/tests/models/multimodal/generation/test_common.py index 9859ac5a89d..e2e35e9b272 100644 --- a/tests/models/multimodal/generation/test_common.py +++ b/tests/models/multimodal/generation/test_common.py @@ -186,8 +186,6 @@ image_size_factors=[(0.25, 0.5, 1.0)], vllm_runner_kwargs={ "model_impl": "transformers", - "disable_mm_preprocessor_cache": True, - "enable_prefix_caching": False, }, marks=[pytest.mark.core_model], ), @@ -205,8 +203,6 @@ # image_size_factors=[(0.25, 0.5, 1.0)], # vllm_runner_kwargs={ # "model_impl": "transformers", - # "disable_mm_preprocessor_cache": True, - # "enable_prefix_caching": False, # }, # marks=[pytest.mark.core_model], # ), @@ -223,8 +219,6 @@ image_size_factors=[(0.25, 0.2, 0.15)], vllm_runner_kwargs={ "model_impl": "transformers", - "disable_mm_preprocessor_cache": True, - "enable_prefix_caching": False, }, marks=[large_gpu_mark(min_gb=32)], ), @@ -239,8 +233,6 @@ image_size_factors=[(0.25, 0.5, 1.0)], vllm_runner_kwargs={ "model_impl": "auto", - "disable_mm_preprocessor_cache": True, - "enable_prefix_caching": False, }, auto_cls=AutoModelForImageTextToText, marks=[pytest.mark.core_model], diff --git a/vllm/model_executor/models/transformers.py b/vllm/model_executor/models/transformers.py index 47cff29caab..eea03afcd8a 100644 --- a/vllm/model_executor/models/transformers.py +++ b/vllm/model_executor/models/transformers.py @@ -315,11 +315,6 @@ def apply( Apply HF Processor on prompt text and multi-modal data together, outputting token IDs and processed tensors. """ - if return_mm_hashes: - raise ValueError( - "TransformersForMultimodalLM doesn't support mm hashing yet! " - "Probably you didn't set `disable_mm_preprocessor_cache=True`") - if tokenization_kwargs is None: tokenization_kwargs = {} @@ -375,12 +370,14 @@ def apply( num_image_patches), ) + mm_hashes = self._hash_mm_items(mm_items, hf_processor_mm_kwargs, + tokenization_kwargs) return MultiModalInputs( type="multimodal", prompt=prompt, prompt_token_ids=prompt_ids, mm_kwargs=mm_kwargs, - mm_hashes=None, + mm_hashes=mm_hashes, mm_placeholders=mm_placeholders, ) diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index 198d79cfb42..5b0218640a8 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -406,9 +406,9 @@ def need_extra_keys(request: Request) -> bool: # Multimodal requests need to include the MM hash. # LoRA requests need to include the LoRA ID. # Request with provided cache salt need to include the salt. - return bool(request.mm_positions) or (request.lora_request - is not None) or (request.cache_salt - is not None) + return bool(request.mm_hashes) or (request.lora_request + is not None) or (request.cache_salt + is not None) def _gen_mm_extra_hash_keys(request: Request, start_token_idx: int, From 525b7bad4f10b5b5a92a5efdbe50d1a55428a35d Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Tue, 22 Jul 2025 11:22:10 -0400 Subject: [PATCH 260/552] Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" (#21384) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- csrc/moe/topk_softmax_kernels.cu | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/csrc/moe/topk_softmax_kernels.cu b/csrc/moe/topk_softmax_kernels.cu index ea4ff67ef3e..064b76c9cd4 100644 --- a/csrc/moe/topk_softmax_kernels.cu +++ b/csrc/moe/topk_softmax_kernels.cu @@ -20,7 +20,6 @@ #include #include #include "../cuda_compat.h" -#include #ifndef USE_ROCM #include @@ -63,7 +62,7 @@ __launch_bounds__(TPB) __global__ const int thread_row_offset = blockIdx.x * num_cols; - cuda::std::plus sum; + cub::Sum sum; float threadData(-FLT_MAX); // Don't touch finished rows. From 7a94e96049a3d0b10bcc421416ec3a77c0bd458e Mon Sep 17 00:00:00 2001 From: Wang Yijun Date: Tue, 22 Jul 2025 23:24:00 +0800 Subject: [PATCH 261/552] Add tokenization_kwargs to encode for embedding model truncation (#21033) Signed-off-by: x22x22 --- vllm/engine/async_llm_engine.py | 6 ++++++ vllm/entrypoints/llm.py | 15 ++++++++++++--- vllm/v1/engine/async_llm.py | 2 ++ 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/vllm/engine/async_llm_engine.py b/vllm/engine/async_llm_engine.py index 3d7d28055dd..06ae2a2f18f 100644 --- a/vllm/engine/async_llm_engine.py +++ b/vllm/engine/async_llm_engine.py @@ -438,6 +438,7 @@ async def add_request_async( prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> None: """ Async version of @@ -468,6 +469,7 @@ async def add_request_async( prompt, lora_request=lora_request, prompt_adapter_request=prompt_adapter_request, + tokenization_kwargs=tokenization_kwargs, ) if isinstance(params, SamplingParams) and \ @@ -862,6 +864,7 @@ async def add_request( prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> AsyncGenerator[Union[RequestOutput, PoolingRequestOutput], None]: if not self.is_running: if self.start_engine_loop: @@ -889,6 +892,7 @@ async def add_request( prompt_adapter_request=prompt_adapter_request, priority=priority, data_parallel_rank=data_parallel_rank, + tokenization_kwargs=tokenization_kwargs, ) return stream.generator() @@ -996,6 +1000,7 @@ async def encode( lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, priority: int = 0, + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> AsyncGenerator[PoolingRequestOutput, None]: """Generate outputs for a request from a pooling model. @@ -1070,6 +1075,7 @@ async def encode( lora_request=lora_request, trace_headers=trace_headers, priority=priority, + tokenization_kwargs=tokenization_kwargs, ): yield LLMEngine.validate_output(output, PoolingRequestOutput) except asyncio.CancelledError: diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py index 78f9d32d811..c4f1b3b8661 100644 --- a/vllm/entrypoints/llm.py +++ b/vllm/entrypoints/llm.py @@ -965,6 +965,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: ... @@ -981,6 +982,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: ... @@ -997,6 +999,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: ... @@ -1014,6 +1017,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: ... @@ -1031,6 +1035,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: ... @@ -1046,6 +1051,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: ... @@ -1066,6 +1072,7 @@ def encode( lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: """Apply pooling to the hidden states corresponding to the input prompts. @@ -1131,9 +1138,11 @@ def encode( for pooling_param in pooling_params: pooling_param.verify(pooling_task, model_config) - tokenization_kwargs = dict[str, Any]() - _validate_truncation_size(model_config.max_model_len, - truncate_prompt_tokens, tokenization_kwargs) + if tokenization_kwargs is None: + tokenization_kwargs = dict[str, Any]() + _validate_truncation_size(model_config.max_model_len, + truncate_prompt_tokens, + tokenization_kwargs) self._validate_and_add_requests( prompts=parsed_prompts, diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index b8ba36f3502..79b5d5ae4a2 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -437,6 +437,7 @@ async def encode( lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, priority: int = 0, + tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> AsyncGenerator[PoolingRequestOutput, None]: """ Main function called by the API server to kick off a request @@ -465,6 +466,7 @@ async def encode( lora_request=lora_request, trace_headers=trace_headers, priority=priority, + tokenization_kwargs=tokenization_kwargs, ) # The output_handler task pushes items into the queue. From 72061ec30d047261456ad1770a2d8f37a379b0c7 Mon Sep 17 00:00:00 2001 From: Aritra Roy Gosthipaty Date: Tue, 22 Jul 2025 20:57:28 +0530 Subject: [PATCH 262/552] [Bugfix] Decode Tokenized IDs to Strings for `hf_processor` in `llm.chat()` with `model_impl=transformers` (#21353) Signed-off-by: ariG23498 Signed-off-by: x22x22 --- .../processing/test_transformers.py | 40 +++++++++++++++++++ vllm/model_executor/models/transformers.py | 5 +++ 2 files changed, 45 insertions(+) create mode 100644 tests/models/multimodal/processing/test_transformers.py diff --git a/tests/models/multimodal/processing/test_transformers.py b/tests/models/multimodal/processing/test_transformers.py new file mode 100644 index 00000000000..c7d1b5271ff --- /dev/null +++ b/tests/models/multimodal/processing/test_transformers.py @@ -0,0 +1,40 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import pytest + +from vllm.assets.image import ImageAsset +from vllm.config import ModelConfig +from vllm.multimodal import MULTIMODAL_REGISTRY + + +# yapf: disable +@pytest.mark.parametrize("model_id", + ["llava-hf/llava-onevision-qwen2-0.5b-ov-hf"]) +def test_multimodal_processor(model_id): + model_config = ModelConfig( + model=model_id, + model_impl="transformers", + ) + + mm_processor = MULTIMODAL_REGISTRY.create_processor(model_config, ) + + image_pil = ImageAsset('cherry_blossom').pil_image + mm_data = {"image": image_pil} + str_prompt = "<|im_start|>user \nWhat is the content of this image?<|im_end|><|im_start|>assistant\n" # noqa: E501 + str_processed_inputs = mm_processor.apply( + prompt=str_prompt, + mm_data=mm_data, + hf_processor_mm_kwargs={}, + ) + + ids_prompt = [ + 151644, 872, 220, 151646, 198, 3838, 374, 279, 2213, 315, 419, 2168, + 30, 151645, 151644, 77091, 198 + ] + ids_processed_inputs = mm_processor.apply( + prompt=ids_prompt, + mm_data=mm_data, + hf_processor_mm_kwargs={}, + ) + + assert str_processed_inputs["prompt"] == ids_processed_inputs["prompt"] diff --git a/vllm/model_executor/models/transformers.py b/vllm/model_executor/models/transformers.py index eea03afcd8a..cb9d28b1067 100644 --- a/vllm/model_executor/models/transformers.py +++ b/vllm/model_executor/models/transformers.py @@ -320,6 +320,11 @@ def apply( mm_items = self._to_mm_items(mm_data) hf_processor = self.info.get_hf_processor(**hf_processor_mm_kwargs) + if not isinstance(prompt, str): + # the prompt is the tokenized ids which is not supported + # by the hf_processor, which is why we would need to decode the ids + # into string + prompt = hf_processor.decode(prompt) (prompt_ids, processed_data, mm_token_type_ids) = self._apply_hf_processor_text_mm( From a1ebe939b7a3d5db131e459e693da4308c9be2bc Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Tue, 22 Jul 2025 23:39:35 +0800 Subject: [PATCH 263/552] [CI/Build] Fix test failure due to updated model repo (#21375) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- tests/models/registry.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/models/registry.py b/tests/models/registry.py index 8e3285aebbe..776b4c03356 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -167,9 +167,9 @@ def check_available_online( "DeepseekV3ForCausalLM": _HfExamplesInfo("deepseek-ai/DeepSeek-V3", # noqa: E501 trust_remote_code=True), "Ernie4_5_ForCausalLM": _HfExamplesInfo("baidu/ERNIE-4.5-0.3B-PT", - trust_remote_code=True), + min_transformers_version="4.54"), "Ernie4_5_MoeForCausalLM": _HfExamplesInfo("baidu/ERNIE-4.5-21B-A3B-PT", - trust_remote_code=True), + min_transformers_version="4.54"), "ExaoneForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct"), # noqa: E501 "Exaone4ForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-4.0-32B"), # noqa: E501 "Fairseq2LlamaForCausalLM": _HfExamplesInfo("mgleize/fairseq2-dummy-Llama-3.2-1B"), # noqa: E501 From 66d21499370867b3618bc67ef78100f19e902be9 Mon Sep 17 00:00:00 2001 From: Xin Li Date: Tue, 22 Jul 2025 15:42:31 -0400 Subject: [PATCH 264/552] Fix Flashinfer Allreduce+Norm enable disable calculation based on `fi_allreduce_fusion_max_token_num` (#21325) Signed-off-by: XIn Li Signed-off-by: x22x22 --- vllm/compilation/collective_fusion.py | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/vllm/compilation/collective_fusion.py b/vllm/compilation/collective_fusion.py index a8b00aaf084..0e7961841bd 100644 --- a/vllm/compilation/collective_fusion.py +++ b/vllm/compilation/collective_fusion.py @@ -159,6 +159,9 @@ def __call__(self, graph: fx.Graph): 6: MiB // 2, # 512KB 8: MiB // 2, # 512KB } + # opt for a more conservative default value + # when world size is not in _FI_MAX_SIZES + _DEFAULT_FI_MAX_SIZE = MiB // 2 def call_trtllm_fused_allreduce_norm( allreduce_in: torch.Tensor, @@ -173,12 +176,16 @@ def call_trtllm_fused_allreduce_norm( max_token_num: int, norm_out: Optional[torch.Tensor] = None, ) -> None: - use_flashinfer = allreduce_in.shape[0] * allreduce_in.shape[ - 1] * allreduce_in.element_size() <= min( - _FI_MAX_SIZES[world_size], - max_token_num * allreduce_in.shape[0] * - allreduce_in.element_size(), - ) + + num_tokens, hidden_size = allreduce_in.shape + element_size = allreduce_in.element_size() + current_tensor_size = num_tokens * hidden_size * element_size + max_fusion_size = max_token_num * hidden_size * element_size + use_flashinfer = current_tensor_size <= min( + _FI_MAX_SIZES.get(world_size, _DEFAULT_FI_MAX_SIZE), + max_fusion_size, + ) + if use_flashinfer: assert (_FI_WORKSPACE_TENSOR is not None ), "Flashinfer must be enabled when using flashinfer" From 28c50bd7667488f62e026276bbb608ab5f8ea75d Mon Sep 17 00:00:00 2001 From: Yiheng Xu Date: Tue, 22 Jul 2025 15:05:57 -0700 Subject: [PATCH 265/552] [Model] Add Qwen3CoderToolParser (#21396) Signed-off-by: simon-mo Co-authored-by: simon-mo Signed-off-by: x22x22 --- tests/tool_use/test_qwen3coder_tool_parser.py | 618 ++++++++++++++++ .../openai/tool_parsers/__init__.py | 2 + .../tool_parsers/qwen3coder_tool_parser.py | 669 ++++++++++++++++++ 3 files changed, 1289 insertions(+) create mode 100644 tests/tool_use/test_qwen3coder_tool_parser.py create mode 100644 vllm/entrypoints/openai/tool_parsers/qwen3coder_tool_parser.py diff --git a/tests/tool_use/test_qwen3coder_tool_parser.py b/tests/tool_use/test_qwen3coder_tool_parser.py new file mode 100644 index 00000000000..40c3158e9e6 --- /dev/null +++ b/tests/tool_use/test_qwen3coder_tool_parser.py @@ -0,0 +1,618 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import json +from collections.abc import Generator +from typing import Optional + +import pytest + +from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, + ChatCompletionToolsParam, + DeltaMessage, FunctionCall, + ToolCall) +from vllm.entrypoints.openai.tool_parsers.qwen3coder_tool_parser import ( + Qwen3CoderToolParser) +from vllm.transformers_utils.detokenizer import detokenize_incrementally +from vllm.transformers_utils.tokenizer import AnyTokenizer, get_tokenizer + +MODEL = "Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8" + + +@pytest.fixture(scope="module") +def qwen3_tokenizer(): + return get_tokenizer(tokenizer_name=MODEL) + + +@pytest.fixture +def qwen3_tool_parser(qwen3_tokenizer): + return Qwen3CoderToolParser(qwen3_tokenizer) + + +@pytest.fixture +def sample_tools(): + return [ + ChatCompletionToolsParam(type="function", + function={ + "name": "get_current_weather", + "description": "Get the current weather", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The city name" + }, + "state": { + "type": "string", + "description": + "The state code" + }, + "unit": { + "type": "string", + "enum": + ["fahrenheit", "celsius"] + } + }, + "required": ["city", "state"] + } + }), + ChatCompletionToolsParam(type="function", + function={ + "name": "calculate_area", + "description": + "Calculate area of a shape", + "parameters": { + "type": "object", + "properties": { + "shape": { + "type": "string" + }, + "dimensions": { + "type": "object" + }, + "precision": { + "type": "integer" + } + } + } + }) + ] + + +def assert_tool_calls(actual_tool_calls: list[ToolCall], + expected_tool_calls: list[ToolCall]): + assert len(actual_tool_calls) == len(expected_tool_calls) + + for actual_tool_call, expected_tool_call in zip(actual_tool_calls, + expected_tool_calls): + # Qwen3 parser doesn't generate IDs during extraction + assert actual_tool_call.type == "function" + assert ( + actual_tool_call.function.name == expected_tool_call.function.name) + assert (json.loads(actual_tool_call.function.arguments) == json.loads( + expected_tool_call.function.arguments)) + + +def stream_delta_message_generator( + qwen3_tool_parser: Qwen3CoderToolParser, + qwen3_tokenizer: AnyTokenizer, + model_output: str, + request: Optional[ChatCompletionRequest] = None +) -> Generator[DeltaMessage, None, None]: + all_token_ids = qwen3_tokenizer.encode(model_output, + add_special_tokens=False) + + previous_text = "" + previous_tokens = None + prefix_offset = 0 + read_offset = 0 + for i, delta_token in enumerate(all_token_ids): + delta_token_ids = [delta_token] + previous_token_ids = all_token_ids[:i] + current_token_ids = all_token_ids[:i + 1] + + (new_tokens, delta_text, new_prefix_offset, + new_read_offset) = detokenize_incrementally( + tokenizer=qwen3_tokenizer, + all_input_ids=current_token_ids, + prev_tokens=previous_tokens, + prefix_offset=prefix_offset, + read_offset=read_offset, + skip_special_tokens=False, + spaces_between_special_tokens=True, + ) + + current_text = previous_text + delta_text + + delta_message = qwen3_tool_parser.extract_tool_calls_streaming( + previous_text, + current_text, + delta_text, + previous_token_ids, + current_token_ids, + delta_token_ids, + request=request, + ) + if delta_message: + yield delta_message + + previous_text = current_text + previous_tokens = (previous_tokens + + new_tokens if previous_tokens else new_tokens) + prefix_offset = new_prefix_offset + read_offset = new_read_offset + + +def test_extract_tool_calls_no_tools(qwen3_tool_parser): + model_output = "This is a test response without any tool calls" + extracted_tool_calls = qwen3_tool_parser.extract_tool_calls( + model_output, request=None) # type: ignore[arg-type] + assert not extracted_tool_calls.tools_called + assert extracted_tool_calls.tool_calls == [] + assert extracted_tool_calls.content == model_output + + +@pytest.mark.parametrize( + ids=[ + "single_tool", + "single_tool_with_content", + "single_tool_multiline_param", + "parallel_tools", + "tool_with_typed_params", + ], + argnames=["model_output", "expected_tool_calls", "expected_content"], + argvalues=[ + (''' + + +Dallas + + +TX + + +fahrenheit + + +''', [ + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit" + }))) + ], None), + ('''Sure! Let me check the weather for you. + + +Dallas + + +TX + + +fahrenheit + + +''', [ + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit" + }))) + ], "Sure! Let me check the weather for you."), + (''' + + +rectangle + + +{"width": 10, + "height": 20} + + +2 + + +''', [ + ToolCall(function=FunctionCall(name="calculate_area", + arguments=json.dumps({ + "shape": "rectangle", + "dimensions": { + "width": 10, + "height": 20 + }, + "precision": 2 + }))) + ], None), + (''' + + +Dallas + + +TX + + +fahrenheit + + + + + + +Orlando + + +FL + + +fahrenheit + + +''', [ + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit" + }))), + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Orlando", + "state": "FL", + "unit": "fahrenheit" + }))) + ], None), + ('''Let me calculate that area for you. + + +circle + + +{"radius": 15.5} + + +3 + + +''', [ + ToolCall(function=FunctionCall(name="calculate_area", + arguments=json.dumps({ + "shape": "circle", + "dimensions": { + "radius": 15.5 + }, + "precision": 3 + }))) + ], "Let me calculate that area for you."), + ], +) +def test_extract_tool_calls(qwen3_tool_parser, sample_tools, model_output, + expected_tool_calls, expected_content): + request = ChatCompletionRequest(model=MODEL, + messages=[], + tools=sample_tools) + extracted_tool_calls = qwen3_tool_parser.extract_tool_calls( + model_output, request=request) + assert extracted_tool_calls.tools_called + + assert_tool_calls(extracted_tool_calls.tool_calls, expected_tool_calls) + + assert extracted_tool_calls.content == expected_content + + +def test_extract_tool_calls_fallback_no_tags(qwen3_tool_parser, sample_tools): + """Test fallback parsing when XML tags are missing""" + model_output = ''' + +Dallas + + +TX + +''' + + request = ChatCompletionRequest(model=MODEL, + messages=[], + tools=sample_tools) + extracted_tool_calls = qwen3_tool_parser.extract_tool_calls( + model_output, request=request) + + assert extracted_tool_calls.tools_called + assert len(extracted_tool_calls.tool_calls) == 1 + assert (extracted_tool_calls.tool_calls[0].function.name == + "get_current_weather") + + +def test_extract_tool_calls_type_conversion(qwen3_tool_parser): + """Test parameter type conversion based on tool schema""" + tools = [ + ChatCompletionToolsParam(type="function", + function={ + "name": "test_types", + "parameters": { + "type": "object", + "properties": { + "int_param": { + "type": "integer" + }, + "float_param": { + "type": "float" + }, + "bool_param": { + "type": "boolean" + }, + "str_param": { + "type": "string" + }, + "obj_param": { + "type": "object" + } + } + } + }) + ] + + model_output = ''' + + +42 + + +3.14 + + +true + + +hello world + + +{"key": "value"} + + +''' + + request = ChatCompletionRequest(model=MODEL, messages=[], tools=tools) + extracted_tool_calls = qwen3_tool_parser.extract_tool_calls( + model_output, request=request) + + args = json.loads(extracted_tool_calls.tool_calls[0].function.arguments) + assert args["int_param"] == 42 + assert args["float_param"] == 3.14 + assert args["bool_param"] is True + assert args["str_param"] == "hello world" + assert args["obj_param"] == {"key": "value"} + + +@pytest.mark.parametrize( + ids=[ + "no_tools", + "single_tool", + "single_tool_with_content", + "parallel_tools", + ], + argnames=["model_output", "expected_tool_calls", "expected_content"], + argvalues=[ + ("This is a test without tools", [], "This is a test without tools"), + (''' + + +Dallas + + +TX + + +fahrenheit + + +''', [ + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit" + }))) + ], ""), + ('''Sure! Let me check the weather for you. + + +Dallas + + +TX + + +fahrenheit + + +''', [ + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit" + }))) + ], "Sure! Let me check the weather for you."), + (''' + + +Dallas + + +TX + + +fahrenheit + + + + + + +Orlando + + +FL + + +celsius + + +''', [ + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Dallas", + "state": "TX", + "unit": "fahrenheit" + }))), + ToolCall( + function=FunctionCall(name="get_current_weather", + arguments=json.dumps({ + "city": "Orlando", + "state": "FL", + "unit": "celsius" + }))) + ], ""), + ], +) +def test_extract_tool_calls_streaming(qwen3_tool_parser, qwen3_tokenizer, + sample_tools, model_output, + expected_tool_calls, expected_content): + """Test incremental streaming behavior""" + request = ChatCompletionRequest(model=MODEL, + messages=[], + tools=sample_tools) + + other_content = '' + tool_states = {} # Track state per tool index + + for delta_message in stream_delta_message_generator( + qwen3_tool_parser, qwen3_tokenizer, model_output, request): + # role should never be streamed from tool parser + assert not delta_message.role + + if delta_message.content: + other_content += delta_message.content + + if delta_message.tool_calls: + for tool_call in delta_message.tool_calls: + idx = tool_call.index + + # Initialize state for new tool + if idx not in tool_states: + tool_states[idx] = { + "id": None, + "name": None, + "arguments": "", + "type": None + } + + # First chunk should have id, name, and type + if tool_call.id: + tool_states[idx]["id"] = tool_call.id + + if tool_call.type: + assert tool_call.type == "function" + tool_states[idx]["type"] = tool_call.type + + if tool_call.function: + if tool_call.function.name: + # Should only be set once + assert tool_states[idx]["name"] is None + tool_states[idx]["name"] = tool_call.function.name + + if tool_call.function.arguments is not None: + # Accumulate arguments incrementally + tool_states[idx][ + "arguments"] += tool_call.function.arguments + + # Verify final content + assert other_content == expected_content + + # Verify we got all expected tool calls + assert len(tool_states) == len(expected_tool_calls) + + # Verify each tool call + for idx, expected_tool in enumerate(expected_tool_calls): + state = tool_states[idx] + assert state["id"] is not None + assert state["type"] == "function" + assert state["name"] == expected_tool.function.name + + # Parse accumulated arguments + arguments_str = state["arguments"] + assert arguments_str is not None + actual_args = json.loads(arguments_str) + expected_args = json.loads(expected_tool.function.arguments) + assert actual_args == expected_args + + +def test_extract_tool_calls_streaming_incremental(qwen3_tool_parser, + qwen3_tokenizer, + sample_tools): + """Test that streaming is truly incremental""" + model_output = '''I'll check the weather. + + +Dallas + + +TX + + +''' + + request = ChatCompletionRequest(model=MODEL, + messages=[], + tools=sample_tools) + + chunks = [] + for delta_message in stream_delta_message_generator( + qwen3_tool_parser, qwen3_tokenizer, model_output, request): + chunks.append(delta_message) + + # Should have multiple chunks + assert len(chunks) > 3 + + # First chunk(s) should be content + assert chunks[0].content is not None + assert chunks[0].tool_calls is None or chunks[0].tool_calls == [] + + # Should have a chunk with tool header (id, name, type) + header_found = False + for chunk in chunks: + if chunk.tool_calls and chunk.tool_calls[0].id: + header_found = True + assert (chunk.tool_calls[0].function.name == "get_current_weather") + assert chunk.tool_calls[0].type == "function" + # Empty initially + assert chunk.tool_calls[0].function.arguments == "" + break + assert header_found + + # Should have chunks with incremental arguments + arg_chunks = [] + for chunk in chunks: + if chunk.tool_calls and chunk.tool_calls[0].function.arguments: + arg_chunks.append(chunk.tool_calls[0].function.arguments) + + # Arguments should be streamed incrementally + assert len(arg_chunks) > 1 + + # Concatenated arguments should form valid JSON + full_args = "".join(arg_chunks) + parsed_args = json.loads(full_args) + assert parsed_args["city"] == "Dallas" + assert parsed_args["state"] == "TX" diff --git a/vllm/entrypoints/openai/tool_parsers/__init__.py b/vllm/entrypoints/openai/tool_parsers/__init__.py index 9eda7155f01..88c8aa929b7 100644 --- a/vllm/entrypoints/openai/tool_parsers/__init__.py +++ b/vllm/entrypoints/openai/tool_parsers/__init__.py @@ -17,6 +17,7 @@ from .mistral_tool_parser import MistralToolParser from .phi4mini_tool_parser import Phi4MiniJsonToolParser from .pythonic_tool_parser import PythonicToolParser +from .qwen3coder_tool_parser import Qwen3CoderToolParser from .xlam_tool_parser import xLAMToolParser __all__ = [ @@ -38,4 +39,5 @@ "KimiK2ToolParser", "HunyuanA13BToolParser", "Glm4MoeModelToolParser", + "Qwen3CoderToolParser", ] diff --git a/vllm/entrypoints/openai/tool_parsers/qwen3coder_tool_parser.py b/vllm/entrypoints/openai/tool_parsers/qwen3coder_tool_parser.py new file mode 100644 index 00000000000..cf4d0b231ae --- /dev/null +++ b/vllm/entrypoints/openai/tool_parsers/qwen3coder_tool_parser.py @@ -0,0 +1,669 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import json +import uuid +from collections.abc import Sequence +from typing import Any, Optional, Union + +import regex as re + +from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, + ChatCompletionToolsParam, + DeltaFunctionCall, DeltaMessage, + DeltaToolCall, + ExtractedToolCallInformation, + FunctionCall, ToolCall) +from vllm.entrypoints.openai.tool_parsers.abstract_tool_parser import ( + ToolParser, ToolParserManager) +from vllm.logger import init_logger +from vllm.transformers_utils.tokenizer import AnyTokenizer + +logger = init_logger(__name__) + + +@ToolParserManager.register_module(["qwen3_coder"]) +class Qwen3CoderToolParser(ToolParser): + + def __init__(self, tokenizer: AnyTokenizer): + super().__init__(tokenizer) + + self.current_tool_name_sent: bool = False + self.prev_tool_call_arr: list[dict] = [] + self.streamed_args_for_tool: list[str] = [] + + # Sentinel tokens for streaming mode + self.tool_call_start_token: str = "" + self.tool_call_end_token: str = "" + self.tool_call_prefix: str = "(.*?)", re.DOTALL) + self.tool_call_regex = re.compile( + r"(.*?)|(.*?)$", re.DOTALL) + self.tool_call_function_regex = re.compile( + r"|| str: + """Generate a unique tool call ID.""" + return f"call_{uuid.uuid4().hex[:24]}" + + def _reset_streaming_state(self): + """Reset all streaming state.""" + self.current_tool_index = 0 + self.is_tool_call_started = False + self.header_sent = False + self.current_tool_string_id = None + self.current_function_name = None + self.current_param_name = None + self.current_param_value = "" + self.param_count = 0 + self.in_param = False + self.in_function = False + self.accumulated_text = "" + self.json_started = False + self.json_closed = False + + def _parse_xml_function_call( + self, function_call_str: str, + tools: Optional[list[ChatCompletionToolsParam]] + ) -> Optional[ToolCall]: + + def get_arguments_config(func_name: str) -> dict: + if tools is None: + return {} + for config in tools: + if not hasattr(config, "type") or not ( + hasattr(config, "function") + and hasattr(config.function, "name")): + continue + if (config.type == "function" + and config.function.name == func_name): + if not hasattr(config.function, "parameters"): + return {} + params = config.function.parameters + if isinstance(params, dict) and "properties" in params: + return params["properties"] + elif isinstance(params, dict): + return params + else: + return {} + logger.warning("Tool '%s' is not defined in the tools list.", + func_name) + return {} + + def convert_param_value(param_value: str, param_name: str, + param_config: dict, func_name: str) -> Any: + # Handle null value for any type + if param_value.lower() == "null": + return None + + converted_value: Any + + if param_name not in param_config: + if param_config != {}: + logger.warning( + "Parsed parameter '%s' is not defined in the tool " + "parameters for tool '%s', directly returning the " + "string value.", param_name, func_name) + return param_value + + if (isinstance(param_config[param_name], dict) + and "type" in param_config[param_name]): + param_type = str( + param_config[param_name]["type"]).strip().lower() + else: + param_type = "string" + if param_type in [ + "string", "str", "text", "varchar", "char", "enum" + ]: + return param_value + elif (param_type.startswith("int") or param_type.startswith("uint") + or param_type.startswith("long") + or param_type.startswith("short") + or param_type.startswith("unsigned")): + try: + converted_value = int(param_value) + return converted_value + except ValueError: + logger.warning( + "Parsed value '%s' of parameter '%s' is not an " + "integer in tool '%s', degenerating to string.", + param_value, param_name, func_name) + return param_value + elif (param_type.startswith("num") + or param_type.startswith("float")): + try: + float_param_value = float(param_value) + converted_value = (float_param_value if float_param_value - + int(float_param_value) != 0 else + int(float_param_value)) + return converted_value + except ValueError: + logger.warning( + "Parsed value '%s' of parameter '%s' is not a float " + "in tool '%s', degenerating to string.", param_value, + param_name, func_name) + return param_value + elif param_type in ["boolean", "bool", "binary"]: + param_value = param_value.lower() + if param_value not in ["true", "false"]: + logger.warning( + "Parsed value '%s' of parameter '%s' is not a " + "boolean (`true` of `false`) in tool '%s', " + "degenerating to false.", param_value, param_name, + func_name) + return param_value == "true" + else: + if param_type == "object" or param_type.startswith("dict"): + try: + converted_value = json.loads(param_value) + return converted_value + except json.JSONDecodeError: + logger.warning( + "Parsed value '%s' of parameter '%s' is not a " + "valid JSON object in tool '%s', will try other " + "methods to parse it.", param_value, param_name, + func_name) + try: + converted_value = eval(param_value) + return converted_value + except Exception: + logger.warning( + "Parsed value '%s' of parameter '%s' cannot be " + "converted via Python `eval()` in tool '%s', " + "degenerating to string.", param_value, param_name, + func_name) + return param_value + + # Extract function name + end_index = function_call_str.index(">") + function_name = function_call_str[:end_index] + param_config = get_arguments_config(function_name) + parameters = function_call_str[end_index + 1:] + param_dict = {} + for match in self.tool_call_parameter_regex.findall(parameters): + match_text = match[0] if match[0] else match[1] + idx = match_text.index(">") + param_name = match_text[:idx] + param_value = str(match_text[idx + 1:]) + # Remove prefix and trailing \n + if param_value.startswith("\n"): + param_value = param_value[1:] + if param_value.endswith("\n"): + param_value = param_value[:-1] + + param_dict[param_name] = convert_param_value( + param_value, param_name, param_config, function_name) + return ToolCall( + type="function", + function=FunctionCall(name=function_name, + arguments=json.dumps(param_dict, + ensure_ascii=False)), + ) + + def _get_function_calls(self, model_output: str) -> list[str]: + # Find all tool calls + matched_ranges = self.tool_call_regex.findall(model_output) + raw_tool_calls = [ + match[0] if match[0] else match[1] for match in matched_ranges + ] + + # Back-off strategy if no tool_call tags found + if len(raw_tool_calls) == 0: + raw_tool_calls = [model_output] + + raw_function_calls = [] + for tool_call in raw_tool_calls: + raw_function_calls.extend( + self.tool_call_function_regex.findall(tool_call)) + + function_calls = [ + match[0] if match[0] else match[1] for match in raw_function_calls + ] + return function_calls + + def extract_tool_calls( + self, + model_output: str, + request: ChatCompletionRequest, + ) -> ExtractedToolCallInformation: + # Quick check to avoid unnecessary processing + if self.tool_call_prefix not in model_output: + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=model_output) + + try: + function_calls = self._get_function_calls(model_output) + if len(function_calls) == 0: + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=model_output) + + tool_calls = [ + self._parse_xml_function_call(function_call_str, request.tools) + for function_call_str in function_calls + ] + + # Populate prev_tool_call_arr for serving layer to set + # finish_reason + self.prev_tool_call_arr.clear() # Clear previous calls + for tool_call in tool_calls: + if tool_call: + self.prev_tool_call_arr.append({ + "name": + tool_call.function.name, + "arguments": + tool_call.function.arguments, + }) + + # Extract content before tool calls + content_index = model_output.find(self.tool_call_start_token) + content_index = (content_index if content_index >= 0 else + model_output.find(self.tool_call_prefix)) + content = model_output[:content_index] # .rstrip() + + return ExtractedToolCallInformation( + tools_called=(len(tool_calls) > 0), + tool_calls=tool_calls, + content=content if content else None, + ) + + except Exception: + logger.exception("Error in extracting tool call from response.") + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=model_output) + + def extract_tool_calls_streaming( + self, + previous_text: str, + current_text: str, + delta_text: str, + previous_token_ids: Sequence[int], + current_token_ids: Sequence[int], + delta_token_ids: Sequence[int], + request: ChatCompletionRequest, + ) -> Union[DeltaMessage, None]: + # If no delta text, return None unless it's an EOS token after tool + # calls + if not delta_text: + # Check if this is an EOS token after all tool calls are complete + # We check for tool calls in the text even if is_tool_call_started + # is False because it might have been reset after processing all + # tools + if (delta_token_ids + and self.tool_call_end_token_id not in delta_token_ids): + # Count complete tool calls + complete_calls = len( + self.tool_call_complete_regex.findall(current_text)) + + # If we have completed tool calls and populated + # prev_tool_call_arr + if (complete_calls > 0 and len(self.prev_tool_call_arr) > 0): + # Check if all tool calls are closed + open_calls = ( + current_text.count(self.tool_call_start_token) - + current_text.count(self.tool_call_end_token)) + if open_calls == 0: + # Return empty delta message to allow finish_reason + # processing + return DeltaMessage(content="") + elif not self.is_tool_call_started and current_text: + # This is a regular content response that's now complete + return DeltaMessage(content="") + return None + + # Check if this is the first call (reset state if needed) + if not previous_text: + self._reset_streaming_state() + + # Update accumulated text + self.accumulated_text = current_text + + # Check if we need to advance to next tool + if self.json_closed and not self.in_function: + # Check if this tool call has ended + tool_ends = current_text.count(self.tool_call_end_token) + if tool_ends > self.current_tool_index: + # This tool has ended, advance to next + self.current_tool_index += 1 + self.header_sent = False + self.param_count = 0 + self.json_started = False + self.json_closed = False + + # Check if there are more tool calls + tool_starts_count = current_text.count( + self.tool_call_start_token) + if self.current_tool_index >= tool_starts_count: + # No more tool calls + self.is_tool_call_started = False + # Continue processing next tool + return None + + # Handle normal content before tool calls + if not self.is_tool_call_started: + # Check if tool call is starting + if (self.tool_call_start_token_id in delta_token_ids + or self.tool_call_start_token in delta_text): + self.is_tool_call_started = True + # Return any content before the tool call + if self.tool_call_start_token in delta_text: + content_before = delta_text[:delta_text.index( + self.tool_call_start_token)] + if content_before: + return DeltaMessage(content=content_before) + return None + else: + # Check if we're between tool calls - skip whitespace + if (current_text.rstrip().endswith(self.tool_call_end_token) + and delta_text.strip() == ""): + # We just ended a tool call, skip whitespace + return None + # Normal content, no tool call + return DeltaMessage(content=delta_text) + + # Check if we're between tool calls (waiting for next one) + # Count tool calls we've seen vs processed + tool_starts_count = current_text.count(self.tool_call_start_token) + if self.current_tool_index >= tool_starts_count: + # We're past all tool calls, shouldn't be here + return None + + # We're in a tool call, find the current tool call portion + # Need to find the correct tool call based on current_tool_index + tool_starts: list[int] = [] + idx = 0 + while True: + idx = current_text.find(self.tool_call_start_token, idx) + if idx == -1: + break + tool_starts.append(idx) + idx += len(self.tool_call_start_token) + + if self.current_tool_index >= len(tool_starts): + # No more tool calls to process yet + return None + + tool_start_idx = tool_starts[self.current_tool_index] + # Find where this tool call ends (or current position if not ended yet) + tool_end_idx = current_text.find(self.tool_call_end_token, + tool_start_idx) + if tool_end_idx == -1: + tool_text = current_text[tool_start_idx:] + else: + tool_text = current_text[tool_start_idx:tool_end_idx + + len(self.tool_call_end_token)] + + # Looking for function header + if not self.header_sent: + if self.tool_call_prefix in tool_text: + func_start = (tool_text.find(self.tool_call_prefix) + + len(self.tool_call_prefix)) + func_end = tool_text.find(">", func_start) + + if func_end != -1: + # Found complete function name + self.current_function_name = tool_text[func_start:func_end] + self.current_tool_string_id = self._generate_tool_call_id() + self.header_sent = True + self.in_function = True + + # IMPORTANT: Add to prev_tool_call_arr immediately when we + # detect a tool call. This ensures + # finish_reason="tool_calls" even if parsing isn't complete + already_added = any( + tool.get("name") == self.current_function_name + for tool in self.prev_tool_call_arr) + if not already_added: + self.prev_tool_call_arr.append({ + "name": self.current_function_name, + "arguments": + "{}", # Placeholder, will be updated later + }) + + # Send header with function info + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_index, + id=self.current_tool_string_id, + function=DeltaFunctionCall( + name=self.current_function_name, arguments=""), + type="function", + ) + ]) + return None + + # We've sent header, now handle function body + if self.in_function: + # Send opening brace if not sent yet + if (not self.json_started + and self.parameter_prefix not in delta_text): + self.json_started = True + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_index, + function=DeltaFunctionCall(arguments="{"), + ) + ]) + + # Make sure json_started is set if we're processing parameters + if not self.json_started: + self.json_started = True + + # Check for function end in accumulated text + if not self.json_closed and self.function_end_token in tool_text: + # Close JSON + self.json_closed = True + + # Extract the complete tool call to update prev_tool_call_arr + # with final arguments. Find the function content + func_start = (tool_text.find(self.tool_call_prefix) + + len(self.tool_call_prefix)) + func_content_end = tool_text.find(self.function_end_token, + func_start) + if func_content_end != -1: + func_content = tool_text[func_start:func_content_end] + # Parse to get the complete arguments + try: + parsed_tool = self._parse_xml_function_call( + func_content, request.tools if request else None) + if parsed_tool: + # Update existing entry in prev_tool_call_arr with + # complete arguments + for i, tool in enumerate(self.prev_tool_call_arr): + if (tool.get("name") == + parsed_tool.function.name): + self.prev_tool_call_arr[i]["arguments"] = ( + parsed_tool.function.arguments) + break + except Exception: + pass # Ignore parsing errors during streaming + + result = DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_index, + function=DeltaFunctionCall(arguments="}"), + ) + ]) + + # Reset state for next tool + self.in_function = False + self.json_closed = True + + return result + + # Look for parameters + # Count how many complete parameters we have processed + complete_params = tool_text.count(self.parameter_end_token) + + # Check if we should start a new parameter + if not self.in_param and self.param_count < complete_params: + # Find the unprocessed parameter + # Count parameter starts + param_starts = [] + idx = 0 + while True: + idx = tool_text.find(self.parameter_prefix, idx) + if idx == -1: + break + param_starts.append(idx) + idx += len(self.parameter_prefix) + + if len(param_starts) > self.param_count: + # Process the next parameter + param_idx = param_starts[self.param_count] + param_start = param_idx + len(self.parameter_prefix) + remaining = tool_text[param_start:] + + if ">" in remaining: + # We have the complete parameter name + name_end = remaining.find(">") + self.current_param_name = remaining[:name_end] + + # Find the parameter value + value_start = param_start + name_end + 1 + value_text = tool_text[value_start:] + if value_text.startswith("\n"): + value_text = value_text[1:] + + # Find where this parameter ends + param_end_idx = value_text.find( + self.parameter_end_token) + if param_end_idx != -1: + # Complete parameter found + param_value = value_text[:param_end_idx] + if param_value.endswith("\n"): + param_value = param_value[:-1] + + # Build complete JSON fragment for this parameter + if self.param_count == 0: + json_fragment = ( + '"' + self.current_param_name + '": "' + + json.dumps(param_value)[1:-1] + '"') + else: + json_fragment = ( + ', "' + self.current_param_name + '": "' + + json.dumps(param_value)[1:-1] + '"') + + self.param_count += 1 + + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_index, + function=DeltaFunctionCall( + arguments=json_fragment), + ) + ]) + + # Continue parameter value + if self.in_param: + if self.parameter_end_token in delta_text: + # End of parameter + end_idx = delta_text.find(self.parameter_end_token) + value_chunk = delta_text[:end_idx] + + # Skip past > if at start + if not self.current_param_value and ">" in value_chunk: + gt_idx = value_chunk.find(">") + value_chunk = value_chunk[gt_idx + 1:] + + if (not self.current_param_value + and value_chunk.startswith("\n")): + value_chunk = value_chunk[1:] + + # Calculate incremental JSON + full_value = self.current_param_value + value_chunk + prev_escaped = (json.dumps(self.current_param_value)[1:-1] + if self.current_param_value else "") + full_escaped = json.dumps(full_value)[1:-1] + delta_escaped = full_escaped[len(prev_escaped):] + + self.in_param = False + self.current_param_value = "" + + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_index, + function=DeltaFunctionCall( + arguments=delta_escaped + '"'), + ) + ]) + else: + # Continue accumulating value + value_chunk = delta_text + + # Handle first chunk after param name + if not self.current_param_value and ">" in value_chunk: + gt_idx = value_chunk.find(">") + value_chunk = value_chunk[gt_idx + 1:] + + if (not self.current_param_value + and value_chunk.startswith("\n")): + value_chunk = value_chunk[1:] + + if value_chunk: + # Stream the escaped delta + prev_escaped = (json.dumps( + self.current_param_value)[1:-1] + if self.current_param_value else "") + self.current_param_value += value_chunk + full_escaped = json.dumps( + self.current_param_value)[1:-1] + delta_escaped = full_escaped[len(prev_escaped):] + + if delta_escaped: + return DeltaMessage(tool_calls=[ + DeltaToolCall( + index=self.current_tool_index, + function=DeltaFunctionCall( + arguments=delta_escaped), + ) + ]) + + return None From 3e028497e3078883c126eb59f8b22a7d778fb285 Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Tue, 22 Jul 2025 16:18:42 -0700 Subject: [PATCH 266/552] [Misc] Copy HF_TOKEN env var to Ray workers (#21406) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- vllm/executor/ray_distributed_executor.py | 6 +++++- vllm/ray/ray_env.py | 5 +++-- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/vllm/executor/ray_distributed_executor.py b/vllm/executor/ray_distributed_executor.py index 417750a08c6..e9ad62aeb99 100644 --- a/vllm/executor/ray_distributed_executor.py +++ b/vllm/executor/ray_distributed_executor.py @@ -58,6 +58,9 @@ class RayDistributedExecutor(DistributedExecutorBase): "VLLM_HOST_IP", "VLLM_HOST_PORT", "LOCAL_RANK", "CUDA_VISIBLE_DEVICES" } + # These non-vLLM env vars are copied from the driver to workers + ADDITIONAL_ENV_VARS = {"HF_TOKEN", "HUGGING_FACE_HUB_TOKEN"} + uses_ray: bool = True def _init_executor(self) -> None: @@ -326,7 +329,8 @@ def sort_by_driver_then_worker_ip(item: RayWorkerMetaData): # Environment variables to copy from driver to workers env_vars_to_copy = get_env_vars_to_copy( exclude_vars=self.WORKER_SPECIFIC_ENV_VARS, - additional_vars=set(current_platform.additional_env_vars), + additional_vars=set(current_platform.additional_env_vars).union( + self.ADDITIONAL_ENV_VARS), destination="workers") # Copy existing env vars to each worker's args diff --git a/vllm/ray/ray_env.py b/vllm/ray/ray_env.py index 716d0bfafae..f6a994bb3c2 100644 --- a/vllm/ray/ray_env.py +++ b/vllm/ray/ray_env.py @@ -43,6 +43,8 @@ def get_env_vars_to_copy(exclude_vars: Optional[set[str]] = None, exclude_vars: A set of vllm defined environment variables to exclude from copying. additional_vars: A set of additional environment variables to copy. + If a variable is in both exclude_vars and additional_vars, it will + be excluded. destination: The destination of the environment variables. Returns: A set of environment variables to copy. @@ -52,10 +54,9 @@ def get_env_vars_to_copy(exclude_vars: Optional[set[str]] = None, env_vars_to_copy = { v - for v in envs.environment_variables + for v in set(envs.environment_variables).union(additional_vars) if v not in exclude_vars and v not in RAY_NON_CARRY_OVER_ENV_VARS } - env_vars_to_copy.update(additional_vars) to_destination = " to " + destination if destination is not None else "" From 5928d18c535c5d3dd55d49f31770272a61088503 Mon Sep 17 00:00:00 2001 From: Joe Runde Date: Tue, 22 Jul 2025 17:19:55 -0600 Subject: [PATCH 267/552] [BugFix] Fix ray import error mem cleanup bug (#21381) Signed-off-by: Travis Johnson Signed-off-by: Joe Runde Co-authored-by: Travis Johnson Signed-off-by: x22x22 --- vllm/config.py | 5 +++-- vllm/executor/ray_utils.py | 8 +++++--- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/vllm/config.py b/vllm/config.py index 5d7b19f9e9b..a5f67451a77 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -2137,10 +2137,11 @@ def __post_init__(self) -> None: elif (current_platform.is_cuda() and cuda_device_count_stateless() < self.world_size): if not ray_found: - raise ValueError("Unable to load Ray which is " + raise ValueError("Unable to load Ray: " + f"{ray_utils.ray_import_err}. Ray is " "required for multi-node inference, " "please install Ray with `pip install " - "ray`.") from ray_utils.ray_import_err + "ray`.") backend = "ray" elif self.data_parallel_backend == "ray": logger.info("Using ray distributed inference because " diff --git a/vllm/executor/ray_utils.py b/vllm/executor/ray_utils.py index c222f160909..033ecc00853 100644 --- a/vllm/executor/ray_utils.py +++ b/vllm/executor/ray_utils.py @@ -145,7 +145,9 @@ def override_env_vars(self, vars: Dict[str, str]): except ImportError as e: ray = None # type: ignore - ray_import_err = e + # only capture string to avoid variable references in the traceback that can + # prevent garbage collection in some cases + ray_import_err = str(e) RayWorkerWrapper = None # type: ignore @@ -157,8 +159,8 @@ def ray_is_available() -> bool: def assert_ray_available(): """Raise an exception if Ray is not available.""" if ray is None: - raise ValueError("Failed to import Ray, please install Ray with " - "`pip install ray`.") from ray_import_err + raise ValueError(f"Failed to import Ray: {ray_import_err}." + "Please install Ray with `pip install ray`.") def _verify_bundles(placement_group: "PlacementGroup", From cee6b472ebe6efe9f1ec38ee96c847cddb8cd555 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 23 Jul 2025 11:25:37 +0800 Subject: [PATCH 268/552] [CI/Build] Fix model executor tests (#21387) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 1 - tests/model_executor/test_model_load_with_params.py | 13 +++++++++---- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index c476f71c663..f4b69fa21ec 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -434,7 +434,6 @@ steps: - label: Model Executor Test mirror_hardwares: [amdexperimental, amdproduction] - soft_fail: true source_file_dependencies: - vllm/model_executor - tests/model_executor diff --git a/tests/model_executor/test_model_load_with_params.py b/tests/model_executor/test_model_load_with_params.py index 27374763021..aae9a4d1ef1 100644 --- a/tests/model_executor/test_model_load_with_params.py +++ b/tests/model_executor/test_model_load_with_params.py @@ -5,7 +5,8 @@ import pytest -from vllm.model_executor.layers.pooler import CLSPool, MeanPool, PoolingType +from vllm.model_executor.layers.pooler import (CLSPool, DispatchPooler, + MeanPool, PoolingType) from vllm.model_executor.models.bert import BertEmbeddingModel from vllm.model_executor.models.roberta import RobertaEmbeddingModel from vllm.platforms import current_platform @@ -49,7 +50,8 @@ def test_model_loading_with_params(vllm_runner): def check_model(model): assert isinstance(model, BertEmbeddingModel) - assert isinstance(model.pooler.pooling, CLSPool) + assert isinstance(pooler := model.pooler, DispatchPooler) + assert isinstance(pooler.poolers_by_task["embed"].pooling, CLSPool) vllm_model.apply_model(check_model) @@ -87,7 +89,9 @@ def test_roberta_model_loading_with_params(vllm_runner): def check_model(model): assert isinstance(model, RobertaEmbeddingModel) - assert isinstance(model.pooler.pooling, MeanPool) + assert isinstance(pooler := model.pooler, DispatchPooler) + assert isinstance(pooler.poolers_by_task["embed"].pooling, + MeanPool) vllm_model.apply_model(check_model) @@ -114,7 +118,8 @@ def test_facebook_roberta_model_loading_with_params(vllm_runner): def check_model(model): assert isinstance(model, RobertaEmbeddingModel) assert not hasattr(model, "lm_head") - assert isinstance(model.pooler.pooling, CLSPool) + assert isinstance(pooler := model.pooler, DispatchPooler) + assert isinstance(pooler.poolers_by_task["embed"].pooling, CLSPool) vllm_model.apply_model(check_model) From 1cea502807cd8f31485e38e2bfbe5709226c0fde Mon Sep 17 00:00:00 2001 From: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Date: Tue, 22 Jul 2025 23:27:41 -0400 Subject: [PATCH 269/552] [Bugfix][ROCm][Build] Fix build regression on ROCm (#21393) Signed-off-by: Gregory Shtrasberg Signed-off-by: x22x22 --- CMakeLists.txt | 4 ++-- csrc/ops.h | 10 +++++----- csrc/torch_bindings.cpp | 18 +++++++++--------- 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 767e9ad7541..98ed682fee7 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -245,7 +245,6 @@ set(VLLM_EXT_SRC "csrc/quantization/gptq/q_gemm.cu" "csrc/quantization/compressed_tensors/int8_quant_kernels.cu" "csrc/quantization/fp8/common.cu" - "csrc/quantization/fp8/per_token_group_quant.cu" "csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu" "csrc/quantization/gguf/gguf_kernel.cu" "csrc/quantization/activation_kernels.cu" @@ -297,7 +296,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") "csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu" "csrc/sparse/cutlass/sparse_scaled_mm_entry.cu" "csrc/cutlass_extensions/common.cpp" - "csrc/attention/mla/cutlass_mla_entry.cu") + "csrc/attention/mla/cutlass_mla_entry.cu" + "csrc/quantization/fp8/per_token_group_quant.cu") set_gencode_flags_for_srcs( SRCS "${VLLM_EXT_SRC}" diff --git a/csrc/ops.h b/csrc/ops.h index fdd3071c56e..97a247d9d62 100644 --- a/csrc/ops.h +++ b/csrc/ops.h @@ -287,6 +287,11 @@ void scaled_fp4_experts_quant( torch::Tensor const& input, torch::Tensor const& input_global_scale, torch::Tensor const& input_offset_by_experts, torch::Tensor const& output_scale_offset_by_experts); + +void per_token_group_quant_fp8(const torch::Tensor& input, + torch::Tensor& output_q, torch::Tensor& output_s, + int64_t group_size, double eps, double fp8_min, + double fp8_max, bool scale_ue8m0); #endif void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, @@ -297,11 +302,6 @@ void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor& scales, std::optional const& azp); -void per_token_group_quant_fp8(const torch::Tensor& input, - torch::Tensor& output_q, torch::Tensor& output_s, - int64_t group_size, double eps, double fp8_min, - double fp8_max, bool scale_ue8m0); - torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight, torch::Tensor b_gptq_qzeros, torch::Tensor b_gptq_scales, torch::Tensor b_g_idx, diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index d310211afe4..95f8541bc9e 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -601,15 +601,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { ops.impl("dynamic_scaled_int8_quant", torch::kCUDA, &dynamic_scaled_int8_quant); - // Compute per-token-group FP8 quantized tensor and scaling factor. - ops.def( - "per_token_group_fp8_quant(Tensor input, Tensor! output_q, Tensor! " - "output_s, " - "int group_size, float eps, float fp8_min, float fp8_max, bool " - "scale_ue8m0) -> ()"); - ops.impl("per_token_group_fp8_quant", torch::kCUDA, - &per_token_group_quant_fp8); - // Mamba selective scan kernel ops.def( "selective_scan_fwd(Tensor! u, Tensor! delta," @@ -624,6 +615,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { ops.impl("selective_scan_fwd", torch::kCUDA, &selective_scan_fwd); #ifndef USE_ROCM + // Compute per-token-group FP8 quantized tensor and scaling factor. + ops.def( + "per_token_group_fp8_quant(Tensor input, Tensor! output_q, Tensor! " + "output_s, " + "int group_size, float eps, float fp8_min, float fp8_max, bool " + "scale_ue8m0) -> ()"); + ops.impl("per_token_group_fp8_quant", torch::kCUDA, + &per_token_group_quant_fp8); + // reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel ops.def( "rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, " From 98ba104545a624ffa459ec182d0c4186f4c11af9 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Wed, 23 Jul 2025 04:29:43 +0100 Subject: [PATCH 270/552] Simplify weight loading in Transformers backend (#21382) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- tests/distributed/test_pipeline_parallel.py | 4 +- tests/lora/test_transformers_model.py | 2 +- tests/models/registry.py | 2 +- tests/models/test_transformers.py | 2 +- vllm/model_executor/models/interfaces.py | 10 +- vllm/model_executor/models/transformers.py | 107 ++++++++------------ vllm/test_utils.py | 2 +- 7 files changed, 53 insertions(+), 76 deletions(-) diff --git a/tests/distributed/test_pipeline_parallel.py b/tests/distributed/test_pipeline_parallel.py index 926a33c949e..2391430a083 100644 --- a/tests/distributed/test_pipeline_parallel.py +++ b/tests/distributed/test_pipeline_parallel.py @@ -177,7 +177,7 @@ def iter_params(self, model_id: str): "ai21labs/Jamba-tiny-dev": PPTestSettings.fast(), "meta-llama/Llama-3.2-1B-Instruct": PPTestSettings.detailed(), # Tests TransformersForCausalLM - "ArthurZ/Ilama-3.2-1B": PPTestSettings.fast(), + "hmellor/Ilama-3.2-1B": PPTestSettings.fast(), "openbmb/MiniCPM-2B-sft-bf16": PPTestSettings.fast(), "openbmb/MiniCPM3-4B": PPTestSettings.fast(), # Uses Llama @@ -249,7 +249,7 @@ def iter_params(self, model_id: str): # [LANGUAGE GENERATION] "microsoft/Phi-3.5-MoE-instruct", "meta-llama/Llama-3.2-1B-Instruct", - "ArthurZ/Ilama-3.2-1B", + "hmellor/Ilama-3.2-1B", "ibm/PowerLM-3b", "deepseek-ai/DeepSeek-V2-Lite-Chat", # [LANGUAGE EMBEDDING] diff --git a/tests/lora/test_transformers_model.py b/tests/lora/test_transformers_model.py index 5065a2fb716..723f7a54778 100644 --- a/tests/lora/test_transformers_model.py +++ b/tests/lora/test_transformers_model.py @@ -9,7 +9,7 @@ from ..utils import create_new_process_for_each_test, multi_gpu_test -MODEL_PATH = "ArthurZ/ilama-3.2-1B" +MODEL_PATH = "hmellor/Ilama-3.2-1B" PROMPT_TEMPLATE = """I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n"\n##Instruction:\nconcert_singer contains tables such as stadium, singer, concert, singer_in_concert. Table stadium has columns such as Stadium_ID, Location, Name, Capacity, Highest, Lowest, Average. Stadium_ID is the primary key.\nTable singer has columns such as Singer_ID, Name, Country, Song_Name, Song_release_year, Age, Is_male. Singer_ID is the primary key.\nTable concert has columns such as concert_ID, concert_Name, Theme, Stadium_ID, Year. concert_ID is the primary key.\nTable singer_in_concert has columns such as concert_ID, Singer_ID. concert_ID is the primary key.\nThe Stadium_ID of concert is the foreign key of Stadium_ID of stadium.\nThe Singer_ID of singer_in_concert is the foreign key of Singer_ID of singer.\nThe concert_ID of singer_in_concert is the foreign key of concert_ID of concert.\n\n###Input:\n{query}\n\n###Response:""" # noqa: E501 diff --git a/tests/models/registry.py b/tests/models/registry.py index 776b4c03356..257ca36db3a 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -500,7 +500,7 @@ def check_available_online( } _TRANSFORMERS_MODELS = { - "TransformersForCausalLM": _HfExamplesInfo("ArthurZ/Ilama-3.2-1B", trust_remote_code=True), # noqa: E501 + "TransformersForCausalLM": _HfExamplesInfo("hmellor/Ilama-3.2-1B", trust_remote_code=True), # noqa: E501 "TransformersForMultimodalLM": _HfExamplesInfo("OpenGVLab/InternVL3-1B-hf"), } diff --git a/tests/models/test_transformers.py b/tests/models/test_transformers.py index 16b9bcffd26..cd5b6193d00 100644 --- a/tests/models/test_transformers.py +++ b/tests/models/test_transformers.py @@ -56,7 +56,7 @@ def check_implementation( "model,model_impl", [ ("meta-llama/Llama-3.2-1B-Instruct", "transformers"), - ("ArthurZ/Ilama-3.2-1B", "auto"), # CUSTOM CODE + ("hmellor/Ilama-3.2-1B", "auto"), # CUSTOM CODE ]) # trust_remote_code=True by default def test_models( hf_runner: type[HfRunner], diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 7f3efde4347..8f6a7db7aa8 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -624,13 +624,9 @@ def __new__(cls, *args, **kwargs) -> Self: instance.quant_config = quant_config # apply model mappings to config for proper config-model matching - # NOTE: `TransformersForCausalLM` is not supported due to how this - # class defines `hf_to_vllm_mapper` as a post-init `@property`. - # After this is fixed, get `instance.hf_to_vllm_mapper` directly - if getattr(instance, "hf_to_vllm_mapper", None) is not None: - instance.quant_config.apply_vllm_mapper( - instance.hf_to_vllm_mapper) - if getattr(instance, "packed_modules_mapping", None) is not None: + if (hf_to_vllm_mapper := instance.hf_to_vllm_mapper) is not None: + instance.quant_config.apply_vllm_mapper(hf_to_vllm_mapper) + if instance.packed_modules_mapping is not None: instance.quant_config.packed_modules_mapping.update( instance.packed_modules_mapping) diff --git a/vllm/model_executor/models/transformers.py b/vllm/model_executor/models/transformers.py index cb9d28b1067..610f8e752db 100644 --- a/vllm/model_executor/models/transformers.py +++ b/vllm/model_executor/models/transformers.py @@ -414,7 +414,7 @@ def __exit__(self, exc_type, exc_value, traceback): setattr(self.config, key, value) -class TransformersModel(nn.Module): +class TransformersModel: def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() @@ -454,9 +454,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): # method after v4.54.0 is released self.text_config._attn_implementation = "vllm" with init_on_device_without_buffers("meta"), config_override: - # FIXME(Isotr0py): We need to refactor this part in the future to - # avoid registering an extra model layer, otherwise we will need a - # weights mapper to rename weights. self.model: PreTrainedModel = AutoModel.from_config( config, torch_dtype=model_config.dtype, @@ -620,9 +617,6 @@ def init_parameters(self, module: nn.Module): for child in module.children(): self.init_parameters(child) - def get_input_embeddings(self) -> nn.Module: - return self.model.get_input_embeddings() - def forward( self, input_ids: Optional[torch.Tensor], @@ -694,7 +688,9 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.config = config - self.model = TransformersModel(vllm_config=vllm_config, prefix=prefix) + self.transformers_model = TransformersModel(vllm_config=vllm_config, + prefix=prefix) + self.model = self.transformers_model.model if get_pp_group().is_last_rank: self.unpadded_vocab_size = config.vocab_size @@ -716,22 +712,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.lm_head = PPMissingLayer() self.make_empty_intermediate_tensors = ( - self.model.make_empty_intermediate_tensors) - - # FIXME(Isotr0py): Don't use any weights mapper for Transformers backend, - # this makes thing complicated. We need to remove this mapper after refactor - # `TransformersModel` in the future. - # NOTE: `SupportsQuant` can be updated after property decorator is removed - @property - def hf_to_vllm_mapper(self): - prefix_mapper = { - name: "model." + name - for name, _ in self.model.model.named_children() - } - return WeightsMapper( - orig_to_new_substr={"model.": "model.model."}, - orig_to_new_prefix=prefix_mapper, - ) + self.transformers_model.make_empty_intermediate_tensors) def forward( self, @@ -740,8 +721,9 @@ def forward( intermediate_tensors: Optional[IntermediateTensors] = None, inputs_embeds: Optional[torch.Tensor] = None, ) -> Union[torch.Tensor, IntermediateTensors]: - model_output = self.model(input_ids, positions, intermediate_tensors, - inputs_embeds) + model_output = self.transformers_model.forward(input_ids, positions, + intermediate_tensors, + inputs_embeds) return model_output def compute_logits( @@ -755,12 +737,10 @@ def compute_logits( def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: - loader = AutoWeightsLoader( - self, - skip_prefixes=(["lm_head."] - if self.config.tie_word_embeddings else None), - ) - return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) + skip_prefixes = ["lm_head." + ] if self.config.tie_word_embeddings else None + loader = AutoWeightsLoader(self, skip_prefixes=skip_prefixes) + return loader.load_weights(weights) @MULTIMODAL_REGISTRY.register_processor( @@ -772,6 +752,29 @@ class TransformersForMultimodalLM(nn.Module, SupportsQuant, SupportsLoRA, embedding_padding_modules = ["lm_head"] embedding_modules = ["embed_tokens"] + # Backwards compatibility for prev released models. State dicts back then + # had different formats and cannot be loaded with `AutoModel` mapping as is + hf_to_vllm_mapper = WeightsMapper( + orig_to_new_prefix={ + "language_model.model": "model.language_model", + "text_model.model": "model.text_model", + "vision_tower": "model.vision_tower", + "vqmodel": "model.vqmodel", + "visual": "model.visual", + "vision_model": "model.vision_model", + "vision_embed_tokens": "model.vision_embed_tokens", + "image_newline": "model.image_newline", + "multi_modal_projector": "model.multi_modal_projector", + "text_model.lm_head": "lm_head", + "language_model.lm_head": "lm_head", + # Qwen models used "model" as the name for the language model. + # Therefore, we must map each of submodule explicitly to avoid + # conflicts with newer models that use "model.language_model". + "model.embed_tokens": "model.language_model.embed_tokens", + "model.layers": "model.language_model.layers", + "model.norm": "model.language_model.norm", + }) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() config: PretrainedConfig = vllm_config.model_config.hf_config @@ -780,7 +783,9 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.config = config self.dtype = vllm_config.model_config.dtype - self.model = TransformersModel(vllm_config=vllm_config, prefix=prefix) + self.transformers_model = TransformersModel(vllm_config=vllm_config, + prefix=prefix) + self.model = self.transformers_model.model text_config = config.get_text_config() if get_pp_group().is_last_rank: @@ -803,32 +808,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.lm_head = PPMissingLayer() self.make_empty_intermediate_tensors = ( - self.model.make_empty_intermediate_tensors) - - @property - def hf_to_vllm_mapper(self): - # Backwards compatibility for prev released models - # State dicts back then had different formats - # and cannot be loaded with `AutoModel` mapping - # as is - prefix_mapper = { - "language_model.model": "model.language_model", - "text_model.model": "model.text_model", - "vision_tower": "model.vision_tower", - "vqmodel": "model.vqmodel", - "vision_model": "model.vision_model", - "vision_embed_tokens": "model.vision_embed_tokens", - "image_newline": "model.image_newline", - "multi_modal_projector": "model.multi_modal_projector", - "text_model.lm_head": "lm_head", - "language_model.lm_head": "lm_head", - } - # Don't change the order for QwenVL - if 'Qwen2' in self.config.__class__.__name__: - prefix_mapper["model"] = "model.language_model" - prefix_mapper["visual"] = "model.visual" - - return WeightsMapper(orig_to_new_prefix=prefix_mapper, ) + self.transformers_model.make_empty_intermediate_tensors) def forward( self, @@ -848,8 +828,9 @@ def forward( input_ids, multimodal_embeds) input_ids = None - model_output = self.model(input_ids, positions, intermediate_tensors, - inputs_embeds) + model_output = self.transformers_model.forward(input_ids, positions, + intermediate_tensors, + inputs_embeds) return model_output def compute_logits( @@ -898,7 +879,7 @@ def get_multimodal_embeddings(self, **kwargs): if isinstance(num_image_patches, list): num_image_patches = torch.cat(num_image_patches) - vision_embeddings = self.model.model.get_image_features( + vision_embeddings = self.model.get_image_features( pixel_values, **{ k: v.flatten(0, 1) @@ -928,7 +909,7 @@ def get_input_embeddings( input_ids: torch.Tensor, multimodal_embeddings=None, ) -> torch.Tensor: - inputs_embeds = self.model.model.get_input_embeddings()(input_ids) + inputs_embeds = self.model.get_input_embeddings()(input_ids) if (multimodal_embeddings is not None and len(multimodal_embeddings) != 0): mask = (input_ids == self.config.image_token_id) diff --git a/vllm/test_utils.py b/vllm/test_utils.py index c6b126d002b..1e61ca6b3de 100644 --- a/vllm/test_utils.py +++ b/vllm/test_utils.py @@ -10,7 +10,7 @@ "allenai/OLMoE-1B-7B-0924-Instruct", "amd/Llama-3.1-8B-Instruct-FP8-KV-Quark-test", "AMead10/Llama-3.2-1B-Instruct-AWQ", - "ArthurZ/Ilama-3.2-1B", + "hmellor/Ilama-3.2-1B", "BAAI/bge-base-en-v1.5", "BAAI/bge-multilingual-gemma2", "BAAI/bge-reranker-v2-m3", From 8f3cefedc400c370899faf454c3b6dcc266f78f3 Mon Sep 17 00:00:00 2001 From: ericehanley Date: Tue, 22 Jul 2025 22:33:00 -0500 Subject: [PATCH 271/552] [BugFix] Update python to python3 calls for image; fix prefix & input calculations. (#21391) Signed-off-by: Eric Hanley Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- benchmarks/auto_tune/auto_tune.sh | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/benchmarks/auto_tune/auto_tune.sh b/benchmarks/auto_tune/auto_tune.sh index 159ee142147..eaa28ea5c92 100644 --- a/benchmarks/auto_tune/auto_tune.sh +++ b/benchmarks/auto_tune/auto_tune.sh @@ -126,11 +126,12 @@ run_benchmark() { # get a basic qps by using request-rate inf bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt" prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 )) - python benchmarks/benchmark_serving.py \ +adjusted_input_len=$(( INPUT_LEN - prefix_len )) + python3 benchmarks/benchmark_serving.py \ --backend vllm \ --model $MODEL \ --dataset-name random \ - --random-input-len $INPUT_LEN \ + --random-input-len $adjusted_input_len \ --random-output-len $OUTPUT_LEN \ --ignore-eos \ --disable-tqdm \ @@ -159,11 +160,11 @@ run_benchmark() { curl -X POST http://0.0.0.0:8004/reset_prefix_cache sleep 5 bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt" - python benchmarks/benchmark_serving.py \ + python3 benchmarks/benchmark_serving.py \ --backend vllm \ --model $MODEL \ --dataset-name random \ - --random-input-len $INPUT_LEN \ + --random-input-len $adjusted_input_len \ --random-output-len $OUTPUT_LEN \ --ignore-eos \ --disable-tqdm \ From 11518eecab970bc3fe8499ce5983837a3fe0e695 Mon Sep 17 00:00:00 2001 From: "Chendi.Xue" Date: Tue, 22 Jul 2025 22:33:57 -0500 Subject: [PATCH 272/552] [BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update (#21414) Signed-off-by: Chendi.Xue Signed-off-by: x22x22 --- vllm/model_executor/models/deepseek_v2.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/vllm/model_executor/models/deepseek_v2.py b/vllm/model_executor/models/deepseek_v2.py index 649109777b3..79ddd3d0f62 100644 --- a/vllm/model_executor/models/deepseek_v2.py +++ b/vllm/model_executor/models/deepseek_v2.py @@ -885,13 +885,16 @@ def load_weights(self, weights: Iterable[tuple[str, # for mlp.experts[0].gate_gate_up_proj, which breaks load. if (("mlp.experts." in name) and name not in params_dict): continue - name = name.replace(weight_name, param_name) + name_mapped = name.replace(weight_name, param_name) # QKV fusion is optional, fall back to normal # weight loading if it's not enabled + # if go with fusion option, then update name if ((param_name == "fused_qkv_a_proj") - and name not in params_dict): + and name_mapped not in params_dict): continue + else: + name = name_mapped # Skip loading extra bias for GPTQ models. if name.endswith(".bias") and name not in params_dict: continue From 7763c2cf379f37586aa8f9890b9820fb622ac4ba Mon Sep 17 00:00:00 2001 From: elvischenv <219235043+elvischenv@users.noreply.github.com> Date: Wed, 23 Jul 2025 11:34:50 +0800 Subject: [PATCH 273/552] [Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported (#21420) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/platforms/cuda.py | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index cc2543538d0..9a8941e3cdd 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -456,6 +456,19 @@ def stateless_init_device_torch_dist_pg( def device_count(cls) -> int: return cuda_device_count_stateless() + @classmethod + def is_kv_cache_dtype_supported(cls, kv_cache_dtype: str) -> bool: + fp8_attention = kv_cache_dtype.startswith("fp8") + will_use_fa = (not envs.is_set("VLLM_ATTENTION_BACKEND") + ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" + supported = False + if cls.is_device_capability(100): + supported = True + elif fp8_attention and will_use_fa: + from vllm.attention.utils.fa_utils import flash_attn_supports_fp8 + supported = flash_attn_supports_fp8() + return supported + # NVML utils # Note that NVML is not affected by `CUDA_VISIBLE_DEVICES`, @@ -583,19 +596,6 @@ def is_fully_connected(cls, physical_device_ids: list[int]) -> bool: " not found. Assuming no NVLink available.") return False - @classmethod - def is_kv_cache_dtype_supported(cls, kv_cache_dtype: str) -> bool: - fp8_attention = kv_cache_dtype.startswith("fp8") - will_use_fa = (not envs.is_set("VLLM_ATTENTION_BACKEND") - ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" - supported = False - if cls.is_device_capability(100): - supported = True - elif fp8_attention and will_use_fa: - from vllm.attention.utils.fa_utils import flash_attn_supports_fp8 - supported = flash_attn_supports_fp8() - return supported - # Autodetect either NVML-enabled or non-NVML platform # based on whether NVML is available. From def1dc92e36328d83eac7516f27207ad242d59c2 Mon Sep 17 00:00:00 2001 From: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Date: Tue, 22 Jul 2025 22:48:31 -0500 Subject: [PATCH 274/552] Changing "amdproduction" allocation. (#21409) Signed-off-by: Alexei V. Ivanov Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index f4b69fa21ec..00608229b95 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -225,7 +225,7 @@ steps: ##### 1 GPU test ##### - label: Regression Test # 5min - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] source_file_dependencies: - vllm/ - tests/test_regression @@ -277,7 +277,7 @@ steps: - pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine - label: Examples Test # 25min - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] working_dir: "/vllm-workspace/examples" source_file_dependencies: - vllm/entrypoints @@ -311,7 +311,7 @@ steps: - label: Platform Tests (CUDA) - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] source_file_dependencies: - vllm/ - tests/cuda @@ -330,7 +330,7 @@ steps: - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers - label: LoRA Test %N # 15min each - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/lora - tests/lora @@ -382,7 +382,7 @@ steps: - pytest -v -s kernels/core - label: Kernels Attention Test %N - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - csrc/attention/ - vllm/attention @@ -393,7 +393,7 @@ steps: parallelism: 2 - label: Kernels Quantization Test %N - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - csrc/quantization/ - vllm/model_executor/layers/quantization @@ -412,7 +412,7 @@ steps: - pytest -v -s kernels/moe - label: Kernels Mamba Test - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] source_file_dependencies: - csrc/mamba/ - tests/kernels/mamba @@ -420,7 +420,7 @@ steps: - pytest -v -s kernels/mamba - label: Tensorizer Test # 11min - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] soft_fail: true source_file_dependencies: - vllm/model_executor/model_loader @@ -490,7 +490,7 @@ steps: - pytest -s entrypoints/openai/correctness/ - label: Encoder Decoder tests # 5min - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] source_file_dependencies: - vllm/ - tests/encoder_decoder @@ -498,7 +498,7 @@ steps: - pytest -v -s encoder_decoder - label: OpenAI-Compatible Tool Use # 20 min - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] fast_check: false source_file_dependencies: - vllm/ @@ -610,7 +610,7 @@ steps: - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model' - label: Quantized Models Test - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/model_executor/layers/quantization - tests/models/quantization From 04aec7c3d0a34f577b09968ba967fecb8fef5032 Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Wed, 23 Jul 2025 15:01:01 +0800 Subject: [PATCH 275/552] [Bugfix] Fix nightly transformers CI failure (#21427) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- tests/models/registry.py | 12 ++-- vllm/model_executor/models/tarsier.py | 6 +- vllm/transformers_utils/config.py | 2 + vllm/transformers_utils/configs/__init__.py | 2 + .../transformers_utils/configs/nemotron_vl.py | 56 +++++++++++++++++++ 5 files changed, 67 insertions(+), 11 deletions(-) create mode 100644 vllm/transformers_utils/configs/nemotron_vl.py diff --git a/tests/models/registry.py b/tests/models/registry.py index 257ca36db3a..1eb7f7b9d82 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -443,6 +443,12 @@ def check_available_online( hf_overrides={"architectures": ["TarsierForConditionalGeneration"]}), # noqa: E501 "Tarsier2ForConditionalGeneration": _HfExamplesInfo("omni-research/Tarsier2-Recap-7b", # noqa: E501 hf_overrides={"architectures": ["Tarsier2ForConditionalGeneration"]}), # noqa: E501 + "VoxtralForConditionalGeneration": _HfExamplesInfo( + "mistralai/Voxtral-Mini-3B-2507", + min_transformers_version="4.54", + # disable this temporarily until we support HF format + is_available_online=False, + ), # [Encoder-decoder] # Florence-2 uses BartFastTokenizer which can't be loaded from AutoTokenizer # Therefore, we borrow the BartTokenizer from the original Bart model @@ -450,13 +456,7 @@ def check_available_online( tokenizer="Isotr0py/Florence-2-tokenizer", # noqa: E501 trust_remote_code=True), # noqa: E501 "MllamaForConditionalGeneration": _HfExamplesInfo("meta-llama/Llama-3.2-11B-Vision-Instruct"), # noqa: E501 - "VoxtralForConditionalGeneration": _HfExamplesInfo( - "mistralai/Voxtral-Mini-3B-2507", - tokenizer_mode="mistral", - min_transformers_version="4.54" - ), "WhisperForConditionalGeneration": _HfExamplesInfo("openai/whisper-large-v3"), # noqa: E501 - # [Cross-encoder] "JinaVLForRanking": _HfExamplesInfo("jinaai/jina-reranker-m0"), # noqa: E501 } diff --git a/vllm/model_executor/models/tarsier.py b/vllm/model_executor/models/tarsier.py index 25f026e9bef..979d789b330 100644 --- a/vllm/model_executor/models/tarsier.py +++ b/vllm/model_executor/models/tarsier.py @@ -13,8 +13,7 @@ from transformers import PretrainedConfig, SiglipVisionConfig from transformers.image_utils import ImageInput, get_image_size, to_numpy_array from transformers.models.llava import LlavaProcessor -from transformers.processing_utils import (ProcessingKwargs, Unpack, - _validate_images_text_input_order) +from transformers.processing_utils import ProcessingKwargs, Unpack from transformers.tokenization_utils_base import PreTokenizedInput, TextInput from vllm.config import VllmConfig @@ -94,9 +93,6 @@ def __call__( raise ValueError( "You have to specify at least one of `images` or `text`.") - # check if images and text inputs are reversed for BC - images, text = _validate_images_text_input_order(images, text) - output_kwargs = self._merge_kwargs( TarsierProcessorKwargs, tokenizer_init_kwargs=self.tokenizer.init_kwargs, diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index 2e66dc16b47..8d1f59e6ead 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -37,6 +37,7 @@ MiniMaxText01Config, MiniMaxVL01Config, MllamaConfig, MLPSpeculatorConfig, MPTConfig, + Nemotron_Nano_VL_Config, NemotronConfig, NVLM_D_Config, OvisConfig, RWConfig, SkyworkR1VChatConfig, SolarConfig, @@ -80,6 +81,7 @@ def _get_hf_token() -> Optional[str]: "dbrx": DbrxConfig, "deepseek_vl_v2": DeepseekVLV2Config, "kimi_vl": KimiVLConfig, + "Llama_Nemotron_Nano_VL": Nemotron_Nano_VL_Config, "mpt": MPTConfig, "RefinedWeb": RWConfig, # For tiiuae/falcon-40b(-instruct) "RefinedWebModel": RWConfig, # For tiiuae/falcon-7b(-instruct) diff --git a/vllm/transformers_utils/configs/__init__.py b/vllm/transformers_utils/configs/__init__.py index 5d84d648f1c..89303213a27 100644 --- a/vllm/transformers_utils/configs/__init__.py +++ b/vllm/transformers_utils/configs/__init__.py @@ -23,6 +23,7 @@ from vllm.transformers_utils.configs.mpt import MPTConfig from vllm.transformers_utils.configs.nemotron import NemotronConfig from vllm.transformers_utils.configs.nemotron_h import NemotronHConfig +from vllm.transformers_utils.configs.nemotron_vl import Nemotron_Nano_VL_Config from vllm.transformers_utils.configs.nvlm_d import NVLM_D_Config from vllm.transformers_utils.configs.ovis import OvisConfig from vllm.transformers_utils.configs.skyworkr1v import SkyworkR1VChatConfig @@ -50,6 +51,7 @@ "KimiVLConfig", "NemotronConfig", "NemotronHConfig", + "Nemotron_Nano_VL_Config", "NVLM_D_Config", "OvisConfig", "SkyworkR1VChatConfig", diff --git a/vllm/transformers_utils/configs/nemotron_vl.py b/vllm/transformers_utils/configs/nemotron_vl.py new file mode 100644 index 00000000000..6a642f26b82 --- /dev/null +++ b/vllm/transformers_utils/configs/nemotron_vl.py @@ -0,0 +1,56 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# yapf: disable +# ruff: noqa: E501 +# Adapted from +# https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1/blob/main/configuration.py +# -------------------------------------------------------- +# Adapted from https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B under MIT License +# LICENSE is in incl_licenses directory. +# -------------------------------------------------------- + +from transformers import LlamaConfig +from transformers.configuration_utils import PretrainedConfig +from transformers.dynamic_module_utils import get_class_from_dynamic_module + + +class Nemotron_Nano_VL_Config(PretrainedConfig): + model_type = 'Llama_Nemotron_Nano_VL' + is_composition = True + + def __init__( + self, + vision_config=None, + llm_config=None, + force_image_size=None, + downsample_ratio=0.5, + template=None, + ps_version='v1', + image_tag_type="internvl", + projector_hidden_size=4096, + vit_hidden_size=1280, + **kwargs + ): + super().__init__(**kwargs) + + if vision_config is not None: + assert "auto_map" in vision_config and "AutoConfig" in vision_config["auto_map"] + vision_auto_config = get_class_from_dynamic_module(*vision_config["auto_map"]["AutoConfig"].split("--")[::-1]) + self.vision_config = vision_auto_config(**vision_config) + else: + self.vision_config = PretrainedConfig() + + if llm_config is None: + self.text_config = LlamaConfig() + else: + self.text_config = LlamaConfig(**llm_config) + + # Assign configuration values + self.force_image_size = force_image_size + self.downsample_ratio = downsample_ratio + self.template = template # TODO move out of here and into the tokenizer + self.ps_version = ps_version # Pixel shuffle version + self.image_tag_type = image_tag_type # TODO: into the tokenizer too? + self.projector_hidden_size = projector_hidden_size + self.vit_hidden_size = vit_hidden_size From eec8b7f69198195cfa66deb47bd9f80b5468e8a9 Mon Sep 17 00:00:00 2001 From: Jialin Ouyang Date: Wed, 23 Jul 2025 00:02:02 -0700 Subject: [PATCH 276/552] [Core] Add basic unit test for maybe_evict_cached_block (#21400) Signed-off-by: Jialin Ouyang Signed-off-by: x22x22 --- tests/v1/core/test_prefix_caching.py | 67 ++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/tests/v1/core/test_prefix_caching.py b/tests/v1/core/test_prefix_caching.py index b7f583de1f6..085616303d8 100644 --- a/tests/v1/core/test_prefix_caching.py +++ b/tests/v1/core/test_prefix_caching.py @@ -1097,6 +1097,73 @@ def test_prefix_cache_stats_disabled(): assert manager.prefix_cache_stats is None +def test_maybe_evict_cached_block(): + pool = BlockPool(num_gpu_blocks=4, enable_caching=True) + block_hash0 = BlockHashWithGroupId(block_hash=BlockHash(hash_value=10, + token_ids=(100, )), + group_id=1000) + block_hash1 = BlockHashWithGroupId(block_hash=BlockHash(hash_value=20, + token_ids=(200, )), + group_id=2000) + block_hash2 = BlockHashWithGroupId(block_hash=BlockHash(hash_value=30, + token_ids=(300, )), + group_id=3000) + block_hashes = [ + block_hash0, + block_hash1, + block_hash2, + # block3 had the exact same block_hash as the first block + block_hash0, + ] + assert len(pool.blocks) == len(block_hashes) + # Manually add all blocks to cached_blocks + for block, block_hash in zip(pool.blocks, block_hashes): + block.block_hash = block_hash + pool.cached_block_hash_to_block[block_hash][block.block_id] = block + + block0, block1, block2, block3 = pool.blocks + assert pool.cached_block_hash_to_block == { + block_hash0: { + block0.block_id: block0, + block3.block_id: block3 + }, + block_hash1: { + block1.block_id: block1 + }, + block_hash2: { + block2.block_id: block2 + } + } + # Evict block1 + pool._maybe_evict_cached_block(block1) + assert pool.cached_block_hash_to_block == { + block_hash0: { + block0.block_id: block0, + block3.block_id: block3 + }, + block_hash2: { + block2.block_id: block2 + } + } + # Evict block0: block_hash0 entry should NOT be removed, as block3 + # also use the same hash + pool._maybe_evict_cached_block(block0) + assert pool.cached_block_hash_to_block == { + block_hash0: { + block3.block_id: block3 + }, + block_hash2: { + block2.block_id: block2 + } + } + # Evict block2 + pool._maybe_evict_cached_block(block2) + assert pool.cached_block_hash_to_block == {block_hash0: {3: block3}} + # Evict block3 + pool._maybe_evict_cached_block(block3) + assert pool.cached_block_hash_to_block == {} + + @pytest.mark.parametrize("blocks_to_cache", [2, 3, 10]) def test_kv_cache_events(blocks_to_cache: int): block_size = 16 From f9430a7dcf701d07a58a6f3cd88fd771d99b2653 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 23 Jul 2025 03:02:48 -0400 Subject: [PATCH 277/552] [Cleanup] Only log MoE DP setup warning if DP is enabled (#21315) Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/config.py | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/config.py b/vllm/model_executor/layers/fused_moe/config.py index 51c421bd228..f5ed2861b8f 100644 --- a/vllm/model_executor/layers/fused_moe/config.py +++ b/vllm/model_executor/layers/fused_moe/config.py @@ -464,10 +464,11 @@ def make( ) else: _quant_config = FusedMoEQuantConfig() - logger.warning_once("MoE DP setup unable to determine " - "quantization scheme or unsupported " - "quantization type. This model will " - "not run with DP enabled.") + if moe_parallel_config.dp_size > 1: + logger.warning_once("MoE DP setup unable to determine " + "quantization scheme or unsupported " + "quantization type. This model will " + "not run with DP enabled.") else: _quant_config = quant_config From 1db31b3fc0bc5d73564061e01b01ae32138b434d Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 23 Jul 2025 15:03:16 +0800 Subject: [PATCH 278/552] add clear messages for deprecated models (#21424) Signed-off-by: youkaichao Signed-off-by: x22x22 --- vllm/model_executor/model_loader/utils.py | 11 ++++++++++- vllm/model_executor/models/registry.py | 2 ++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/vllm/model_executor/model_loader/utils.py b/vllm/model_executor/model_loader/utils.py index 42c5512905f..4b30336f013 100644 --- a/vllm/model_executor/model_loader/utils.py +++ b/vllm/model_executor/model_loader/utils.py @@ -25,7 +25,8 @@ as_reward_model, as_seq_cls_model) from vllm.model_executor.models.interfaces import SupportsQuant -from vllm.model_executor.models.registry import _TRANSFORMERS_MODELS +from vllm.model_executor.models.registry import (_PREVIOUSLY_SUPPORTED_MODELS, + _TRANSFORMERS_MODELS) from vllm.utils import is_pin_memory_available logger = init_logger(__name__) @@ -261,6 +262,14 @@ def get_model_architecture( vllm_not_supported = False break + if any(arch in _PREVIOUSLY_SUPPORTED_MODELS for arch in architectures): + previous_version = _PREVIOUSLY_SUPPORTED_MODELS[architectures[0]] + raise ValueError( + f"Model architecture {architectures[0]} was supported" + f" in vLLM until version {previous_version}, and is " + "not supported anymore. Please use an older version" + " of vLLM if you want to use this model architecture.") + if (model_config.model_impl == ModelImpl.TRANSFORMERS or model_config.model_impl == ModelImpl.AUTO and vllm_not_supported): architectures = resolve_transformers_arch(model_config, architectures) diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 9d88b5fe82c..100532943c2 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -276,6 +276,8 @@ sys.executable, "-m", "vllm.model_executor.models.registry" ] +_PREVIOUSLY_SUPPORTED_MODELS = {"Phi3SmallForCausalLM": "0.9.2"} + @dataclass(frozen=True) class _ModelInfo: From b4a871908e51e110d8bc3e1004cb5a5e1565b1b7 Mon Sep 17 00:00:00 2001 From: Guillaume Calmettes Date: Wed, 23 Jul 2025 09:30:05 +0200 Subject: [PATCH 279/552] [Bugfix] ensure tool_choice is popped when `tool_choice:null` is passed in json payload (#19679) Signed-off-by: Guillaume Calmettes Signed-off-by: x22x22 --- vllm/entrypoints/openai/protocol.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index 95e5bcd3bae..6c6ec207a3c 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -841,7 +841,7 @@ def check_tool_usage(cls, data): return data # if "tool_choice" is specified -- validation - if "tool_choice" in data: + if "tool_choice" in data and data["tool_choice"] is not None: # ensure that if "tool choice" is specified, tools are present if "tools" not in data or data["tools"] is None: @@ -853,7 +853,7 @@ def check_tool_usage(cls, data): if data["tool_choice"] not in [ "auto", "required" ] and not isinstance(data["tool_choice"], dict): - raise NotImplementedError( + raise ValueError( f'Invalid value for `tool_choice`: {data["tool_choice"]}! '\ 'Only named tools, "none", "auto" or "required" '\ 'are supported.' From aee6b325a869e4009dad9d33f3c70cfe7b76cb4a Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 23 Jul 2025 10:18:54 +0200 Subject: [PATCH 280/552] Fixed typo in profiling logs (#21441) Signed-off-by: x22x22 --- vllm/multimodal/profiling.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/multimodal/profiling.py b/vllm/multimodal/profiling.py index cdec783ef9c..7f6fb47a21f 100644 --- a/vllm/multimodal/profiling.py +++ b/vllm/multimodal/profiling.py @@ -275,7 +275,7 @@ def get_mm_max_tokens( if total_mm_tokens > seq_len: logger.warning_once( "The sequence length (%d) is smaller than the pre-defined" - " wosrt-case total number of multimodal tokens (%d). " + " worst-case total number of multimodal tokens (%d). " "This may cause certain multi-modal inputs to fail during " "inference. To avoid this, you should increase " "`max_model_len` or reduce `mm_counts`.", From a68bfcc5bde9d520749f5297e5853f5862cc831f Mon Sep 17 00:00:00 2001 From: Michael Yao Date: Wed, 23 Jul 2025 16:23:20 +0800 Subject: [PATCH 281/552] [Docs] Fix bullets and grammars in tool_calling.md (#21440) Signed-off-by: windsonsea Signed-off-by: x22x22 --- docs/features/tool_calling.md | 66 +++++++++++++++++++---------------- 1 file changed, 35 insertions(+), 31 deletions(-) diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index 8d89dc4c8d8..ce74683a162 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -1,10 +1,10 @@ # Tool Calling -vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`) and `none` options for the `tool_choice` field in the chat completion API. +vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`), and `none` options for the `tool_choice` field in the chat completion API. ## Quickstart -Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the llama3 tool calling chat template from the vLLM examples directory: +Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the `llama3_json` tool calling chat template from the vLLM examples directory: ```bash vllm serve meta-llama/Llama-3.1-8B-Instruct \ @@ -13,7 +13,7 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \ --chat-template examples/tool_chat_template_llama3.1_json.jinja ``` -Next, make a request to the model that should result in it using the available tools: +Next, make a request that triggers the model to use the available tools: ??? code @@ -73,7 +73,7 @@ This example demonstrates: You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests. -Remember that it's the callers responsibility to: +Remember that it's the caller's responsibility to: 1. Define appropriate tools in the request 2. Include relevant context in the chat messages @@ -84,7 +84,7 @@ For more advanced usage, including parallel tool calls and different model-speci ## Named Function Calling vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is -enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a +enabled by default and will work with any supported model. You are guaranteed a validly-parsable function call - not a high-quality one. vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter. @@ -95,7 +95,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha ## Required Function Calling -vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](../usage/v1_guide.md#features) for the V1 engine. +vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The guided decoding features for `tool_choice='required'` (such as JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](../usage/v1_guide.md#features) for the V1 engine. When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter. @@ -109,16 +109,16 @@ However, when `tool_choice='none'` is specified, vLLM includes tool definitions To enable this feature, you should set the following flags: -* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it +* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate. * `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers -will continue to be added in the future, and also can register your own tool parsers in the `--tool-parser-plugin`. +will continue to be added in the future. You can also register your own tool parsers in the `--tool-parser-plugin`. * `--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`. -* `--chat-template` -- **optional** for auto tool choice. the path to the chat template which handles `tool`-role messages and `assistant`-role messages +* `--chat-template` -- **optional** for auto tool choice. It's the path to the chat template which handles `tool`-role messages and `assistant`-role messages that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their `tokenizer_config.json` files, but you can specify a custom template. This argument can be set to `tool_use` if your model has a tool use-specific chat template configured in the `tokenizer_config.json`. In this case, it will be used per the `transformers` specification. More on this [here](https://huggingface.co/docs/transformers/en/chat_templating#why-do-some-models-have-multiple-templates) -from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json) +from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json). If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template! @@ -130,7 +130,7 @@ All Nous Research Hermes-series models newer than Hermes 2 Pro should be support * `NousResearch/Hermes-2-Theta-*` * `NousResearch/Hermes-3-*` -_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge +_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality and capabilities due to the merge step in their creation_. Flags: `--tool-call-parser hermes` @@ -146,13 +146,13 @@ Known issues: 1. Mistral 7B struggles to generate parallel tool calls correctly. 2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is -much shorter than what vLLM generates. Since an exception is thrown when this condition -is not met, the following additional chat templates are provided: + much shorter than what vLLM generates. Since an exception is thrown when this condition + is not met, the following additional chat templates are provided: -* - this is the "official" Mistral chat template, but tweaked so that -it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits) -* - this is a "better" version that adds a tool-use system prompt -when tools are provided, that results in much better reliability when working with parallel tool calling. + * - this is the "official" Mistral chat template, but tweaked so that + it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits) + * - this is a "better" version that adds a tool-use system prompt + when tools are provided, that results in much better reliability when working with parallel tool calling. Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja` @@ -166,17 +166,17 @@ All Llama 3.1, 3.2 and 4 models should be supported. * `meta-llama/Llama-3.2-*` * `meta-llama/Llama-4-*` -The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for llama 4 models, it is recommended to use the `llama4_pythonic` tool parser. +The tool calling that is supported is the [JSON-based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for Llama 4 models, it is recommended to use the `llama4_pythonic` tool parser. Other tool calling formats like the built in python tool calling or custom tool calling are not supported. Known issues: -1. Parallel tool calls are not supported for llama 3, but it is supported in llama 4 models. -2. The model can generate parameters with a wrong format, such as generating +1. Parallel tool calls are not supported for Llama 3, but it is supported in Llama 4 models. +2. The model can generate parameters in an incorrect format, such as generating an array serialized as string instead of an array. -VLLM provides two JSON based chat templates for Llama 3.1 and 3.2: +VLLM provides two JSON-based chat templates for Llama 3.1 and 3.2: * - this is the "official" chat template for the Llama 3.1 models, but tweaked so that it works better with vLLM. @@ -185,7 +185,8 @@ images. Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}` -VLLM also provides a pythonic and JSON based chat template for Llama 4, but pythonic tool calling is recommended: +VLLM also provides a pythonic and JSON-based chat template for Llama 4, but pythonic tool calling is recommended: + * - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models. For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`. @@ -196,21 +197,21 @@ Supported models: * `ibm-granite/granite-3.0-8b-instruct` -Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja` + Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja` -: this is a modified chat template from the original on Huggingface. Parallel function calls are supported. + : this is a modified chat template from the original on Hugging Face. Parallel function calls are supported. * `ibm-granite/granite-3.1-8b-instruct` -Recommended flags: `--tool-call-parser granite` + Recommended flags: `--tool-call-parser granite` -The chat template from Huggingface can be used directly. Parallel function calls are supported. + The chat template from Huggingface can be used directly. Parallel function calls are supported. * `ibm-granite/granite-20b-functioncalling` -Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja` + Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja` -: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported. + : this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported. ### InternLM Models (`internlm`) @@ -246,10 +247,12 @@ The xLAM tool parser is designed to support models that generate tool calls in v Parallel function calls are supported, and the parser can effectively separate text content from tool calls. Supported models: + * Salesforce Llama-xLAM models: `Salesforce/Llama-xLAM-2-8B-fc-r`, `Salesforce/Llama-xLAM-2-70B-fc-r` * Qwen-xLAM models: `Salesforce/xLAM-1B-fc-r`, `Salesforce/xLAM-3B-fc-r`, `Salesforce/Qwen-xLAM-32B-fc-r` Flags: + * For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja` * For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja` @@ -292,9 +295,10 @@ Flags: `--tool-call-parser kimi_k2` Supported models: -* `tencent/Hunyuan-A13B-Instruct` (chat template already included huggingface model file.) +* `tencent/Hunyuan-A13B-Instruct` (The chat template is already included in the Hugging Face model files.) Flags: + * For non-reasoning: `--tool-call-parser hunyuan_a13b` * For reasoning: `--tool-call-parser hunyuan_a13b --reasoning-parser hunyuan_a13b --enable_reasoning` @@ -325,9 +329,9 @@ Example supported models: Flags: `--tool-call-parser pythonic --chat-template {see_above}` !!! warning - Llama's smaller models frequently fail to emit tool calls in the correct format. Your mileage may vary. + Llama's smaller models frequently fail to emit tool calls in the correct format. Results may vary depending on the model. -## How to write a tool parser plugin +## How to Write a Tool Parser Plugin A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in . From be03462b01538f39f3fac53fd2edaee99c8394dd Mon Sep 17 00:00:00 2001 From: Lu Fang <30275821+houseroad@users.noreply.github.com> Date: Wed, 23 Jul 2025 01:39:25 -0700 Subject: [PATCH 282/552] [Sampler] Introduce logprobs mode for logging (#21398) Signed-off-by: Lu Fang Signed-off-by: x22x22 --- tests/v1/sample/test_logprobs.py | 43 ++++++++++++++++++++++++++++++ vllm/config.py | 9 +++++++ vllm/engine/arg_utils.py | 18 ++++++++----- vllm/v1/sample/sampler.py | 17 ++++++++++-- vllm/v1/sample/tpu/sampler.py | 1 + vllm/v1/worker/gpu_input_batch.py | 4 +-- vllm/v1/worker/gpu_model_runner.py | 4 +-- 7 files changed, 83 insertions(+), 13 deletions(-) diff --git a/tests/v1/sample/test_logprobs.py b/tests/v1/sample/test_logprobs.py index 4f1f340a4cc..680e2ce98bb 100644 --- a/tests/v1/sample/test_logprobs.py +++ b/tests/v1/sample/test_logprobs.py @@ -12,6 +12,7 @@ assert_incr_detok_str_matches_non_incr_detok_str, compute_correct_cumulative_logprob, get_test_batch) from vllm import SamplingParams +from vllm.config import LogprobsMode from ...conftest import HfRunner, VllmRunner @@ -426,3 +427,45 @@ def test_zero_logprobs(vllm_model, example_prompts, # prompt token assert prompt_logprobs is not None assert len(prompt_token_ids) == len(prompt_logprobs) + + +@pytest.mark.parametrize( + "logprobs_mode", + ["raw_logprobs", "raw_logits", "processed_logprobs", "processed_logits"]) +def test_logprobs_mode(logprobs_mode: LogprobsMode, + monkeypatch: pytest.MonkeyPatch): + """Test with LLM engine with different logprobs_mode. + For logprobs, we should have non-positive values. + For logits, we should expect at least one positive values. + """ + from vllm import LLM + with monkeypatch.context() as m: + m.setenv("VLLM_USE_V1", "1") + + llm = LLM( + "facebook/opt-125m", + max_logprobs=5, + enable_prefix_caching=False, + # 2 other llms alive during whole session + gpu_memory_utilization=0.05, + max_model_len=16, + logprobs_mode=logprobs_mode) + vllm_sampling_params = SamplingParams(logprobs=1) + results = llm.generate(["Hello world"], + sampling_params=vllm_sampling_params) + + total_token_with_logprobs = 0 + positive_values = 0 + for output in results[0].outputs: + for logprobs in output.logprobs: + for token_id in logprobs: + logprob = logprobs[token_id] + if "logprobs" in logprobs_mode: + assert logprob.logprob <= 0 + if logprob.logprob > 0: + positive_values = positive_values + 1 + total_token_with_logprobs = total_token_with_logprobs + 1 + assert total_token_with_logprobs >= len(results[0].outputs) + if "logits" in logprobs_mode: + assert positive_values > 0 + del llm diff --git a/vllm/config.py b/vllm/config.py index a5f67451a77..ccc9708a3ab 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -219,6 +219,8 @@ def is_init_field(cls: ConfigType, name: str) -> bool: TokenizerMode = Literal["auto", "slow", "mistral", "custom"] ModelDType = Literal["auto", "half", "float16", "bfloat16", "float", "float32"] +LogprobsMode = Literal["raw_logprobs", "raw_logits", "processed_logprobs", + "processed_logits"] @config @@ -316,6 +318,13 @@ class ModelConfig: """Maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`. The default value comes the default for the OpenAI Chat Completions API.""" + logprobs_mode: LogprobsMode = "raw_logprobs" + """Indicates the content returned in the logprobs and prompt_logprobs. + Supported mode: + 1) raw_logprobs, 2) processed_logprobs, 3) raw_logits, 4) processed_logits. + Raw means the values before applying logit processors, like bad words. + Processed means the values after applying such processors. + """ disable_sliding_window: bool = False """Whether to disable sliding window. If True, we will disable the sliding window functionality of the model, capping to sliding window size. If the diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 1e3d46a8d96..4a5efd40241 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -26,13 +26,13 @@ DetailedTraceModules, Device, DeviceConfig, DistributedExecutorBackend, GuidedDecodingBackend, GuidedDecodingBackendV1, HfOverrides, KVEventsConfig, - KVTransferConfig, LoadConfig, LoadFormat, LoRAConfig, - ModelConfig, ModelDType, ModelImpl, MultiModalConfig, - ObservabilityConfig, ParallelConfig, PoolerConfig, - PrefixCachingHashAlgo, PromptAdapterConfig, - SchedulerConfig, SchedulerPolicy, SpeculativeConfig, - TaskOption, TokenizerMode, VllmConfig, get_attr_docs, - get_field) + KVTransferConfig, LoadConfig, LoadFormat, + LogprobsMode, LoRAConfig, ModelConfig, ModelDType, + ModelImpl, MultiModalConfig, ObservabilityConfig, + ParallelConfig, PoolerConfig, PrefixCachingHashAlgo, + PromptAdapterConfig, SchedulerConfig, SchedulerPolicy, + SpeculativeConfig, TaskOption, TokenizerMode, + VllmConfig, get_attr_docs, get_field) from vllm.logger import init_logger from vllm.platforms import CpuArchEnum, current_platform from vllm.plugins import load_general_plugins @@ -324,6 +324,7 @@ class EngineArgs: SchedulerConfig.long_prefill_token_threshold max_num_seqs: Optional[int] = SchedulerConfig.max_num_seqs max_logprobs: int = ModelConfig.max_logprobs + logprobs_mode: LogprobsMode = ModelConfig.logprobs_mode disable_log_stats: bool = False revision: Optional[str] = ModelConfig.revision code_revision: Optional[str] = ModelConfig.code_revision @@ -490,6 +491,8 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: **model_kwargs["max_seq_len_to_capture"]) model_group.add_argument("--max-logprobs", **model_kwargs["max_logprobs"]) + model_group.add_argument("--logprobs-mode", + **model_kwargs["logprobs_mode"]) model_group.add_argument("--disable-sliding-window", **model_kwargs["disable_sliding_window"]) model_group.add_argument("--disable-cascade-attn", @@ -892,6 +895,7 @@ def create_model_config(self) -> ModelConfig: enforce_eager=self.enforce_eager, max_seq_len_to_capture=self.max_seq_len_to_capture, max_logprobs=self.max_logprobs, + logprobs_mode=self.logprobs_mode, disable_sliding_window=self.disable_sliding_window, disable_cascade_attn=self.disable_cascade_attn, skip_tokenizer_init=self.skip_tokenizer_init, diff --git a/vllm/v1/sample/sampler.py b/vllm/v1/sample/sampler.py index fa078e62876..82f51298f1b 100644 --- a/vllm/v1/sample/sampler.py +++ b/vllm/v1/sample/sampler.py @@ -5,6 +5,7 @@ import torch import torch.nn as nn +from vllm.config import LogprobsMode from vllm.utils import is_pin_memory_available from vllm.v1.outputs import LogprobsTensors, SamplerOutput from vllm.v1.sample.metadata import SamplingMetadata @@ -18,10 +19,11 @@ class Sampler(nn.Module): - def __init__(self): + def __init__(self, logprobs_mode: LogprobsMode = "raw_logprobs"): super().__init__() self.topk_topp_sampler = TopKTopPSampler() self.pin_memory = is_pin_memory_available() + self.logprobs_mode = logprobs_mode def forward( self, @@ -36,7 +38,10 @@ def forward( # See https://vllm-dev.slack.com/archives/C07UUL8E61Z/p1735907856007919 # noqa: E501 num_logprobs = sampling_metadata.max_num_logprobs if num_logprobs is not None: - raw_logprobs = self.compute_logprobs(logits) + if self.logprobs_mode == "raw_logprobs": + raw_logprobs = self.compute_logprobs(logits) + elif self.logprobs_mode == "raw_logits": + raw_logprobs = logits.clone() # Use float32 for the logits. logits = logits.to(torch.float32) @@ -51,6 +56,14 @@ def forward( # Apply penalties (e.g., min_tokens, freq_penalties). logits = self.apply_penalties(logits, sampling_metadata) + + # Get the process logprobs or logits. + if num_logprobs is not None: + if self.logprobs_mode == "processed_logprobs": + raw_logprobs = self.compute_logprobs(logits) + elif self.logprobs_mode == "processed_logits": + raw_logprobs = logits.clone() + # Sample the next token. sampled = self.sample(logits, sampling_metadata) # Convert sampled token ids to int64 (long) type to ensure compatibility diff --git a/vllm/v1/sample/tpu/sampler.py b/vllm/v1/sample/tpu/sampler.py index 1056eb1d7b7..2c9f4892bc2 100644 --- a/vllm/v1/sample/tpu/sampler.py +++ b/vllm/v1/sample/tpu/sampler.py @@ -15,6 +15,7 @@ class Sampler(nn.Module): def __init__(self): + # TODO(houseroad): Add support for logprobs_mode. super().__init__() self.topk_topp_sampler = TopKTopPSampler() diff --git a/vllm/v1/worker/gpu_input_batch.py b/vllm/v1/worker/gpu_input_batch.py index a242c7fca5e..c63041600f3 100644 --- a/vllm/v1/worker/gpu_input_batch.py +++ b/vllm/v1/worker/gpu_input_batch.py @@ -389,7 +389,7 @@ def add_request( def remove_request(self, req_id: str) -> Optional[int]: """This method must always be followed by a call to condense(). - + Args: req_id: request to remove @@ -590,7 +590,7 @@ def condense(self) -> None: def refresh_metadata(self): """Apply batch updates, reset input batch at end of step - + * Apply batch add/remove/permute to logits procs' states * If batch state is modified, update sampling metadata """ diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 4c14ac3be3c..6a42e01f14b 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -151,7 +151,7 @@ def __init__( self.encoder_cache_size = encoder_cache_size # Sampler - self.sampler = Sampler() + self.sampler = Sampler(logprobs_mode=self.model_config.logprobs_mode) self.eplb_state: Optional[EplbState] = None """ @@ -1996,7 +1996,7 @@ def maybe_randomize_inputs(self, input_ids: torch.Tensor): Randomize input_ids if VLLM_RANDOMIZE_DP_DUMMY_INPUTS is set. This is to help balance expert-selection - during profile_run - - during DP rank dummy run + - during DP rank dummy run """ dp_size = self.vllm_config.parallel_config.data_parallel_size randomize_inputs = envs.VLLM_RANDOMIZE_DP_DUMMY_INPUTS and dp_size > 1 From 370ff28aa34973d152b430c0955f339ac2230431 Mon Sep 17 00:00:00 2001 From: Yu Chin Fabian Lim Date: Wed, 23 Jul 2025 04:40:27 -0400 Subject: [PATCH 283/552] Mamba V2 Test not Asserting Failures. (#21379) Signed-off-by: Yu Chin Fabian Lim Signed-off-by: x22x22 --- tests/kernels/mamba/test_mamba_mixer2.py | 9 ++++---- tests/kernels/mamba/test_mamba_ssm_ssd.py | 26 +++++++++++++++++------ 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/tests/kernels/mamba/test_mamba_mixer2.py b/tests/kernels/mamba/test_mamba_mixer2.py index f5c6a18614f..16c310726ad 100644 --- a/tests/kernels/mamba/test_mamba_mixer2.py +++ b/tests/kernels/mamba/test_mamba_mixer2.py @@ -119,7 +119,8 @@ def mixer2_gated_norm_tensor_parallel( gate_states[..., local_rank * N:(local_rank + 1) * N], ) ref_output = mixer_single_gpu(hidden_states, gate_states) - torch.allclose(output, - ref_output[..., local_rank * N:(local_rank + 1) * N], - atol=1e-3, - rtol=1e-3) + torch.testing.assert_close(output, + ref_output[..., + local_rank * N:(local_rank + 1) * N], + atol=5e-3, + rtol=1e-3) diff --git a/tests/kernels/mamba/test_mamba_ssm_ssd.py b/tests/kernels/mamba/test_mamba_ssm_ssd.py index 6a3f21ba543..00c1a2911d7 100644 --- a/tests/kernels/mamba/test_mamba_ssm_ssd.py +++ b/tests/kernels/mamba/test_mamba_ssm_ssd.py @@ -193,6 +193,13 @@ def test_mamba_chunk_scan_single_example(d_head, n_heads, seq_len_chunk_size, # this tests the kernels on a single example (no batching) + # TODO: the bfloat16 case requires higher thresholds. To be investigated + + if itype == torch.bfloat16: + atol, rtol = 5e-2, 5e-2 + else: + atol, rtol = 8e-3, 5e-3 + # set seed batch_size = 1 # batch_size # ssd_minimal_discrete requires chunk_size divide seqlen @@ -216,14 +223,14 @@ def test_mamba_chunk_scan_single_example(d_head, n_heads, seq_len_chunk_size, return_final_states=True) # just test the last in sequence - torch.allclose(Y[:, -1], Y_min[:, -1], atol=1e-3, rtol=1e-3) + torch.testing.assert_close(Y[:, -1], Y_min[:, -1], atol=atol, rtol=rtol) # just test the last head # NOTE, in the kernel we always cast states to fp32 - torch.allclose(final_state[:, -1], - final_state_min[:, -1].to(torch.float32), - atol=1e-3, - rtol=1e-3) + torch.testing.assert_close(final_state[:, -1], + final_state_min[:, -1].to(torch.float32), + atol=atol, + rtol=rtol) @pytest.mark.parametrize("itype", [torch.float32, torch.float16]) @@ -263,6 +270,13 @@ def test_mamba_chunk_scan_cont_batch(d_head, n_heads, seq_len_chunk_size_cases, seqlen, chunk_size, num_examples, cases = seq_len_chunk_size_cases + # TODO: the irregular chunk size cases have some issues and require higher + # tolerance. This is to be invesigated + if chunk_size not in {8, 256}: + atol, rtol = 5e-1, 5e-1 + else: + atol, rtol = 5e-3, 5e-3 + # hold state during the cutting process so we know if an # example has been exhausted and needs to cycle last_taken: dict = {} # map: eg -> pointer to last taken sample @@ -300,7 +314,7 @@ def test_mamba_chunk_scan_cont_batch(d_head, n_heads, seq_len_chunk_size_cases, # just test one dim and dstate Y_eg = Y[0, cu_seqlens[i]:cu_seqlens[i + 1], 0, 0] Y_min_eg = Y_min[i][:, 0, 0] - torch.allclose(Y_eg, Y_min_eg, atol=1e-3, rtol=1e-3) + torch.testing.assert_close(Y_eg, Y_min_eg, atol=atol, rtol=rtol) # update states states = new_states From c7cdaa3acbcaccd5c007084555f627d3e874c2a5 Mon Sep 17 00:00:00 2001 From: Yang Chen Date: Wed, 23 Jul 2025 01:41:43 -0700 Subject: [PATCH 284/552] [Misc] fixed nvfp4_moe test failures due to invalid kwargs (#21246) Signed-off-by: Yang Chen Signed-off-by: x22x22 --- tests/kernels/moe/test_nvfp4_moe.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/kernels/moe/test_nvfp4_moe.py b/tests/kernels/moe/test_nvfp4_moe.py index 3f5412e7582..3ff38536029 100644 --- a/tests/kernels/moe/test_nvfp4_moe.py +++ b/tests/kernels/moe/test_nvfp4_moe.py @@ -93,11 +93,11 @@ def test_cutlass_fp4_moe_no_graph(m: int, n: int, k: int, e: int, topk: int, a1_gscale=a1_gs, w1_fp4=w1_q, w1_blockscale=w1_blockscale, - w1_alphas=(1 / w1_gs), + g1_alphas=(1 / w1_gs), a2_gscale=a2_gs, w2_fp4=w2_q, w2_blockscale=w2_blockscale, - w2_alphas=(1 / w2_gs), + g2_alphas=(1 / w2_gs), topk_weights=topk_weights, topk_ids=topk_ids, m=m, From f2fd29fdea1c0eec1d41c9d575496acfaf906c3c Mon Sep 17 00:00:00 2001 From: Michael Yao Date: Wed, 23 Jul 2025 18:37:25 +0800 Subject: [PATCH 285/552] [Docs] Clean up v1/metrics.md (#21449) Signed-off-by: windsonsea Signed-off-by: x22x22 --- docs/design/v1/metrics.md | 165 +++++++++++++++++--------------------- 1 file changed, 73 insertions(+), 92 deletions(-) diff --git a/docs/design/v1/metrics.md b/docs/design/v1/metrics.md index e23308f2637..52cd320dd4e 100644 --- a/docs/design/v1/metrics.md +++ b/docs/design/v1/metrics.md @@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0. ## Objectives - Achieve parity of metrics between v0 and v1. -- The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments. -- Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases. +- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments. +- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases. ## Background Metrics in vLLM can be categorized as follows: -1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus. -2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking. +1. Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus. +2. Request-level metrics: Metrics that track the characteristics (e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking. -The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are. +The mental model is that server-level metrics help explain the values of request-level metrics. ### v0 Metrics @@ -65,20 +65,20 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: -- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds -- `vllm:prompt_tokens_total` - Prompt Tokens -- `vllm:generation_tokens_total` - Generation Tokens -- `vllm:time_per_output_token_seconds` - Inter token latency (Time Per Output Token, TPOT) in second. +- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds. +- `vllm:prompt_tokens_total` - Prompt tokens. +- `vllm:generation_tokens_total` - Generation tokens. +- `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds. - `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds. -- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in RUNNING, WAITING, and SWAPPED state +- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states. - `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM. -- `vllm:request_prompt_tokens` - Request prompt length -- `vllm:request_generation_tokens` - request generation length -- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached -- `vllm:request_queue_time_seconds` - Queue Time -- `vllm:request_prefill_time_seconds` - Requests Prefill Time -- `vllm:request_decode_time_seconds` - Requests Decode Time -- `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group +- `vllm:request_prompt_tokens` - Request prompt length. +- `vllm:request_generation_tokens` - Request generation length. +- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached. +- `vllm:request_queue_time_seconds` - Queue time. +- `vllm:request_prefill_time_seconds` - Requests prefill time. +- `vllm:request_decode_time_seconds` - Requests decode time. +- `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group. See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here. @@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process ### Built in Python/Process Metrics -The following metrics are supported by default by `prometheus_client`, but the are not exposed with multiprocess mode is used: +The following metrics are supported by default by `prometheus_client`, but they are not exposed when multi-process mode is used: - `python_gc_objects_collected_total` - `python_gc_objects_uncollectable_total` @@ -158,6 +158,7 @@ In v1, we wish to move computation and overhead out of the engine core process to minimize the time between each forward pass. The overall idea of V1 EngineCore design is: + - EngineCore is the inner loop. Performance is most critical here - AsyncLLM is the outer loop. This is overlapped with GPU execution (ideally), so this is where any "overheads" should be if @@ -178,7 +179,7 @@ time" (`time.time()`) to calculate intervals as the former is unaffected by system clock changes (e.g. from NTP). It's also important to note that monotonic clocks differ between -processes - each process has its own reference. point. So it is +processes - each process has its own reference point. So it is meaningless to compare monotonic timestamps from different processes. Therefore, in order to calculate an interval, we must compare two @@ -343,14 +344,15 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3. vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0 ``` -Note - the choice of histogram buckets to be most useful to users -across a broad set of use cases is not straightforward and will -require refinement over time. +!!! note + The choice of histogram buckets to be most useful to users + across a broad set of use cases is not straightforward and will + require refinement over time. ### Cache Config Info -`prometheus_client` has support for [Info -metrics](https://prometheus.github.io/client_python/instrumenting/info/) +`prometheus_client` has support for +[Info metrics](https://prometheus.github.io/client_python/instrumenting/info/) which are equivalent to a `Gauge` whose value is permanently set to 1, but exposes interesting key/value pair information via labels. This is used for information about an instance that does not change - so it @@ -363,14 +365,11 @@ We use this concept for the `vllm:cache_config_info` metric: # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # TYPE vllm:cache_config_info gauge vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 - ``` -However, `prometheus_client` has [never supported Info metrics in -multiprocessing -mode](https://github.com/prometheus/client_python/pull/300) - for -[unclear -reasons](gh-pr:7279#discussion_r1710417152). We +However, `prometheus_client` has +[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) - +for [unclear reasons](gh-pr:7279#discussion_r1710417152). We simply use a `Gauge` metric set to 1 and `multiprocess_mode="mostrecent"` instead. @@ -395,11 +394,9 @@ distinguish between per-adapter counts. This should be revisited. Note that `multiprocess_mode="livemostrecent"` is used - the most recent metric is used, but only from currently running processes. -This was added in - and there is -[at least one known -user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If -we revisit this design and deprecate the old metric, we should reduce +This was added in and there is +[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). +If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric. @@ -442,23 +439,20 @@ suddenly (from their perspective) when it is removed, even if there is an equivalent metric for them to use. As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was -[deprecated](gh-pr:2764) (with a -comment in the code), -[removed](gh-pr:12383), and then -[noticed by a -user](gh-issue:13218). +[deprecated](gh-pr:2764) (with a comment in the code), +[removed](gh-pr:12383), and then [noticed by a user](gh-issue:13218). In general: -1) We should be cautious about deprecating metrics, especially since +1. We should be cautious about deprecating metrics, especially since it can be hard to predict the user impact. -2) We should include a prominent deprecation notice in the help string +2. We should include a prominent deprecation notice in the help string that is included in the `/metrics' output. -3) We should list deprecated metrics in user-facing documentation and +3. We should list deprecated metrics in user-facing documentation and release notes. -4) We should consider hiding deprecated metrics behind a CLI argument - in order to give administrators [an escape - hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics) +4. We should consider hiding deprecated metrics behind a CLI argument + in order to give administrators + [an escape hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics) for some time before deleting them. See the [deprecation policy](../../contributing/deprecation_policy.md) for @@ -474,7 +468,7 @@ removed. The `vllm:time_in_queue_requests` Histogram metric was added by and its calculation is: -``` +```python self.metrics.first_scheduled_time = now self.metrics.time_in_queue = now - self.metrics.arrival_time ``` @@ -482,7 +476,7 @@ The `vllm:time_in_queue_requests` Histogram metric was added by Two weeks later, added `vllm:request_queue_time_seconds` leaving us with: -``` +```python if seq_group.is_finished(): if (seq_group.metrics.first_scheduled_time is not None and seq_group.metrics.first_token_time is not None): @@ -517,8 +511,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU memory. This is also known as "KV cache offloading" and is configured with `--swap-space` and `--preemption-mode`. -In v0, [vLLM has long supported beam -search](gh-issue:6226). The +In v0, [vLLM has long supported beam search](gh-issue:6226). The SequenceGroup encapsulated the idea of N Sequences which all shared the same prompt kv blocks. This enabled KV cache block sharing between requests, and copy-on-write to do branching. CPU @@ -530,9 +523,8 @@ option than CPU swapping since blocks can be evicted slowly on demand and the part of the prompt that was evicted can be recomputed. SequenceGroup was removed in V1, although a replacement will be -required for "parallel sampling" (`n>1`). [Beam search was moved out of -the core (in -V0)](gh-issue:8306). There was a +required for "parallel sampling" (`n>1`). +[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a lot of complex code for a very uncommon feature. In V1, with prefix caching being better (zero over head) and therefore @@ -547,18 +539,18 @@ Some v0 metrics are only relevant in the context of "parallel sampling". This is where the `n` parameter in a request is used to request multiple completions from the same prompt. -As part of adding parallel sampling support in we should +As part of adding parallel sampling support in , we should also add these metrics. - `vllm:request_params_n` (Histogram) -Observes the value of the 'n' parameter of every finished request. + Observes the value of the 'n' parameter of every finished request. - `vllm:request_max_num_generation_tokens` (Histogram) -Observes the maximum output length of all sequences in every finished -sequence group. In the absence of parallel sampling, this is -equivalent to `vllm:request_generation_tokens`. + Observes the maximum output length of all sequences in every finished + sequence group. In the absence of parallel sampling, this is + equivalent to `vllm:request_generation_tokens`. ### Speculative Decoding @@ -576,26 +568,23 @@ There is a PR under review () to add "prompt lookup (ngram)" seculative decoding to v1. Other techniques will follow. We should revisit the v0 metrics in this context. -Note - we should probably expose acceptance rate as separate accepted -and draft counters, like we do for prefix caching hit rate. Efficiency -likely also needs similar treatment. +!!! note + We should probably expose acceptance rate as separate accepted + and draft counters, like we do for prefix caching hit rate. Efficiency + likely also needs similar treatment. ### Autoscaling and Load-balancing A common use case for our metrics is to support automated scaling of vLLM instances. -For related discussion from the [Kubernetes Serving Working -Group](https://github.com/kubernetes/community/tree/master/wg-serving), +For related discussion from the +[Kubernetes Serving Working Group](https://github.com/kubernetes/community/tree/master/wg-serving), see: -- [Standardizing Large Model Server Metrics in - Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) -- [Benchmarking LLM Workloads for Performance Evaluation and - Autoscaling in - Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ) -- [Inference - Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf) +- [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) +- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ) +- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf) - and . This is a non-trivial topic. Consider this comment from Rob: @@ -619,19 +608,16 @@ should judge an instance as approaching saturation: Our approach to naming metrics probably deserves to be revisited: -1. The use of colons in metric names seems contrary to ["colons are - reserved for user defined recording - rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels) +1. The use of colons in metric names seems contrary to + ["colons are reserved for user defined recording rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels). 2. Most of our metrics follow the convention of ending with units, but not all do. 3. Some of our metric names end with `_total`: -``` -If there is a suffix of `_total` on the metric name, it will be removed. When -exposing the time series for counter, a `_total` suffix will be added. This is -for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics -requires the `_total` suffix. -``` + If there is a suffix of `_total` on the metric name, it will be removed. When + exposing the time series for counter, a `_total` suffix will be added. This is + for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics + requires the `_total` suffix. ### Adding More Metrics @@ -642,8 +628,7 @@ There is no shortage of ideas for new metrics: - Proposals arising from specific use cases, like the Kubernetes auto-scaling topic above - Proposals that might arise out of standardisation efforts like - [OpenTelemetry Semantic Conventions for Gen - AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai). + [OpenTelemetry Semantic Conventions for Gen AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai). We should be cautious in our approach to adding new metrics. While metrics are often relatively straightforward to add: @@ -668,18 +653,14 @@ fall under the more general heading of "Observability". v0 has support for OpenTelemetry tracing: - Added by -- Configured with `--oltp-traces-endpoint` and - `--collect-detailed-traces` -- [OpenTelemetry blog - post](https://opentelemetry.io/blog/2024/llm-observability/) +- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` +- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) - [User-facing docs](../../examples/online_serving/opentelemetry.md) -- [Blog - post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) -- [IBM product - docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview) +- [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) +- [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview) -OpenTelemetry has a [Gen AI Working -Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). +OpenTelemetry has a +[Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). Since metrics is a big enough topic on its own, we are going to tackle the topic of tracing in v1 separately. @@ -698,7 +679,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled and if `--collect-detailed-traces=all/model/worker` is used. The documentation for this option states: -> collect detailed traces for the specified "modules. This involves +> collect detailed traces for the specified modules. This involves > use of possibly costly and or blocking operations and hence might > have a performance impact. From 474a0bf5b0772629d5bc05e94e0808d5f28a84aa Mon Sep 17 00:00:00 2001 From: Asher Date: Wed, 23 Jul 2025 18:54:08 +0800 Subject: [PATCH 286/552] [Model] add Hunyuan V1 Dense Model support. (#21368) Signed-off-by: Asher Zhang Signed-off-by: x22x22 --- docs/models/supported_models.md | 1 + tests/models/registry.py | 2 + .../{hunyuan_v1_moe.py => hunyuan_v1.py} | 70 ++++++++++++++----- vllm/model_executor/models/registry.py | 3 +- 4 files changed, 57 insertions(+), 19 deletions(-) rename vllm/model_executor/models/{hunyuan_v1_moe.py => hunyuan_v1.py} (95%) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 391e27cc12b..4553c46afb0 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -363,6 +363,7 @@ th { | `GraniteMoeSharedForCausalLM` | Granite MoE Shared | `ibm-research/moe-7b-1b-active-shared-experts` (test model) | ✅︎ | ✅︎ | ✅︎ | | `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | | | `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ | ✅︎ | +| `HunYuanDenseV1ForCausalLM` | Hunyuan-7B-Instruct-0124 | `tencent/Hunyuan-7B-Instruct-0124` | ✅︎ | | ✅︎ | | `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, etc. | ✅︎ | | ✅︎ | | `InternLMForCausalLM` | InternLM | `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM2ForCausalLM` | InternLM2 | `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ | diff --git a/tests/models/registry.py b/tests/models/registry.py index 1eb7f7b9d82..84ca0bc6000 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -199,6 +199,8 @@ def check_available_online( trust_remote_code=True), "HunYuanMoEV1ForCausalLM": _HfExamplesInfo("tencent/Hunyuan-A13B-Instruct", trust_remote_code=True), + "HunYuanDenseV1ForCausalLM":_HfExamplesInfo("tencent/Hunyuan-7B-Instruct-0124", + trust_remote_code=True), "InternLMForCausalLM": _HfExamplesInfo("internlm/internlm-chat-7b", trust_remote_code=True), "InternLM2ForCausalLM": _HfExamplesInfo("internlm/internlm2-chat-7b", diff --git a/vllm/model_executor/models/hunyuan_v1_moe.py b/vllm/model_executor/models/hunyuan_v1.py similarity index 95% rename from vllm/model_executor/models/hunyuan_v1_moe.py rename to vllm/model_executor/models/hunyuan_v1.py index b3baec98b0f..fbba849a76f 100644 --- a/vllm/model_executor/models/hunyuan_v1_moe.py +++ b/vllm/model_executor/models/hunyuan_v1.py @@ -61,6 +61,19 @@ make_layers) +def _is_moe(config: PretrainedConfig) -> bool: + num_experts = getattr(config, "num_experts", None) + if isinstance(num_experts, int): + return num_experts > 1 + if isinstance(num_experts, list) and num_experts: + # Ensure all elements are integers before calling max. + if all(isinstance(e, int) for e in num_experts): + return max(num_experts) > 1 + else: + return False + return False + + def _get_cla_factor(config: PretrainedConfig) -> int: if not getattr(config, "use_cla", False): return 1 @@ -140,8 +153,8 @@ def __init__( # the KV heads across multiple tensor parallel GPUs. assert tp_size % self.total_num_kv_heads == 0 self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size) - # MistralConfig has an optional head_dim introduced by Mistral-Nemo - if hasattr(config, "head_dim"): + + if hasattr(config, "head_dim") and config.head_dim: self.head_dim = config.head_dim elif hasattr(config, "attention_head_dim"): self.head_dim = config.attention_head_dim @@ -490,12 +503,23 @@ def __init__( else: raise RuntimeError(f"Unsupported attention type: {attention_type}") - self.mlp = HunYuanSparseMoeBlock( - config=config, - quant_config=quant_config, - layer_id=layer_id, - prefix=f"{prefix}.mlp", - ) + if _is_moe(config): + self.mlp = HunYuanSparseMoeBlock( + config=config, + quant_config=quant_config, + layer_id=layer_id, + prefix=f"{prefix}.mlp", + ) + else: + self.mlp = HunYuanMLP( + hidden_size=self.hidden_size, + intermediate_size=self.intermediate_size, + hidden_act=config.hidden_act, + quant_config=quant_config, + bias=getattr(config, "mlp_bias", False), + prefix=f"{prefix}.mlp", + ) + self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.post_attention_layernorm = RMSNorm(config.hidden_size, @@ -642,15 +666,17 @@ def _split_qkv_weight(self, qkv: torch.Tensor): return torch.concat((q, k, v)) def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: - - # Params for weights, fp8 weight scales, fp8 activation scales - # (param_name, weight_name, expert_id, shard_id) - return FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="gate_proj", - ckpt_down_proj_name="down_proj", - ckpt_up_proj_name="up_proj", - num_experts=self.config.num_experts, - ) + if _is_moe(self.config): + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.num_experts, + ) + else: + return [] def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): cla_factor = _get_cla_factor(self.config) @@ -815,7 +841,7 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): return loaded_params -class HunYuanMoEV1ForCausalLM(nn.Module, SupportsLoRA): +class HunYuanV1Base(nn.Module, SupportsLoRA): packed_modules_mapping = { "qkv_proj": [ "q_proj", @@ -901,3 +927,11 @@ def load_weights(self, weights: Iterable[tuple[str, def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: return self.model.get_expert_mapping() + + +class HunYuanDenseV1ForCausalLM(HunYuanV1Base): + pass + + +class HunYuanMoEV1ForCausalLM(HunYuanV1Base): + pass diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 100532943c2..fafb6a70438 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -79,7 +79,8 @@ "GraniteMoeSharedForCausalLM": ("granitemoeshared", "GraniteMoeSharedForCausalLM"), # noqa: E501 "GritLM": ("gritlm", "GritLM"), "Grok1ModelForCausalLM": ("grok1", "Grok1ForCausalLM"), - "HunYuanMoEV1ForCausalLM": ("hunyuan_v1_moe", "HunYuanMoEV1ForCausalLM"), + "HunYuanMoEV1ForCausalLM": ("hunyuan_v1", "HunYuanMoEV1ForCausalLM"), + "HunYuanDenseV1ForCausalLM": ("hunyuan_v1", "HunYuanDenseV1ForCausalLM"), "InternLMForCausalLM": ("llama", "LlamaForCausalLM"), "InternLM2ForCausalLM": ("internlm2", "InternLM2ForCausalLM"), "InternLM2VEForCausalLM": ("internlm2_ve", "InternLM2VEForCausalLM"), From 3ec970663974f6c07025af3e329b8d4a76cf4ee3 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 23 Jul 2025 20:53:26 +0800 Subject: [PATCH 287/552] [V1] Check all pooling tasks during profiling (#21299) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/sequence.py | 7 ++++ vllm/v1/worker/gpu_model_runner.py | 63 +++++++++++++++++++----------- 2 files changed, 47 insertions(+), 23 deletions(-) diff --git a/vllm/sequence.py b/vllm/sequence.py index 99208fbad65..1f507add0d9 100644 --- a/vllm/sequence.py +++ b/vllm/sequence.py @@ -1173,6 +1173,10 @@ class PoolingSequenceGroupOutput( # The actual type is in SequenceGroup.pooled_data data: Any + def get_data_nbytes(self) -> int: + data: torch.Tensor = self.data + return data.nbytes + def __repr__(self) -> str: return f"PoolingSequenceGroupOutput(data={self.data}" @@ -1234,6 +1238,9 @@ class PoolerOutput( """The output from a pooling operation in the pooling model.""" outputs: list[PoolingSequenceGroupOutput] + def get_data_nbytes(self) -> int: + return sum(o.get_data_nbytes() for o in self.outputs) + def __getitem__(self, idx: int) -> PoolingSequenceGroupOutput: return self.outputs[idx] diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 6a42e01f14b..2078fedac92 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -41,7 +41,7 @@ from vllm.multimodal.utils import group_mm_inputs_by_modality from vllm.pooling_params import PoolingParams, PoolingTask from vllm.sampling_params import SamplingType -from vllm.sequence import IntermediateTensors +from vllm.sequence import IntermediateTensors, PoolerOutput from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, DeviceMemoryProfiler, GiB_bytes, LazyLoader, check_use_alibi, get_dtype_size, is_pin_memory_available, round_up) @@ -1819,7 +1819,7 @@ def load_model(self, eep_scale_up: bool = False) -> None: old_global_expert_indices = None rank_mapping = None - with DeviceMemoryProfiler() as m: # noqa: SIM117 + with DeviceMemoryProfiler() as m: time_before_load = time.perf_counter() model_loader = get_model_loader(self.load_config) if not hasattr(self, "model"): @@ -2215,12 +2215,11 @@ def _dummy_sampler_run( ) return sampler_output - @torch.inference_mode() - def _dummy_pooler_run( + def _dummy_pooler_run_task( self, hidden_states: torch.Tensor, - ) -> torch.Tensor: - + task: PoolingTask, + ) -> PoolerOutput: num_tokens = hidden_states.shape[0] max_num_reqs = self.scheduler_config.max_num_seqs num_reqs = min(num_tokens, max_num_reqs) @@ -2232,37 +2231,55 @@ def _dummy_pooler_run( hidden_states_list = list( torch.split(hidden_states, num_scheduled_tokens_list)) - req_num_tokens = num_tokens // num_reqs - model = cast(VllmModelForPooling, self.model) - dummy_task = self.get_supported_pooling_tasks()[0] - dummy_pooling_params = PoolingParams(task=dummy_task) + dummy_prompt_lens = torch.tensor( + [h.shape[0] for h in hidden_states_list], + device=self.device, + ) + dummy_token_ids = torch.zeros((num_reqs, req_num_tokens), + dtype=torch.int32, + device=self.device) - to_update = model.pooler.get_pooling_updates(dummy_task) + model = cast(VllmModelForPooling, self.model) + dummy_pooling_params = PoolingParams(task=task) + to_update = model.pooler.get_pooling_updates(task) to_update.apply(dummy_pooling_params) dummy_metadata = PoolingMetadata( - prompt_lens=torch.tensor([h.shape[0] for h in hidden_states_list], - device=self.device), - prompt_token_ids=torch.zeros((num_reqs, req_num_tokens), - dtype=torch.int32, - device=self.device), - pooling_params=[dummy_pooling_params] * num_reqs) + prompt_lens=dummy_prompt_lens, + prompt_token_ids=dummy_token_ids, + pooling_params=[dummy_pooling_params] * num_reqs, + ) try: - pooler_output = model.pooler(hidden_states=hidden_states_list, - pooling_metadata=dummy_metadata) + return model.pooler(hidden_states=hidden_states_list, + pooling_metadata=dummy_metadata) except RuntimeError as e: if 'out of memory' in str(e): raise RuntimeError( - "CUDA out of memory occurred when warming up pooler with " - f"{num_reqs} dummy requests. Please try lowering " - "`max_num_seqs` or `gpu_memory_utilization` when " + "CUDA out of memory occurred when warming up pooler " + f"({task=}) with {num_reqs} dummy requests. Please try " + "lowering `max_num_seqs` or `gpu_memory_utilization` when " "initializing the engine.") from e else: raise e - return pooler_output + + @torch.inference_mode() + def _dummy_pooler_run( + self, + hidden_states: torch.Tensor, + ) -> PoolerOutput: + # Find the task that has the largest output for subsequent steps + output_size = dict[PoolingTask, float]() + for task in self.get_supported_pooling_tasks(): + # Run a full batch with each task to ensure none of them OOMs + output = self._dummy_pooler_run_task(hidden_states, task) + output_size[task] = output.get_data_nbytes() + del output # Allow GC + + max_task = max(output_size.items(), key=lambda x: x[1])[0] + return self._dummy_pooler_run_task(hidden_states, max_task) def profile_run(self) -> None: # Profile with multimodal encoder & encoder cache. From d3b738b1e612534bc59411b4679b1b8a208612e9 Mon Sep 17 00:00:00 2001 From: Tao He Date: Wed, 23 Jul 2025 21:34:37 +0800 Subject: [PATCH 288/552] [Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. (#21364) Signed-off-by: Tao He Signed-off-by: x22x22 --- vllm/attention/backends/dual_chunk_flash_attn.py | 8 -------- 1 file changed, 8 deletions(-) diff --git a/vllm/attention/backends/dual_chunk_flash_attn.py b/vllm/attention/backends/dual_chunk_flash_attn.py index e108646e7ff..fa6f3f1b39c 100644 --- a/vllm/attention/backends/dual_chunk_flash_attn.py +++ b/vllm/attention/backends/dual_chunk_flash_attn.py @@ -1055,7 +1055,6 @@ def _dual_chunk_flash_attn_prefill_func( v_states_intra, softmax_scale=softmax_scale, causal=True, - block_table=block_table, stage="intra", vertical_indices=vertical_buffer, slash_indices=slash_buffer, @@ -1070,7 +1069,6 @@ def _dual_chunk_flash_attn_prefill_func( v_states_intra, softmax_scale=softmax_scale, causal=True, - block_table=block_table, stage="intra", vertical_indices=intra_vertical_indices, slash_indices=intra_slash_indices, @@ -1085,7 +1083,6 @@ def _dual_chunk_flash_attn_prefill_func( v_states_succ, softmax_scale=softmax_scale, causal=False, - block_table=block_table, stage="succ", vertical_indices=succ_vertical_buffer, slash_indices=succ_slash_buffer, @@ -1100,7 +1097,6 @@ def _dual_chunk_flash_attn_prefill_func( v_states_succ, softmax_scale=softmax_scale, causal=False, - block_table=block_table, stage="succ", vertical_indices=succ_vertical_indices, slash_indices=succ_slash_indices, @@ -1115,7 +1111,6 @@ def _dual_chunk_flash_attn_prefill_func( v_states_inter, softmax_scale=softmax_scale, causal=False, - block_table=block_table, stage="inter", vertical_indices=inter_vertical_buffer, slash_indices=inter_slash_buffer, @@ -1130,7 +1125,6 @@ def _dual_chunk_flash_attn_prefill_func( v_states_inter, softmax_scale=softmax_scale, causal=False, - block_table=block_table, stage="inter", vertical_indices=inter_vertical_indices, slash_indices=inter_slash_indices, @@ -1151,7 +1145,6 @@ def _do_flash_attn( value_states: torch.Tensor, softmax_scale: float, causal: bool = True, - block_table: torch.Tensor = None, max_seqlen_k: Optional[int] = None, stage: str = "intra", vertical_indices: Optional[torch.Tensor] = None, @@ -1230,7 +1223,6 @@ def _do_flash_attn( device=query_states.device), max_seqlen_k=max_seqlen_k, causal=causal, - block_table=block_table.unsqueeze(0), return_softmax_lse=True, ) softmax_lse = softmax_lse.view(q_len, q_heads, 1).transpose(0, From 07bffafabddca980232dc5f483bdada37238dbf9 Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Wed, 23 Jul 2025 15:49:25 +0100 Subject: [PATCH 289/552] [Tests] Add tests for headless internal DP LB (#21450) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 2 + .../openai/test_multi_api_servers.py | 123 +--- tests/v1/test_internal_lb_dp.py | 639 ++++++++++++++++++ tests/v1/test_utils.py | 124 ++++ 4 files changed, 768 insertions(+), 120 deletions(-) create mode 100644 tests/v1/test_internal_lb_dp.py diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 00608229b95..c7378bf8ba5 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -165,6 +165,7 @@ steps: - tests/examples/offline_inference/data_parallel.py - tests/v1/test_async_llm_dp.py - tests/v1/test_external_lb_dp.py + - tests/v1/test_internal_lb_dp.py - tests/v1/engine/test_engine_core_client.py commands: # test with tp=2 and external_dp=2 @@ -176,6 +177,7 @@ steps: - python3 ../examples/offline_inference/data_parallel.py --enforce-eager - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_external_lb_dp.py + - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/test_internal_lb_dp.py - pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp - pytest -v -s distributed/test_utils.py - pytest -v -s compile/test_basic_correctness.py diff --git a/tests/v1/entrypoints/openai/test_multi_api_servers.py b/tests/v1/entrypoints/openai/test_multi_api_servers.py index e84b5e3095d..f7c31b0c437 100644 --- a/tests/v1/entrypoints/openai/test_multi_api_servers.py +++ b/tests/v1/entrypoints/openai/test_multi_api_servers.py @@ -2,136 +2,19 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import asyncio import os -import re import openai # use the official client for correctness check import pytest import pytest_asyncio -import requests from tests.utils import RemoteOpenAIServer +from tests.v1.test_utils import check_request_balancing MODEL_NAME = "ibm-research/PowerMoE-3b" DP_SIZE = os.getenv("DP_SIZE", "1") -def get_prometheus_metrics( - server: RemoteOpenAIServer) -> dict[str, dict[str, float]]: - """Fetch and parse Prometheus metrics from the /metrics endpoint. - - Returns: - Dict mapping metric names to their values grouped by labels. - For example: {"vllm:request_success": { - "engine=0": 5.0, "engine=1": 3.0} - } - """ - try: - response = requests.get(server.url_for("metrics"), timeout=10) - response.raise_for_status() - - metrics: dict[str, dict[str, float]] = {} - - # Regex patterns for Prometheus metrics - metric_with_labels = re.compile( - r'^([a-zA-Z_:][a-zA-Z0-9_:]*)\{([^}]*)\}\s+([\d\.\-\+e]+)$') - metric_simple = re.compile( - r'^([a-zA-Z_:][a-zA-Z0-9_:]*)\s+([\d\.\-\+e]+)$') - - for line in response.text.split('\n'): - line = line.strip() - # Skip comments and empty lines - if not line or line.startswith('#'): - continue - - # Try to match metric with labels first - match = metric_with_labels.match(line) - if match: - metric_name, labels_part, value_str = match.groups() - try: - value = float(value_str) - if metric_name not in metrics: - metrics[metric_name] = {} - metrics[metric_name][f'{{{labels_part}}}'] = value - except ValueError: - continue - else: - # Try simple metric without labels - match = metric_simple.match(line) - if match: - metric_name, value_str = match.groups() - try: - value = float(value_str) - if metric_name not in metrics: - metrics[metric_name] = {} - metrics[metric_name][''] = value - except ValueError: - continue - - return metrics - except Exception as e: - pytest.fail(f"Failed to fetch Prometheus metrics: {e}") - return {} - - -def get_engine_request_counts( - metrics: dict[str, dict[str, float]]) -> dict[str, float]: - """Extract request counts per engine from Prometheus metrics. - - Returns: - Dict mapping engine indices to request counts. - For example: {"0": 15.0, "1": 12.0} - """ - engine_counts = {} - - # Look for request success metrics with engine labels - success_metrics = metrics.get("vllm:request_success_total", {}) - engine_pattern = re.compile(r'engine="([^"]*)"') - - for labels, count in success_metrics.items(): - # Extract engine ID from labels using regex - match = engine_pattern.search(labels) - if match: - engine_id = match.group(1) - if engine_id not in engine_counts: - engine_counts[engine_id] = 0.0 - engine_counts[engine_id] += count - - return engine_counts - - -def check_request_balancing(server: RemoteOpenAIServer): - """Check request balancing via Prometheus metrics if DP_SIZE > 1. - - Args: - server: The RemoteOpenAIServer instance - """ - dp_size = int(DP_SIZE) - if dp_size <= 1: - return - - # Get metrics after all requests are completed - metrics = get_prometheus_metrics(server) - engine_counts = get_engine_request_counts(metrics) - - # Check that multiple engines received requests - engines_with_requests = [ - engine for engine, count in engine_counts.items() if count > 0 - ] - assert len(engines_with_requests) == dp_size, ( - f"Expected requests to be distributed across multiple engines," - f" but only engine(s) {engines_with_requests} received " - f"requests. Engine counts: {engine_counts}") - - # Verify that the load is reasonably balanced - # (no engine should handle all requests) - total_requests = sum(engine_counts.values()) - - for count in engine_counts.values(): - assert count > total_requests // (dp_size + 1), ( - f"requests are imbalanced: {engine_counts}") - - @pytest.fixture(scope="module") def default_server_args(): return [ @@ -217,7 +100,7 @@ async def make_request(): assert all(completion is not None for completion in results) # Check request balancing via Prometheus metrics if DP_SIZE > 1 - check_request_balancing(server) + check_request_balancing(server, int(DP_SIZE)) @pytest.mark.asyncio @@ -295,4 +178,4 @@ async def make_streaming_request(): assert all(results), "Not all streaming requests completed successfully." # Check request balancing via Prometheus metrics if DP_SIZE > 1 - check_request_balancing(server) + check_request_balancing(server, int(DP_SIZE)) diff --git a/tests/v1/test_internal_lb_dp.py b/tests/v1/test_internal_lb_dp.py new file mode 100644 index 00000000000..9aef4d5821e --- /dev/null +++ b/tests/v1/test_internal_lb_dp.py @@ -0,0 +1,639 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import asyncio +import os +import threading +import time + +import openai # use the official client for correctness check +import pytest +import pytest_asyncio + +from tests.utils import RemoteOpenAIServer +from tests.v1.test_utils import check_request_balancing +from vllm.platforms import Platform + +MODEL_NAME = "ibm-research/PowerMoE-3b" + +# Number of data parallel ranks for multi-node internal LB testing +DP_SIZE = int(os.getenv("DP_SIZE", "2")) +# Default tensor parallel size to use +TP_SIZE = int(os.getenv("TP_SIZE", "1")) + +# Number of nodes to simulate +NUM_NODES = 2 + + +class MultinodeInternalLBServerManager: + """Manages multi-node data parallel vLLM server instances for internal + load balancer testing using --headless mode.""" + + def __init__(self, + model_name: str, + dp_size: int, + api_server_count: int, + base_server_args: list, + dp_per_node: int = 1, + tp_size: int = TP_SIZE): + self.model_name = model_name + self.dp_size = dp_size + self.dp_per_node = dp_per_node + self.tp_size = tp_size + self.api_server_count = api_server_count + self.base_server_args = base_server_args + self.servers: list[tuple[RemoteOpenAIServer, list[str]]] = [] + self.server_threads: list[threading.Thread] = [] + + def __enter__(self) -> list[tuple[RemoteOpenAIServer, list[str]]]: + """Start all server instances for multi-node internal LB mode.""" + for rank in range(0, self.dp_size, self.dp_per_node): + # Create server args for this specific rank + server_args = self.base_server_args.copy() + + if rank == 0: + # Head node - runs API server and first DP rank + server_args.extend([ + "--data-parallel-size", + str(self.dp_size), + "--data-parallel-size-local", + str(self.dp_per_node), + "--tensor-parallel-size", + str(self.tp_size), + "--port", + "8000", # Single endpoint for all requests + "--api-server-count", + str(self.api_server_count), + "--data-parallel-address", + "127.0.0.1", + "--data-parallel-rpc-port", + "13345", + ]) + else: + # Secondary nodes - run in headless mode + server_args.extend([ + "--headless", + "--data-parallel-size", + str(self.dp_size), + "--data-parallel-size-local", + str(self.dp_per_node), + "--data-parallel-start-rank", + str(rank), + "--tensor-parallel-size", + str(self.tp_size), + "--data-parallel-address", + "127.0.0.1", + "--data-parallel-rpc-port", + "13345", + ]) + + # Use a thread to start each server to allow parallel initialization + def start_server(r: int, sargs: list[str]): + gpus_per_node = self.tp_size * self.dp_per_node + try: + # Start the server + server = RemoteOpenAIServer( + self.model_name, + sargs, + auto_port=False, + env_dict={ + "CUDA_VISIBLE_DEVICES": + ",".join( + str(Platform.device_id_to_physical_device_id( + i)) for i in range(r, r + gpus_per_node)) + }) + server.__enter__() + if r == 0: + print( + f"Head node (rank {r}) started successfully with " + f"{self.api_server_count} API servers") + else: + print(f"Headless node (rank {r}) started successfully") + self.servers.append((server, sargs)) + except Exception as e: + print(f"Failed to start server rank {r}: {e}") + raise + + thread = threading.Thread(target=start_server, + args=(rank, server_args)) + thread.start() + + self.server_threads.append(thread) + + # Wait for all servers to start + for thread in self.server_threads: + thread.join() + + # Give servers additional time to fully initialize and coordinate + time.sleep(3) + + if len(self.servers) != self.dp_size // self.dp_per_node: + raise Exception("Servers failed to start") + + return self.servers + + def __exit__(self, exc_type, exc_val, exc_tb): + """Stop all server instances.""" + while self.servers: + try: + self.servers.pop()[0].__exit__(exc_type, exc_val, exc_tb) + except Exception as e: + print(f"Error stopping server: {e}") + + +class APIOnlyServerManager: + """Manages API-only server (Node 0) and headless engines server (Node 1) + for testing separated API server and engine configuration.""" + + def __init__(self, + model_name: str, + dp_size: int, + api_server_count: int, + base_server_args: list, + tp_size: int = TP_SIZE): + self.model_name = model_name + self.dp_size = dp_size + self.tp_size = tp_size + self.api_server_count = api_server_count + self.base_server_args = base_server_args + self.servers: list[tuple[RemoteOpenAIServer, list[str]]] = [] + self.server_threads: list[threading.Thread] = [] + + def __enter__(self) -> list[tuple[RemoteOpenAIServer, list[str]]]: + """Start API-only server and headless engines server.""" + + # Start API-only server (Node 0) - no engines, only API server + api_server_args = self.base_server_args.copy() + api_server_args.extend([ + "--data-parallel-size", + str(self.dp_size), + "--data-parallel-size-local", + "0", # No engines on this node + "--tensor-parallel-size", + str(self.tp_size), + "--port", + "8000", + "--api-server-count", + str(self.api_server_count), + "--data-parallel-address", + "127.0.0.1", + "--data-parallel-rpc-port", + "13345", + ]) + + # Start headless engines server (Node 1) - all engines, no API server + engines_server_args = self.base_server_args.copy() + engines_server_args.extend([ + "--headless", + "--data-parallel-size", + str(self.dp_size), + "--data-parallel-size-local", + str(self.dp_size), # All engines on this node + "--tensor-parallel-size", + str(self.tp_size), + "--data-parallel-address", + "127.0.0.1", + "--data-parallel-rpc-port", + "13345", + ]) + + # Use threads to start both servers in parallel + def start_api_server(): + try: + server = RemoteOpenAIServer( + self.model_name, + api_server_args, + auto_port=False, + env_dict={}) # No GPUs needed for API-only server + server.__enter__() + print(f"API-only server started successfully with " + f"{self.api_server_count} API servers") + self.servers.append((server, api_server_args)) + except Exception as e: + print(f"Failed to start API-only server: {e}") + raise + + def start_engines_server(): + try: + server = RemoteOpenAIServer( + self.model_name, + engines_server_args, + auto_port=False, + env_dict={ + "CUDA_VISIBLE_DEVICES": + ",".join( + str(Platform.device_id_to_physical_device_id(i)) + for i in range(self.dp_size * self.tp_size)) + }) + server.__enter__() + print(f"Headless engines server started successfully with " + f"{self.dp_size} engines") + self.servers.append((server, engines_server_args)) + except Exception as e: + print(f"Failed to start headless engines server: {e}") + raise + + # Start API server first + api_thread = threading.Thread(target=start_api_server) + api_thread.start() + self.server_threads.append(api_thread) + + # Start engines server second + engines_thread = threading.Thread(target=start_engines_server) + engines_thread.start() + self.server_threads.append(engines_thread) + + # Wait for both servers to start + for thread in self.server_threads: + thread.join() + + # Give servers additional time to fully initialize and coordinate + time.sleep(3) + + if len(self.servers) != 2: + raise Exception("Both servers failed to start") + + return self.servers + + def __exit__(self, exc_type, exc_val, exc_tb): + """Stop both server instances.""" + while self.servers: + try: + self.servers.pop()[0].__exit__(exc_type, exc_val, exc_tb) + except Exception as e: + print(f"Error stopping server: {e}") + + +@pytest.fixture(scope="module") +def default_server_args(): + return [ + # use half precision for speed and memory savings in CI environment + "--dtype", + "bfloat16", + "--max-model-len", + "2048", + "--max-num-seqs", + "128", + "--enforce-eager", + ] + + +@pytest.fixture(scope="module", params=[1, 4]) +def servers(request, default_server_args): + api_server_count = request.param + with MultinodeInternalLBServerManager(MODEL_NAME, DP_SIZE, + api_server_count, + default_server_args, + DP_SIZE // NUM_NODES, + TP_SIZE) as server_list: + yield server_list + + +@pytest.fixture(scope="module", params=[1, 4]) +def api_only_servers(request, default_server_args): + """Fixture for API-only server + headless engines configuration.""" + api_server_count = request.param + with APIOnlyServerManager(MODEL_NAME, DP_SIZE, api_server_count, + default_server_args, TP_SIZE) as server_list: + yield server_list + + +@pytest_asyncio.fixture +async def client(servers: list[tuple[RemoteOpenAIServer, list[str]]]): + # For internal LB, we only connect to the head node (rank 0) + # which provides the single API endpoint + head_server = servers[0][0] + async with head_server.get_async_client() as client: + yield client + + +@pytest_asyncio.fixture +async def api_only_client(api_only_servers: list[tuple[RemoteOpenAIServer, + list[str]]]): + """Client fixture for API-only server configuration.""" + # Connect to the API-only server (first server in the list) + api_server = api_only_servers[0][0] + async with api_server.get_async_client() as client: + yield client + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name", + [MODEL_NAME], +) +async def test_multinode_dp_completion(client: openai.AsyncOpenAI, + servers: list[tuple[RemoteOpenAIServer, + list[str]]], + model_name: str) -> None: + + async def make_request(): + completion = await client.completions.create( + model=model_name, + prompt="Hello, my name is", + max_tokens=10, + temperature=1.0) + + assert completion.id is not None + assert completion.choices is not None and len(completion.choices) == 1 + + choice = completion.choices[0] + # The exact number of tokens can vary slightly with temperature=1.0, + # so we check for a reasonable minimum length. + assert len(choice.text) >= 1 + # Finish reason might not always be 'length' if the model finishes early + # or due to other reasons, especially with high temperature. + # So, we'll accept 'length' or 'stop'. + assert choice.finish_reason in ("length", "stop") + + # Token counts can also vary, so we check they are positive. + assert completion.usage.completion_tokens > 0 + assert completion.usage.prompt_tokens > 0 + assert completion.usage.total_tokens > 0 + return completion + + # Test single request + result = await make_request() + assert result is not None + print( + "Multi-node internal LB handled single completion request successfully" + ) + + await asyncio.sleep(0.5) + + # Send multiple requests - internal LB should distribute across DP ranks + num_requests = 50 + all_tasks = [make_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(completion is not None for completion in results) + + await asyncio.sleep(0.5) + + # Second burst of requests + all_tasks = [make_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(completion is not None for completion in results) + + _, server_args = servers[0] + api_server_count = ( + server_args.count('--api-server-count') + and server_args[server_args.index('--api-server-count') + 1] or 1) + print(f"Successfully completed multi-node internal LB test with " + f"{len(servers)} DP ranks (API server count: {api_server_count})") + + # Check request balancing via Prometheus metrics + head_server = servers[0][0] + check_request_balancing(head_server, DP_SIZE) + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name", + [MODEL_NAME], +) +async def test_multinode_dp_completion_streaming(client: openai.AsyncOpenAI, + servers: list[ + tuple[RemoteOpenAIServer, + list[str]]], + model_name: str) -> None: + prompt = "What is an LLM?" + + async def make_streaming_request(): + # Perform a non-streaming request to get the expected full output + single_completion = await client.completions.create( + model=model_name, + prompt=prompt, + max_tokens=5, + temperature=0.0, + ) + single_output = single_completion.choices[0].text + + # Perform the streaming request + stream = await client.completions.create(model=model_name, + prompt=prompt, + max_tokens=5, + temperature=0.0, + stream=True) + chunks: list[str] = [] + finish_reason_count = 0 + last_chunk = None + async for chunk in stream: + chunks.append(chunk.choices[0].text) + if chunk.choices[0].finish_reason is not None: + finish_reason_count += 1 + last_chunk = chunk # Keep track of the last chunk + + # finish reason should only return in the last block for OpenAI API + assert finish_reason_count == 1, ( + "Finish reason should appear exactly once.") + assert last_chunk is not None, ( + "Stream should have yielded at least one chunk.") + assert last_chunk.choices[ + 0].finish_reason == "length", "Finish reason should be 'length'." + # Check that the combined text matches the non-streamed version. + assert "".join( + chunks + ) == single_output, "Streamed output should match non-streamed output." + return True # Indicate success for this request + + # Test single streaming request + result = await make_streaming_request() + assert result is not None + print( + "Multi-node internal LB handled single streaming request successfully") + + await asyncio.sleep(0.5) + + # Send multiple streaming requests - internal LB should distribute across + # DP ranks + num_requests = 50 + all_tasks = [make_streaming_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(results), "Not all streaming requests completed successfully." + + await asyncio.sleep(0.5) + + # Second burst of streaming requests + all_tasks = [make_streaming_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(results), "Not all streaming requests completed successfully." + + _, server_args = servers[0] + api_server_count = ( + server_args.count('--api-server-count') + and server_args[server_args.index('--api-server-count') + 1] or 1) + print(f"Successfully completed multi-node internal LB streaming test with " + f"{len(servers)} DP ranks (API server count: {api_server_count})") + + # Check request balancing via Prometheus metrics + head_server = servers[0][0] + check_request_balancing(head_server, DP_SIZE) + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name", + [MODEL_NAME], +) +async def test_api_only_multinode_dp_completion( + api_only_client: openai.AsyncOpenAI, + api_only_servers: list[tuple[RemoteOpenAIServer, + list[str]]], model_name: str) -> None: + """Test API-only server with all engines on separate headless server.""" + + async def make_request(): + completion = await api_only_client.completions.create( + model=model_name, + prompt="Hello, my name is", + max_tokens=10, + temperature=1.0) + + assert completion.id is not None + assert completion.choices is not None and len(completion.choices) == 1 + + choice = completion.choices[0] + # The exact number of tokens can vary slightly with temperature=1.0, + # so we check for a reasonable minimum length. + assert len(choice.text) >= 1 + # Finish reason might not always be 'length' if the model finishes + # early or due to other reasons, especially with high temperature. + # So, we'll accept 'length' or 'stop'. + assert choice.finish_reason in ("length", "stop") + + # Token counts can also vary, so we check they are positive. + assert completion.usage.completion_tokens > 0 + assert completion.usage.prompt_tokens > 0 + assert completion.usage.total_tokens > 0 + return completion + + # Test single request + result = await make_request() + assert result is not None + print("API-only server handled single completion request successfully") + + await asyncio.sleep(0.5) + + # Send multiple requests - should be distributed across engines on + # headless server + num_requests = 50 + all_tasks = [make_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(completion is not None for completion in results) + + await asyncio.sleep(0.5) + + # Second burst of requests + all_tasks = [make_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(completion is not None for completion in results) + + _, api_server_args = api_only_servers[0] + api_server_count = ( + api_server_args.count('--api-server-count') + and api_server_args[api_server_args.index('--api-server-count') + 1] + or 1) + print(f"Successfully completed API-only multi-node test with {DP_SIZE} " + f"engines on headless server (API server count: {api_server_count})") + + # Check request balancing via Prometheus metrics + api_server = api_only_servers[0][0] + check_request_balancing(api_server, DP_SIZE) + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name", + [MODEL_NAME], +) +async def test_api_only_multinode_dp_completion_streaming( + api_only_client: openai.AsyncOpenAI, + api_only_servers: list[tuple[RemoteOpenAIServer, + list[str]]], model_name: str) -> None: + """Test API-only server streaming with all engines on separate + headless server.""" + prompt = "What is an LLM?" + + async def make_streaming_request(): + # Perform a non-streaming request to get the expected full output + single_completion = await api_only_client.completions.create( + model=model_name, + prompt=prompt, + max_tokens=5, + temperature=0.0, + ) + single_output = single_completion.choices[0].text + + # Perform the streaming request + stream = await api_only_client.completions.create(model=model_name, + prompt=prompt, + max_tokens=5, + temperature=0.0, + stream=True) + chunks: list[str] = [] + finish_reason_count = 0 + last_chunk = None + async for chunk in stream: + chunks.append(chunk.choices[0].text) + if chunk.choices[0].finish_reason is not None: + finish_reason_count += 1 + last_chunk = chunk # Keep track of the last chunk + + # finish reason should only return in the last block for OpenAI API + assert finish_reason_count == 1, ( + "Finish reason should appear exactly once.") + assert last_chunk is not None, ( + "Stream should have yielded at least one chunk.") + assert last_chunk.choices[ + 0].finish_reason == "length", "Finish reason should be 'length'." + # Check that the combined text matches the non-streamed version. + assert "".join( + chunks + ) == single_output, "Streamed output should match non-streamed output." + return True # Indicate success for this request + + # Test single streaming request + result = await make_streaming_request() + assert result is not None + print("API-only server handled single streaming request successfully") + + await asyncio.sleep(0.5) + + # Send multiple streaming requests - should be distributed across engines + num_requests = 50 + all_tasks = [make_streaming_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(results), "Not all streaming requests completed successfully." + + await asyncio.sleep(0.5) + + # Second burst of streaming requests + all_tasks = [make_streaming_request() for _ in range(num_requests)] + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests + assert all(results), "Not all streaming requests completed successfully." + + _, api_server_args = api_only_servers[0] + api_server_count = ( + api_server_args.count('--api-server-count') + and api_server_args[api_server_args.index('--api-server-count') + 1] + or 1) + print(f"Successfully completed API-only streaming test with {DP_SIZE} " + f"engines on headless server (API server count: {api_server_count})") + + # Check request balancing via Prometheus metrics + api_server = api_only_servers[0][0] + check_request_balancing(api_server, DP_SIZE) diff --git a/tests/v1/test_utils.py b/tests/v1/test_utils.py index fd0e630ce17..0b892bd9dff 100644 --- a/tests/v1/test_utils.py +++ b/tests/v1/test_utils.py @@ -1,8 +1,13 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import re + +import pytest +import requests import torch +from tests.utils import RemoteOpenAIServer from vllm.v1.worker.utils import bind_kv_cache @@ -61,3 +66,122 @@ def test_bind_kv_cache_non_attention(): assert runner_kv_caches[0] is kv_cache['model.layers.20.attn'] assert runner_kv_caches[1] is kv_cache['model.layers.28.attn'] + + +# Prometheus metrics utilities for testing + + +def get_prometheus_metrics( + server: RemoteOpenAIServer) -> dict[str, dict[str, float]]: + """Fetch and parse Prometheus metrics from the /metrics endpoint. + + Returns: + Dict mapping metric names to their values grouped by labels. + For example: {"vllm:request_success": { + "engine=0": 5.0, "engine=1": 3.0} + } + """ + try: + response = requests.get(server.url_for("metrics"), timeout=10) + response.raise_for_status() + + metrics: dict[str, dict[str, float]] = {} + + # Regex patterns for Prometheus metrics + metric_with_labels = re.compile( + r'^([a-zA-Z_:][a-zA-Z0-9_:]*)\{([^}]*)\}\s+([\d\.\-\+e]+)$') + metric_simple = re.compile( + r'^([a-zA-Z_:][a-zA-Z0-9_:]*)\s+([\d\.\-\+e]+)$') + + for line in response.text.split('\n'): + line = line.strip() + # Skip comments and empty lines + if not line or line.startswith('#'): + continue + + # Try to match metric with labels first + match = metric_with_labels.match(line) + if match: + metric_name, labels_part, value_str = match.groups() + try: + value = float(value_str) + if metric_name not in metrics: + metrics[metric_name] = {} + metrics[metric_name][f'{{{labels_part}}}'] = value + except ValueError: + continue + else: + # Try simple metric without labels + match = metric_simple.match(line) + if match: + metric_name, value_str = match.groups() + try: + value = float(value_str) + if metric_name not in metrics: + metrics[metric_name] = {} + metrics[metric_name][''] = value + except ValueError: + continue + + return metrics + except Exception as e: + pytest.fail(f"Failed to fetch Prometheus metrics: {e}") + return {} + + +def get_engine_request_counts( + metrics: dict[str, dict[str, float]]) -> dict[str, float]: + """Extract request counts per engine from Prometheus metrics. + + Returns: + Dict mapping engine indices to request counts. + For example: {"0": 15.0, "1": 12.0} + """ + engine_counts = {} + + # Look for request success metrics with engine labels + success_metrics = metrics.get("vllm:request_success_total", {}) + engine_pattern = re.compile(r'engine="([^"]*)"') + + for labels, count in success_metrics.items(): + # Extract engine ID from labels using regex + match = engine_pattern.search(labels) + if match: + engine_id = match.group(1) + if engine_id not in engine_counts: + engine_counts[engine_id] = 0.0 + engine_counts[engine_id] += count + + return engine_counts + + +def check_request_balancing(server: RemoteOpenAIServer, dp_size: int): + """Check request balancing via Prometheus metrics if dp_size > 1. + + Args: + server: The RemoteOpenAIServer instance + dp_size: Number of data parallel ranks + """ + if dp_size <= 1: + return + + # Get metrics after all requests are completed + metrics = get_prometheus_metrics(server) + engine_counts = get_engine_request_counts(metrics) + + # Check that multiple engines received requests + engines_with_requests = [ + engine for engine, count in engine_counts.items() if count > 0 + ] + assert len(engines_with_requests) == dp_size, ( + f"Expected requests to be distributed across multiple engines," + f" but only engine(s) {engines_with_requests} received " + f"requests. Engine counts: {engine_counts}") + + # Verify that the load is reasonably balanced + # (no engine should handle all requests) + total_requests = sum(engine_counts.values()) + + for count in engine_counts.values(): + assert count > total_requests // (dp_size + 1), ( + f"requests are imbalanced: {engine_counts}") From df27e04b90897bc41391ffc1c4562267fe35c9c6 Mon Sep 17 00:00:00 2001 From: Christian Pinto Date: Wed, 23 Jul 2025 19:00:23 +0100 Subject: [PATCH 290/552] [Core][Model] PrithviMAE Enablement on vLLM v1 engine (#20577) Signed-off-by: Christian Pinto Signed-off-by: x22x22 --- .../prithvi_geospatial_mae.py | 245 ++++-------- requirements/test.in | 1 + requirements/test.txt | 374 +++++++++++++++++- .../multimodal/pooling/test_prithvi_mae.py | 63 +++ vllm/config.py | 6 +- vllm/engine/llm_engine.py | 10 +- vllm/model_executor/models/interfaces.py | 34 ++ .../models/prithvi_geospatial_mae.py | 74 +++- vllm/model_executor/models/registry.py | 13 +- vllm/multimodal/registry.py | 2 +- vllm/v1/engine/async_llm.py | 17 +- vllm/v1/engine/llm_engine.py | 13 +- vllm/v1/engine/output_processor.py | 18 +- vllm/v1/engine/processor.py | 12 +- vllm/v1/worker/gpu_model_runner.py | 60 +++ 15 files changed, 704 insertions(+), 238 deletions(-) create mode 100644 tests/models/multimodal/pooling/test_prithvi_mae.py diff --git a/examples/offline_inference/prithvi_geospatial_mae.py b/examples/offline_inference/prithvi_geospatial_mae.py index 6dc03e85baa..4fdc7a3cf70 100644 --- a/examples/offline_inference/prithvi_geospatial_mae.py +++ b/examples/offline_inference/prithvi_geospatial_mae.py @@ -1,122 +1,27 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -""" -This is a demo script showing how to use the -PrithviGeospatialMAE model with vLLM -This script is based on: https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11/blob/main/inference.py # noqa - -Target model weights: https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11/resolve/main/Prithvi-EO-V2-300M-TL-Sen1Floods11.pt # noqa - -The requirements for running this script are: -- Installing [terratorch, albumentations, rasterio] in your python environment -- downloading the model weights in a 'model' folder local to the script - (temporary measure until the proper config.json file is uploaded to HF) -- download an input example image (India_900498_S2Hand.tif) and place it in - the same folder with the script (or specify with the --data_file argument) - -Run the example: -python prithvi_geospatial_mae.py - -""" # noqa: E501 - import argparse import datetime import os +import re from typing import Union import albumentations import numpy as np import rasterio -import regex as re import torch from einops import rearrange from terratorch.datamodules import Sen1Floods11NonGeoDataModule from vllm import LLM +torch.set_default_dtype(torch.float16) + NO_DATA = -9999 NO_DATA_FLOAT = 0.0001 OFFSET = 0 PERCENTILE = 99 -model_config = """{ - "architectures": ["PrithviGeoSpatialMAE"], - "num_classes": 0, - "pretrained_cfg": { - "task_args": { - "task": "SemanticSegmentationTask", - "model_factory": "EncoderDecoderFactory", - "loss": "ce", - "ignore_index": -1, - "lr": 0.001, - "freeze_backbone": false, - "freeze_decoder": false, - "plot_on_val": 10, - "optimizer": "AdamW", - "scheduler": "CosineAnnealingLR" - }, - "model_args": { - "backbone_pretrained": false, - "backbone": "prithvi_eo_v2_300_tl", - "decoder": "UperNetDecoder", - "decoder_channels": 256, - "decoder_scale_modules": true, - "num_classes": 2, - "rescale": true, - "backbone_bands": [ - "BLUE", - "GREEN", - "RED", - "NIR_NARROW", - "SWIR_1", - "SWIR_2" - ], - "head_dropout": 0.1, - "necks": [ - { - "name": "SelectIndices", - "indices": [ - 5, - 11, - 17, - 23 - ] - }, - { - "name": "ReshapeTokensToImage" - } - ] - }, - "optimizer_params" : { - "lr": 5.0e-05, - "betas": [0.9, 0.999], - "eps": [1.0e-08], - "weight_decay": 0.05, - "amsgrad": false, - "maximize": false, - "capturable": false, - "differentiable": false - }, - "scheduler_params" : { - "T_max": 50, - "eta_min": 0, - "last_epoch": -1, - "verbose": "deprecated" - } - }, - - - "torch_dtype": "float32" -} -""" - -# Temporarily creating the "config.json" for the model. -# This is going to disappear once the correct config.json is available on HF -with open( - os.path.join(os.path.dirname(__file__), "./model/config.json"), "w" -) as config_file: - config_file.write(model_config) - datamodule_config = { "bands": ["BLUE", "GREEN", "RED", "NIR_NARROW", "SWIR_1", "SWIR_2"], "batch_size": 16, @@ -138,28 +43,24 @@ class PrithviMAE: - def __init__(self): - print("Initializing PrithviMAE model") - self.llm = LLM( - model=os.path.join(os.path.dirname(__file__), "./model"), - skip_tokenizer_init=True, - dtype="float32", + def __init__(self, model): + self.model = LLM( + model=model, skip_tokenizer_init=True, dtype="float16", enforce_eager=True ) def run(self, input_data, location_coords): - print("################ Running inference on vLLM ##############") # merge the inputs into one data structure + if input_data is not None and input_data.dtype == torch.float32: + input_data = input_data.to(torch.float16) + input_data = input_data[0] + mm_data = { - "pixel_values": torch.empty(0) if input_data is None else input_data, - "location_coords": torch.empty(0) - if location_coords is None - else location_coords, + "pixel_values": input_data, + "location_coords": location_coords, } prompt = {"prompt_token_ids": [1], "multi_modal_data": mm_data} - - outputs = self.llm.encode(prompt, use_tqdm=False) - print("################ Inference done (it took seconds) ##############") + outputs = self.model.encode(prompt, use_tqdm=False) return outputs[0].outputs.data @@ -181,11 +82,12 @@ def process_channel_group(orig_img, channels): """ Args: orig_img: torch.Tensor representing original image (reference) - with shape = (bands, H, W). + with shape = (bands, H, W). channels: list of indices representing RGB channels. Returns: - torch.Tensor with shape (num_channels, height, width) for original image + torch.Tensor with shape (num_channels, height, width) + for original image """ orig_img = orig_img[channels, ...] @@ -260,10 +162,10 @@ def load_example( Args: file_paths: list of file paths . - mean: list containing mean values for each band in the images - in *file_paths*. - std: list containing std values for each band in the images - in *file_paths*. + mean: list containing mean values for each band in the + images in *file_paths*. + std: list containing std values for each band in the + images in *file_paths*. Returns: np.array containing created example @@ -308,7 +210,7 @@ def load_example( print(f"Could not extract timestamp for {file} ({e})") imgs = np.stack(imgs, axis=0) # num_frames, H, W, C - imgs = np.moveaxis(imgs, -1, 0).astype("float32") + imgs = np.moveaxis(imgs, -1, 0).astype("float32") # C, num_frames, H, W imgs = np.expand_dims(imgs, axis=0) # add batch di return imgs, temporal_coords, location_coords, metas @@ -332,8 +234,10 @@ def run_model( ) # Build sliding window + batch_size = 1 - batch = torch.tensor(input_data, device="cpu") + # batch = torch.tensor(input_data, device="cpu") + batch = torch.tensor(input_data) windows = batch.unfold(3, img_size, img_size).unfold(4, img_size, img_size) h1, w1 = windows.shape[3:5] windows = rearrange( @@ -344,18 +248,16 @@ def run_model( num_batches = windows.shape[0] // batch_size if windows.shape[0] > batch_size else 1 windows = torch.tensor_split(windows, num_batches, dim=0) - device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") - if temporal_coords: - temporal_coords = torch.tensor(temporal_coords, device=device).unsqueeze(0) + temporal_coords = torch.tensor(temporal_coords).unsqueeze(0) else: temporal_coords = None if location_coords: - location_coords = torch.tensor(location_coords[0], device=device).unsqueeze(0) + location_coords = torch.tensor(location_coords[0]).unsqueeze(0) else: location_coords = None - # Run model + # Run Prithvi-EO-V2-300M-TL-Sen1Floods11 pred_imgs = [] for x in windows: # Apply standardization @@ -363,15 +265,7 @@ def run_model( x = datamodule.aug(x)["image"] with torch.no_grad(): - x = x.to(device) pred = model.run(x, location_coords=location_coords) - if lightning_model: - pred_lightning = lightning_model( - x, temporal_coords=temporal_coords, location_coords=location_coords - ) - pred_lightning = pred_lightning.output.detach().cpu() - if not torch.equal(pred, pred_lightning): - print("Inference output is not equal") y_hat = pred.argmax(dim=1) y_hat = torch.nn.functional.interpolate( @@ -403,52 +297,18 @@ def run_model( return pred_imgs -def parse_args(): - parser = argparse.ArgumentParser("MAE run inference", add_help=False) - - parser.add_argument( - "--data_file", - type=str, - default="./India_900498_S2Hand.tif", - help="Path to the file.", - ) - parser.add_argument( - "--output_dir", - type=str, - default="output", - help="Path to the directory where to save outputs.", - ) - parser.add_argument( - "--input_indices", - default=[1, 2, 3, 8, 11, 12], - type=int, - nargs="+", - help="0-based indices of the six Prithvi channels to be selected from the " - "input. By default selects [1,2,3,8,11,12] for S2L1C data.", - ) - parser.add_argument( - "--rgb_outputs", - action="store_true", - help="If present, output files will only contain RGB channels. " - "Otherwise, all bands will be saved.", - ) - - def main( data_file: str, + model: str, output_dir: str, rgb_outputs: bool, input_indices: list[int] = None, ): os.makedirs(output_dir, exist_ok=True) - # Load model --------------------------------------------------------------- - - model_obj = PrithviMAE() + model_obj = PrithviMAE(model=model) datamodule = generate_datamodule() - img_size = 256 # Size of Sen1Floods11 - - # Loading data ------------------------------------------------------------- + img_size = 512 # Size of Sen1Floods11 input_data, temporal_coords, location_coords, meta_data = load_example( file_paths=[data_file], @@ -460,8 +320,6 @@ def main( if input_data.mean() > 1: input_data = input_data / 10000 # Convert to range 0-1 - # Running model ------------------------------------------------------------ - channels = [ datamodule_config["bands"].index(b) for b in ["RED", "GREEN", "BLUE"] ] # BGR -> RGB @@ -469,7 +327,6 @@ def main( pred = run_model( input_data, temporal_coords, location_coords, model_obj, datamodule, img_size ) - # Save pred meta_data.update(count=1, dtype="uint8", compress="lzw", nodata=0) pred_file = os.path.join( @@ -487,6 +344,7 @@ def main( orig_img=torch.Tensor(input_data[0, :, 0, ...]), channels=channels, ) + rgb_orig = rgb_orig.to(torch.float32) pred[pred == 0.0] = np.nan img_pred = rgb_orig * 0.7 + pred * 0.3 @@ -503,9 +361,10 @@ def main( # Save image rgb if rgb_outputs: + name_suffix = os.path.splitext(os.path.basename(data_file))[0] rgb_file = os.path.join( output_dir, - f"original_rgb_{os.path.splitext(os.path.basename(data_file))[0]}.tiff", + f"original_rgb_{name_suffix}.tiff", ) save_geotiff( image=_convert_np_uint8(rgb_orig), @@ -515,6 +374,42 @@ def main( if __name__ == "__main__": - args = parse_args() + parser = argparse.ArgumentParser("MAE run inference", add_help=False) + + parser.add_argument( + "--data_file", + type=str, + default="./India_900498_S2Hand.tif", + help="Path to the file.", + ) + parser.add_argument( + "--model", + type=str, + default="christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM", + help="Path to a checkpoint file to load from.", + ) + parser.add_argument( + "--output_dir", + type=str, + default="output", + help="Path to the directory where to save outputs.", + ) + parser.add_argument( + "--input_indices", + default=[1, 2, 3, 8, 11, 12], + type=int, + nargs="+", + help=""" + 0-based indices of the six Prithvi channels to be selected from the input. + By default selects [1,2,3,8,11,12] for S2L1C data. + """, + ) + parser.add_argument( + "--rgb_outputs", + action="store_true", + help="If present, output files will only contain RGB channels. " + "Otherwise, all bands will be saved.", + ) + args = parser.parse_args() main(**vars(args)) diff --git a/requirements/test.in b/requirements/test.in index c6c68891d6a..9f66e2d6919 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -54,3 +54,4 @@ runai-model-streamer==0.11.0 runai-model-streamer-s3==0.11.0 fastsafetensors>=0.1.10 pydantic>=2.10 # 2.9 leads to error on python 3.10 +terratorch==1.1rc2 # required for PrithviMAE test \ No newline at end of file diff --git a/requirements/test.txt b/requirements/test.txt index aadbab03f6f..a2b230102d4 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -6,6 +6,10 @@ accelerate==1.0.1 # via # lm-eval # peft +aenum==3.1.16 + # via lightly +affine==2.4.0 + # via rasterio aiohappyeyeballs==2.4.3 # via aiohttp aiohttp==3.10.11 @@ -21,8 +25,18 @@ aiosignal==1.3.1 # via # aiohttp # ray +albucore==0.0.16 + # via terratorch +albumentations==1.4.6 + # via terratorch +alembic==1.16.4 + # via mlflow annotated-types==0.7.0 # via pydantic +antlr4-python3-runtime==4.9.3 + # via + # hydra-core + # omegaconf anyio==4.6.2.post1 # via # httpx @@ -34,10 +48,12 @@ arrow==1.3.0 attrs==24.2.0 # via # aiohttp + # fiona # hypothesis # jsonlines # jsonschema # pytest-subtests + # rasterio # referencing audioread==3.0.1 # via librosa @@ -46,9 +62,13 @@ backoff==2.2.1 # -r requirements/test.in # schemathesis bitsandbytes==0.46.1 - # via -r requirements/test.in + # via + # -r requirements/test.in + # lightning black==24.10.0 # via datamodel-code-generator +blinker==1.9.0 + # via flask blobfile==3.0.0 # via -r requirements/test.in bm25s==0.2.13 @@ -64,11 +84,18 @@ bounded-pool-executor==0.0.3 buildkite-test-collector==0.1.9 # via -r requirements/test.in cachetools==5.5.2 - # via google-auth + # via + # google-auth + # mlflow-skinny certifi==2024.8.30 # via + # fiona # httpcore # httpx + # lightly + # pyogrio + # pyproj + # rasterio # requests cffi==1.17.1 # via soundfile @@ -79,11 +106,28 @@ charset-normalizer==3.4.0 click==8.1.7 # via # black + # click-plugins + # cligj + # fiona + # flask # jiwer + # mlflow-skinny # nltk + # rasterio # ray # schemathesis # typer + # uvicorn +click-plugins==1.1.1.2 + # via + # fiona + # rasterio +cligj==0.7.2 + # via + # fiona + # rasterio +cloudpickle==3.1.1 + # via mlflow-skinny colorama==0.4.6 # via # sacrebleu @@ -99,6 +143,8 @@ cupy-cuda12x==13.3.0 # via ray cycler==0.12.1 # via matplotlib +databricks-sdk==0.59.0 + # via mlflow-skinny datamodel-code-generator==0.26.3 # via -r requirements/test.in dataproperty==1.0.1 @@ -122,13 +168,21 @@ distlib==0.3.9 # via virtualenv dnspython==2.7.0 # via email-validator +docker==7.1.0 + # via mlflow docopt==0.6.2 # via num2words -einops==0.8.0 +docstring-parser==0.17.0 + # via jsonargparse +efficientnet-pytorch==0.7.1 + # via segmentation-models-pytorch +einops==0.8.1 # via # -r requirements/test.in # encodec # mamba-ssm + # terratorch + # torchgeo # vector-quantize-pytorch # vocos einx==0.3.0 @@ -141,6 +195,8 @@ eval-type-backport==0.2.2 # via mteb evaluate==0.4.3 # via lm-eval +fastapi==0.116.1 + # via mlflow-skinny fastparquet==2024.11.0 # via genai-perf fastrlock==0.8.2 @@ -156,6 +212,10 @@ filelock==3.16.1 # torch # transformers # virtualenv +fiona==1.10.1 + # via torchgeo +flask==3.1.1 + # via mlflow fonttools==4.54.1 # via matplotlib fqdn==1.5.1 @@ -173,6 +233,8 @@ fsspec==2024.9.0 # evaluate # fastparquet # huggingface-hub + # lightning + # pytorch-lightning # torch ftfy==6.3.1 # via open-clip-torch @@ -180,18 +242,41 @@ genai-perf==0.0.8 # via -r requirements/test.in genson==1.3.0 # via datamodel-code-generator +geopandas==1.0.1 + # via terratorch +gitdb==4.0.12 + # via gitpython +gitpython==3.1.44 + # via mlflow-skinny google-api-core==2.24.2 # via opencensus google-auth==2.40.2 - # via google-api-core + # via + # databricks-sdk + # google-api-core googleapis-common-protos==1.70.0 # via google-api-core +graphene==3.4.3 + # via mlflow graphql-core==3.2.6 - # via hypothesis-graphql + # via + # graphene + # graphql-relay + # hypothesis-graphql +graphql-relay==3.2.0 + # via graphene +greenlet==3.2.3 + # via sqlalchemy grpcio==1.71.0 # via ray +gunicorn==23.0.0 + # via mlflow h11==0.14.0 - # via httpcore + # via + # httpcore + # uvicorn +h5py==3.13.0 + # via terratorch harfile==0.3.0 # via schemathesis hf-xet==1.1.3 @@ -204,7 +289,7 @@ httpx==0.27.2 # via # -r requirements/test.in # schemathesis -huggingface-hub==0.33.0 +huggingface-hub==0.33.1 # via # -r requirements/test.in # accelerate @@ -212,13 +297,19 @@ huggingface-hub==0.33.0 # evaluate # open-clip-torch # peft + # segmentation-models-pytorch # sentence-transformers + # terratorch # timm # tokenizers # transformers # vocos humanize==4.11.0 # via runai-model-streamer +hydra-core==1.3.2 + # via + # lightly + # lightning hypothesis==6.131.0 # via # hypothesis-graphql @@ -236,6 +327,14 @@ idna==3.10 # jsonschema # requests # yarl +imageio==2.37.0 + # via scikit-image +importlib-metadata==8.7.0 + # via + # mlflow-skinny + # opentelemetry-api +importlib-resources==6.5.2 + # via typeshed-client inflect==5.6.2 # via datamodel-code-generator iniconfig==2.0.0 @@ -244,9 +343,13 @@ isoduration==20.11.0 # via jsonschema isort==5.13.2 # via datamodel-code-generator +itsdangerous==2.2.0 + # via flask jinja2==3.1.6 # via # datamodel-code-generator + # flask + # mlflow # torch jiwer==3.0.5 # via -r requirements/test.in @@ -259,6 +362,10 @@ joblib==1.4.2 # librosa # nltk # scikit-learn +jsonargparse==4.35.0 + # via + # lightning + # terratorch jsonlines==4.0.0 # via lm-eval jsonpointer==3.0.0 @@ -277,12 +384,33 @@ kaleido==0.2.1 # via genai-perf kiwisolver==1.4.7 # via matplotlib +kornia==0.8.1 + # via torchgeo +kornia-rs==0.1.9 + # via kornia lazy-loader==0.4 - # via librosa + # via + # librosa + # scikit-image libnacl==2.1.0 # via tensorizer librosa==0.10.2.post1 # via -r requirements/test.in +lightly==1.5.20 + # via + # terratorch + # torchgeo +lightly-utils==0.0.2 + # via lightly +lightning==2.5.1.post0 + # via + # terratorch + # torchgeo +lightning-utilities==0.14.3 + # via + # lightning + # pytorch-lightning + # torchmetrics llvmlite==0.44.0 # via numba lm-eval==0.4.8 @@ -291,16 +419,27 @@ lxml==5.3.0 # via # blobfile # sacrebleu +mako==1.3.10 + # via alembic mamba-ssm==2.2.4 # via -r requirements/test.in +markdown==3.8.2 + # via mlflow markdown-it-py==3.0.0 # via rich markupsafe==3.0.1 # via + # flask # jinja2 + # mako # werkzeug matplotlib==3.9.2 - # via -r requirements/test.in + # via + # -r requirements/test.in + # lightning + # mlflow + # pycocotools + # torchgeo mbstrdecoder==1.1.3 # via # dataproperty @@ -310,6 +449,10 @@ mdurl==0.1.2 # via markdown-it-py mistral-common==1.8.0 # via -r requirements/test.in +mlflow==2.22.0 + # via terratorch +mlflow-skinny==2.22.0 + # via mlflow more-itertools==10.5.0 # via lm-eval mpmath==1.3.0 @@ -328,10 +471,14 @@ multiprocess==0.70.16 # via # datasets # evaluate +munch==4.0.0 + # via pretrainedmodels mypy-extensions==1.0.0 # via black networkx==3.2.1 - # via torch + # via + # scikit-image + # torch ninja==1.11.1.3 # via mamba-ssm nltk==3.9.1 @@ -348,6 +495,8 @@ numpy==1.26.4 # via # -r requirements/test.in # accelerate + # albucore + # albumentations # bitsandbytes # bm25s # contourpy @@ -358,9 +507,15 @@ numpy==1.26.4 # evaluate # fastparquet # genai-perf + # geopandas + # h5py + # imageio # librosa + # lightly + # lightly-utils # matplotlib # mistral-common + # mlflow # mteb # numba # numexpr @@ -368,18 +523,30 @@ numpy==1.26.4 # pandas # patsy # peft + # pycocotools + # pyogrio + # rasterio + # rioxarray # rouge-score # runai-model-streamer # sacrebleu + # scikit-image # scikit-learn # scipy + # segmentation-models-pytorch + # shapely # soxr # statsmodels + # tensorboardx # tensorizer + # tifffile + # torchgeo + # torchmetrics # torchvision # transformers # tritonclient # vocos + # xarray nvidia-cublas-cu12==12.8.3.14 # via # nvidia-cudnn-cu12 @@ -417,6 +584,10 @@ nvidia-nvjitlink-cu12==12.8.61 # torch nvidia-nvtx-cu12==12.8.55 # via torch +omegaconf==2.3.0 + # via + # hydra-core + # lightning open-clip-torch==2.32.0 # via -r requirements/test.in opencensus==0.11.4 @@ -426,7 +597,18 @@ opencensus-context==0.1.3 opencv-python-headless==4.11.0.86 # via # -r requirements/test.in + # albucore + # albumentations # mistral-common +opentelemetry-api==1.35.0 + # via + # mlflow-skinny + # opentelemetry-sdk + # opentelemetry-semantic-conventions +opentelemetry-sdk==1.35.0 + # via mlflow-skinny +opentelemetry-semantic-conventions==0.56b0 + # via opentelemetry-sdk packaging==24.2 # via # accelerate @@ -435,26 +617,44 @@ packaging==24.2 # datasets # evaluate # fastparquet + # geopandas + # gunicorn # huggingface-hub + # hydra-core + # kornia # lazy-loader + # lightning + # lightning-utilities # mamba-ssm # matplotlib + # mlflow-skinny # peft # plotly # pooch + # pyogrio # pytest # pytest-rerunfailures + # pytorch-lightning # ray + # rioxarray + # scikit-image # statsmodels + # tensorboardx + # torchmetrics # transformers # typepy + # xarray pandas==2.2.3 # via # datasets # evaluate # fastparquet # genai-perf + # geopandas + # mlflow # statsmodels + # torchgeo + # xarray pathspec==0.12.1 # via black pathvalidate==3.2.1 @@ -468,9 +668,14 @@ peft==0.13.2 pillow==10.4.0 # via # genai-perf + # imageio + # lightly-utils # matplotlib # mistral-common + # scikit-image + # segmentation-models-pytorch # sentence-transformers + # torchgeo # torchvision platformdirs==4.3.6 # via @@ -489,6 +694,8 @@ portalocker==2.10.1 # via sacrebleu pqdm==0.2.0 # via -r requirements/test.in +pretrainedmodels==0.7.4 + # via segmentation-models-pytorch prometheus-client==0.22.0 # via ray propcache==0.2.0 @@ -499,8 +706,10 @@ protobuf==5.28.3 # via # google-api-core # googleapis-common-protos + # mlflow-skinny # proto-plus # ray + # tensorboardx # tensorizer psutil==6.1.0 # via @@ -515,6 +724,7 @@ pyarrow==18.0.0 # via # datasets # genai-perf + # mlflow pyasn1==0.6.1 # via # pyasn1-modules @@ -523,6 +733,8 @@ pyasn1-modules==0.4.2 # via google-auth pybind11==2.13.6 # via lm-eval +pycocotools==2.0.8 + # via terratorch pycountry==24.6.1 # via pydantic-extra-types pycparser==2.22 @@ -532,8 +744,12 @@ pycryptodomex==3.22.0 pydantic==2.11.5 # via # -r requirements/test.in + # albumentations # datamodel-code-generator + # fastapi + # lightly # mistral-common + # mlflow-skinny # mteb # pydantic-extra-types # ray @@ -543,15 +759,24 @@ pydantic-extra-types==2.10.5 # via mistral-common pygments==2.18.0 # via rich +pyogrio==0.11.0 + # via geopandas pyparsing==3.2.0 - # via matplotlib + # via + # matplotlib + # rasterio +pyproj==3.7.1 + # via + # geopandas + # rioxarray + # torchgeo pyrate-limiter==3.7.0 # via schemathesis pystemmer==3.0.0 # via mteb pytablewriter==1.2.0 # via lm-eval -pytest==8.3.3 +pytest==8.3.5 # via # -r requirements/test.in # buildkite-test-collector @@ -564,6 +789,7 @@ pytest==8.3.3 # pytest-subtests # pytest-timeout # schemathesis + # terratorch pytest-asyncio==0.24.0 # via -r requirements/test.in pytest-forked==1.6.0 @@ -578,15 +804,23 @@ pytest-subtests==0.14.1 # via schemathesis pytest-timeout==2.3.1 # via -r requirements/test.in +python-box==7.3.2 + # via terratorch python-dateutil==2.9.0.post0 # via # arrow # botocore + # graphene + # lightly # matplotlib # pandas # typepy python-rapidjson==1.20 # via tritonclient +pytorch-lightning==2.5.2 + # via + # lightly + # lightning pytrec-eval-terrier==0.5.7 # via mteb pytz==2024.2 @@ -596,11 +830,17 @@ pytz==2024.2 pyyaml==6.0.2 # via # accelerate + # albumentations # datamodel-code-generator # datasets # genai-perf # huggingface-hub + # jsonargparse + # lightning + # mlflow-skinny + # omegaconf # peft + # pytorch-lightning # ray # responses # schemathesis @@ -609,6 +849,11 @@ pyyaml==6.0.2 # vocos rapidfuzz==3.12.1 # via jiwer +rasterio==1.4.3 + # via + # rioxarray + # terratorch + # torchgeo ray==2.43.0 # via -r requirements/test.in redis==5.2.0 @@ -627,12 +872,16 @@ regex==2024.9.11 requests==2.32.3 # via # buildkite-test-collector + # databricks-sdk # datasets + # docker # evaluate # google-api-core # huggingface-hub + # lightly # lm-eval # mistral-common + # mlflow-skinny # mteb # pooch # ray @@ -650,8 +899,11 @@ rfc3987==1.3.8 rich==13.9.4 # via # genai-perf + # lightning # mteb # typer +rioxarray==0.19.0 + # via terratorch rouge-score==0.1.2 # via lm-eval rpds-py==0.20.1 @@ -660,6 +912,8 @@ rpds-py==0.20.1 # referencing rsa==4.9.1 # via google-auth +rtree==1.4.0 + # via torchgeo runai-model-streamer==0.11.0 # via -r requirements/test.in runai-model-streamer-s3==0.11.0 @@ -677,21 +931,32 @@ safetensors==0.4.5 # transformers schemathesis==3.39.15 # via -r requirements/test.in +scikit-image==0.25.2 + # via albumentations scikit-learn==1.5.2 # via + # albumentations # librosa # lm-eval + # mlflow # mteb # sentence-transformers scipy==1.13.1 # via + # albumentations # bm25s # librosa + # mlflow # mteb + # scikit-image # scikit-learn # sentence-transformers # statsmodels # vocos +segmentation-models-pytorch==0.4.0 + # via + # terratorch + # torchgeo sentence-transformers==3.2.1 # via # -r requirements/test.in @@ -700,21 +965,30 @@ sentencepiece==0.2.0 # via mistral-common setuptools==77.0.3 # via + # lightning-utilities # mamba-ssm # pytablewriter # torch # triton +shapely==2.1.1 + # via + # geopandas + # torchgeo shellingham==1.5.4 # via typer six==1.16.0 # via # junit-xml + # lightly # opencensus # python-dateutil # rfc3339-validator # rouge-score + # segmentation-models-pytorch smart-open==7.1.0 # via ray +smmap==5.0.2 + # via gitdb sniffio==1.3.1 # via # anyio @@ -727,10 +1001,17 @@ soundfile==0.12.1 # librosa soxr==0.5.0.post1 # via librosa +sqlalchemy==2.0.41 + # via + # alembic + # mlflow sqlitedict==2.1.0 # via lm-eval +sqlparse==0.5.3 + # via mlflow-skinny starlette==0.46.2 # via + # fastapi # schemathesis # starlette-testclient starlette-testclient==0.4.1 @@ -751,18 +1032,29 @@ tenacity==9.0.0 # via # lm-eval # plotly +tensorboardx==2.6.4 + # via lightning tensorizer==2.10.1 # via -r requirements/test.in +terratorch==1.1rc2 + # via -r requirements/test.in threadpoolctl==3.5.0 # via scikit-learn +tifffile==2025.3.30 + # via + # scikit-image + # terratorch tiktoken==0.7.0 # via # lm-eval # mistral-common -timm==1.0.11 +timm==1.0.15 # via # -r requirements/test.in # open-clip-torch + # segmentation-models-pytorch + # terratorch + # torchgeo tokenizers==0.21.1 # via # -r requirements/test.in @@ -776,18 +1068,28 @@ torch==2.7.1+cu128 # -r requirements/test.in # accelerate # bitsandbytes + # efficientnet-pytorch # encodec # fastsafetensors + # kornia + # lightly + # lightning # lm-eval # mamba-ssm # mteb # open-clip-torch # peft + # pretrainedmodels + # pytorch-lightning # runai-model-streamer + # segmentation-models-pytorch # sentence-transformers # tensorizer + # terratorch # timm # torchaudio + # torchgeo + # torchmetrics # torchvision # vector-quantize-pytorch # vocos @@ -796,22 +1098,40 @@ torchaudio==2.7.1+cu128 # -r requirements/test.in # encodec # vocos +torchgeo==0.7.0 + # via terratorch +torchmetrics==1.7.4 + # via + # lightning + # pytorch-lightning + # terratorch + # torchgeo torchvision==0.22.1+cu128 # via # -r requirements/test.in + # lightly # open-clip-torch + # pretrainedmodels + # segmentation-models-pytorch + # terratorch # timm + # torchgeo tqdm==4.66.6 # via # datasets # evaluate # huggingface-hub + # lightly + # lightning # lm-eval # mteb # nltk # open-clip-torch # peft # pqdm + # pretrainedmodels + # pytorch-lightning + # segmentation-models-pytorch # sentence-transformers # tqdm-multiprocess # transformers @@ -843,18 +1163,34 @@ typer==0.15.2 # via fastsafetensors types-python-dateutil==2.9.0.20241206 # via arrow +typeshed-client==2.8.2 + # via jsonargparse typing-extensions==4.12.2 # via + # albumentations + # alembic + # fastapi + # graphene # huggingface-hub # librosa + # lightning + # lightning-utilities # mistral-common + # mlflow-skinny # mteb + # opentelemetry-api + # opentelemetry-sdk + # opentelemetry-semantic-conventions # pqdm # pydantic # pydantic-core # pydantic-extra-types + # pytorch-lightning + # sqlalchemy # torch + # torchgeo # typer + # typeshed-client # typing-inspection typing-inspection==0.4.1 # via pydantic @@ -866,9 +1202,13 @@ urllib3==2.2.3 # via # blobfile # botocore + # docker + # lightly # requests # responses # tritonclient +uvicorn==0.35.0 + # via mlflow-skinny vector-quantize-pytorch==1.21.2 # via -r requirements/test.in virtualenv==20.31.2 @@ -880,11 +1220,15 @@ wcwidth==0.2.13 webcolors==24.11.1 # via jsonschema werkzeug==3.1.3 - # via schemathesis + # via + # flask + # schemathesis word2number==1.1 # via lm-eval wrapt==1.17.2 # via smart-open +xarray==2025.7.1 + # via rioxarray xxhash==3.5.0 # via # datasets @@ -893,5 +1237,7 @@ yarl==1.17.1 # via # aiohttp # schemathesis +zipp==3.23.0 + # via importlib-metadata zstandard==0.23.0 # via lm-eval diff --git a/tests/models/multimodal/pooling/test_prithvi_mae.py b/tests/models/multimodal/pooling/test_prithvi_mae.py new file mode 100644 index 00000000000..f08d83c0821 --- /dev/null +++ b/tests/models/multimodal/pooling/test_prithvi_mae.py @@ -0,0 +1,63 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import pytest +import torch + +from vllm.utils import set_default_torch_num_threads + +from ....conftest import VllmRunner + + +def generate_test_mm_data(): + mm_data = { + "pixel_values": torch.full((6, 512, 512), 1.0, dtype=torch.float16), + "location_coords": torch.full((1, 2), 1.0, dtype=torch.float16), + } + return mm_data + + +def _run_test( + vllm_runner: type[VllmRunner], + model: str, +) -> None: + + prompt = [ + { + # This model deals with no text input + "prompt_token_ids": [1], + "multi_modal_data": generate_test_mm_data(), + } for _ in range(10) + ] + + with ( + set_default_torch_num_threads(1), + vllm_runner( + model, + task="embed", + dtype=torch.float16, + enforce_eager=True, + skip_tokenizer_init=True, + # Limit the maximum number of sequences to avoid the + # test going OOM during the warmup run + max_num_seqs=32, + ) as vllm_model, + ): + vllm_model.encode(prompt) + + +MODELS = ["christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM"] + + +@pytest.mark.core_model +@pytest.mark.parametrize("model", MODELS) +def test_models_image( + hf_runner, + vllm_runner, + image_assets, + model: str, +) -> None: + _run_test( + vllm_runner, + model, + ) diff --git a/vllm/config.py b/vllm/config.py index ccc9708a3ab..a844e771cd9 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -651,6 +651,8 @@ def __post_init__(self) -> None: self.original_max_model_len = self.max_model_len self.max_model_len = self.get_and_verify_max_len(self.max_model_len) self.multimodal_config = self._init_multimodal_config() + self.model_supports_multimodal_raw_input = ( + self.registry.supports_multimodal_raw_input(self.architectures)) if not self.skip_tokenizer_init: self._verify_tokenizer_mode() @@ -1243,10 +1245,10 @@ def get_sliding_window(self) -> Optional[Union[int, list[Optional[int]]]]: return self.get_hf_config_sliding_window() def get_vocab_size(self) -> int: - return self.hf_text_config.vocab_size + return getattr(self.hf_text_config, "vocab_size", 0) def get_hidden_size(self) -> int: - return self.hf_text_config.hidden_size + return getattr(self.hf_text_config, "hidden_size", 0) @property def is_deepseek_mla(self) -> bool: diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py index e2f8de1990b..3081995e693 100644 --- a/vllm/engine/llm_engine.py +++ b/vllm/engine/llm_engine.py @@ -238,14 +238,14 @@ def __init__( self.log_stats = log_stats self.use_cached_outputs = use_cached_outputs - if not self.model_config.skip_tokenizer_init: - self.tokenizer = self._init_tokenizer() - self.detokenizer = Detokenizer(self.tokenizer) - tokenizer_group = self.get_tokenizer_group() - else: + if self.model_config.skip_tokenizer_init: self.tokenizer = None self.detokenizer = None tokenizer_group = None + else: + self.tokenizer = self._init_tokenizer() + self.detokenizer = Detokenizer(self.tokenizer) + tokenizer_group = self.get_tokenizer_group() # Ensure that the function doesn't contain a reference to self, # to avoid engine GC issues diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 8f6a7db7aa8..957b57276b4 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -136,6 +136,40 @@ def supports_multimodal( return getattr(model, "supports_multimodal", False) +@runtime_checkable +class SupportsMultiModalWithRawInput(SupportsMultiModal, Protocol): + """The interface required for all multi-modal models.""" + + supports_multimodal_raw_input: ClassVar[Literal[True]] = True + """ + A flag that indicates this model supports multi-modal inputs and processes + them in their raw form and not embeddings. + + Note: + There is no need to redefine this flag if this class is in the + MRO of your model class. + """ + + +@overload +def supports_multimodal_raw_input( + model: object) -> TypeIs[SupportsMultiModalWithRawInput]: + ... + + +@overload +def supports_multimodal_raw_input( + model: type[object]) -> TypeIs[type[SupportsMultiModalWithRawInput]]: + ... + + +def supports_multimodal_raw_input( + model: Union[type[object], object] +) -> Union[TypeIs[type[SupportsMultiModalWithRawInput]], + TypeIs[SupportsMultiModalWithRawInput]]: + return getattr(model, "supports_multimodal_raw_input", False) + + @runtime_checkable class SupportsScoreTemplate(Protocol): """The interface required for all models that support score template.""" diff --git a/vllm/model_executor/models/prithvi_geospatial_mae.py b/vllm/model_executor/models/prithvi_geospatial_mae.py index d51fcec07fd..0f00fd47fe4 100644 --- a/vllm/model_executor/models/prithvi_geospatial_mae.py +++ b/vllm/model_executor/models/prithvi_geospatial_mae.py @@ -16,6 +16,7 @@ # See the License for the specific language governing permissions and # limitations under the License. """Inference-only IBM/NASA Prithvi Geospatial model.""" + from collections.abc import Iterable, Mapping, Sequence from typing import Optional, Union @@ -27,13 +28,14 @@ from vllm.model_executor.layers.pooler import (AllPool, PoolerHead, PoolerIdentity, SimplePooler) from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.model_executor.models.interfaces import (IsAttentionFree, - SupportsMultiModal, - SupportsV0Only) +from vllm.model_executor.models.interfaces import ( + IsAttentionFree, MultiModalEmbeddings, SupportsMultiModalWithRawInput) from vllm.model_executor.models.utils import AutoWeightsLoader from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, - MultiModalInputs, MultiModalKwargs) + MultiModalFieldElem, MultiModalInputs, + MultiModalKwargs, MultiModalKwargsItem, + MultiModalSharedField, PlaceholderRange) from vllm.multimodal.parse import MultiModalDataItems from vllm.multimodal.processing import (BaseMultiModalProcessor, BaseProcessingInfo, PromptUpdate) @@ -62,8 +64,9 @@ def get_dummy_mm_data( # The size of pixel_values might change in the cases where we resize # the input but never exceeds the dimensions below. return { - "pixel_values": torch.full((1, 6, 512, 512), 1.0), - "location_coords": torch.full((1, 2), 1.0), + "pixel_values": torch.full((6, 512, 512), 1.0, + dtype=torch.float16), + "location_coords": torch.full((1, 2), 1.0, dtype=torch.float16), } @@ -75,8 +78,10 @@ def _get_mm_fields_config( hf_processor_mm_kwargs: Mapping[str, object], ) -> Mapping[str, MultiModalFieldConfig]: return dict( - pixel_values=MultiModalFieldConfig.batched("image"), - location_coords=MultiModalFieldConfig.batched("image"), + pixel_values=MultiModalFieldConfig.shared(batch_size=1, + modality="image"), + location_coords=MultiModalFieldConfig.shared(batch_size=1, + modality="image"), ) def _get_prompt_updates( @@ -99,23 +104,48 @@ def apply( for k, v in mm_data.items(): mm_kwargs[k] = v + mm_placeholders = {"image": [PlaceholderRange(offset=0, length=0)]} + + # This model receives in input a multi-dimensional tensor representing + # a single image patch and therefore it is not to be split + # into multiple elements, but rather to be considered a single one. + # Hence, the decision of using a MultiModalSharedField. + # The expected shape is (num_channels, width, height). + + # This model however allows the user to also submit multiple image + # patches as a batch, adding a further dimension to the above shape. + # At this stage we only support submitting one patch per request and + # batching is achieved via vLLM batching. + # TODO (christian-pinto): enable support for multi patch requests + # in tandem with vLLM batching. + multimodal_kwargs_items = [ + MultiModalKwargsItem.from_elems([ + MultiModalFieldElem( + modality="image", + key=key, + data=data, + field=MultiModalSharedField(1), + ) for key, data in mm_kwargs.items() + ]) + ] return MultiModalInputs( type="multimodal", prompt=prompt, prompt_token_ids=[1], - mm_kwargs=MultiModalKwargs(mm_kwargs), + mm_kwargs=MultiModalKwargs.from_items(multimodal_kwargs_items), mm_hashes=None, - mm_placeholders={}, + mm_placeholders=mm_placeholders, ) @MULTIMODAL_REGISTRY.register_processor( PrithviGeoSpatialMAEMultiModalProcessor, info=PrithviGeoSpatialMAEProcessingInfo, - dummy_inputs=PrithviGeoSpatialMAEInputBuilder) -class PrithviGeoSpatialMAE(nn.Module, IsAttentionFree, SupportsMultiModal, - SupportsV0Only): + dummy_inputs=PrithviGeoSpatialMAEInputBuilder, +) +class PrithviGeoSpatialMAE(nn.Module, IsAttentionFree, + SupportsMultiModalWithRawInput): """Prithvi Masked Autoencoder""" is_pooling_model = True @@ -128,10 +158,10 @@ def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: raise ValueError("Only image modality is supported") def _instantiate_model(self, config: dict) -> Optional[nn.Module]: - # We might be able/need to support different tasks with this same model if config["task_args"]["task"] == "SemanticSegmentationTask": from terratorch.cli_tools import SemanticSegmentationTask + task = SemanticSegmentationTask( config["model_args"], config["task_args"]["model_factory"], @@ -144,7 +174,8 @@ def _instantiate_model(self, config: dict) -> Optional[nn.Module]: scheduler_hparams=config["scheduler_params"], plot_on_val=config["task_args"]["plot_on_val"], freeze_decoder=config["task_args"]["freeze_decoder"], - freeze_backbone=config["task_args"]["freeze_backbone"]) + freeze_backbone=config["task_args"]["freeze_backbone"], + ) return task.model else: @@ -168,12 +199,10 @@ def __init__(self, vllm_config: VllmConfig, prefix: str = ""): def _parse_and_validate_multimodal_data( self, **kwargs) -> tuple[torch.Tensor, Optional[torch.Tensor]]: - pixel_values = kwargs.pop("pixel_values", None) if not isinstance(pixel_values, torch.Tensor): raise ValueError(f"Incorrect type of pixel_values. " f"Got type: {type(pixel_values)}") - pixel_values = torch.unbind(pixel_values, dim=0)[0] location_coords = kwargs.pop("location_coords", None) if not isinstance(location_coords, torch.Tensor): @@ -185,6 +214,17 @@ def _parse_and_validate_multimodal_data( return pixel_values, location_coords + def get_input_embeddings( + self, + input_ids: torch.Tensor, + multimodal_embeddings: Optional[MultiModalEmbeddings] = None, + ) -> torch.Tensor: + # We do not really use any input tokens and therefore no embeddings + # to be calculated. However, due to the mandatory token ids in + # the input prompt we pass one token and the size of the dummy + # embedding tensors must reflect that. + return torch.empty((input_ids.shape[0], 0)) + def forward( self, input_ids: Optional[torch.Tensor], diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index fafb6a70438..2aaac7798fc 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -22,8 +22,8 @@ from .interfaces import (has_inner_state, has_noops, is_attention_free, is_hybrid, supports_cross_encoding, - supports_multimodal, supports_pp, - supports_transcription, supports_v0_only) + supports_multimodal, supports_multimodal_raw_input, + supports_pp, supports_transcription, supports_v0_only) from .interfaces_base import is_text_generation_model logger = init_logger(__name__) @@ -287,6 +287,7 @@ class _ModelInfo: is_pooling_model: bool supports_cross_encoding: bool supports_multimodal: bool + supports_multimodal_raw_input: bool supports_pp: bool has_inner_state: bool is_attention_free: bool @@ -304,6 +305,7 @@ def from_model_cls(model: type[nn.Module]) -> "_ModelInfo": is_pooling_model=True, # Can convert any model into a pooling model supports_cross_encoding=supports_cross_encoding(model), supports_multimodal=supports_multimodal(model), + supports_multimodal_raw_input=supports_multimodal_raw_input(model), supports_pp=supports_pp(model), has_inner_state=has_inner_state(model), is_attention_free=is_attention_free(model), @@ -573,6 +575,13 @@ def is_multimodal_model( model_cls, _ = self.inspect_model_cls(architectures) return model_cls.supports_multimodal + def supports_multimodal_raw_input( + self, + architectures: Union[str, list[str]], + ) -> bool: + model_cls, _ = self.inspect_model_cls(architectures) + return model_cls.supports_multimodal_raw_input + def is_pp_supported_model( self, architectures: Union[str, list[str]], diff --git a/vllm/multimodal/registry.py b/vllm/multimodal/registry.py index 27aaa661c35..c44fcacd246 100644 --- a/vllm/multimodal/registry.py +++ b/vllm/multimodal/registry.py @@ -266,7 +266,7 @@ def create_processor( if not model_config.is_multimodal_model: raise ValueError(f"{model_config.model} is not a multimodal model") - if tokenizer is None: + if tokenizer is None and not model_config.skip_tokenizer_init: tokenizer = cached_tokenizer_from_config(model_config) if disable_cache is None: mm_config = model_config.get_multimodal_config() diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index 79b5d5ae4a2..95a474228d4 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -94,11 +94,14 @@ def __init__( self.log_requests = log_requests self.log_stats = log_stats - # Tokenizer (+ ensure liveness if running in another process). - self.tokenizer = init_tokenizer_from_configs( - model_config=vllm_config.model_config, - scheduler_config=vllm_config.scheduler_config, - lora_config=vllm_config.lora_config) + if self.model_config.skip_tokenizer_init: + self.tokenizer = None + else: + # Tokenizer (+ ensure liveness if running in another process). + self.tokenizer = init_tokenizer_from_configs( + model_config=vllm_config.model_config, + scheduler_config=vllm_config.scheduler_config, + lora_config=vllm_config.lora_config) # Processor (converts Inputs --> EngineCoreRequests). self.processor = Processor( @@ -525,6 +528,10 @@ async def get_tokenizer( self, lora_request: Optional[LoRARequest] = None, ) -> AnyTokenizer: + if self.tokenizer is None: + raise ValueError("Unable to get tokenizer because " + "skip_tokenizer_init is True") + return self.tokenizer.get_lora_tokenizer(lora_request) async def is_tracing_enabled(self) -> bool: diff --git a/vllm/v1/engine/llm_engine.py b/vllm/v1/engine/llm_engine.py index a2328c37ba0..29aca1ad698 100644 --- a/vllm/v1/engine/llm_engine.py +++ b/vllm/v1/engine/llm_engine.py @@ -82,11 +82,14 @@ def __init__( self.dp_group = None self.should_execute_dummy_batch = False - # Tokenizer (+ ensure liveness if running in another process). - self.tokenizer = init_tokenizer_from_configs( - model_config=vllm_config.model_config, - scheduler_config=vllm_config.scheduler_config, - lora_config=vllm_config.lora_config) + if self.model_config.skip_tokenizer_init: + self.tokenizer = None + else: + # Tokenizer (+ ensure liveness if running in another process). + self.tokenizer = init_tokenizer_from_configs( + model_config=vllm_config.model_config, + scheduler_config=vllm_config.scheduler_config, + lora_config=vllm_config.lora_config) # Processor (convert Inputs --> EngineCoreRequests) self.processor = Processor(vllm_config=vllm_config, diff --git a/vllm/v1/engine/output_processor.py b/vllm/v1/engine/output_processor.py index 2bcd61d1f0a..3be6c482121 100644 --- a/vllm/v1/engine/output_processor.py +++ b/vllm/v1/engine/output_processor.py @@ -327,14 +327,16 @@ def add_request( if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") - req_state = RequestState.from_new_request( - tokenizer=self.tokenizer.get_lora_tokenizer(request.lora_request), - request=request, - prompt=prompt, - parent_req=parent_req, - request_index=request_index, - queue=queue, - log_stats=self.log_stats) + tokenizer = None if not self.tokenizer else \ + self.tokenizer.get_lora_tokenizer(request.lora_request) + + req_state = RequestState.from_new_request(tokenizer=tokenizer, + request=request, + prompt=prompt, + parent_req=parent_req, + request_index=request_index, + queue=queue, + log_stats=self.log_stats) self.request_states[request_id] = req_state self.lora_states.add_request(req_state) if parent_req: diff --git a/vllm/v1/engine/processor.py b/vllm/v1/engine/processor.py index 7af4ed54a22..725152f978d 100644 --- a/vllm/v1/engine/processor.py +++ b/vllm/v1/engine/processor.py @@ -380,7 +380,6 @@ def _validate_model_input( prompt_type: Literal["encoder", "decoder"], ): model_config = self.model_config - tokenizer = self.tokenizer.get_lora_tokenizer(lora_request) prompt_ids = prompt_inputs["prompt_token_ids"] if not prompt_ids: @@ -389,9 +388,14 @@ def _validate_model_input( else: raise ValueError(f"The {prompt_type} prompt cannot be empty") - max_input_id = max(prompt_ids, default=0) - if max_input_id > tokenizer.max_token_id: - raise ValueError(f"Token id {max_input_id} is out of vocabulary") + if self.model_config.skip_tokenizer_init: + tokenizer = None + else: + tokenizer = self.tokenizer.get_lora_tokenizer(lora_request) + max_input_id = max(prompt_ids, default=0) + if max_input_id > tokenizer.max_token_id: + raise ValueError( + f"Token id {max_input_id} is out of vocabulary") max_prompt_len = self.model_config.max_model_len if len(prompt_ids) > max_prompt_len: diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 2078fedac92..864cf91e785 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -126,6 +126,8 @@ def __init__( self.is_multimodal_model = model_config.is_multimodal_model self.is_pooling_model = model_config.pooler_config is not None + self.model_supports_multimodal_raw_input = ( + model_config.model_supports_multimodal_raw_input) self.max_model_len = model_config.max_model_len self.max_num_tokens = scheduler_config.max_num_batched_tokens self.max_num_reqs = scheduler_config.max_num_seqs @@ -328,6 +330,14 @@ def _may_reorder_batch(self, scheduler_output: "SchedulerOutput") -> None: Args: scheduler_output: The scheduler output. """ + # Attention free models have zero kv_cache_goups, however models + # like Mamba are also attention free but use the kv_cache for + # keeping its internal state. This is why we check the number + # of kv_cache groups instead of solely checking + # for self.model_config.is_attention_free. + if len(self.kv_cache_config.kv_cache_groups) == 0: + return + self.attn_metadata_builders[0].reorder_batch(self.input_batch, scheduler_output) @@ -565,6 +575,38 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> None: # Refresh batch metadata with any pending updates. self.input_batch.refresh_metadata() + def _init_model_kwargs_for_multimodal_model( + self, + scheduler_output: Optional["SchedulerOutput"] = None, + num_reqs: int = -1, + ) -> dict[str, Any]: + + model_kwargs: dict[str, Any] = {} + if self.model_supports_multimodal_raw_input: + # This model requires the raw multimodal data in input. + if scheduler_output: + multi_modal_kwargs_list = [] + for req in scheduler_output.scheduled_new_reqs: + req_mm_inputs = req.mm_inputs + if not isinstance(req_mm_inputs, list): + req_mm_inputs = list(req_mm_inputs) + multi_modal_kwargs_list.extend(req_mm_inputs) + multi_modal_kwargs = MultiModalKwargs.batch( + multi_modal_kwargs_list) + else: + # The only case where SchedulerOutput is None is for + # a dummy run let's get some dummy data. + dummy_data = [ + self.mm_registry.get_decoder_dummy_data( + model_config=self.model_config, + seq_len=1).multi_modal_data for i in range(num_reqs) + ] + multi_modal_kwargs = MultiModalKwargs.batch(dummy_data) + + model_kwargs.update(multi_modal_kwargs) + + return model_kwargs + def _get_cumsum_and_arange( self, num_tokens: np.ndarray, @@ -1359,10 +1401,14 @@ def execute_model( # embeddings), we always use embeddings (rather than token ids) # as input to the multimodal model, even when the input is text. input_ids = self.input_ids[:num_scheduled_tokens] + + model_kwargs = self._init_model_kwargs_for_multimodal_model( + scheduler_output=scheduler_output) inputs_embeds = self.model.get_input_embeddings( input_ids=input_ids, multimodal_embeddings=mm_embeds or None, ) + # TODO(woosuk): Avoid the copy. Optimize. self.inputs_embeds[:num_scheduled_tokens].copy_(inputs_embeds) inputs_embeds = self.inputs_embeds[:num_input_tokens] @@ -1374,6 +1420,7 @@ def execute_model( # then the embedding layer is not included in the CUDA graph. input_ids = self.input_ids[:num_input_tokens] inputs_embeds = None + model_kwargs = {} if self.uses_mrope: positions = self.mrope_positions[:, :num_input_tokens] else: @@ -1406,6 +1453,10 @@ def execute_model( positions=positions, intermediate_tensors=intermediate_tensors, inputs_embeds=inputs_embeds, + **MultiModalKwargs.as_kwargs( + model_kwargs, + device=self.device, + ), ) self.maybe_wait_for_kv_save() @@ -2084,11 +2135,15 @@ def _dummy_run( num_scheduled_tokens): model = self.model if self.is_multimodal_model: + model_kwargs = self._init_model_kwargs_for_multimodal_model( + num_reqs=num_reqs) input_ids = None inputs_embeds = self.inputs_embeds[:num_tokens] else: input_ids = self.input_ids[:num_tokens] inputs_embeds = None + model_kwargs = {} + if self.uses_mrope: positions = self.mrope_positions[:, :num_tokens] else: @@ -2117,7 +2172,12 @@ def _dummy_run( positions=positions, intermediate_tensors=intermediate_tensors, inputs_embeds=inputs_embeds, + **MultiModalKwargs.as_kwargs( + model_kwargs, + device=self.device, + ), ) + if self.use_aux_hidden_state_outputs: hidden_states, _ = outputs else: From c80b511e16a1db17c89e16ac6f69faff6a55be17 Mon Sep 17 00:00:00 2001 From: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Date: Wed, 23 Jul 2025 11:00:47 -0700 Subject: [PATCH 291/552] Add test case for compiling multiple graphs (#21044) Signed-off-by: Yong Hoon Shin Signed-off-by: x22x22 --- .../compile/piecewise/test_multiple_graphs.py | 350 ++++++++++++++++++ vllm/compilation/compiler_interface.py | 6 + vllm/compilation/decorators.py | 35 +- 3 files changed, 390 insertions(+), 1 deletion(-) create mode 100644 tests/compile/piecewise/test_multiple_graphs.py diff --git a/tests/compile/piecewise/test_multiple_graphs.py b/tests/compile/piecewise/test_multiple_graphs.py new file mode 100644 index 00000000000..e460d709517 --- /dev/null +++ b/tests/compile/piecewise/test_multiple_graphs.py @@ -0,0 +1,350 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Test (piecewise) compilation with a simple model where multiple submodules +are compiled and graph captured separately. +""" +import torch +from torch import nn +from torch.library import Library + +from vllm.compilation.backends import set_model_tag +from vllm.compilation.counter import compilation_counter +from vllm.compilation.decorators import (ignore_torch_compile, + support_torch_compile) +from vllm.config import (CompilationConfig, CompilationLevel, VllmConfig, + set_current_vllm_config) +from vllm.envs import VLLM_USE_V1 +from vllm.forward_context import set_forward_context +from vllm.utils import direct_register_custom_op + +# create a library to hold the custom op +silly_lib = Library("silly", "FRAGMENT") # noqa + +BATCH_SIZE = 32 +MLP_SIZE = 128 +HIDDEN_SIZE = 1024 +RANDOM_SEED = 0 + + +def silly_attention(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, + out: torch.Tensor) -> None: + out.copy_(q) + out += k + out += v + + +def silly_attention_fake(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, + out: torch.Tensor) -> None: + return + + +direct_register_custom_op( + op_name="attention", + op_func=silly_attention, + mutates_args=["out"], + fake_impl=silly_attention_fake, + target_lib=silly_lib, +) + + +@support_torch_compile +class ParentModel(nn.Module): + + def __init__(self, + *, + vllm_config: VllmConfig, + prefix: str = '', + **kwargs) -> None: + super().__init__() + + def forward(self, x: torch.Tensor) -> torch.Tensor: + return x + + +class Attention(nn.Module): + + def __init__(self, mlp_size: int, hidden_size: int) -> None: + super().__init__() + self.pre_attn = nn.Linear(mlp_size, hidden_size, bias=False) + self.post_attn = nn.Linear(hidden_size, mlp_size, bias=False) + self.rms_norm_weight = nn.Parameter(torch.ones(hidden_size)) + + # Initialize to same weights for testing + nn.init.xavier_normal_( + self.pre_attn.weight.data, + generator=torch.Generator().manual_seed(RANDOM_SEED), + gain=0.001) + nn.init.xavier_normal_( + self.post_attn.weight.data, + generator=torch.Generator().manual_seed(RANDOM_SEED), + gain=0.001) + + def rms_norm_ref(self, x: torch.Tensor) -> torch.Tensor: + x_f32 = x.float() + return (x_f32 * torch.rsqrt( + torch.mean(x_f32.square(), dim=-1, keepdim=True) + 1e-6) * + self.rms_norm_weight).to(x.dtype) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + x = self.pre_attn(x) + x = self.rms_norm_ref(x) + attn_output = torch.empty_like(x) + torch.ops.silly.attention(x, x, x, attn_output) + x = attn_output + x = self.rms_norm_ref(x) + x = self.post_attn(x) + return x + + +@support_torch_compile +class CompiledAttention(nn.Module): + + def __init__(self, + *, + mlp_size: int, + hidden_size: int, + vllm_config: VllmConfig, + prefix: str = '', + **kwargs) -> None: + super().__init__() + self.attn = Attention(mlp_size, hidden_size) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.attn(x) + + +@support_torch_compile +class CompiledAttentionTwo(CompiledAttention): + + def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.attn(x) + x + + +@ignore_torch_compile +class SimpleModelWithTwoGraphs(ParentModel): + + def __init__(self, + *, + mlp_size: int, + hidden_size: int, + vllm_config: VllmConfig, + prefix: str = '', + **kwargs) -> None: + super().__init__(vllm_config=vllm_config, prefix=prefix) + # Test will fail without set_model_tag here with error: + # "ValueError: too many values to unpack (expected 3)" + # This is because CompiledAttention and CompiledAttentionTwo + # have different implmentations but the same torch.compile + # cache dir will be used as default prefix is 'model_tag' + with set_model_tag("attn_one"): + self.attn_one = CompiledAttention( + mlp_size=mlp_size, + hidden_size=hidden_size, + vllm_config=vllm_config, + prefix=f"{prefix}.attn_one", + ) + with set_model_tag("attn_two"): + self.attn_two = CompiledAttentionTwo( + mlp_size=mlp_size, + hidden_size=hidden_size, + vllm_config=vllm_config, + prefix=f"{prefix}.attn_two", + ) + + self.hidden_states = torch.zeros((BATCH_SIZE, MLP_SIZE)).cuda() + + def forward(self, x: torch.Tensor) -> torch.Tensor: + bsz = x.shape[0] + # CUDAGraph expects same tensor addresses for each run + self.hidden_states[:bsz].copy_(x) + x = self.attn_one(self.hidden_states[:bsz]) + self.hidden_states[:bsz].copy_(x) + x = self.attn_two(self.hidden_states[:bsz]) + return x + + +def test_ignore_torch_compile_decorator(): + assert VLLM_USE_V1 + + # piecewise + vllm_config = VllmConfig(compilation_config=CompilationConfig( + level=CompilationLevel.PIECEWISE, + use_cudagraph=True, + splitting_ops=["silly.attention"], + cudagraph_capture_sizes=[1, 2], + )) + + @support_torch_compile + class A(nn.Module): + + def __init__(self, + *, + vllm_config: VllmConfig, + prefix: str = '', + **kwargs) -> None: + super().__init__() + + def forward(self, x: torch.Tensor) -> torch.Tensor: + x = x + x + attn_output = torch.empty_like(x) + torch.ops.silly.attention(x, x, x, attn_output) + x = attn_output + x = x * 3 + return x + + @ignore_torch_compile + class B(A): + ... + + @support_torch_compile + class C(B): + ... + + with set_current_vllm_config(vllm_config): + mod_A = A(vllm_config=vllm_config, prefix='').eval().cuda() + + # A has support_torch_compile + with compilation_counter.expect( + num_graphs_seen=1, + num_piecewise_graphs_seen=3, + num_piecewise_capturable_graphs_seen=2, + num_backend_compilations=2, + num_cudagraph_captured=4, + # num_cudagraph_sizes * num_piecewise_capturable_graphs_seen + ), set_forward_context({}, vllm_config=vllm_config): + # first run is for compile + mod_A(torch.randn(BATCH_SIZE, MLP_SIZE).cuda()) + # run cudagraph captured sizes + mod_A(torch.randn(2, MLP_SIZE).cuda()) + mod_A(torch.randn(1, MLP_SIZE).cuda()) + + with set_current_vllm_config(vllm_config): + mod_B = B(vllm_config=vllm_config, prefix='').eval().cuda() + + # B's ignore_torch_compile should override A's support_torch_compile + with compilation_counter.expect( + num_graphs_seen=0, + num_piecewise_graphs_seen=0, + num_piecewise_capturable_graphs_seen=0, + num_backend_compilations=0, + num_cudagraph_captured=0, + ), set_forward_context({}, vllm_config=vllm_config): + mod_B(torch.randn(BATCH_SIZE, MLP_SIZE).cuda()) + mod_B(torch.randn(2, MLP_SIZE).cuda()) + mod_B(torch.randn(1, MLP_SIZE).cuda()) + + with set_current_vllm_config(vllm_config): + mod_C = C(vllm_config=vllm_config, prefix='').eval().cuda() + + # C's support_torch_compile should override B's ignore_torch_compile + with compilation_counter.expect( + num_graphs_seen=1, + num_piecewise_graphs_seen=3, + num_piecewise_capturable_graphs_seen=2, + num_backend_compilations=2, + num_cudagraph_captured=4, + # num_cudagraph_sizes * num_piecewise_capturable_graphs_seen + ), set_forward_context({}, vllm_config=vllm_config): + mod_C(torch.randn(BATCH_SIZE, MLP_SIZE).cuda()) + mod_C(torch.randn(2, MLP_SIZE).cuda()) + mod_C(torch.randn(1, MLP_SIZE).cuda()) + + +@torch.inference_mode +def run_model(vllm_config, model: nn.Module, inputs: torch.Tensor): + with set_forward_context({}, vllm_config=vllm_config): + # First run is for compile + model(inputs) + + # Run CUDAGraph captured sizes + model(inputs[:2]) + model(inputs[:1]) + + output = model(inputs[:2]) + + output = output.cpu() + return output.cpu() + + +def test_multi_graph_piecewise_compile_outputs_equal(): + outputs = [] + + # piecewise compile + vllm_config = VllmConfig(compilation_config=CompilationConfig( + level=CompilationLevel.PIECEWISE, + use_cudagraph=True, + splitting_ops=["silly.attention"], + cudagraph_capture_sizes=[1, 2], + )) + + with set_current_vllm_config(vllm_config): + model = SimpleModelWithTwoGraphs(mlp_size=MLP_SIZE, + hidden_size=HIDDEN_SIZE, + vllm_config=vllm_config, + prefix='').eval().cuda() + + # Pre-allocate memory for CUDAGraph which expects + # static tensor addresses + inputs = torch.randn(BATCH_SIZE, MLP_SIZE).cuda() + + with compilation_counter.expect( + num_graphs_seen=2, # two graphs for the model + num_piecewise_graphs_seen=6, + # attn_one, attn_two each has 3 piecewise graphs + # (pre attn, post attn, silly_attention) each + num_piecewise_capturable_graphs_seen=4, + # attn_one, attn_two has pre attn and post attn each, total=4 + num_backend_compilations=4, # num_piecewise_capturable_graphs_seen + num_cudagraph_captured=8, + # num_cudagraph_sizes * num_piecewise_capturable_graphs_seen + ): + outputs.append(run_model(vllm_config, model, inputs)) + + # no compile or cudagraph + vllm_config = VllmConfig(compilation_config=CompilationConfig( + level=CompilationLevel.NO_COMPILATION, )) + + with set_current_vllm_config(vllm_config): + model = SimpleModelWithTwoGraphs(mlp_size=MLP_SIZE, + hidden_size=HIDDEN_SIZE, + vllm_config=vllm_config, + prefix='').eval().cuda() + + with compilation_counter.expect( + num_graphs_seen=0, + num_piecewise_graphs_seen=0, + num_piecewise_capturable_graphs_seen=0, + num_backend_compilations=0, + num_cudagraph_captured=0, + ): + outputs.append(run_model(vllm_config, model, inputs)) + + # piecewise compile without CUDA graph + vllm_config = VllmConfig(compilation_config=CompilationConfig( + level=CompilationLevel.PIECEWISE, + use_cudagraph=False, + splitting_ops=["silly.attention"], + )) + + with set_current_vllm_config(vllm_config): + model = SimpleModelWithTwoGraphs(mlp_size=MLP_SIZE, + hidden_size=HIDDEN_SIZE, + vllm_config=vllm_config, + prefix='').eval().cuda() + + with compilation_counter.expect( + num_graphs_seen=2, + num_piecewise_graphs_seen=6, + num_piecewise_capturable_graphs_seen=4, + num_backend_compilations=4, + num_cudagraph_captured=0, # no cudagraph captured + ): + outputs.append(run_model(vllm_config, model, inputs)) + + # Generally don't expect outputs with and without inductor + # to be bitwise equivalent + assert torch.allclose(outputs[0], outputs[1]) + + # Expect bitwise equivalence using inductor w/ and w/o cudagraph + assert torch.equal(outputs[0], outputs[2]) diff --git a/vllm/compilation/compiler_interface.py b/vllm/compilation/compiler_interface.py index b529f84b798..7158fd68596 100644 --- a/vllm/compilation/compiler_interface.py +++ b/vllm/compilation/compiler_interface.py @@ -423,6 +423,12 @@ def _get_shape_env() -> AlwaysHitShapeEnv: if is_torch_equal_or_newer("2.6"): stack.enter_context( torch._inductor.config.patch(fx_graph_remote_cache=False)) + # InductorAdaptor (unfortunately) requires AOTAutogradCache + # to be turned off to run. It will fail to acquire the hash_str + # and error if not. + # StandaloneInductorAdaptor (PyTorch 2.8+) fixes this problem. + stack.enter_context( + torch._functorch.config.patch(enable_autograd_cache=False)) stack.enter_context( torch._functorch.config.patch( enable_remote_autograd_cache=False)) diff --git a/vllm/compilation/decorators.py b/vllm/compilation/decorators.py index 05e4ca9f08b..f3592324d8c 100644 --- a/vllm/compilation/decorators.py +++ b/vllm/compilation/decorators.py @@ -20,9 +20,38 @@ logger = init_logger(__name__) +IGNORE_COMPILE_KEY = "_ignore_compile_vllm" + _T = TypeVar("_T", bound=type[nn.Module]) +def ignore_torch_compile(cls: _T) -> _T: + """ + A decorator to ignore support_torch_compile decorator + on the class. This is useful when a parent class has + a support_torch_compile decorator, but we don't want to + compile the class `cls` that inherits the parent class. + This only ignores compiling the forward of the class the + decorator is applied to. + + If the parent has ignore_torch_compile but the child has + support_torch_compile, the child will still be compiled. + + If the class has one or more submodules + that have support_torch_compile decorator applied, compile will + not be ignored for those submodules. + """ + setattr(cls, IGNORE_COMPILE_KEY, True) + return cls + + +def _should_ignore_torch_compile(cls) -> bool: + """ + Check if the class should be ignored for torch.compile. + """ + return getattr(cls, IGNORE_COMPILE_KEY, False) + + @overload def support_torch_compile( *, @@ -148,6 +177,8 @@ def _support_torch_compile( old_init = cls.__init__ + setattr(cls, IGNORE_COMPILE_KEY, False) + def __init__(self, *, vllm_config: VllmConfig, prefix: str = '', **kwargs): old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) self.vllm_config = vllm_config @@ -156,9 +187,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = '', **kwargs): self.do_not_compile = \ vllm_config.compilation_config.level in [ CompilationLevel.NO_COMPILATION, CompilationLevel.DYNAMO_AS_IS - ] or not supports_dynamo() + ] or not supports_dynamo() or _should_ignore_torch_compile( + self.__class__) if self.do_not_compile: return + compilation_counter.num_models_seen += 1 TorchCompileWrapperWithCustomDispatcher.__init__( self, compilation_level=vllm_config.compilation_config.level) From 4e4275b62ab2e471d63f85d7f47a508ea454d446 Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Wed, 23 Jul 2025 11:29:36 -0700 Subject: [PATCH 292/552] [TPU][TEST] Fix the downloading issue in TPU v1 test 11. (#21418) Signed-off-by: Qiliang Cui Signed-off-by: x22x22 --- .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh index 60f0d174bd6..d39acae0b04 100755 --- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh @@ -62,7 +62,8 @@ echo "Results will be stored in: $RESULTS_DIR" echo "--- Installing Python dependencies ---" python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \ && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \ - && python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 + && python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 \ + && python3 -m pip install --progress-bar off hf-transfer echo "--- Python dependencies installed ---" export VLLM_USE_V1=1 export VLLM_XLA_CHECK_RECOMPILATION=1 @@ -150,7 +151,7 @@ run_and_track_test 9 "test_multimodal.py" \ run_and_track_test 10 "test_pallas.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" run_and_track_test 11 "test_struct_output_generate.py" \ - "python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\"" + "HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\"" run_and_track_test 12 "test_moe_pallas.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py" run_and_track_test 13 "test_lora.py" \ From 7ad7a24acac03b171c00ab2a5fde09980d9c273a Mon Sep 17 00:00:00 2001 From: 22quinn <33176974+22quinn@users.noreply.github.com> Date: Wed, 23 Jul 2025 14:24:52 -0700 Subject: [PATCH 293/552] [Core] Add `reload_weights` RPC method (#20096) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: x22x22 --- tests/v1/worker/test_gpu_model_runner.py | 7 ++++- vllm/v1/worker/gpu_model_runner.py | 21 +++++++-------- vllm/v1/worker/gpu_worker.py | 33 +++++++++++++++--------- vllm/v1/worker/tpu_model_runner.py | 21 ++++++++------- vllm/v1/worker/tpu_worker.py | 3 +++ 5 files changed, 51 insertions(+), 34 deletions(-) diff --git a/tests/v1/worker/test_gpu_model_runner.py b/tests/v1/worker/test_gpu_model_runner.py index 6ddcbfea24a..7fec4782517 100644 --- a/tests/v1/worker/test_gpu_model_runner.py +++ b/tests/v1/worker/test_gpu_model_runner.py @@ -460,11 +460,16 @@ def test_load_model_weights_inplace(dist_init, model_runner, model_runner_2): {"load_config": { "load_format": original_load_format }}) - model_runner_2.load_model() # Load real weights inplace + model_runner_2.reload_weights() # Load real weights inplace assert str(model_runner.get_model().state_dict()) == str( model_runner_2.get_model().state_dict()) +def test_reload_weights_before_load_model(model_runner): + with pytest.raises(AssertionError): + model_runner.reload_weights() + + def test_init_kv_cache_with_kv_sharing_invalid_target_layer_order(): torch.set_default_dtype(torch.float16) layer_0 = "model.layers.0.self_attn.attn" diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 864cf91e785..1ee379d3427 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1873,17 +1873,9 @@ def load_model(self, eep_scale_up: bool = False) -> None: with DeviceMemoryProfiler() as m: time_before_load = time.perf_counter() model_loader = get_model_loader(self.load_config) - if not hasattr(self, "model"): - logger.info("Loading model from scratch...") - self.model = model_loader.load_model( - vllm_config=self.vllm_config, - model_config=self.model_config) - else: - logger.info( - "Model was already initialized. Loading weights inplace..." - ) - model_loader.load_weights(self.model, - model_config=self.model_config) + logger.info("Loading model from scratch...") + self.model = model_loader.load_model( + vllm_config=self.vllm_config, model_config=self.model_config) if self.lora_config: self.model = self.load_lora_model(self.model, self.model_config, @@ -1916,6 +1908,13 @@ def load_model(self, eep_scale_up: bool = False) -> None: rank_mapping, ) + def reload_weights(self) -> None: + assert getattr(self, "model", None) is not None, \ + "Cannot reload weights before model is loaded." + model_loader = get_model_loader(self.load_config) + logger.info("Reloading weights inplace...") + model_loader.load_weights(self.model, model_config=self.model_config) + def save_tensorized_model( self, tensorizer_config: "TensorizerConfig", diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 6411874883e..1c180322e12 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -4,6 +4,7 @@ import copy import gc import os +from contextlib import AbstractContextManager, nullcontext from typing import TYPE_CHECKING, Any, Optional import torch @@ -118,6 +119,21 @@ def wake_up(self, tags: Optional[list[str]] = None) -> None: buffer.data.copy_(self._sleep_saved_buffers[name].data) self._sleep_saved_buffers = {} + def _maybe_get_memory_pool_context(self, + tag: str) -> AbstractContextManager: + if self.vllm_config.model_config.enable_sleep_mode: + from vllm.device_allocator.cumem import CuMemAllocator + + allocator = CuMemAllocator.get_instance() + if tag == "weights": + assert allocator.get_current_usage() == 0, ( + "Sleep mode can only be " + "used for one instance per process.") + context = allocator.use_memory_pool(tag=tag) + else: + context = nullcontext() + return context + def initialize_cache(self, num_gpu_blocks: int, num_cpu_blocks: int) -> None: self.cache_config.num_gpu_blocks = num_gpu_blocks @@ -179,24 +195,17 @@ def init_device(self): # FIXME(youkaichao & ywang96): Use TorchDispatchMode instead of memory pool # to hijack tensor allocation. def load_model(self) -> None: - if self.vllm_config.model_config.enable_sleep_mode: - from vllm.device_allocator.cumem import CuMemAllocator - - allocator = CuMemAllocator.get_instance() - assert allocator.get_current_usage() == 0, ( - "Sleep mode can only be " - "used for one instance per process.") - context = allocator.use_memory_pool(tag="weights") - else: - from contextlib import nullcontext - context = nullcontext() eep_scale_up = os.environ.get("VLLM_ELASTIC_EP_SCALE_UP_LAUNCH") == "1" - with context: + with self._maybe_get_memory_pool_context(tag="weights"): self.model_runner.load_model(eep_scale_up=eep_scale_up) def update_config(self, overrides: dict[str, Any]) -> None: self.model_runner.update_config(overrides) + def reload_weights(self) -> None: + with self._maybe_get_memory_pool_context(tag="weights"): + self.model_runner.reload_weights() + @torch.inference_mode() def determine_available_memory(self) -> int: """Profiles the peak memory usage of the model to determine how much diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 31e9cff9124..f160384f8f6 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -1174,16 +1174,10 @@ def load_model(self) -> None: mesh=self.mesh) else: model_loader = get_model_loader(self.load_config) - if not hasattr(self, "model"): - logger.info("Loading model from scratch...") - model = model_loader.load_model( - vllm_config=self.vllm_config, - model_config=self.model_config) - else: - logger.info("Model was already initialized. \ - Loading weights inplace...") - model_loader.load_weights( - self.model, model_config=self.model_config) + logger.info("Loading model from scratch...") + model = model_loader.load_model( + vllm_config=self.vllm_config, + model_config=self.model_config) except RuntimeError as e: raise RuntimeError( f"Unable to load model, a likely reason is the model is " @@ -1205,6 +1199,13 @@ def load_model(self) -> None: self.model = model self.sampler = TPUSampler() + def reload_weights(self) -> None: + assert getattr(self, "model", None) is not None, \ + "Cannot reload weights before model is loaded." + model_loader = get_model_loader(self.load_config) + logger.info("Reloading weights inplace...") + model_loader.load_weights(self.model, model_config=self.model_config) + @torch.no_grad() def _dummy_run(self, num_tokens: int, num_reqs: int, num_blocks: int) -> None: diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index 592d9fc17c9..1d61878ca08 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -265,6 +265,9 @@ def load_model(self) -> None: def update_config(self, overrides: dict[str, Any]) -> None: self.model_runner.update_config(overrides) + def reload_weights(self) -> None: + self.model_runner.reload_weights() + def compile_or_warm_up_model(self) -> None: if not self.model_config.enforce_eager: self.model_runner.capture_model() From c3deb55914df619f635c7036a91312439cbf5c82 Mon Sep 17 00:00:00 2001 From: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Date: Wed, 23 Jul 2025 15:59:30 -0700 Subject: [PATCH 294/552] [V1] Fix local chunked attention always disabled (#21419) Signed-off-by: Yong Hoon Shin Signed-off-by: x22x22 --- vllm/attention/layer.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/vllm/attention/layer.py b/vllm/attention/layer.py index 1b80fa19d54..178453ecdc4 100644 --- a/vllm/attention/layer.py +++ b/vllm/attention/layer.py @@ -143,6 +143,8 @@ def __init__( # the backends) if envs.VLLM_USE_V1: self.use_irope = extra_impl_args.pop("use_irope", False) + else: + self.use_irope = extra_impl_args.get("use_irope", False) quant_method = quant_config.get_quant_method( self, prefix=prefix) if quant_config else None @@ -177,7 +179,6 @@ def __init__( kv_sharing_target_layer_name, **extra_impl_args) self.backend = backend_name_to_enum(attn_backend.get_name()) self.dtype = dtype - self.use_irope = extra_impl_args.get("use_irope", False) # For cuda-alike (CUDA and ROCM) and cpu platforms, we control how # torch.compile works by registering the attention as one giant From 7f0b94dea133af506e6f794f1ad326a826bf8d0c Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 23 Jul 2025 19:36:48 -0400 Subject: [PATCH 295/552] [V0 Deprecation] Remove Prompt Adapters (#20588) Signed-off-by: mgoin Signed-off-by: x22x22 --- docs/api/README.md | 1 - docs/features/compatibility_matrix.md | 34 +- pyproject.toml | 1 - tests/entrypoints/openai/test_completion.py | 72 ++-- .../openai/test_return_tokens_as_ids.py | 1 - .../entrypoints/openai/test_serving_models.py | 3 +- tests/prompt_adapter/test_bloom.py | 48 --- .../test_multi_adapter_inference.py | 56 --- tests/prompt_adapter/test_pa_lora.py | 64 ---- tools/mypy.sh | 1 - vllm/config.py | 62 --- vllm/core/scheduler.py | 12 - vllm/engine/arg_utils.py | 49 +-- vllm/engine/async_llm_engine.py | 10 - vllm/engine/llm_engine.py | 68 +--- vllm/engine/multiprocessing/__init__.py | 4 - vllm/engine/multiprocessing/client.py | 9 +- vllm/engine/multiprocessing/engine.py | 14 +- vllm/engine/protocol.py | 2 - vllm/entrypoints/llm.py | 46 +-- vllm/entrypoints/logger.py | 7 +- vllm/entrypoints/openai/api_server.py | 1 - vllm/entrypoints/openai/cli_args.py | 36 +- vllm/entrypoints/openai/run_batch.py | 1 - vllm/entrypoints/openai/serving_chat.py | 11 +- .../openai/serving_classification.py | 10 +- vllm/entrypoints/openai/serving_completion.py | 7 +- vllm/entrypoints/openai/serving_embedding.py | 9 +- vllm/entrypoints/openai/serving_engine.py | 31 +- vllm/entrypoints/openai/serving_models.py | 31 -- vllm/entrypoints/openai/serving_pooling.py | 12 +- vllm/entrypoints/openai/serving_responses.py | 9 +- vllm/entrypoints/openai/serving_score.py | 22 +- .../openai/serving_tokenization.py | 21 +- vllm/entrypoints/openai/speech_to_text.py | 12 +- vllm/executor/executor_base.py | 31 -- vllm/inputs/preprocess.py | 35 +- vllm/prompt_adapter/__init__.py | 0 vllm/prompt_adapter/layers.py | 83 ---- vllm/prompt_adapter/models.py | 358 ------------------ vllm/prompt_adapter/request.py | 37 -- vllm/prompt_adapter/utils.py | 98 ----- vllm/prompt_adapter/worker_manager.py | 179 --------- vllm/sequence.py | 39 +- vllm/utils/__init__.py | 5 - vllm/v1/engine/async_llm.py | 7 +- vllm/v1/engine/llm_engine.py | 5 +- vllm/v1/engine/processor.py | 6 - vllm/v1/utils.py | 2 - vllm/v1/worker/gpu_model_runner.py | 1 - vllm/v1/worker/tpu_model_runner.py | 1 - vllm/v1/worker/tpu_worker.py | 1 - vllm/worker/enc_dec_model_runner.py | 7 +- vllm/worker/model_runner.py | 151 +------- vllm/worker/model_runner_base.py | 1 - vllm/worker/multi_step_model_runner.py | 3 - vllm/worker/pooling_model_runner.py | 7 - vllm/worker/utils.py | 4 - vllm/worker/worker.py | 14 - vllm/worker/worker_base.py | 1 - 60 files changed, 126 insertions(+), 1727 deletions(-) delete mode 100644 tests/prompt_adapter/test_bloom.py delete mode 100644 tests/prompt_adapter/test_multi_adapter_inference.py delete mode 100644 tests/prompt_adapter/test_pa_lora.py delete mode 100644 vllm/prompt_adapter/__init__.py delete mode 100644 vllm/prompt_adapter/layers.py delete mode 100644 vllm/prompt_adapter/models.py delete mode 100644 vllm/prompt_adapter/request.py delete mode 100644 vllm/prompt_adapter/utils.py delete mode 100644 vllm/prompt_adapter/worker_manager.py diff --git a/docs/api/README.md b/docs/api/README.md index 245c925f7f5..db4dab0ae53 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -14,7 +14,6 @@ API documentation for vLLM's configuration classes. - [vllm.config.DeviceConfig][] - [vllm.config.SpeculativeConfig][] - [vllm.config.LoRAConfig][] -- [vllm.config.PromptAdapterConfig][] - [vllm.config.MultiModalConfig][] - [vllm.config.PoolerConfig][] - [vllm.config.DecodingConfig][] diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md index fdd75bfe33d..8be1585f8e7 100644 --- a/docs/features/compatibility_matrix.md +++ b/docs/features/compatibility_matrix.md @@ -34,23 +34,22 @@ th:not(:first-child) { } -| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | prmpt adptr | [SD](spec_decode.md) | CUDA graph | pooling | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | -|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| +| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | pooling | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | +|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | -| prmpt adptr | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | | -| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | | -| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | -| pooling | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | | -| enc-dec | ❌ | [❌](gh-issue:7366) | ❌ | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | -| logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | -| prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | -| async output | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | -| multi-step | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | -| mm | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | -| best-of | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ✅ | ✅ | | -| beam-search | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ❔ | ✅ | ✅ | +| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | +| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | +| pooling | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | | +| enc-dec | ❌ | [❌](gh-issue:7366) | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | +| logP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | +| prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | +| async output | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | +| multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | +| mm | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | +| best-of | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ✅ | ✅ | | +| beam-search | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ❔ | ✅ | ✅ | [](){ #feature-x-hardware } @@ -59,10 +58,9 @@ th:not(:first-child) { | Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | |-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| | [CP][chunked-prefill] | [❌](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| [APC](automatic_prefix_caching.md) | [❌](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| prmpt adptr | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:8475) | ✅ | ❌ | -| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | +| [APC](automatic_prefix_caching.md) | [❌](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | | pooling | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ | | enc-dec | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | diff --git a/pyproject.toml b/pyproject.toml index 0c8d2f82d1d..a65267942d4 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -72,7 +72,6 @@ line-length = 80 "vllm/core/**/*.py" = ["UP006", "UP035"] "vllm/engine/**/*.py" = ["UP006", "UP035"] "vllm/executor/**/*.py" = ["UP006", "UP035"] -"vllm/prompt_adapter/**/*.py" = ["UP006", "UP035"] "vllm/worker/**/*.py" = ["UP006", "UP035"] # Python 3.8 typing - skip utils for ROCm "vllm/utils/__init__.py" = ["UP006", "UP035"] diff --git a/tests/entrypoints/openai/test_completion.py b/tests/entrypoints/openai/test_completion.py index df9586ee84d..6eca3e767f3 100644 --- a/tests/entrypoints/openai/test_completion.py +++ b/tests/entrypoints/openai/test_completion.py @@ -2,6 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project # imports for guided decoding tests import json +import os import shutil from tempfile import TemporaryDirectory from typing import Optional @@ -26,10 +27,6 @@ # technically these adapters use a different base model, # but we're not testing generation quality here LORA_NAME = "typeof/zephyr-7b-beta-lora" -PA_NAME = "swapnilbp/llama_tweet_ptune" -# if PA_NAME changes, PA_NUM_VIRTUAL_TOKENS might also -# need to change to match the prompt adapter -PA_NUM_VIRTUAL_TOKENS = 8 GUIDED_DECODING_BACKENDS = ["outlines", "lm-format-enforcer", "xgrammar"] @@ -56,13 +53,7 @@ def zephyr_lora_added_tokens_files(zephyr_lora_files): @pytest.fixture(scope="module") -def zephyr_pa_files(): - return snapshot_download(repo_id=PA_NAME) - - -@pytest.fixture(scope="module") -def default_server_args(zephyr_lora_files, zephyr_lora_added_tokens_files, - zephyr_pa_files): +def default_server_args(zephyr_lora_files, zephyr_lora_added_tokens_files): return [ # use half precision for speed and memory savings in CI environment "--dtype", @@ -81,15 +72,6 @@ def default_server_args(zephyr_lora_files, zephyr_lora_added_tokens_files, "64", "--max-cpu-loras", "2", - # pa config - "--enable-prompt-adapter", - "--prompt-adapters", - f"zephyr-pa={zephyr_pa_files}", - f"zephyr-pa2={zephyr_pa_files}", - "--max-prompt-adapters", - "2", - "--max-prompt-adapter-token", - "128", ] @@ -98,8 +80,19 @@ def default_server_args(zephyr_lora_files, zephyr_lora_added_tokens_files, def server(default_server_args, request): if request.param: default_server_args.append(request.param) - with RemoteOpenAIServer(MODEL_NAME, default_server_args) as remote_server: - yield remote_server + + original_value = os.environ.get('VLLM_USE_V1') + os.environ['VLLM_USE_V1'] = '0' + try: + with RemoteOpenAIServer(MODEL_NAME, + default_server_args) as remote_server: + yield remote_server + finally: + # Restore original env value + if original_value is None: + os.environ.pop('VLLM_USE_V1', None) + else: + os.environ['VLLM_USE_V1'] = original_value @pytest_asyncio.fixture @@ -110,14 +103,11 @@ async def client(server): @pytest.mark.asyncio @pytest.mark.parametrize( - # first test base model, then test loras, then test prompt adapters - "model_name,num_virtual_tokens", - [(MODEL_NAME, 0), ("zephyr-lora", 0), ("zephyr-lora2", 0), - ("zephyr-pa", PA_NUM_VIRTUAL_TOKENS), - ("zephyr-pa2", PA_NUM_VIRTUAL_TOKENS)], + # first test base model, then test loras + "model_name", + [MODEL_NAME, "zephyr-lora", "zephyr-lora2"], ) -async def test_single_completion(client: openai.AsyncOpenAI, model_name: str, - num_virtual_tokens: int): +async def test_single_completion(client: openai.AsyncOpenAI, model_name: str): completion = await client.completions.create(model=model_name, prompt="Hello, my name is", max_tokens=5, @@ -130,9 +120,7 @@ async def test_single_completion(client: openai.AsyncOpenAI, model_name: str, assert len(choice.text) >= 5 assert choice.finish_reason == "length" assert completion.usage == openai.types.CompletionUsage( - completion_tokens=5, - prompt_tokens=6 + num_virtual_tokens, - total_tokens=11 + num_virtual_tokens) + completion_tokens=5, prompt_tokens=6, total_tokens=11) # test using token IDs completion = await client.completions.create( @@ -175,9 +163,9 @@ async def test_added_lora_tokens_base_model(client: openai.AsyncOpenAI): @pytest.mark.asyncio @pytest.mark.parametrize( - # first test base model, then test loras, then test prompt adapters + # first test base model, then test loras "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-lora2", "zephyr-pa", "zephyr-pa2"], + [MODEL_NAME, "zephyr-lora", "zephyr-lora2"], ) async def test_no_logprobs(client: openai.AsyncOpenAI, model_name: str): # test using token IDs @@ -194,9 +182,9 @@ async def test_no_logprobs(client: openai.AsyncOpenAI, model_name: str): @pytest.mark.asyncio @pytest.mark.parametrize( - # just test 1 lora and 1 pa hereafter + # just test 1 lora "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_zero_logprobs(client: openai.AsyncOpenAI, model_name: str): # test using token IDs @@ -217,7 +205,7 @@ async def test_zero_logprobs(client: openai.AsyncOpenAI, model_name: str): @pytest.mark.asyncio @pytest.mark.parametrize( "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_some_logprobs(client: openai.AsyncOpenAI, model_name: str): # test using token IDs @@ -238,7 +226,7 @@ async def test_some_logprobs(client: openai.AsyncOpenAI, model_name: str): @pytest.mark.asyncio @pytest.mark.parametrize( "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_too_many_completion_logprobs(client: openai.AsyncOpenAI, model_name: str): @@ -314,7 +302,7 @@ async def test_prompt_logprobs_completion(client: openai.AsyncOpenAI, @pytest.mark.asyncio @pytest.mark.parametrize( "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_completion_streaming(client: openai.AsyncOpenAI, model_name: str): @@ -348,7 +336,7 @@ async def test_completion_streaming(client: openai.AsyncOpenAI, @pytest.mark.asyncio @pytest.mark.parametrize( "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_parallel_streaming(client: openai.AsyncOpenAI, model_name: str): """Streaming for parallel sampling. @@ -382,7 +370,7 @@ async def test_parallel_streaming(client: openai.AsyncOpenAI, model_name: str): @pytest.mark.asyncio @pytest.mark.parametrize( "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_completion_stream_options(client: openai.AsyncOpenAI, model_name: str): @@ -519,7 +507,7 @@ async def test_completion_stream_options(client: openai.AsyncOpenAI, @pytest.mark.asyncio @pytest.mark.parametrize( "model_name", - [MODEL_NAME, "zephyr-lora", "zephyr-pa"], + [MODEL_NAME, "zephyr-lora"], ) async def test_batch_completions(client: openai.AsyncOpenAI, model_name: str): # test both text and token IDs diff --git a/tests/entrypoints/openai/test_return_tokens_as_ids.py b/tests/entrypoints/openai/test_return_tokens_as_ids.py index 099062e55c7..af58fbd4b36 100644 --- a/tests/entrypoints/openai/test_return_tokens_as_ids.py +++ b/tests/entrypoints/openai/test_return_tokens_as_ids.py @@ -13,7 +13,6 @@ from .test_completion import default_server_args # noqa: F401 from .test_completion import zephyr_lora_added_tokens_files # noqa: F401 from .test_completion import zephyr_lora_files # noqa: F401 -from .test_completion import zephyr_pa_files # noqa: F401 from .test_completion import MODEL_NAME diff --git a/tests/entrypoints/openai/test_serving_models.py b/tests/entrypoints/openai/test_serving_models.py index 5f334c754a3..c3b458d717f 100644 --- a/tests/entrypoints/openai/test_serving_models.py +++ b/tests/entrypoints/openai/test_serving_models.py @@ -32,8 +32,7 @@ async def _async_serving_models_init() -> OpenAIServingModels: serving_models = OpenAIServingModels(engine_client=mock_engine_client, base_model_paths=BASE_MODEL_PATHS, model_config=mock_model_config, - lora_modules=None, - prompt_adapters=None) + lora_modules=None) await serving_models.init_static_loras() return serving_models diff --git a/tests/prompt_adapter/test_bloom.py b/tests/prompt_adapter/test_bloom.py deleted file mode 100644 index 2b603fe8f02..00000000000 --- a/tests/prompt_adapter/test_bloom.py +++ /dev/null @@ -1,48 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import pytest - -import vllm -from vllm.prompt_adapter.request import PromptAdapterRequest - -MODEL_PATH = "bigscience/bloomz-560m" -PA_PATH = 'stevhliu/bloomz-560m_PROMPT_TUNING_CAUSAL_LM' - - -def do_sample(llm, pa_name: str, pa_id: int): - - prompts = [ - "Tweet text : @nationalgridus I have no water and the bill is \ - current and paid. Can you do something about this? Label : ", - "Tweet text : @nationalgridus Looks good thanks! Label : " - ] - sampling_params = vllm.SamplingParams(temperature=0.0, - max_tokens=3, - stop_token_ids=[3]) - - outputs = llm.generate(prompts, - sampling_params, - prompt_adapter_request=PromptAdapterRequest( - pa_name, pa_id, PA_PATH, 8) if pa_id else None) - - # Print the outputs. - generated_texts = [] - for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text.strip() - generated_texts.append(generated_text) - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - return generated_texts - - -@pytest.mark.parametrize("enforce_eager", [True, False]) -def test_twitter_prompt_adapter(enforce_eager: bool): - llm = vllm.LLM(MODEL_PATH, - enforce_eager=enforce_eager, - enable_prompt_adapter=True, - max_prompt_adapter_token=8) - - expected_output = ['complaint', 'no complaint'] - - assert do_sample(llm, "twitter_pa", pa_id=1) == expected_output diff --git a/tests/prompt_adapter/test_multi_adapter_inference.py b/tests/prompt_adapter/test_multi_adapter_inference.py deleted file mode 100644 index 4f273afb4e3..00000000000 --- a/tests/prompt_adapter/test_multi_adapter_inference.py +++ /dev/null @@ -1,56 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from vllm import EngineArgs, LLMEngine, SamplingParams -from vllm.prompt_adapter.request import PromptAdapterRequest - -MODEL_PATH = "bigscience/bloomz-560m" -pa_path = 'stevhliu/bloomz-560m_PROMPT_TUNING_CAUSAL_LM' -pa_path2 = 'swapnilbp/angry_tweet_ptune' - - -def do_sample(engine): - - prompts = [ - ("Tweet text: I have complaints! Label: ", - SamplingParams(temperature=0.0, max_tokens=3, stop_token_ids=[3]), - PromptAdapterRequest("hate_speech", 1, pa_path2, 8)), - ("Tweet text: I have no problems Label: ", - SamplingParams(temperature=0.0, max_tokens=3, stop_token_ids=[3]), - PromptAdapterRequest("hate_speech2", 2, pa_path2, 8)), - ("Tweet text: I have complaints! Label: ", - SamplingParams(temperature=0.0, max_tokens=3), None), - ("Tweet text: I have no problems Label: ", - SamplingParams(temperature=0.0, max_tokens=3, stop_token_ids=[3]), - PromptAdapterRequest("complain", 3, pa_path, 8)), - ] - - request_id = 0 - results = set() - while prompts or engine.has_unfinished_requests(): - if prompts: - prompt, sampling_params, pa_request = prompts.pop(0) - engine.add_request(str(request_id), - prompt, - sampling_params, - prompt_adapter_request=pa_request) - request_id += 1 - - request_outputs = engine.step() - - for request_output in request_outputs: - if request_output.finished: - results.add(request_output.outputs[0].text) - return results - - -def test_multi_prompt_adapters(): - engine_args = EngineArgs(model=MODEL_PATH, - max_prompt_adapters=3, - enable_prompt_adapter=True, - max_prompt_adapter_token=8) - engine = LLMEngine.from_engine_args(engine_args) - expected_output = { - ' quot;I', 'hate speech', 'no complaint', 'not hate speech' - } - assert do_sample(engine) == expected_output diff --git a/tests/prompt_adapter/test_pa_lora.py b/tests/prompt_adapter/test_pa_lora.py deleted file mode 100644 index ba2e15b81bc..00000000000 --- a/tests/prompt_adapter/test_pa_lora.py +++ /dev/null @@ -1,64 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from huggingface_hub import snapshot_download - -from vllm import EngineArgs, LLMEngine, SamplingParams -from vllm.lora.request import LoRARequest -from vllm.prompt_adapter.request import PromptAdapterRequest - -MODEL_PATH = "meta-llama/Llama-2-7b-hf" -pa_path = snapshot_download(repo_id="swapnilbp/llama_tweet_ptune") -lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test") - - -def do_sample(engine): - - prompt_text = "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]" # noqa: E501 - - # first prompt with a prompt adapter and second without adapter - prompts = [ - (prompt_text, - SamplingParams(temperature=0.0, max_tokens=100, - stop=["[/assistant]"]), - PromptAdapterRequest("hate_speech", 1, pa_path, - 8), LoRARequest("sql_test", 1, lora_path)), - (prompt_text, - SamplingParams(temperature=0.0, max_tokens=100, - stop=["[/assistant]"]), None, - LoRARequest("sql_test", 1, lora_path)), - ] - - request_id = 0 - results = set() - while prompts or engine.has_unfinished_requests(): - if prompts: - prompt, sampling_params, pa_request, lora_request = prompts.pop(0) - engine.add_request(str(request_id), - prompt, - sampling_params, - prompt_adapter_request=pa_request, - lora_request=lora_request) - request_id += 1 - - request_outputs = engine.step() - - for request_output in request_outputs: - if request_output.finished: - results.add(request_output.outputs[0].text) - return results - - -def test_lora_prompt_adapter(): - engine_args = EngineArgs(model=MODEL_PATH, - enable_prompt_adapter=True, - enable_lora=True, - max_num_seqs=60, - max_prompt_adapter_token=8) - engine = LLMEngine.from_engine_args(engine_args) - result = do_sample(engine) - - expected_output = { - " SELECT icao FROM table_name_74 WHERE airport = 'lilongwe international airport' " # noqa: E501 - } - assert result == expected_output diff --git a/tools/mypy.sh b/tools/mypy.sh index af4c61233ab..781d8fc0288 100755 --- a/tools/mypy.sh +++ b/tools/mypy.sh @@ -31,6 +31,5 @@ run_mypy vllm/inputs run_mypy vllm/lora run_mypy vllm/model_executor run_mypy vllm/plugins -run_mypy vllm/prompt_adapter run_mypy vllm/worker run_mypy vllm/v1 diff --git a/vllm/config.py b/vllm/config.py index a844e771cd9..0632bb3db23 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3143,59 +3143,6 @@ def verify_with_model_config(self, model_config: ModelConfig): self.lora_dtype = getattr(torch, self.lora_dtype) -@config -@dataclass(config=ConfigDict(arbitrary_types_allowed=True)) -class PromptAdapterConfig: - """Configuration for PromptAdapters.""" - - max_prompt_adapters: int = 1 - """Max number of PromptAdapters in a batch.""" - max_prompt_adapter_token: int = 0 - """Max number of PromptAdapters tokens.""" - max_cpu_prompt_adapters: Optional[int] = None - """Maximum number of PromptAdapters to store in CPU memory. Must be >= than - `max_prompt_adapters`.""" - prompt_adapter_dtype: Union[torch.dtype, str] = "auto" - """Data type for PromptAdapter. If auto, will default to base model dtype. - """ - - def compute_hash(self) -> str: - """ - WARNING: Whenever a new field is added to this config, - ensure that it is included in the factors list if - it affects the computation graph. - - Provide a hash that uniquely identifies all the configs - that affect the structure of the computation - graph from input ids/embeddings to the final hidden states, - excluding anything before input ids/embeddings and after - the final hidden states. - """ - # no factors to consider. - # this config will not affect the computation graph. - factors: list[Any] = [] - hash_str = hashlib.md5(str(factors).encode(), - usedforsecurity=False).hexdigest() - return hash_str - - def __post_init__(self): - - if self.max_prompt_adapters < 1: - raise ValueError(f"max_prompt_adapters " - f"({self.max_prompt_adapters}) must be >= 1.") - if self.max_prompt_adapter_token == 0: - raise ValueError("max_prompt_adapter_token must be set.") - if self.max_cpu_prompt_adapters is None: - self.max_cpu_prompt_adapters = self.max_prompt_adapters - - def verify_with_model_config(self, model_config: ModelConfig): - if self.prompt_adapter_dtype == "auto": - self.prompt_adapter_dtype = model_config.dtype - elif isinstance(self.prompt_adapter_dtype, str): - self.prompt_adapter_dtype = getattr(torch, - self.prompt_adapter_dtype) - - @config @dataclass class MultiModalConfig: @@ -4431,8 +4378,6 @@ class VllmConfig: """Decoding configuration.""" observability_config: Optional[ObservabilityConfig] = None """Observability configuration.""" - prompt_adapter_config: Optional[PromptAdapterConfig] = None - """Prompt adapter configuration.""" quant_config: Optional[QuantizationConfig] = None """Quantization configuration.""" compilation_config: CompilationConfig = field( @@ -4529,10 +4474,6 @@ def compute_hash(self) -> str: vllm_factors.append(self.observability_config.compute_hash()) else: vllm_factors.append("None") - if self.prompt_adapter_config: - vllm_factors.append(self.prompt_adapter_config.compute_hash()) - else: - vllm_factors.append("None") if self.quant_config: pass # should be captured by model_config.quantization if self.compilation_config: @@ -4640,9 +4581,6 @@ def __post_init__(self): if self.lora_config is not None: self.lora_config.verify_with_cache_config(self.cache_config) self.lora_config.verify_with_model_config(self.model_config) - if self.prompt_adapter_config is not None: - self.prompt_adapter_config.verify_with_model_config( - self.model_config) if self.quant_config is None and self.model_config is not None: self.quant_config = VllmConfig._get_quantization_config( diff --git a/vllm/core/scheduler.py b/vllm/core/scheduler.py index 0ef0396996b..61346da145b 100644 --- a/vllm/core/scheduler.py +++ b/vllm/core/scheduler.py @@ -15,7 +15,6 @@ from vllm.core.interfaces import AllocStatus, BlockSpaceManager from vllm.logger import init_logger from vllm.lora.request import LoRARequest -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sequence import (Sequence, SequenceData, SequenceGroup, SequenceGroupBase, SequenceGroupMetadata, SequenceGroupMetadataDelta, SequenceStage, @@ -165,8 +164,6 @@ def __post_init__(self): if self.num_loras > 0: self._sort_by_lora_ids() - self.num_prompt_adapters: int = len(self.prompt_adapter_requests) - def is_empty(self) -> bool: # NOTE: We do not consider the ignored sequence groups. return (not self.scheduled_seq_groups and not self.blocks_to_swap_in @@ -194,14 +191,6 @@ def lora_requests(self) -> Set[LoRARequest]: if g.seq_group.lora_request is not None } - @property - def prompt_adapter_requests(self) -> Set[PromptAdapterRequest]: - return { - g.seq_group.prompt_adapter_request - for g in self.scheduled_seq_groups - if g.seq_group.prompt_adapter_request is not None - } - @dataclass class SchedulerRunningOutputs: @@ -1648,7 +1637,6 @@ def schedule( multi_modal_placeholders=( seq_group.multi_modal_placeholders if scheduler_outputs.num_prefill_groups > 0 else None), - prompt_adapter_request=seq_group.prompt_adapter_request, ) else: # When SPMD mode is enabled, we only send delta data except for diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 4a5efd40241..62792fade4e 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -30,9 +30,9 @@ LogprobsMode, LoRAConfig, ModelConfig, ModelDType, ModelImpl, MultiModalConfig, ObservabilityConfig, ParallelConfig, PoolerConfig, PrefixCachingHashAlgo, - PromptAdapterConfig, SchedulerConfig, SchedulerPolicy, - SpeculativeConfig, TaskOption, TokenizerMode, - VllmConfig, get_attr_docs, get_field) + SchedulerConfig, SchedulerPolicy, SpeculativeConfig, + TaskOption, TokenizerMode, VllmConfig, get_attr_docs, + get_field) from vllm.logger import init_logger from vllm.platforms import CpuArchEnum, current_platform from vllm.plugins import load_general_plugins @@ -358,11 +358,6 @@ class EngineArgs: max_cpu_loras: Optional[int] = LoRAConfig.max_cpu_loras lora_dtype: Optional[Union[str, torch.dtype]] = LoRAConfig.lora_dtype lora_extra_vocab_size: int = LoRAConfig.lora_extra_vocab_size - # PromptAdapter fields - enable_prompt_adapter: bool = False - max_prompt_adapters: int = PromptAdapterConfig.max_prompt_adapters - max_prompt_adapter_token: int = \ - PromptAdapterConfig.max_prompt_adapter_token num_scheduler_steps: int = SchedulerConfig.num_scheduler_steps multi_step_stream_outputs: bool = SchedulerConfig.multi_step_stream_outputs @@ -437,6 +432,8 @@ class EngineArgs: ParallelConfig.enable_multimodal_encoder_data_parallel async_scheduling: bool = SchedulerConfig.async_scheduling + # DEPRECATED + enable_prompt_adapter: bool = False def __post_init__(self): # support `EngineArgs(compilation_config={...})` @@ -729,23 +726,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: lora_group.add_argument("--default-mm-loras", **lora_kwargs["default_mm_loras"]) - # PromptAdapter related configs - prompt_adapter_kwargs = get_kwargs(PromptAdapterConfig) - prompt_adapter_group = parser.add_argument_group( - title="PromptAdapterConfig", - description=PromptAdapterConfig.__doc__, - ) - prompt_adapter_group.add_argument( - "--enable-prompt-adapter", - action=argparse.BooleanOptionalAction, - help="If True, enable handling of PromptAdapters.") - prompt_adapter_group.add_argument( - "--max-prompt-adapters", - **prompt_adapter_kwargs["max_prompt_adapters"]) - prompt_adapter_group.add_argument( - "--max-prompt-adapter-token", - **prompt_adapter_kwargs["max_prompt_adapter_token"]) - # Speculative arguments speculative_group = parser.add_argument_group( title="SpeculativeConfig", @@ -850,6 +830,12 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: parser.add_argument('--disable-log-stats', action='store_true', help='Disable logging statistics.') + parser.add_argument('--enable-prompt-adapter', + action='store_true', + deprecated=True, + help='[DEPRECATED] Prompt adapter has been ' + 'removed. Setting this flag to True or False' + ' has no effect on vLLM behavior.') return parser @@ -1234,11 +1220,6 @@ def create_engine_config( load_config = self.create_load_config() - prompt_adapter_config = PromptAdapterConfig( - max_prompt_adapters=self.max_prompt_adapters, - max_prompt_adapter_token=self.max_prompt_adapter_token) \ - if self.enable_prompt_adapter else None - decoding_config = DecodingConfig( backend=self.guided_decoding_backend, disable_fallback=self.guided_decoding_disable_fallback, @@ -1266,7 +1247,6 @@ def create_engine_config( load_config=load_config, decoding_config=decoding_config, observability_config=observability_config, - prompt_adapter_config=prompt_adapter_config, compilation_config=self.compilation_config, kv_transfer_config=self.kv_transfer_config, kv_events_config=self.kv_events_config, @@ -1342,12 +1322,6 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: recommend_to_remove=False) return False - # No Prompt Adapter so far. - if self.enable_prompt_adapter: - _raise_or_fallback(feature_name="--enable-prompt-adapter", - recommend_to_remove=False) - return False - # No text embedding inputs so far. if self.enable_prompt_embeds: _raise_or_fallback(feature_name="--enable-prompt-embeds", @@ -1469,7 +1443,6 @@ def _set_default_args_v0(self, model_config: ModelConfig) -> None: if (is_gpu and not use_sliding_window and not use_spec_decode and not self.enable_lora - and not self.enable_prompt_adapter and model_config.runner_type != "pooling"): self.enable_chunked_prefill = True logger.warning( diff --git a/vllm/engine/async_llm_engine.py b/vllm/engine/async_llm_engine.py index 06ae2a2f18f..39642d89167 100644 --- a/vllm/engine/async_llm_engine.py +++ b/vllm/engine/async_llm_engine.py @@ -29,7 +29,6 @@ from vllm.model_executor.layers.sampler import SamplerOutput from vllm.outputs import PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import SamplingParams from vllm.sequence import ExecuteModelRequest from vllm.transformers_utils.tokenizer import AnyTokenizer @@ -435,7 +434,6 @@ async def add_request_async( arrival_time: Optional[float] = None, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, @@ -468,7 +466,6 @@ async def add_request_async( processed_inputs = await self.input_preprocessor.preprocess_async( prompt, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, tokenization_kwargs=tokenization_kwargs, ) @@ -491,7 +488,6 @@ async def add_request_async( params=params, arrival_time=arrival_time, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, trace_headers=trace_headers, priority=priority, ) @@ -861,7 +857,6 @@ async def add_request( arrival_time: Optional[float] = None, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, @@ -889,7 +884,6 @@ async def add_request( arrival_time=arrival_time or time.time(), lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=priority, data_parallel_rank=data_parallel_rank, tokenization_kwargs=tokenization_kwargs, @@ -904,7 +898,6 @@ async def generate( request_id: str, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, ) -> AsyncGenerator[RequestOutput, None]: @@ -922,8 +915,6 @@ async def generate( request_id: The unique id of the request. lora_request: LoRA request to use for generation, if any. trace_headers: OpenTelemetry trace headers. - prompt_adapter_request: Prompt Adapter request to use - for generation, if any. priority: The priority of the request. Only applicable with priority scheduling. data_parallel_rank: The (global) data parallel rank that must @@ -983,7 +974,6 @@ async def generate( sampling_params, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=priority, data_parallel_rank=data_parallel_rank, ): diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py index 3081995e693..e7919d90442 100644 --- a/vllm/engine/llm_engine.py +++ b/vllm/engine/llm_engine.py @@ -44,7 +44,6 @@ from vllm.outputs import (PoolingRequestOutput, RequestOutput, RequestOutputFactory) from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import RequestOutputKind, SamplingParams from vllm.sequence import (ExecuteModelRequest, ParallelSampleSequenceGroup, PoolingSequenceGroupOutput, Sequence, SequenceGroup, @@ -223,7 +222,6 @@ def __init__( self.load_config = vllm_config.load_config self.decoding_config = vllm_config.decoding_config or DecodingConfig( # noqa ) - self.prompt_adapter_config = vllm_config.prompt_adapter_config # noqa self.observability_config = vllm_config.observability_config or ObservabilityConfig( # noqa ) @@ -294,8 +292,6 @@ def get_tokenizer_for_seq(sequence: Sequence) -> AnyTokenizer: # Feature flags "enable_lora": bool(self.lora_config), - "enable_prompt_adapter": - bool(self.prompt_adapter_config), "enable_prefix_caching": self.cache_config.enable_prefix_caching, "enforce_eager": @@ -542,9 +538,6 @@ def _verify_args(self) -> None: self.lora_config.verify_with_model_config(self.model_config) self.lora_config.verify_with_scheduler_config( self.scheduler_config) - if self.prompt_adapter_config: - self.prompt_adapter_config.verify_with_model_config( - self.model_config) def _add_processed_request( self, @@ -553,7 +546,6 @@ def _add_processed_request( params: Union[SamplingParams, PoolingParams], arrival_time: float, lora_request: Optional[LoRARequest], - prompt_adapter_request: Optional[PromptAdapterRequest], trace_headers: Optional[Mapping[str, str]] = None, priority: int = 0, ) -> Optional[SequenceGroup]: @@ -569,7 +561,6 @@ def _add_processed_request( arrival_time=arrival_time, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=priority, ) return None @@ -583,11 +574,10 @@ def _add_processed_request( encoder_inputs, decoder_inputs = split_enc_dec_inputs(processed_inputs) seq = Sequence(seq_id, decoder_inputs, block_size, eos_token_id, - lora_request, prompt_adapter_request) + lora_request) encoder_seq = (None if encoder_inputs is None else Sequence( - seq_id, encoder_inputs, block_size, eos_token_id, lora_request, - prompt_adapter_request)) + seq_id, encoder_inputs, block_size, eos_token_id, lora_request)) # Create a SequenceGroup based on SamplingParams or PoolingParams if isinstance(params, SamplingParams): @@ -598,7 +588,6 @@ def _add_processed_request( arrival_time=arrival_time, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, encoder_seq=encoder_seq, priority=priority) elif isinstance(params, PoolingParams): @@ -608,7 +597,6 @@ def _add_processed_request( params, arrival_time=arrival_time, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, encoder_seq=encoder_seq, priority=priority) else: @@ -637,7 +625,6 @@ def add_request( lora_request: Optional[LoRARequest] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> None: """Add a request to the engine's request pool. @@ -658,7 +645,6 @@ def add_request( the current monotonic time. lora_request: The LoRA request to add. trace_headers: OpenTelemetry trace headers. - prompt_adapter_request: The prompt adapter request to add. priority: The priority of the request. Only applicable with priority scheduling. @@ -719,7 +705,6 @@ def add_request( prompt, tokenization_kwargs=tokenization_kwargs, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, ) self._add_processed_request( @@ -728,7 +713,6 @@ def add_request( params=params, arrival_time=arrival_time, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, trace_headers=trace_headers, priority=priority, ) @@ -741,7 +725,6 @@ def _create_sequence_group_with_sampling( arrival_time: float, lora_request: Optional[LoRARequest], trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, encoder_seq: Optional[Sequence] = None, priority: int = 0, ) -> SequenceGroup: @@ -769,17 +752,15 @@ def _create_sequence_group_with_sampling( if self.vllm_config.speculative_config is not None: draft_size = \ self.vllm_config.speculative_config.num_speculative_tokens + 1 - seq_group = SequenceGroup( - request_id=request_id, - seqs=[seq], - arrival_time=arrival_time, - sampling_params=sampling_params, - lora_request=lora_request, - trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, - encoder_seq=encoder_seq, - priority=priority, - draft_size=draft_size) + seq_group = SequenceGroup(request_id=request_id, + seqs=[seq], + arrival_time=arrival_time, + sampling_params=sampling_params, + lora_request=lora_request, + trace_headers=trace_headers, + encoder_seq=encoder_seq, + priority=priority, + draft_size=draft_size) return seq_group @@ -790,7 +771,6 @@ def _create_sequence_group_with_pooling( pooling_params: PoolingParams, arrival_time: float, lora_request: Optional[LoRARequest], - prompt_adapter_request: Optional[PromptAdapterRequest], encoder_seq: Optional[Sequence] = None, priority: int = 0, ) -> SequenceGroup: @@ -798,15 +778,13 @@ def _create_sequence_group_with_pooling( # Defensive copy of PoolingParams, which are used by the pooler pooling_params = pooling_params.clone() # Create the sequence group. - seq_group = SequenceGroup( - request_id=request_id, - seqs=[seq], - arrival_time=arrival_time, - lora_request=lora_request, - pooling_params=pooling_params, - prompt_adapter_request=prompt_adapter_request, - encoder_seq=encoder_seq, - priority=priority) + seq_group = SequenceGroup(request_id=request_id, + seqs=[seq], + arrival_time=arrival_time, + lora_request=lora_request, + pooling_params=pooling_params, + encoder_seq=encoder_seq, + priority=priority) return seq_group def abort_request(self, request_id: Union[str, Iterable[str]]) -> None: @@ -1834,16 +1812,6 @@ def list_loras(self) -> Set[int]: def pin_lora(self, lora_id: int) -> bool: return self.model_executor.pin_lora(lora_id) - def add_prompt_adapter( - self, prompt_adapter_request: PromptAdapterRequest) -> bool: - return self.model_executor.add_prompt_adapter(prompt_adapter_request) - - def remove_prompt_adapter(self, prompt_adapter_id: int) -> bool: - return self.model_executor.remove_prompt_adapter(prompt_adapter_id) - - def list_prompt_adapters(self) -> List[int]: - return self.model_executor.list_prompt_adapters() - def start_profile(self) -> None: self.model_executor.start_profile() diff --git a/vllm/engine/multiprocessing/__init__.py b/vllm/engine/multiprocessing/__init__.py index db968cd6b5d..ff0405d2f84 100644 --- a/vllm/engine/multiprocessing/__init__.py +++ b/vllm/engine/multiprocessing/__init__.py @@ -10,7 +10,6 @@ from vllm.inputs import PromptType from vllm.lora.request import LoRARequest from vllm.outputs import RequestOutput -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import SamplingParams from vllm.utils import Device @@ -33,7 +32,6 @@ class RPCProcessRequest: request_id: str lora_request: Optional[LoRARequest] = None trace_headers: Optional[Mapping[str, str]] = None - prompt_adapter_request: Optional[PromptAdapterRequest] = None priority: int = 0 def __init__( @@ -43,7 +41,6 @@ def __init__( request_id: str, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> None: super().__init__() @@ -53,7 +50,6 @@ def __init__( self.request_id = request_id self.lora_request = lora_request self.trace_headers = trace_headers - self.prompt_adapter_request = prompt_adapter_request self.priority = priority diff --git a/vllm/engine/multiprocessing/client.py b/vllm/engine/multiprocessing/client.py index 9e018ec7f34..67d9a3bf6ce 100644 --- a/vllm/engine/multiprocessing/client.py +++ b/vllm/engine/multiprocessing/client.py @@ -45,7 +45,6 @@ from vllm.lora.request import LoRARequest from vllm.model_executor.layers.sampler import SamplerOutput from vllm.outputs import PoolingRequestOutput, RequestOutput -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import SamplingParams from vllm.transformers_utils.tokenizer_group import init_tokenizer_from_configs from vllm.utils import Device @@ -448,7 +447,6 @@ def generate( request_id: str, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> AsyncGenerator[RequestOutput, None]: """Generate outputs for a request. @@ -465,8 +463,6 @@ def generate( request_id: The unique id of the request. lora_request: LoRA request to use for generation, if any. trace_headers: OpenTelemetry trace headers. - prompt_adapter_request: Prompt Adapter request to use - for generation, if any. priority: Priority of the request (lower means earlier handling). Any priority other than 0 will lead to an error if the scheduling policy is not "priority". @@ -474,8 +470,7 @@ def generate( return cast( AsyncGenerator[RequestOutput, None], self._process_request(prompt, sampling_params, request_id, - lora_request, trace_headers, - prompt_adapter_request, priority)) + lora_request, trace_headers, priority)) def encode( self, @@ -521,7 +516,6 @@ async def _process_request( request_id: str, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> Union[AsyncGenerator[RequestOutput, None], AsyncGenerator[ PoolingRequestOutput, None]]: @@ -575,7 +569,6 @@ async def _process_request( request_id=request_id, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=priority, )) diff --git a/vllm/engine/multiprocessing/engine.py b/vllm/engine/multiprocessing/engine.py index ef088bd3933..fe6eb0d8c2f 100644 --- a/vllm/engine/multiprocessing/engine.py +++ b/vllm/engine/multiprocessing/engine.py @@ -304,14 +304,12 @@ def _handle_process_request(self, request: RPCProcessRequest): self._send_outputs(rpc_err) try: - self.engine.add_request( - request_id=request_id, - prompt=request.prompt, - params=request.params, - lora_request=request.lora_request, - trace_headers=request.trace_headers, - prompt_adapter_request=request.prompt_adapter_request, - priority=request.priority) + self.engine.add_request(request_id=request_id, + prompt=request.prompt, + params=request.params, + lora_request=request.lora_request, + trace_headers=request.trace_headers, + priority=request.priority) if self.log_requests: logger.info("Added request %s.", request.request_id) diff --git a/vllm/engine/protocol.py b/vllm/engine/protocol.py index f5cc9c47405..671e9648a3d 100644 --- a/vllm/engine/protocol.py +++ b/vllm/engine/protocol.py @@ -16,7 +16,6 @@ from vllm.model_executor.layers.sampler import SamplerOutput from vllm.outputs import CompletionOutput, PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import BeamSearchParams, SamplingParams from vllm.transformers_utils.tokenizer import AnyTokenizer from vllm.utils import Device, collect_from_async_generator, random_uuid @@ -55,7 +54,6 @@ def generate( request_id: str, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> AsyncGenerator[RequestOutput, None]: """Generate outputs for a request.""" diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py index c4f1b3b8661..2f766a2dae5 100644 --- a/vllm/entrypoints/llm.py +++ b/vllm/entrypoints/llm.py @@ -45,7 +45,6 @@ PoolingRequestOutput, RequestOutput, ScoringRequestOutput) from vllm.pooling_params import PoolingParams, PoolingTask -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import (BeamSearchParams, GuidedDecodingParams, RequestOutputKind, SamplingParams) from vllm.transformers_utils.tokenizer import (AnyTokenizer, MistralTokenizer, @@ -314,7 +313,6 @@ def generate( *, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, ) -> list[RequestOutput]: @@ -330,7 +328,6 @@ def generate( prompt_token_ids: Optional[list[int]] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, ) -> list[RequestOutput]: @@ -346,7 +343,6 @@ def generate( prompt_token_ids: Optional[list[list[int]]] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, ) -> list[RequestOutput]: @@ -363,7 +359,6 @@ def generate( prompt_token_ids: list[int], use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, ) -> list[RequestOutput]: @@ -380,7 +375,6 @@ def generate( prompt_token_ids: list[list[int]], use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, ) -> list[RequestOutput]: @@ -395,7 +389,6 @@ def generate( prompt_token_ids: Union[list[int], list[list[int]]], use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, ) -> list[RequestOutput]: @@ -415,7 +408,6 @@ def generate( prompt_token_ids: Optional[Union[list[int], list[list[int]]]] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, guided_options_request: Optional[Union[LLMGuidedOptions, GuidedDecodingRequest]] = None, priority: Optional[list[int]] = None, @@ -440,8 +432,6 @@ def generate( it is used to create the progress bar. If `False`, no progress bar is created. lora_request: LoRA request to use for generation, if any. - prompt_adapter_request: Prompt Adapter request to use for - generation, if any. priority: The priority of the requests, if any. Only applicable when priority scheduling policy is enabled. @@ -507,7 +497,6 @@ def generate( params=sampling_params, use_tqdm=use_tqdm, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, guided_options=guided_options_request, tokenization_kwargs=tokenization_kwargs, priority=priority, @@ -963,7 +952,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -980,7 +968,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -997,7 +984,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -1015,7 +1001,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -1033,7 +1018,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -1049,7 +1033,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -1070,7 +1053,6 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, pooling_task: PoolingTask = "encode", tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: @@ -1092,8 +1074,6 @@ def encode( it is used to create the progress bar. If `False`, no progress bar is created. lora_request: LoRA request to use for generation, if any. - prompt_adapter_request: Prompt Adapter request to use for - generation, if any. pooling_task: Override the pooling task to use. Returns: @@ -1150,7 +1130,6 @@ def encode( use_tqdm=use_tqdm, lora_request=lora_request, tokenization_kwargs=tokenization_kwargs, - prompt_adapter_request=prompt_adapter_request, ) outputs = self._run_engine(use_tqdm=use_tqdm) @@ -1167,7 +1146,6 @@ def embed( pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, ) -> list[EmbeddingRequestOutput]: """ Generate an embedding vector for each prompt. @@ -1187,8 +1165,6 @@ def embed( it is used to create the progress bar. If `False`, no progress bar is created. lora_request: LoRA request to use for generation, if any. - prompt_adapter_request: Prompt Adapter request to use for - generation, if any. Returns: A list of `EmbeddingRequestOutput` objects containing the @@ -1205,7 +1181,6 @@ def embed( use_tqdm=use_tqdm, pooling_params=pooling_params, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, pooling_task="embed", ) @@ -1218,7 +1193,6 @@ def classify( *, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, ) -> list[ClassificationRequestOutput]: """ Generate class logits for each prompt. @@ -1236,8 +1210,6 @@ def classify( it is used to create the progress bar. If `False`, no progress bar is created. lora_request: LoRA request to use for generation, if any. - prompt_adapter_request: Prompt Adapter request to use for - generation, if any. Returns: A list of `ClassificationRequestOutput` objects containing the @@ -1253,7 +1225,6 @@ def classify( prompts, use_tqdm=use_tqdm, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, pooling_task="classify", ) @@ -1267,7 +1238,6 @@ def _embedding_score( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, ) -> list[ScoringRequestOutput]: encoded_output: list[PoolingRequestOutput] = self.encode( @@ -1275,7 +1245,6 @@ def _embedding_score( truncate_prompt_tokens=truncate_prompt_tokens, use_tqdm=use_tqdm, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, pooling_task="embed", ) @@ -1303,7 +1272,6 @@ def _cross_encoding_score( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, ) -> list[ScoringRequestOutput]: if isinstance(tokenizer, MistralTokenizer): @@ -1361,7 +1329,6 @@ def _cross_encoding_score( params=pooling_params, use_tqdm=use_tqdm, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, ) outputs = self._run_engine(use_tqdm=use_tqdm) @@ -1381,7 +1348,6 @@ def score( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, ) -> list[ScoringRequestOutput]: """Generate similarity scores for all pairs `` or ``. @@ -1412,8 +1378,6 @@ def score( it is used to create the progress bar. If `False`, no progress bar is created. lora_request: LoRA request to use for generation, if any. - prompt_adapter_request: Prompt Adapter request to use for - generation, if any. Returns: A list of `ScoringRequestOutput` objects containing the @@ -1504,8 +1468,7 @@ def ensure_str(prompt: SingletonPrompt): data_2, # type: ignore[arg-type] truncate_prompt_tokens, use_tqdm, - lora_request, - prompt_adapter_request) + lora_request) else: return self._embedding_score( tokenizer, @@ -1513,8 +1476,7 @@ def ensure_str(prompt: SingletonPrompt): data_2, # type: ignore[arg-type] truncate_prompt_tokens, use_tqdm, - lora_request, - prompt_adapter_request) + lora_request) def start_profile(self) -> None: self.llm_engine.start_profile() @@ -1625,7 +1587,6 @@ def _validate_and_add_requests( *, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[Sequence[LoRARequest], LoRARequest]], - prompt_adapter_request: Optional[PromptAdapterRequest], tokenization_kwargs: Optional[dict[str, Any]] = None, guided_options: Optional[GuidedDecodingRequest] = None, priority: Optional[list[int]] = None, @@ -1671,7 +1632,6 @@ def _validate_and_add_requests( tokenization_kwargs=tokenization_kwargs, lora_request=lora_request[i] if isinstance( lora_request, Sequence) else lora_request, - prompt_adapter_request=prompt_adapter_request, priority=priority[i] if priority else 0, ) @@ -1681,7 +1641,6 @@ def _add_request( params: Union[SamplingParams, PoolingParams], tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[LoRARequest] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> None: request_id = str(next(self.request_counter)) @@ -1691,7 +1650,6 @@ def _add_request( params, lora_request=lora_request, tokenization_kwargs=tokenization_kwargs, - prompt_adapter_request=prompt_adapter_request, priority=priority, ) diff --git a/vllm/entrypoints/logger.py b/vllm/entrypoints/logger.py index f3aee188dae..06ff3b417f8 100644 --- a/vllm/entrypoints/logger.py +++ b/vllm/entrypoints/logger.py @@ -8,7 +8,6 @@ from vllm.logger import init_logger from vllm.lora.request import LoRARequest from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import BeamSearchParams, SamplingParams logger = init_logger(__name__) @@ -30,7 +29,6 @@ def log_inputs( params: Optional[Union[SamplingParams, PoolingParams, BeamSearchParams]], lora_request: Optional[LoRARequest], - prompt_adapter_request: Optional[PromptAdapterRequest], ) -> None: max_log_len = self.max_log_len if max_log_len is not None: @@ -44,7 +42,6 @@ def log_inputs( "Received request %s: prompt: %r, " "params: %s, prompt_token_ids: %s, " "prompt_embeds shape: %s, " - "lora_request: %s, prompt_adapter_request: %s.", request_id, - prompt, params, prompt_token_ids, + "lora_request: %s.", request_id, prompt, params, prompt_token_ids, prompt_embeds.shape if prompt_embeds is not None else None, - lora_request, prompt_adapter_request) + lora_request) diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 57240bb4f33..d4135519aa4 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1620,7 +1620,6 @@ async def init_app_state( model_config=model_config, base_model_paths=base_model_paths, lora_modules=lora_modules, - prompt_adapters=args.prompt_adapters, ) await state.openai_serving_models.init_static_loras() state.openai_serving_responses = OpenAIServingResponses( diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 28857f8caef..b1814866664 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -20,8 +20,7 @@ from vllm.engine.arg_utils import AsyncEngineArgs, optional_type from vllm.entrypoints.chat_utils import (ChatTemplateContentFormatOption, validate_chat_template) -from vllm.entrypoints.openai.serving_models import (LoRAModulePath, - PromptAdapterPath) +from vllm.entrypoints.openai.serving_models import LoRAModulePath from vllm.entrypoints.openai.tool_parsers import ToolParserManager from vllm.logger import init_logger from vllm.utils import FlexibleArgumentParser @@ -65,27 +64,6 @@ def __call__( setattr(namespace, self.dest, lora_list) -class PromptAdapterParserAction(argparse.Action): - - def __call__( - self, - parser: argparse.ArgumentParser, - namespace: argparse.Namespace, - values: Optional[Union[str, Sequence[str]]], - option_string: Optional[str] = None, - ): - if values is None: - values = [] - if isinstance(values, str): - raise TypeError("Expected values to be a list") - - adapter_list: list[PromptAdapterPath] = [] - for item in values: - name, path = item.split('=') - adapter_list.append(PromptAdapterPath(name, path)) - setattr(namespace, self.dest, adapter_list) - - @config @dataclass class FrontendArgs: @@ -115,9 +93,6 @@ class FrontendArgs: or JSON list format. Example (old format): `'name=path'` Example (new format): `{\"name\": \"name\", \"path\": \"lora_path\", \"base_model_name\": \"id\"}`""" - prompt_adapters: Optional[list[PromptAdapterPath]] = None - """Prompt adapter configurations in the format name=path. Multiple adapters - can be specified.""" chat_template: Optional[str] = None """The file path to the chat template, or the template in single-line form for the specified model.""" @@ -207,12 +182,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: frontend_kwargs["lora_modules"]["type"] = optional_type(str) frontend_kwargs["lora_modules"]["action"] = LoRAParserAction - # Special case: Prompt adapters need custom parser action and - # optional_type(str) - frontend_kwargs["prompt_adapters"]["type"] = optional_type(str) - frontend_kwargs["prompt_adapters"][ - "action"] = PromptAdapterParserAction - # Special case: Middleware needs append action frontend_kwargs["middleware"]["action"] = "append" frontend_kwargs["middleware"]["type"] = str @@ -288,9 +257,6 @@ def validate_parsed_serve_args(args: argparse.Namespace): if args.enable_auto_tool_choice and not args.tool_call_parser: raise TypeError("Error: --enable-auto-tool-choice requires " "--tool-call-parser") - if args.enable_prompt_embeds and args.enable_prompt_adapter: - raise ValueError( - "Cannot use prompt embeds and prompt adapter at the same time.") def log_non_default_args(args: argparse.Namespace): diff --git a/vllm/entrypoints/openai/run_batch.py b/vllm/entrypoints/openai/run_batch.py index 3dc5826909a..ef5bf6f9a81 100644 --- a/vllm/entrypoints/openai/run_batch.py +++ b/vllm/entrypoints/openai/run_batch.py @@ -337,7 +337,6 @@ async def main(args): model_config=model_config, base_model_paths=base_model_paths, lora_modules=None, - prompt_adapters=None, ) openai_serving_chat = OpenAIServingChat( engine, diff --git a/vllm/entrypoints/openai/serving_chat.py b/vllm/entrypoints/openai/serving_chat.py index a5eb16a5397..33d80743420 100644 --- a/vllm/entrypoints/openai/serving_chat.py +++ b/vllm/entrypoints/openai/serving_chat.py @@ -147,11 +147,8 @@ async def create_chat_completion( raise self.engine_client.dead_error try: - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request, - supports_default_mm_loras=True) + lora_request = self._maybe_get_adapters( + request, supports_default_mm_loras=True) model_name = self._get_model_name(request.model, lora_request) @@ -239,8 +236,7 @@ async def create_chat_completion( self._log_inputs(request_id, request_prompts[i], params=sampling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + lora_request=lora_request) trace_headers = (None if raw_request is None else await self._get_trace_headers(raw_request.headers)) @@ -259,7 +255,6 @@ async def create_chat_completion( request_id, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=request.priority, ) diff --git a/vllm/entrypoints/openai/serving_classification.py b/vllm/entrypoints/openai/serving_classification.py index e4ea5ab8dc5..377f7f68471 100644 --- a/vllm/entrypoints/openai/serving_classification.py +++ b/vllm/entrypoints/openai/serving_classification.py @@ -49,19 +49,11 @@ async def _preprocess( return None try: - ( - ctx.lora_request, - ctx.prompt_adapter_request, - ) = self._maybe_get_adapters(ctx.request) + ctx.lora_request = self._maybe_get_adapters(ctx.request) ctx.tokenizer = await self.engine_client.get_tokenizer( ctx.lora_request) - if ctx.prompt_adapter_request is not None: - raise NotImplementedError( - "Prompt adapter is not supported for classification models" - ) - ( ctx.request_prompts, ctx.engine_prompts, diff --git a/vllm/entrypoints/openai/serving_completion.py b/vllm/entrypoints/openai/serving_completion.py index 1e1f655022f..323795ca437 100644 --- a/vllm/entrypoints/openai/serving_completion.py +++ b/vllm/entrypoints/openai/serving_completion.py @@ -121,10 +121,7 @@ async def create_completion( raw_request.state.request_metadata = request_metadata try: - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) + lora_request = self._maybe_get_adapters(request) tokenizer = await self.engine_client.get_tokenizer(lora_request) @@ -197,7 +194,6 @@ async def create_completion( request_prompts[i], params=sampling_params, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, ) trace_headers = (None if raw_request is None else await @@ -221,7 +217,6 @@ async def create_completion( sampling_params, request_id_item, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, trace_headers=trace_headers, priority=request.priority, ) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 64f432db729..a5d42f3ecf5 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -62,18 +62,11 @@ async def _preprocess( ) -> Optional[ErrorResponse]: ctx = cast(EmbeddingServeContext, ctx) try: - ( - ctx.lora_request, - ctx.prompt_adapter_request, - ) = self._maybe_get_adapters(ctx.request) + ctx.lora_request = self._maybe_get_adapters(ctx.request) tokenizer = await self.engine_client.get_tokenizer(ctx.lora_request ) - if ctx.prompt_adapter_request is not None: - raise NotImplementedError("Prompt adapter is not supported " - "for embedding models") - if isinstance(ctx.request, EmbeddingChatRequest): ( _, diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index 14bcbafc6ab..7b230703d86 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -69,7 +69,6 @@ MultiModalDataDict) from vllm.outputs import PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import BeamSearchParams, SamplingParams from vllm.sequence import Logprob, PromptLogprobs from vllm.tracing import (contains_trace_headers, extract_trace_headers, @@ -162,7 +161,6 @@ class ServeContext(RequestProcessingMixin, ResponseGenerationMixin, BaseModel, request_id: str created_time: int = Field(default_factory=lambda: int(time.time())) lora_request: Optional[LoRARequest] = None - prompt_adapter_request: Optional[PromptAdapterRequest] = None # Shared across most requests tokenizer: Optional[AnyTokenizer] = None @@ -344,12 +342,10 @@ async def _prepare_generators( return self.create_error_response( "Request prompts not available") - self._log_inputs( - request_id_item, - ctx.request_prompts[i], - params=pooling_params, - lora_request=ctx.lora_request, - prompt_adapter_request=ctx.prompt_adapter_request) + self._log_inputs(request_id_item, + ctx.request_prompts[i], + params=pooling_params, + lora_request=ctx.lora_request) # Mypy has an existing bug related to inferring the variance of # TypedDicts with `builtins.enumerate`: @@ -451,11 +447,6 @@ async def _check_model( if isinstance(load_result, ErrorResponse) and \ load_result.code == HTTPStatus.BAD_REQUEST.value: error_response = load_result - if request.model in [ - prompt_adapter.prompt_adapter_name - for prompt_adapter in self.models.prompt_adapter_requests - ]: - return None return error_response or self.create_error_response( message=f"The model `{request.model}` does not exist.", @@ -490,25 +481,21 @@ def _maybe_get_adapters( self, request: AnyRequest, supports_default_mm_loras: bool = False, - ) -> Union[tuple[None, None], tuple[LoRARequest, None], tuple[ - None, PromptAdapterRequest]]: + ) -> Optional[LoRARequest]: if request.model in self.models.lora_requests: - return self.models.lora_requests[request.model], None + return self.models.lora_requests[request.model] # Currently only support default modality specific loras # if we have exactly one lora matched on the request. if supports_default_mm_loras: default_mm_lora = self._get_active_default_mm_loras(request) if default_mm_lora is not None: - return default_mm_lora, None + return default_mm_lora if self._is_model_supported(request.model): - return None, None + return None - for prompt_adapter in self.models.prompt_adapter_requests: - if request.model == prompt_adapter.prompt_adapter_name: - return None, prompt_adapter # if _check_model has been called earlier, this will be unreachable raise ValueError(f"The model `{request.model}` does not exist.") @@ -1011,7 +998,6 @@ def _log_inputs( params: Optional[Union[SamplingParams, PoolingParams, BeamSearchParams]], lora_request: Optional[LoRARequest], - prompt_adapter_request: Optional[PromptAdapterRequest], ) -> None: if self.request_logger is None: return @@ -1035,7 +1021,6 @@ def _log_inputs( prompt_embeds, params=params, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, ) async def _get_trace_headers( diff --git a/vllm/entrypoints/openai/serving_models.py b/vllm/entrypoints/openai/serving_models.py index bc4f523c82e..27614fcb411 100644 --- a/vllm/entrypoints/openai/serving_models.py +++ b/vllm/entrypoints/openai/serving_models.py @@ -1,8 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import json -import pathlib from asyncio import Lock from collections import defaultdict from dataclasses import dataclass @@ -19,7 +17,6 @@ from vllm.logger import init_logger from vllm.lora.request import LoRARequest from vllm.lora.resolver import LoRAResolver, LoRAResolverRegistry -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.utils import AtomicCounter logger = init_logger(__name__) @@ -31,12 +28,6 @@ class BaseModelPath: model_path: str -@dataclass -class PromptAdapterPath: - name: str - local_path: str - - @dataclass class LoRAModulePath: name: str @@ -60,7 +51,6 @@ def __init__( base_model_paths: list[BaseModelPath], *, lora_modules: Optional[list[LoRAModulePath]] = None, - prompt_adapters: Optional[list[PromptAdapterPath]] = None, ): super().__init__() @@ -81,20 +71,6 @@ def __init__( LoRAResolverRegistry.get_resolver(lora_resolver_name)) self.lora_resolver_lock: dict[str, Lock] = defaultdict(Lock) - self.prompt_adapter_requests = [] - if prompt_adapters is not None: - for i, prompt_adapter in enumerate(prompt_adapters, start=1): - with pathlib.Path(prompt_adapter.local_path, - "adapter_config.json").open() as f: - adapter_config = json.load(f) - num_virtual_tokens = adapter_config["num_virtual_tokens"] - self.prompt_adapter_requests.append( - PromptAdapterRequest( - prompt_adapter_name=prompt_adapter.name, - prompt_adapter_id=i, - prompt_adapter_local_path=prompt_adapter.local_path, - prompt_adapter_num_virtual_tokens=num_virtual_tokens)) - async def init_static_loras(self): """Loads all static LoRA modules. Raises if any fail to load""" @@ -141,14 +117,7 @@ async def show_available_models(self) -> ModelList: permission=[ModelPermission()]) for lora in self.lora_requests.values() ] - prompt_adapter_cards = [ - ModelCard(id=prompt_adapter.prompt_adapter_name, - root=self.base_model_paths[0].name, - permission=[ModelPermission()]) - for prompt_adapter in self.prompt_adapter_requests - ] model_cards.extend(lora_cards) - model_cards.extend(prompt_adapter_cards) return ModelList(data=model_cards) async def load_lora_adapter( diff --git a/vllm/entrypoints/openai/serving_pooling.py b/vllm/entrypoints/openai/serving_pooling.py index eec21087b99..12334cdac36 100644 --- a/vllm/entrypoints/openai/serving_pooling.py +++ b/vllm/entrypoints/openai/serving_pooling.py @@ -94,17 +94,10 @@ async def create_pooling( try: truncate_prompt_tokens = _validate_truncation_size( self.max_model_len, truncate_prompt_tokens) - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) + lora_request = self._maybe_get_adapters(request) tokenizer = await self.engine_client.get_tokenizer(lora_request) - if prompt_adapter_request is not None: - raise NotImplementedError("Prompt adapter is not supported " - "for pooling models") - if isinstance(request, PoolingChatRequest): ( _, @@ -153,8 +146,7 @@ async def create_pooling( self._log_inputs(request_id_item, request_prompts[i], params=pooling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + lora_request=lora_request) trace_headers = (None if raw_request is None else await self._get_trace_headers(raw_request.headers)) diff --git a/vllm/entrypoints/openai/serving_responses.py b/vllm/entrypoints/openai/serving_responses.py index a359371848c..64880a3a537 100644 --- a/vllm/entrypoints/openai/serving_responses.py +++ b/vllm/entrypoints/openai/serving_responses.py @@ -133,10 +133,7 @@ async def create_responses( messages = self._construct_input_messages(request, prev_response) try: - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) + lora_request = self._maybe_get_adapters(request) model_name = self._get_model_name(request.model, lora_request) tokenizer = await self.engine_client.get_tokenizer(lora_request) @@ -169,8 +166,7 @@ async def create_responses( self._log_inputs(request.request_id, request_prompts[i], params=sampling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + lora_request=lora_request) trace_headers = (None if raw_request is None else await self._get_trace_headers(raw_request.headers)) @@ -181,7 +177,6 @@ async def create_responses( request.request_id, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=request.priority, ) generators.append(generator) diff --git a/vllm/entrypoints/openai/serving_score.py b/vllm/entrypoints/openai/serving_score.py index 35f6581768a..4da2094147c 100644 --- a/vllm/entrypoints/openai/serving_score.py +++ b/vllm/entrypoints/openai/serving_score.py @@ -27,7 +27,6 @@ from vllm.logger import init_logger from vllm.lora.request import LoRARequest from vllm.outputs import PoolingRequestOutput, ScoringRequestOutput -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.transformers_utils.tokenizer import AnyTokenizer, MistralTokenizer from vllm.utils import make_async, merge_async_iterators @@ -58,8 +57,6 @@ async def _embedding_score( request_id: str, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[Union[LoRARequest, None]] = None, - prompt_adapter_request: Optional[Union[PromptAdapterRequest, - None]] = None, trace_headers: Optional[Mapping[str, str]] = None, ) -> Union[list[PoolingRequestOutput], ErrorResponse]: input_texts = texts_1 + texts_2 @@ -100,8 +97,7 @@ async def _embedding_score( self._log_inputs(request_id_item, input_texts[i], params=pooling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + lora_request=lora_request) generators.append( self.engine_client.encode( @@ -176,8 +172,6 @@ async def _cross_encoding_score( request_id: str, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[Union[LoRARequest, None]] = None, - prompt_adapter_request: Optional[Union[PromptAdapterRequest, - None]] = None, trace_headers: Optional[Mapping[str, str]] = None, ) -> Union[list[PoolingRequestOutput], ErrorResponse]: request_prompts: list[str] = [] @@ -261,8 +255,7 @@ async def _cross_encoding_score( self._log_inputs(request_id_item, request_prompts[i], params=pooling_params, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + lora_request=lora_request) generator = self.engine_client.encode( engine_prompt, @@ -295,14 +288,7 @@ async def _run_scoring( raw_request: Optional[Request] = None, truncate_prompt_tokens: Optional[int] = None, ) -> Union[list[PoolingRequestOutput], ErrorResponse]: - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) - - if prompt_adapter_request is not None: - raise NotImplementedError("Prompt adapter is not supported " - "for scoring models") + lora_request = self._maybe_get_adapters(request) tokenizer = await self.engine_client.get_tokenizer(lora_request) @@ -340,7 +326,6 @@ async def _run_scoring( request_id=request_id, tokenization_kwargs=tokenization_kwargs, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, trace_headers=trace_headers) else: @@ -352,7 +337,6 @@ async def _run_scoring( request_id=request_id, tokenization_kwargs=tokenization_kwargs, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, trace_headers=trace_headers) async def create_score( diff --git a/vllm/entrypoints/openai/serving_tokenization.py b/vllm/entrypoints/openai/serving_tokenization.py index 8181b36ed0b..58d72047476 100644 --- a/vllm/entrypoints/openai/serving_tokenization.py +++ b/vllm/entrypoints/openai/serving_tokenization.py @@ -60,10 +60,7 @@ async def create_tokenize( request_id = f"tokn-{self._base_request_id(raw_request)}" try: - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) + lora_request = self._maybe_get_adapters(request) tokenizer = await self.engine_client.get_tokenizer(lora_request) @@ -104,11 +101,8 @@ async def create_tokenize( self._log_inputs(request_id, request_prompts[i], params=None, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) + lora_request=lora_request) - # Silently ignore prompt adapter since it does not affect - # tokenization (Unlike in Embeddings API where an error is raised) if isinstance(engine_prompt, dict) and "prompt_token_ids" in engine_prompt: input_ids.extend(engine_prompt["prompt_token_ids"]) @@ -133,21 +127,14 @@ async def create_detokenize( request_id = f"tokn-{self._base_request_id(raw_request)}" - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) + lora_request = self._maybe_get_adapters(request) tokenizer = await self.engine_client.get_tokenizer(lora_request) self._log_inputs(request_id, request.tokens, params=None, - lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request) - - # Silently ignore prompt adapter since it does not affect tokenization - # (Unlike in Embeddings API where an error is raised) + lora_request=lora_request) prompt_input = await self._tokenize_prompt_input_async( request, diff --git a/vllm/entrypoints/openai/speech_to_text.py b/vllm/entrypoints/openai/speech_to_text.py index 09b346dcef6..e26e1b748b8 100644 --- a/vllm/entrypoints/openai/speech_to_text.py +++ b/vllm/entrypoints/openai/speech_to_text.py @@ -150,19 +150,12 @@ async def _create_speech_to_text( raw_request.state.request_metadata = request_metadata try: - ( - lora_request, - prompt_adapter_request, - ) = self._maybe_get_adapters(request) + lora_request = self._maybe_get_adapters(request) if lora_request: return self.create_error_response( "Currently do not support LoRA for " f"{self.task_type.title()}.") - if prompt_adapter_request: - return self.create_error_response( - f"Currently do not support PromptAdapter for " - f"{self.task_type.title()}.") prompts, duration_s = await self._preprocess_speech_to_text( request=request, @@ -188,8 +181,7 @@ async def _create_speech_to_text( # It will not display special tokens like <|startoftranscript|> request.prompt, params=sampling_params, - lora_request=None, - prompt_adapter_request=None) + lora_request=None) list_result_generator = [ self.engine_client.generate( diff --git a/vllm/executor/executor_base.py b/vllm/executor/executor_base.py index ca9f1376b9f..483fdb1486f 100644 --- a/vllm/executor/executor_base.py +++ b/vllm/executor/executor_base.py @@ -17,7 +17,6 @@ from vllm.lora.request import LoRARequest from vllm.model_executor.layers.sampler import SamplerOutput from vllm.pooling_params import PoolingTask -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sequence import ExecuteModelRequest, PoolerOutput from vllm.utils import make_async from vllm.worker.worker_base import WorkerBase @@ -50,7 +49,6 @@ def __init__( self.scheduler_config = vllm_config.scheduler_config self.device_config = vllm_config.device_config self.speculative_config = vllm_config.speculative_config - self.prompt_adapter_config = vllm_config.prompt_adapter_config self.observability_config = vllm_config.observability_config self._init_executor() self.is_sleeping = False @@ -171,35 +169,6 @@ def list_loras(self) -> Set[int]: assert s == sets[0], "All workers should have the same LORAs." return sets[0] - def add_prompt_adapter( - self, prompt_adapter_request: PromptAdapterRequest) -> bool: - assert prompt_adapter_request.prompt_adapter_id > 0, \ - "prompt_adapter_id must be greater than 0." - return all( - self.collective_rpc("add_prompt_adapter", - args=(prompt_adapter_request, ))) - - def remove_prompt_adapter(self, prompt_adapter_id: int) -> bool: - assert prompt_adapter_id > 0, \ - "prompt_adapter_id must be greater than 0." - return all( - self.collective_rpc("remove_prompt_adapter", - args=(prompt_adapter_id, ))) - - def pin_prompt_adapter(self, prompt_adapter_id: int) -> bool: - assert prompt_adapter_id > 0, \ - "prompt_adapter_id must be greater than 0." - return all( - self.collective_rpc("pin_prompt_adapter", - args=(prompt_adapter_id, ))) - - def list_prompt_adapters(self) -> Set[int]: - sets = self.collective_rpc("list_prompt_adapters") - for s in sets: - assert (s == sets[0] - ), "All workers should have the same prompt adapters." - return sets[0] - def start_profile(self) -> None: self.collective_rpc("start_profile") diff --git a/vllm/inputs/preprocess.py b/vllm/inputs/preprocess.py index deda9bc23da..de5dc087665 100644 --- a/vllm/inputs/preprocess.py +++ b/vllm/inputs/preprocess.py @@ -13,7 +13,6 @@ from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalRegistry from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalEncDecInputs, MultiModalInputs) -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.transformers_utils.tokenizer import AnyTokenizer from vllm.transformers_utils.tokenizer_group import TokenizerGroup @@ -168,18 +167,6 @@ def _prepare_decoder_input_ids_for_generation( return decoder_input_ids - def _apply_prompt_adapter( - self, - prompt_token_ids: list[int], - prompt_adapter_request: Optional[PromptAdapterRequest], - ) -> list[int]: - if prompt_adapter_request: - prompt_token_ids = ( - [0] * prompt_adapter_request.prompt_adapter_num_virtual_tokens - + prompt_token_ids) - - return prompt_token_ids - def _get_tokenization_kw( self, overrides: Optional[dict[str, Any]] = None, @@ -786,15 +773,10 @@ async def _process_encoder_decoder_prompt_async( def _build_decoder_only_llm_inputs( self, prompt_inputs: DecoderOnlyInputs, - prompt_adapter_request: Optional[PromptAdapterRequest], ) -> DecoderOnlyInputs: if "prompt_token_ids" in prompt_inputs: prompt_inputs = cast(Union[TokenInputs, MultiModalInputs], prompt_inputs) # Needed for mypy - prompt_inputs["prompt_token_ids"] = self._apply_prompt_adapter( - prompt_inputs["prompt_token_ids"], - prompt_adapter_request=prompt_adapter_request, - ) return prompt_inputs @@ -803,7 +785,6 @@ def _process_decoder_only_prompt( prompt: SingletonPrompt, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[LoRARequest] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, return_mm_hashes: bool = False, ) -> DecoderOnlyInputs: """ @@ -815,7 +796,6 @@ def _process_decoder_only_prompt( * prompt: input prompt * lora_request - * prompt_adapter_request * return_mm_hashes Returns: @@ -830,17 +810,13 @@ def _process_decoder_only_prompt( return_mm_hashes=return_mm_hashes, ) - return self._build_decoder_only_llm_inputs( - prompt_comps, - prompt_adapter_request=prompt_adapter_request, - ) + return self._build_decoder_only_llm_inputs(prompt_comps) async def _process_decoder_only_prompt_async( self, prompt: SingletonPrompt, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[LoRARequest] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, return_mm_hashes: bool = False, ) -> DecoderOnlyInputs: """ @@ -854,17 +830,13 @@ async def _process_decoder_only_prompt_async( return_mm_hashes=return_mm_hashes, ) - return self._build_decoder_only_llm_inputs( - prompt_comps, - prompt_adapter_request=prompt_adapter_request, - ) + return self._build_decoder_only_llm_inputs(prompt_comps) def preprocess( self, prompt: PromptType, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[LoRARequest] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, return_mm_hashes: bool = False, ) -> ProcessorInputs: """Preprocess the input prompt.""" @@ -886,7 +858,6 @@ def preprocess( prompt, tokenization_kwargs=tokenization_kwargs, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, return_mm_hashes=return_mm_hashes, ) @@ -895,7 +866,6 @@ async def preprocess_async( prompt: PromptType, tokenization_kwargs: Optional[dict[str, Any]] = None, lora_request: Optional[LoRARequest] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, return_mm_hashes: bool = False, ) -> ProcessorInputs: """ @@ -919,6 +889,5 @@ async def preprocess_async( prompt, tokenization_kwargs=tokenization_kwargs, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, return_mm_hashes=return_mm_hashes, ) diff --git a/vllm/prompt_adapter/__init__.py b/vllm/prompt_adapter/__init__.py deleted file mode 100644 index e69de29bb2d..00000000000 diff --git a/vllm/prompt_adapter/layers.py b/vllm/prompt_adapter/layers.py deleted file mode 100644 index b5b925d042f..00000000000 --- a/vllm/prompt_adapter/layers.py +++ /dev/null @@ -1,83 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -from dataclasses import dataclass -from typing import Optional - -import torch -from torch import nn - -from vllm.adapter_commons.layers import AdapterMapping -from vllm.config import PromptAdapterConfig -from vllm.model_executor.layers.vocab_parallel_embedding import ( - VocabParallelEmbedding) - - -@dataclass -class PromptAdapterMapping(AdapterMapping): - pass - - -class VocabParallelEmbeddingWithPromptAdapter(nn.Module): - - def __init__(self, base_layer: VocabParallelEmbedding) -> None: - super().__init__() - self.base_layer = base_layer - self.emb_layer = self.base_layer - if 'LoRA' in base_layer.__class__.__name__: - self.emb_layer = self.base_layer.base_layer - - def create_prompt_adapter_weights( - self, prompt_adapter_config: PromptAdapterConfig): - self.embeddings_tensors = torch.zeros( - ( - prompt_adapter_config.max_prompt_adapters, - prompt_adapter_config.max_prompt_adapter_token, - self.emb_layer.embedding_dim, - ), - dtype=self.emb_layer.weight.dtype, - device=self.emb_layer.weight.device, - ) - self.adapter_lengths = torch.zeros( - prompt_adapter_config.max_prompt_adapters, - dtype=torch.long, - device=self.emb_layer.weight.device) - - self.indices_gpu: torch.Tensor - self.embedding_indices_gpu: torch.Tensor - - def reset_prompt_adapter(self, index: int): - self.embeddings_tensors[index] = 0 - - def set_prompt_adapter( - self, - index: int, - adapter_model: Optional[torch.Tensor], - ): - self.reset_prompt_adapter(index) - if adapter_model is not None: - length = adapter_model.shape[0] - self.embeddings_tensors[index, :length] = adapter_model - self.adapter_lengths[index] = length - - def set_mapping( - self, - prompt_indices: torch.Tensor, - prompt_embedding_indices: torch.Tensor, - ): - self.indices_gpu = prompt_indices.to( - device=self.emb_layer.weight.device) - self.embedding_indices_gpu = prompt_embedding_indices.to( - device=self.emb_layer.weight.device) - - def forward(self, x: torch.Tensor) -> torch.Tensor: - hidden_states = self.base_layer(x) - if self.embedding_indices_gpu.ndim > 1: - valid_mask = self.indices_gpu != -1 - gathered_embeddings = self.embeddings_tensors[ - self.embedding_indices_gpu[:, 0], - self.embedding_indices_gpu[:, 1]] - - # Update hidden states - hidden_states[valid_mask] = gathered_embeddings - return hidden_states \ No newline at end of file diff --git a/vllm/prompt_adapter/models.py b/vllm/prompt_adapter/models.py deleted file mode 100644 index 864b50c861e..00000000000 --- a/vllm/prompt_adapter/models.py +++ /dev/null @@ -1,358 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import logging -import math -from typing import Any, Callable, Dict, List, Optional, Type - -import torch -from torch import nn - -from vllm.adapter_commons.models import (AdapterLRUCache, AdapterModel, - AdapterModelManager) -from vllm.adapter_commons.utils import (add_adapter, deactivate_adapter, - get_adapter, list_adapters, - remove_adapter, set_adapter_mapping) -from vllm.config import PromptAdapterConfig -from vllm.prompt_adapter.layers import ( - VocabParallelEmbeddingWithPromptAdapter) # yapf: disable -from vllm.prompt_adapter.layers import PromptAdapterMapping -from vllm.prompt_adapter.utils import load_peft_weights - -logger = logging.getLogger(__name__) - -_GLOBAL_PROMPT_ADAPTER_ID = 0 - - -def get_prompt_adapter_id(): - global _GLOBAL_PROMPT_ADAPTER_ID - _GLOBAL_PROMPT_ADAPTER_ID += 1 - return _GLOBAL_PROMPT_ADAPTER_ID - - -def convert_to_embedding_indices(indices): - embedding_indices = [] - count = 0 - - for value in indices: - if value == -1: - count = 0 - else: - embedding_indices.append([value, count]) - count += 1 - - return torch.tensor(embedding_indices) - - -def convert_mapping( - mapping: PromptAdapterMapping, - prompt_adapter_index_to_id: List[Optional[int]], -) -> torch.Tensor: - """Converts PromptAdapterMapping to index tensors. - - Args: - mapping: PromptAdapterMapping mapping rows in a - batch to PromptAdapter ids. - prompt_adapter_index_to_id: List mapping PromptAdapter - ids to PromptAdapter indices. - - Returns: - pa_indices: Tensor of shape [batch_size] mapping batch rows to - PromptAdapter indices. - """ - id_to_index = { - id_: idx - for idx, id_ in enumerate(prompt_adapter_index_to_id) - if id_ is not None - } - pa_indices = ([ - id_to_index.get(id_, -1) if id_ > 0 else -1 - for id_ in mapping.index_mapping - ]) - - pa_embedding_mapping = convert_to_embedding_indices(pa_indices) - pa_indices = torch.tensor(pa_indices) - return pa_indices, pa_embedding_mapping - - -class PromptAdapterModel(AdapterModel): - - def __init__(self, - prompt_adapter_id=None, - num_virtual_tokens=None, - prompt_embedding=None) -> None: - self.id = prompt_adapter_id - self.prompt_embedding = prompt_embedding - self.num_virtual_tokens = num_virtual_tokens - - @classmethod - def from_local_checkpoint( - cls, - adapter_model_path: str, - prompt_adapter_id: int, - num_virtual_tokens: int, - config: PromptAdapterConfig, - device: str = "cuda", - ) -> "PromptAdapterModel": - - if num_virtual_tokens > config.max_prompt_adapter_token: - raise ValueError( - f'num_virtual_tokens ({num_virtual_tokens}) should be <= ' - f'max_prompt_adapter_token({config.max_prompt_adapter_token})') - - adapters_weights = load_peft_weights(adapter_model_path, device) - prompt_embedding = adapters_weights["prompt_embeddings"].to( - config.prompt_adapter_dtype) - - return cls(prompt_adapter_id, num_virtual_tokens, prompt_embedding) - - -class PromptAdapterModelManager(AdapterModelManager): - """A manager that manages multiple Prompt Adapter models.""" - - def __init__( - self, - model: nn.Module, - max_num_seqs: int, - max_num_batched_tokens: int, - prompt_adapter_config: PromptAdapterConfig, - ): - """Create a PromptAdapterModel and adapter for a given model. - - Args: - model: the model to be adapted. - max_num_seqs: the maximum number of sequences model can run in a - single batch. - max_num_batched_tokens: the maximum number of tokens model can run - in a single batch. - prompt_adapter_config: the PromptAdapter config, - """ - self.model: nn.Module = model - # Dict instead of a Set for compatibility with LRUCache. - self.prompt_adapter_index_to_id: List[ - Optional[int]] = [None] * self.prompt_adapter_slots - self.max_num_seqs = max_num_seqs - self.max_num_batched_tokens = math.ceil(max_num_batched_tokens / 8) * 8 - self.prompt_adapter_config = prompt_adapter_config - self.model.prompt_adapter_manager = self - self.adapter_type = 'PromptAdapter' - - self.base_indices = torch.tensor([-1]) - self.base_embedding_indices = torch.tensor([]) - - self.modules: Dict[str, nn.Module] = {} - self._create_prompt_adapter_modules() - self._last_mapping: Optional[PromptAdapterMapping] = None - - @property - def prompt_adapter_slots(self) -> int: - return self.prompt_adapter_config.max_prompt_adapters - - @property - def adapter_slots(self) -> int: - return self.prompt_adapter_slots - - @property - def capacity(self) -> int: - return self.prompt_adapter_config.max_cpu_prompt_adapters - - def activate_adapter( - self, - prompt_adapter_id: int, - ) -> bool: - """Move PromptAdapter into a GPU buffer - to be used in the forward pass.""" - if prompt_adapter_id in self._active_adapters: - return False - first_free_slot = next( - ((i, prompt_adapter_id) for i, prompt_adapter_id in enumerate( - self.prompt_adapter_index_to_id) if prompt_adapter_id is None), - None) - if first_free_slot is None: - raise ValueError("No free prompt_adapter slots") - index, _ = first_free_slot - self._active_adapters[prompt_adapter_id] = None - prompt_adapter_model = (self._registered_adapters[prompt_adapter_id]) - logger.debug("Activating prompt_adapter. int id: %d, slot index: %d", - prompt_adapter_model.id, index) - self.prompt_adapter_index_to_id[index] = prompt_adapter_model.id - for _, v in self.modules.items(): - v.set_prompt_adapter(index, prompt_adapter_model.prompt_embedding) - return True - - def _deactivate_adapter(self, prompt_adapter_id: int): - try: - index = self.prompt_adapter_index_to_id.index(prompt_adapter_id) - self.prompt_adapter_index_to_id[index] = None - for _, v in self.modules.items(): - v.reset_prompt_adapter(index) - except ValueError: - pass - - def _add_adapter(self, prompt_adapter: PromptAdapterModel): - self._registered_adapters[prompt_adapter.id] = prompt_adapter - - def _set_adapter_mapping(self, mapping: PromptAdapterMapping) -> None: - base_indices, base_embedding_indices = convert_mapping( - mapping, self.prompt_adapter_index_to_id) - for k, v in self.modules.items(): - v.set_mapping(base_indices, base_embedding_indices) - - def _create_prompt_adapter_modules(self): - for module_name, module in self.model.named_modules( - remove_duplicate=False): - if "VocabParallel" in module.__class__.__name__: - new_module = VocabParallelEmbeddingWithPromptAdapter(module) - new_module.create_prompt_adapter_weights( - self.prompt_adapter_config) - replaced_module = self.replace_submodule( - self.model, module_name, new_module) - self.register_module(module.__class__.__name__, - replaced_module) - replaced_module.set_mapping(self.base_indices, - self.base_embedding_indices) - break - - def replace_submodule(self, model: nn.Module, module_name: str, - new_module: nn.Module) -> nn.Module: - """Replace a submodule in a model with a new module.""" - parent = model.get_submodule(".".join(module_name.split(".")[:-1])) - target_name = module_name.split(".")[-1] - setattr(parent, target_name, new_module) - return new_module - - def register_module(self, module_name: str, module: nn.Module): - self.modules[module_name] = module - - def pin_adapter(self, prompt_adapter_id: int) -> bool: - """Pin a PromptAdapterModel in the manager cache.""" - raise NotImplementedError( - "Pinning is not supported in PromptAdapterModelManager. " - "Use LRUCachePromptAdapterModelManager for pinning" - ) # type: ignore - - def remove_all_adapters(self): - """Remove all PromptAdapterModel from the manager.""" - self._registered_adapters.clear() - self.prompt_adapter_index_to_id = [None] * self.prompt_adapter_slots - self._active_adapters.clear() - - def deactivate_adapter(self, adapter_id: int) -> bool: - return deactivate_adapter(adapter_id, self._active_adapters, - self._deactivate_adapter) - - def add_adapter(self, adapter: PromptAdapterModel) -> bool: - return add_adapter(adapter, self._registered_adapters, self.capacity, - self._add_adapter) - - def set_adapter_mapping(self, mapping: PromptAdapterMapping) -> None: - self._last_mapping = set_adapter_mapping(mapping, self._last_mapping, - self._set_adapter_mapping) - - def remove_adapter(self, adapter_id: int) -> bool: - return remove_adapter(adapter_id, self._registered_adapters, - self.deactivate_adapter) - - def list_adapters(self) -> Dict[int, Any]: - return list_adapters(self._registered_adapters) - - def get_adapter(self, adapter_id: int) -> Optional[Any]: - return get_adapter(adapter_id, self._registered_adapters) - - -class PromptAdapterLRUCache(AdapterLRUCache[PromptAdapterModel]): - - def __init__(self, capacity: int, - deactivate_prompt_adapter_fn: Callable[[int], bool]): - super().__init__(capacity, deactivate_prompt_adapter_fn) - - -class LRUCachePromptAdapterModelManager(PromptAdapterModelManager): - """A model manager that manages multiple prompt_adapters with LRU cache.""" - - def __init__( - self, - model: nn.Module, - max_num_seqs: int, - max_num_batched_tokens: int, - prompt_adapter_config: PromptAdapterConfig, - ): - self.prompt_adapter_config = prompt_adapter_config - super().__init__(model, max_num_seqs, max_num_batched_tokens, - prompt_adapter_config) - self._registered_adapters = PromptAdapterLRUCache( - self.capacity, self.deactivate_adapter) - self._active_adapters = PromptAdapterLRUCache( - self.prompt_adapter_slots, self._deactivate_adapter) - - def list_adapters(self) -> Dict[int, PromptAdapterModel]: - """List all registered PromptAdapterModel.""" - return dict(self._registered_adapters.cache) - - def add_adapter(self, prompt_adapter: PromptAdapterModel) -> bool: - """Add a PromptAdapterModel to the manager.""" - if prompt_adapter.id not in self._registered_adapters: - self._add_adapter(prompt_adapter) - was_added = True - else: - # We always touch to update the LRU cache order - self._registered_adapters.touch(prompt_adapter.id) - was_added = False - return was_added - - def activate_adapter( - self, - prompt_adapter_id: int, - ) -> bool: - if prompt_adapter_id not in self._active_adapters and len( - self._active_adapters) >= self.prompt_adapter_slots: - self._active_adapters.remove_oldest() - result = super().activate_adapter(prompt_adapter_id) - # We always touch to update the LRU cache order - self._active_adapters.touch(prompt_adapter_id) - return result - - def remove_oldest_adapter(self) -> bool: - if len(self._registered_adapters) > 0: - self._registered_adapters.remove_oldest() - return True - return False - - def pin_adapter(self, prompt_adapter_id: int) -> bool: - """Pin a PromptAdapterModel in the manager cache.""" - self._pin_prompt_adapter_in_cpu_cache(prompt_adapter_id) - self._pin_prompt_adapter_in_gpu_cache(prompt_adapter_id) - return True - - def _pin_prompt_adapter_in_cpu_cache(self, prompt_adapter_id: int): - try: - self._registered_adapters.pin(prompt_adapter_id) - except ValueError as err: - raise ValueError( - "Pinning failed. " - f"Prompt Adapter {prompt_adapter_id} is not registered." - ) from err - - def _pin_prompt_adapter_in_gpu_cache(self, prompt_adapter_id: int): - if prompt_adapter_id not in self._active_adapters: - # move adapter to gpu if not already active - self.activate_adapter(prompt_adapter_id) - self._active_adapters.pin(prompt_adapter_id) - - -def create_prompt_adapter_manager( - model: nn.Module, - max_num_seqs: int, - max_num_batched_tokens: int, - prompt_adapter_config: PromptAdapterConfig, - prompt_adapter_manager_cls: Type[ - PromptAdapterModelManager] = PromptAdapterModelManager, - **kwargs) -> PromptAdapterModelManager: - """Create a PromptAdapterModel for a given model.""" - prompt_adapter_manager = prompt_adapter_manager_cls( - model=model, - max_num_seqs=max_num_seqs, - max_num_batched_tokens=max_num_batched_tokens, - prompt_adapter_config=prompt_adapter_config, - **kwargs) - return prompt_adapter_manager diff --git a/vllm/prompt_adapter/request.py b/vllm/prompt_adapter/request.py deleted file mode 100644 index 3ce50d0a26b..00000000000 --- a/vllm/prompt_adapter/request.py +++ /dev/null @@ -1,37 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import msgspec - -from vllm.adapter_commons.request import AdapterRequest - - -class PromptAdapterRequest( - msgspec.Struct, - array_like=True, # type: ignore[call-arg] - omit_defaults=True, # type: ignore[call-arg] - frozen=True): # type: ignore[call-arg] - """ - Request for a Prompt adapter. - """ - __metaclass__ = AdapterRequest - - prompt_adapter_name: str - prompt_adapter_id: int - prompt_adapter_local_path: str - prompt_adapter_num_virtual_tokens: int - - def __hash__(self): - return super().__hash__() - - @property - def adapter_id(self): - return self.prompt_adapter_id - - @property - def name(self): - return self.prompt_adapter_name - - @property - def local_path(self): - return self.prompt_adapter_local_path diff --git a/vllm/prompt_adapter/utils.py b/vllm/prompt_adapter/utils.py deleted file mode 100644 index ddd007868f6..00000000000 --- a/vllm/prompt_adapter/utils.py +++ /dev/null @@ -1,98 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# code borrowed from: https://github.com/huggingface/peft/blob/v0.12.0/src/peft/utils/save_and_load.py#L420 - -import os -from typing import Optional - -import torch -from huggingface_hub import file_exists, hf_hub_download -from huggingface_hub.utils import EntryNotFoundError -from safetensors.torch import load_file as safe_load_file - -from vllm.platforms import current_platform - -WEIGHTS_NAME = "adapter_model.bin" -SAFETENSORS_WEIGHTS_NAME = "adapter_model.safetensors" - - -# Get current device name based on available devices -def infer_device() -> str: - if current_platform.is_cuda_alike(): - return "cuda" - return "cpu" - - -def load_peft_weights(model_id: str, - device: Optional[str] = None, - **hf_hub_download_kwargs) -> dict: - r""" - A helper method to load the PEFT weights from the HuggingFace Hub or locally - - Args: - model_id (`str`): - The local path to the adapter weights or the name of the adapter to - load from the HuggingFace Hub. - device (`str`): - The device to load the weights onto. - hf_hub_download_kwargs (`dict`): - Additional arguments to pass to the `hf_hub_download` method when - loading from the HuggingFace Hub. - """ - path = (os.path.join(model_id, hf_hub_download_kwargs["subfolder"]) if - hf_hub_download_kwargs.get("subfolder") is not None else model_id) - - if device is None: - device = infer_device() - - if os.path.exists(os.path.join(path, SAFETENSORS_WEIGHTS_NAME)): - filename = os.path.join(path, SAFETENSORS_WEIGHTS_NAME) - use_safetensors = True - elif os.path.exists(os.path.join(path, WEIGHTS_NAME)): - filename = os.path.join(path, WEIGHTS_NAME) - use_safetensors = False - else: - token = hf_hub_download_kwargs.get("token") - if token is None: - token = hf_hub_download_kwargs.get("use_auth_token") - - hub_filename = (os.path.join(hf_hub_download_kwargs["subfolder"], - SAFETENSORS_WEIGHTS_NAME) - if hf_hub_download_kwargs.get("subfolder") is not None - else SAFETENSORS_WEIGHTS_NAME) - has_remote_safetensors_file = file_exists( - repo_id=model_id, - filename=hub_filename, - revision=hf_hub_download_kwargs.get("revision"), - repo_type=hf_hub_download_kwargs.get("repo_type"), - token=token, - ) - use_safetensors = has_remote_safetensors_file - - if has_remote_safetensors_file: - # Priority 1: load safetensors weights - filename = hf_hub_download( - model_id, - SAFETENSORS_WEIGHTS_NAME, - **hf_hub_download_kwargs, - ) - else: - try: - filename = hf_hub_download(model_id, WEIGHTS_NAME, - **hf_hub_download_kwargs) - except EntryNotFoundError: - raise ValueError( # noqa: B904 - f"Can't find weights for {model_id} in {model_id} or \ - in the Hugging Face Hub. " - f"Please check that the file {WEIGHTS_NAME} or \ - {SAFETENSORS_WEIGHTS_NAME} is present at {model_id}.") - - if use_safetensors: - adapters_weights = safe_load_file(filename, device=device) - else: - adapters_weights = torch.load(filename, - map_location=torch.device(device), - weights_only=True) - - return adapters_weights diff --git a/vllm/prompt_adapter/worker_manager.py b/vllm/prompt_adapter/worker_manager.py deleted file mode 100644 index 56265de8087..00000000000 --- a/vllm/prompt_adapter/worker_manager.py +++ /dev/null @@ -1,179 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import logging -from typing import Any, Optional, Set, Type - -import torch - -from vllm.adapter_commons.utils import (add_adapter_worker, - apply_adapters_worker, - list_adapters_worker, - set_active_adapters_worker) -from vllm.adapter_commons.worker_manager import AbstractWorkerManager -from vllm.config import PromptAdapterConfig -from vllm.prompt_adapter.models import (LRUCachePromptAdapterModelManager, - PromptAdapterModel, - PromptAdapterModelManager, - create_prompt_adapter_manager) -from vllm.prompt_adapter.request import PromptAdapterRequest - -logger = logging.getLogger(__name__) - - -class WorkerPromptAdapterManager(AbstractWorkerManager): - """WorkerPromptAdapterManager that manages - prompt_adapter models on the worker side. - - Every request, the requested prompt_adapters will be - loaded (unless they are already loaded), - and every other prompt_adapter will be unloaded.""" - - _manager_cls: Type[PromptAdapterModelManager] = PromptAdapterModelManager - - def __init__( - self, - max_num_seqs: int, - max_num_batched_tokens: int, - device: torch.device, - prompt_adapter_config: PromptAdapterConfig, - prompt_adapter_model_cls: Type[PromptAdapterModel] = PromptAdapterModel - ): - self._adapter_manager: PromptAdapterModelManager - self.max_num_seqs = max_num_seqs - self.max_num_batched_tokens = max_num_batched_tokens - self._prompt_adapter_model_cls = prompt_adapter_model_cls - self.prompt_adapter_config = prompt_adapter_config - super().__init__(device) - - @property - def is_enabled(self) -> bool: - return True - - def create_prompt_adapter_manager( - self, - model: torch.nn.Module, - ) -> Any: - prompt_adapter_manager = create_prompt_adapter_manager( - model, - max_num_seqs=self.max_num_seqs, - max_num_batched_tokens=self.max_num_batched_tokens, - prompt_adapter_config=self.prompt_adapter_config, - prompt_adapter_manager_cls=self._manager_cls, - ) - self._adapter_manager = prompt_adapter_manager - return prompt_adapter_manager.model - - def _load_adapter( - self, prompt_adapter_request: PromptAdapterRequest - ) -> PromptAdapterModel: - try: - prompt_adapter = ( - self._prompt_adapter_model_cls.from_local_checkpoint( - prompt_adapter_request.prompt_adapter_local_path, - prompt_adapter_id=prompt_adapter_request.prompt_adapter_id, - num_virtual_tokens=prompt_adapter_request. - prompt_adapter_num_virtual_tokens, - config=self.prompt_adapter_config, - device=str(self.device), - )) - except Exception as e: - raise RuntimeError( - f"Loading prompt_adapter " - f"{prompt_adapter_request.prompt_adapter_local_path}" - f" failed") from e - return prompt_adapter - - def add_dummy_prompt_adapter( - self, prompt_adapter_request: PromptAdapterRequest) -> bool: - return True - - def pin_adapter(self, adapter_id: int) -> bool: - return self._adapter_manager.pin_adapter(adapter_id) - - def set_active_adapters(self, requests: Set[Any], - mapping: Optional[Any]) -> None: - set_active_adapters_worker(requests, mapping, self._apply_adapters, - self._adapter_manager.set_adapter_mapping) - - def add_adapter(self, adapter_request: Any) -> bool: - return add_adapter_worker(adapter_request, self.list_adapters, - self._load_adapter, - self._adapter_manager.add_adapter, - self._adapter_manager.activate_adapter) - - def _apply_adapters(self, adapter_requests: Set[Any]) -> None: - apply_adapters_worker(adapter_requests, self.list_adapters, - self._adapter_manager.adapter_slots, - self.remove_adapter, self.add_adapter) - - def remove_adapter(self, adapter_id: int) -> bool: - return self._adapter_manager.remove_adapter(adapter_id) - - def remove_all_adapters(self): - self._adapter_manager.remove_all_adapters() - - def list_adapters(self) -> Set[int]: - return list_adapters_worker(self._adapter_manager.list_adapters) - - -class LRUCacheWorkerPromptAdapterManager(WorkerPromptAdapterManager): - """WorkerPromptAdapterManager that manages - prompt_adapter models on the worker side. - - Uses an LRU Cache. Every request, the requested - prompt_adapters will be loaded (unless they are already loaded) - and least recently used prompt_adapters will - be unloaded if the cache is above capacity.""" - - _prompt_adapter_manager_cls: Type[ - LRUCachePromptAdapterModelManager] = LRUCachePromptAdapterModelManager - - def create_prompt_adapter_manager( - self, - model: torch.nn.Module, - ) -> Any: - prompt_adapter_manager = create_prompt_adapter_manager( - model, - max_num_seqs=self.max_num_seqs, - max_num_batched_tokens=self.max_num_batched_tokens, - prompt_adapter_config=self.prompt_adapter_config, - prompt_adapter_manager_cls=self._prompt_adapter_manager_cls) - self._adapter_manager: LRUCachePromptAdapterModelManager = ( - prompt_adapter_manager) - return prompt_adapter_manager.model - - def _apply_adapters( - self, prompt_adapter_requests: Set[PromptAdapterRequest]) -> None: - prompt_adapters_map = { - prompt_adapter_request.prompt_adapter_id: prompt_adapter_request - for prompt_adapter_request in prompt_adapter_requests - if prompt_adapter_request - } - if len(prompt_adapters_map - ) > self._adapter_manager.prompt_adapter_slots: - raise RuntimeError( - f"Number of requested prompt_adapters " - f"({len(prompt_adapters_map)}) is greater " - "than the number of GPU prompt_adapter slots " - f"({self._adapter_manager.prompt_adapter_slots}).") - for prompt_adapter in prompt_adapters_map.values(): - self.add_adapter(prompt_adapter) - - def add_adapter(self, - prompt_adapter_request: PromptAdapterRequest) -> bool: - if prompt_adapter_request.prompt_adapter_id not in self.list_adapters( - ): - # Remove before we load the new prompt_adapter to save memory - if len(self._adapter_manager) + 1 > self._adapter_manager.capacity: - self._adapter_manager.remove_oldest_adapter() - prompt_adapter = self._load_adapter(prompt_adapter_request) - loaded = self._adapter_manager.add_adapter(prompt_adapter) - else: - # If the prompt_adapter is already loaded, just touch it to - # update its position in the caches - loaded = self._adapter_manager.get_adapter( - prompt_adapter_request.prompt_adapter_id) is not None - self._adapter_manager.activate_adapter( - prompt_adapter_request.prompt_adapter_id) - return loaded diff --git a/vllm/sequence.py b/vllm/sequence.py index 1f507add0d9..fe87b52f9df 100644 --- a/vllm/sequence.py +++ b/vllm/sequence.py @@ -19,7 +19,6 @@ from vllm.lora.request import LoRARequest from vllm.multimodal import MultiModalKwargs, MultiModalPlaceholderDict from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import RequestOutputKind, SamplingParams VLLM_TOKEN_ID_ARRAY_TYPE = "l" @@ -458,7 +457,6 @@ class Sequence: block size used by the block manager and cache engine. eos_token_id: The end-of-sequence (EOS) token id recognized by this LLM. lora_request: LoRA request. - prompt_adapter_request: Prompt Adapter request. """ def __init__( @@ -468,14 +466,12 @@ def __init__( block_size: int, eos_token_id: Optional[int] = None, lora_request: Optional[LoRARequest] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, ) -> None: self.seq_id = seq_id self.inputs = inputs self.block_size = block_size self.eos_token_id = eos_token_id self.lora_request = lora_request - self.prompt_adapter_request = prompt_adapter_request self.data = SequenceData.from_seqs( self.prompt_token_ids, @@ -537,11 +533,6 @@ def multi_modal_placeholders(self) -> MultiModalPlaceholderDict: def lora_int_id(self) -> int: return self.lora_request.lora_int_id if self.lora_request else 0 - @property - def prompt_adapter_id(self) -> int: - return self.prompt_adapter_request.prompt_adapter_id \ - if self.prompt_adapter_request else 0 - def get_output_text_to_return(self, buffer_length: int, delta: bool) -> str: """If delta is True, only new text since the last call to @@ -601,12 +592,12 @@ def extra_hash(self) -> Optional[int]: designed for prefix caching mode. The final sequence hash is determined by applying token_ids from the sequence's blocks. """ - if self.prompt_adapter_id == 0 and self.lora_int_id == 0: + if self.lora_int_id == 0: return None # NOTE: If there are additional factors influencing the block aside from # token_ids, include them as input parameters to the hash. - return hash((self.prompt_adapter_id, self.lora_int_id)) + return hash(self.lora_int_id) def num_hashed_tokens_of_block(self, logical_idx: int): return logical_idx * self.block_size + self.block_size @@ -707,7 +698,6 @@ class SequenceGroup: encoder_seq: Optional, the single encoder sequence. Should be None unless you are working with an encoder/decoder model. trace_headers: OpenTelemetry trace headers. - prompt_adapter_request: Prompt Adapter request. priority: User-defined priority of the request. draft_size: The number of speculative tokens plus one from the target model; equal to max number of tokens a step can generate @@ -725,7 +715,6 @@ def __init__(self, pooled_data: Optional[torch.Tensor] = None, encoder_seq: Optional[Sequence] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, draft_size: int = 1) -> None: self.request_id = request_id @@ -747,7 +736,6 @@ def __init__(self, self.state = SequenceGroupState() self.pooling_params = pooling_params self.pooled_data = pooled_data - self.prompt_adapter_request = prompt_adapter_request self.encoder_seq = encoder_seq self.trace_headers = trace_headers self.priority = priority @@ -802,16 +790,6 @@ def multi_modal_placeholders(self) -> MultiModalPlaceholderDict: def lora_int_id(self) -> int: return self.lora_request.lora_int_id if self.lora_request else 0 - @property - def prompt_adapter_id(self) -> int: - return self.prompt_adapter_request.prompt_adapter_id \ - if self.prompt_adapter_request else 0 - - @property - def prompt_adapter_num_virtual_tokens(self) -> int: - return self.prompt_adapter_request.prompt_adapter_num_virtual_tokens\ - if self.prompt_adapter_request else 0 - def init_multi_step(self, num_steps: int) -> None: self.state.num_steps = num_steps self.state.current_step = 0 @@ -1011,7 +989,6 @@ class SequenceGroupMetadata( (SequenceGroup.encoder_seq). Should be None unless you are working with an encoder/decoder model. - prompt_adapter_request: Prompt Adapter request. """ request_id: str @@ -1030,7 +1007,6 @@ class SequenceGroupMetadata( multi_modal_placeholders: Optional[MultiModalPlaceholderDict] = None encoder_seq_data: Optional[SequenceData] = None cross_block_table: Optional[list[int]] = None - prompt_adapter_request: Optional[PromptAdapterRequest] = None token_chunk_size: Optional[int] = None ### Stateful fields that are lazily defined. ### @@ -1052,16 +1028,6 @@ def __post_init__(self): def lora_int_id(self) -> int: return self.lora_request.lora_int_id if self.lora_request else 0 - @property - def prompt_adapter_id(self) -> int: - return self.prompt_adapter_request.prompt_adapter_id \ - if self.prompt_adapter_request else 0 - - @property - def prompt_adapter_num_virtual_tokens(self) -> int: - return self.prompt_adapter_request.prompt_adapter_num_virtual_tokens \ - if self.prompt_adapter_request else 0 - # Multi-Step Chunked-Prefill property @property def is_single_step_prompt(self) -> bool: @@ -1525,7 +1491,6 @@ def add_request(request_id: str, engine, params, **kwargs): pooled_data=seq_group.pooled_data, encoder_seq=seq_group.encoder_seq, trace_headers=seq_group.trace_headers, - prompt_adapter_request=seq_group.prompt_adapter_request, priority=seq_group.priority, ) diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index e4f495e22e2..5b9c3b6a50c 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -128,10 +128,6 @@ "backends currently supported with encoder/" "decoder models.") -STR_NOT_IMPL_ENC_DEC_PROMPT_ADAPTER = ("Prompt adapters are not " - "currently supported with encoder/" - "decoder models.") - # Efficiently import all enc/dec error strings # rather than having to import all of the above STR_NOT_IMPL_ENC_DEC_ERR_STRS = { @@ -145,7 +141,6 @@ "STR_NOT_IMPL_ENC_DEC_MM": STR_NOT_IMPL_ENC_DEC_MM, "STR_NOT_IMPL_ENC_DEC_SPEC_DEC": STR_NOT_IMPL_ENC_DEC_SPEC_DEC, "STR_NOT_IMPL_ENC_DEC_BACKEND": STR_NOT_IMPL_ENC_DEC_BACKEND, - "STR_NOT_IMPL_ENC_DEC_PROMPT_ADAPTER": STR_NOT_IMPL_ENC_DEC_PROMPT_ADAPTER, } # Constants related to forcing the attention backend selection diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index 95a474228d4..66e76777d75 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -20,7 +20,6 @@ from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalRegistry from vllm.outputs import PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import SamplingParams from vllm.transformers_utils.config import ( maybe_register_config_serialize_by_value) @@ -221,7 +220,6 @@ async def add_request( lora_request: Optional[LoRARequest] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, ) -> RequestOutputCollector: @@ -238,8 +236,7 @@ async def add_request( # Convert Input --> Request. prompt_str, request = self.processor.process_inputs( request_id, prompt, params, arrival_time, lora_request, - tokenization_kwargs, trace_headers, prompt_adapter_request, - priority, data_parallel_rank) + tokenization_kwargs, trace_headers, priority, data_parallel_rank) if is_pooling or params.n == 1: await self._add_request(request, prompt_str, None, 0, queue) @@ -283,7 +280,6 @@ async def generate( request_id: str, lora_request: Optional[LoRARequest] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, ) -> AsyncGenerator[RequestOutput, None]: @@ -314,7 +310,6 @@ async def generate( sampling_params, lora_request=lora_request, trace_headers=trace_headers, - prompt_adapter_request=prompt_adapter_request, priority=priority, data_parallel_rank=data_parallel_rank, ) diff --git a/vllm/v1/engine/llm_engine.py b/vllm/v1/engine/llm_engine.py index 29aca1ad698..991242e1827 100644 --- a/vllm/v1/engine/llm_engine.py +++ b/vllm/v1/engine/llm_engine.py @@ -17,7 +17,6 @@ from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalRegistry from vllm.outputs import PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import SamplingParams from vllm.transformers_utils.tokenizer_group import ( TokenizerGroup, init_tokenizer_from_configs) @@ -192,7 +191,6 @@ def add_request( lora_request: Optional[LoRARequest] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, ) -> None: # Validate the request_id type. @@ -203,8 +201,7 @@ def add_request( # Process raw inputs into the request. prompt_str, request = self.processor.process_inputs( request_id, prompt, params, arrival_time, lora_request, - tokenization_kwargs, trace_headers, prompt_adapter_request, - priority) + tokenization_kwargs, trace_headers, priority) n = params.n if isinstance(params, SamplingParams) else 1 diff --git a/vllm/v1/engine/processor.py b/vllm/v1/engine/processor.py index 725152f978d..0f2f404a130 100644 --- a/vllm/v1/engine/processor.py +++ b/vllm/v1/engine/processor.py @@ -16,7 +16,6 @@ from vllm.multimodal.processing import EncDecMultiModalProcessor from vllm.multimodal.utils import merge_and_sort_multimodal_metadata from vllm.pooling_params import PoolingParams -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sampling_params import SamplingParams from vllm.transformers_utils.tokenizer_group import TokenizerGroup from vllm.v1.engine import EngineCoreRequest @@ -226,7 +225,6 @@ def process_inputs( lora_request: Optional[LoRARequest] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, trace_headers: Optional[Mapping[str, str]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, priority: int = 0, data_parallel_rank: Optional[int] = None, ) -> tuple[Optional[str], EngineCoreRequest]: @@ -237,8 +235,6 @@ def process_inputs( self._validate_params(params, lora_request) if trace_headers is not None: raise ValueError("V1 does not support tracing yet.") - if prompt_adapter_request is not None: - raise ValueError("V1 does not support prompt_adapter_request.") data_parallel_size = self.vllm_config.parallel_config.data_parallel_size if data_parallel_rank is not None and not (0 <= data_parallel_rank < @@ -253,12 +249,10 @@ def process_inputs( # 1. Tokenize text prompt, with LoRA request if one exists. # 2. For multimodal models with a merged preprocessor, preprocess # multimodal data and expand prompt token ids accordingly. - # 3. Apply prompt adapter to prompt token ids if one exists. processed_inputs: ProcessorInputs = self.input_preprocessor.preprocess( prompt, tokenization_kwargs=tokenization_kwargs, lora_request=lora_request, - prompt_adapter_request=prompt_adapter_request, return_mm_hashes=self.use_hash, ) from vllm.platforms import current_platform diff --git a/vllm/v1/utils.py b/vllm/v1/utils.py index 97fec4704b4..c74d8c543f7 100644 --- a/vllm/v1/utils.py +++ b/vllm/v1/utils.py @@ -318,8 +318,6 @@ def report_usage_stats( # Feature flags "enable_lora": bool(vllm_config.lora_config), - "enable_prompt_adapter": - bool(vllm_config.prompt_adapter_config), "enable_prefix_caching": vllm_config.cache_config.enable_prefix_caching, "enforce_eager": diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 1ee379d3427..3671b466070 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -104,7 +104,6 @@ def __init__( self.parallel_config = vllm_config.parallel_config self.scheduler_config = vllm_config.scheduler_config self.speculative_config = vllm_config.speculative_config - self.prompt_adapter_config = vllm_config.prompt_adapter_config self.observability_config = vllm_config.observability_config from vllm.model_executor.models.utils import set_cpu_offload_max_bytes diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index f160384f8f6..3bb033f1487 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -114,7 +114,6 @@ def __init__( self.original_parallel_config = original_parallel_config self.scheduler_config = vllm_config.scheduler_config self.speculative_config = vllm_config.speculative_config - self.prompt_adapter_config = vllm_config.prompt_adapter_config self.observability_config = vllm_config.observability_config self.device_config = vllm_config.device_config diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index 1d61878ca08..648d9c3195c 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -62,7 +62,6 @@ def __init__( self.scheduler_config = vllm_config.scheduler_config self.device_config = vllm_config.device_config self.speculative_config = vllm_config.speculative_config - self.prompt_adapter_config = vllm_config.prompt_adapter_config self.observability_config = vllm_config.observability_config self.parallel_config.rank = rank diff --git a/vllm/worker/enc_dec_model_runner.py b/vllm/worker/enc_dec_model_runner.py index 8d92edc5b38..cb5d5664ab5 100644 --- a/vllm/worker/enc_dec_model_runner.py +++ b/vllm/worker/enc_dec_model_runner.py @@ -91,10 +91,9 @@ def __init__( ''' EncoderDecoderModelRunner constructor. - `lora_config` and `prompt_adapter_config` are - unused (since these features are not yet supported for encoder/decoder - models) but these arguments are present here for compatibility with - the base-class constructor. + `lora_config` is unused (since these features are not yet supported + for encoder/decoder models) but these arguments are present here for + compatibility with the base-class constructor. ''' self._maybe_force_supported_attention_backend() diff --git a/vllm/worker/model_runner.py b/vllm/worker/model_runner.py index bced3ba9ba1..4bea37c8530 100644 --- a/vllm/worker/model_runner.py +++ b/vllm/worker/model_runner.py @@ -45,10 +45,6 @@ from vllm.multimodal import (MULTIMODAL_REGISTRY, BatchedTensorInputs, MultiModalKwargs, MultiModalPlaceholderMap, MultiModalRegistry) -from vllm.prompt_adapter.layers import PromptAdapterMapping -from vllm.prompt_adapter.request import PromptAdapterRequest -from vllm.prompt_adapter.worker_manager import ( - LRUCacheWorkerPromptAdapterManager) from vllm.sampling_params import SamplingParams from vllm.sequence import IntermediateTensors, SequenceGroupMetadata from vllm.utils import (DeviceMemoryProfiler, GiB_bytes, PyObjectCache, @@ -95,8 +91,6 @@ class ModelInputForGPU(ModelRunnerInputBase): lora_mapping: Optional["LoRAMapping"] = None lora_requests: Optional[Set[LoRARequest]] = None attn_metadata: Optional["AttentionMetadata"] = None - prompt_adapter_mapping: Optional[PromptAdapterMapping] = None - prompt_adapter_requests: Optional[Set[PromptAdapterRequest]] = None multi_modal_kwargs: Optional[BatchedTensorInputs] = None request_ids_to_seq_ids: Optional[Dict[str, List[int]]] = None finished_requests_ids: Optional[List[str]] = None @@ -113,8 +107,6 @@ def as_broadcastable_tensor_dict(self) -> Dict[str, Any]: "lora_requests": self.lora_requests, "lora_mapping": self.lora_mapping, "multi_modal_kwargs": self.multi_modal_kwargs, - "prompt_adapter_mapping": self.prompt_adapter_mapping, - "prompt_adapter_requests": self.prompt_adapter_requests, "virtual_engine": self.virtual_engine, "request_ids_to_seq_ids": self.request_ids_to_seq_ids, "finished_requests_ids": self.finished_requests_ids, @@ -164,8 +156,6 @@ def as_broadcastable_tensor_dict(self) -> Dict[str, Any]: "lora_requests": self.lora_requests, "lora_mapping": self.lora_mapping, "multi_modal_kwargs": self.multi_modal_kwargs, - "prompt_adapter_mapping": self.prompt_adapter_mapping, - "prompt_adapter_requests": self.prompt_adapter_requests, "virtual_engine": self.virtual_engine, "request_ids_to_seq_ids": self.request_ids_to_seq_ids, "finished_requests_ids": self.finished_requests_ids, @@ -212,8 +202,6 @@ def simple_reinit(self): self.lora_index_mapping.clear() # type: ignore self.lora_prompt_mapping.clear() # type: ignore self.lora_requests.clear() # type: ignore - self.prompt_adapter_index_mapping.clear() # type: ignore - self.prompt_adapter_prompt_mapping.clear() # type: ignore def __init__( self, @@ -252,11 +240,6 @@ def __init__( lora_prompt_mapping: Optional[List[List[int]]] = None, lora_requests: Optional[Set[LoRARequest]] = None, - # Prompt adapter inputs. - prompt_adapter_index_mapping: Optional[List[int]] = None, - prompt_adapter_prompt_mapping: Optional[List[int]] = None, - prompt_adapter_request: Optional[PromptAdapterRequest] = None, - # Multi-modal inputs. multi_modal_kwargs: Optional[MultiModalKwargs] = None, multi_modal_placeholder_maps: Optional[Dict[ @@ -360,18 +343,6 @@ def __init__( else: self.lora_requests.clear() - if prompt_adapter_index_mapping: - self.prompt_adapter_index_mapping = \ - prompt_adapter_index_mapping - else: - self.prompt_adapter_index_mapping.clear() - - if prompt_adapter_prompt_mapping: - self.prompt_adapter_prompt_mapping = \ - prompt_adapter_prompt_mapping - else: - self.prompt_adapter_prompt_mapping.clear() - else: self.input_tokens = input_tokens or [] self.inputs_embeds = inputs_embeds @@ -390,12 +361,6 @@ def __init__( self.lora_prompt_mapping = lora_prompt_mapping or [] self.lora_requests = lora_requests or set() - self.prompt_adapter_index_mapping = ( - prompt_adapter_index_mapping or []) - self.prompt_adapter_prompt_mapping = ( - prompt_adapter_prompt_mapping or []) - - self.prompt_adapter_request = prompt_adapter_request self.multi_modal_kwargs = multi_modal_kwargs self.multi_modal_placeholder_maps = multi_modal_placeholder_maps self.prefix_cache_hit = prefix_cache_hit @@ -485,7 +450,6 @@ def __init__(self, # Compute functions for each sequence group. # WARNING: The order of the functions matters! self.per_seq_group_compute_fns = [ - self._compute_prompt_adapter_input, self._compute_multi_modal_input, ] @@ -496,8 +460,6 @@ def __init__(self, self.sliding_window = self.runner.sliding_window self.block_size = self.runner.block_size self.enable_lora = self.runner.lora_config is not None - self.enable_prompt_adapter = (self.runner.prompt_adapter_config - is not None) # Attention metadata inputs. if self.attn_backend is not None: @@ -693,34 +655,6 @@ def _compute_lora_input(self, inter_data: InterDataForSeqGroup, else: inter_data.lora_prompt_mapping.append([]) - def _compute_prompt_adapter_input( - self, inter_data: InterDataForSeqGroup, - seq_group_metadata: SequenceGroupMetadata): - """If prompt adapter is enabled, compute index and prompt mapping. - """ - # Note that when is_prompt=True, we expect only one sequence - # in the group. - if not self.enable_prompt_adapter: - return - - prompt_adapter_id = seq_group_metadata.prompt_adapter_id - if prompt_adapter_id <= 0 or not inter_data.is_prompt: - return - - # We expect only one sequence in the group when is_prompt=True. - assert inter_data.n_seqs == 1 - query_len = inter_data.query_lens[0] - inter_data.prompt_adapter_request = ( - seq_group_metadata.prompt_adapter_request) - - num_tokens = seq_group_metadata.prompt_adapter_num_virtual_tokens - inter_data.prompt_adapter_index_mapping = [ - prompt_adapter_id - ] * num_tokens + [0] * (query_len - num_tokens) - inter_data.prompt_adapter_prompt_mapping = [prompt_adapter_id] * ( - query_len if seq_group_metadata.sampling_params - and seq_group_metadata.sampling_params.prompt_logprobs else 1) - def _compute_multi_modal_input(self, inter_data: InterDataForSeqGroup, seq_group_metadata: SequenceGroupMetadata): """If multi-modal data is given, add it to the input.""" @@ -1009,29 +943,6 @@ def build(self) -> ModelInputForGPU: prompt_mapping=lora_prompt_mapping, is_prefill=not self.decode_only)) - # Prompt adapter data. - prompt_adapter_requests: Set[PromptAdapterRequest] = set() - prompt_adapter_mapping = None - if self.enable_prompt_adapter: - prompt_adapter_requests = set( - data.prompt_adapter_request for data in self.inter_data_list - if data.prompt_adapter_request is not None) - prompt_adapter_index_mapping = flatten_2d_lists([ - inter_data.prompt_adapter_index_mapping - for inter_data in self.inter_data_list - ]) - if cuda_graph_pad_size: - prompt_adapter_index_mapping.extend( - itertools.repeat(0, cuda_graph_pad_size)) - prompt_adapter_prompt_mapping = flatten_2d_lists([ - inter_data.prompt_adapter_prompt_mapping - for inter_data in self.inter_data_list - ]) - prompt_adapter_mapping = PromptAdapterMapping( - prompt_adapter_index_mapping, - prompt_adapter_prompt_mapping, - ) - # Multi-modal data. multi_modal_kwargs_list = [ data.multi_modal_kwargs for data in self.inter_data_list @@ -1051,9 +962,7 @@ def build(self) -> ModelInputForGPU: lora_requests=lora_requests, multi_modal_kwargs=multi_modal_kwargs, request_ids_to_seq_ids=request_ids_to_seq_ids, - finished_requests_ids=self.finished_requests_ids, - prompt_adapter_mapping=prompt_adapter_mapping, - prompt_adapter_requests=prompt_adapter_requests) + finished_requests_ids=self.finished_requests_ids) class GPUModelRunnerBase(ModelRunnerBase[TModelInputForGPU]): @@ -1148,7 +1057,6 @@ def __init__( self.model: nn.Module # Set after load_model # Set after load_model. self.lora_manager: Optional[LRUCacheWorkerLoRAManager] = None - self.prompt_adapter_manager: LRUCacheWorkerPromptAdapterManager = None self.sampler = get_sampler() set_cpu_offload_max_bytes( @@ -1207,14 +1115,7 @@ def load_model(self) -> None: logger.info("Model loading took %.4f GiB and %.6f seconds", self.model_memory_usage / GiB_bytes, time_after_load - time_before_load) - if self.prompt_adapter_config: - self.prompt_adapter_manager = LRUCacheWorkerPromptAdapterManager( - self.scheduler_config.max_num_seqs, - self.scheduler_config.max_num_batched_tokens, self.device, - self.prompt_adapter_config) - self.model = ( - self.prompt_adapter_manager.create_prompt_adapter_manager( - self.model)) + if self.vllm_config.compilation_config.level ==\ CompilationLevel.DYNAMO_AS_IS and supports_dynamo(): @@ -1466,40 +1367,6 @@ def list_loras(self) -> Set[int]: raise RuntimeError("LoRA is not enabled.") return self.lora_manager.list_adapters() - def remove_all_prompt_adapters(self): - if not self.prompt_adapter_manager: - raise RuntimeError("PromptAdapter is not enabled.") - self.prompt_adapter_manager.remove_all_adapters() - - def set_active_prompt_adapters( - self, prompt_adapter_requests: Set[PromptAdapterRequest], - prompt_adapter_mapping: PromptAdapterMapping) -> None: - if not self.prompt_adapter_manager: - raise RuntimeError("PromptAdapter is not enabled.") - self.prompt_adapter_manager.set_active_adapters( - prompt_adapter_requests, prompt_adapter_mapping) - - def add_prompt_adapter( - self, prompt_adapter_request: PromptAdapterRequest) -> bool: - if not self.prompt_adapter_manager: - raise RuntimeError("PromptAdapter is not enabled.") - return self.prompt_adapter_manager.add_adapter(prompt_adapter_request) - - def remove_prompt_adapter(self, prompt_adapter_id: int) -> bool: - if not self.prompt_adapter_manager: - raise RuntimeError("PromptAdapter is not enabled.") - return self.prompt_adapter_manager.remove_adapter(prompt_adapter_id) - - def pin_prompt_adapter(self, prompt_adapter_id: int) -> bool: - if not self.prompt_adapter_manager: - raise RuntimeError("PromptAdapter is not enabled.") - return self.prompt_adapter_manager.pin_adapter(prompt_adapter_id) - - def list_prompt_adapters(self) -> Set[int]: - if not self.prompt_adapter_manager: - raise RuntimeError("PromptAdapter is not enabled.") - return self.prompt_adapter_manager.list_adapters() - @torch.inference_mode() def capture_model(self, kv_caches: List[List[torch.Tensor]]) -> None: """Cuda graph capture a model. @@ -1609,13 +1476,6 @@ def capture_model(self, kv_caches: List[List[torch.Tensor]]) -> None: self.set_active_loras(set([dummy_lora_request]), lora_mapping) - if self.prompt_adapter_config: - prompt_adapter_mapping = PromptAdapterMapping( - [-1] * batch_size, - [-1] * batch_size, - ) - self.set_active_prompt_adapters( - set(), prompt_adapter_mapping) graph_runner = CUDAGraphRunner( self.model, self.attn_backend.get_name(), self.attn_state.graph_clone(batch_size), @@ -1776,13 +1636,6 @@ def execute_model( self.set_active_loras(model_input.lora_requests, model_input.lora_mapping) - if self.prompt_adapter_config: - assert model_input.prompt_adapter_requests is not None - assert model_input.prompt_adapter_mapping is not None - self.set_active_prompt_adapters( - model_input.prompt_adapter_requests, - model_input.prompt_adapter_mapping) - self.attn_state.begin_forward(model_input) # Currently cuda graph is only supported by the decode phase. diff --git a/vllm/worker/model_runner_base.py b/vllm/worker/model_runner_base.py index 62f26ac57a9..feca8a7a1e7 100644 --- a/vllm/worker/model_runner_base.py +++ b/vllm/worker/model_runner_base.py @@ -190,7 +190,6 @@ def __init__( self.scheduler_config = vllm_config.scheduler_config self.device_config = vllm_config.device_config self.speculative_config = vllm_config.speculative_config - self.prompt_adapter_config = vllm_config.prompt_adapter_config self.observability_config = vllm_config.observability_config # Map of request_id -> generator used for seeded random sampling diff --git a/vllm/worker/multi_step_model_runner.py b/vllm/worker/multi_step_model_runner.py index 0680e60b52a..2aa910bdff6 100644 --- a/vllm/worker/multi_step_model_runner.py +++ b/vllm/worker/multi_step_model_runner.py @@ -288,9 +288,6 @@ def maybe_advance_frozen_model_input(self, device: str, pin_memory: bool): assert fmi.lora_requests is not None assert len(fmi.lora_requests) == 0 assert fmi.attn_metadata is not None - assert fmi.prompt_adapter_mapping is None - assert fmi.prompt_adapter_requests is not None - assert len(fmi.prompt_adapter_requests) == 0 assert fmi.multi_modal_kwargs is not None assert len(fmi.multi_modal_kwargs) == 0 diff --git a/vllm/worker/pooling_model_runner.py b/vllm/worker/pooling_model_runner.py index d91b16be83d..e49783ad9b2 100644 --- a/vllm/worker/pooling_model_runner.py +++ b/vllm/worker/pooling_model_runner.py @@ -64,13 +64,6 @@ def execute_model( self.set_active_loras(model_input.lora_requests, model_input.lora_mapping) - if self.prompt_adapter_config: - assert model_input.prompt_adapter_requests is not None - assert model_input.prompt_adapter_mapping is not None - self.set_active_prompt_adapters( - model_input.prompt_adapter_requests, - model_input.prompt_adapter_mapping) - # Currently cuda graph is only supported by the decode phase. assert model_input.attn_metadata is not None prefill_meta = model_input.attn_metadata.prefill_metadata diff --git a/vllm/worker/utils.py b/vllm/worker/utils.py index 1a5f62cb3c4..512a1dca737 100644 --- a/vllm/worker/utils.py +++ b/vllm/worker/utils.py @@ -47,7 +47,3 @@ def assert_enc_dec_mr_supported_scenario( if enc_dec_mr.scheduler_config.num_lookahead_slots > 0: raise NotImplementedError( STR_NOT_IMPL_ENC_DEC_ERR_STRS['STR_NOT_IMPL_ENC_DEC_SPEC_DEC']) - - if enc_dec_mr.prompt_adapter_config is not None: - raise NotImplementedError(STR_NOT_IMPL_ENC_DEC_ERR_STRS[ - 'STR_NOT_IMPL_ENC_DEC_PROMPT_ADAPTER']) diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py index 6b6943d7643..9dfea947568 100644 --- a/vllm/worker/worker.py +++ b/vllm/worker/worker.py @@ -22,7 +22,6 @@ from vllm.model_executor.layers.sampler import SamplerOutput from vllm.model_executor.model_loader.tensorizer import TensorizerConfig from vllm.platforms import current_platform -from vllm.prompt_adapter.request import PromptAdapterRequest from vllm.sequence import (ExecuteModelRequest, IntermediateTensors, SequenceGroupMetadata, SequenceGroupMetadataDelta) from vllm.utils import (GiB_bytes, MemorySnapshot, bind_kv_cache, @@ -513,19 +512,6 @@ def pin_lora(self, lora_id: int) -> bool: def list_loras(self) -> Set[int]: return self.model_runner.list_loras() - def add_prompt_adapter( - self, prompt_adapter_request: PromptAdapterRequest) -> bool: - return self.model_runner.add_prompt_adapter(prompt_adapter_request) - - def remove_prompt_adapter(self, prompt_adapter_id: int) -> bool: - return self.model_runner.remove_lora(prompt_adapter_id) - - def pin_prompt_adapter(self, prompt_adapter_id: int) -> bool: - return self.model_runner.pin_prompt_adapter(prompt_adapter_id) - - def list_prompt_adapters(self) -> Set[int]: - return self.model_runner.list_prompt_adapters() - @property def max_model_len(self) -> int: return self.model_config.max_model_len diff --git a/vllm/worker/worker_base.py b/vllm/worker/worker_base.py index 55705062d39..f1c9a0ab001 100644 --- a/vllm/worker/worker_base.py +++ b/vllm/worker/worker_base.py @@ -49,7 +49,6 @@ def __init__( self.scheduler_config = vllm_config.scheduler_config self.device_config = vllm_config.device_config self.speculative_config = vllm_config.speculative_config - self.prompt_adapter_config = vllm_config.prompt_adapter_config self.observability_config = vllm_config.observability_config self.kv_transfer_config = vllm_config.kv_transfer_config self.compilation_config = vllm_config.compilation_config From 886fcef024b021e60f5e8e090a240e39f5543a64 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 23 Jul 2025 20:20:14 -0400 Subject: [PATCH 296/552] [Core] Freeze gc during cuda graph capture to speed up init (#21146) Signed-off-by: Codex Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/envs.py | 7 +++++++ vllm/v1/worker/gpu_model_runner.py | 17 ++++++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/vllm/envs.py b/vllm/envs.py index 16f635b3ac4..ca45d69eec1 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -140,6 +140,7 @@ VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB: Optional[int] = None VLLM_NIXL_ABORT_REQUEST_TIMEOUT: int = 120 VLLM_USE_CUDNN_PREFILL: bool = False + VLLM_ENABLE_CUDAGRAPH_GC: bool = False VLLM_LOOPBACK_IP: str = "" @@ -968,6 +969,12 @@ def get_vllm_port() -> Optional[int]: "VLLM_USE_TRTLLM_DECODE_ATTENTION": lambda: os.getenv("VLLM_USE_TRTLLM_DECODE_ATTENTION", None), + # Controls garbage collection during CUDA graph capture. + # If set to 0 (default), enables GC freezing to speed up capture time. + # If set to 1, allows GC to run during capture. + "VLLM_ENABLE_CUDAGRAPH_GC": + lambda: bool(int(os.getenv("VLLM_ENABLE_CUDAGRAPH_GC", "0"))), + # Used to force set up loopback IP "VLLM_LOOPBACK_IP": lambda: os.getenv("VLLM_LOOPBACK_IP", ""), diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 3671b466070..a5bf197ba16 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2439,10 +2439,25 @@ def capture_model(self) -> None: start_time = time.perf_counter() start_free_gpu_memory = torch.cuda.mem_get_info()[0] + @contextmanager + def freeze_gc(): + # Optimize garbage collection during CUDA graph capture. + # Clean up, then freeze all remaining objects from being included + # in future collections. + gc.collect() + should_freeze = not envs.VLLM_ENABLE_CUDAGRAPH_GC + if should_freeze: + gc.freeze() + try: + yield + finally: + if should_freeze: + gc.unfreeze() + # Trigger CUDA graph capture for specific shapes. # Capture the large shapes first so that the smaller shapes # can reuse the memory pool allocated for the large shapes. - with graph_capture(device=self.device): + with freeze_gc(), graph_capture(device=self.device): full_cg = self.full_cuda_graph # Only rank 0 should print progress bar during capture compilation_cases = reversed(self.cudagraph_batch_sizes) From a7af151b0fa9c769d5a5241803edbf679c6108de Mon Sep 17 00:00:00 2001 From: Hardik Gupta <40640596+hardikkgupta@users.noreply.github.com> Date: Wed, 23 Jul 2025 20:21:02 -0700 Subject: [PATCH 297/552] feat(gguf_loader): accept HF repo paths & URLs for GGUF (#20793) Signed-off-by: Hardik Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/model_executor/model_loader/gguf_loader.py | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/vllm/model_executor/model_loader/gguf_loader.py b/vllm/model_executor/model_loader/gguf_loader.py index 203c8076014..26af87c1ed6 100644 --- a/vllm/model_executor/model_loader/gguf_loader.py +++ b/vllm/model_executor/model_loader/gguf_loader.py @@ -6,6 +6,7 @@ import gguf import torch import torch.nn as nn +from huggingface_hub import hf_hub_download from transformers import AutoModelForCausalLM from vllm.config import LoadConfig, ModelConfig, VllmConfig @@ -32,8 +33,18 @@ def __init__(self, load_config: LoadConfig): def _prepare_weights(self, model_name_or_path: str): if os.path.isfile(model_name_or_path): return model_name_or_path + # for raw HTTPS link + if model_name_or_path.startswith( + ("http://", "https://")) and model_name_or_path.endswith(".gguf"): + return hf_hub_download(url=model_name_or_path) + # repo id/filename.gguf + if "/" in model_name_or_path and model_name_or_path.endswith(".gguf"): + repo_id, filename = model_name_or_path.rsplit("/", 1) + return hf_hub_download(repo_id=repo_id, filename=filename) else: - raise ValueError(f"{model_name_or_path} is not a file.") + raise ValueError( + f"Unrecognised GGUF reference: {model_name_or_path} " + "(expected local file, raw URL, or /.gguf)") def _get_gguf_weights_map(self, model_config: ModelConfig): """ From c7335716818b0e6382c88adbef09e2b8b7ba0dee Mon Sep 17 00:00:00 2001 From: deven-labovitch Date: Wed, 23 Jul 2025 23:22:19 -0400 Subject: [PATCH 298/552] [Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding (#21374) Signed-off-by: Deven Labovitch Signed-off-by: x22x22 --- docs/serving/openai_compatible_server.md | 5 +++++ vllm/entrypoints/openai/speech_to_text.py | 9 ++++----- vllm/envs.py | 7 +++++++ 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index 2cf45eeaab4..edec40f4176 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -351,6 +351,11 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai Code example: +#### API Enforced Limits + +Set the maximum audio file size (in MB) that VLLM will accept, via the +`VLLM_MAX_AUDIO_CLIP_FILESIZE_MB` environment variable. Default is 25 MB. + #### Extra Parameters The following [sampling parameters][sampling-params] are supported. diff --git a/vllm/entrypoints/openai/speech_to_text.py b/vllm/entrypoints/openai/speech_to_text.py index e26e1b748b8..c2227a21a4b 100644 --- a/vllm/entrypoints/openai/speech_to_text.py +++ b/vllm/entrypoints/openai/speech_to_text.py @@ -11,6 +11,7 @@ import numpy as np from fastapi import Request +import vllm.envs as envs from vllm.config import ModelConfig from vllm.engine.protocol import EngineClient from vllm.entrypoints.logger import RequestLogger @@ -38,10 +39,6 @@ logger = init_logger(__name__) -# As per https://platform.openai.com/docs/guides/speech-to-text#overview. -# TODO configurable -MAX_AUDIO_CLIP_FILESIZE_MB = 25 - class OpenAISpeechToText(OpenAIServing): """Base class for speech-to-text operations like transcription and @@ -70,6 +67,8 @@ def __init__( self.asr_config = self.model_cls.get_speech_to_text_config( model_config, task_type) + self.max_audio_filesize_mb = envs.VLLM_MAX_AUDIO_CLIP_FILESIZE_MB + if self.default_sampling_params: logger.info( "Overwriting default completion sampling param with: %s", @@ -93,7 +92,7 @@ async def _preprocess_speech_to_text( lang = request.language or "en" self.model_cls.validate_language(lang) - if len(audio_data) / 1024**2 > MAX_AUDIO_CLIP_FILESIZE_MB: + if len(audio_data) / 1024**2 > self.max_audio_filesize_mb: raise ValueError("Maximum file size exceeded.") with io.BytesIO(audio_data) as bytes_: diff --git a/vllm/envs.py b/vllm/envs.py index ca45d69eec1..5c414e82d93 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -61,6 +61,7 @@ VLLM_IMAGE_FETCH_TIMEOUT: int = 5 VLLM_VIDEO_FETCH_TIMEOUT: int = 30 VLLM_AUDIO_FETCH_TIMEOUT: int = 10 + VLLM_MAX_AUDIO_CLIP_FILESIZE_MB: int = 25 VLLM_VIDEO_LOADER_BACKEND: str = "opencv" VLLM_MM_INPUT_CACHE_GIB: int = 8 VLLM_TARGET_DEVICE: str = "cuda" @@ -519,6 +520,12 @@ def get_vllm_port() -> Optional[int]: "VLLM_AUDIO_FETCH_TIMEOUT": lambda: int(os.getenv("VLLM_AUDIO_FETCH_TIMEOUT", "10")), + # Maximum filesize in MB for a single audio file when processing + # speech-to-text requests. Files larger than this will be rejected. + # Default is 25 MB + "VLLM_MAX_AUDIO_CLIP_FILESIZE_MB": + lambda: int(os.getenv("VLLM_MAX_AUDIO_CLIP_FILESIZE_MB", "25")), + # Backend for Video IO # - "opencv": Default backend that uses OpenCV stream buffered backend. # From 9d596494d406b4247c187a6fe9700f0912cd724d Mon Sep 17 00:00:00 2001 From: Ming Yang Date: Wed, 23 Jul 2025 20:22:42 -0700 Subject: [PATCH 299/552] [Misc] Add dummy maverick test to CI (#21324) Signed-off-by: Ming Yang Co-authored-by: Cyrus Leung Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 1 + tests/models/multimodal/generation/test_maverick.py | 3 +++ 2 files changed, 4 insertions(+) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index c7378bf8ba5..c2e56557ba9 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -718,6 +718,7 @@ steps: - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown + - pytest -v -s models/multimodal/generation/test_maverick.py - label: Plugin Tests (2 GPUs) # 40min mirror_hardwares: [amdexperimental] diff --git a/tests/models/multimodal/generation/test_maverick.py b/tests/models/multimodal/generation/test_maverick.py index 083dc66148e..306cf39002d 100644 --- a/tests/models/multimodal/generation/test_maverick.py +++ b/tests/models/multimodal/generation/test_maverick.py @@ -23,6 +23,8 @@ from vllm import LLM, SamplingParams +from ....utils import multi_gpu_test + # Sample prompts for testing PROMPTS: list[str] = [ "Hello, my name is", @@ -541,6 +543,7 @@ def run_reduced_model(model_path: str, print("-" * 40) +@multi_gpu_test(num_gpus=2) @pytest.mark.parametrize( "original_model_name,text_layers,num_experts,vision_layers,", [("meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", 4, 4, 2)]) From 0793cd99c6afed6609aecd049ef602a06f847d3d Mon Sep 17 00:00:00 2001 From: Liangliang Ma Date: Thu, 24 Jul 2025 11:24:04 +0800 Subject: [PATCH 300/552] [XPU][UT] increase intel xpu CI test scope (#21492) Signed-off-by: Ma, Liangliang Signed-off-by: x22x22 --- .buildkite/scripts/hardware_ci/run-xpu-test.sh | 9 +++++++++ docker/Dockerfile.xpu | 2 +- tests/entrypoints/openai/correctness/test_lmeval.py | 5 +++-- 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-xpu-test.sh b/.buildkite/scripts/hardware_ci/run-xpu-test.sh index 7589b48b584..deb61a9bafa 100644 --- a/.buildkite/scripts/hardware_ci/run-xpu-test.sh +++ b/.buildkite/scripts/hardware_ci/run-xpu-test.sh @@ -31,4 +31,13 @@ docker run \ VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp cd tests pytest -v -s v1/core + pytest -v -s v1/engine + pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py + pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py + pytest -v -s v1/structured_output + pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_eagle.py + pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py + pytest -v -s v1/test_serial_utils.py + pytest -v -s v1/test_utils.py + pytest -v -s v1/test_metrics_reader.py ' diff --git a/docker/Dockerfile.xpu b/docker/Dockerfile.xpu index 3130435ca72..7d5a589eb1d 100644 --- a/docker/Dockerfile.xpu +++ b/docker/Dockerfile.xpu @@ -47,7 +47,7 @@ FROM vllm-base AS vllm-openai # install additional dependencies for openai api server RUN --mount=type=cache,target=/root/.cache/pip \ - pip install accelerate hf_transfer pytest modelscope + pip install accelerate hf_transfer pytest pytest_asyncio lm_eval[api] modelscope ENV VLLM_USAGE_SOURCE production-docker-image \ TRITON_XPU_PROFILE 1 diff --git a/tests/entrypoints/openai/correctness/test_lmeval.py b/tests/entrypoints/openai/correctness/test_lmeval.py index 41b70f80e3b..a07a147cdc2 100644 --- a/tests/entrypoints/openai/correctness/test_lmeval.py +++ b/tests/entrypoints/openai/correctness/test_lmeval.py @@ -69,8 +69,9 @@ def run_test(more_args): @pytest.mark.skipif(not current_platform.is_cuda() - and not current_platform.is_tpu(), - reason="V1 currently only supported on CUDA and TPU") + and not current_platform.is_tpu() + and not current_platform.is_xpu(), + reason="V1 currently only supported on CUDA, XPU and TPU") def test_lm_eval_accuracy_v1_engine(monkeypatch: pytest.MonkeyPatch): """Run with the V1 Engine.""" From fb11717ccb650d964ebcb82af8953df8fc9a8457 Mon Sep 17 00:00:00 2001 From: Matthew Bonanni Date: Wed, 23 Jul 2025 23:41:23 -0400 Subject: [PATCH 301/552] [Bugfix] Fix casing warning (#21468) Signed-off-by: Matthew Bonanni Signed-off-by: x22x22 --- docker/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index d1fa92ce6d1..868b8170466 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -265,7 +265,7 @@ RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \ #################### EXTENSION Build IMAGE #################### #################### DEV IMAGE #################### -FROM base as dev +FROM base AS dev ARG PIP_INDEX_URL UV_INDEX_URL ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL From 07dc5f094ee62d2ce01cc9076486bec389c9e28d Mon Sep 17 00:00:00 2001 From: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Date: Thu, 24 Jul 2025 11:42:11 +0800 Subject: [PATCH 302/552] [Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (#21437) Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: x22x22 --- .../disagg_example_p2p_nccl_xpyd.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh b/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh index 2966f386c93..76f5c0c99d0 100644 --- a/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh +++ b/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh @@ -93,6 +93,7 @@ ensure_python_library_installed() { cleanup() { echo "Stopping everything…" trap - INT TERM # prevent re-entrancy + pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py" kill -- -$$ # negative PID == "this whole process-group" wait # reap children so we don't leave zombies exit 0 From 7b09c92bf2a8c106e661f6b2f8df0fea7b92f970 Mon Sep 17 00:00:00 2001 From: KazusatoOoko <49611861+KazusatoOoko@users.noreply.github.com> Date: Wed, 23 Jul 2025 20:43:17 -0700 Subject: [PATCH 303/552] [BugFix]: Batch generation from prompt_embeds fails for long prompts (#21390) Signed-off-by: KazusatoOko Co-authored-by: KazusatoOko Signed-off-by: x22x22 --- vllm/worker/model_runner.py | 36 ++++++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/vllm/worker/model_runner.py b/vllm/worker/model_runner.py index 4bea37c8530..5a185e7451a 100644 --- a/vllm/worker/model_runner.py +++ b/vllm/worker/model_runner.py @@ -1785,24 +1785,32 @@ def execute_model( if model_input.inputs_embeds is not None: if self.is_driver_worker: - sampled = broadcast_tensor_dict( - {"token_ids": output.sampled_token_ids}) + sampled_token_ids = [] + valid_outputs = [] + for sequence_group_output in output.outputs: + if len(sequence_group_output.samples) == 0: + continue + assert len(sequence_group_output.samples) == 1 + valid_outputs.append(sequence_group_output) + sampled_token_ids.append( + sequence_group_output.samples[0].output_token) + sampled_token_ids = torch.tensor(sampled_token_ids).to( + self.device) + sampled_token_ids = broadcast_tensor_dict( + {"sampled_token_ids": + sampled_token_ids})["sampled_token_ids"] else: - sampled = broadcast_tensor_dict() - if sampled["token_ids"] is not None: - sampled_token_embeds = self.model.get_input_embeddings( - sampled["token_ids"].squeeze(1)) + sampled_token_ids = broadcast_tensor_dict( + )["sampled_token_ids"] + if len(sampled_token_ids) > 0: + sampled_token_embeds = \ + self.model.get_input_embeddings(sampled_token_ids) if self.is_driver_worker: self.sampler.include_gpu_probs_tensor = \ orig_include_gpu_probs - - output.sampled_token_embeds = sampled_token_embeds - - for token_embed, sequence_group_output in zip( - output.sampled_token_embeds, output.outputs): - assert len(sequence_group_output.samples) == 1 - sequence_group_output.samples[ - 0].output_embed = token_embed + for i, sequence_group_output in enumerate(valid_outputs): + sequence_group_output.samples[0].output_embed = \ + sampled_token_embeds[i] if not self.is_driver_worker: return [] From c0a91bae8324867e29dc73270f736f068d833180 Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Thu, 24 Jul 2025 04:56:49 +0100 Subject: [PATCH 304/552] [BugFix] Fix KVConnector TP worker aggregation (#21473) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- vllm/v1/worker/gpu_worker.py | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 1c180322e12..52294635114 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -16,7 +16,8 @@ from vllm.distributed import (ensure_model_parallel_initialized, init_distributed_environment, set_custom_all_reduce) -from vllm.distributed.kv_transfer import ensure_kv_transfer_initialized +from vllm.distributed.kv_transfer import (ensure_kv_transfer_initialized, + has_kv_transfer_group) from vllm.distributed.parallel_state import get_pp_group, get_tp_group from vllm.logger import init_logger from vllm.lora.request import LoRARequest @@ -342,19 +343,20 @@ def execute_model( assert isinstance(output, IntermediateTensors) get_pp_group().send_tensor_dict(output.tensors, all_gather_group=get_tp_group()) + if not has_kv_transfer_group(): + return None # In case of PP with kv transfer, we need to pass through the # finished_sending and finished_recving buffers. - empty_output = EMPTY_MODEL_RUNNER_OUTPUT + new_output = EMPTY_MODEL_RUNNER_OUTPUT if output.finished_sending or output.finished_recving: - empty_output = copy.copy(empty_output) - empty_output.finished_sending = output.finished_sending - empty_output.finished_recving = output.finished_recving - output = empty_output + new_output = copy.copy(new_output) + new_output.finished_sending = output.finished_sending + new_output.finished_recving = output.finished_recving + output = new_output assert isinstance(output, ModelRunnerOutput) - # return output only from the driver worker - return output if self.is_driver_worker else None + return output def profile(self, is_start: bool = True): if self.profiler is None: From 648cc37cd8aecbd42a0206c8d1e32fd5a2acab9e Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Date: Wed, 23 Jul 2025 23:57:32 -0400 Subject: [PATCH 305/552] [DP] Internal Load Balancing Per Node [`one-pod-per-node`] (#21238) Signed-off-by: Robert Shaw Signed-off-by: Nick Hill Signed-off-by: Tyler Michael Smith Co-authored-by: Robert Shaw Co-authored-by: Nick Hill Co-authored-by: Tyler Michael Smith Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 2 + tests/v1/engine/test_engine_core_client.py | 4 +- tests/v1/test_hybrid_lb_dp.py | 352 +++++++++++++++++++++ vllm/config.py | 12 +- vllm/engine/arg_utils.py | 38 +++ vllm/entrypoints/cli/serve.py | 19 +- vllm/entrypoints/openai/cli_args.py | 7 - vllm/v1/engine/async_llm.py | 2 +- vllm/v1/engine/coordinator.py | 5 +- vllm/v1/engine/core.py | 19 +- vllm/v1/engine/core_client.py | 27 +- vllm/v1/engine/utils.py | 44 ++- 12 files changed, 486 insertions(+), 45 deletions(-) create mode 100644 tests/v1/test_hybrid_lb_dp.py diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index c2e56557ba9..948ce9e8667 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -166,6 +166,7 @@ steps: - tests/v1/test_async_llm_dp.py - tests/v1/test_external_lb_dp.py - tests/v1/test_internal_lb_dp.py + - tests/v1/test_hybrid_lb_dp.py - tests/v1/engine/test_engine_core_client.py commands: # test with tp=2 and external_dp=2 @@ -178,6 +179,7 @@ steps: - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_external_lb_dp.py - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/test_internal_lb_dp.py + - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/test_hybrid_lb_dp.py - pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp - pytest -v -s distributed/test_utils.py - pytest -v -s compile/test_basic_correctness.py diff --git a/tests/v1/engine/test_engine_core_client.py b/tests/v1/engine/test_engine_core_client.py index 65f1da803fb..2ac6dc796bd 100644 --- a/tests/v1/engine/test_engine_core_client.py +++ b/tests/v1/engine/test_engine_core_client.py @@ -565,8 +565,8 @@ def create_mock_executor(vllm_config): from vllm.v1.engine.utils import EngineZmqAddresses - def mock_startup_handshake(self, handshake_socket, on_head_node, - parallel_config): + def mock_startup_handshake(self, handshake_socket, local_client, + headless, parallel_config): return EngineZmqAddresses(inputs=["tcp://127.0.0.1:5555"], outputs=["tcp://127.0.0.1:5556"], coordinator_input=None, diff --git a/tests/v1/test_hybrid_lb_dp.py b/tests/v1/test_hybrid_lb_dp.py new file mode 100644 index 00000000000..08336489abe --- /dev/null +++ b/tests/v1/test_hybrid_lb_dp.py @@ -0,0 +1,352 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import asyncio +import os +import threading +import time +from contextlib import AsyncExitStack + +import openai # use the official client for correctness check +import pytest +import pytest_asyncio + +from tests.utils import RemoteOpenAIServer +from tests.v1.test_utils import check_request_balancing +from vllm.platforms import Platform + +MODEL_NAME = "ibm-research/PowerMoE-3b" + +# Number of data parallel ranks for hybrid LB testing (4 total) +DP_SIZE = int(os.getenv("DP_SIZE", "4")) +# Default tensor parallel size to use +TP_SIZE = int(os.getenv("TP_SIZE", "1")) + +# Number of nodes (2 nodes, each with 2 DP ranks) +NUM_NODES = 2 +DP_SIZE_LOCAL = DP_SIZE // NUM_NODES # 2 ranks per node + + +class HybridLBServerManager: + """Manages hybrid data parallel vLLM server instances where each node + runs a single logical API server that balances requests only to the + DP engines running on that same node.""" + + def __init__(self, + model_name: str, + dp_size: int, + api_server_count: int, + base_server_args: list, + dp_size_local: int = DP_SIZE_LOCAL, + tp_size: int = TP_SIZE): + self.model_name = model_name + self.dp_size = dp_size + self.dp_size_local = dp_size_local + self.tp_size = tp_size + self.api_server_count = api_server_count + self.base_server_args = base_server_args + self.servers: list[tuple[RemoteOpenAIServer, list[str]]] = [] + self.server_threads: list[threading.Thread] = [] + self.num_nodes = dp_size // dp_size_local + + def __enter__(self) -> list[tuple[RemoteOpenAIServer, list[str]]]: + """Start all server instances for hybrid LB mode.""" + for node_id in range(self.num_nodes): + # Create server args for this specific node + server_args = self.base_server_args.copy() + + # Calculate start rank for this node + start_rank = node_id * self.dp_size_local + + # Add hybrid LB specific arguments + server_args.extend([ + "--data-parallel-size", + str(self.dp_size), + "--data-parallel-size-local", + str(self.dp_size_local), + "--data-parallel-start-rank", + str(start_rank), + "--data-parallel-hybrid-lb", # Enable hybrid LB mode + "--tensor-parallel-size", + str(self.tp_size), + "--port", + str(8000 + node_id), # Different port for each node + "--api-server-count", + str(self.api_server_count), + "--data-parallel-address", + "127.0.0.1", + "--data-parallel-rpc-port", + "13345", + ]) + + # Use a thread to start each server to allow parallel initialization + def start_server(node: int, sargs: list[str]): + try: + # Calculate GPU devices for this node + gpus_per_node = self.dp_size_local * self.tp_size + gpu_start = node * gpus_per_node + gpu_end = gpu_start + gpus_per_node + + # Start the server + server = RemoteOpenAIServer( + self.model_name, + sargs, + auto_port=False, + env_dict={ + "CUDA_VISIBLE_DEVICES": + ",".join( + str(Platform.device_id_to_physical_device_id( + i)) for i in range(gpu_start, gpu_end)) + }) + server.__enter__() + print(f"Hybrid LB node {node} started successfully with " + f"{self.dp_size_local} local DP ranks and " + f"{self.api_server_count} API servers") + self.servers.append((server, sargs)) + except Exception as e: + print(f"Failed to start hybrid LB node {node}: {e}") + raise + + thread = threading.Thread(target=start_server, + args=(node_id, server_args)) + thread.start() + + self.server_threads.append(thread) + + # Wait for all servers to start + for thread in self.server_threads: + thread.join() + + # Give servers additional time to fully initialize and coordinate + time.sleep(3) + + if len(self.servers) != self.num_nodes: + raise Exception("Servers failed to start") + + return self.servers + + def __exit__(self, exc_type, exc_val, exc_tb): + """Stop all server instances.""" + while self.servers: + try: + self.servers.pop()[0].__exit__(exc_type, exc_val, exc_tb) + except Exception as e: + print(f"Error stopping server: {e}") + + +@pytest.fixture(scope="module") +def default_server_args(): + return [ + # use half precision for speed and memory savings in CI environment + "--dtype", + "bfloat16", + "--max-model-len", + "2048", + "--max-num-seqs", + "128", + "--enforce-eager", + ] + + +@pytest.fixture(scope="module", params=[1]) # Only 1 API server for now +def servers(request, default_server_args): + api_server_count = request.param + with HybridLBServerManager(MODEL_NAME, DP_SIZE, api_server_count, + default_server_args, DP_SIZE_LOCAL, + TP_SIZE) as server_list: + yield server_list + + +@pytest_asyncio.fixture +async def clients(servers: list[tuple[RemoteOpenAIServer, list[str]]]): + # Create a client for each node (each node has its own API endpoint) + async with AsyncExitStack() as stack: + yield [ + await stack.enter_async_context(server.get_async_client()) + for server, _ in servers + ] + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name", + [MODEL_NAME], +) +async def test_hybrid_lb_completion(clients: list[openai.AsyncOpenAI], + servers: list[tuple[RemoteOpenAIServer, + list[str]]], + model_name: str) -> None: + + async def make_request(client: openai.AsyncOpenAI): + completion = await client.completions.create( + model=model_name, + prompt="Hello, my name is", + max_tokens=10, + temperature=1.0) + + assert completion.id is not None + assert completion.choices is not None and len(completion.choices) == 1 + + choice = completion.choices[0] + # The exact number of tokens can vary slightly with temperature=1.0, + # so we check for a reasonable minimum length. + assert len(choice.text) >= 1 + # Finish reason might not always be 'length' if the model finishes early + # or due to other reasons, especially with high temperature. + # So, we'll accept 'length' or 'stop'. + assert choice.finish_reason in ("length", "stop") + + # Token counts can also vary, so we check they are positive. + assert completion.usage.completion_tokens > 0 + assert completion.usage.prompt_tokens > 0 + assert completion.usage.total_tokens > 0 + return completion + + # Test single request to each node + for i, client in enumerate(clients): + result = await make_request(client) + assert result is not None + print( + f"Hybrid LB node {i} handled single completion request successfully" + ) + + await asyncio.sleep(0.5) + + # Send requests to all nodes - each should balance within its local DP ranks + num_requests_per_node = 25 # Total 50 requests across 2 nodes + all_tasks = [] + + for i, client in enumerate(clients): + tasks = [make_request(client) for _ in range(num_requests_per_node)] + all_tasks.extend(tasks) + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests_per_node * len(clients) + assert all(completion is not None for completion in results) + + await asyncio.sleep(0.5) + + # Second burst of requests + all_tasks = [] + for i, client in enumerate(clients): + tasks = [make_request(client) for _ in range(num_requests_per_node)] + all_tasks.extend(tasks) + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests_per_node * len(clients) + assert all(completion is not None for completion in results) + + _, server_args = servers[0] + api_server_count = ( + server_args.count('--api-server-count') + and server_args[server_args.index('--api-server-count') + 1] or 1) + print( + f"Successfully completed hybrid LB test with {len(clients)} nodes " + f"({DP_SIZE_LOCAL} DP ranks each, API server count: {api_server_count})" + ) + + # Check request balancing within each node + for i, (server, _) in enumerate(servers): + print(f"Checking request balancing for node {i}") + check_request_balancing(server, DP_SIZE_LOCAL) + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "model_name", + [MODEL_NAME], +) +async def test_hybrid_lb_completion_streaming(clients: list[ + openai.AsyncOpenAI], servers: list[tuple[RemoteOpenAIServer, list[str]]], + model_name: str) -> None: + prompt = "What is an LLM?" + + async def make_streaming_request(client: openai.AsyncOpenAI): + # Perform a non-streaming request to get the expected full output + single_completion = await client.completions.create( + model=model_name, + prompt=prompt, + max_tokens=5, + temperature=0.0, + ) + single_output = single_completion.choices[0].text + + # Perform the streaming request + stream = await client.completions.create(model=model_name, + prompt=prompt, + max_tokens=5, + temperature=0.0, + stream=True) + chunks: list[str] = [] + finish_reason_count = 0 + last_chunk = None + async for chunk in stream: + chunks.append(chunk.choices[0].text) + if chunk.choices[0].finish_reason is not None: + finish_reason_count += 1 + last_chunk = chunk # Keep track of the last chunk + + # finish reason should only return in the last block for OpenAI API + assert finish_reason_count == 1, ( + "Finish reason should appear exactly once.") + assert last_chunk is not None, ( + "Stream should have yielded at least one chunk.") + assert last_chunk.choices[ + 0].finish_reason == "length", "Finish reason should be 'length'." + # Check that the combined text matches the non-streamed version. + assert "".join( + chunks + ) == single_output, "Streamed output should match non-streamed output." + return True # Indicate success for this request + + # Test single request to each node + for i, client in enumerate(clients): + result = await make_streaming_request(client) + assert result is not None + print( + f"Hybrid LB node {i} handled single streaming request successfully" + ) + + await asyncio.sleep(0.5) + + # Send streaming requests to all nodes + num_requests_per_node = 25 # Total 50 requests across 2 nodes + all_tasks = [] + + for i, client in enumerate(clients): + tasks = [ + make_streaming_request(client) + for _ in range(num_requests_per_node) + ] + all_tasks.extend(tasks) + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests_per_node * len(clients) + assert all(results), "Not all streaming requests completed successfully." + + await asyncio.sleep(0.5) + + # Second burst of streaming requests + all_tasks = [] + for i, client in enumerate(clients): + tasks = [ + make_streaming_request(client) + for _ in range(num_requests_per_node) + ] + all_tasks.extend(tasks) + + results = await asyncio.gather(*all_tasks) + assert len(results) == num_requests_per_node * len(clients) + assert all(results), "Not all streaming requests completed successfully." + + _, server_args = servers[0] + api_server_count = ( + server_args.count('--api-server-count') + and server_args[server_args.index('--api-server-count') + 1] or 1) + print(f"Successfully completed hybrid LB streaming test with " + f"{len(clients)} nodes ({DP_SIZE_LOCAL} DP ranks each, " + f"API server count: {api_server_count})") + + # Check request balancing within each node + for i, (server, _) in enumerate(servers): + print(f"Checking streaming request balancing for node {i}") + check_request_balancing(server, DP_SIZE_LOCAL) diff --git a/vllm/config.py b/vllm/config.py index 0632bb3db23..eb5ddef30f2 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -1908,8 +1908,16 @@ class ParallelConfig: """Backend to use for data parallel, either "mp" or "ray".""" data_parallel_external_lb: bool = False """Whether to use "external" DP LB mode. Applies only to online serving - and when data_parallel_size > 0. Set implicitly when - data_parallel_rank is provided explicitly to vllm serve.""" + and when data_parallel_size > 0. This is useful for a "one-pod-per-rank" + wide-EP setup in Kuberentes. Set implicitly when --data-parallel-rank + is provided explicitly to vllm serve.""" + data_parallel_hybrid_lb: bool = False + """Whether to use "hybrid" DP LB mode. Applies only to online serving + and when data_parallel_size > 0. Enables running an AsyncLLM + and API server on a "per-node" basis where vLLM load balances + between local data parallel ranks, but an external LB balances + between vLLM nodes/replicas. Set explicitly in conjunction with + --data-parallel-start-rank.""" enable_expert_parallel: bool = False """Use expert parallelism instead of tensor parallelism for MoE layers.""" enable_eplb: bool = False diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 62792fade4e..aec75f82631 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -295,9 +295,11 @@ class EngineArgs: tensor_parallel_size: int = ParallelConfig.tensor_parallel_size data_parallel_size: int = ParallelConfig.data_parallel_size data_parallel_rank: Optional[int] = None + data_parallel_start_rank: Optional[int] = None data_parallel_size_local: Optional[int] = None data_parallel_address: Optional[str] = None data_parallel_rpc_port: Optional[int] = None + data_parallel_hybrid_lb: bool = False data_parallel_backend: str = ParallelConfig.data_parallel_backend enable_expert_parallel: bool = ParallelConfig.enable_expert_parallel enable_eplb: bool = ParallelConfig.enable_eplb @@ -604,6 +606,11 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: type=int, help='Data parallel rank of this instance. ' 'When set, enables external load balancer mode.') + parallel_group.add_argument('--data-parallel-start-rank', + '-dpr', + type=int, + help='Starting data parallel rank ' + 'for secondary nodes.') parallel_group.add_argument('--data-parallel-size-local', '-dpl', type=int, @@ -625,6 +632,9 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: default='mp', help='Backend for data parallel, either ' '"mp" or "ray".') + parallel_group.add_argument( + "--data-parallel-hybrid-lb", + **parallel_kwargs["data_parallel_hybrid_lb"]) parallel_group.add_argument( "--enable-expert-parallel", **parallel_kwargs["enable_expert_parallel"]) @@ -972,6 +982,7 @@ def create_speculative_config( def create_engine_config( self, usage_context: Optional[UsageContext] = None, + headless: bool = False, ) -> VllmConfig: """ Create the VllmConfig. @@ -1060,15 +1071,41 @@ def create_engine_config( # but we should not do this here. placement_group = ray.util.get_current_placement_group() + assert not headless or not self.data_parallel_hybrid_lb, ( + "data_parallel_hybrid_lb is not applicable in " + "headless mode") + data_parallel_external_lb = self.data_parallel_rank is not None + # Local DP rank = 1, use pure-external LB. if data_parallel_external_lb: assert self.data_parallel_size_local in (1, None), ( "data_parallel_size_local must be 1 when data_parallel_rank " "is set") data_parallel_size_local = 1 + # Use full external lb if we have local_size of 1. + self.data_parallel_hybrid_lb = False elif self.data_parallel_size_local is not None: data_parallel_size_local = self.data_parallel_size_local + + if self.data_parallel_start_rank and not headless: + # Infer hybrid LB mode. + self.data_parallel_hybrid_lb = True + + if self.data_parallel_hybrid_lb and data_parallel_size_local == 1: + # Use full external lb if we have local_size of 1. + data_parallel_external_lb = True + self.data_parallel_hybrid_lb = False + + if data_parallel_size_local == self.data_parallel_size: + # Disable hybrid LB mode if set for a single node + self.data_parallel_hybrid_lb = False + + self.data_parallel_rank = self.data_parallel_start_rank or 0 else: + assert not self.data_parallel_hybrid_lb, ( + "data_parallel_size_local must be set to use " + "data_parallel_hybrid_lb.") + # Local DP size defaults to global DP size if not set. data_parallel_size_local = self.data_parallel_size @@ -1125,6 +1162,7 @@ def create_engine_config( data_parallel_master_ip=data_parallel_address, data_parallel_rpc_port=data_parallel_rpc_port, data_parallel_backend=self.data_parallel_backend, + data_parallel_hybrid_lb=self.data_parallel_hybrid_lb, enable_expert_parallel=self.enable_expert_parallel, enable_eplb=self.enable_eplb, num_redundant_experts=self.num_redundant_experts, diff --git a/vllm/entrypoints/cli/serve.py b/vllm/entrypoints/cli/serve.py index 1204ccc1c67..72460c2d91c 100644 --- a/vllm/entrypoints/cli/serve.py +++ b/vllm/entrypoints/cli/serve.py @@ -45,11 +45,6 @@ def cmd(args: argparse.Namespace) -> None: if args.headless or args.api_server_count < 1: run_headless(args) else: - if args.data_parallel_start_rank: - raise ValueError( - "data_parallel_start_rank is only applicable " - "in headless mode. " - "Add --headless flag to enable headless mode.") if args.api_server_count > 1: run_multi_api_server(args) else: @@ -86,13 +81,14 @@ def run_headless(args: argparse.Namespace): # Create the EngineConfig. engine_args = vllm.AsyncEngineArgs.from_cli_args(args) usage_context = UsageContext.OPENAI_API_SERVER - vllm_config = engine_args.create_engine_config(usage_context=usage_context) + vllm_config = engine_args.create_engine_config(usage_context=usage_context, + headless=True) if not envs.VLLM_USE_V1: raise ValueError("Headless mode is only supported for V1") - if engine_args.data_parallel_rank is not None: - raise ValueError("data_parallel_rank is not applicable in " + if engine_args.data_parallel_hybrid_lb: + raise ValueError("data_parallel_hybrid_lb is not applicable in " "headless mode") parallel_config = vllm_config.parallel_config @@ -122,7 +118,7 @@ def signal_handler(signum, frame): engine_manager = CoreEngineProcManager( target_fn=EngineCoreProc.run_engine_core, local_engine_count=local_engine_count, - start_index=args.data_parallel_start_rank, + start_index=vllm_config.parallel_config.data_parallel_rank, local_start_index=0, vllm_config=vllm_config, local_client=False, @@ -169,6 +165,11 @@ def run_multi_api_server(args: argparse.Namespace): " api_server_count > 1") model_config.disable_mm_preprocessor_cache = True + if vllm_config.parallel_config.data_parallel_hybrid_lb: + raise NotImplementedError( + "Hybrid load balancing with --api-server-count > 0" + "is not yet supported.") + executor_class = Executor.get_class(vllm_config) log_stats = not engine_args.disable_log_stats diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index b1814866664..3025a626368 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -222,13 +222,6 @@ def make_arg_parser(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: default=False, help="Run in headless mode. See multi-node data parallel " "documentation for more details.") - parser.add_argument( - "--data-parallel-start-rank", - "-dpr", - type=int, - default=0, - help="Starting data parallel rank for secondary nodes. " - "Requires --headless.") parser.add_argument("--api-server-count", "-asc", type=int, diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index 66e76777d75..02cb80197fa 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -127,7 +127,7 @@ def __init__( if self.log_stats: self.logger_manager = StatLoggerManager( vllm_config=vllm_config, - engine_idxs=self.engine_core.engine_ranks, + engine_idxs=self.engine_core.engine_ranks_managed, custom_stat_loggers=stat_loggers, ) self.logger_manager.log_engine_initialized() diff --git a/vllm/v1/engine/coordinator.py b/vllm/v1/engine/coordinator.py index 005e71647aa..c0decd6ffa2 100644 --- a/vllm/v1/engine/coordinator.py +++ b/vllm/v1/engine/coordinator.py @@ -61,11 +61,12 @@ def __init__(self, parallel_config: ParallelConfig): host = parallel_config.data_parallel_master_ip external_lb = parallel_config.data_parallel_external_lb + hybrid_lb = parallel_config.data_parallel_hybrid_lb # Assume coordinator is colocated with front-end procs when not in - # external DP LB mode. + # either external or hybrid DP LB mode. front_publish_address = get_engine_client_zmq_addr( - local_only=not external_lb, host=host) + local_only=not external_lb and not hybrid_lb, host=host) local_only_eng = dp_size == parallel_config.data_parallel_size_local back_publish_address = get_engine_client_zmq_addr(local_only_eng, host) diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index ca636bf5a6f..4a971e0b312 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -467,13 +467,14 @@ def _perform_handshakes( For DP>1 with internal loadbalancing this is with the shared front-end process which may reside on a different node. - For DP>1 with external loadbalancing, two handshakes are performed: + For DP>1 with external or hybrid loadbalancing, two handshakes are + performed: - With the rank 0 front-end process which retrieves the DP Coordinator ZMQ addresses and DP process group address. - With the colocated front-end process which retrieves the client input/output socket addresses. - with the exception of the rank 0 engine itself which doesn't require - the second handshake. + with the exception of the rank 0 and colocated engines themselves which + don't require the second handshake. Here, "front-end" process can mean the process containing the engine core client (which is the API server process in the case the API @@ -482,15 +483,18 @@ def _perform_handshakes( """ input_ctx = zmq.Context() is_local = local_client and client_handshake_address is None + headless = not local_client handshake = self._perform_handshake(input_ctx, handshake_address, - identity, is_local, vllm_config, + identity, is_local, headless, + vllm_config, vllm_config.parallel_config) if client_handshake_address is None: with handshake as addresses: yield addresses else: + assert local_client local_handshake = self._perform_handshake( - input_ctx, client_handshake_address, identity, local_client, + input_ctx, client_handshake_address, identity, True, False, vllm_config) with handshake as addresses, local_handshake as client_addresses: addresses.inputs = client_addresses.inputs @@ -507,6 +511,7 @@ def _perform_handshake( handshake_address: str, identity: bytes, local_client: bool, + headless: bool, vllm_config: VllmConfig, parallel_config_to_update: Optional[ParallelConfig] = None, ) -> Generator[EngineZmqAddresses, None, None]: @@ -518,6 +523,7 @@ def _perform_handshake( bind=False) as handshake_socket: # Register engine with front-end. addresses = self.startup_handshake(handshake_socket, local_client, + headless, parallel_config_to_update) yield addresses @@ -531,6 +537,7 @@ def _perform_handshake( msgspec.msgpack.encode({ "status": "READY", "local": local_client, + "headless": headless, "num_gpu_blocks": num_gpu_blocks, "dp_stats_address": dp_stats_address, })) @@ -539,6 +546,7 @@ def _perform_handshake( def startup_handshake( handshake_socket: zmq.Socket, local_client: bool, + headless: bool, parallel_config: Optional[ParallelConfig] = None, ) -> EngineZmqAddresses: @@ -547,6 +555,7 @@ def startup_handshake( msgspec.msgpack.encode({ "status": "HELLO", "local": local_client, + "headless": headless, })) # Receive initialization message. diff --git a/vllm/v1/engine/core_client.py b/vllm/v1/engine/core_client.py index 2ebb76a97eb..69ae3690d00 100644 --- a/vllm/v1/engine/core_client.py +++ b/vllm/v1/engine/core_client.py @@ -429,18 +429,23 @@ def __init__( parallel_config = vllm_config.parallel_config dp_size = parallel_config.data_parallel_size dp_rank = parallel_config.data_parallel_rank - external_dp_lb = parallel_config.data_parallel_external_lb - + dp_local_size = parallel_config.data_parallel_size_local offline_mode = parallel_config.data_parallel_rank_local is not None - self.engine_ranks = ([dp_rank] if - (offline_mode or external_dp_lb) else list( - range(dp_size))) + # Client manages local+remote EngineCores in pure internal LB case. + # Client manages local EngineCores in hybrid and external LB case. + local_engines_only = (parallel_config.data_parallel_hybrid_lb + or parallel_config.data_parallel_external_lb) + + num_ranks = dp_local_size if local_engines_only else dp_size + self.engine_ranks_managed = [dp_rank] if offline_mode else list( + range(dp_rank, dp_rank + num_ranks)) assert parallel_config.data_parallel_size_local <= len( - self.engine_ranks) + self.engine_ranks_managed) # ZMQ identity of each engine that this client will talk to. self.core_engines: list[EngineIdentity] = [ - index.to_bytes(2, "little") for index in self.engine_ranks + rank.to_bytes(2, "little") + for rank in self.engine_ranks_managed ] # Wait for ready messages from each engine on the input socket. @@ -895,6 +900,12 @@ def _ensure_stats_update_task(self): return assert self.stats_update_address is not None + assert len(self.engine_ranks_managed) > 0 + # NOTE: running and waiting counts are all global from + # the Coordinator include all global EngineCores. This + # slice includes just the cores managed by this client. + count_slice = slice(self.engine_ranks_managed[0], + self.engine_ranks_managed[-1] + 1) async def run_engine_stats_update_task(): with make_zmq_socket(self.ctx, self.stats_update_address, @@ -959,7 +970,7 @@ async def run_engine_stats_update_task(): counts, wave, running = msgspec.msgpack.decode(buf) self.current_wave = wave self.engines_running = running - self.lb_engines = counts + self.lb_engines = counts[count_slice] resources.stats_update_task = asyncio.create_task( run_engine_stats_update_task()) diff --git a/vllm/v1/engine/utils.py b/vllm/v1/engine/utils.py index 6dde477576b..092b5b90bb5 100644 --- a/vllm/v1/engine/utils.py +++ b/vllm/v1/engine/utils.py @@ -544,7 +544,8 @@ def launch_core_engines( local_start_index = parallel_config.data_parallel_rank_local dp_rank = parallel_config.data_parallel_rank host = parallel_config.data_parallel_master_ip - external_dp_lb = parallel_config.data_parallel_external_lb + local_engines_only = (parallel_config.data_parallel_hybrid_lb + or parallel_config.data_parallel_external_lb) # In offline mode there is an LLM instance per DP rank and # one core engine per LLM, see @@ -553,8 +554,8 @@ def launch_core_engines( # client_local_only = True for cases where this front-end # sends requests only to colocated engines. - client_local_only = offline_mode or external_dp_lb or (local_engine_count - == dp_size) + client_local_only = (offline_mode or local_engines_only + or (local_engine_count == dp_size)) # Set up input and output addresses. addresses = EngineZmqAddresses( @@ -598,14 +599,27 @@ def launch_core_engines( yield engine_actor_manager, coordinator, addresses return - if offline_mode or (external_dp_lb and dp_rank > 0): + if offline_mode: assert local_engine_count == 1 engines_to_handshake = [CoreEngine(index=dp_rank, local=True)] - else: + elif dp_rank == 0: + # Rank 0 holds Coordinator, so it handshakes with all Cores + # in both external dplb and internal dplb mode. + # Note this also covers the case where we have zero local engines + # and rank 0 is headless. engines_to_handshake = [ CoreEngine(index=i, local=(i < local_engine_count)) for i in range(dp_size) ] + else: + # Rank > 0 handshakes with just the local cores it is managing. + assert local_engines_only, ( + "Attempting to launch core_engines from dp_rank > 0, but " + "found internal DPLB, which is incompatible.") + engines_to_handshake = [ + CoreEngine(index=i, local=True) + for i in range(dp_rank, dp_rank + local_engine_count) + ] # Whether the started engines will handshake only with co-located # front-end processes. In external_dp_lb mode, ranks > 0 handshake with @@ -616,7 +630,7 @@ def launch_core_engines( handshake_address = get_engine_client_zmq_addr( handshake_local_only, host, parallel_config.data_parallel_rpc_port) - if external_dp_lb and dp_rank > 0: + if local_engines_only and dp_rank > 0: assert not handshake_local_only local_handshake_address = get_open_zmq_ipc_path() client_handshake_address = local_handshake_address @@ -631,8 +645,6 @@ def launch_core_engines( # Start local engines. if local_engine_count: - # In server mode, start_index and local_start_index will - # both be 0. local_engine_manager = CoreEngineProcManager( EngineCoreProc.run_engine_core, vllm_config=vllm_config, @@ -678,6 +690,9 @@ def wait_for_engine_startup( poller = zmq.Poller() poller.register(handshake_socket, zmq.POLLIN) + remote_should_be_headless = not parallel_config.data_parallel_hybrid_lb \ + and not parallel_config.data_parallel_external_lb + if proc_manager is not None: for sentinel in proc_manager.sentinels(): poller.register(sentinel, zmq.POLLIN) @@ -713,13 +728,24 @@ def wait_for_engine_startup( raise RuntimeError(f"Message from engine with unexpected data " f"parallel rank: {eng_index}") msg = msgspec.msgpack.decode(ready_msg_bytes) - status, local = msg["status"], msg["local"] + status, local, headless = msg["status"], msg["local"], msg["headless"] if local != engine.local: raise RuntimeError(f"{status} message from " f"{'local' if local else 'remote'} " f"engine {eng_index}, expected it to be " f"{'local' if engine.local else 'remote'}") + # Remote engines must be headless iff we aren't in hybrid dp lb mode. + if not local and headless != remote_should_be_headless: + if headless: + raise RuntimeError(f"Remote engine {eng_index} must not use " + f"--headless in external or hybrid dp lb " + f"mode") + else: + raise RuntimeError(f"Remote engine {eng_index} must use " + f"--headless unless in external or hybrid " + f"dp lb mode") + if status == "HELLO" and engine.state == CoreEngineState.NEW: # Send init message with DP config info. From 72283036b5ed3554898104803aa5db3286323f5b Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Wed, 23 Jul 2025 21:10:30 -0700 Subject: [PATCH 306/552] Dump input metadata on crash for async scheduling (#21258) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- vllm/v1/engine/core.py | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 4a971e0b312..772f15576fb 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -234,9 +234,14 @@ def abort_requests(self, request_ids: list[str]): self.scheduler.finish_requests(request_ids, RequestStatus.FINISHED_ABORTED) - def execute_model(self, scheduler_output: SchedulerOutput): + def execute_model_with_error_logging( + self, + model_fn: Callable[[SchedulerOutput], ModelRunnerOutput], + scheduler_output: SchedulerOutput, + ) -> ModelRunnerOutput: + """Execute the model and log detailed info on failure.""" try: - return self.model_executor.execute_model(scheduler_output) + return model_fn(scheduler_output) except Exception as err: # We do not want to catch BaseException here since we're only # interested in dumping info when the exception is due to an @@ -259,7 +264,9 @@ def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]: if not self.scheduler.has_requests(): return {}, False scheduler_output = self.scheduler.schedule() - model_output = self.execute_model(scheduler_output) + model_output = self.execute_model_with_error_logging( + self.model_executor.execute_model, # type: ignore + scheduler_output) engine_core_outputs = self.scheduler.update_from_output( scheduler_output, model_output) # type: ignore @@ -306,8 +313,11 @@ def step_with_batch_queue( # so we need more work. if not scheduled_batch and not self.batch_queue.empty(): future, scheduler_output = self.batch_queue.get_nowait() + # Blocking until the first result is available. - model_output = future.result() + model_output = self.execute_model_with_error_logging( + lambda _: future.result(), scheduler_output) + self.batch_queue.task_done() engine_core_outputs = (self.scheduler.update_from_output( scheduler_output, model_output)) From 1d44e07fc3830576af0a29f73d06edf2a0f7574d Mon Sep 17 00:00:00 2001 From: Yinghai Lu Date: Wed, 23 Jul 2025 21:44:04 -0700 Subject: [PATCH 307/552] [BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses (#21211) Signed-off-by: Yinghai Lu Signed-off-by: Nick Hill Signed-off-by: Rui Qiao Co-authored-by: Nick Hill Co-authored-by: Rui Qiao Signed-off-by: x22x22 --- vllm/v1/engine/core.py | 51 +++++++++++++++++++++++++---------------- vllm/v1/engine/utils.py | 44 ++++++++++++++++++++++++++++++----- 2 files changed, 69 insertions(+), 26 deletions(-) diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 772f15576fb..7779b559c20 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -910,22 +910,6 @@ def _init_data_parallel(self, vllm_config: VllmConfig): logger.debug("Setting kv_transfer_config.engine_id to %s", vllm_config.kv_transfer_config.engine_id) - from vllm.platforms import current_platform - device_control_env_var = current_platform.device_control_env_var - world_size = vllm_config.parallel_config.world_size - # Set CUDA_VISIBLE_DEVICES or equivalent. - try: - os.environ[device_control_env_var] = ",".join( - str(current_platform.device_id_to_physical_device_id(i)) - for i in range(local_dp_rank * - world_size, (local_dp_rank + 1) * world_size)) - except IndexError as e: - raise Exception( - f"Error setting {device_control_env_var}: " - f"local range: [{local_dp_rank * world_size}, " - f"{(local_dp_rank + 1) * world_size}) " - f"base value: \"{os.getenv(device_control_env_var)}\"") from e - self.dp_rank = dp_rank self.dp_group = vllm_config.parallel_config.stateless_init_dp_group() @@ -1088,14 +1072,41 @@ def __init__( vllm_config.parallel_config.data_parallel_rank_local = \ local_dp_rank - # Ray sets CUDA_VISIBLE_DEVICES to empty string, - # we clean this up to be able to properly initialize - # data parallel groups. - del os.environ['CUDA_VISIBLE_DEVICES'] + # Set CUDA_VISIBLE_DEVICES as early as possible in actor life cycle + # NOTE: in MP we set CUDA_VISIBLE_DEVICES at process creation time, + # and this cannot be done in the same way for Ray because: + # 1) Ray manages life cycle of all ray workers (including + # DPEngineCoreActor) + # 2) Ray sets CUDA_VISIBLE_DEVICES based on num_gpus configuration + # To bypass 2, we need to also set + # RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, but vLLM workers created + # thereafter would have CUDA_VISIBLE_DEVICES set, which is sticky: + # https://github.com/ray-project/ray/blob/e752fc319ddedd9779a0989b6d3613909bad75c9/python/ray/_private/worker.py#L456 # noqa: E501 + # But vLLM worker assumes visibility into all local GPUs, therefore + # this results in incorrect indexing into the GPU ID list. + self._set_cuda_visible_devices(vllm_config, local_dp_rank) super().__init__(vllm_config, local_client, "", executor_class, log_stats) + def _set_cuda_visible_devices(self, vllm_config: VllmConfig, + local_dp_rank: int): + from vllm.platforms import current_platform + device_control_env_var = current_platform.device_control_env_var + world_size = vllm_config.parallel_config.world_size + # Set CUDA_VISIBLE_DEVICES or equivalent. + try: + os.environ[device_control_env_var] = ",".join( + str(current_platform.device_id_to_physical_device_id(i)) + for i in range(local_dp_rank * + world_size, (local_dp_rank + 1) * world_size)) + except IndexError as e: + raise Exception( + f"Error setting {device_control_env_var}: " + f"local range: [{local_dp_rank * world_size}, " + f"{(local_dp_rank + 1) * world_size}) " + f"base value: \"{os.getenv(device_control_env_var)}\"") from e + def _decorate_logs(self): pass diff --git a/vllm/v1/engine/utils.py b/vllm/v1/engine/utils.py index 092b5b90bb5..f39aa405932 100644 --- a/vllm/v1/engine/utils.py +++ b/vllm/v1/engine/utils.py @@ -10,12 +10,14 @@ from multiprocessing import Process, connection from multiprocessing.process import BaseProcess from typing import TYPE_CHECKING, Callable, Optional, Union +from unittest.mock import patch import msgspec import zmq from vllm.config import CacheConfig, ParallelConfig, VllmConfig from vllm.logger import init_logger +from vllm.platforms import current_platform from vllm.ray.ray_env import get_env_vars_to_copy from vllm.utils import get_mp_context, get_open_zmq_ipc_path, zmq_socket_ctx from vllm.v1.engine.coordinator import DPCoordinator @@ -105,10 +107,13 @@ def __init__( "client_handshake_address"] = client_handshake_address self.processes: list[BaseProcess] = [] + local_dp_ranks = [] for index in range(local_engine_count): local_index = local_start_index + index global_index = start_index + index + # Start EngineCore in background process. + local_dp_ranks.append(local_index) self.processes.append( context.Process(target=target_fn, name=f"EngineCore_{global_index}", @@ -118,9 +123,14 @@ def __init__( })) self._finalizer = weakref.finalize(self, shutdown, self.processes) + + data_parallel = vllm_config.parallel_config.data_parallel_size > 1 try: - for proc in self.processes: - proc.start() + for proc, local_dp_rank in zip(self.processes, local_dp_ranks): + with set_device_control_env_var( + vllm_config, local_dp_rank) if ( + data_parallel) else contextlib.nullcontext(): + proc.start() finally: # Kill other procs if not all are running. if self.finished_procs(): @@ -145,6 +155,30 @@ def finished_procs(self) -> dict[str, int]: } +@contextlib.contextmanager +def set_device_control_env_var(vllm_config: VllmConfig, + local_dp_rank: int) -> Iterator[None]: + """ + Temporarily set CUDA_VISIBLE_DEVICES or equivalent + for engine subprocess. + """ + world_size = vllm_config.parallel_config.world_size + evar = current_platform.device_control_env_var + try: + value = ",".join( + str(current_platform.device_id_to_physical_device_id(i)) + for i in range(local_dp_rank * world_size, (local_dp_rank + 1) * + world_size)) + except IndexError as e: + raise Exception(f"Error setting {evar}: " + f"local range: [{local_dp_rank * world_size}, " + f"{(local_dp_rank + 1) * world_size}) " + "base value: " + f"\"{os.getenv(evar)}\"") from e + with patch.dict(os.environ, values=((evar, value), )): + yield + + class CoreEngineActorManager: """ Utility class to handle creation, readiness, and shutdown @@ -215,10 +249,9 @@ def __init__( self.placement_group_is_local = [] refs = [] - for index in range(dp_size): - local_index = local_dp_ranks[index] + for index, local_index, pg in zip(range(dp_size), local_dp_ranks, + placement_groups): dp_vllm_config = copy.deepcopy(vllm_config) - pg = placement_groups[index] dp_vllm_config.parallel_config.placement_group = pg local_client = index < local_engine_count actor = ray.remote(DPEngineCoreActor).options( @@ -264,7 +297,6 @@ def create_dp_placement_groups( local_engine_count = \ vllm_config.parallel_config.data_parallel_size_local - nodes = list_nodes() nodes = sorted(list_nodes(), key=lambda node: node.node_ip != dp_master_ip) assert nodes[0].node_ip == dp_master_ip, ( From 972af6bf162758262f1156ff1827d8d9da028050 Mon Sep 17 00:00:00 2001 From: Julien Denize <40604584+juliendenize@users.noreply.github.com> Date: Thu, 24 Jul 2025 06:51:32 +0200 Subject: [PATCH 308/552] Add think chunk (#21333) Signed-off-by: Julien Denize Signed-off-by: x22x22 --- requirements/common.txt | 2 +- requirements/nightly_torch_test.txt | 2 +- requirements/test.in | 2 +- requirements/test.txt | 7 +- tests/entrypoints/test_chat_utils.py | 167 +++++++++ .../test_mistral_reasoning_parser.py | 341 ++++++++++++++++++ tests/reasoning/utils.py | 59 +++ vllm/entrypoints/chat_utils.py | 29 +- vllm/reasoning/__init__.py | 2 + vllm/reasoning/mistral_reasoning_parser.py | 47 +++ vllm/transformers_utils/tokenizers/mistral.py | 37 +- 11 files changed, 682 insertions(+), 13 deletions(-) create mode 100644 tests/reasoning/test_mistral_reasoning_parser.py create mode 100644 vllm/reasoning/mistral_reasoning_parser.py diff --git a/requirements/common.txt b/requirements/common.txt index 1876a7e9af0..96ab646bb50 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -33,7 +33,7 @@ pyzmq >= 25.0.0 msgspec gguf >= 0.13.0 importlib_metadata; python_version < '3.10' -mistral_common[opencv] >= 1.8.0 +mistral_common[image,audio] >= 1.8.2 opencv-python-headless >= 4.11.0 # required for video IO pyyaml six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12 diff --git a/requirements/nightly_torch_test.txt b/requirements/nightly_torch_test.txt index 9c378dcf68f..0a72ddefda7 100644 --- a/requirements/nightly_torch_test.txt +++ b/requirements/nightly_torch_test.txt @@ -23,7 +23,7 @@ jiwer # required for audio tests timm # required for internvl test transformers_stream_generator # required for qwen-vl test matplotlib # required for qwen-vl test -mistral_common[opencv] >= 1.8.0 # required for voxtral test +mistral_common[image,audio] >= 1.8.2 # required for voxtral test num2words # required for smolvlm test opencv-python-headless >= 4.11.0 # required for video test datamodel_code_generator # required for minicpm3 test diff --git a/requirements/test.in b/requirements/test.in index 9f66e2d6919..429d1a50422 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -28,7 +28,7 @@ torchvision==0.22.1 transformers_stream_generator # required for qwen-vl test mamba_ssm # required for plamo2 test matplotlib # required for qwen-vl test -mistral_common[opencv] >= 1.8.0 # required for voxtral test +mistral_common[image,audio] >= 1.8.2 # required for voxtral test num2words # required for smolvlm test open_clip_torch==2.32.0 # Required for nemotron_vl test opencv-python-headless >= 4.11.0 # required for video test diff --git a/requirements/test.txt b/requirements/test.txt index a2b230102d4..8e5af8d74ba 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -447,7 +447,7 @@ mbstrdecoder==1.1.3 # typepy mdurl==0.1.2 # via markdown-it-py -mistral-common==1.8.0 +mistral-common==1.8.2 # via -r requirements/test.in mlflow==2.22.0 # via terratorch @@ -999,8 +999,11 @@ soundfile==0.12.1 # via # -r requirements/test.in # librosa + # mistral-common soxr==0.5.0.post1 - # via librosa + # via + # librosa + # mistral-common sqlalchemy==2.0.41 # via # alembic diff --git a/tests/entrypoints/test_chat_utils.py b/tests/entrypoints/test_chat_utils.py index e321ca70001..ed57fe39df6 100644 --- a/tests/entrypoints/test_chat_utils.py +++ b/tests/entrypoints/test_chat_utils.py @@ -6,6 +6,10 @@ from typing import Literal, Optional import pytest +from mistral_common.tokens.tokenizers.base import (SpecialTokenPolicy, + SpecialTokens) +from mistral_common.tokens.tokenizers.tekken import (SpecialTokenInfo, + Tekkenizer) from vllm.assets.audio import AudioAsset from vllm.assets.image import ImageAsset @@ -21,6 +25,7 @@ from vllm.multimodal.utils import (encode_audio_base64, encode_image_base64, encode_video_base64) from vllm.transformers_utils.tokenizer_group import TokenizerGroup +from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer from ..models.registry import HF_EXAMPLE_MODELS from ..utils import VLLM_PATH @@ -1374,3 +1379,165 @@ def test_resolve_content_format_examples(template_path, expected_format): ) assert resolved_format == expected_format + + +def test_parse_chat_messages_include_thinking_chunk(mistral_model_config, + mistral_tokenizer): + messages = [{ + "role": + "system", + "content": [{ + "type": "text", + "text": "You are a helpful assistant." + }, { + "type": + "thinking", + "closed": + True, + "thinking": + "Only return the answer when you are confident." + }] + }, { + "role": "user", + "content": "What is 2+2?" + }, { + "role": + "assistant", + "content": [{ + "type": "text", + "text": "Let me think about it." + }, { + "type": "thinking", + "closed": True, + "thinking": "2+2 = 4" + }, { + "type": "text", + "text": "The answer is 4.", + }], + }] + + conversation_with_thinking, _ = parse_chat_messages( + messages, + mistral_model_config, + mistral_tokenizer, + content_format="openai", + ) + + expected_conversation = [{ + "role": + "system", + "content": [{ + "type": "text", + "text": "You are a helpful assistant." + }, { + "type": "text", + "text": "Only return the answer when you are confident." + }], + }, { + "role": + "user", + "content": [{ + "type": "text", + "text": "What is 2+2?" + }], + }, { + "role": + "assistant", + "content": [ + { + "type": "text", + "text": "Let me think about it." + }, + { + "type": "text", + "text": "2+2 = 4" + }, + { + "type": "text", + "text": "The answer is 4." + }, + ] + }] + + assert conversation_with_thinking == expected_conversation + + +def test_apply_mistral_chat_template_thinking_chunk(): + # Moved import here to avoid yapf and isort conflicts + from vllm.entrypoints.chat_utils import apply_mistral_chat_template + messages = [{ + "role": + "system", + "content": [{ + "type": "text", + "text": "You are a helpful assistant." + }, { + "type": + "thinking", + "closed": + True, + "thinking": + "Only return the answer when you are confident." + }] + }, { + "role": "user", + "content": "What is 2+2?" + }, { + "role": + "assistant", + "content": [{ + "type": "text", + "text": "Let me think about it." + }, { + "type": "thinking", + "closed": True, + "thinking": "2+2 = 4" + }, { + "type": "text", + "text": "The answer is 4.", + }], + }, { + "role": "user", + "content": "Thanks, what is 3+3?" + }] + + # TODO(Julien): upon model release change to a tokenizer already configured. + # ================================================================= + mistral_tokenizer = MistralTokenizer.from_pretrained( + "mistralai/Devstral-Small-2507") + assert isinstance(mistral_tokenizer.tokenizer, Tekkenizer) + # Add think special tokens to the tokenizer + mistral_tokenizer.tokenizer._all_special_tokens[35] = SpecialTokenInfo( + rank=35, is_control=True, token_str=SpecialTokens.begin_think.value) + mistral_tokenizer.tokenizer._all_special_tokens[36] = SpecialTokenInfo( + rank=36, is_control=True, token_str=SpecialTokens.end_think.value) + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab = { + k: v + for k, v in + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab.items() + if v not in {35, 36} + } + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab[ + SpecialTokens.begin_think.value] = 35 + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab[ + SpecialTokens.end_think.value] = 36 + mistral_tokenizer.instruct.BEGIN_THINK = 35 + mistral_tokenizer.instruct.END_THINK = 36 + # ================================================================= + + tokens_ids = apply_mistral_chat_template(mistral_tokenizer, + messages, + chat_template=None, + tools=None) + + string_tokens = mistral_tokenizer.mistral.decode( + tokens_ids, special_token_policy=SpecialTokenPolicy.KEEP) + + expected_tokens = ( + r"[SYSTEM_PROMPT]You are a helpful assistant.[THINK]Only return the" + r" answer when you are confident.[/THINK][/SYSTEM_PROMPT]" + r"[INST]What is 2+2?[/INST]" + r"Let me think about it.[THINK]2+2 = 4[/THINK]The answer is 4." + r"[INST]Thanks, what is 3+3?[/INST]") + + assert string_tokens == expected_tokens diff --git a/tests/reasoning/test_mistral_reasoning_parser.py b/tests/reasoning/test_mistral_reasoning_parser.py new file mode 100644 index 00000000000..91a22f6f5d7 --- /dev/null +++ b/tests/reasoning/test_mistral_reasoning_parser.py @@ -0,0 +1,341 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import pytest +from mistral_common.tokens.tokenizers.base import SpecialTokens +from mistral_common.tokens.tokenizers.tekken import (SpecialTokenInfo, + Tekkenizer) + +from tests.reasoning.utils import run_reasoning_extraction_mistral +from vllm.reasoning import ReasoningParser, ReasoningParserManager +from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer + +parser_name = "mistral" + + +@pytest.fixture(scope="module") +def mistral_tokenizer(): + # TODO(Julien): upon model release change to a tokenizer already configured. + # ================================================================= + mistral_tokenizer = MistralTokenizer.from_pretrained( + "mistralai/Devstral-Small-2507") + assert isinstance(mistral_tokenizer.tokenizer, Tekkenizer) + # Add think special tokens to the tokenizer + mistral_tokenizer.tokenizer._all_special_tokens[35] = SpecialTokenInfo( + rank=35, is_control=True, token_str=SpecialTokens.begin_think.value) + mistral_tokenizer.tokenizer._all_special_tokens[36] = SpecialTokenInfo( + rank=36, is_control=True, token_str=SpecialTokens.end_think.value) + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab = { + k: v + for k, v in + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab.items() + if v not in {35, 36} + } + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab[ + SpecialTokens.begin_think.value] = 35 + mistral_tokenizer.tokenizer._special_tokens_reverse_vocab[ + SpecialTokens.end_think.value] = 36 + mistral_tokenizer.instruct.BEGIN_THINK = 35 + mistral_tokenizer.instruct.END_THINK = 36 + # ================================================================= + return mistral_tokenizer + + +SIMPLE_REASONING = { + "output": "This is a reasoning section[/THINK]This is the rest", + "reasoning_content": "This is a reasoning section", + "content": "This is the rest", + "is_reasoning_end": True, +} +COMPLETE_REASONING = { + "output": "This is a reasoning section[/THINK]", + "reasoning_content": "This is a reasoning section", + "content": None, + "is_reasoning_end": True, +} +NO_CONTENT = { + "output": "This is content", + "reasoning_content": "This is content", + "content": None, + "is_reasoning_end": False, +} +NO_REASONING_STREAMING = { + "output": "This is a reasoning section", + "reasoning_content": "This is a reasoning section", + "content": None, + "is_reasoning_end": False, +} +MULTIPLE_LINES = { + "output": "This\nThat[/THINK]This is the rest\nThat", + "reasoning_content": "This\nThat", + "content": "This is the rest\nThat", + "is_reasoning_end": True, +} +SHORTEST_REASONING_NO_STREAMING = { + "output": "[/THINK]This is the rest", + "reasoning_content": "", + "content": "This is the rest", + "is_reasoning_end": True, +} +SHORTEST_REASONING = { + "output": "[/THINK]This is the rest", + "reasoning_content": None, + "content": "This is the rest", + "is_reasoning_end": True, +} +REASONING_WITH_THINK = { + "output": "[THINK]This is a reasoning section[/THINK]This is the rest", + "reasoning_content": "This is a reasoning section", + "content": "This is the rest", + "is_reasoning_end": True, +} +COMPLETE_REASONING_WITH_THINK = { + "output": "[THINK]This is a reasoning section[/THINK]", + "reasoning_content": "This is a reasoning section", + "content": None, + "is_reasoning_end": True, +} +MULTIPLE_LINES_WITH_THINK = { + "output": "[THINK]This\nThat[/THINK]This is the rest\nThat", + "reasoning_content": "This\nThat", + "content": "This is the rest\nThat", + "is_reasoning_end": True, +} +SHORTEST_REASONING_NO_STREAMING_WITH_THINK = { + "output": "[/THINK]This is the rest", + "reasoning_content": "", + "content": "This is the rest", + "is_reasoning_end": True, +} +SHORTEST_REASONING_WITH_THINK = { + "output": "[/THINK]This is the rest", + "reasoning_content": None, + "content": "This is the rest", + "is_reasoning_end": True, +} +THINK_NO_END = { + "output": "[THINK]This is a reasoning section", + "reasoning_content": "This is a reasoning section", + "content": None, + "is_reasoning_end": False, +} +EMPTY = { + "output": "", + "reasoning_content": "", + "content": None, + "is_reasoning_end": False, +} +EMPTY_STREAMING = { + "output": "", + "reasoning_content": None, + "content": None, + "is_reasoning_end": False, +} +NEW_LINE = { + "output": "\n[THINK]This is a reasoning section[/THINK]\nThis is the rest", + "reasoning_content": "This is a reasoning section", + "content": "\nThis is the rest", + "is_reasoning_end": True, +} +# Streaming cannot handle new lines at the beginning of the output +# because we need to support [THINK]...[/THINK] and [/THINK]... +# We cannot know if the text before [THINK] is reasoning content +# or not. +NEW_LINE_STREAMING = { + "output": "\n[THINK]This is a reasoning section[/THINK]\nThis is the rest", + "reasoning_content": "\nThis is a reasoning section", + "content": "\nThis is the rest", + "is_reasoning_end": True, +} + +TEST_CASES = [ + pytest.param( + False, + SIMPLE_REASONING, + id="simple_reasoning", + ), + pytest.param( + True, + SIMPLE_REASONING, + id="simple_reasoning_streaming", + ), + pytest.param( + False, + COMPLETE_REASONING, + id="complete_reasoning", + ), + pytest.param( + True, + COMPLETE_REASONING, + id="complete_reasoning_streaming", + ), + pytest.param( + False, + NO_CONTENT, + id="no_content_token", + ), + pytest.param( + True, + NO_REASONING_STREAMING, + id="no_reasoning_token_streaming", + ), + pytest.param( + False, + MULTIPLE_LINES, + id="multiple_lines", + ), + pytest.param( + True, + MULTIPLE_LINES, + id="multiple_lines_streaming", + ), + pytest.param( + True, + SHORTEST_REASONING, + id="shortest", + ), + pytest.param( + False, + SHORTEST_REASONING_NO_STREAMING, + id="shortest_streaming", + ), + pytest.param( + False, + REASONING_WITH_THINK, + id="reasoning_with_think", + ), + pytest.param( + True, + REASONING_WITH_THINK, + id="reasoning_with_think_streaming", + ), + pytest.param( + False, + COMPLETE_REASONING_WITH_THINK, + id="complete_reasoning_with_think", + ), + pytest.param( + True, + COMPLETE_REASONING_WITH_THINK, + id="complete_reasoning_with_think_streaming", + ), + pytest.param( + False, + MULTIPLE_LINES_WITH_THINK, + id="multiple_lines_with_think", + ), + pytest.param( + True, + MULTIPLE_LINES_WITH_THINK, + id="multiple_lines_with_think_streaming", + ), + pytest.param( + False, + SHORTEST_REASONING_NO_STREAMING_WITH_THINK, + id="shortest_with_think", + ), + pytest.param( + True, + SHORTEST_REASONING_WITH_THINK, + id="shortest_with_think_streaming", + ), + pytest.param( + False, + THINK_NO_END, + id="think_no_end", + ), + pytest.param( + True, + THINK_NO_END, + id="think_no_end_streaming", + ), + pytest.param( + False, + EMPTY, + id="empty", + ), + pytest.param( + True, + EMPTY_STREAMING, + id="empty_streaming", + ), + pytest.param( + False, + NEW_LINE, + id="new_line", + ), + pytest.param( + True, + NEW_LINE_STREAMING, + id="new_line_streaming", + ), +] + + +@pytest.mark.parametrize("streaming, param_dict", TEST_CASES) +def test_mistral_reasoning( + streaming: bool, + param_dict: dict, + mistral_tokenizer: MistralTokenizer, +): + output = param_dict["output"] + + index_think = output.find("[THINK]") + len_think = len("[THINK]") + index_end_think = output.find("[/THINK]") + len_end_think = len("[/THINK]") + + # encode everything to tokens ids + output_tokens = [] + if index_think != -1: + output_before_think = output[:index_think] + output_tokens += mistral_tokenizer.tokenizer.encode( + output_before_think, False, False) + output_tokens += [mistral_tokenizer.instruct.BEGIN_THINK] + + if index_end_think != -1: + output_middle = output[index_think + len_think:index_end_think] + output_after_think = output[index_end_think + len_end_think:] + output_tokens += mistral_tokenizer.tokenizer.encode( + output_middle, False, False) + output_tokens += [mistral_tokenizer.instruct.END_THINK] + output_tokens += mistral_tokenizer.tokenizer.encode( + output_after_think, False, False) + else: + output_middle = output[index_think + len_think:] + output_tokens += mistral_tokenizer.tokenizer.encode( + output_middle, False, False) + elif index_end_think != -1: + output_before_think = output[:index_end_think] + output_after_think = output[index_end_think + len_end_think:] + output_tokens += mistral_tokenizer.tokenizer.encode( + output_before_think, False, False) + output_tokens += [mistral_tokenizer.instruct.END_THINK] + output_tokens += mistral_tokenizer.tokenizer.encode( + output_after_think, False, False) + else: + output_tokens += mistral_tokenizer.tokenizer.encode( + output, False, False) + + parser: ReasoningParser = ReasoningParserManager.get_reasoning_parser( + parser_name)(mistral_tokenizer) + + reasoning, content = run_reasoning_extraction_mistral(parser, + output_tokens, + streaming=streaming) + + assert reasoning == param_dict["reasoning_content"] + assert content == param_dict["content"] + + # Test is_reasoning_end + is_reasoning_end = parser.is_reasoning_end(output_tokens) + assert is_reasoning_end == param_dict["is_reasoning_end"] + + # Test extract_content + if param_dict["content"] is not None: + content = parser.extract_content_ids(output_tokens) + assert content == mistral_tokenizer.tokenizer.encode( + param_dict["content"], bos=False, eos=False) + else: + content = parser.extract_content_ids(output_tokens) + assert content == [] diff --git a/tests/reasoning/utils.py b/tests/reasoning/utils.py index ddcf89796fb..9af5fa5addb 100644 --- a/tests/reasoning/utils.py +++ b/tests/reasoning/utils.py @@ -6,6 +6,7 @@ from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, DeltaMessage) from vllm.reasoning import ReasoningParser +from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer class StreamingReasoningReconstructor: @@ -54,6 +55,32 @@ def run_reasoning_extraction( return reasoning, content +def run_reasoning_extraction_mistral( + reasoning_parser: ReasoningParser, + model_output: list[int], + request: Union[ChatCompletionRequest, None] = None, + streaming: bool = False, +) -> tuple[Optional[str], Optional[str]]: + assert isinstance(reasoning_parser.model_tokenizer, + MistralTokenizer), type(reasoning_parser.model_tokenizer) + if streaming: + reconstructor = run_reasoning_extraction_streaming_mistral( + reasoning_parser, + model_output, + request, + ) + return ( + reconstructor.reasoning_content, + reconstructor.other_content or None, + ) + else: + str_output = reasoning_parser.model_tokenizer.convert_ids_to_tokens( + model_output) + reasoning, content = run_reasoning_extraction_nonstreaming( + reasoning_parser, str_output, request) + return reasoning, content + + def run_reasoning_extraction_nonstreaming( reasoning_parser: ReasoningParser, model_output: list[str], @@ -94,3 +121,35 @@ def run_reasoning_extraction_streaming( previous_text = current_text previous_tokens = current_tokens return reconstructor + + +def run_reasoning_extraction_streaming_mistral( + reasoning_parser: ReasoningParser, + model_deltas: list[int], + request: Union[ChatCompletionRequest, None] = None, +) -> StreamingReasoningReconstructor: + assert isinstance(reasoning_parser.model_tokenizer, + MistralTokenizer), type(reasoning_parser.model_tokenizer) + request = request or ChatCompletionRequest(messages=[], model="test-model") + reconstructor = StreamingReasoningReconstructor() + previous_text = "" + previous_tokens: list[int] = [] + for model_delta in model_deltas: + token_delta = [model_delta] + delta = reasoning_parser.model_tokenizer.convert_ids_to_tokens( + [model_delta])[0] + current_text = previous_text + delta + current_tokens = previous_tokens + token_delta + delta_message = reasoning_parser.extract_reasoning_content_streaming( + previous_text, + current_text, + delta, + previous_tokens, + current_tokens, + token_delta, + ) + if delta_message is not None: + reconstructor.append_delta(delta_message) + previous_text = current_text + previous_tokens = current_tokens + return reconstructor diff --git a/vllm/entrypoints/chat_utils.py b/vllm/entrypoints/chat_utils.py index 496caef4256..a6602391d40 100644 --- a/vllm/entrypoints/chat_utils.py +++ b/vllm/entrypoints/chat_utils.py @@ -151,6 +151,27 @@ class CustomChatCompletionContentSimpleVideoParam(TypedDict, total=False): video_url: Required[str] +class CustomThinkCompletionContentParam(TypedDict, total=False): + """A Think Completion Content Param that accepts a plain text and a boolean. + + Example: + { + "thinking": "I am thinking about the answer", + "closed": True, + "type": "thinking" + } + """ + + thinking: Required[str] + """The thinking content.""" + + closed: bool + """Whether the thinking is closed.""" + + type: Required[Literal["thinking"]] + """The thinking type.""" + + ChatCompletionContentPartParam: TypeAlias = Union[ OpenAIChatCompletionContentPartParam, ChatCompletionContentPartAudioParam, ChatCompletionContentPartInputAudioParam, @@ -159,7 +180,8 @@ class CustomChatCompletionContentSimpleVideoParam(TypedDict, total=False): CustomChatCompletionContentSimpleImageParam, ChatCompletionContentPartImageEmbedsParam, CustomChatCompletionContentSimpleAudioParam, - CustomChatCompletionContentSimpleVideoParam, str] + CustomChatCompletionContentSimpleVideoParam, str, + CustomThinkCompletionContentParam] class CustomChatCompletionMessageParam(TypedDict, total=False): @@ -938,6 +960,7 @@ def _get_full_multimodal_text_prompt(placeholder_storage: dict[str, list], _InputAudioParser = partial(cast, ChatCompletionContentPartInputAudioParam) _RefusalParser = partial(cast, ChatCompletionContentPartRefusalParam) _PILImageParser = partial(cast, CustomChatCompletionContentPILImageParam) +_ThinkParser = partial(cast, CustomThinkCompletionContentParam) # Need to validate url objects _ImageParser = TypeAdapter(ChatCompletionContentPartImageParam).validate_python _AudioParser = TypeAdapter(ChatCompletionContentPartAudioParam).validate_python @@ -954,6 +977,8 @@ def _get_full_multimodal_text_prompt(placeholder_storage: dict[str, list], ] = { "text": lambda part: _TextParser(part).get("text", None), + "thinking": + lambda part: _ThinkParser(part).get("thinking", None), "input_text": lambda part: _TextParser(part).get("text", None), "input_image": @@ -1100,7 +1125,7 @@ def _parse_chat_message_content_part( "with empty / unparsable content.", part, part_type) return None - if part_type in ("text", "input_text", "refusal"): + if part_type in ("text", "input_text", "refusal", "thinking"): str_content = cast(str, content) if wrap_dicts: return {'type': 'text', 'text': str_content} diff --git a/vllm/reasoning/__init__.py b/vllm/reasoning/__init__.py index bae593c1dff..d61e4f11dfa 100644 --- a/vllm/reasoning/__init__.py +++ b/vllm/reasoning/__init__.py @@ -6,6 +6,7 @@ from .glm4_moe_reasoning_parser import Glm4MoeModelReasoningParser from .granite_reasoning_parser import GraniteReasoningParser from .hunyuan_a13b_reasoning_parser import HunyuanA13BReasoningParser +from .mistral_reasoning_parser import MistralReasoningParser from .qwen3_reasoning_parser import Qwen3ReasoningParser __all__ = [ @@ -16,4 +17,5 @@ "HunyuanA13BReasoningParser", "Qwen3ReasoningParser", "Glm4MoeModelReasoningParser", + "MistralReasoningParser", ] diff --git a/vllm/reasoning/mistral_reasoning_parser.py b/vllm/reasoning/mistral_reasoning_parser.py new file mode 100644 index 00000000000..6c707a4079f --- /dev/null +++ b/vllm/reasoning/mistral_reasoning_parser.py @@ -0,0 +1,47 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from vllm.logger import init_logger +from vllm.reasoning import ReasoningParser, ReasoningParserManager +from vllm.reasoning.deepseek_r1_reasoning_parser import ( + DeepSeekR1ReasoningParser) +from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer + +logger = init_logger(__name__) + + +@ReasoningParserManager.register_module("mistral") +class MistralReasoningParser(DeepSeekR1ReasoningParser): + """ + Reasoning parser for Mistral models. + + The Mistral models uses [THINK]...[/THINK] tokens to denote reasoning + text. This parser extracts the reasoning content from the model output. + """ + + def __init__(self, tokenizer: MistralTokenizer): + if not isinstance(tokenizer, MistralTokenizer): + raise ValueError( + "The tokenizer must be an instance of MistralTokenizer.") + + ReasoningParser.__init__(self, tokenizer) + + if not self.model_tokenizer: + raise ValueError( + "The model tokenizer must be passed to the ReasoningParser " + "constructor during construction.") + + from mistral_common.tokens.tokenizers.base import SpecialTokens + + self.start_token = SpecialTokens.begin_think + self.end_token = SpecialTokens.end_think + + self.start_token_id = tokenizer.tokenizer.get_control_token( + self.start_token) + self.end_token_id = tokenizer.tokenizer.get_control_token( + self.end_token) + + if self.start_token_id is None or self.end_token_id is None: + raise RuntimeError( + "Mistral reasoning parser could not locate think start/end " + "tokens in the tokenizer!") diff --git a/vllm/transformers_utils/tokenizers/mistral.py b/vllm/transformers_utils/tokenizers/mistral.py index 24ac4580d67..f83405cfc01 100644 --- a/vllm/transformers_utils/tokenizers/mistral.py +++ b/vllm/transformers_utils/tokenizers/mistral.py @@ -145,6 +145,21 @@ def find_tokenizer_file(files: list[str]): return matched_files[0] +def _aggregate_content(content: list) -> list[dict[str, Any]]: + aggregated_content: list[dict[str, Any]] = [] + for chunk in content: + if chunk.get("type" + ) == "text" and aggregated_content and aggregated_content[ + -1].get("type") == "text": + aggregated_content[-1]["text"] += "\n\n" + chunk.get("text") + else: + aggregated_content.append(chunk) + if len(aggregated_content) == 1 and aggregated_content[0].get( + "type") == "text": + content = aggregated_content[0]["text"] + return content + + def make_mistral_chat_completion_request( messages: list["ChatCompletionMessageParam"], tools: Optional[list[dict[str, @@ -162,10 +177,10 @@ def make_mistral_chat_completion_request( # Convert list text content to string if message.get("role") in ("assistant", "tool"): - content = message.get("content") + content: Any = message.get("content") if isinstance(content, list): - content = "\n".join(chunk.get("text") for chunk in content) - message["content"] = content + content = _aggregate_content(content) + message["content"] = content # The Mistral client, in comparison to the OpenAI client, requires the # "parameters" dict to be present, even if it's empty. @@ -465,6 +480,8 @@ def convert_ids_to_tokens( skip_special_tokens: bool = True, ) -> list[str]: from mistral_common.tokens.tokenizers.base import SpecialTokens + from mistral_common.tokens.tokenizers.instruct import ( + InstructTokenizerV13) # TODO(Patrick) - potentially allow special tokens to not be skipped assert ( @@ -474,10 +491,18 @@ def convert_ids_to_tokens( assert self.is_tekken or self.is_spm, type(self.tokenizer) if self.is_tekken: - # skip special tokens except tool call - ids = [ - i for i in ids if i > self.tokenizer.num_special_tokens or i == + # skip special tokens except tool call and think tokens + non_skip_special_tokens = { self.tokenizer.get_control_token(SpecialTokens.tool_calls) + } + if isinstance(self.instruct, InstructTokenizerV13): + if self.instruct.BEGIN_THINK: + non_skip_special_tokens.add(self.instruct.BEGIN_THINK) + if self.instruct.END_THINK: + non_skip_special_tokens.add(self.instruct.END_THINK) + ids = [ + i for i in ids if i > self.tokenizer.num_special_tokens + or i in non_skip_special_tokens ] tokens = [self.tokenizer.id_to_piece(id) for id in ids] From eca7bb1ebf439886cd0c1191ff24ffe8b22552e8 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Thu, 24 Jul 2025 08:16:23 +0100 Subject: [PATCH 309/552] Deduplicate Transformers backend code using inheritance (#21461) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/model_executor/models/transformers.py | 199 +++++---------------- 1 file changed, 49 insertions(+), 150 deletions(-) diff --git a/vllm/model_executor/models/transformers.py b/vllm/model_executor/models/transformers.py index 610f8e752db..8cd95605cdf 100644 --- a/vllm/model_executor/models/transformers.py +++ b/vllm/model_executor/models/transformers.py @@ -39,7 +39,6 @@ from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.layers.vocab_parallel_embedding import ( ParallelLMHead, VocabParallelEmbedding) -from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalKwargs from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, @@ -55,8 +54,8 @@ from .interfaces import (SupportsLoRA, SupportsMultiModal, SupportsPP, SupportsQuant) from .utils import (AutoWeightsLoader, PPMissingLayer, WeightsMapper, - flatten_bn, is_pp_missing_parameter, - make_empty_intermediate_tensors_factory, maybe_prefix) + flatten_bn, make_empty_intermediate_tensors_factory, + maybe_prefix) logger = init_logger(__name__) @@ -414,40 +413,40 @@ def __exit__(self, exc_type, exc_value, traceback): setattr(self.config, key, value) -class TransformersModel: +class TransformersBase(nn.Module, SupportsQuant, SupportsLoRA, SupportsPP): + embedding_padding_modules = ["lm_head"] + embedding_modules = ["embed_tokens" + ] # TODO transformers will have a util to get it def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() logger.info("Using Transformers backend.") - config: PretrainedConfig = vllm_config.model_config.hf_config - cache_config: CacheConfig = vllm_config.cache_config - device_config: DeviceConfig = vllm_config.device_config - model_config: ModelConfig = vllm_config.model_config - parallel_config: ParallelConfig = vllm_config.parallel_config - quant_config: QuantizationConfig = vllm_config.quant_config - - self.config = config - self.text_config = config.get_text_config() - self.cache_config = cache_config - self.device_config = device_config - self.model_config = model_config - self.parallel_config = parallel_config - self.quant_config = quant_config + self.config: PretrainedConfig = vllm_config.model_config.hf_config + self.text_config: PretrainedConfig = self.config.get_text_config() + self.cache_config: CacheConfig = vllm_config.cache_config + self.device_config: DeviceConfig = vllm_config.device_config + self.model_config: ModelConfig = vllm_config.model_config + self.parallel_config: ParallelConfig = vllm_config.parallel_config + self.quant_config: QuantizationConfig = vllm_config.quant_config self.pp_group = get_pp_group() self.pp_size = self.pp_group.world_size self.pp_rank = self.pp_group.rank_in_group self.tp_size = get_tensor_model_parallel_world_size() + # To be updated in child classes for use in `load_weights` + self.skip_prefixes: Optional[list[str]] = None + # vLLM handles interleaved sliding window attention by creating a new # interleaved_sliding_window attribute and deleting the sliding_window # attribute. This breaks the constructors in Transformers so we # temporarily add the attribute back to construct the model. config_override = nullcontext() - if hasattr(config, "interleaved_sliding_window"): + if hasattr(self.config, "interleaved_sliding_window"): config_override = ConfigOverride( - config, sliding_window=config.interleaved_sliding_window) + self.config, + sliding_window=self.config.interleaved_sliding_window) # Set correct attn and init on "meta" to delay allocating GPU tensors # TODO: @raushan, use the public `model.set_attn_implementation()` @@ -455,23 +454,22 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.text_config._attn_implementation = "vllm" with init_on_device_without_buffers("meta"), config_override: self.model: PreTrainedModel = AutoModel.from_config( - config, - torch_dtype=model_config.dtype, - trust_remote_code=model_config.trust_remote_code, + self.config, + torch_dtype=self.model_config.dtype, + trust_remote_code=self.model_config.trust_remote_code, ) self.pipeline_parallel() self.tensor_parallel() # Input embeddings - text_config = config.get_text_config() if not isinstance(self.model.get_input_embeddings(), PPMissingLayer): self.model.set_input_embeddings( VocabParallelEmbedding( - text_config.vocab_size, - text_config.hidden_size, - org_num_embeddings=text_config.vocab_size, - quant_config=quant_config, + self.text_config.vocab_size, + self.text_config.hidden_size, + org_num_embeddings=self.text_config.vocab_size, + quant_config=self.quant_config, )) # Attention layers @@ -481,8 +479,8 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.init_parameters(self.model) self.make_empty_intermediate_tensors = ( - make_empty_intermediate_tensors_factory(["hidden_states"], - text_config.hidden_size)) + make_empty_intermediate_tensors_factory( + ["hidden_states"], self.text_config.hidden_size)) def pipeline_parallel(self): """ @@ -654,78 +652,40 @@ def forward( def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: - params_dict = dict(self.named_parameters()) - - loaded_params = set[str]() - for name, loaded_weight in weights: - # Use "model" instead of base_model_prefix because - # the base model attribute in vLLM is always `model` - if not name.startswith(prefix := "model."): - name = prefix + name - - if is_pp_missing_parameter(name, self): - continue - if name in params_dict: - param = params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - weight_loader(param, loaded_weight) - loaded_params.add(name) - return loaded_params + loader = AutoWeightsLoader(self, skip_prefixes=self.skip_prefixes) + return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) @support_torch_compile -class TransformersForCausalLM(nn.Module, SupportsQuant, SupportsLoRA, - SupportsPP): - embedding_padding_modules = ["lm_head"] - embedding_modules = ["embed_tokens" - ] # TODO transformers will have a util to get it +class TransformersForCausalLM(TransformersBase): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - config: PretrainedConfig = vllm_config.model_config.hf_config - quant_config: QuantizationConfig = vllm_config.quant_config - - self.config = config + super().__init__(vllm_config=vllm_config, prefix=prefix) - self.transformers_model = TransformersModel(vllm_config=vllm_config, - prefix=prefix) - self.model = self.transformers_model.model + # Tell `TransformersBase.load_weights` to skip + # `lm_head` if the model has tied word embeddings + if self.text_config.tie_word_embeddings: + self.skip_prefixes = ["lm_head."] if get_pp_group().is_last_rank: - self.unpadded_vocab_size = config.vocab_size + self.unpadded_vocab_size = self.text_config.vocab_size self.lm_head = ParallelLMHead( - config.vocab_size, - config.hidden_size, - quant_config=quant_config, + self.text_config.vocab_size, + self.text_config.hidden_size, + quant_config=self.quant_config, prefix=maybe_prefix(prefix, "lm_head"), ) - if config.tie_word_embeddings: + if self.text_config.tie_word_embeddings: self.lm_head = self.lm_head.tie_weights( self.model.get_input_embeddings()) - logit_scale = getattr(config, "logit_scale", 1.0) - self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, - config.vocab_size, - logit_scale) + logit_scale = getattr(self.text_config, "logit_scale", 1.0) + self.logits_processor = LogitsProcessor( + self.unpadded_vocab_size, self.text_config.vocab_size, + logit_scale) else: self.lm_head = PPMissingLayer() - self.make_empty_intermediate_tensors = ( - self.transformers_model.make_empty_intermediate_tensors) - - def forward( - self, - input_ids: Optional[torch.Tensor], - positions: torch.Tensor, - intermediate_tensors: Optional[IntermediateTensors] = None, - inputs_embeds: Optional[torch.Tensor] = None, - ) -> Union[torch.Tensor, IntermediateTensors]: - model_output = self.transformers_model.forward(input_ids, positions, - intermediate_tensors, - inputs_embeds) - return model_output - def compute_logits( self, hidden_states: torch.Tensor, @@ -735,23 +695,12 @@ def compute_logits( sampling_metadata) return logits - def load_weights(self, weights: Iterable[tuple[str, - torch.Tensor]]) -> set[str]: - skip_prefixes = ["lm_head." - ] if self.config.tie_word_embeddings else None - loader = AutoWeightsLoader(self, skip_prefixes=skip_prefixes) - return loader.load_weights(weights) - @MULTIMODAL_REGISTRY.register_processor( MultiModalProcessor, info=MultiModalProcessingInfo, dummy_inputs=MultiModalDummyInputsBuilder) -class TransformersForMultimodalLM(nn.Module, SupportsQuant, SupportsLoRA, - SupportsPP, SupportsMultiModal): - embedding_padding_modules = ["lm_head"] - embedding_modules = ["embed_tokens"] - +class TransformersForMultimodalLM(TransformersForCausalLM, SupportsMultiModal): # Backwards compatibility for prev released models. State dicts back then # had different formats and cannot be loaded with `AutoModel` mapping as is hf_to_vllm_mapper = WeightsMapper( @@ -776,40 +725,10 @@ class TransformersForMultimodalLM(nn.Module, SupportsQuant, SupportsLoRA, }) def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - config: PretrainedConfig = vllm_config.model_config.hf_config - quant_config: QuantizationConfig = vllm_config.quant_config + super().__init__(vllm_config=vllm_config, prefix=prefix) - self.config = config self.dtype = vllm_config.model_config.dtype - self.transformers_model = TransformersModel(vllm_config=vllm_config, - prefix=prefix) - self.model = self.transformers_model.model - text_config = config.get_text_config() - - if get_pp_group().is_last_rank: - self.unpadded_vocab_size = text_config.vocab_size - self.lm_head = ParallelLMHead( - text_config.vocab_size, - text_config.hidden_size, - quant_config=quant_config, - prefix=maybe_prefix(prefix, "lm_head"), - ) - if text_config.tie_word_embeddings: - self.lm_head = self.lm_head.tie_weights( - self.model.get_input_embeddings()) - - logit_scale = getattr(config, "logit_scale", 1.0) - self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, - text_config.vocab_size, - logit_scale) - else: - self.lm_head = PPMissingLayer() - - self.make_empty_intermediate_tensors = ( - self.transformers_model.make_empty_intermediate_tensors) - def forward( self, input_ids: Optional[torch.Tensor], @@ -828,30 +747,10 @@ def forward( input_ids, multimodal_embeds) input_ids = None - model_output = self.transformers_model.forward(input_ids, positions, - intermediate_tensors, - inputs_embeds) + model_output = super().forward(input_ids, positions, + intermediate_tensors, inputs_embeds) return model_output - def compute_logits( - self, - hidden_states: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[torch.Tensor]: - logits = self.logits_processor(self.lm_head, hidden_states, - sampling_metadata) - return logits - - def load_weights(self, weights: Iterable[tuple[str, - torch.Tensor]]) -> set[str]: - loader = AutoWeightsLoader( - self, - skip_prefixes=([ - "lm_head." - ] if self.config.get_text_config().tie_word_embeddings else None), - ) - return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) - def get_multimodal_embeddings(self, **kwargs): pixel_values = kwargs.pop("pixel_values", None) pixel_values = pixel_values if pixel_values is not None else kwargs.pop( From e4fa7e2a2c76b58d4ce0a47057fa793c26f31554 Mon Sep 17 00:00:00 2001 From: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Date: Thu, 24 Jul 2025 03:37:19 -0400 Subject: [PATCH 310/552] [Bugfix][ROCm] Fix for warp_size uses on host (#21205) Signed-off-by: Gregory Shtrasberg Signed-off-by: x22x22 --- csrc/attention/attention_kernels.cuh | 2 +- csrc/attention/paged_attention_v1.cu | 5 ++- csrc/attention/paged_attention_v2.cu | 5 ++- csrc/cuda_compat.h | 31 ++++++++++++++-- csrc/moe/topk_softmax_kernels.cu | 47 +++++++++++++++---------- csrc/quantization/activation_kernels.cu | 2 +- csrc/quantization/gguf/gguf_kernel.cu | 2 +- csrc/rocm/attention.cu | 2 +- csrc/rocm/skinny_gemms.cu | 2 +- 9 files changed, 67 insertions(+), 31 deletions(-) diff --git a/csrc/attention/attention_kernels.cuh b/csrc/attention/attention_kernels.cuh index 8f24be89578..57382c1ddc6 100644 --- a/csrc/attention/attention_kernels.cuh +++ b/csrc/attention/attention_kernels.cuh @@ -24,7 +24,7 @@ #include "attention_dtypes.h" #include "attention_utils.cuh" -#include "cuda_compat.h" +#include "../cuda_compat.h" #ifdef USE_ROCM #include diff --git a/csrc/attention/paged_attention_v1.cu b/csrc/attention/paged_attention_v1.cu index 7a5ef10f8ef..307300e5566 100644 --- a/csrc/attention/paged_attention_v1.cu +++ b/csrc/attention/paged_attention_v1.cu @@ -16,9 +16,8 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - #include "attention_kernels.cuh" -#include "cuda_compat.h" +#include "../cuda_compat.h" #define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b)) @@ -75,7 +74,7 @@ void paged_attention_v1_launcher( const float* k_scale_ptr = reinterpret_cast(k_scale.data_ptr()); const float* v_scale_ptr = reinterpret_cast(v_scale.data_ptr()); - constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE; + const int NUM_WARPS = NUM_THREADS / WARP_SIZE; int padded_max_seq_len = DIVIDE_ROUND_UP(max_seq_len, BLOCK_SIZE) * BLOCK_SIZE; int logits_size = padded_max_seq_len * sizeof(float); diff --git a/csrc/attention/paged_attention_v2.cu b/csrc/attention/paged_attention_v2.cu index b45b28dad05..eb9b4feb4a8 100644 --- a/csrc/attention/paged_attention_v2.cu +++ b/csrc/attention/paged_attention_v2.cu @@ -16,9 +16,8 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - #include "attention_kernels.cuh" -#include "cuda_compat.h" +#include "../cuda_compat.h" #define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b)) @@ -79,7 +78,7 @@ void paged_attention_v2_launcher( const float* k_scale_ptr = reinterpret_cast(k_scale.data_ptr()); const float* v_scale_ptr = reinterpret_cast(v_scale.data_ptr()); - constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE; + const int NUM_WARPS = NUM_THREADS / WARP_SIZE; int max_num_partitions = DIVIDE_ROUND_UP(max_seq_len, PARTITION_SIZE); int logits_size = PARTITION_SIZE * sizeof(float); int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float); diff --git a/csrc/cuda_compat.h b/csrc/cuda_compat.h index affa051c759..d7d589db62c 100644 --- a/csrc/cuda_compat.h +++ b/csrc/cuda_compat.h @@ -4,8 +4,35 @@ #include #endif -#if defined(USE_ROCM) && defined(__GFX9__) - #define WARP_SIZE 64 +#ifdef USE_ROCM +struct Utils { + static __host__ int get_warp_size() { + static bool is_cached = false; + static int result; + + if (!is_cached) { + int device_id; + cudaDeviceProp deviceProp; + cudaGetDevice(&device_id); + cudaGetDeviceProperties(&deviceProp, device_id); + + result = deviceProp.warpSize; + is_cached = true; + } + + return result; + } + + static __device__ constexpr int get_warp_size() { + #ifdef __GFX9__ + return 64; + #else + return 32; + #endif + } +}; + + #define WARP_SIZE Utils::get_warp_size() #else #define WARP_SIZE 32 #endif diff --git a/csrc/moe/topk_softmax_kernels.cu b/csrc/moe/topk_softmax_kernels.cu index 064b76c9cd4..0b505d2e04a 100644 --- a/csrc/moe/topk_softmax_kernels.cu +++ b/csrc/moe/topk_softmax_kernels.cu @@ -190,8 +190,8 @@ __launch_bounds__(TPB) __global__ void moeTopK( 2) This implementation assumes k is small, but will work for any k. */ -template -__launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__ +template +__launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__ void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices, int* source_rows, const int k, const int start_expert, const int end_expert) { @@ -209,12 +209,12 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__ // Restrictions based on previous section. static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg"); - static_assert(WARP_SIZE % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp"); + static_assert(WARP_SIZE_PARAM % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp"); static_assert(THREADS_PER_ROW == (THREADS_PER_ROW & -THREADS_PER_ROW), "THREADS_PER_ROW must be power of 2"); - static_assert(THREADS_PER_ROW <= WARP_SIZE, "THREADS_PER_ROW can be at most warp size"); + static_assert(THREADS_PER_ROW <= WARP_SIZE_PARAM, "THREADS_PER_ROW can be at most warp size"); // We have NUM_EXPERTS elements per row. We specialize for small #experts - static constexpr int ELTS_PER_WARP = WARP_SIZE * VPT; + static constexpr int ELTS_PER_WARP = WARP_SIZE_PARAM * VPT; static constexpr int ROWS_PER_WARP = ELTS_PER_WARP / ELTS_PER_ROW; static constexpr int ROWS_PER_CTA = WARPS_PER_CTA * ROWS_PER_WARP; @@ -393,41 +393,51 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__ namespace detail { // Constructs some constants needed to partition the work across threads at compile time. -template +template struct TopkConstants { static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float); - static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE) == 0, ""); - static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE)); + static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0, ""); + static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM)); static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG; static constexpr int THREADS_PER_ROW = EXPERTS / VPT; - static constexpr int ROWS_PER_WARP = WARP_SIZE / THREADS_PER_ROW; + static const int ROWS_PER_WARP = WARP_SIZE_PARAM / THREADS_PER_ROW; }; } // namespace detail -template +template void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices, int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream) { static constexpr std::size_t MAX_BYTES_PER_LDG = 16; static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS); - using Constants = detail::TopkConstants; + using Constants = detail::TopkConstants; static constexpr int VPT = Constants::VPT; static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP; const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP; const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB; - dim3 block_dim(WARP_SIZE, WARPS_PER_TB); - topkGatingSoftmax<<>>( + dim3 block_dim(WARP_SIZE_PARAM, WARPS_PER_TB); + topkGatingSoftmax<<>>( input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert); } -#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \ - topkGatingSoftmaxLauncherHelper( \ - gating_output, nullptr, topk_weights, topk_indices, \ - token_expert_indices, num_tokens, topk, 0, num_experts, \ - stream); +#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \ + switch (warpSize) { \ + case 32: \ + topkGatingSoftmaxLauncherHelper( \ + gating_output, nullptr, topk_weights, topk_indices, \ + token_expert_indices, num_tokens, topk, 0, num_experts, stream); \ + break; \ + case 64: \ + topkGatingSoftmaxLauncherHelper( \ + gating_output, nullptr, topk_weights, topk_indices, \ + token_expert_indices, num_tokens, topk, 0, num_experts, stream); \ + break; \ + default: \ + TORCH_CHECK(false, "Unsupported warp size: ", warpSize); \ + } template void topkGatingSoftmaxKernelLauncher( @@ -441,6 +451,7 @@ void topkGatingSoftmaxKernelLauncher( const int topk, cudaStream_t stream) { static constexpr int WARPS_PER_TB = 4; + auto warpSize = WARP_SIZE; switch (num_experts) { case 1: LAUNCH_SOFTMAX(1, WARPS_PER_TB); diff --git a/csrc/quantization/activation_kernels.cu b/csrc/quantization/activation_kernels.cu index 67e9149c137..8bc2b9bff3d 100644 --- a/csrc/quantization/activation_kernels.cu +++ b/csrc/quantization/activation_kernels.cu @@ -4,7 +4,7 @@ #include #include "core/math.hpp" -#include "cuda_compat.h" +#include "../cuda_compat.h" #include "dispatch_utils.h" #include "quantization/fp8/common.cuh" diff --git a/csrc/quantization/gguf/gguf_kernel.cu b/csrc/quantization/gguf/gguf_kernel.cu index 3b5180b5162..76fe73e9504 100644 --- a/csrc/quantization/gguf/gguf_kernel.cu +++ b/csrc/quantization/gguf/gguf_kernel.cu @@ -4,7 +4,7 @@ #include #include -#include "cuda_compat.h" +#include "../../cuda_compat.h" #include "dispatch_utils.h" #include "ggml-common.h" diff --git a/csrc/rocm/attention.cu b/csrc/rocm/attention.cu index 3bddd12cad0..65cb1c1d147 100644 --- a/csrc/rocm/attention.cu +++ b/csrc/rocm/attention.cu @@ -19,7 +19,7 @@ #include #include #include -#include "cuda_compat.h" +#include "../cuda_compat.h" #include #include "../attention/dtype_fp8.cuh" diff --git a/csrc/rocm/skinny_gemms.cu b/csrc/rocm/skinny_gemms.cu index 6212570c79d..eb47139208c 100644 --- a/csrc/rocm/skinny_gemms.cu +++ b/csrc/rocm/skinny_gemms.cu @@ -9,7 +9,7 @@ #include #include -#include "cuda_compat.h" +#include "../cuda_compat.h" #include "dispatch_utils.h" #include "quantization/fp8/common.cuh" From 7614653fdc77342798e2d1244744d2f10bcbe4c4 Mon Sep 17 00:00:00 2001 From: Chengji Yao Date: Thu, 24 Jul 2025 00:38:39 -0700 Subject: [PATCH 311/552] [TPU][Bugfix] fix moe layer (#21340) Signed-off-by: Chengji Yao Co-authored-by: Simon Mo Signed-off-by: x22x22 --- tests/v1/tpu/test_basic.py | 1 + vllm/model_executor/layers/fused_moe/layer.py | 19 ++++++++++++++++++- 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/tests/v1/tpu/test_basic.py b/tests/v1/tpu/test_basic.py index c8cd099a98c..b9ee9d66a38 100644 --- a/tests/v1/tpu/test_basic.py +++ b/tests/v1/tpu/test_basic.py @@ -18,6 +18,7 @@ MODELS = [ "Qwen/Qwen2.5-1.5B-Instruct", + "Qwen/Qwen1.5-MoE-A2.7B", # TODO: Enable this models with v6e # "Qwen/Qwen2-7B-Instruct", # "meta-llama/Llama-3.1-8B", diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index 4a6a3b95ec7..2a283a6d12b 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -481,8 +481,16 @@ def forward_cpu( e_score_correction_bias: Optional[torch.Tensor] = None, apply_router_weight_on_input: bool = False, activation: str = "silu", - **kwargs, + enable_eplb: bool = False, + expert_load_view: Optional[torch.Tensor] = None, + logical_to_physical_map: Optional[torch.Tensor] = None, + logical_replica_count: Optional[torch.Tensor] = None, ): + if enable_eplb is not False or expert_load_view is not None or \ + logical_to_physical_map is not None or \ + logical_replica_count is not None: + raise NotImplementedError("Expert load balancing is not supported " + "for CPU.") return layer.cpu_fused_moe( layer, x, @@ -518,6 +526,10 @@ def forward_tpu( e_score_correction_bias: Optional[torch.Tensor] = None, apply_router_weight_on_input: bool = False, activation: str = "silu", + enable_eplb: bool = False, + expert_load_view: Optional[torch.Tensor] = None, + logical_to_physical_map: Optional[torch.Tensor] = None, + logical_replica_count: Optional[torch.Tensor] = None, ) -> torch.Tensor: assert not use_grouped_topk assert num_expert_group is None @@ -531,6 +543,11 @@ def forward_tpu( raise NotImplementedError( "Expert score correction bias is not supported for TPU.") assert activation == "silu", f"{activation} is not supported for TPU." + if enable_eplb is not False or expert_load_view is not None or \ + logical_to_physical_map is not None or \ + logical_replica_count is not None: + raise NotImplementedError("Expert load balancing is not supported " + "for TPU.") return fused_moe_pallas(hidden_states=x, w1=layer.w13_weight, w2=layer.w2_weight, From a02bedc7e37ca790e66afbd138a2833bdc6bdeef Mon Sep 17 00:00:00 2001 From: Zhou Fang Date: Thu, 24 Jul 2025 00:40:11 -0700 Subject: [PATCH 312/552] [v1][Core] Clean up usages of `SpecializedManager` (#21407) Signed-off-by: Zhou Fang Signed-off-by: x22x22 --- ...cialized_manager.py => test_single_type_kv_cache_manager.py} | 0 vllm/v1/core/single_type_kv_cache_manager.py | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename tests/v1/core/{test_specialized_manager.py => test_single_type_kv_cache_manager.py} (100%) diff --git a/tests/v1/core/test_specialized_manager.py b/tests/v1/core/test_single_type_kv_cache_manager.py similarity index 100% rename from tests/v1/core/test_specialized_manager.py rename to tests/v1/core/test_single_type_kv_cache_manager.py diff --git a/vllm/v1/core/single_type_kv_cache_manager.py b/vllm/v1/core/single_type_kv_cache_manager.py index 65a196e044a..e8a44c7773a 100644 --- a/vllm/v1/core/single_type_kv_cache_manager.py +++ b/vllm/v1/core/single_type_kv_cache_manager.py @@ -27,7 +27,7 @@ def __init__( caching_hash_fn: Callable, ) -> None: """ - Initializes the SpecializedManager. + Initializes the SingleTypeKVCacheManager. Args: kv_cache_spec: The kv_cache_spec for this manager. block_pool: The block pool. From 4990b3a48c3e4eb816257be768b01138346cbd23 Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Thu, 24 Jul 2025 09:27:30 +0100 Subject: [PATCH 313/552] [Misc] Fix duplicate FusedMoEConfig debug messages (#21455) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/config.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/config.py b/vllm/model_executor/layers/fused_moe/config.py index f5ed2861b8f..9e4ee5a3d7b 100644 --- a/vllm/model_executor/layers/fused_moe/config.py +++ b/vllm/model_executor/layers/fused_moe/config.py @@ -325,8 +325,8 @@ class FusedMoEConfig: def __post_init__(self): if self.dp_size > 1: - logger.debug("Using FusedMoEConfig::max_num_tokens=%d", - self.max_num_tokens) + logger.debug_once("Using FusedMoEConfig::max_num_tokens=%d", + self.max_num_tokens) assert self.max_num_tokens > 0 From c630bf22f04f01964ad25c443c52d35880595209 Mon Sep 17 00:00:00 2001 From: 22quinn <33176974+22quinn@users.noreply.github.com> Date: Thu, 24 Jul 2025 01:49:44 -0700 Subject: [PATCH 314/552] [Core] Support model loader plugins (#21067) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: x22x22 --- .../test_fastsafetensors_loader.py | 4 +- tests/model_executor/model_loader/__init__.py | 0 .../model_loader/test_registry.py | 37 ++++++ .../test_runai_model_streamer_loader.py | 7 +- vllm/config.py | 30 +---- vllm/engine/arg_utils.py | 28 ++--- vllm/model_executor/model_loader/__init__.py | 114 +++++++++++++----- .../model_loader/default_loader.py | 18 +-- .../model_loader/sharded_state_loader.py | 7 +- 9 files changed, 159 insertions(+), 86 deletions(-) create mode 100644 tests/model_executor/model_loader/__init__.py create mode 100644 tests/model_executor/model_loader/test_registry.py diff --git a/tests/fastsafetensors_loader/test_fastsafetensors_loader.py b/tests/fastsafetensors_loader/test_fastsafetensors_loader.py index 1b95bf59f67..afd411ff487 100644 --- a/tests/fastsafetensors_loader/test_fastsafetensors_loader.py +++ b/tests/fastsafetensors_loader/test_fastsafetensors_loader.py @@ -2,7 +2,6 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from vllm import SamplingParams -from vllm.config import LoadFormat test_model = "openai-community/gpt2" @@ -17,7 +16,6 @@ def test_model_loader_download_files(vllm_runner): - with vllm_runner(test_model, - load_format=LoadFormat.FASTSAFETENSORS) as llm: + with vllm_runner(test_model, load_format="fastsafetensors") as llm: deserialized_outputs = llm.generate(prompts, sampling_params) assert deserialized_outputs diff --git a/tests/model_executor/model_loader/__init__.py b/tests/model_executor/model_loader/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/model_executor/model_loader/test_registry.py b/tests/model_executor/model_loader/test_registry.py new file mode 100644 index 00000000000..93a3e34835b --- /dev/null +++ b/tests/model_executor/model_loader/test_registry.py @@ -0,0 +1,37 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import pytest +from torch import nn + +from vllm.config import LoadConfig, ModelConfig +from vllm.model_executor.model_loader import (get_model_loader, + register_model_loader) +from vllm.model_executor.model_loader.base_loader import BaseModelLoader + + +@register_model_loader("custom_load_format") +class CustomModelLoader(BaseModelLoader): + + def __init__(self, load_config: LoadConfig) -> None: + super().__init__(load_config) + + def download_model(self, model_config: ModelConfig) -> None: + pass + + def load_weights(self, model: nn.Module, + model_config: ModelConfig) -> None: + pass + + +def test_register_model_loader(): + load_config = LoadConfig(load_format="custom_load_format") + assert isinstance(get_model_loader(load_config), CustomModelLoader) + + +def test_invalid_model_loader(): + with pytest.raises(ValueError): + + @register_model_loader("invalid_load_format") + class InValidModelLoader: + pass diff --git a/tests/runai_model_streamer_test/test_runai_model_streamer_loader.py b/tests/runai_model_streamer_test/test_runai_model_streamer_loader.py index e27d9958f29..84c615b6b8d 100644 --- a/tests/runai_model_streamer_test/test_runai_model_streamer_loader.py +++ b/tests/runai_model_streamer_test/test_runai_model_streamer_loader.py @@ -2,9 +2,10 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from vllm import SamplingParams -from vllm.config import LoadConfig, LoadFormat +from vllm.config import LoadConfig from vllm.model_executor.model_loader import get_model_loader +load_format = "runai_streamer" test_model = "openai-community/gpt2" prompts = [ @@ -18,7 +19,7 @@ def get_runai_model_loader(): - load_config = LoadConfig(load_format=LoadFormat.RUNAI_STREAMER) + load_config = LoadConfig(load_format=load_format) return get_model_loader(load_config) @@ -28,6 +29,6 @@ def test_get_model_loader_with_runai_flag(): def test_runai_model_loader_download_files(vllm_runner): - with vllm_runner(test_model, load_format=LoadFormat.RUNAI_STREAMER) as llm: + with vllm_runner(test_model, load_format=load_format) as llm: deserialized_outputs = llm.generate(prompts, sampling_params) assert deserialized_outputs diff --git a/vllm/config.py b/vllm/config.py index eb5ddef30f2..02a3ed93910 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -65,7 +65,7 @@ from vllm.model_executor.layers.quantization import QuantizationMethods from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) - from vllm.model_executor.model_loader import BaseModelLoader + from vllm.model_executor.model_loader import LoadFormats from vllm.model_executor.model_loader.tensorizer import TensorizerConfig ConfigType = type[DataclassInstance] @@ -78,6 +78,7 @@ QuantizationConfig = Any QuantizationMethods = Any BaseModelLoader = Any + LoadFormats = Any TensorizerConfig = Any ConfigType = type HfOverrides = Union[dict[str, Any], Callable[[type], type]] @@ -1773,29 +1774,12 @@ def verify_with_parallel_config( logger.warning("Possibly too large swap space. %s", msg) -class LoadFormat(str, enum.Enum): - AUTO = "auto" - PT = "pt" - SAFETENSORS = "safetensors" - NPCACHE = "npcache" - DUMMY = "dummy" - TENSORIZER = "tensorizer" - SHARDED_STATE = "sharded_state" - GGUF = "gguf" - BITSANDBYTES = "bitsandbytes" - MISTRAL = "mistral" - RUNAI_STREAMER = "runai_streamer" - RUNAI_STREAMER_SHARDED = "runai_streamer_sharded" - FASTSAFETENSORS = "fastsafetensors" - - @config @dataclass class LoadConfig: """Configuration for loading the model weights.""" - load_format: Union[str, LoadFormat, - "BaseModelLoader"] = LoadFormat.AUTO.value + load_format: Union[str, LoadFormats] = "auto" """The format of the model weights to load:\n - "auto" will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.\n @@ -1816,7 +1800,8 @@ class LoadConfig: - "gguf" will load weights from GGUF format files (details specified in https://github.com/ggml-org/ggml/blob/master/docs/gguf.md).\n - "mistral" will load weights from consolidated safetensors files used by - Mistral models.""" + Mistral models. + - Other custom values can be supported via plugins.""" download_dir: Optional[str] = None """Directory to download and load the weights, default to the default cache directory of Hugging Face.""" @@ -1864,10 +1849,7 @@ def compute_hash(self) -> str: return hash_str def __post_init__(self): - if isinstance(self.load_format, str): - load_format = self.load_format.lower() - self.load_format = LoadFormat(load_format) - + self.load_format = self.load_format.lower() if self.ignore_patterns is not None and len(self.ignore_patterns) > 0: logger.info( "Ignoring the following patterns when downloading weights: %s", diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index aec75f82631..70996800471 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -26,13 +26,12 @@ DetailedTraceModules, Device, DeviceConfig, DistributedExecutorBackend, GuidedDecodingBackend, GuidedDecodingBackendV1, HfOverrides, KVEventsConfig, - KVTransferConfig, LoadConfig, LoadFormat, - LogprobsMode, LoRAConfig, ModelConfig, ModelDType, - ModelImpl, MultiModalConfig, ObservabilityConfig, - ParallelConfig, PoolerConfig, PrefixCachingHashAlgo, - SchedulerConfig, SchedulerPolicy, SpeculativeConfig, - TaskOption, TokenizerMode, VllmConfig, get_attr_docs, - get_field) + KVTransferConfig, LoadConfig, LogprobsMode, + LoRAConfig, ModelConfig, ModelDType, ModelImpl, + MultiModalConfig, ObservabilityConfig, ParallelConfig, + PoolerConfig, PrefixCachingHashAlgo, SchedulerConfig, + SchedulerPolicy, SpeculativeConfig, TaskOption, + TokenizerMode, VllmConfig, get_attr_docs, get_field) from vllm.logger import init_logger from vllm.platforms import CpuArchEnum, current_platform from vllm.plugins import load_general_plugins @@ -47,10 +46,12 @@ if TYPE_CHECKING: from vllm.executor.executor_base import ExecutorBase from vllm.model_executor.layers.quantization import QuantizationMethods + from vllm.model_executor.model_loader import LoadFormats from vllm.usage.usage_lib import UsageContext else: ExecutorBase = Any QuantizationMethods = Any + LoadFormats = Any UsageContext = Any logger = init_logger(__name__) @@ -276,7 +277,7 @@ class EngineArgs: trust_remote_code: bool = ModelConfig.trust_remote_code allowed_local_media_path: str = ModelConfig.allowed_local_media_path download_dir: Optional[str] = LoadConfig.download_dir - load_format: str = LoadConfig.load_format + load_format: Union[str, LoadFormats] = LoadConfig.load_format config_format: str = ModelConfig.config_format dtype: ModelDType = ModelConfig.dtype kv_cache_dtype: CacheDType = CacheConfig.cache_dtype @@ -547,9 +548,7 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: title="LoadConfig", description=LoadConfig.__doc__, ) - load_group.add_argument("--load-format", - choices=[f.value for f in LoadFormat], - **load_kwargs["load_format"]) + load_group.add_argument("--load-format", **load_kwargs["load_format"]) load_group.add_argument("--download-dir", **load_kwargs["download_dir"]) load_group.add_argument("--model-loader-extra-config", @@ -864,10 +863,9 @@ def create_model_config(self) -> ModelConfig: # NOTE: This is to allow model loading from S3 in CI if (not isinstance(self, AsyncEngineArgs) and envs.VLLM_CI_USE_S3 - and self.model in MODELS_ON_S3 - and self.load_format == LoadFormat.AUTO): # noqa: E501 + and self.model in MODELS_ON_S3 and self.load_format == "auto"): self.model = f"{MODEL_WEIGHTS_S3_BUCKET}/{self.model}" - self.load_format = LoadFormat.RUNAI_STREAMER + self.load_format = "runai_streamer" return ModelConfig( model=self.model, @@ -1299,7 +1297,7 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: ############################################################# # Unsupported Feature Flags on V1. - if self.load_format == LoadFormat.SHARDED_STATE.value: + if self.load_format == "sharded_state": _raise_or_fallback( feature_name=f"--load_format {self.load_format}", recommend_to_remove=False) diff --git a/vllm/model_executor/model_loader/__init__.py b/vllm/model_executor/model_loader/__init__.py index 78681a04637..2dada794a8f 100644 --- a/vllm/model_executor/model_loader/__init__.py +++ b/vllm/model_executor/model_loader/__init__.py @@ -1,11 +1,12 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import Optional +from typing import Literal, Optional from torch import nn -from vllm.config import LoadConfig, LoadFormat, ModelConfig, VllmConfig +from vllm.config import LoadConfig, ModelConfig, VllmConfig +from vllm.logger import init_logger from vllm.model_executor.model_loader.base_loader import BaseModelLoader from vllm.model_executor.model_loader.bitsandbytes_loader import ( BitsAndBytesModelLoader) @@ -20,34 +21,92 @@ from vllm.model_executor.model_loader.utils import ( get_architecture_class_name, get_model_architecture, get_model_cls) +logger = init_logger(__name__) + +# Reminder: Please update docstring in `LoadConfig` +# if a new load format is added here +LoadFormats = Literal[ + "auto", + "bitsandbytes", + "dummy", + "fastsafetensors", + "gguf", + "mistral", + "npcache", + "pt", + "runai_streamer", + "runai_streamer_sharded", + "safetensors", + "sharded_state", + "tensorizer", +] +_LOAD_FORMAT_TO_MODEL_LOADER: dict[str, type[BaseModelLoader]] = { + "auto": DefaultModelLoader, + "bitsandbytes": BitsAndBytesModelLoader, + "dummy": DummyModelLoader, + "fastsafetensors": DefaultModelLoader, + "gguf": GGUFModelLoader, + "mistral": DefaultModelLoader, + "npcache": DefaultModelLoader, + "pt": DefaultModelLoader, + "runai_streamer": RunaiModelStreamerLoader, + "runai_streamer_sharded": ShardedStateLoader, + "safetensors": DefaultModelLoader, + "sharded_state": ShardedStateLoader, + "tensorizer": TensorizerLoader, +} + + +def register_model_loader(load_format: str): + """Register a customized vllm model loader. + + When a load format is not supported by vllm, you can register a customized + model loader to support it. + + Args: + load_format (str): The model loader format name. + + Examples: + >>> from vllm.config import LoadConfig + >>> from vllm.model_executor.model_loader import get_model_loader, register_model_loader + >>> from vllm.model_executor.model_loader.base_loader import BaseModelLoader + >>> + >>> @register_model_loader("my_loader") + ... class MyModelLoader(BaseModelLoader): + ... def download_model(self): + ... pass + ... + ... def load_weights(self): + ... pass + >>> + >>> load_config = LoadConfig(load_format="my_loader") + >>> type(get_model_loader(load_config)) + + """ # noqa: E501 + + def _wrapper(model_loader_cls): + if load_format in _LOAD_FORMAT_TO_MODEL_LOADER: + logger.warning( + "Load format `%s` is already registered, and will be " + "overwritten by the new loader class `%s`.", load_format, + model_loader_cls) + if not issubclass(model_loader_cls, BaseModelLoader): + raise ValueError("The model loader must be a subclass of " + "`BaseModelLoader`.") + _LOAD_FORMAT_TO_MODEL_LOADER[load_format] = model_loader_cls + logger.info("Registered model loader `%s` with load format `%s`", + model_loader_cls, load_format) + return model_loader_cls + + return _wrapper + def get_model_loader(load_config: LoadConfig) -> BaseModelLoader: """Get a model loader based on the load format.""" - if isinstance(load_config.load_format, type): - return load_config.load_format(load_config) - - if load_config.load_format == LoadFormat.DUMMY: - return DummyModelLoader(load_config) - - if load_config.load_format == LoadFormat.TENSORIZER: - return TensorizerLoader(load_config) - - if load_config.load_format == LoadFormat.SHARDED_STATE: - return ShardedStateLoader(load_config) - - if load_config.load_format == LoadFormat.BITSANDBYTES: - return BitsAndBytesModelLoader(load_config) - - if load_config.load_format == LoadFormat.GGUF: - return GGUFModelLoader(load_config) - - if load_config.load_format == LoadFormat.RUNAI_STREAMER: - return RunaiModelStreamerLoader(load_config) - - if load_config.load_format == LoadFormat.RUNAI_STREAMER_SHARDED: - return ShardedStateLoader(load_config, runai_model_streamer=True) - - return DefaultModelLoader(load_config) + load_format = load_config.load_format + if load_format not in _LOAD_FORMAT_TO_MODEL_LOADER: + raise ValueError(f"Load format `{load_format}` is not supported") + return _LOAD_FORMAT_TO_MODEL_LOADER[load_format](load_config) def get_model(*, @@ -66,6 +125,7 @@ def get_model(*, "get_architecture_class_name", "get_model_architecture", "get_model_cls", + "register_model_loader", "BaseModelLoader", "BitsAndBytesModelLoader", "GGUFModelLoader", diff --git a/vllm/model_executor/model_loader/default_loader.py b/vllm/model_executor/model_loader/default_loader.py index 2fcae7eb6e6..36568e881eb 100644 --- a/vllm/model_executor/model_loader/default_loader.py +++ b/vllm/model_executor/model_loader/default_loader.py @@ -13,7 +13,7 @@ from transformers.utils import SAFE_WEIGHTS_INDEX_NAME from vllm import envs -from vllm.config import LoadConfig, LoadFormat, ModelConfig +from vllm.config import LoadConfig, ModelConfig from vllm.logger import init_logger from vllm.model_executor.model_loader.base_loader import BaseModelLoader from vllm.model_executor.model_loader.weight_utils import ( @@ -104,19 +104,19 @@ def _prepare_weights( use_safetensors = False index_file = SAFE_WEIGHTS_INDEX_NAME # Some quantized models use .pt files for storing the weights. - if load_format == LoadFormat.AUTO: + if load_format == "auto": allow_patterns = ["*.safetensors", "*.bin"] - elif (load_format == LoadFormat.SAFETENSORS - or load_format == LoadFormat.FASTSAFETENSORS): + elif (load_format == "safetensors" + or load_format == "fastsafetensors"): use_safetensors = True allow_patterns = ["*.safetensors"] - elif load_format == LoadFormat.MISTRAL: + elif load_format == "mistral": use_safetensors = True allow_patterns = ["consolidated*.safetensors"] index_file = "consolidated.safetensors.index.json" - elif load_format == LoadFormat.PT: + elif load_format == "pt": allow_patterns = ["*.pt"] - elif load_format == LoadFormat.NPCACHE: + elif load_format == "npcache": allow_patterns = ["*.bin"] else: raise ValueError(f"Unknown load_format: {load_format}") @@ -178,7 +178,7 @@ def _get_weights_iterator( hf_folder, hf_weights_files, use_safetensors = self._prepare_weights( source.model_or_path, source.revision, source.fall_back_to_pt, source.allow_patterns_overrides) - if self.load_config.load_format == LoadFormat.NPCACHE: + if self.load_config.load_format == "npcache": # Currently np_cache only support *.bin checkpoints assert use_safetensors is False weights_iterator = np_cache_weights_iterator( @@ -189,7 +189,7 @@ def _get_weights_iterator( self.load_config.use_tqdm_on_load, ) elif use_safetensors: - if self.load_config.load_format == LoadFormat.FASTSAFETENSORS: + if self.load_config.load_format == "fastsafetensors": weights_iterator = fastsafetensors_weights_iterator( hf_weights_files, self.load_config.use_tqdm_on_load, diff --git a/vllm/model_executor/model_loader/sharded_state_loader.py b/vllm/model_executor/model_loader/sharded_state_loader.py index 2fd9cfba3f6..3edd4ec4007 100644 --- a/vllm/model_executor/model_loader/sharded_state_loader.py +++ b/vllm/model_executor/model_loader/sharded_state_loader.py @@ -32,12 +32,9 @@ class ShardedStateLoader(BaseModelLoader): DEFAULT_PATTERN = "model-rank-{rank}-part-{part}.safetensors" - def __init__(self, - load_config: LoadConfig, - runai_model_streamer: bool = False): + def __init__(self, load_config: LoadConfig): super().__init__(load_config) - self.runai_model_streamer = runai_model_streamer extra_config = ({} if load_config.model_loader_extra_config is None else load_config.model_loader_extra_config.copy()) self.pattern = extra_config.pop("pattern", self.DEFAULT_PATTERN) @@ -152,7 +149,7 @@ def load_weights(self, model: nn.Module, def iterate_over_files( self, paths) -> Generator[tuple[str, torch.Tensor], None, None]: - if self.runai_model_streamer: + if self.load_config.load_format == "runai_streamer_sharded": yield from runai_safetensors_weights_iterator(paths, True) else: from safetensors.torch import safe_open From a2a71ff746f691b1285b81a0fdaf6ea6d7bc2691 Mon Sep 17 00:00:00 2001 From: Yuxuan Zhang <2448370773@qq.com> Date: Thu, 24 Jul 2025 16:52:43 +0800 Subject: [PATCH 315/552] remove GLM-4.5 quantization wrong Code (#21435) Signed-off-by: x22x22 --- vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py | 2 +- vllm/model_executor/models/glm4_moe.py | 1 - vllm/reasoning/glm4_moe_reasoning_parser.py | 2 +- 3 files changed, 2 insertions(+), 3 deletions(-) diff --git a/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py b/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py index c3f9d792357..40cdf7275a8 100644 --- a/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py +++ b/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py @@ -20,7 +20,7 @@ logger = init_logger(__name__) -@ToolParserManager.register_module("glm4_moe") +@ToolParserManager.register_module("glm45") class Glm4MoeModelToolParser(ToolParser): def __init__(self, tokenizer: AnyTokenizer): diff --git a/vllm/model_executor/models/glm4_moe.py b/vllm/model_executor/models/glm4_moe.py index bdca293d21d..095bfbc401b 100644 --- a/vllm/model_executor/models/glm4_moe.py +++ b/vllm/model_executor/models/glm4_moe.py @@ -390,7 +390,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.embed_tokens = VocabParallelEmbedding( config.vocab_size, config.hidden_size, - quant_config=quant_config, prefix=f"{prefix}.embed_tokens") else: self.embed_tokens = PPMissingLayer() diff --git a/vllm/reasoning/glm4_moe_reasoning_parser.py b/vllm/reasoning/glm4_moe_reasoning_parser.py index 6511fb49d10..460e38d2d39 100644 --- a/vllm/reasoning/glm4_moe_reasoning_parser.py +++ b/vllm/reasoning/glm4_moe_reasoning_parser.py @@ -14,7 +14,7 @@ logger = init_logger(__name__) -@ReasoningParserManager.register_module("glm4_moe") +@ReasoningParserManager.register_module("glm45") class Glm4MoeModelReasoningParser(ReasoningParser): """ Reasoning parser for the Glm4MoeModel model. From c9a71b7240a2f838a5ed358212b38a8dbaec276a Mon Sep 17 00:00:00 2001 From: Shintarou Okada Date: Thu, 24 Jul 2025 18:56:36 +0900 Subject: [PATCH 316/552] Replace `--expand-tools-even-if-tool-choice-none` with `--exclude-tools-when-tool-choice-none` for v0.10.0 (#20544) Signed-off-by: okada Signed-off-by: okada shintarou Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/features/tool_calling.md | 3 ++- vllm/entrypoints/openai/api_server.py | 2 ++ vllm/entrypoints/openai/cli_args.py | 3 +++ vllm/entrypoints/openai/serving_chat.py | 7 ++++++- 4 files changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index ce74683a162..37d502ef9ce 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -103,7 +103,8 @@ When tool_choice='required' is set, the model is guaranteed to generate one or m vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request. -However, when `tool_choice='none'` is specified, vLLM includes tool definitions from the prompt. +!!! note + When tools are specified in the request, vLLM includes tool definitions in the prompt by default, regardless of the `tool_choice` setting. To exclude tool definitions when `tool_choice='none'`, use the `--exclude-tools-when-tool-choice-none` option. ## Automatic Function Calling diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index d4135519aa4..89e5e7ed8d3 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1646,6 +1646,8 @@ async def init_app_state( chat_template_content_format=args.chat_template_content_format, return_tokens_as_token_ids=args.return_tokens_as_token_ids, enable_auto_tools=args.enable_auto_tool_choice, + exclude_tools_when_tool_choice_none=args. + exclude_tools_when_tool_choice_none, tool_parser=args.tool_call_parser, reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 3025a626368..7f60fe71302 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -133,6 +133,9 @@ class FrontendArgs: """If specified, API server will add X-Request-Id header to responses. Caution: this hurts performance at high QPS.""" enable_auto_tool_choice: bool = False + """If specified, exclude tool definitions in prompts when + tool_choice='none'.""" + exclude_tools_when_tool_choice_none: bool = False """Enable auto tool choice for supported models. Use `--tool-call-parser` to specify which parser to use.""" tool_call_parser: Optional[str] = None diff --git a/vllm/entrypoints/openai/serving_chat.py b/vllm/entrypoints/openai/serving_chat.py index 33d80743420..832a3d501de 100644 --- a/vllm/entrypoints/openai/serving_chat.py +++ b/vllm/entrypoints/openai/serving_chat.py @@ -63,6 +63,7 @@ def __init__( return_tokens_as_token_ids: bool = False, reasoning_parser: str = "", enable_auto_tools: bool = False, + exclude_tools_when_tool_choice_none: bool = False, tool_parser: Optional[str] = None, enable_prompt_tokens_details: bool = False, enable_force_include_usage: bool = False, @@ -111,6 +112,8 @@ def __init__( raise TypeError("Error: --enable-auto-tool-choice requires " f"tool_parser:'{tool_parser}' which has not " "been registered") from e + self.exclude_tools_when_tool_choice_none = ( + exclude_tools_when_tool_choice_none) self.enable_prompt_tokens_details = enable_prompt_tokens_details self.enable_force_include_usage = enable_force_include_usage @@ -174,7 +177,9 @@ async def create_chat_completion( "--enable-auto-tool-choice and --tool-call-parser to be set" ) - if request.tools is None: + if (request.tools is None + or (request.tool_choice == "none" + and self.exclude_tools_when_tool_choice_none)): tool_dicts = None else: tool_dicts = [tool.model_dump() for tool in request.tools] From 426adc053f053e97f393006e60293a4339684606 Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Thu, 24 Jul 2025 03:13:40 -0700 Subject: [PATCH 317/552] [Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() (#21501) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- vllm/v1/engine/core.py | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 7779b559c20..5b8b95e932e 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -1082,8 +1082,13 @@ def __init__( # RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, but vLLM workers created # thereafter would have CUDA_VISIBLE_DEVICES set, which is sticky: # https://github.com/ray-project/ray/blob/e752fc319ddedd9779a0989b6d3613909bad75c9/python/ray/_private/worker.py#L456 # noqa: E501 - # But vLLM worker assumes visibility into all local GPUs, therefore - # this results in incorrect indexing into the GPU ID list. + # This is problematic because when the vLLM worker (a Ray actor) + # executes a task, it indexes into the sticky CUDA_VISIBLE_DEVICES + # rather than directly using the GPU ID, potentially resulting in + # index out of bounds error. See: + # https://github.com/ray-project/ray/pull/40461/files#diff-31e8159767361e4bc259b6d9883d9c0d5e5db780fcea4a52ead4ee3ee4a59a78R1860 # noqa: E501 + # and get_accelerator_ids_for_accelerator_resource() in worker.py + # of ray. self._set_cuda_visible_devices(vllm_config, local_dp_rank) super().__init__(vllm_config, local_client, "", executor_class, From 340d23b6a8e40d038e0fd72b3121040775a12abe Mon Sep 17 00:00:00 2001 From: Chauncey Date: Thu, 24 Jul 2025 18:15:23 +0800 Subject: [PATCH 318/552] [Feat] Allow custom naming of vLLM processes (#21445) Signed-off-by: chaunceyjiang Signed-off-by: x22x22 --- requirements/common.txt | 1 + requirements/docs.txt | 1 + vllm/entrypoints/cli/serve.py | 4 ++-- vllm/entrypoints/openai/api_server.py | 7 ++++--- vllm/envs.py | 6 ++++++ vllm/utils/__init__.py | 14 ++++++++++++++ vllm/v1/engine/coordinator.py | 11 ++++++----- vllm/v1/engine/core.py | 4 +++- vllm/v1/executor/multiproc_executor.py | 9 ++++++--- vllm/v1/utils.py | 6 +++--- 10 files changed, 46 insertions(+), 17 deletions(-) diff --git a/requirements/common.txt b/requirements/common.txt index 96ab646bb50..d29b3e59d35 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -48,3 +48,4 @@ scipy # Required for phi-4-multimodal-instruct ninja # Required for xgrammar, rocm, tpu, xpu pybase64 # fast base64 implementation cbor2 # Required for cross-language serialization of hashable objects +setproctitle # Used to set process names for better debugging and monitoring diff --git a/requirements/docs.txt b/requirements/docs.txt index 1ddc825a9cd..950906b2ff3 100644 --- a/requirements/docs.txt +++ b/requirements/docs.txt @@ -22,6 +22,7 @@ pillow psutil pybase64 pydantic +setproctitle torch transformers zmq diff --git a/vllm/entrypoints/cli/serve.py b/vllm/entrypoints/cli/serve.py index 72460c2d91c..b144431dee9 100644 --- a/vllm/entrypoints/cli/serve.py +++ b/vllm/entrypoints/cli/serve.py @@ -21,7 +21,7 @@ from vllm.executor.multiproc_worker_utils import _add_prefix from vllm.logger import init_logger from vllm.usage.usage_lib import UsageContext -from vllm.utils import FlexibleArgumentParser, get_tcp_uri +from vllm.utils import FlexibleArgumentParser, bind_process_name, get_tcp_uri from vllm.v1.engine.core import EngineCoreProc from vllm.v1.engine.utils import CoreEngineProcManager, launch_core_engines from vllm.v1.executor.abstract import Executor @@ -77,7 +77,7 @@ def run_headless(args: argparse.Namespace): if args.api_server_count > 1: raise ValueError("api_server_count can't be set in headless mode") - + bind_process_name("APIServer_Headless") # Create the EngineConfig. engine_args = vllm.AsyncEngineArgs.from_cli_args(args) usage_context = UsageContext.OPENAI_API_SERVER diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 89e5e7ed8d3..ba257990d4a 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -101,8 +101,9 @@ maybe_register_config_serialize_by_value) from vllm.transformers_utils.tokenizer import MistralTokenizer from vllm.usage.usage_lib import UsageContext -from vllm.utils import (Device, FlexibleArgumentParser, get_open_zmq_ipc_path, - is_valid_ipv6_address, set_ulimit) +from vllm.utils import (Device, FlexibleArgumentParser, bind_process_name, + get_open_zmq_ipc_path, is_valid_ipv6_address, + set_ulimit) from vllm.v1.metrics.prometheus import get_prometheus_registry from vllm.version import __version__ as VLLM_VERSION @@ -1804,7 +1805,7 @@ async def run_server_worker(listen_address, ToolParserManager.import_tool_parser(args.tool_parser_plugin) server_index = client_config.get("client_index", 0) if client_config else 0 - + bind_process_name("APIServer", str(server_index)) # Load logging config for uvicorn if specified log_config = load_log_config(args.log_config_file) if log_config is not None: diff --git a/vllm/envs.py b/vllm/envs.py index 5c414e82d93..0eff741519a 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -985,6 +985,12 @@ def get_vllm_port() -> Optional[int]: # Used to force set up loopback IP "VLLM_LOOPBACK_IP": lambda: os.getenv("VLLM_LOOPBACK_IP", ""), + + # Used to set the process name prefix for vLLM processes. + # This is useful for debugging and monitoring purposes. + # The default value is "VLLM". + "VLLM_PROCESS_NAME_PREFIX": + lambda: os.getenv("VLLM_PROCESS_NAME_PREFIX", "VLLM"), } # --8<-- [end:env-vars-definition] diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index 5b9c3b6a50c..9f4140ac64e 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -58,6 +58,7 @@ import numpy.typing as npt import psutil import regex as re +import setproctitle import torch import torch.types import yaml @@ -3278,3 +3279,16 @@ def has_deep_gemm() -> bool: """Whether the optional `deep_gemm` package is available.""" return _has_module("deep_gemm") + + +def bind_process_name(name: str, suffix: str = "") -> None: + """Bind the process name to a specific name with an optional suffix. + + Args: + name: The base name to bind the process to. + suffix: An optional suffix to append to the base name. + """ + name = f"{envs.VLLM_PROCESS_NAME_PREFIX}::{name}" + if suffix: + name = f"{name}_{suffix}" + setproctitle.setproctitle(name) diff --git a/vllm/v1/engine/coordinator.py b/vllm/v1/engine/coordinator.py index c0decd6ffa2..fc45eea3a73 100644 --- a/vllm/v1/engine/coordinator.py +++ b/vllm/v1/engine/coordinator.py @@ -13,7 +13,8 @@ from vllm.utils import get_mp_context, make_zmq_socket from vllm.v1.engine import EngineCoreOutputs, EngineCoreRequestType from vllm.v1.serial_utils import MsgpackDecoder -from vllm.v1.utils import get_engine_client_zmq_addr, shutdown +from vllm.v1.utils import (bind_process_name, get_engine_client_zmq_addr, + shutdown) logger = init_logger(__name__) @@ -79,7 +80,7 @@ def __init__(self, parallel_config: ParallelConfig): context = get_mp_context() self.proc: multiprocessing.Process = context.Process( - target=CoordinatorProc.run_coordinator, + target=DPCoordinatorProc.run_coordinator, name="VLLM_DP_Coordinator", kwargs={ "engine_count": parallel_config.data_parallel_size, @@ -113,12 +114,12 @@ def __init__(self): self.request_counts = [0, 0] # [waiting, running] -class CoordinatorProc: +class DPCoordinatorProc: def __init__(self, engine_count: int, min_stats_update_interval_ms: int = 100): - + bind_process_name(self.__class__.__name__) self.ctx = zmq.Context() self.engines = [EngineState() for _ in range(engine_count)] @@ -137,7 +138,7 @@ def run_coordinator( back_publish_address: str, min_stats_update_interval_ms: int = 100, ): - coordinator = CoordinatorProc( + coordinator = DPCoordinatorProc( engine_count=engine_count, min_stats_update_interval_ms=min_stats_update_interval_ms) try: diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 5b8b95e932e..88c511606d7 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -25,7 +25,8 @@ from vllm.lora.request import LoRARequest from vllm.transformers_utils.config import ( maybe_register_config_serialize_by_value) -from vllm.utils import make_zmq_socket, resolve_obj_by_qualname +from vllm.utils import (bind_process_name, make_zmq_socket, + resolve_obj_by_qualname) from vllm.v1.core.kv_cache_utils import (get_kv_cache_config, unify_kv_cache_configs) from vllm.v1.core.sched.interface import SchedulerInterface @@ -411,6 +412,7 @@ def __init__( client_handshake_address: Optional[str] = None, engine_index: int = 0, ): + bind_process_name(self.__class__.__name__, f"{engine_index}") self.input_queue = queue.Queue[tuple[EngineCoreRequestType, Any]]() self.output_queue = queue.Queue[Union[tuple[int, EngineCoreOutputs], bytes]]() diff --git a/vllm/v1/executor/multiproc_executor.py b/vllm/v1/executor/multiproc_executor.py index 11ddade3eb7..993a90752bb 100644 --- a/vllm/v1/executor/multiproc_executor.py +++ b/vllm/v1/executor/multiproc_executor.py @@ -30,8 +30,8 @@ from vllm.executor.multiproc_worker_utils import ( _add_prefix, set_multiprocessing_worker_envs) from vllm.logger import init_logger -from vllm.utils import (get_distributed_init_method, get_loopback_ip, - get_mp_context, get_open_port) +from vllm.utils import (bind_process_name, get_distributed_init_method, + get_loopback_ip, get_mp_context, get_open_port) from vllm.v1.executor.abstract import Executor, FailureCallback from vllm.v1.outputs import ModelRunnerOutput from vllm.worker.worker_base import WorkerWrapperBase @@ -365,7 +365,10 @@ def __init__( } wrapper.init_worker(all_kwargs) self.worker = wrapper - + bind_process_name( + self.worker.worker.__class__.__name__, + f"TP{self.rank}_DP{vllm_config.parallel_config.data_parallel_rank}" + ) pid = os.getpid() _add_prefix(sys.stdout, f"VllmWorker rank={rank}", pid) _add_prefix(sys.stderr, f"VllmWorker rank={rank}", pid) diff --git a/vllm/v1/utils.py b/vllm/v1/utils.py index c74d8c543f7..bb5a36f3838 100644 --- a/vllm/v1/utils.py +++ b/vllm/v1/utils.py @@ -15,8 +15,8 @@ from vllm.logger import init_logger from vllm.usage.usage_lib import (UsageContext, is_usage_stats_enabled, usage_message) -from vllm.utils import (get_open_port, get_open_zmq_ipc_path, get_tcp_uri, - kill_process_tree) +from vllm.utils import (bind_process_name, get_open_port, + get_open_zmq_ipc_path, get_tcp_uri, kill_process_tree) if TYPE_CHECKING: from vllm.v1.engine.coordinator import DPCoordinator @@ -144,7 +144,7 @@ def __init__( self.listen_address = listen_address self.sock = sock self.args = args - + bind_process_name(self.__class__.__name__) # Start API servers spawn_context = multiprocessing.get_context("spawn") self.processes: list[BaseProcess] = [] From 6d84270a2bd5f2d9b9bf891d5baee891c3cd10cf Mon Sep 17 00:00:00 2001 From: cjackal <44624812+cjackal@users.noreply.github.com> Date: Thu, 24 Jul 2025 19:20:38 +0900 Subject: [PATCH 319/552] bump `flashinfer` to `v0.2.8` (#21385) Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com> Signed-off-by: x22x22 --- docker/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 868b8170466..3c2bdc2066e 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -390,7 +390,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist # Install FlashInfer from source ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git" -ARG FLASHINFER_GIT_REF="v0.2.8rc1" +ARG FLASHINFER_GIT_REF="v0.2.8" RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' . /etc/environment git clone --depth 1 --recursive --shallow-submodules \ From 841628b8c1ad5f043d67053fcea303e7e892e4d5 Mon Sep 17 00:00:00 2001 From: Lucas Wilkinson Date: Thu, 24 Jul 2025 06:21:46 -0400 Subject: [PATCH 320/552] [Attention] Optimize FlashInfer MetadataBuilder Build call (#21137) Signed-off-by: Lucas Wilkinson Signed-off-by: x22x22 --- tests/v1/attention/test_attention_backends.py | 13 +- tests/v1/attention/utils.py | 2 +- vllm/v1/attention/backends/flashinfer.py | 157 +++++++++--------- 3 files changed, 94 insertions(+), 78 deletions(-) diff --git a/tests/v1/attention/test_attention_backends.py b/tests/v1/attention/test_attention_backends.py index b4e0101a0d4..9bd0b99798d 100644 --- a/tests/v1/attention/test_attention_backends.py +++ b/tests/v1/attention/test_attention_backends.py @@ -11,7 +11,8 @@ create_vllm_config, get_attention_backend) from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, cdiv -from vllm.v1.attention.backends.utils import CommonAttentionMetadata +from vllm.v1.attention.backends.utils import (CommonAttentionMetadata, + set_kv_cache_layout) from vllm.v1.kv_cache_interface import FullAttentionSpec BACKENDS_TO_TEST = [ @@ -212,7 +213,7 @@ def run_attention_backend(backend: _Backend, kv_cache_spec: FullAttentionSpec, from vllm.v1.attention.backends.flashinfer import PerLayerParameters - def mock_get_per_layer_parameters(vllm_config): + def mock_get_per_layer_parameters(vllm_config, impl_cls): # Return mock parameters for a single layer head_size = vllm_config.model_config.get_head_size() return { @@ -297,7 +298,8 @@ def test_backend_correctness(batch_spec_name: str, model: str): 5. Comparing the vLLM backend's output to the ground-truth SDPA output. """ batch_spec = BATCH_SPECS[batch_spec_name] - vllm_config = create_vllm_config(model_name=model) + vllm_config = create_vllm_config(model_name=model, + max_model_len=max(batch_spec.seq_lens)) device = torch.device("cuda:0") kv_cache_spec = create_standard_kv_cache_spec(vllm_config) @@ -419,6 +421,11 @@ def test_backend_correctness(batch_spec_name: str, model: str): if backend_name == _Backend.FLASHINFER_VLLM_V1: kv_cache_for_backend = kv_cache.transpose(0, 1) + # For FlashInfer default to HND layout and + kv_cache_for_backend = kv_cache_for_backend.transpose( + 2, 3).contiguous().transpose(2, 3) + set_kv_cache_layout("HND") + backend_output = run_attention_backend(backend_name, kv_cache_spec, vllm_config, device, common_attn_metadata, diff --git a/tests/v1/attention/utils.py b/tests/v1/attention/utils.py index 30cfbdda5d8..69bd4a2060a 100644 --- a/tests/v1/attention/utils.py +++ b/tests/v1/attention/utils.py @@ -66,7 +66,7 @@ def create_common_attn_metadata( num_computed_tokens_cpu = torch.tensor(context_lens, dtype=torch.int32) # Create block table (random for testing) - max_blocks = max(batch_spec.seq_lens) // block_size + 1 + max_blocks = (max(batch_spec.seq_lens) + block_size - 1) // block_size block_table_tensor = torch.randint(0, max_block_idx, (batch_spec.batch_size, max_blocks), diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 953ef26c814..94d80d441d8 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -18,6 +18,7 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform +from vllm.utils import cdiv from vllm.v1.attention.backends.flash_attn import use_cascade_attention from vllm.v1.attention.backends.utils import ( AttentionMetadataBuilder, CommonAttentionMetadata, PerLayerParameters, @@ -158,7 +159,7 @@ class FlashInferMetadata: # (batch_size + 1,). The cumulative subquery lengths of the sequences in # the batch, used to index into subquery. E.g., if the subquery length # is [4, 6], it is [0, 4, 10]. - qo_indptr: torch.Tensor + qo_indptr_cpu: torch.Tensor # An example for paged_kv_indices, paged_kv_indptr: # request 1, page indices [0, 5, 8] # request 2, page indices [1, 6, 7] @@ -167,13 +168,13 @@ class FlashInferMetadata: # [0, 5, 8, 1, 6, 7, 3, 4] # paged_kv_indptr is used to index into paged_kv_indices: # [0, 3, 6, 8] - # The indptr of the paged kv cache, shape: [batch_size + 1] - paged_kv_indptr: torch.Tensor - # The page indices of the paged kv cache + # The indptr of the paged kv cache, shape: [batch_size + 1] (CPU for plan) + paged_kv_indptr_cpu: torch.Tensor + # The page indices of the paged kv cache (on device for plan) paged_kv_indices: torch.Tensor # The number of entries in the last page of each request in - # the paged kv cache, shape: [batch_size] - paged_kv_last_page_len: torch.Tensor + # the paged kv cache, shape: [batch_size] (CPU for plan) + paged_kv_last_page_len_cpu: torch.Tensor # The number of query/output heads num_qo_heads: int # The number of key/value heads @@ -201,22 +202,17 @@ class FlashInferMetadata: num_prefills: int num_prefill_tokens: int - # For cascade attention. + # For cascade attention (CPU for planning). use_cascade: bool - shared_qo_indptr: Optional[torch.Tensor] = None - shared_kv_page_indptr: Optional[torch.Tensor] = None - shared_kv_page_indices: Optional[torch.Tensor] = None - shared_kv_last_page_len: Optional[torch.Tensor] = None + shared_qo_indptr_cpu: Optional[torch.Tensor] = None + shared_kv_page_indptr_cpu: Optional[torch.Tensor] = None + shared_kv_page_indices_cpu: Optional[torch.Tensor] = None + shared_kv_last_page_len_cpu: Optional[torch.Tensor] = None prefill_wrapper: Optional[BatchPrefillWithPagedKVCacheWrapper] = None decode_wrapper: Optional[BatchDecodeWithPagedKVCacheWrapper] = None cascade_wrapper: Optional[MultiLevelCascadeAttentionWrapper] = None - @property - def query_start_loc(self): - # The GPUModelRunner expects to be able to access this property. - return self.qo_indptr - def __post_init__(self): if self.head_dim is not None: FlashInferBackend.validate_head_size(self.head_dim) @@ -238,6 +234,12 @@ def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, self.vllm_config = vllm_config self.cache_config = vllm_config.cache_config self.kv_cache_spec = kv_cache_spec + max_num_blocks_per_request = cdiv( + vllm_config.model_config.max_model_len, + self.kv_cache_spec.block_size) + self.block_table_arange = torch.arange(max_num_blocks_per_request, + dtype=torch.int32, + device=self.device) def reorder_batch(self, input_batch: InputBatch, scheduler_output: SchedulerOutput) -> bool: @@ -285,21 +287,25 @@ def _plan(self, num_prefills: int, num_decodes: int, if self.global_hyperparameters is None: self.global_hyperparameters = infer_global_hyperparameters( get_per_layer_parameters(self.vllm_config, FlashInferImpl)) + if attn_metadata.use_cascade: attn_metadata.cascade_wrapper = self._get_cascade_wrapper() attn_metadata.cascade_wrapper.plan( - [attn_metadata.shared_qo_indptr, attn_metadata.qo_indptr], [ - attn_metadata.shared_kv_page_indptr, - attn_metadata.paged_kv_indptr + attn_metadata.shared_qo_indptr_cpu, + attn_metadata.qo_indptr_cpu + ], + [ + attn_metadata.shared_kv_page_indptr_cpu, + attn_metadata.paged_kv_indptr_cpu ], [ - attn_metadata.shared_kv_page_indices, + attn_metadata.shared_kv_page_indices_cpu, attn_metadata.paged_kv_indices ], [ - attn_metadata.shared_kv_last_page_len, - attn_metadata.paged_kv_last_page_len + attn_metadata.shared_kv_last_page_len_cpu, + attn_metadata.paged_kv_last_page_len_cpu ], attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, @@ -320,22 +326,22 @@ def _plan(self, num_prefills: int, num_decodes: int, # Decodes are first so prefills start after the last decode prefill_start = num_decodes attn_metadata.prefill_wrapper = self._get_prefill_wrapper() - assert attn_metadata.qo_indptr[prefill_start:].shape[ + assert attn_metadata.qo_indptr_cpu[prefill_start:].shape[ 0] == num_prefills + 1 - assert attn_metadata.paged_kv_indptr[prefill_start:].shape[ + assert attn_metadata.paged_kv_indptr_cpu[prefill_start:].shape[ 0] == num_prefills + 1 - assert attn_metadata.paged_kv_last_page_len[ + assert attn_metadata.paged_kv_last_page_len_cpu[ prefill_start:].shape[0] == num_prefills # Since prefill_wrapper.run() will be called with # query[num_decode_tokens:] we need to adjust the qo_indptr # to be relative to the start of the prefill queries. - qo_indptr = attn_metadata.qo_indptr[ - prefill_start:] - attn_metadata.qo_indptr[prefill_start] + qo_indptr_cpu = attn_metadata.qo_indptr_cpu[ + prefill_start:] - attn_metadata.qo_indptr_cpu[prefill_start] attn_metadata.prefill_wrapper.plan( - qo_indptr, - attn_metadata.paged_kv_indptr[prefill_start:], + qo_indptr_cpu, + attn_metadata.paged_kv_indptr_cpu[prefill_start:], attn_metadata.paged_kv_indices, - attn_metadata.paged_kv_last_page_len[prefill_start:], + attn_metadata.paged_kv_last_page_len_cpu[prefill_start:], attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim, @@ -357,9 +363,9 @@ def _plan(self, num_prefills: int, num_decodes: int, attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim): attn_metadata.decode_wrapper.plan( - attn_metadata.paged_kv_indptr[:num_decodes + 1], + attn_metadata.paged_kv_indptr_cpu[:num_decodes + 1], attn_metadata.paged_kv_indices, - attn_metadata.paged_kv_last_page_len[:num_decodes], + attn_metadata.paged_kv_last_page_len_cpu[:num_decodes], attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim, @@ -383,55 +389,58 @@ def build(self, split_decodes_and_prefills(common_attn_metadata) page_size = self.kv_cache_spec.block_size - device = self.device - qo_indptr = common_attn_metadata.query_start_loc max_seq_len = common_attn_metadata.seq_lens_cpu.max() seq_lens = common_attn_metadata.seq_lens + seq_lens_cpu = common_attn_metadata.seq_lens_cpu block_table_tensor = common_attn_metadata.block_table_tensor - block_table_bounds = (seq_lens + page_size - 1) // page_size + block_table_bounds_cpu = (seq_lens_cpu + page_size - 1) // page_size use_cascade = common_prefix_len > 0 if use_cascade: # Grab the blocks of the shared prefix from the first request. assert common_prefix_len % page_size == 0 num_common_kv_blocks = common_prefix_len // page_size - shared_qo_indptr = torch.tensor([0, num_actual_tokens], - dtype=torch.int32, - device=device) - shared_kv_page_indptr = torch.tensor([0, num_common_kv_blocks], - dtype=torch.int32, - device=device) - shared_kv_page_indices = block_table_tensor[ + + # Create CPU versions directly for cascade (no GPU versions needed) + shared_qo_indptr_cpu = torch.tensor([0, num_actual_tokens], + dtype=torch.int32, + device='cpu') + shared_kv_page_indptr_cpu = torch.tensor([0, num_common_kv_blocks], + dtype=torch.int32, + device='cpu') + shared_kv_page_indices_cpu = block_table_tensor[ 0, :num_common_kv_blocks] - shared_kv_last_page_len = torch.tensor([page_size], - dtype=torch.int32, - device=device) + shared_kv_last_page_len_cpu = torch.tensor([page_size], + dtype=torch.int32, + device='cpu') + # Remove the blocks of the shared prefix from all requests. block_table_tensor = block_table_tensor[:, num_common_kv_blocks:] - block_table_bounds -= num_common_kv_blocks + block_table_bounds_cpu -= num_common_kv_blocks else: - shared_qo_indptr = None - shared_kv_page_indptr = None - shared_kv_page_indices = None - shared_kv_last_page_len = None - - mask = (torch.arange(block_table_tensor.size(1), - dtype=block_table_tensor.dtype, - device=block_table_tensor.device).unsqueeze(0) + shared_qo_indptr_cpu = None + shared_kv_page_indptr_cpu = None + shared_kv_page_indices_cpu = None + shared_kv_last_page_len_cpu = None + + max_num_blocks = block_table_bounds_cpu.max() + block_table_bounds = block_table_bounds_cpu.to(self.device, + non_blocking=True) + mask = (self.block_table_arange[:max_num_blocks].unsqueeze(0) < block_table_bounds.unsqueeze(1)) - paged_kv_indices = block_table_tensor[mask] - - paged_kv_indptr = torch.cat([ - torch.zeros(1, - dtype=block_table_bounds.dtype, - device=block_table_bounds.device), - block_table_bounds.cumsum(dim=0, dtype=torch.int32) - ]) - - paged_kv_last_page_len = seq_lens % page_size - paged_kv_last_page_len = torch.where(paged_kv_last_page_len == 0, - page_size, paged_kv_last_page_len) + paged_kv_indices = block_table_tensor[:, :max_num_blocks][mask] + + paged_kv_indptr_cpu = torch.zeros(len(block_table_bounds_cpu) + 1, + dtype=torch.int32, + device='cpu') + paged_kv_indptr_cpu[1:] = block_table_bounds_cpu.cumsum( + dim=0, dtype=torch.int32) + + paged_kv_last_page_len_cpu = seq_lens_cpu % page_size + paged_kv_last_page_len_cpu = torch.where( + paged_kv_last_page_len_cpu == 0, page_size, + paged_kv_last_page_len_cpu) cache_dtype = self.cache_config.cache_dtype if cache_dtype.startswith("fp8"): kv_cache_dtype = FlashInferBackend.get_fp8_dtype_for_flashinfer( @@ -440,10 +449,10 @@ def build(self, kv_cache_dtype = self.kv_cache_spec.dtype attn_metadata = FlashInferMetadata( num_actual_tokens=num_actual_tokens, - qo_indptr=qo_indptr, - paged_kv_indptr=paged_kv_indptr, + qo_indptr_cpu=common_attn_metadata.query_start_loc_cpu, + paged_kv_indptr_cpu=paged_kv_indptr_cpu, paged_kv_indices=paged_kv_indices, - paged_kv_last_page_len=paged_kv_last_page_len, + paged_kv_last_page_len_cpu=paged_kv_last_page_len_cpu, num_qo_heads=self.vllm_config.model_config.get_num_attention_heads( self.vllm_config.parallel_config), num_kv_heads=self.kv_cache_spec.num_kv_heads, @@ -457,14 +466,14 @@ def build(self, num_prefills=num_prefills, num_prefill_tokens=num_prefill_tokens, use_cascade=use_cascade, - shared_qo_indptr=shared_qo_indptr, - shared_kv_page_indptr=shared_kv_page_indptr, - shared_kv_page_indices=shared_kv_page_indices, - shared_kv_last_page_len=shared_kv_last_page_len, + shared_qo_indptr_cpu=shared_qo_indptr_cpu, + shared_kv_page_indptr_cpu=shared_kv_page_indptr_cpu, + shared_kv_page_indices_cpu=shared_kv_page_indices_cpu, + shared_kv_last_page_len_cpu=shared_kv_last_page_len_cpu, max_seq_len=max_seq_len, seq_lens=seq_lens, block_table_tensor=block_table_tensor, - workspace_buffer=self._workspace_buffer, + workspace_buffer=self._get_workspace_buffer(), ) self._plan(num_prefills, num_decodes, attn_metadata) From b355317381b78776c41bd4bb91ef75bcf7811860 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Thu, 24 Jul 2025 11:22:12 +0100 Subject: [PATCH 321/552] [Model] Officially support Emu3 with Transformers backend (#21319) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/models/supported_models.md | 6 ++++++ tests/models/multimodal/test_mapping.py | 17 ++++++++++------- tests/models/registry.py | 5 +++-- vllm/model_executor/model_loader/utils.py | 6 +++--- vllm/model_executor/models/registry.py | 9 +++++++-- 5 files changed, 29 insertions(+), 14 deletions(-) diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 4553c46afb0..4dd4f8f4c22 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -626,6 +626,12 @@ Specified using `--task generate`. | `TarsierForConditionalGeneration` | Tarsier | T + IE+ | `omni-search/Tarsier-7b`, `omni-search/Tarsier-34b` | | ✅︎ | ✅︎ | | `Tarsier2ForConditionalGeneration`^ | Tarsier2 | T + IE+ + VE+ | `omni-research/Tarsier2-Recap-7b`, `omni-research/Tarsier2-7b-0115` | | ✅︎ | ✅︎ | +Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it! + +| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | +|--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------|---------------------| +| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ | ✅︎ | + ^ You need to set the architecture name via `--hf-overrides` to match the one in vLLM.     • For example, to use DeepSeek-VL2 series models:       `--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` diff --git a/tests/models/multimodal/test_mapping.py b/tests/models/multimodal/test_mapping.py index 5f20452aff3..f323dfd04cb 100644 --- a/tests/models/multimodal/test_mapping.py +++ b/tests/models/multimodal/test_mapping.py @@ -23,18 +23,14 @@ def create_repo_dummy_weights(repo: str) -> Iterable[tuple[str, torch.Tensor]]: return ((name, torch.empty(0)) for name in weight_names) -def create_model_dummy_weights( - repo: str, - model_arch: str, -) -> Iterable[tuple[str, torch.Tensor]]: +def create_dummy_model(repo: str, model_arch: str) -> PreTrainedModel: """ Create weights from a dummy meta deserialized hf model with name conversion """ model_cls: PreTrainedModel = getattr(transformers, model_arch) config = AutoConfig.from_pretrained(repo) with torch.device("meta"): - model: PreTrainedModel = model_cls._from_config(config) - return model.named_parameters() + return model_cls._from_config(config) def model_architectures_for_test() -> list[str]: @@ -70,14 +66,21 @@ def test_hf_model_weights_mapper(model_arch: str): model_cls = MULTIMODAL_REGISTRY._get_model_cls(model_config) original_weights = create_repo_dummy_weights(model_id) - hf_converted_weights = create_model_dummy_weights(model_id, model_arch) + hf_dummy_model = create_dummy_model(model_id, model_arch) + hf_converted_weights = hf_dummy_model.named_parameters() + hf_converted_buffers = hf_dummy_model.named_buffers() mapper: WeightsMapper = model_cls.hf_to_vllm_mapper mapped_original_weights = mapper.apply(original_weights) mapped_hf_converted_weights = mapper.apply(hf_converted_weights) + mapped_hf_converted_buffers = mapper.apply(hf_converted_buffers) ref_weight_names = set(map(lambda x: x[0], mapped_original_weights)) weight_names = set(map(lambda x: x[0], mapped_hf_converted_weights)) + buffer_names = set(map(lambda x: x[0], mapped_hf_converted_buffers)) + + # Some checkpoints may have buffers, we ignore them for this test + ref_weight_names -= buffer_names weights_missing = ref_weight_names - weight_names weights_unmapped = weight_names - ref_weight_names diff --git a/tests/models/registry.py b/tests/models/registry.py index 84ca0bc6000..3b92462e58a 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -357,6 +357,7 @@ def check_available_online( max_transformers_version="4.48", # noqa: E501 transformers_version_reason="HF model is not compatible.", # noqa: E501 hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]}), # noqa: E501 + "Emu3ForConditionalGeneration": _HfExamplesInfo("BAAI/Emu3-Chat-hf"), "FuyuForCausalLM": _HfExamplesInfo("adept/fuyu-8b"), "Gemma3ForConditionalGeneration": _HfExamplesInfo("google/gemma-3-4b-it"), "GraniteSpeechForConditionalGeneration": _HfExamplesInfo("ibm-granite/granite-speech-3.3-2b"), # noqa: E501 @@ -501,7 +502,7 @@ def check_available_online( speculative_model="XiaomiMiMo/MiMo-7B-RL") } -_TRANSFORMERS_MODELS = { +_TRANSFORMERS_BACKEND_MODELS = { "TransformersForCausalLM": _HfExamplesInfo("hmellor/Ilama-3.2-1B", trust_remote_code=True), # noqa: E501 "TransformersForMultimodalLM": _HfExamplesInfo("OpenGVLab/InternVL3-1B-hf"), } @@ -512,7 +513,7 @@ def check_available_online( **_SEQUENCE_CLASSIFICATION_EXAMPLE_MODELS, **_MULTIMODAL_EXAMPLE_MODELS, **_SPECULATIVE_DECODING_EXAMPLE_MODELS, - **_TRANSFORMERS_MODELS, + **_TRANSFORMERS_BACKEND_MODELS, } diff --git a/vllm/model_executor/model_loader/utils.py b/vllm/model_executor/model_loader/utils.py index 4b30336f013..a0cd94c969a 100644 --- a/vllm/model_executor/model_loader/utils.py +++ b/vllm/model_executor/model_loader/utils.py @@ -26,7 +26,7 @@ as_seq_cls_model) from vllm.model_executor.models.interfaces import SupportsQuant from vllm.model_executor.models.registry import (_PREVIOUSLY_SUPPORTED_MODELS, - _TRANSFORMERS_MODELS) + _TRANSFORMERS_BACKEND_MODELS) from vllm.utils import is_pin_memory_available logger = init_logger(__name__) @@ -178,7 +178,7 @@ def resolve_transformers_arch(model_config: ModelConfig, "happen.") for i, arch in enumerate(architectures): - if arch in _TRANSFORMERS_MODELS: + if arch in _TRANSFORMERS_BACKEND_MODELS: continue if model_config.model_impl == ModelImpl.AUTO: @@ -241,7 +241,7 @@ def get_model_architecture( vllm_supported_archs = ModelRegistry.get_supported_archs() is_supported = lambda arch: (arch in vllm_supported_archs and arch not in - _TRANSFORMERS_MODELS) + _TRANSFORMERS_BACKEND_MODELS) vllm_not_supported = not any(is_supported(arch) for arch in architectures) if vllm_not_supported: diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 2aaac7798fc..7470b31e125 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -254,7 +254,11 @@ # "MLPSpeculatorPreTrainedModel": ("mlp_speculator", "MLPSpeculator"), } -_TRANSFORMERS_MODELS = { +_TRANSFORMERS_SUPPORTED_MODELS = { + "Emu3ForConditionalGeneration": ("transformers", "TransformersForMultimodalLM"), # noqa: E501 +} + +_TRANSFORMERS_BACKEND_MODELS = { "TransformersForMultimodalLM": ("transformers", "TransformersForMultimodalLM"), # noqa: E501 "TransformersForCausalLM": ("transformers", "TransformersForCausalLM"), } @@ -266,7 +270,8 @@ **_CROSS_ENCODER_MODELS, **_MULTIMODAL_MODELS, **_SPECULATIVE_DECODING_MODELS, - **_TRANSFORMERS_MODELS, + **_TRANSFORMERS_SUPPORTED_MODELS, + **_TRANSFORMERS_BACKEND_MODELS, } # This variable is used as the args for subprocess.run(). We From 8d0cf97375c2463165a90ee420a5cb8b95d44495 Mon Sep 17 00:00:00 2001 From: Ming Yang Date: Thu, 24 Jul 2025 03:23:59 -0700 Subject: [PATCH 322/552] [Bugfix] Fix CUDA arch flags for MoE permute (#21426) Signed-off-by: Ming Yang Signed-off-by: x22x22 --- CMakeLists.txt | 6 +- tests/kernels/test_shuffle_rows.py | 294 +++++++++++++++++++++++++++++ 2 files changed, 297 insertions(+), 3 deletions(-) create mode 100644 tests/kernels/test_shuffle_rows.py diff --git a/CMakeLists.txt b/CMakeLists.txt index 98ed682fee7..529ce29029b 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -635,7 +635,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") "in CUDA target architectures.") endif() endif() - + cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}") if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS) set(SRCS "csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu") @@ -842,8 +842,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") "csrc/moe/moe_permute_unpermute_op.cu") set_gencode_flags_for_srcs( - SRCS "${MARLIN_PERMUTE_SRC}" - CUDA_ARCHS "${MOE_PERMUTE_ARCHS}") + SRCS "${MOE_PERMUTE_SRC}" + CUDA_ARCHS "${CUDA_ARCHS}") list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}") endif() diff --git a/tests/kernels/test_shuffle_rows.py b/tests/kernels/test_shuffle_rows.py new file mode 100644 index 00000000000..7d02e1764e7 --- /dev/null +++ b/tests/kernels/test_shuffle_rows.py @@ -0,0 +1,294 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Tests for the shuffle_rows function + +Run `pytest tests/kernels/test_shuffle_rows.py`. +""" + +import pytest +import torch + +from vllm._custom_ops import shuffle_rows +from vllm.platforms import current_platform + + +@pytest.mark.parametrize("num_tokens", [1, 16, 64, 128, 256, 512, 1024]) +@pytest.mark.parametrize("hidden_size", [128, 256, 512, 1024, 2048, 4096]) +@pytest.mark.parametrize("dtype", + [torch.float16, torch.bfloat16, torch.float32]) +def test_shuffle_rows_basic(num_tokens: int, hidden_size: int, + dtype: torch.dtype): + """Test basic functionality of shuffle_rows with various tensor sizes and + dtypes.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + # Create input tensor + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + + # Create a simple permutation map (identity mapping) + dst2src_map = torch.arange(num_tokens, device="cuda", dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # With identity mapping, output should be identical to input + torch.testing.assert_close(output, input_tensor, atol=0, rtol=0) + + # Check output shape + assert output.shape == (num_tokens, hidden_size) + assert output.dtype == dtype + assert output.device == input_tensor.device + + +@pytest.mark.parametrize("num_tokens", [16, 64, 128]) +@pytest.mark.parametrize("hidden_size", [128, 512, 1024]) +@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16]) +def test_shuffle_rows_permutation(num_tokens: int, hidden_size: int, + dtype: torch.dtype): + """Test shuffle_rows with actual permutation.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + # Create input tensor + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + + # Create a reverse permutation map + dst2src_map = torch.arange(num_tokens - 1, + -1, + -1, + device="cuda", + dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Check that the output is the reverse of the input + expected_output = torch.flip(input_tensor, dims=[0]) + torch.testing.assert_close(output, expected_output, atol=1e-6, rtol=1e-5) + + # Check output shape and properties + assert output.shape == (num_tokens, hidden_size) + assert output.dtype == dtype + assert output.device == input_tensor.device + + +@pytest.mark.parametrize("num_tokens", [32, 64]) +@pytest.mark.parametrize("hidden_size", [256, 512]) +def test_shuffle_rows_expansion(num_tokens: int, hidden_size: int): + """Test shuffle_rows with expansion (more output tokens than input + tokens).""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + dtype = torch.float16 + + # Create input tensor + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + + # Create a mapping that duplicates some tokens (expansion) + expanded_size = num_tokens * 2 + dst2src_map = torch.randint(0, + num_tokens, (expanded_size, ), + device="cuda", + dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Check output shape + assert output.shape == (expanded_size, hidden_size) + assert output.dtype == dtype + assert output.device == input_tensor.device + + # Verify that each output row matches the corresponding input row + for i in range(expanded_size): + src_idx = dst2src_map[i].item() + torch.testing.assert_close(output[i], + input_tensor[src_idx], + atol=1e-6, + rtol=1e-5) + + +@pytest.mark.parametrize("num_tokens", [16, 64]) +@pytest.mark.parametrize("hidden_size", [128, 512]) +def test_shuffle_rows_random_permutation(num_tokens: int, hidden_size: int): + """Test shuffle_rows with random permutation.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + dtype = torch.float16 + + # Set seed for reproducibility + torch.manual_seed(42) + + # Create input tensor + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + + # Create a random permutation map + dst2src_map = torch.randperm(num_tokens, device="cuda", dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Check output shape and properties + assert output.shape == (num_tokens, hidden_size) + assert output.dtype == dtype + assert output.device == input_tensor.device + + # Verify that each output row matches the corresponding input row + for i in range(num_tokens): + src_idx = dst2src_map[i].item() + torch.testing.assert_close(output[i], + input_tensor[src_idx], + atol=1e-6, + rtol=1e-5) + + +def test_shuffle_rows_edge_cases(): + """Test shuffle_rows with edge cases.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + dtype = torch.float16 + + # Test with single token + input_tensor = torch.randn(1, 128, device="cuda", dtype=dtype) + dst2src_map = torch.tensor([0], device="cuda", dtype=torch.int32) + output = shuffle_rows(input_tensor, dst2src_map) + torch.testing.assert_close(output, input_tensor, atol=0, rtol=0) + + # Test with single feature dimension + input_tensor = torch.randn(16, 1, device="cuda", dtype=dtype) + dst2src_map = torch.arange(16, device="cuda", dtype=torch.int32) + output = shuffle_rows(input_tensor, dst2src_map) + torch.testing.assert_close(output, input_tensor, atol=0, rtol=0) + + +def test_shuffle_rows_moe_like_scenario(): + """Test shuffle_rows in a scenario similar to MoE usage.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + dtype = torch.float16 + batch_size = 32 + hidden_size = 1024 + topk = 2 + + # Simulate input tokens + input_tensor = torch.randn(batch_size, + hidden_size, + device="cuda", + dtype=dtype) + + # Simulate expert assignment (each token goes to topk experts) + # This creates a mapping where tokens are duplicated for multiple experts + total_tokens = batch_size * topk + dst2src_map = torch.zeros(total_tokens, device="cuda", dtype=torch.int32) + + # Fill the mapping to simulate MoE token distribution + for i in range(batch_size): + for k in range(topk): + dst2src_map[i * topk + k] = i + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Check output shape + assert output.shape == (total_tokens, hidden_size) + assert output.dtype == dtype + assert output.device == input_tensor.device + + # Verify that tokens are correctly duplicated + for i in range(batch_size): + for k in range(topk): + output_idx = i * topk + k + torch.testing.assert_close(output[output_idx], + input_tensor[i], + atol=1e-6, + rtol=1e-5) + + +@pytest.mark.parametrize("dtype", + [torch.float16, torch.bfloat16, torch.float32]) +def test_shuffle_rows_dtype_consistency(dtype: torch.dtype): + """Test that shuffle_rows preserves dtype correctly.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + num_tokens = 64 + hidden_size = 512 + + # Create input tensor with specific dtype + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + dst2src_map = torch.arange(num_tokens, device="cuda", dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Verify dtype is preserved + assert output.dtype == dtype + assert output.device == input_tensor.device + torch.testing.assert_close(output, input_tensor, atol=1e-6, rtol=1e-5) + + +def test_shuffle_rows_device_consistency(): + """Test that shuffle_rows maintains device consistency.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + num_tokens = 32 + hidden_size = 256 + dtype = torch.float16 + + # Create input tensor on CUDA + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + dst2src_map = torch.arange(num_tokens, device="cuda", dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Verify device is maintained + assert output.device == input_tensor.device + assert output.device.type == "cuda" + + +def test_shuffle_rows_contiguous_output(): + """Test that shuffle_rows produces contiguous output.""" + if not current_platform.is_cuda(): + pytest.skip("shuffle_rows requires CUDA") + + num_tokens = 64 + hidden_size = 512 + dtype = torch.float16 + + # Create input tensor + input_tensor = torch.randn(num_tokens, + hidden_size, + device="cuda", + dtype=dtype) + dst2src_map = torch.arange(num_tokens, device="cuda", dtype=torch.int32) + + # Test shuffle_rows + output = shuffle_rows(input_tensor, dst2src_map) + + # Verify output is contiguous + assert output.is_contiguous() From 54870f6760a8ca0c518aa1829f89d39aa6404599 Mon Sep 17 00:00:00 2001 From: elvischenv <219235043+elvischenv@users.noreply.github.com> Date: Thu, 24 Jul 2025 18:25:41 +0800 Subject: [PATCH 323/552] [Fix] Update mamba_ssm to 2.2.5 (#21421) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: x22x22 --- docker/Dockerfile | 8 -------- docs/contributing/ci/update_pytorch_version.md | 2 +- requirements/test.in | 2 +- requirements/test.txt | 6 ++++-- 4 files changed, 6 insertions(+), 12 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 3c2bdc2066e..11991829968 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -276,10 +276,6 @@ ARG PYTORCH_CUDA_INDEX_BASE_URL ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" -# Workaround for #17068 -RUN --mount=type=cache,target=/root/.cache/uv \ - uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4" - COPY requirements/lint.txt requirements/lint.txt COPY requirements/test.txt requirements/test.txt COPY requirements/dev.txt requirements/dev.txt @@ -452,10 +448,6 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" -# Workaround for #17068 -RUN --mount=type=cache,target=/root/.cache/uv \ - uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4" - # install development dependencies (for testing) RUN --mount=type=cache,target=/root/.cache/uv \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ diff --git a/docs/contributing/ci/update_pytorch_version.md b/docs/contributing/ci/update_pytorch_version.md index 1fe18d5d885..5046db11a47 100644 --- a/docs/contributing/ci/update_pytorch_version.md +++ b/docs/contributing/ci/update_pytorch_version.md @@ -134,7 +134,7 @@ MAX_JOBS=16 uv pip install --system \ ```bash uv pip install --system \ - --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4" + --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.5" ``` ### causal-conv1d diff --git a/requirements/test.in b/requirements/test.in index 429d1a50422..c794d1b3cb8 100644 --- a/requirements/test.in +++ b/requirements/test.in @@ -26,7 +26,7 @@ torch==2.7.1 torchaudio==2.7.1 torchvision==0.22.1 transformers_stream_generator # required for qwen-vl test -mamba_ssm # required for plamo2 test +mamba_ssm==2.2.5 # required for plamo2 test matplotlib # required for qwen-vl test mistral_common[image,audio] >= 1.8.2 # required for voxtral test num2words # required for smolvlm test diff --git a/requirements/test.txt b/requirements/test.txt index 8e5af8d74ba..c4e3c33f373 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -421,7 +421,7 @@ lxml==5.3.0 # sacrebleu mako==1.3.10 # via alembic -mamba-ssm==2.2.4 +mamba-ssm==2.2.5 # via -r requirements/test.in markdown==3.8.2 # via mlflow @@ -1152,7 +1152,9 @@ transformers==4.53.2 transformers-stream-generator==0.0.5 # via -r requirements/test.in triton==3.3.1 - # via torch + # via + # mamba-ssm + # torch tritonclient==2.51.0 # via # -r requirements/test.in From 35f60138cff3b4491eb28206dc4f2ec54bf033bb Mon Sep 17 00:00:00 2001 From: Sanger Steel Date: Thu, 24 Jul 2025 09:56:18 -0400 Subject: [PATCH 324/552] [Docs] Update Tensorizer usage documentation (#21190) Signed-off-by: Sanger Steel Signed-off-by: William Goldby Co-authored-by: William Goldby Signed-off-by: x22x22 --- docs/models/extensions/tensorizer.md | 99 +++++++++++++++++++++++-- examples/others/tensorize_vllm_model.py | 29 ++++---- 2 files changed, 110 insertions(+), 18 deletions(-) diff --git a/docs/models/extensions/tensorizer.md b/docs/models/extensions/tensorizer.md index 6ea61b080cd..f70ab0c6f4e 100644 --- a/docs/models/extensions/tensorizer.md +++ b/docs/models/extensions/tensorizer.md @@ -5,9 +5,98 @@ vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or at runtime extremely quickly directly to the GPU, resulting in significantly shorter Pod startup times and CPU memory usage. Tensor encryption is also supported. -For more information on CoreWeave's Tensorizer, please refer to -[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see -the [vLLM example script](../../examples/others/tensorize_vllm_model.md). +vLLM fully integrates Tensorizer in to its model loading machinery. The following will give a brief overview on how to get started with using Tensorizer on vLLM. -!!! note - Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. +## Installing Tensorizer + +To install `tensorizer`, run `pip install vllm[tensorizer]`. + +## The basics + +To load a model using Tensorizer, the model first needs to be serialized by +Tensorizer. [The example script](../../examples/others/tensorize_vllm_model.md) takes care of this process. + +Let's walk through a basic example by serializing `facebook/opt-125m` using the script, and then loading it for inference. + +## Serializing a vLLM model with Tensorizer + +To serialize a model with Tensorizer, call the example script with the necessary +CLI arguments. The docstring for the script itself explains the CLI args +and how to use it properly in great detail, and we'll use one of the examples from the docstring directly, assuming we want to serialize and save our model at our S3 bucket example `s3://my-bucket`: + +```bash +python examples/others/tensorize_vllm_model.py \ + --model facebook/opt-125m \ + serialize \ + --serialized-directory s3://my-bucket \ + --suffix v1 +``` + +This saves the model tensors at `s3://my-bucket/vllm/facebook/opt-125m/v1`. If you intend on applying a LoRA adapter to your tensorized model, you can pass the HF id of the LoRA adapter in the above command, and the artifacts will be saved there too: + +```bash +python examples/others/tensorize_vllm_model.py \ + --model facebook/opt-125m \ + --lora-path \ + serialize \ + --serialized-directory s3://my-bucket \ + --suffix v1 +``` + +## Serving the model using Tensorizer + +Once the model is serialized where you want it, you can load the model using `vllm serve` or the `LLM` entrypoint. You can pass the directory where you saved the model to the `model` argument for `LLM()` and `vllm serve`. For example, to serve the tensorized model saved previously with the LoRA adapter, you'd do: + +```bash +vllm serve s3://my-bucket/vllm/facebook/opt-125m/v1 \ + --load-format tensorizer \ + --enable-lora +``` + +Or, with `LLM()`: + +```python +from vllm import LLM +llm = LLM( + "s3://my-bucket/vllm/facebook/opt-125m/v1", + load_format="tensorizer", + enable_lora=True +) +``` + +## Options for configuring Tensorizer + +`tensorizer`'s core objects that serialize and deserialize models are `TensorSerializer` and `TensorDeserializer` respectively. In order to pass arbitrary kwargs to these, which will configure the serialization and deserialization processes, you can provide them as keys to `model_loader_extra_config` with `serialization_kwargs` and `deserialization_kwargs` respectively. Full docstrings detailing all parameters for the aforementioned objects can be found in `tensorizer`'s [serialization.py](https://github.com/coreweave/tensorizer/blob/main/tensorizer/serialization.py) file. + +As an example, CPU concurrency can be limited when serializing with `tensorizer` via the `limit_cpu_concurrency` parameter in the initializer for `TensorSerializer`. To set `limit_cpu_concurrency` to some arbitrary value, you would do so like this when serializing: + +```bash +python examples/others/tensorize_vllm_model.py \ + --model facebook/opt-125m \ + --lora-path \ + serialize \ + --serialized-directory s3://my-bucket \ + --serialization-kwargs '{"limit_cpu_concurrency": 2}' \ + --suffix v1 +``` + +As an example when customizing the loading process via `TensorDeserializer`, you could limit the number of concurrency readers during deserialization with the `num_readers` parameter in the initializer via `model_loader_extra_config` like so: + +```bash +vllm serve s3://my-bucket/vllm/facebook/opt-125m/v1 \ + --load-format tensorizer \ + --enable-lora \ + --model-loader-extra-config '{"deserialization_kwargs": {"num_readers": 2}}' +``` + +Or with `LLM()`: + +```python +from vllm import LLM +llm = LLM( + "s3://my-bucket/vllm/facebook/opt-125m/v1", + load_format="tensorizer", + enable_lora=True, + model_loader_extra_config={"deserialization_kwargs": {"num_readers": 2}} +) +``` diff --git a/examples/others/tensorize_vllm_model.py b/examples/others/tensorize_vllm_model.py index 64a6c42ae23..559c7c493ac 100644 --- a/examples/others/tensorize_vllm_model.py +++ b/examples/others/tensorize_vllm_model.py @@ -84,18 +84,22 @@ Once a model is serialized, tensorizer can be invoked with the `LLM` class directly to load models: - llm = LLM(model="facebook/opt-125m", - load_format="tensorizer", - model_loader_extra_config=TensorizerConfig( - tensorizer_uri = path_to_tensors, - num_readers=3, - ) - ) +```python +from vllm import LLM +llm = LLM( + "s3://my-bucket/vllm/facebook/opt-125m/v1", + load_format="tensorizer" +) +``` + A serialized model can be used during model loading for the vLLM OpenAI -inference server. `model_loader_extra_config` is exposed as the CLI arg -`--model-loader-extra-config`, and accepts a JSON string literal of the -TensorizerConfig arguments desired. +inference server: + +``` +vllm serve s3://my-bucket/vllm/facebook/opt-125m/v1 \ + --load-format tensorizer +``` In order to see all of the available arguments usable to configure loading with tensorizer that are given to `TensorizerConfig`, run: @@ -116,10 +120,9 @@ `--enable-lora`. For instance: ``` -vllm serve \ +vllm serve s3://my-bucket/vllm/facebook/opt-125m/v1 \ --load-format tensorizer \ - --model-loader-extra-config '{"tensorizer_uri": ".tensors"}' \ - --enable-lora + --enable-lora ``` """ From 79796dc90741cec14d7e1cc695b5de5b4149d9a2 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 24 Jul 2025 08:13:05 -0700 Subject: [PATCH 325/552] [Docs] Rewrite Distributed Inference and Serving guide (#20593) Signed-off-by: Ricardo Decal Co-authored-by: Simon Mo Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/serving/distributed_serving.md | 131 +++++++++++++++++----------- 1 file changed, 79 insertions(+), 52 deletions(-) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index a1f522cc5f1..d1ea29404de 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -1,31 +1,38 @@ -# Distributed Inference and Serving +# Distributed inference and serving -## How to decide the distributed inference strategy? +## Distributed inference strategies for a single-model replica -Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is: +To choose a distributed inference strategy for a single-model replica, use the following guidelines: -- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference. -- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4. -- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2. +- **Single GPU (no distributed inference):** if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU. +- **Single-node multi-GPU using tensor parallel inference:** if the model is too large for a single GPU but fits on a single node with multiple GPUs, use *tensor parallelism*. For example, set `tensor_parallel_size=4` when using a node with 4 GPUs. +- **Multi-node multi-GPU using tensor parallel and pipeline parallel inference:** if the model is too large for a single node, combine *tensor parallelism* with *pipeline parallelism*. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes. For example, set `tensor_parallel_size=8` and `pipeline_parallel_size=2` when using 2 nodes with 8 GPUs per node. -In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes. +Increase the number of GPUs and nodes until there is enough GPU memory for the model. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes. -After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough. +After you provision sufficient resources to fit the model, run `vllm`. Look for log messages like: -!!! note - There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs. +```text +INFO 07-23 13:56:04 [kv_cache_utils.py:775] GPU KV cache size: 643,232 tokens +INFO 07-23 13:56:04 [kv_cache_utils.py:779] Maximum concurrency for 40,960 tokens per request: 15.70x +``` + +The `GPU KV cache size` line reports the total number of tokens that can be stored in the GPU KV cache at once. The `Maximum concurrency` line provides an estimate of how many requests can be served concurrently if each request requires the specified number of tokens (40,960 in the example above). The tokens-per-request number is taken from the model configuration's maximum sequence length, `ModelConfig.max_model_len`. If these numbers are lower than your throughput requirements, add more GPUs or nodes to your cluster. -### Distributed serving of MoE (Mixture of Experts) models +!!! note "Edge case: uneven GPU splits" + If the model fits within a single node but the GPU count doesn't evenly divide the model size, enable pipeline parallelism, which splits the model along layers and supports uneven splits. In this scenario, set `tensor_parallel_size=1` and `pipeline_parallel_size` to the number of GPUs. Furthermore, if the GPUs on the node do not have NVLINK interconnect (e.g. L40S), leverage pipeline parallelism instead of tensor parallelism for higher throughput and lower communication overhead. -It is often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. See the page on [Data Parallel Deployment](data_parallel_deployment.md) for more information. +### Distributed serving of *Mixture of Experts* (*MoE*) models -## Running vLLM on a single node +It's often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. For more information, see [Data Parallel Deployment](data_parallel_deployment.md). -vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray. +## Single-node deployment -Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. +vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. The implementation includes [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). -To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs: +The default distributed runtimes are [Ray](https://github.com/ray-project/ray) for multi-node inference and native Python `multiprocessing` for single-node inference. You can override the defaults by setting `distributed_executor_backend` in the `LLM` class or `--distributed-executor-backend` in the API server. Use `mp` for `multiprocessing` or `ray` for Ray. + +For multi-GPU inference, set `tensor_parallel_size` in the `LLM` class to the desired GPU count. For example, to run inference on 4 GPUs: ```python from vllm import LLM @@ -33,84 +40,96 @@ llm = LLM("facebook/opt-13b", tensor_parallel_size=4) output = llm.generate("San Francisco is a") ``` -To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: +For multi-GPU serving, include `--tensor-parallel-size` when starting the server. For example, to run the API server on 4 GPUs: ```bash vllm serve facebook/opt-13b \ --tensor-parallel-size 4 ``` -You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism: +To enable pipeline parallelism, add `--pipeline-parallel-size`. For example, to run the API server on 8 GPUs with pipeline parallelism and tensor parallelism: ```bash +# Eight GPUs total vllm serve gpt2 \ --tensor-parallel-size 4 \ --pipeline-parallel-size 2 ``` -## Running vLLM on multiple nodes +## Multi-node deployment + +If a single node lacks sufficient GPUs to hold the model, deploy vLLM across multiple nodes. Multi-node deployments require Ray as the runtime engine. Ensure that every node provides an identical execution environment, including the model path and Python packages. Using container images is recommended because they provide a convenient way to keep environments consistent and to hide host heterogeneity. -If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration. +### Ray cluster setup with containers -The first step, is to start containers and organize them into a cluster. We have provided the helper script to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command. +The helper script `` starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command. -Pick a node as the head node, and run the following command: +Choose one node as the head node and run: ```bash bash run_cluster.sh \ vllm/vllm-openai \ - ip_of_head_node \ + \ --head \ /path/to/the/huggingface/home/in/this/node \ - -e VLLM_HOST_IP=ip_of_this_node + -e VLLM_HOST_IP= ``` -On the rest of the worker nodes, run the following command: +On each worker node, run: ```bash bash run_cluster.sh \ vllm/vllm-openai \ - ip_of_head_node \ + \ --worker \ /path/to/the/huggingface/home/in/this/node \ - -e VLLM_HOST_IP=ip_of_this_node + -e VLLM_HOST_IP= ``` -Then you get a ray cluster of **containers**. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses. +Note that `VLLM_HOST_IP` is unique for each worker. Keep the shells running these commands open; closing any shell terminates the cluster. Ensure that all nodes can communicate with each other through their IP addresses. -!!! warning - It is considered best practice to set `VLLM_HOST_IP` to an address on a private network segment for the vLLM cluster. The traffic sent here is not encrypted. The endpoints are also exchanging data in a format that could be exploited to execute arbitrary code should a malicious party gain access to the network. Please ensure that this network is not reachable by any untrusted parties. +!!! warning "Network security" + For security, set `VLLM_HOST_IP` to an address on a private network segment. Traffic sent over this network is unencrypted, and the endpoints exchange data in a format that can be exploited to execute arbitrary code if an adversary gains network access. Ensure that untrusted parties cannot reach the network. -!!! warning - Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`. +From any node, enter a container and run `ray status` and `ray list nodes` to verify that Ray finds the expected number of nodes and GPUs. -Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` and `ray list nodes` to check the status of the Ray cluster. You should see the right number of nodes and GPUs. +!!! tip + Alternatively, set up the Ray cluster using KubeRay. For more information, see [KubeRay vLLM documentation](https://docs.ray.io/en/latest/cluster/kubernetes/examples/vllm-rayservice.html). -After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node: vLLM will be able to leverage GPU resources of all nodes in the Ray cluster, and therefore, only run the `vllm` command on this node but not other nodes. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2: +### Running vLLM on a Ray cluster + +!!! tip + If Ray is running inside containers, run the commands in the remainder of this guide _inside the containers_, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it /bin/bash`. + +Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient. + +The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs across 2 nodes (8 GPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2: ```bash - vllm serve /path/to/the/model/in/the/container \ - --tensor-parallel-size 8 \ - --pipeline-parallel-size 2 +vllm serve /path/to/the/model/in/the/container \ + --tensor-parallel-size 8 \ + --pipeline-parallel-size 2 ``` -You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16: +Alternatively, you can set `tensor_parallel_size` to the total number of GPUs in the cluster: ```bash vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 16 ``` -To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like InfiniBand. To correctly set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the InfiniBand is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses InfiniBand with GPUDirect RDMA, which is efficient. +## Troubleshooting distributed deployments + +To make tensor parallelism performant, ensure that communication between nodes is efficient, for example, by using high-speed network cards such as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Contact your system administrator for more information about the required flags. One way to confirm if InfiniBand is working is to run `vllm` with the `NCCL_DEBUG=TRACE` environment variable set, for example `NCCL_DEBUG=TRACE vllm serve ...`, and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, NCCL uses a raw TCP socket, which is not efficient for cross-node tensor parallelism. If you find `[send] via NET/IB/GDRDMA` in the logs, NCCL uses InfiniBand with GPUDirect RDMA, which is efficient. -### GPUDirect RDMA +## Enabling GPUDirect RDMA -To enable GPUDirect RDMA with vLLM, specific configuration tweaks are needed. This setup ensures: +To enable GPUDirect RDMA with vLLM, configure the following settings: -- `IPC_LOCK` Security Context: Add the `IPC_LOCK` capability to the container’s security context to lock memory pages and prevent swapping to disk. -- Shared Memory with `/dev/shm`: Mount `/dev/shm` in the pod spec to provide shared memory for IPC. +- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. +- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). -When using Docker, you can set up the container as follows: +If you use Docker, set up the container as follows: ```bash docker run --gpus all \ @@ -120,7 +139,7 @@ docker run --gpus all \ vllm/vllm-openai ``` -When using Kubernetes, you can set up the pod spec as follows: +If you use Kubernetes, set up the pod spec as follows: ```yaml ... @@ -146,13 +165,21 @@ spec: ... ``` -!!! warning - After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script][troubleshooting-incorrect-hardware-driver] for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See for more information. +Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. To enable InfiniBand, append flags such as `--privileged -e NCCL_IB_HCA=mlx5` to `run_cluster.sh`. For cluster-specific settings, consult your system administrator. + +To confirm InfiniBand operation, enable detailed NCCL logs: + +```bash +NCCL_DEBUG=TRACE vllm serve ... +``` + +Search the logs for the transport method. Entries containing `[send] via NET/Socket` indicate raw TCP sockets, which perform poorly for cross-node tensor parallelism. Entries containing `[send] via NET/IB/GDRDMA` indicate InfiniBand with GPUDirect RDMA, which provides high performance. -!!! warning - Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes. +!!! tip "Verify inter-node GPU communication" + After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to `run_cluster.sh`, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . - When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model. +!!! tip "Pre-download Hugging Face models" + If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=` to `run_cluster.sh`. -!!! warning - If you keep receiving the error message `Error: No available node types can fulfill resource request` but you have enough GPUs in the cluster, chances are your nodes have multiple IP addresses and vLLM cannot find the right one, especially when you are using multi-node inference. Please make sure vLLM and ray use the same IP address. You can set the `VLLM_HOST_IP` environment variable to the right IP address in the `run_cluster.sh` script (different for each node!), and check `ray status` and `ray list nodes` to see the IP address used by Ray. See for more information. +!!! tip + The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . From 99277f89a6536e01ab9761f7f9300c335cc5ec92 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Thu, 24 Jul 2025 11:13:24 -0400 Subject: [PATCH 326/552] [Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access (#21465) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- .../quantization/cutlass_w8a8/moe/moe_data.cu | 27 ++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/csrc/quantization/cutlass_w8a8/moe/moe_data.cu b/csrc/quantization/cutlass_w8a8/moe/moe_data.cu index 993c30c48c8..857cca1e82d 100644 --- a/csrc/quantization/cutlass_w8a8/moe/moe_data.cu +++ b/csrc/quantization/cutlass_w8a8/moe/moe_data.cu @@ -47,13 +47,12 @@ __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids, __global__ void compute_expert_offsets( const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets, - int32_t* atomic_buffer, const int num_experts, const int topk_length) { + int32_t* atomic_buffer, const int num_experts, const bool swap_ab) { int32_t tot_offset = 0; expert_offsets[0] = 0; for (int i = 0; i < num_experts; ++i) { atomic_buffer[i] = tot_offset; - tot_offset += topk_length > SWAP_AB_THRESHOLD ? problem_sizes1[i * 3] - : problem_sizes1[i * 3 + 1]; + tot_offset += swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3]; expert_offsets[i + 1] = tot_offset; } } @@ -61,15 +60,14 @@ __global__ void compute_expert_offsets( __global__ void compute_expert_blockscale_offsets( const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets, int32_t* blockscale_offsets, int32_t* atomic_buffer, const int num_experts, - const int topk_length) { + const bool swap_ab) { int32_t tot_offset = 0; int32_t tot_offset_round = 0; expert_offsets[0] = 0; blockscale_offsets[0] = 0; for (int i = 0; i < num_experts; ++i) { - int32_t cur_offset = topk_length > SWAP_AB_THRESHOLD - ? problem_sizes1[i * 3] - : problem_sizes1[i * 3 + 1]; + int32_t cur_offset = + swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3]; atomic_buffer[i] = tot_offset; tot_offset += cur_offset; expert_offsets[i + 1] = tot_offset; @@ -119,15 +117,19 @@ void get_cutlass_moe_mm_data_caller( int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel()); - if (topk_ids.numel() > SWAP_AB_THRESHOLD) { - compute_problem_sizes<<>>( + // Swap-AB should be disabled for FP4 path + bool may_swap_ab = (!blockscale_offsets.has_value()) && + (topk_ids.numel() <= SWAP_AB_THRESHOLD); + + if (may_swap_ab) { + compute_problem_sizes<<>>( static_cast(topk_ids.data_ptr()), static_cast(problem_sizes1.data_ptr()), static_cast(problem_sizes2.data_ptr()), static_cast(atomic_buffer.data_ptr()), topk_ids.numel(), n, k); } else { - compute_problem_sizes<<>>( + compute_problem_sizes<<>>( static_cast(topk_ids.data_ptr()), static_cast(problem_sizes1.data_ptr()), static_cast(problem_sizes2.data_ptr()), @@ -136,18 +138,19 @@ void get_cutlass_moe_mm_data_caller( } if (blockscale_offsets.has_value()) { + // fp4 path compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>( static_cast(problem_sizes1.data_ptr()), static_cast(expert_offsets.data_ptr()), static_cast(blockscale_offsets.value().data_ptr()), static_cast(atomic_buffer.data_ptr()), num_experts, - topk_ids.numel()); + may_swap_ab); } else { compute_expert_offsets<<<1, 1, 0, stream>>>( static_cast(problem_sizes1.data_ptr()), static_cast(expert_offsets.data_ptr()), static_cast(atomic_buffer.data_ptr()), num_experts, - topk_ids.numel()); + may_swap_ab); } compute_arg_sorts<<>>( static_cast(topk_ids.data_ptr()), From 3ee60ba74137aaa1856057922d18a4a48b1a7716 Mon Sep 17 00:00:00 2001 From: Shu Wang Date: Thu, 24 Jul 2025 10:13:31 -0500 Subject: [PATCH 327/552] Update flashinfer CUTLASS MoE Kernel (#21408) Signed-off-by: Shu Wang. Signed-off-by: x22x22 --- .../fused_moe/flashinfer_cutlass_prepare_finalize.py | 4 ++-- vllm/model_executor/layers/quantization/modelopt.py | 4 ++-- vllm/utils/flashinfer.py | 8 ++++---- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py index e658990e95e..02e1d1f1fd0 100644 --- a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py @@ -11,7 +11,7 @@ from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig from vllm.model_executor.layers.fused_moe.utils import ( extract_required_args, moe_kernel_quantize_input) -from vllm.utils.flashinfer import block_scale_interleave +from vllm.utils.flashinfer import nvfp4_block_scale_interleave def get_local_sizes(local_tokens): @@ -92,7 +92,7 @@ def prepare( dim=0, sizes=get_local_sizes(local_tokens)) a1_m, a1_n = a1q.shape - a1q_scale = block_scale_interleave(a1q_scale) + a1q_scale = nvfp4_block_scale_interleave(a1q_scale) return a1q, a1q_scale, None, topk_ids, topk_weights diff --git a/vllm/model_executor/layers/quantization/modelopt.py b/vllm/model_executor/layers/quantization/modelopt.py index 460334d77f0..81611ed07aa 100644 --- a/vllm/model_executor/layers/quantization/modelopt.py +++ b/vllm/model_executor/layers/quantization/modelopt.py @@ -1254,8 +1254,8 @@ def apply( x, layer.w13_weight, layer.w2_weight), ( "Flashinfer CUTLASS Fused MoE not applicable!") - a1_gscale = torch.min(layer.w13_input_scale_quant) - a2_gscale = torch.min(layer.w2_input_scale_quant) + a1_gscale = layer.w13_input_scale_quant + a2_gscale = layer.w2_input_scale_quant extra_expert_args = { 'g1_alphas': layer.g1_alphas, 'g2_alphas': layer.g2_alphas, diff --git a/vllm/utils/flashinfer.py b/vllm/utils/flashinfer.py index 1ddafbae7fc..b25e3a49f18 100644 --- a/vllm/utils/flashinfer.py +++ b/vllm/utils/flashinfer.py @@ -69,8 +69,8 @@ def wrapper(*args, **kwargs): flashinfer_cutlass_fused_moe = _lazy_import_wrapper("flashinfer.fused_moe", "cutlass_fused_moe") fp4_quantize = _lazy_import_wrapper("flashinfer", "fp4_quantize") -block_scale_interleave = _lazy_import_wrapper("flashinfer", - "block_scale_interleave") +nvfp4_block_scale_interleave = _lazy_import_wrapper( + "flashinfer", "nvfp4_block_scale_interleave") # Special case for autotune since it returns a context manager autotune = _lazy_import_wrapper( @@ -95,7 +95,7 @@ def has_flashinfer_cutlass_fused_moe() -> bool: required_functions = [ ("flashinfer.fused_moe", "cutlass_fused_moe"), ("flashinfer", "fp4_quantize"), - ("flashinfer", "block_scale_interleave"), + ("flashinfer", "nvfp4_block_scale_interleave"), ] for module_name, attr_name in required_functions: @@ -110,7 +110,7 @@ def has_flashinfer_cutlass_fused_moe() -> bool: "flashinfer_trtllm_fp8_block_scale_moe", "flashinfer_cutlass_fused_moe", "fp4_quantize", - "block_scale_interleave", + "nvfp4_block_scale_interleave", "autotune", "has_flashinfer_moe", "has_flashinfer_cutlass_fused_moe", From 16ad88ea8e6eaef3f23d3406eef676b230b8d7f7 Mon Sep 17 00:00:00 2001 From: Chaojun Zhang Date: Thu, 24 Jul 2025 23:23:36 +0800 Subject: [PATCH 328/552] [XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform (#21036) Signed-off-by: chzhang Signed-off-by: x22x22 --- vllm/compilation/pass_manager.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/vllm/compilation/pass_manager.py b/vllm/compilation/pass_manager.py index 58216a1f0ed..11e03daced1 100644 --- a/vllm/compilation/pass_manager.py +++ b/vllm/compilation/pass_manager.py @@ -5,12 +5,15 @@ from vllm.config import VllmConfig from vllm.logger import init_logger +from vllm.platforms import current_platform + +if current_platform.is_cuda_alike(): + from .fusion import FusionPass + from .collective_fusion import AllReduceFusionPass, AsyncTPPass + from .fusion_attn import AttnFusionPass from .activation_quant_fusion import ActivationQuantFusionPass -from .collective_fusion import AllReduceFusionPass, AsyncTPPass from .fix_functionalization import FixFunctionalizationPass -from .fusion import FusionPass -from .fusion_attn import AttnFusionPass from .inductor_pass import CustomGraphPass, InductorPass, get_pass_context from .noop_elimination import NoOpEliminationPass from .sequence_parallelism import SequenceParallelismPass From bf8af92e2cc0bfd0d08d55f3919117d26727605e Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Thu, 24 Jul 2025 08:53:45 -0700 Subject: [PATCH 329/552] [P/D] Move FakeNixlWrapper to test dir (#21328) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- .../kv_connector/unit/test_nixl_connector.py | 169 ++++++++++++++---- .../kv_transfer/kv_connector/utils.py | 10 +- vllm/mocks/__init__.py | 0 vllm/mocks/mock_nixl_connector.py | 76 -------- 4 files changed, 140 insertions(+), 115 deletions(-) delete mode 100644 vllm/mocks/__init__.py delete mode 100644 vllm/mocks/mock_nixl_connector.py diff --git a/tests/v1/kv_connector/unit/test_nixl_connector.py b/tests/v1/kv_connector/unit/test_nixl_connector.py index 99bde919c72..c5ca7df8368 100644 --- a/tests/v1/kv_connector/unit/test_nixl_connector.py +++ b/tests/v1/kv_connector/unit/test_nixl_connector.py @@ -1,10 +1,15 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import contextlib +import inspect import os import tempfile import textwrap import time +import uuid +from collections import defaultdict +from typing import Optional from unittest.mock import patch import pytest @@ -16,30 +21,118 @@ KVConnectorRole, NixlAgentMetadata, NixlConnector, NixlConnectorMetadata, NixlConnectorWorker) from vllm.forward_context import ForwardContext -from vllm.mocks.mock_nixl_connector import FakeNixlWrapper from vllm.sampling_params import SamplingParams from .utils import create_request, create_scheduler, create_vllm_config -def _make_stub_pkg() -> str: - """Return a directory that makes - `from nixl._api import nixl_agent` resolve to our FakeNixlWrapper.""" - td = tempfile.mkdtemp() - pkg_root = os.path.join(td, "nixl", "_api") - os.makedirs(pkg_root, exist_ok=True) +class FakeNixlWrapper: + """Mock implementation of NixlWrapper for testing. - stub = textwrap.dedent("""\ - # Forward the real FakeNixlWrapper that the driver already defined. - print("In fake package") - from vllm.mocks.mock_nixl_connector import FakeNixlWrapper as nixl_agent - """) - with open(os.path.join(pkg_root, "__init__.py"), "w") as f: - f.write(stub) + We don't inherit from nixl._api.nixl_agent because nixl may not be + installed. + + Note: The complete source of this class is also used in the + `_make_fake_nixl_pkg` function to create a fake nixl package + for Ray workers. + """ + + AGENT_METADATA = b"fake_agent_metadata" + REMOTE_AGENT_NAME = "remote_agent" + + def __init__(self, agent_name: str, *args, **kwargs): + self._cycles_before_xfer_done = 0 + self._check_xfer_state_cycles: defaultdict[int, int] = defaultdict( + lambda: 0) + + def get_reg_descs(self, caches_data, memory_type: str) -> list: + return [str(uuid.uuid4()) for _ in caches_data] + + def register_memory(self, descs) -> None: + pass + + def get_xfer_descs(self, blocks_data, memory_type: str) -> list: + return [str(uuid.uuid4()) for _ in blocks_data] + + def prep_xfer_dlist(self, agent_name: str, descs: list) -> int: + return uuid.uuid4().int + + def get_agent_metadata(self) -> bytes: + return self.AGENT_METADATA + + def add_remote_agent(self, agent_metadata: bytes) -> str: + return self.REMOTE_AGENT_NAME + + def get_new_notifs(self) -> dict[str, list[bytes]]: + # Used to collect done_sending, which we don't test yet. + return {} + + def check_xfer_state(self, handle: int) -> str: + if self._check_xfer_state_cycles[ + handle] >= self._cycles_before_xfer_done: + return "DONE" + self._check_xfer_state_cycles[handle] += 1 + return "PROC" + + def release_xfer_handle(self, handle: int) -> None: + pass + + def send_notif(self, agent_name: str, notif_msg: bytes) -> None: + pass + + def make_prepped_xfer(self, + xfer_type: str, + local_xfer_side_handle: int, + local_block_descs_ids: list[int], + remote_xfer_side_handle: int, + remote_block_descs_ids: list[int], + notif_msg: Optional[bytes] = None) -> int: + return uuid.uuid4().int - # touch parent package - open(os.path.join(td, "nixl", "__init__.py"), "w").close() - return td + def transfer(self, handle: int) -> str: + return "PROC" + + ############################################################ + # Follow are for changing the behavior during testing. + ############################################################ + + def set_cycles_before_xfer_done(self, cycles: int): + """Set the number of cycles before a transfer is considered done.""" + self._cycles_before_xfer_done = cycles + + +@contextlib.contextmanager +def _make_fake_nixl_pkg(): + """Context manager that creates a temporary package making + `from nixl._api import nixl_agent` resolve to our FakeNixlWrapper. + + Automatically cleans up the temporary directory when done. + """ + with tempfile.TemporaryDirectory() as td: + pkg_root = os.path.join(td, "nixl", "_api") + os.makedirs(pkg_root, exist_ok=True) + + # Get the source code of FakeNixlWrapper class and dedent it + fake_nixl_source = inspect.getsource(FakeNixlWrapper) + fake_nixl_source = textwrap.dedent(fake_nixl_source) + + stub = f"""\ +# Copy of FakeNixlWrapper implementation for Ray workers +import uuid +from collections import defaultdict +from typing import Optional + +{fake_nixl_source} + +# Export as nixl_agent +nixl_agent = FakeNixlWrapper +""" + with open(os.path.join(pkg_root, "__init__.py"), "w") as f: + f.write(stub) + + # touch parent package + open(os.path.join(td, "nixl", "__init__.py"), "w").close() + yield td def test_basic_interface(): @@ -351,27 +444,37 @@ def test_abort_timeout_on_prefiller(monkeypatch, distributed_executor_backend): kv_connector="NixlConnector", kv_role="kv_both", ) + llm_kwargs = { + "model": model_name, + "enforce_eager": True, + "gpu_memory_utilization": 0.5, + "kv_transfer_config": kv_transfer_config, + "distributed_executor_backend": distributed_executor_backend, + } + timeout = 6 monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0") monkeypatch.setenv("VLLM_NIXL_ABORT_REQUEST_TIMEOUT", str(timeout)) - # Build runtime_env only if we’re using Ray + # Build runtime_env only if we're using Ray if distributed_executor_backend == "ray": - runtime_env = { - "working_dir": _make_stub_pkg(), # ship stub package - "env_vars": { - "VLLM_NIXL_ABORT_REQUEST_TIMEOUT": str(timeout), - }, - } - ray.init(runtime_env=runtime_env) - - llm = LLM( - model=model_name, - enforce_eager=True, - gpu_memory_utilization=0.5, - kv_transfer_config=kv_transfer_config, - distributed_executor_backend=distributed_executor_backend, - ) + with _make_fake_nixl_pkg() as working_dir: + runtime_env = { + "working_dir": working_dir, # ship fake nixl package + "env_vars": { + "VLLM_NIXL_ABORT_REQUEST_TIMEOUT": str(timeout), + }, + } + ray.init(runtime_env=runtime_env) + + _run_abort_timeout_test(llm_kwargs, timeout) + else: + _run_abort_timeout_test(llm_kwargs, timeout) + + +def _run_abort_timeout_test(llm_kwargs: dict, timeout: int): + """Helper function to run the abort timeout test logic.""" + llm = LLM(**llm_kwargs) remote_prefill_opts = { "do_remote_decode": True, "do_remote_prefill": False, diff --git a/vllm/distributed/kv_transfer/kv_connector/utils.py b/vllm/distributed/kv_transfer/kv_connector/utils.py index c179d6cc29b..459a5329891 100644 --- a/vllm/distributed/kv_transfer/kv_connector/utils.py +++ b/vllm/distributed/kv_transfer/kv_connector/utils.py @@ -120,8 +120,8 @@ class KVOutputAggregator: output corresponding to Rank 0 for scheduler.""" def __init__(self, world_size: int): - # Complete transfer tracker. Used by to track finished requests - # [req_id -> n_finished_workers] + # Complete transfer tracker. Used to track finished requests + # [req_id -> n_remaining_workers] self._recv_remaining_count = defaultdict[str, int](lambda: world_size) self._send_remaining_count = defaultdict[str, int](lambda: world_size) @@ -134,12 +134,10 @@ def update_finished_set(req_ids: Optional[set[str]], remaining_count_dict: dict[str, int], finished_set: set[str]) -> None: for req_id in req_ids or (): - new_count = remaining_count_dict[req_id] - 1 - if new_count == 0: + remaining_count_dict[req_id] -= 1 + if remaining_count_dict[req_id] == 0: finished_set.add(req_id) del remaining_count_dict[req_id] - else: - remaining_count_dict[req_id] = new_count finished_sending = set[str]() finished_recving = set[str]() diff --git a/vllm/mocks/__init__.py b/vllm/mocks/__init__.py deleted file mode 100644 index e69de29bb2d..00000000000 diff --git a/vllm/mocks/mock_nixl_connector.py b/vllm/mocks/mock_nixl_connector.py deleted file mode 100644 index 54e2c5ee3b0..00000000000 --- a/vllm/mocks/mock_nixl_connector.py +++ /dev/null @@ -1,76 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import uuid -from collections import defaultdict -from typing import Optional - - -class FakeNixlWrapper: - """Mock implementation of NixlWrapper for testing. - - We don't inherit from nixl._api.nixl_agent because nixl may not be - installed. - """ - - AGENT_METADATA = b"fake_agent_metadata" - REMOTE_AGENT_NAME = "remote_agent" - - def __init__(self, agent_name: str, *args, **kwargs): - self._cycles_before_xfer_done = 0 - self._check_xfer_state_cycles: defaultdict[int, int] = defaultdict( - lambda: 0) - - def get_reg_descs(self, caches_data, memory_type: str) -> list: - return [str(uuid.uuid4()) for _ in caches_data] - - def register_memory(self, descs) -> None: - pass - - def get_xfer_descs(self, blocks_data, memory_type: str) -> list: - return [str(uuid.uuid4()) for _ in blocks_data] - - def prep_xfer_dlist(self, agent_name: str, descs: list) -> int: - return uuid.uuid4().int - - def get_agent_metadata(self) -> bytes: - return self.AGENT_METADATA - - def add_remote_agent(self, agent_metadata: bytes) -> str: - return self.REMOTE_AGENT_NAME - - def get_new_notifs(self) -> dict[str, list[bytes]]: - # Used to collect done_sending, which we don't test yet. - return {} - - def check_xfer_state(self, handle: int) -> str: - if self._check_xfer_state_cycles[ - handle] >= self._cycles_before_xfer_done: - return "DONE" - self._check_xfer_state_cycles[handle] += 1 - return "PROC" - - def release_xfer_handle(self, handle: int) -> None: - pass - - def send_notif(self, agent_name: str, notif_msg: bytes) -> None: - pass - - def make_prepped_xfer(self, - xfer_type: str, - local_xfer_side_handle: int, - local_block_descs_ids: list[int], - remote_xfer_side_handle: int, - remote_block_descs_ids: list[int], - notif_msg: Optional[bytes] = None) -> int: - return uuid.uuid4().int - - def transfer(self, handle: int) -> str: - return "PROC" - - ############################################################ - # Follow are for changing the behavior during testing. - ############################################################ - - def set_cycles_before_xfer_done(self, cycles: int): - """Set the number of cycles before a transfer is considered done.""" - self._cycles_before_xfer_done = cycles From 2bfe64e39c1f4d8b550504a20d1543f015ffc6d9 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Fri, 25 Jul 2025 00:46:25 +0800 Subject: [PATCH 330/552] The logging within the EmbeddingMixin class has been optimized, and the unnecessary prompt_adapter_request parameter has been removed, thereby enhancing the conciseness and readability of the code. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index a5d42f3ecf5..a7f95cf2b85 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -233,8 +233,8 @@ def _should_use_chunked_processing(self, request) -> bool: "its chunk (similar to sliding window attention), " "which changes token representations before pooling. " "While MEAN pooling provides a reasonable " - "approximation " - "through weighted averaging aggregation, other pooling " + "approximation through weighted averaging aggregation, " + "other pooling " "types use different aggregation strategies that " "further approximate the original behavior. Set " "'allow_non_mean_chunking: true' in pooler config " @@ -316,8 +316,7 @@ async def _process_chunked_request( self._log_inputs(chunk_request_id, chunk_request_prompt, params=pooling_params, - lora_request=ctx.lora_request, - prompt_adapter_request=ctx.prompt_adapter_request) + lora_request=ctx.lora_request) # Create generator for this chunk generator = self.engine_client.encode( @@ -468,12 +467,10 @@ async def _prepare_generators( # Normal processing for short prompts or non-token prompts request_id_item = f"{ctx.request_id}-{i}" - self._log_inputs( - request_id_item, - request_prompt, - params=pooling_params, - lora_request=ctx.lora_request, - prompt_adapter_request=ctx.prompt_adapter_request) + self._log_inputs(request_id_item, + request_prompt, + params=pooling_params, + lora_request=ctx.lora_request) # Mypy has an existing bug related to inferring the variance # of TypedDicts with `builtins.enumerate`: From c35f33443d222f162227c8f9922c32cbb8a2c166 Mon Sep 17 00:00:00 2001 From: Juncheng Gu <6314092+juncgu@users.noreply.github.com> Date: Thu, 24 Jul 2025 09:58:42 -0700 Subject: [PATCH 331/552] [P/D] Support CPU Transfer in NixlConnector (#18293) Signed-off-by: Juncheng Gu Signed-off-by: Richard Liu Co-authored-by: Richard Liu <39319471+richardsliu@users.noreply.github.com> Co-authored-by: Richard Liu Signed-off-by: x22x22 --- requirements/tpu.txt | 1 + .../run_tpu_disagg_accuracy_test.sh | 162 +++++++++++ .../run_tpu_edge_case_test.sh | 128 +++++++++ .../nixl_integration/test_disagg_accuracy.py | 162 +++++++++++ .../nixl_integration/test_edge_cases.py | 9 +- .../nixl_integration/toy_proxy_server.py | 6 +- .../kv_transfer/kv_connector/v1/base.py | 15 +- .../kv_connector/v1/nixl_connector.py | 272 +++++++++++++++--- vllm/v1/worker/gpu_model_runner.py | 58 +--- .../worker/kv_connector_model_runner_mixin.py | 70 +++++ vllm/v1/worker/tpu_model_runner.py | 105 ++++++- vllm/v1/worker/tpu_worker.py | 15 +- 12 files changed, 893 insertions(+), 110 deletions(-) create mode 100644 tests/v1/kv_connector/nixl_integration/run_tpu_disagg_accuracy_test.sh create mode 100644 tests/v1/kv_connector/nixl_integration/run_tpu_edge_case_test.sh create mode 100644 tests/v1/kv_connector/nixl_integration/test_disagg_accuracy.py create mode 100644 vllm/v1/worker/kv_connector_model_runner_mixin.py diff --git a/requirements/tpu.txt b/requirements/tpu.txt index 354771482ee..d86f643d388 100644 --- a/requirements/tpu.txt +++ b/requirements/tpu.txt @@ -10,6 +10,7 @@ jinja2>=3.1.6 ray[default] ray[data] setuptools==78.1.0 +nixl==0.3.0 # Install torch_xla --pre diff --git a/tests/v1/kv_connector/nixl_integration/run_tpu_disagg_accuracy_test.sh b/tests/v1/kv_connector/nixl_integration/run_tpu_disagg_accuracy_test.sh new file mode 100644 index 00000000000..45779d16914 --- /dev/null +++ b/tests/v1/kv_connector/nixl_integration/run_tpu_disagg_accuracy_test.sh @@ -0,0 +1,162 @@ +#!/bin/bash +set -xe + +# Hosts / ports +PREFILL_HOST=${PREFILL_HOST:-"localhost"} +PREFILL_PORT=${PREFILL_PORT:-8100} +PREFILL_NIXL_SIDE_PORT=${PREFILL_NIXL_SIDE_PORT:-5577} +DECODE_HOST=${DECODE_HOST:-"localhost"} +DECODE_PORT=${DECODE_PORT:-8200} +PROXY_HOST=${PROXY_HOST:-"localhost"} +PROXY_PORT=${PROXY_PORT:-8192} +BASELINE_HOST=${BASELINE_HOST:-"localhost"} +BASELINE_PORT=${BASELINE_PORT:-9290} + + +# Model to run. +MODEL_NAME=${MODEL_NAME:-"meta-llama/Llama-3.2-3B-Instruct"} +MAX_MODEL_LEN=${MAX_MODEL_LEN:-1024} +BLOCK_SIZE=${BLOCK_SIZE:-32} + + +# execution env +GIT_ROOT=$(git rev-parse --show-toplevel) +EXP_ROOT="${GIT_ROOT}/tests/v1/kv_connector/nixl_integration" +CONDA_PATH=${CONDA_PATH:-"/home/${USER}/anaconda3"} +CONDA_ENV_NAME=${CONDA_ENV_NAME:-"nixl"} + +OUTPUT_FILE=${OUTPUT_FILE:-"${EXP_ROOT}/.tpu_accuracy_test_outputs.txt"} + +# Trap the SIGINT signal (triggered by Ctrl+C) +trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT + + +# Waits for vLLM server to start. +wait_for_server() { + local host=$1 + local port=$2 + timeout 1200 bash -c " + until curl -s ${host}:${port}/v1/completions > /dev/null; do + sleep 1 + done" && return 0 || return 1 +} + +# Cleanup function +cleanup() { + echo "Caught Ctrl+C, cleaning up..." + # Cleanup commands + pgrep python | xargs kill -9 || true + # pkill -f python || true + echo "Cleanup complete. Exiting." +} + +launch_baseline() { + BASELINE_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + VLLM_LOGGING_LEVEL=DEBUG \ + VLLM_USE_V1=1 \ + PJRT_DEVICE=TPU \ + VLLM_WORKER_MULTIPROC_METHOD=spawn \ + VLLM_ENABLE_V1_MULTIPROCESSING=0 vllm serve $MODEL_NAME \ + --host ${BASELINE_HOST} \ + --port ${BASELINE_PORT} \ + --max-model-len ${MAX_MODEL_LEN}\ + --seed 42 \ + --block-size ${BLOCK_SIZE} \ + --gpu-memory-utilization 0.5 \ + --disable-log-requests \ + --enforce-eager" + echo ${BASELINE_BASE_CMD} + ssh -tt ${BASELINE_HOST} "${BASELINE_BASE_CMD}" & +} + +launch_pd() { + PREFILL_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + UCX_TLS=tcp \ + VLLM_MULTIPROC_EXECUTE_MODEL_TIMEOUT_S=200 \ + VLLM_LOGGING_LEVEL=DEBUG \ + VLLM_USE_V1=1 \ + VLLM_NIXL_SIDE_CHANNEL_HOST=${PREFILL_HOST} \ + VLLM_NIXL_SIDE_CHANNEL_PORT=${PREFILL_NIXL_SIDE_PORT} \ + PJRT_DEVICE=TPU \ + VLLM_WORKER_MULTIPROC_METHOD=spawn \ + VLLM_ENABLE_V1_MULTIPROCESSING=0 vllm serve $MODEL_NAME \ + --host ${PREFILL_HOST} \ + --port ${PREFILL_PORT} \ + --max-model-len ${MAX_MODEL_LEN}\ + --seed 42 \ + --block-size ${BLOCK_SIZE} \ + --enforce-eager \ + --gpu-memory-utilization 0.5 \ + --disable-log-requests \ + --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"cpu\"}'" + + + DECODE_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + UCX_TLS=tcp \ + VLLM_MULTIPROC_EXECUTE_MODEL_TIMEOUT_S=200 \ + VLLM_LOGGING_LEVEL=DEBUG \ + VLLM_USE_V1=1 \ + PJRT_DEVICE=TPU \ + VLLM_WORKER_MULTIPROC_METHOD=spawn \ + VLLM_ENABLE_V1_MULTIPROCESSING=0 vllm serve $MODEL_NAME \ + --host ${DECODE_HOST} \ + --port ${DECODE_PORT} \ + --max-model-len ${MAX_MODEL_LEN}\ + --seed 42 \ + --block-size ${BLOCK_SIZE} \ + --enforce-eager \ + --gpu-memory-utilization 0.5 \ + --disable-log-requests \ + --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"cpu\"}'" + + echo ${PREFILL_BASE_CMD} + echo ${DECODE_BASE_CMD} + sleep 2 + + # execute on hosts + ssh -tt ${PREFILL_HOST} "${PREFILL_BASE_CMD}" & + ssh -tt ${DECODE_HOST} "${DECODE_BASE_CMD}" & + sleep 1 + wait_for_server ${PREFILL_HOST} ${PREFILL_PORT} + sleep 1 + wait_for_server ${DECODE_HOST} ${DECODE_PORT} + sleep 1 +} + +launch_pd_proxy(){ + PROXY_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + python3 ${EXP_ROOT}/toy_proxy_server.py \ + --prefiller-host ${PREFILL_HOST} --prefiller-port ${PREFILL_PORT} \ + --decoder-host ${DECODE_HOST} --decoder-port ${DECODE_PORT} \ + --host=${PROXY_HOST} --port ${PROXY_PORT}" + echo ${PROXY_BASE_CMD} + ssh -tt ${PROXY_HOST} "${PROXY_BASE_CMD}" & +} + +run_tests(){ + local service_url=$1 + local mode=$2 + python3 ${EXP_ROOT}/test_disagg_accuracy.py --service_url=${service_url} --model_name=${MODEL_NAME} --mode=${mode} --file_name=${OUTPUT_FILE} +} + + +# run non-disagg. baseline & save outputs +launch_baseline +sleep 2 +wait_for_server ${BASELINE_HOST} ${BASELINE_PORT} +run_tests "http://${BASELINE_HOST}:${BASELINE_PORT}" "baseline" +cleanup +sleep 10 + + +# run disagg. & do exact-match with the outputs from baseline +launch_pd +launch_pd_proxy +sleep 10 +run_tests "http://${PROXY_HOST}:${PROXY_PORT}" "disagg" +echo "-----P/D success----" + +rm ${OUTPUT_FILE} +cleanup + +exit 0 \ No newline at end of file diff --git a/tests/v1/kv_connector/nixl_integration/run_tpu_edge_case_test.sh b/tests/v1/kv_connector/nixl_integration/run_tpu_edge_case_test.sh new file mode 100644 index 00000000000..c37c92fdf5d --- /dev/null +++ b/tests/v1/kv_connector/nixl_integration/run_tpu_edge_case_test.sh @@ -0,0 +1,128 @@ +#!/bin/bash +set -xe + +# Hosts / ports +PREFILL_HOST=${PREFILL_HOST:-"localhost"} +PREFILL_PORT=${PREFILL_PORT:-8100} +PREFILL_NIXL_SIDE_PORT=${PREFILL_NIXL_SIDE_PORT:-5577} +DECODE_HOST=${DECODE_HOST:-"localhost"} +DECODE_PORT=${DECODE_PORT:-8200} +PROXY_HOST=${PROXY_HOST:-"localhost"} +PROXY_PORT=${PROXY_PORT:-8192} +BASELINE_HOST=${BASELINE_HOST:-"localhost"} +BASELINE_PORT=${BASELINE_PORT:-9290} + + +# Model to run. +MODEL_NAME=${MODEL_NAME:-"meta-llama/Llama-3.2-3B-Instruct"} +MAX_MODEL_LEN=${MAX_MODEL_LEN:-1024} +BLOCK_SIZE=${BLOCK_SIZE:-32} + + +# execution env +GIT_ROOT=$(git rev-parse --show-toplevel) +EXP_ROOT="${GIT_ROOT}/tests/v1/kv_connector/nixl_integration" +CONDA_PATH=${CONDA_PATH:-"/home/${USER}/anaconda3"} +CONDA_ENV_NAME=${CONDA_ENV_NAME:-"nixl"} + +OUTPUT_FILE=${OUTPUT_FILE:-"${EXP_ROOT}/.tpu_accuracy_test_outputs.txt"} + +# Trap the SIGINT signal (triggered by Ctrl+C) +trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT + +# Waits for vLLM server to start. +wait_for_server() { + local host=$1 + local port=$2 + timeout 1200 bash -c " + until curl -s ${host}:${port}/v1/completions > /dev/null; do + sleep 1 + done" && return 0 || return 1 +} + +# Cleanup function +cleanup() { + echo "Caught Ctrl+C, cleaning up..." + # Cleanup commands + pgrep python | xargs kill -9 || true + # pkill -f python || true + echo "Cleanup complete. Exiting." +} + + +launch_pd() { + PREFILL_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + UCX_TLS=tcp \ + VLLM_MULTIPROC_EXECUTE_MODEL_TIMEOUT_S=200 \ + VLLM_LOGGING_LEVEL=DEBUG \ + VLLM_USE_V1=1 \ + VLLM_NIXL_SIDE_CHANNEL_HOST=${PREFILL_HOST} \ + VLLM_NIXL_SIDE_CHANNEL_PORT=${PREFILL_NIXL_SIDE_PORT} \ + PJRT_DEVICE=TPU \ + VLLM_WORKER_MULTIPROC_METHOD=spawn \ + VLLM_ENABLE_V1_MULTIPROCESSING=0 vllm serve $MODEL_NAME \ + --host ${PREFILL_HOST} \ + --port ${PREFILL_PORT} \ + --max-model-len ${MAX_MODEL_LEN}\ + --seed 42 \ + --block-size ${BLOCK_SIZE} \ + --enforce-eager \ + --gpu-memory-utilization 0.5 \ + --disable-log-requests \ + --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"cpu\"}'" + + + DECODE_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + UCX_TLS=tcp \ + VLLM_MULTIPROC_EXECUTE_MODEL_TIMEOUT_S=200 \ + VLLM_LOGGING_LEVEL=DEBUG \ + VLLM_USE_V1=1 \ + PJRT_DEVICE=TPU \ + VLLM_WORKER_MULTIPROC_METHOD=spawn \ + VLLM_ENABLE_V1_MULTIPROCESSING=0 vllm serve $MODEL_NAME \ + --host ${DECODE_HOST} \ + --port ${DECODE_PORT} \ + --max-model-len ${MAX_MODEL_LEN}\ + --seed 42 \ + --block-size ${BLOCK_SIZE} \ + --enforce-eager \ + --gpu-memory-utilization 0.5 \ + --disable-log-requests \ + --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"cpu\"}'" + + echo ${PREFILL_BASE_CMD} + echo ${DECODE_BASE_CMD} + sleep 2 + + # execute on hosts + ssh -tt ${PREFILL_HOST} "${PREFILL_BASE_CMD}" & + ssh -tt ${DECODE_HOST} "${DECODE_BASE_CMD}" & + sleep 1 + wait_for_server ${PREFILL_HOST} ${PREFILL_PORT} + sleep 1 + wait_for_server ${DECODE_HOST} ${DECODE_PORT} + sleep 1 +} + +launch_pd_proxy(){ + PROXY_BASE_CMD="source ${CONDA_PATH}/bin/activate ${CONDA_ENV_NAME}; + python3 ${EXP_ROOT}/toy_proxy_server.py \ + --prefiller-host ${PREFILL_HOST} --prefiller-port ${PREFILL_PORT} \ + --decoder-host ${DECODE_HOST} --decoder-port ${DECODE_PORT} \ + --host=${PROXY_HOST} --port ${PROXY_PORT}" + echo ${PROXY_BASE_CMD} + ssh -tt ${PROXY_HOST} "${PROXY_BASE_CMD}" & +} + + +# run disagg. & do exact-match with the outputs from baseline +launch_pd +launch_pd_proxy +sleep 10 + +PREFILL_HOST=${PREFILL_HOST} \ +PREFILL_PORT=${PREFILL_PORT} \ +DECODE_HOST=${DECODE_HOST} \ +DECODE_PORT=${DECODE_PORT} \ +PROXY_HOST=${PROXY_HOST} \ +PROXY_PORT=${PROXY_PORT} python -m pytest -s -v ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_edge_cases.py \ No newline at end of file diff --git a/tests/v1/kv_connector/nixl_integration/test_disagg_accuracy.py b/tests/v1/kv_connector/nixl_integration/test_disagg_accuracy.py new file mode 100644 index 00000000000..00e62f351ce --- /dev/null +++ b/tests/v1/kv_connector/nixl_integration/test_disagg_accuracy.py @@ -0,0 +1,162 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import argparse +import json +import os +import time + +import openai +import requests + +MAX_OUTPUT_LEN = 30 + +SAMPLE_PROMPTS = ( + "Red Hat is the best company in the world to work for because it works on " + "open source software, which means that all the contributions are " + "delivered to the community. As a result, when working on projects like " + "vLLM we are able to meet many amazing people from various organizations " + "like AMD, Google, NVIDIA, ", + "We hold these truths to be self-evident, that all men are created equal, " + "that they are endowed by their Creator with certain unalienable Rights, " + "that among these are Life, Liberty and the pursuit of Happiness.--That " + "to secure these rights, Governments are instituted among Men, deriving " + "their just powers from the consent of the governed, ", +) + + +def check_vllm_server(url: str, timeout=5, retries=3) -> bool: + """ + Checks if the vLLM server is ready by sending a GET request to the + /health endpoint. + + Args: + url (str): The base URL of the vLLM server. + timeout (int): Timeout in seconds for the request. + retries (int): Number of retries if the server is not ready. + + Returns: + bool: True if the server is ready, False otherwise. + """ + for attempt in range(retries): + try: + response = requests.get(url, timeout=timeout) + if response.status_code == 200: + return True + else: + print(f"Attempt {attempt + 1}: Server returned status code " + "{response.status_code}") + except requests.exceptions.RequestException as e: + print(f"Attempt {attempt + 1}: Error connecting to server: {e}") + time.sleep(1) # Wait before retrying + return False + + +def run_simple_prompt(base_url: str, model_name: str, + input_prompt: str) -> str: + client = openai.OpenAI(api_key="EMPTY", base_url=base_url) + completion = client.completions.create(model=model_name, + prompt=input_prompt, + max_tokens=MAX_OUTPUT_LEN, + temperature=0.0, + seed=42) + + # print("-" * 50) + # print(f"Completion results for {model_name}:") + # print(completion) + # print("-" * 50) + return completion.choices[0].text + + +def main(): + """ + This script demonstrates how to accept two optional string arguments + ("service_url" and "file_name") from the command line, each with a + default value of an empty string, using the argparse module. + """ + parser = argparse.ArgumentParser(description="vLLM client script") + + parser.add_argument( + "--service_url", # Name of the first argument + type=str, + required=True, + help="The vLLM service URL.") + + parser.add_argument( + "--model_name", # Name of the first argument + type=str, + required=True, + help="model_name", + ) + + parser.add_argument( + "--mode", # Name of the second argument + type=str, + default="baseline", + help="mode: baseline==non-disagg, or disagg", + ) + + parser.add_argument( + "--file_name", # Name of the second argument + type=str, + default=".vllm_output.txt", + help="the file that saves the output tokens ", + ) + + args = parser.parse_args() + + for arg in vars(args): + print(f"{arg}: {getattr(args, arg)}") + + if args.mode == "baseline": + # non-disagg + health_check_url = f"{args.service_url}/health" + else: + # disagg proxy + health_check_url = f"{args.service_url}/healthcheck" + if not os.path.exists(args.file_name): + raise ValueError( + f"In disagg mode, the output file {args.file_name} from " + "non-disagg. baseline does not exist.") + + service_url = f"{args.service_url}/v1" + + if not check_vllm_server(health_check_url): + raise RuntimeError( + f"vllm server: {args.service_url} is not ready yet!") + + output_strs = dict() + for prompt in SAMPLE_PROMPTS: + output_str = run_simple_prompt(base_url=service_url, + model_name=args.model_name, + input_prompt=prompt) + print(f"Prompt: {prompt}, output: {output_str}") + output_strs[prompt] = output_str + + if args.mode == "baseline": + # baseline: save outputs + try: + with open(args.file_name, 'w') as json_file: + json.dump(output_strs, json_file, indent=4) + except OSError as e: + print(f"Error writing to file: {e}") + raise + else: + # disagg. verify outputs + baseline_outputs = None + try: + with open(args.file_name) as json_file: + baseline_outputs = json.load(json_file) + except OSError as e: + print(f"Error writing to file: {e}") + raise + assert isinstance(baseline_outputs, dict) + assert len(baseline_outputs) == len(output_strs) + for prompt, output in baseline_outputs.items(): + assert prompt in output_strs, f"{prompt} not included" + assert output == output_strs[prompt], ( + f"baseline_output: {output} != PD output: {output_strs[prompt]}" + ) + + +if __name__ == "__main__": + main() diff --git a/tests/v1/kv_connector/nixl_integration/test_edge_cases.py b/tests/v1/kv_connector/nixl_integration/test_edge_cases.py index 95465a25fc9..8439e30be15 100644 --- a/tests/v1/kv_connector/nixl_integration/test_edge_cases.py +++ b/tests/v1/kv_connector/nixl_integration/test_edge_cases.py @@ -4,8 +4,11 @@ import openai +PREFILL_HOST = os.getenv("PREFILL_HOST", "localhost") PREFILL_PORT = os.getenv("PREFILL_PORT", None) +DECODE_HOST = os.getenv("DECODE_HOST", "localhost") DECODE_PORT = os.getenv("DECODE_PORT", None) +PROXY_HOST = os.getenv("PROXY_HOST", "localhost") PROXY_PORT = os.getenv("PROXY_PORT", None) if PREFILL_PORT is None or DECODE_PORT is None or PROXY_PORT is None: @@ -21,15 +24,15 @@ def test_edge_cases(): # Set the OpenAI API key and base URL decode_client = openai.OpenAI( api_key="MY_KEY", - base_url=f"http://localhost:{DECODE_PORT}/v1", + base_url=f"http://{DECODE_HOST}:{DECODE_PORT}/v1", ) prefill_client = openai.OpenAI( api_key="MY_KEY", - base_url=f"http://localhost:{PREFILL_PORT}/v1", + base_url=f"http://{PREFILL_HOST}:{PREFILL_PORT}/v1", ) proxy_client = openai.OpenAI( api_key="MY_KEY", - base_url=f"http://localhost:{PROXY_PORT}/v1", + base_url=f"http://{PROXY_HOST}:{PROXY_PORT}/v1", ) # Get the list of models diff --git a/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py b/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py index c58cb0286f1..66e237da0f8 100644 --- a/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py +++ b/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py @@ -3,6 +3,7 @@ import argparse import itertools +import logging import os import uuid from contextlib import asynccontextmanager @@ -11,9 +12,8 @@ from fastapi import FastAPI, Request from fastapi.responses import StreamingResponse -from vllm.logger import init_logger - -logger = init_logger(__name__) +logger = logging.getLogger(__name__) +logger.setLevel(logging.DEBUG) @asynccontextmanager diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/base.py b/vllm/distributed/kv_transfer/kv_connector/v1/base.py index e1245775bea..8bbdd7e0621 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/base.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/base.py @@ -32,7 +32,7 @@ import enum from abc import ABC, abstractmethod -from typing import TYPE_CHECKING, Any, Optional +from typing import TYPE_CHECKING, Any, Callable, Literal, Optional import torch @@ -46,6 +46,12 @@ from vllm.v1.core.kv_cache_manager import KVCacheBlocks from vllm.v1.request import Request +# s_tensor_list, d_tensor_list, s_indices, d_indices, direction +CopyBlocksOp = Callable[[ + dict[str, torch.Tensor], dict[ + str, torch.Tensor], list[int], list[int], Literal["h2d", "d2h"] +], None] + logger = init_logger(__name__) @@ -127,6 +133,13 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): """ return + def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): + """ + Set the xPU-specific ops for copying KV between host and device. + Needed when host buffer is used for kv transfer (e.g., in NixlConnector) + """ + return + @abstractmethod def start_load_kv(self, forward_context: "ForwardContext", **kwargs) -> None: diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py b/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py index 0c5986bfafa..c06cda356f5 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py @@ -1,6 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import contextlib +import logging import math import queue import threading @@ -20,14 +21,14 @@ from vllm.attention.selector import backend_name_to_enum, get_attn_backend from vllm.config import VllmConfig from vllm.distributed.kv_transfer.kv_connector.v1.base import ( - KVConnectorBase_V1, KVConnectorMetadata, KVConnectorRole) + CopyBlocksOp, KVConnectorBase_V1, KVConnectorMetadata, KVConnectorRole) from vllm.distributed.parallel_state import ( get_tensor_model_parallel_rank, get_tensor_model_parallel_world_size, get_tp_group) from vllm.distributed.utils import divide from vllm.forward_context import ForwardContext from vllm.logger import init_logger -from vllm.platforms import _Backend +from vllm.platforms import _Backend, current_platform from vllm.utils import make_zmq_path, make_zmq_socket, round_down from vllm.v1.core.sched.output import SchedulerOutput from vllm.v1.request import RequestStatus @@ -40,6 +41,7 @@ Transfer = tuple[int, float] # (xfer_handle, start_time) EngineId = str ReqId = str + GET_META_MSG = b"get_meta_msg" logger = init_logger(__name__) @@ -52,6 +54,13 @@ logger.warning("NIXL is not available") NixlWrapper = None +# Supported xPUs and types of kv transfer buffer. +# {xPU: tuple of supported kv buffer types} +_NIXL_SUPPORTED_XPUS = { + "cuda": ("cuda", ), + "tpu": ("cpu", ), +} + class NixlAgentMetadata( msgspec.Struct, @@ -80,6 +89,7 @@ class NixlConnectorMetadata(KVConnectorMetadata): def __init__(self): self.reqs_to_recv: dict[ReqId, ReqMeta] = {} + self.reqs_to_save: dict[ReqId, ReqMeta] = {} self.reqs_to_send: dict[ReqId, float] = {} def add_new_req( @@ -87,8 +97,12 @@ def add_new_req( request_id: ReqId, local_block_ids: list[int], kv_transfer_params: dict[str, Any], + load_remote_cache: bool = True, + save_to_host: bool = False, ): - self.reqs_to_recv[request_id] = ReqMeta( + # save and load are mutually exclusive + assert load_remote_cache ^ save_to_host + _req = ReqMeta( local_block_ids=local_block_ids, remote_block_ids=kv_transfer_params["remote_block_ids"], remote_engine_id=kv_transfer_params["remote_engine_id"], @@ -97,6 +111,10 @@ def add_new_req( # P workers don't need to receive tp_size from proxy here. tp_size=kv_transfer_params.get("tp_size", 1), ) + if save_to_host: + self.reqs_to_save[request_id] = _req + if load_remote_cache: + self.reqs_to_recv[request_id] = _req class NixlConnector(KVConnectorBase_V1): @@ -155,6 +173,10 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): assert self.connector_worker is not None self.connector_worker.register_kv_caches(kv_caches) + def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): + assert self.connector_worker is not None + self.connector_worker.set_host_xfer_buffer_ops(copy_operation) + def get_finished(self, finished_req_ids: set[str]) -> tuple[set[str], set[str]]: """Get the finished recving and sending requests.""" @@ -177,8 +199,11 @@ def save_kv_layer(self, layer_name: str, kv_layer: torch.Tensor, pass def wait_for_save(self): - """NixlConnector does not save explicitly.""" - pass + assert self.connector_worker is not None + assert isinstance(self._connector_metadata, NixlConnectorMetadata) + if self.connector_worker.use_host_buffer and \ + self.connector_worker.copy_blocks: + self.connector_worker.save_kv_to_host(self._connector_metadata) class NixlConnectorScheduler: @@ -193,12 +218,15 @@ def __init__(self, vllm_config: VllmConfig, engine_id: str): envs.VLLM_NIXL_SIDE_CHANNEL_PORT + vllm_config.parallel_config.data_parallel_rank * vllm_config.parallel_config.tensor_parallel_size) + self.use_host_buffer = \ + vllm_config.kv_transfer_config.kv_buffer_device == "cpu" logger.info("Initializing NIXL Scheduler %s", engine_id) # Requests that need to start recv/send. # New requests are added by update_state_after_alloc in # the scheduler. Used to make metadata passed to Worker. self._reqs_need_recv: dict[ReqId, tuple[Request, list[int]]] = {} + self._reqs_need_save: dict[ReqId, tuple[Request, list[int]]] = {} # Reqs to send and their expiration time self._reqs_need_send: dict[ReqId, float] = {} @@ -248,7 +276,25 @@ def update_state_after_alloc(self, request: "Request", "num_external_tokens=%s, kv_transfer_params=%s", num_external_tokens, params) - if params is not None and params.get("do_remote_prefill"): + if not params: + return + if self.use_host_buffer and params.get("do_remote_decode"): + # NOTE: when accelerator is not directly supported by Nixl, + # prefilled blocks need to be saved to host memory before transfer. + + # figure out full computed blocks to save + block_ids = blocks.get_block_ids()[0] + all_full = request.num_tokens % self.block_size == 0 + full_block_ids = (block_ids if all_full else block_ids[:-1]) + # TODO: skip the blocks that are already in the host xfer buffer. + # Currently, the host xfer buffer block is 1-to-1 mapped to device + # kv blocks, so host blocks won't be flushed as long as its device + # block is not overwritten; and it will be safe to skip saving them + # to host xfer buffer. + if full_block_ids: + self._reqs_need_save[request.request_id] = \ + (request, full_block_ids) + elif params.get("do_remote_prefill"): if params.get("remote_block_ids"): if all(p in params for p in ("remote_engine_id", "remote_host", "remote_port")): @@ -260,6 +306,7 @@ def update_state_after_alloc(self, request: "Request", # Get unhashed blocks to pull from remote. self._reqs_need_recv[request.request_id] = ( request, local_block_ids) + else: logger.warning( "Got invalid KVTransferParams: %s. This " @@ -284,10 +331,21 @@ def build_connector_meta( kv_transfer_params=req.kv_transfer_params, ) - # Clear the list once workers start the transfers - self._reqs_need_recv.clear() + for req_id, (req, block_ids) in self._reqs_need_save.items(): + assert req.kv_transfer_params is not None + meta.add_new_req( + request_id=req_id, + local_block_ids=block_ids, + kv_transfer_params=req.kv_transfer_params, + load_remote_cache=False, + save_to_host=True, + ) meta.reqs_to_send = self._reqs_need_send + + # Clear the list once workers start the transfers + self._reqs_need_recv.clear() + self._reqs_need_save.clear() self._reqs_need_send = {} return meta @@ -379,9 +437,36 @@ def __init__(self, vllm_config: VllmConfig, engine_id: str): self.tp_rank = get_tensor_model_parallel_rank() self.world_size = get_tensor_model_parallel_world_size() self.tp_group = get_tp_group() + self.num_blocks = 0 # KV Caches and nixl tracking data. - self.kv_caches: dict[str, torch.Tensor] = {} + self.device_type = current_platform.device_type + self.kv_buffer_device: str = \ + vllm_config.kv_transfer_config.kv_buffer_device + if self.device_type not in _NIXL_SUPPORTED_XPUS: + raise RuntimeError(f"{self.device_type} is not supported.") + elif self.kv_buffer_device not in _NIXL_SUPPORTED_XPUS[ + self.device_type]: + raise RuntimeError( + f"{self.device_type} with {self.kv_buffer_device} kv_buffer " + "is not supported.") + self.device_kv_caches: dict[str, torch.Tensor] = {} + + # cpu kv buffer for xfer + # used when xPU memory can not be registered under nixl + self.host_xfer_buffers: dict[str, torch.Tensor] = {} + self.use_host_buffer = self.kv_buffer_device == "cpu" + if self.kv_buffer_device == "cuda": + self.nixl_memory_type = "VRAM" + elif self.kv_buffer_device == "cpu": + self.nixl_memory_type = "DRAM" + else: + raise RuntimeError( + f"{self.device_type} with {self.kv_buffer_device} kv_buffer " + "is not supported.") + + # Note: host xfer buffer ops when use_host_buffer is True + self.copy_blocks: Optional[CopyBlocksOp] = None # Map of engine_id -> kv_caches_base_addr. For TP case, each local # rank will still only pull from a single remote TP worker. @@ -404,6 +489,7 @@ def __init__(self, vllm_config: VllmConfig, engine_id: str): # In progress transfers. # [req_id -> list[handle]] + self._recving_metadata: dict[ReqId, ReqMeta] = {} self._recving_transfers = defaultdict[ReqId, list[Transfer]](list) # Track the expiration time of requests that are waiting to be sent. self._reqs_to_send: dict[ReqId, float] = {} @@ -440,6 +526,7 @@ def __init__(self, vllm_config: VllmConfig, engine_id: str): self.backend_name = backend.get_name() attn_backend = backend_name_to_enum(self.backend_name) self._use_flashinfer = attn_backend == _Backend.FLASHINFER_VLLM_V1 + self._use_pallas_v1 = attn_backend == _Backend.PALLAS_VLLM_V1 logger.debug("Detected attention backend %s", self.backend_name) self._tp_size: dict[EngineId, int] = {self.engine_id: self.world_size} @@ -529,6 +616,31 @@ def _nixl_handshake( # Remote rank -> agent name. return {p_remote_rank: remote_agent_name} + def initialize_host_xfer_buffer( + self, kv_caches: dict[str, torch.Tensor]) -> None: + """ + Initialize transfer buffer in CPU mem for accelerators + NOT directly supported by NIXL (e.g., tpu) + """ + xfer_buffers: dict[str, torch.Tensor] = {} + try: + for layer_name, kv_cache in kv_caches.items(): + kv_shape = kv_cache.shape + kv_dtype = kv_cache.dtype + xfer_buffers[layer_name] = torch.empty(kv_shape, + dtype=kv_dtype, + device="cpu") + except MemoryError as e: + logger.error("NIXLConnectorWorker gets %s.", e) + raise + + self.host_xfer_buffers = xfer_buffers + + def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): + """Assign copy (d2h, h2d) operations when host buffer is used.""" + assert self.use_host_buffer + self.copy_blocks = copy_operation + def _background_nixl_handshake(self, req_id: str, remote_engine_id: EngineId, meta: ReqMeta): # Do NIXL handshake in background and add to _ready_requests when done. @@ -562,47 +674,76 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): _, first_kv_cache = next(iter(kv_caches.items())) kv_elem_size = first_kv_cache.element_size() + if self.use_host_buffer: + self.initialize_host_xfer_buffer(kv_caches=kv_caches) + assert len(self.host_xfer_buffers) == len(kv_caches), ( + f"host_buffer: {len(self.host_xfer_buffers)}, " + f"kv_caches: {len(kv_caches)}") + xfer_buffers = self.host_xfer_buffers + else: + xfer_buffers = kv_caches + assert not self.host_xfer_buffers, ( + "host_xfer_buffer should not be initialized when " + f"kv_buffer_device is {self.kv_buffer_device}") + # TODO(tms): Find a more robust way to detect and handle MLA # NOTE (NickLucche) To move blocks efficiently with NIXL, the expected # KV memory layout is HND, as opposed to the default NHD. Note that it # will only affects the strides. For MLA instead, we make require no # such thing and resort to the standard layout. use_mla = len(first_kv_cache.shape) == 3 - assert use_mla == self.use_mla - - # TODO (NickLucche) not compatible with hybrid allocator. Enforce check - # once it goes live, as a single kv layout is expected for xfers. - if use_mla: - # MLA case. + if self.device_type == "tpu": + assert not use_mla, f"{self.kv_buffer_device} does not support MLA." + assert self._use_pallas_v1, f"attn backend: {self.backend_name}" + # tpu (v1) kv shape per layer: + # (num_blocks, block_size, num_kv_heads * 2, head_size) self.num_blocks = first_kv_cache.shape[0] - block_rank = 2 # [block_size, latent_dim] + block_rank = 3 # [block_size, kv_heads, head_dim] block_shape = first_kv_cache.shape[-block_rank:] - block_size, kv_latent_dim = block_shape - self.slot_size_bytes = kv_elem_size * kv_latent_dim - else: - # [2 (k and v), num_blocks, ...] - if self._use_flashinfer: - # FlashInfer swaps 2<->num_blocks dimensions. + block_size, n_kv_heads_x_2, head_dim = block_shape + self.slot_size_bytes = kv_elem_size * n_kv_heads_x_2 * head_dim + elif self.device_type == "cuda": + assert use_mla == self.use_mla + # TODO (NickLucche) not compatible with hybrid allocator. + # Enforce check once it goes live, as a single kv layout + # is expected for xfers. + if use_mla: + # MLA case. self.num_blocks = first_kv_cache.shape[0] - block_rank = 4 # [2, block_size, kv_heads, head_dim] + block_rank = 2 # [block_size, latent_dim] + block_shape = first_kv_cache.shape[-block_rank:] + block_size, kv_latent_dim = block_shape + self.slot_size_bytes = kv_elem_size * kv_latent_dim else: - self.num_blocks = first_kv_cache.shape[1] - block_rank = 3 # [block_size, kv_heads, head_dim] - block_shape = first_kv_cache.shape[-block_rank:] - block_size, n_kv_heads, head_dim = block_shape[-3:] - # head size in bytes. - self.slot_size_bytes = kv_elem_size * n_kv_heads * head_dim - assert block_size == self.block_size + # [2 (k and v), num_blocks, ...] + if self._use_flashinfer: + # FlashInfer swaps 2<->num_blocks dimensions. + self.num_blocks = first_kv_cache.shape[0] + block_rank = 4 # [2, block_size, kv_heads, head_dim] + else: + self.num_blocks = first_kv_cache.shape[1] + block_rank = 3 # [block_size, kv_heads, head_dim] + block_shape = first_kv_cache.shape[-block_rank:] + block_size, n_kv_heads, head_dim = block_shape[-3:] + # head size in bytes. + self.slot_size_bytes = kv_elem_size * n_kv_heads * head_dim + assert block_size == self.block_size + else: + raise RuntimeError( + f"{self.device_type} ({self.backend_name}) is not supported.") + # TODO(tms): self.block_len needs to be per-layer for sliding window, # hybrid attn, etc # block size in bytes self.block_len = kv_elem_size * math.prod(block_shape) logger.info( - "Registering KV_Caches: use_mla: %s, num_blocks: %s, " - "block_shape: %s, per_layer_kv_cache_shape: %s", use_mla, - self.num_blocks, block_shape, first_kv_cache.shape) + "Registering KV_Caches. use_mla: %s, kv_buffer_device: %s, " + "use_host_buffer: %s, num_blocks: %s, block_shape: %s, " + "per_layer_kv_cache_shape: %s", use_mla, self.kv_buffer_device, + self.use_host_buffer, self.num_blocks, block_shape, + first_kv_cache.shape) self.dst_num_blocks[self.engine_id] = self.num_blocks - self.kv_caches = kv_caches + self.device_kv_caches = kv_caches kv_caches_base_addr = [] caches_data = [] @@ -614,19 +755,21 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): # (roughly 8KB vs 5KB). # Conversely for FlashInfer, K and V are transferred in the same tensor # to better exploit the memory layout (ie num_blocks is the first dim). - for cache_or_caches in kv_caches.values(): + for cache_or_caches in xfer_buffers.values(): # Normalize to always be a list of caches - cache_list = [cache_or_caches] if use_mla or self._use_flashinfer \ - else cache_or_caches + cache_list = [cache_or_caches] if use_mla \ + or self._use_pallas_v1 or self._use_flashinfer \ + else cache_or_caches for cache in cache_list: base_addr = cache.data_ptr() region_len = self.num_blocks * self.block_len - caches_data.append( - (base_addr, region_len, cache.device.index, "")) + # NOTE: use tp_rank for device_id since multi-node TP + # is rarely used. + caches_data.append((base_addr, region_len, self.tp_rank, "")) kv_caches_base_addr.append(base_addr) self.kv_caches_base_addr[self.engine_id] = kv_caches_base_addr self.num_regions = len(caches_data) - self.num_layers = len(self.kv_caches.keys()) + self.num_layers = len(xfer_buffers.keys()) # TODO(mgoin): remove this once we have hybrid memory allocator # Optimization for models with local attention (Llama 4) @@ -648,7 +791,8 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): self.block_window_per_layer) assert len(self.block_window_per_layer) == self.num_layers - descs = self.nixl_wrapper.get_reg_descs(caches_data, "VRAM") + descs = self.nixl_wrapper.get_reg_descs(caches_data, + self.nixl_memory_type) logger.debug("Registering descs: %s", caches_data) self.nixl_wrapper.register_memory(descs) logger.debug("Done registering descs") @@ -666,11 +810,13 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): block_offset = block_id * self.block_len addr = base_addr + block_offset # (addr, len, device id) + # TODO: does device_id matter to DRAM? blocks_data.append((addr, self.block_len, self.tp_rank)) logger.debug("Created %s blocks for src engine %s and rank %s", len(blocks_data), self.engine_id, self.tp_rank) - descs = self.nixl_wrapper.get_xfer_descs(blocks_data, "VRAM") + descs = self.nixl_wrapper.get_xfer_descs(blocks_data, + self.nixl_memory_type) # NIXL_INIT_AGENT to be used for preparations of local descs. self.src_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist( "NIXL_INIT_AGENT", descs) @@ -755,6 +901,8 @@ def add_remote_agent(self, tp_ratio = divide(self._tp_size[self.engine_id], self._tp_size[engine_id]) assert tp_ratio > 0, "Decode TP cannot be smaller than prefill TP" + assert not self._use_pallas_v1 or tp_ratio == 1, \ + "TPU (pallas_v1) DOES NOT support heterogeneous TP yet." # Handle tp_size>num_kv_heads: replicate KV cache. total_num_kv_heads = self.model_config.get_total_num_kv_heads() @@ -813,13 +961,43 @@ def add_remote_agent(self, self.tp_rank) # Register with NIXL. - descs = self.nixl_wrapper.get_xfer_descs(blocks_data, "VRAM") + descs = self.nixl_wrapper.get_xfer_descs(blocks_data, + self.nixl_memory_type) self.dst_xfer_side_handles[ engine_id] = self.nixl_wrapper.prep_xfer_dlist( remote_agent_name, descs) return remote_agent_name + def sync_recved_kv_to_device(self, req_id: str, meta: ReqMeta): + """copy recved kv from host buffer to device.""" + assert self.use_host_buffer + assert self.copy_blocks is not None + + local_block_ids = meta.local_block_ids + self.copy_blocks(self.host_xfer_buffers, self.device_kv_caches, + local_block_ids, local_block_ids, "h2d") + if logger.isEnabledFor(logging.DEBUG): + logger.debug( + "synced recved kv of request[%s] to device kv buffer," + "local_block_ids: %s. ", req_id, + ",".join(map(str, meta.local_block_ids))) + + def save_kv_to_host(self, metadata: NixlConnectorMetadata): + """copy kv from device to host buffer.""" + assert self.use_host_buffer + assert self.copy_blocks is not None + + for req_id, meta in metadata.reqs_to_save.items(): + if logger.isEnabledFor(logging.DEBUG): + logger.debug( + "save_load_kv for request[%s] to host xfer buffer." + "local_block_ids: %s. ", req_id, + ",".join(map(str, meta.local_block_ids))) + # blocking + self.copy_blocks(self.device_kv_caches, self.host_xfer_buffers, + meta.local_block_ids, meta.local_block_ids, "d2h") + def get_finished(self) -> tuple[set[str], set[str]]: """ Get requests that are done sending or recving on this specific worker. @@ -834,6 +1012,12 @@ def get_finished(self) -> tuple[set[str], set[str]]: "and %s requests done recving", self.tp_rank, len(done_sending), len(done_recving)) + if self.use_host_buffer: + for req_id in done_recving: + meta = self._recving_metadata.pop(req_id) + assert meta, f"{req_id} not found in recving_metadata list" + self.sync_recved_kv_to_device(req_id, meta) + # Handle timeout to avoid stranding blocks on remote. now = time.perf_counter() while self._reqs_to_send: @@ -904,6 +1088,8 @@ def start_load_kv(self, metadata: NixlConnectorMetadata): "Num local_block_ids: %s. Num remote_block_ids: %s. ", req_id, remote_engine_id, len(meta.local_block_ids), len(meta.remote_block_ids)) + if self.use_host_buffer: + self._recving_metadata[req_id] = meta if remote_engine_id not in self._remote_agents: # Initiate handshake with remote engine to exchange metadata. with self._handshake_lock: diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index a5bf197ba16..32004ced4aa 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1,7 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -import copy import gc import time from contextlib import contextmanager @@ -23,12 +22,10 @@ from vllm.distributed.eplb.eplb_state import EplbState from vllm.distributed.kv_transfer import (get_kv_transfer_group, has_kv_transfer_group) -from vllm.distributed.kv_transfer.kv_connector.v1 import KVConnectorBase_V1 from vllm.distributed.parallel_state import ( get_pp_group, get_tp_group, graph_capture, is_global_first_rank, prepare_communication_buffer_for_model) -from vllm.forward_context import (DPMetadata, get_forward_context, - set_forward_context) +from vllm.forward_context import DPMetadata, set_forward_context from vllm.logger import init_logger from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding @@ -66,6 +63,8 @@ from vllm.v1.spec_decode.metadata import SpecDecodeMetadata from vllm.v1.spec_decode.ngram_proposer import NgramProposer from vllm.v1.worker.gpu_input_batch import CachedRequestState, InputBatch +from vllm.v1.worker.kv_connector_model_runner_mixin import ( + KVConnectorModelRunnerMixin) from vllm.v1.worker.lora_model_runner_mixin import LoRAModelRunnerMixin from ..sample.logits_processor import LogitsProcessorManager @@ -88,7 +87,7 @@ logger = init_logger(__name__) -class GPUModelRunner(LoRAModelRunnerMixin): +class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin): def __init__( self, @@ -1357,7 +1356,8 @@ def execute_model( # Return empty ModelRunnerOutput if there's no work to do. return EMPTY_MODEL_RUNNER_OUTPUT - return self.kv_connector_no_forward(scheduler_output) + return self.kv_connector_no_forward(scheduler_output, + self.vllm_config) # Prepare the decoder inputs. (attn_metadata, attention_cuda_graphs, logits_indices, @@ -1745,52 +1745,6 @@ def propose_draft_token_ids( spec_token_ids = draft_token_ids.tolist() return spec_token_ids - @staticmethod - def maybe_setup_kv_connector(scheduler_output: "SchedulerOutput"): - # Update KVConnector with the KVConnector metadata forward(). - if has_kv_transfer_group(): - kv_connector = get_kv_transfer_group() - assert isinstance(kv_connector, KVConnectorBase_V1) - assert scheduler_output.kv_connector_metadata is not None - kv_connector.bind_connector_metadata( - scheduler_output.kv_connector_metadata) - - # Background KV cache transfers happen here. - # These transfers are designed to be async and the requests - # involved may be disjoint from the running requests. - # Do this here to save a collective_rpc. - kv_connector.start_load_kv(get_forward_context()) - - @staticmethod - def maybe_wait_for_kv_save() -> None: - if has_kv_transfer_group(): - get_kv_transfer_group().wait_for_save() - - @staticmethod - def get_finished_kv_transfers( - scheduler_output: "SchedulerOutput", - ) -> tuple[Optional[set[str]], Optional[set[str]]]: - if has_kv_transfer_group(): - return get_kv_transfer_group().get_finished( - scheduler_output.finished_req_ids) - return None, None - - def kv_connector_no_forward( - self, scheduler_output: "SchedulerOutput") -> ModelRunnerOutput: - # KV send/recv even if no work to do. - with set_forward_context(None, self.vllm_config): - self.maybe_setup_kv_connector(scheduler_output) - finished_sending, finished_recving = ( - self.get_finished_kv_transfers(scheduler_output)) - - if not finished_sending and not finished_recving: - return EMPTY_MODEL_RUNNER_OUTPUT - - output = copy.copy(EMPTY_MODEL_RUNNER_OUTPUT) - output.finished_sending = finished_sending - output.finished_recving = finished_recving - return output - def propose_ngram_draft_token_ids( self, sampled_token_ids: list[list[int]], diff --git a/vllm/v1/worker/kv_connector_model_runner_mixin.py b/vllm/v1/worker/kv_connector_model_runner_mixin.py new file mode 100644 index 00000000000..5a3186058fc --- /dev/null +++ b/vllm/v1/worker/kv_connector_model_runner_mixin.py @@ -0,0 +1,70 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Define KV connector functionality mixin for model runners. +""" +import copy +from typing import TYPE_CHECKING, Optional + +from vllm.config import VllmConfig +from vllm.distributed.kv_transfer import (get_kv_transfer_group, + has_kv_transfer_group) +from vllm.distributed.kv_transfer.kv_connector.v1 import KVConnectorBase_V1 +from vllm.forward_context import get_forward_context, set_forward_context +from vllm.logger import init_logger +from vllm.v1.outputs import EMPTY_MODEL_RUNNER_OUTPUT, ModelRunnerOutput + +if TYPE_CHECKING: + from vllm.v1.core.sched.output import SchedulerOutput + +logger = init_logger(__name__) + + +# Defined as a kv connector functionality mixin for ModelRunner (GPU, TPU) +class KVConnectorModelRunnerMixin: + + @staticmethod + def maybe_setup_kv_connector(scheduler_output: "SchedulerOutput"): + # Update KVConnector with the KVConnector metadata forward(). + if has_kv_transfer_group(): + kv_connector = get_kv_transfer_group() + assert isinstance(kv_connector, KVConnectorBase_V1) + assert scheduler_output.kv_connector_metadata is not None + kv_connector.bind_connector_metadata( + scheduler_output.kv_connector_metadata) + + # Background KV cache transfers happen here. + # These transfers are designed to be async and the requests + # involved may be disjoint from the running requests. + # Do this here to save a collective_rpc. + kv_connector.start_load_kv(get_forward_context()) + + @staticmethod + def maybe_wait_for_kv_save() -> None: + if has_kv_transfer_group(): + get_kv_transfer_group().wait_for_save() + + @staticmethod + def get_finished_kv_transfers( + scheduler_output: "SchedulerOutput", + ) -> tuple[Optional[set[str]], Optional[set[str]]]: + if has_kv_transfer_group(): + return get_kv_transfer_group().get_finished( + scheduler_output.finished_req_ids) + return None, None + + def kv_connector_no_forward(self, scheduler_output: "SchedulerOutput", + vllm_config: VllmConfig) -> ModelRunnerOutput: + # KV send/recv even if no work to do. + with set_forward_context(None, vllm_config): + self.maybe_setup_kv_connector(scheduler_output) + finished_sending, finished_recving = ( + self.get_finished_kv_transfers(scheduler_output)) + + if not finished_sending and not finished_recving: + return EMPTY_MODEL_RUNNER_OUTPUT + + output = copy.copy(EMPTY_MODEL_RUNNER_OUTPUT) + output.finished_sending = finished_sending + output.finished_recving = finished_recving + return output diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index 3bb033f1487..e8c80084589 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -3,7 +3,7 @@ import bisect import gc import time -from typing import TYPE_CHECKING, Any, Optional, cast +from typing import TYPE_CHECKING, Any, Literal, Optional, Union, cast from unittest.mock import patch import numpy as np @@ -20,6 +20,8 @@ from vllm.compilation.wrapper import TorchCompileWrapperWithCustomDispatcher from vllm.config import (ParallelConfig, VllmConfig, get_layers_from_vllm_config, update_config) +from vllm.distributed.kv_transfer import (get_kv_transfer_group, + has_kv_transfer_group) from vllm.forward_context import set_forward_context from vllm.logger import init_logger from vllm.lora.layers import BaseLayerWithLoRA @@ -46,6 +48,8 @@ LogprobsTensors, ModelRunnerOutput) from vllm.v1.sample.tpu.metadata import TPUSupportedSamplingMetadata from vllm.v1.sample.tpu.sampler import Sampler as TPUSampler +from vllm.v1.worker.kv_connector_model_runner_mixin import ( + KVConnectorModelRunnerMixin) from vllm.v1.worker.lora_model_runner_mixin import LoRAModelRunnerMixin from vllm.v1.worker.tpu_input_batch import CachedRequestState, InputBatch @@ -97,7 +101,7 @@ # The dummy_run should be comprehensive, ensuring all potential input shapes and # branch predictions are included as subgraph inputs to facilitate # pre-compilation. -class TPUModelRunner(LoRAModelRunnerMixin): +class TPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin): def __init__( self, @@ -971,8 +975,12 @@ def execute_model( # Update cached state self._update_states(scheduler_output) if not scheduler_output.total_num_scheduled_tokens: - # Return empty ModelRunnerOutput if there's no work to do. - return EMPTY_MODEL_RUNNER_OUTPUT + if not has_kv_transfer_group(): + # Return empty ModelRunnerOutput if there's no work to do. + return EMPTY_MODEL_RUNNER_OUTPUT + + return self.kv_connector_no_forward(scheduler_output, + self.vllm_config) if self.is_multimodal_model: # Run the multimodal encoder if any. @@ -986,6 +994,12 @@ def execute_model( start_index = 0 combined_selected_tokens: list[torch.Tensor] = [] combined_logprobs: list[LogprobsLists] = [] + + # NOTE: setup current batch's metadata for kv connector. + # Currently, only verified with NixlConnector + with set_forward_context(None, self.vllm_config): + self.maybe_setup_kv_connector(scheduler_output) + while start_index < self.input_batch.num_reqs: attn_metadata, logits_indices, padded_num_reqs, num_reqs,\ end_index = self._prepare_inputs(scheduler_output, start_index) @@ -1032,6 +1046,14 @@ def execute_model( start_index = end_index + # NOTE: current kv load and save get h2d/d2h copies involved. + # Those copies are blocking. Once they become async., kv_save + # should be called right after each single forward pass, + # instead of the forwards of the entire input batch. + self.maybe_wait_for_kv_save() + finished_sending, finished_recving = ( + self.get_finished_kv_transfers(scheduler_output)) + selected_token_ids = torch.cat(combined_selected_tokens, dim=0) if tpu_sampling_metadata.logprobs: @@ -1126,6 +1148,8 @@ def concat_lists(input_lists): logprobs=logprobs_lists, prompt_logprobs_dict=prompt_logprobs_dict, pooler_output=[], + finished_sending=finished_sending, + finished_recving=finished_recving, ) # Check there are no new graphs compiled - all the graphs should be @@ -1637,6 +1661,10 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None: for cache in self.kv_caches: xs.mark_sharding(cache, self.mesh, (None, 'x', None, None)) + if has_kv_transfer_group(): + get_kv_transfer_group().register_kv_caches(kv_caches) + get_kv_transfer_group().set_host_xfer_buffer_ops(copy_kv_blocks) + def reset_dynamo_cache(self): if self.is_multimodal_model: compiled_model = self.model.get_language_model().model @@ -1851,6 +1879,75 @@ def _get_padded_token_len(paddings: list[int], x: int) -> int: return paddings[index] +def _make_src_and_dst_indices( + src_block_ids: list[int], + dst_block_ids: list[int], + src_device: Union[torch.device, str], + dst_device: Union[torch.device, str], +) -> tuple[torch.Tensor, torch.Tensor]: + src_indices = torch.tensor(src_block_ids, + device=src_device, + dtype=torch.int64) + dst_indices = torch.tensor(dst_block_ids, + device=dst_device, + dtype=torch.int64) + return src_indices, dst_indices + + +@torch.compile(backend="openxla") +def _insert_blocks_to_tpu( + cpu_cache: torch.Tensor, + tpu_cache: torch.Tensor, + cpu_block_indices: torch.Tensor, + tpu_block_indices: torch.Tensor, +) -> None: + torch.ops.xla.dynamo_set_buffer_donor_(tpu_cache, True) + tpu_cache[tpu_block_indices] = cpu_cache[cpu_block_indices].to( + tpu_cache.device) + + +@torch.compile(backend="openxla") +def _swap_out_tpu_blocks( + tpu_cache: torch.Tensor, + cpu_cache: torch.Tensor, + tpu_block_indices: torch.Tensor, + cpu_block_indices: torch.Tensor, +) -> None: + """ tpu blocks to cpu blocks""" + torch.ops.xla.dynamo_set_buffer_donor_(tpu_cache, True) + cpu_cache[cpu_block_indices] = tpu_cache[tpu_block_indices].cpu() + + +def copy_kv_blocks( + src_kv_caches: dict[str, torch.Tensor], + dst_kv_caches: dict[str, torch.Tensor], + src_block_ids: list[int], + dst_block_ids: list[int], + direction: Literal["h2d", "d2h"], +) -> None: + """Copy kv blocks between different buffers.""" + if not src_kv_caches or not dst_kv_caches or \ + not src_block_ids or not dst_block_ids or \ + len(src_block_ids) != len(dst_block_ids): + return + + src_device = next(iter(src_kv_caches.values())).device + dst_device = next(iter(dst_kv_caches.values())).device + + src_indices, dst_indices = _make_src_and_dst_indices( + src_block_ids=src_block_ids, + dst_block_ids=dst_block_ids, + src_device=src_device, + dst_device=dst_device) + + _copy_fn = _insert_blocks_to_tpu if direction == "h2d" else \ + _swap_out_tpu_blocks + for layer_name in src_kv_caches: + src_tensor = src_kv_caches[layer_name] + dst_tensor = dst_kv_caches[layer_name] + _copy_fn(src_tensor, dst_tensor, src_indices, dst_indices) + + def _get_padded_num_kv_cache_update_slices( num_tokens: int, max_num_reqs: int, page_size: int, num_slices_per_kv_cache_update_block: int) -> int: diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index 648d9c3195c..254b058d2cd 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -12,9 +12,11 @@ import torch_xla.runtime as xr import vllm.envs as envs -from vllm.config import ParallelConfig, VllmConfig +from vllm.config import VllmConfig from vllm.distributed import (ensure_model_parallel_initialized, init_distributed_environment) +from vllm.distributed.kv_transfer import (ensure_kv_transfer_initialized, + has_kv_transfer_group) from vllm.logger import init_logger from vllm.lora.request import LoRARequest from vllm.model_executor import set_random_seed @@ -118,7 +120,7 @@ def init_device(self): # Initialize the distributed environment. self._init_tpu_worker_distributed_environment( - self.parallel_config, self.rank, self.distributed_init_method, + self.vllm_config, self.rank, self.distributed_init_method, self.local_rank) # Device initialization should happen after initializing @@ -242,7 +244,9 @@ def execute_model( scheduler_output: "SchedulerOutput", ) -> Optional[ModelRunnerOutput]: output = self.model_runner.execute_model(scheduler_output) - return output if self.is_driver_worker else None + # every worker's output is needed when kv_transfer_group is setup + return output if self.is_driver_worker or has_kv_transfer_group( + ) else None def profile(self, is_start: bool = True): if self.rank < 1: @@ -294,7 +298,7 @@ def check_health(self) -> None: def _init_tpu_worker_distributed_environment( self, - parallel_config: ParallelConfig, + vllm_config: VllmConfig, rank: int, distributed_init_method: Optional[str] = None, local_rank: int = -1, @@ -306,6 +310,7 @@ def _init_tpu_worker_distributed_environment( # the input objects on CPU. The all-reduce and all-gather ops on TPU # are invoked by `xm.all_reduce` and `xm.all_gather` which use their # own context. + parallel_config = vllm_config.parallel_config init_distributed_environment( world_size=parallel_config.world_size, rank=rank, @@ -317,6 +322,8 @@ def _init_tpu_worker_distributed_environment( parallel_config.tensor_parallel_size, parallel_config.pipeline_parallel_size) + ensure_kv_transfer_initialized(vllm_config) + try: from tpu_commons.worker import TPUWorker as TPUCommonsWorker From e639d1b8cf6c70af2285fc0bbc8e097ce52409ae Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 24 Jul 2025 10:36:56 -0700 Subject: [PATCH 332/552] [Docs][minor] Fix broken gh-file link in distributed serving docs (#21543) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- docs/serving/distributed_serving.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index d1ea29404de..4f111115f30 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul ### Ray cluster setup with containers -The helper script `` starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command. +The helper script starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command. Choose one node as the head node and run: From 106943d319f9954867df7b6ca68cce32fe22e274 Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Thu, 24 Jul 2025 12:36:06 -0700 Subject: [PATCH 333/552] [Docs] Add Expert Parallelism Initial Documentation (#21373) Signed-off-by: simon-mo Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/serving/expert_parallel_deployment.md | 244 +++++++++++++++++++++ 1 file changed, 244 insertions(+) create mode 100644 docs/serving/expert_parallel_deployment.md diff --git a/docs/serving/expert_parallel_deployment.md b/docs/serving/expert_parallel_deployment.md new file mode 100644 index 00000000000..d79b6fc5901 --- /dev/null +++ b/docs/serving/expert_parallel_deployment.md @@ -0,0 +1,244 @@ +# Expert Parallel Deployment + +vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall. + +EP is typically coupled with Data Parallelism (DP). While DP can be used independently of EP, EP is more efficient when used in conjunction with DP. You can read more about data parallelism [here](data_parallel_deployment.md). + +## Prerequisites + +Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future: + +1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels). +2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation). +3. **For disaggregated serving**: Install UCX and NIXL following the [script](gh-file:tools/install_nixl.sh). + +### Backend Selection Guide + +vLLM provides three communication backends for EP: + +| Backend | Use Case | Features | Best For | +|---------|----------|----------|----------| +| `pplx` | Single node | Chunked prefill support | Development, best for intra-node deployments | +| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout | High-throughput scenarios, prefill-dominated workloads | +| `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout | Low-latency scenarios, decode-dominated workloads | + +## Single Node Deployment + +!!! warning + EP is an experimental feature. Argument names and default values may change in the future. + +### Configuration + +Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as: + +``` +EP_SIZE = TP_SIZE × DP_SIZE +``` + +Where: +- `TP_SIZE`: Tensor parallel size (always 1 for now) +- `DP_SIZE`: Data parallel size +- `EP_SIZE`: Expert parallel size (computed automatically) + +### Example Command + +The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section. + +```bash +# Single node EP deployment with pplx backend +VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 \ + vllm serve deepseek-ai/DeepSeek-V3-0324 \ + --tensor-parallel-size 1 \ # Tensor parallelism across 1 GPU + --data-parallel-size 8 \ # Data parallelism across 8 processes + --enable-expert-parallel # Enable expert parallelism +``` + +## Multi-Node Deployment + +For multi-node deployment, use the DeepEP communication kernel with one of two modes (see [Backend Selection Guide](#backend-selection-guide) above). + +### Deployment Steps + +1. **Run one command per node** - Each node requires its own launch command +2. **Configure networking** - Ensure proper IP addresses and port configurations +3. **Set node roles** - First node handles requests, additional nodes run in headless mode + +### Example: 2-Node Deployment + +The following example deploys `DeepSeek-V3-0324` across 2 nodes using `deepep_low_latency` mode: + +```bash +# Node 1 (Primary - handles incoming requests) +VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \ + vllm serve deepseek-ai/DeepSeek-V3-0324 \ + --tensor-parallel-size 1 \ # TP size per node + --enable-expert-parallel \ # Enable EP + --data-parallel-size 16 \ # Total DP size across all nodes + --data-parallel-size-local 8 \ # Local DP size on this node (8 GPUs per node) + --data-parallel-address 192.168.1.100 \ # Replace with actual IP of Node 1 + --data-parallel-rpc-port 13345 \ # RPC communication port, can be any port as long as reachable by all nodes + --api-server-count=8 # Number of API servers for load handling (scaling this out to total ranks are recommended) + +# Node 2 (Secondary - headless mode, no API server) +VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \ + vllm serve deepseek-ai/DeepSeek-V3-0324 \ + --tensor-parallel-size 1 \ # TP size per node + --enable-expert-parallel \ # Enable EP + --data-parallel-size 16 \ # Total DP size across all nodes + --data-parallel-size-local 8 \ # Local DP size on this node + --data-parallel-start-rank 8 \ # Starting rank offset for this node + --data-parallel-address 192.168.1.100 \ # IP of primary node (Node 1) + --data-parallel-rpc-port 13345 \ # Same RPC port as primary + --headless # No API server, worker only +``` + +### Key Configuration Notes + +- **Headless mode**: Secondary nodes run with `--headless` flag, meaning all client requests are handled by the primary node +- **Rank calculation**: `--data-parallel-start-rank` should equal the cumulative local DP size of previous nodes +- **Load scaling**: Adjust `--api-server-count` on the primary node to handle higher request loads + +### Network Configuration + +!!! important "InfiniBand Clusters" + On InfiniBand networked clusters, set this environment variable to prevent initialization hangs: + ```bash + export GLOO_SOCKET_IFNAME=eth0 + ``` + This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup. + +## Expert Parallel Load Balancer (EPLB) + +While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts. + +### Configuration + +Enable EPLB with the `--enable-eplb` flag. + +!!! note "Model Support" + Currently only DeepSeek V3 architecture is supported. + +When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution. + +### EPLB Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `--eplb-window-size` | Number of engine steps to track for rebalancing decisions | - | +| `--eplb-step-interval` | Frequency of rebalancing (every N engine steps) | - | +| `--eplb-log-balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` | +| `--num-redundant-experts` | Additional global experts per EP rank beyond equal distribution | `0` | + +### Expert Distribution Formula + +- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts +- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts + +### Example Command + +Single node deployment with EPLB enabled: + +```bash +# Single node with EPLB load balancing +VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 vllm serve deepseek-ai/DeepSeek-V3-0324 \ + --tensor-parallel-size 1 \ # Tensor parallelism + --data-parallel-size 8 \ # Data parallelism + --enable-expert-parallel \ # Enable EP + --enable-eplb \ # Enable load balancer + --eplb-log-balancedness \ # Log balancing metrics + --eplb-window-size 1000 \ # Track last 1000 engine steps + --eplb-step-interval 3000 # Rebalance every 3000 steps +``` + +For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--num-redundant-experts` to 32 in large scale use cases so the most popular experts are always available. + +## Disaggregated Serving (Prefill/Decode Split) + +For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations. + +### Architecture Overview + +- **Prefill Instance**: Uses `deepep_high_throughput` backend for optimal prefill performance +- **Decode Instance**: Uses `deepep_low_latency` backend for minimal decode latency +- **KV Cache Transfer**: Connects instances via NIXL or other KV connectors + +### Setup Steps + +1. **Install KV Connector**: Install NIXL using the [installation script](gh-file:tools/install_nixl.sh) + +2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}` + +3. **Client Orchestration**: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions. + +### Client Orchestration Example + +```python +from openai import OpenAI +import uuid + +try: + # 1: Set up clients for prefill and decode instances + openai_api_key = "EMPTY" # vLLM doesn't require a real API key + + # Replace these IP addresses with your actual instance addresses + prefill_client = OpenAI( + api_key=openai_api_key, + base_url="http://192.168.1.100:8000/v1", # Prefill instance URL + ) + decode_client = OpenAI( + api_key=openai_api_key, + base_url="http://192.168.1.101:8001/v1", # Decode instance URL + ) + + # Get model name from prefill instance + models = prefill_client.models.list() + model = models.data[0].id + print(f"Using model: {model}") + + # 2: Prefill Phase + # Generate unique request ID to link prefill and decode operations + request_id = str(uuid.uuid4()) + print(f"Request ID: {request_id}") + + prefill_response = prefill_client.completions.create( + model=model, + # Prompt must exceed vLLM's block size (16 tokens) for PD to work + prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations", + max_tokens=1, # Force prefill-only operation + extra_body={ + "kv_transfer_params": { + "do_remote_decode": True, # Enable remote decode + "do_remote_prefill": False, # This is the prefill instance + "remote_engine_id": None, # Will be populated by vLLM + "remote_block_ids": None, # Will be populated by vLLM + "remote_host": None, # Will be populated by vLLM + "remote_port": None # Will be populated by vLLM + } + }, + extra_headers={"X-Request-Id": request_id} + ) + + print("-" * 50) + print("✓ Prefill completed successfully") + print(f"Prefill response: {prefill_response.choices[0].text}") + + # 3: Decode Phase + # Transfer KV cache parameters from prefill to decode instance + decode_response = decode_client.completions.create( + model=model, + prompt="This prompt is ignored during decode", # Original prompt not needed + max_tokens=150, # Generate up to 150 tokens + extra_body={ + "kv_transfer_params": prefill_response.kv_transfer_params # Pass KV cache info + }, + extra_headers={"X-Request-Id": request_id} # Same request ID + ) + + print("-" * 50) + print("✓ Decode completed successfully") + print(f"Final response: {decode_response.choices[0].text}") + +except Exception as e: + print(f"❌ Error during disaggregated serving: {e}") + print("Check that both prefill and decode instances are running and accessible") +``` From 0d98f496611f509008f7e3658f953c14e6f44458 Mon Sep 17 00:00:00 2001 From: weiliang <617878975@qq.com> Date: Fri, 25 Jul 2025 05:06:11 +0800 Subject: [PATCH 334/552] update flashinfer to v0.2.9rc1 (#21485) Signed-off-by: Weiliang Liu Signed-off-by: x22x22 --- docker/Dockerfile | 2 +- vllm/attention/backends/flashinfer.py | 10 +++------- vllm/v1/attention/backends/flashinfer.py | 9 ++------- 3 files changed, 6 insertions(+), 15 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 11991829968..2e8c15bbd32 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -386,7 +386,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist # Install FlashInfer from source ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git" -ARG FLASHINFER_GIT_REF="v0.2.8" +ARG FLASHINFER_GIT_REF="v0.2.9rc1" RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' . /etc/environment git clone --depth 1 --recursive --shallow-submodules \ diff --git a/vllm/attention/backends/flashinfer.py b/vllm/attention/backends/flashinfer.py index 56d3da699f4..e6e60e75624 100644 --- a/vllm/attention/backends/flashinfer.py +++ b/vllm/attention/backends/flashinfer.py @@ -1169,16 +1169,12 @@ def forward( query=decode_query, kv_cache=kv_cache.permute(*stride_order), workspace_buffer=workspace_buffer, - num_heads=num_heads, - num_kv_heads=num_kv_heads, - scale=softmax_scale, block_tables=attn_metadata.block_tables, seq_lens=decode_meta.seq_lens_tensor, - block_size=attn_metadata.page_size, max_seq_len=attn_metadata.max_decode_seq_len, - kv_cache_dtype=kv_cache_dtype, - k_scale=layer._k_scale_float, - v_scale=layer._v_scale_float) + bmm1_scale=layer._k_scale_float * softmax_scale, + bmm2_scale=layer._v_scale_float, + ) if prefill_output is None and decode_output is not None: # Decode only batch. diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 94d80d441d8..b72745ef156 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -678,15 +678,10 @@ def forward( query=decode_query, kv_cache=kv_cache_permute, workspace_buffer=attn_metadata.workspace_buffer, - num_heads=self.num_heads, - num_kv_heads=self.num_kv_heads, - scale=self.scale, block_tables=block_tables_decode, seq_lens=seq_lens_decode, - block_size=attn_metadata.page_size, max_seq_len=attn_metadata.max_seq_len, - kv_cache_dtype=self.kv_cache_dtype, - k_scale=layer._k_scale_float, - v_scale=layer._v_scale_float, + bmm1_scale=layer._k_scale_float * self.scale, + bmm2_scale=layer._v_scale_float, )) return output_padded From 0d1a43f073861f38eed7b73d451ac640ac6407e4 Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Thu, 24 Jul 2025 15:33:04 -0700 Subject: [PATCH 335/552] [TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. (#21539) Signed-off-by: Qiliang Cui Signed-off-by: x22x22 --- .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh | 2 +- tests/entrypoints/llm/test_accuracy.py | 3 --- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh index d39acae0b04..5514d7770cf 100755 --- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh @@ -135,7 +135,7 @@ run_and_track_test 1 "test_compilation.py" \ run_and_track_test 2 "test_basic.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py" run_and_track_test 3 "test_accuracy.py::test_lm_eval_accuracy_v1_engine" \ - "python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine" + "HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine" run_and_track_test 4 "test_quantization_accuracy.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py" run_and_track_test 5 "examples/offline_inference/tpu.py" \ diff --git a/tests/entrypoints/llm/test_accuracy.py b/tests/entrypoints/llm/test_accuracy.py index 6c5706d1634..39bc8ab07d4 100644 --- a/tests/entrypoints/llm/test_accuracy.py +++ b/tests/entrypoints/llm/test_accuracy.py @@ -73,9 +73,6 @@ def test_lm_eval_accuracy_v1_engine(model, monkeypatch: pytest.MonkeyPatch): if current_platform.is_tpu(): # Limit compilation time for TPU V1 - # xet doesn't work well for both Qwen/Qwen3-1.7B and - # google/gemma-3-1b-it - m.setenv("HF_HUB_DISABLE_XET", "1") more_args = "max_model_len=2048,max_num_seqs=64" # Add TP test (if provided) From bcfbeca6c377001eeab4f8ccd2dfc68586a976f9 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Thu, 24 Jul 2025 15:56:08 -0700 Subject: [PATCH 336/552] [MoE] More balanced expert sharding (#21497) Signed-off-by: Woosuk Kwon Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/layer.py | 22 +++++++++---------- 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index 2a283a6d12b..254cd2e10b8 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -591,22 +591,20 @@ def determine_expert_map( if ep_size == 1: return (global_num_experts, None) - local_num_experts = global_num_experts // ep_size + # Distribute experts as evenly as possible to each rank. + base_experts = global_num_experts // ep_size + remainder = global_num_experts % ep_size + if ep_rank < remainder: + local_num_experts = base_experts + 1 + else: + local_num_experts = base_experts # Create a tensor of size num_experts filled with -1 expert_map = torch.full((global_num_experts, ), -1, dtype=torch.int32) # Create a expert map for the local experts - if ep_rank < (ep_size - 1): - # Each non-last rank gets local_num_experts experts. - expert_map[ep_rank * local_num_experts: - (ep_rank + 1) * local_num_experts] = \ - torch.arange(0, local_num_experts, dtype=torch.int32) - else: - # All remaining experts are assigned to the last rank. - local_num_experts = (global_num_experts - ep_rank * local_num_experts) - - expert_map[-local_num_experts:] = \ - torch.arange(0, local_num_experts, dtype=torch.int32) + start_idx = ep_rank * base_experts + min(ep_rank, remainder) + expert_map[start_idx:start_idx + local_num_experts] = torch.arange( + 0, local_num_experts, dtype=torch.int32) return (local_num_experts, expert_map) From 14c7ae780acdc77099c39426cb998f4ad793f4d7 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 25 Jul 2025 11:05:55 +0800 Subject: [PATCH 337/552] [Frontend] `run-batch` supports V1 (#21541) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- benchmarks/benchmark_throughput.py | 3 +- tests/entrypoints/openai/test_metrics.py | 5 ++- vllm/benchmarks/throughput.py | 4 ++- vllm/entrypoints/openai/api_server.py | 22 +++++++++--- vllm/entrypoints/openai/run_batch.py | 43 ++++++++++++++++-------- 5 files changed, 54 insertions(+), 23 deletions(-) diff --git a/benchmarks/benchmark_throughput.py b/benchmarks/benchmark_throughput.py index 14461121fec..c0a7f1d5825 100644 --- a/benchmarks/benchmark_throughput.py +++ b/benchmarks/benchmark_throughput.py @@ -167,7 +167,8 @@ async def run_vllm_async( from vllm import SamplingParams async with build_async_engine_client_from_engine_args( - engine_args, disable_frontend_multiprocessing + engine_args, + disable_frontend_multiprocessing=disable_frontend_multiprocessing, ) as llm: model_config = await llm.get_model_config() assert all( diff --git a/tests/entrypoints/openai/test_metrics.py b/tests/entrypoints/openai/test_metrics.py index 2d7b845736b..9107d089834 100644 --- a/tests/entrypoints/openai/test_metrics.py +++ b/tests/entrypoints/openai/test_metrics.py @@ -295,8 +295,6 @@ async def test_metrics_exist(server: RemoteOpenAIServer, def test_metrics_exist_run_batch(use_v1: bool): - if use_v1: - pytest.skip("Skipping test on vllm V1") input_batch = """{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/multilingual-e5-small", "input": "You are a helpful assistant."}}""" # noqa: E501 base_url = "0.0.0.0" @@ -323,7 +321,8 @@ def test_metrics_exist_run_batch(use_v1: bool): base_url, "--port", port, - ], ) + ], + env={"VLLM_USE_V1": "1" if use_v1 else "0"}) def is_server_up(url): try: diff --git a/vllm/benchmarks/throughput.py b/vllm/benchmarks/throughput.py index af2ca965712..0fe042e2736 100644 --- a/vllm/benchmarks/throughput.py +++ b/vllm/benchmarks/throughput.py @@ -148,7 +148,9 @@ async def run_vllm_async( from vllm import SamplingParams async with build_async_engine_client_from_engine_args( - engine_args, disable_frontend_multiprocessing) as llm: + engine_args, + disable_frontend_multiprocessing=disable_frontend_multiprocessing, + ) as llm: model_config = await llm.get_model_config() assert all( model_config.max_model_len >= (request.prompt_len + diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index ba257990d4a..8540d25d4e9 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -149,6 +149,9 @@ async def _force_log(): @asynccontextmanager async def build_async_engine_client( args: Namespace, + *, + usage_context: UsageContext = UsageContext.OPENAI_API_SERVER, + disable_frontend_multiprocessing: Optional[bool] = None, client_config: Optional[dict[str, Any]] = None, ) -> AsyncIterator[EngineClient]: @@ -156,15 +159,24 @@ async def build_async_engine_client( # Ensures everything is shutdown and cleaned up on error/exit engine_args = AsyncEngineArgs.from_cli_args(args) + if disable_frontend_multiprocessing is None: + disable_frontend_multiprocessing = bool( + args.disable_frontend_multiprocessing) + async with build_async_engine_client_from_engine_args( - engine_args, args.disable_frontend_multiprocessing, - client_config) as engine: + engine_args, + usage_context=usage_context, + disable_frontend_multiprocessing=disable_frontend_multiprocessing, + client_config=client_config, + ) as engine: yield engine @asynccontextmanager async def build_async_engine_client_from_engine_args( engine_args: AsyncEngineArgs, + *, + usage_context: UsageContext = UsageContext.OPENAI_API_SERVER, disable_frontend_multiprocessing: bool = False, client_config: Optional[dict[str, Any]] = None, ) -> AsyncIterator[EngineClient]: @@ -177,7 +189,6 @@ async def build_async_engine_client_from_engine_args( """ # Create the EngineConfig (determines if we can use V1). - usage_context = UsageContext.OPENAI_API_SERVER vllm_config = engine_args.create_engine_config(usage_context=usage_context) # V1 AsyncLLM. @@ -1811,7 +1822,10 @@ async def run_server_worker(listen_address, if log_config is not None: uvicorn_kwargs['log_config'] = log_config - async with build_async_engine_client(args, client_config) as engine_client: + async with build_async_engine_client( + args, + client_config=client_config, + ) as engine_client: maybe_register_tokenizer_info_endpoint(args) app = build_app(args) diff --git a/vllm/entrypoints/openai/run_batch.py b/vllm/entrypoints/openai/run_batch.py index ef5bf6f9a81..57705509232 100644 --- a/vllm/entrypoints/openai/run_batch.py +++ b/vllm/entrypoints/openai/run_batch.py @@ -3,6 +3,7 @@ import asyncio import tempfile +from argparse import Namespace from collections.abc import Awaitable from http import HTTPStatus from io import StringIO @@ -13,10 +14,12 @@ from prometheus_client import start_http_server from tqdm import tqdm +from vllm.config import VllmConfig from vllm.engine.arg_utils import AsyncEngineArgs, optional_type -from vllm.engine.async_llm_engine import AsyncLLMEngine +from vllm.engine.protocol import EngineClient from vllm.entrypoints.logger import RequestLogger # yapf: disable +from vllm.entrypoints.openai.api_server import build_async_engine_client from vllm.entrypoints.openai.protocol import (BatchRequestInput, BatchRequestOutput, BatchResponseData, @@ -310,36 +313,37 @@ async def run_request(serving_engine_func: Callable, return batch_output -async def main(args): +async def run_batch( + engine_client: EngineClient, + vllm_config: VllmConfig, + args: Namespace, +) -> None: if args.served_model_name is not None: served_model_names = args.served_model_name else: served_model_names = [args.model] - engine_args = AsyncEngineArgs.from_cli_args(args) - engine = AsyncLLMEngine.from_engine_args( - engine_args, usage_context=UsageContext.OPENAI_BATCH_RUNNER) + if args.disable_log_requests: + request_logger = None + else: + request_logger = RequestLogger(max_log_len=args.max_log_len) - model_config = await engine.get_model_config() base_model_paths = [ BaseModelPath(name=name, model_path=args.model) for name in served_model_names ] - if args.disable_log_requests: - request_logger = None - else: - request_logger = RequestLogger(max_log_len=args.max_log_len) + model_config = vllm_config.model_config # Create the openai serving objects. openai_serving_models = OpenAIServingModels( - engine_client=engine, + engine_client=engine_client, model_config=model_config, base_model_paths=base_model_paths, lora_modules=None, ) openai_serving_chat = OpenAIServingChat( - engine, + engine_client, model_config, openai_serving_models, args.response_role, @@ -349,7 +353,7 @@ async def main(args): enable_prompt_tokens_details=args.enable_prompt_tokens_details, ) if "generate" in model_config.supported_tasks else None openai_serving_embedding = OpenAIServingEmbedding( - engine, + engine_client, model_config, openai_serving_models, request_logger=request_logger, @@ -362,7 +366,7 @@ async def main(args): "num_labels", 0) == 1) openai_serving_scores = ServingScores( - engine, + engine_client, model_config, openai_serving_models, request_logger=request_logger, @@ -457,6 +461,17 @@ async def main(args): await write_file(args.output_file, responses, args.output_tmp_dir) +async def main(args: Namespace): + async with build_async_engine_client( + args, + usage_context=UsageContext.OPENAI_BATCH_RUNNER, + disable_frontend_multiprocessing=False, + ) as engine_client: + vllm_config = await engine_client.get_vllm_config() + + await run_batch(engine_client, vllm_config, args) + + if __name__ == "__main__": args = parse_args() From 32164eb5fdccf8fe405851d9c4be9dea9cdced8c Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Fri, 25 Jul 2025 04:05:58 +0100 Subject: [PATCH 338/552] [Docs] Fix `site_url` for RunLLM (#21564) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- mkdocs.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yaml b/mkdocs.yaml index b392fb160c2..8f731a2c1fc 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -1,5 +1,5 @@ site_name: vLLM -site_url: https://docs.vllm.ai +site_url: !ENV READTHEDOCS_CANONICAL_URL repo_url: https://github.com/vllm-project/vllm edit_uri: edit/main/docs/ exclude_docs: | From a3fd326fc2e85b899e9d02e9c990a3ca29a48083 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Thu, 24 Jul 2025 23:07:22 -0400 Subject: [PATCH 339/552] [Bug] Fix DeepGemm Init Error (#21554) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- vllm/model_executor/layers/quantization/utils/fp8_utils.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/quantization/utils/fp8_utils.py b/vllm/model_executor/layers/quantization/utils/fp8_utils.py index ee5f2b51564..8a7e809d082 100644 --- a/vllm/model_executor/layers/quantization/utils/fp8_utils.py +++ b/vllm/model_executor/layers/quantization/utils/fp8_utils.py @@ -366,7 +366,7 @@ def per_token_group_quant_fp8( dtype: Optional[torch.dtype] = None, column_major_scales: bool = False, out_q: Optional[torch.Tensor] = None, - use_ue8m0: bool = is_blackwell_deep_gemm_used(), + use_ue8m0: Optional[bool] = None, ) -> tuple[torch.Tensor, torch.Tensor]: """Function to perform per-token-group quantization on an input tensor `x`. It converts the tensor values into signed float8 values and returns the @@ -383,6 +383,10 @@ def per_token_group_quant_fp8( tuple[torch.Tensor, torch.Tensor]: The quantized tensor and the scaling factor. """ + # TODO(wentao): refactor this + # use_ue8m0 should be a global flag that could be set by user + if use_ue8m0 is None: + use_ue8m0 = is_blackwell_deep_gemm_used() dtype = current_platform.fp8_dtype() if dtype is None else dtype assert (x.shape[-1] % group_size == 0), ( f"the last dimension of `x` {x.shape[-1]} must be divisible " From 82d366e67be2eec40c5a206f4fb6938e247f3bef Mon Sep 17 00:00:00 2001 From: Yuxuan Zhang <2448370773@qq.com> Date: Fri, 25 Jul 2025 11:07:38 +0800 Subject: [PATCH 340/552] Fix GLM-4 PP Missing Layer When using with PP. (#21531) Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: x22x22 --- vllm/model_executor/models/glm4_moe.py | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/vllm/model_executor/models/glm4_moe.py b/vllm/model_executor/models/glm4_moe.py index 095bfbc401b..43824abb571 100644 --- a/vllm/model_executor/models/glm4_moe.py +++ b/vllm/model_executor/models/glm4_moe.py @@ -612,14 +612,20 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.num_expert_groups = config.n_group self.moe_layers: list[FusedMoE] = [] + example_moe = None for layer in self.model.layers: + if isinstance(layer, PPMissingLayer): + continue + assert isinstance(layer, Glm4MoeDecoderLayer) if isinstance(layer.mlp, Glm4MoE): + # Pick last one layer since the first ones may be dense layers. + example_moe = layer.mlp self.moe_layers.append(layer.mlp.experts) - # Pick last one layer since the first ones may be dense layers. - example_moe = typing.cast( - Glm4MoE, self.model.layers[config.num_hidden_layers - 1].mlp) + if example_moe is None: + raise RuntimeError("No Glm4MoE layer found in model.layers.") + self.num_logical_experts = example_moe.n_logical_experts self.num_physical_experts = example_moe.n_physical_experts self.num_local_physical_experts = example_moe.n_local_physical_experts From 687a04461054a95aa6166b5b4a4d8c3537af6228 Mon Sep 17 00:00:00 2001 From: Burkhard Ringlein Date: Fri, 25 Jul 2025 05:16:59 +0200 Subject: [PATCH 341/552] [Kernel] adding fused_moe configs for upcoming granite4 (#21332) Signed-off-by: Burkhard Ringlein Co-authored-by: Thomas Parnell Signed-off-by: x22x22 --- ...256,device_name=NVIDIA_H100_80GB_HBM3.json | 146 ++++++++++++++++++ ...512,device_name=NVIDIA_H100_80GB_HBM3.json | 146 ++++++++++++++++++ ...384,device_name=NVIDIA_H100_80GB_HBM3.json | 146 ++++++++++++++++++ ...768,device_name=NVIDIA_H100_80GB_HBM3.json | 146 ++++++++++++++++++ 4 files changed, 584 insertions(+) create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=62,N=256,device_name=NVIDIA_H100_80GB_HBM3.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=62,N=512,device_name=NVIDIA_H100_80GB_HBM3.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=72,N=384,device_name=NVIDIA_H100_80GB_HBM3.json create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=72,N=768,device_name=NVIDIA_H100_80GB_HBM3.json diff --git a/vllm/model_executor/layers/fused_moe/configs/E=62,N=256,device_name=NVIDIA_H100_80GB_HBM3.json b/vllm/model_executor/layers/fused_moe/configs/E=62,N=256,device_name=NVIDIA_H100_80GB_HBM3.json new file mode 100644 index 00000000000..147a836602f --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=62,N=256,device_name=NVIDIA_H100_80GB_HBM3.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 2 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 2 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 2 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 2 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "128": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 2 + }, + "256": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 8, + "num_stages": 3 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "1024": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "1536": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=62,N=512,device_name=NVIDIA_H100_80GB_HBM3.json b/vllm/model_executor/layers/fused_moe/configs/E=62,N=512,device_name=NVIDIA_H100_80GB_HBM3.json new file mode 100644 index 00000000000..a01e9c317ea --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=62,N=512,device_name=NVIDIA_H100_80GB_HBM3.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 2 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 2 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 5 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 2 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 2 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 2 + }, + "128": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 2 + }, + "256": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 2 + }, + "512": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 8, + "num_stages": 2 + }, + "1024": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "1536": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=72,N=384,device_name=NVIDIA_H100_80GB_HBM3.json b/vllm/model_executor/layers/fused_moe/configs/E=72,N=384,device_name=NVIDIA_H100_80GB_HBM3.json new file mode 100644 index 00000000000..a7cfd175d72 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=72,N=384,device_name=NVIDIA_H100_80GB_HBM3.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 128, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 2 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 4 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 2 + }, + "96": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "128": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "512": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + } +} diff --git a/vllm/model_executor/layers/fused_moe/configs/E=72,N=768,device_name=NVIDIA_H100_80GB_HBM3.json b/vllm/model_executor/layers/fused_moe/configs/E=72,N=768,device_name=NVIDIA_H100_80GB_HBM3.json new file mode 100644 index 00000000000..3caae02cb91 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=72,N=768,device_name=NVIDIA_H100_80GB_HBM3.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 8, + "num_stages": 5 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 4 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 8, + "num_stages": 3 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 8, + "num_stages": 3 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 8, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 2 + }, + "96": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 4 + }, + "128": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 4 + }, + "256": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 4 + }, + "512": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 8, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 8, + "num_stages": 3 + } +} From 5946508efff6ef0ec81cb53fc9125a552a36420a Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Fri, 25 Jul 2025 08:47:29 +0530 Subject: [PATCH 342/552] [Bugfix] DeepGemm utils : Fix hardcoded type-cast (#21517) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/deep_gemm_utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py b/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py index 8cc5a747c67..c8469501af5 100644 --- a/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py +++ b/vllm/model_executor/layers/fused_moe/deep_gemm_utils.py @@ -52,7 +52,7 @@ def compute_aligned_M(M: int, num_topk: int, local_num_experts: int, @triton.jit def apply_expert_map(expert_id, expert_map): if expert_id != -1: - expert_id = tl.load(expert_map + expert_id).to(tl.int64) + expert_id = tl.load(expert_map + expert_id).to(expert_id.dtype) return expert_id From 933017a66836becfd5c542aab39e160457d9d304 Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Fri, 25 Jul 2025 04:18:16 +0100 Subject: [PATCH 343/552] [DP] Support api-server-count > 0 in hybrid DP LB mode (#21510) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- tests/v1/test_hybrid_lb_dp.py | 2 +- vllm/entrypoints/cli/serve.py | 12 ++++-------- 2 files changed, 5 insertions(+), 9 deletions(-) diff --git a/tests/v1/test_hybrid_lb_dp.py b/tests/v1/test_hybrid_lb_dp.py index 08336489abe..74708b61765 100644 --- a/tests/v1/test_hybrid_lb_dp.py +++ b/tests/v1/test_hybrid_lb_dp.py @@ -147,7 +147,7 @@ def default_server_args(): ] -@pytest.fixture(scope="module", params=[1]) # Only 1 API server for now +@pytest.fixture(scope="module", params=[1, 4]) def servers(request, default_server_args): api_server_count = request.param with HybridLBServerManager(MODEL_NAME, DP_SIZE, api_server_count, diff --git a/vllm/entrypoints/cli/serve.py b/vllm/entrypoints/cli/serve.py index b144431dee9..68eb2580991 100644 --- a/vllm/entrypoints/cli/serve.py +++ b/vllm/entrypoints/cli/serve.py @@ -165,18 +165,14 @@ def run_multi_api_server(args: argparse.Namespace): " api_server_count > 1") model_config.disable_mm_preprocessor_cache = True - if vllm_config.parallel_config.data_parallel_hybrid_lb: - raise NotImplementedError( - "Hybrid load balancing with --api-server-count > 0" - "is not yet supported.") - executor_class = Executor.get_class(vllm_config) log_stats = not engine_args.disable_log_stats parallel_config = vllm_config.parallel_config dp_rank = parallel_config.data_parallel_rank external_dp_lb = parallel_config.data_parallel_external_lb - assert external_dp_lb or dp_rank == 0 + hybrid_dp_lb = parallel_config.data_parallel_hybrid_lb + assert external_dp_lb or hybrid_dp_lb or dp_rank == 0 api_server_manager: Optional[APIServerProcessManager] = None @@ -196,12 +192,12 @@ def run_multi_api_server(args: argparse.Namespace): stats_update_address=coordinator.get_stats_publish_address() if coordinator else None) - # For dp ranks > 0 in external DP LB mode, we must delay the + # For dp ranks > 0 in external/hybrid DP LB modes, we must delay the # start of the API servers until the local engine is started # (after the launcher context manager exits), # since we get the front-end stats update address from the coordinator # via the handshake with the local engine. - if dp_rank == 0 or not external_dp_lb: + if dp_rank == 0 or not (external_dp_lb or hybrid_dp_lb): # Start API servers using the manager. api_server_manager = APIServerProcessManager( **api_server_manager_kwargs) From ad4270127caab7e15be58c64d64e3560bf3cf776 Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Thu, 24 Jul 2025 20:44:50 -0700 Subject: [PATCH 344/552] [TPU][Test] Temporarily suspend this MoE model in test_basic.py. (#21560) Signed-off-by: Qiliang Cui Signed-off-by: x22x22 --- tests/v1/tpu/test_basic.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tests/v1/tpu/test_basic.py b/tests/v1/tpu/test_basic.py index b9ee9d66a38..865b58bc7f4 100644 --- a/tests/v1/tpu/test_basic.py +++ b/tests/v1/tpu/test_basic.py @@ -18,7 +18,8 @@ MODELS = [ "Qwen/Qwen2.5-1.5B-Instruct", - "Qwen/Qwen1.5-MoE-A2.7B", + # TODO: Enable this model when fixed. + # "Qwen/Qwen1.5-MoE-A2.7B", # TODO: Enable this models with v6e # "Qwen/Qwen2-7B-Instruct", # "meta-llama/Llama-3.1-8B", From cd79556e341dbc4c1a48cd759d6c6ac729eabcb3 Mon Sep 17 00:00:00 2001 From: Zhou Fang Date: Thu, 24 Jul 2025 20:51:15 -0700 Subject: [PATCH 345/552] [Docs] Add `requirements/common.txt` to run unit tests (#21572) Signed-off-by: Zhou Fang Signed-off-by: x22x22 --- docs/contributing/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/contributing/README.md b/docs/contributing/README.md index f2d439e37cc..e3ae5055b99 100644 --- a/docs/contributing/README.md +++ b/docs/contributing/README.md @@ -98,7 +98,7 @@ For additional features and advanced configurations, refer to the official [MkDo ??? console "Commands" ```bash - pip install -r requirements/dev.txt + pip install -r requirements/common.txt -r requirements/dev.txt # Linting, formatting and static type checking pre-commit install --hook-type pre-commit --hook-type commit-msg From e8bb37800f6d19587bbaa3a1b7e55f902d401a20 Mon Sep 17 00:00:00 2001 From: Benji Beck Date: Thu, 24 Jul 2025 21:43:52 -0700 Subject: [PATCH 346/552] Integrate TensorSchema with shape validation for Phi3VImagePixelInputs (#21232) Signed-off-by: Benji Beck Signed-off-by: x22x22 --- tests/standalone_tests/test_tensor_schema.py | 126 +++++++++++ vllm/model_executor/models/phi3v.py | 108 ++++------ vllm/utils/tensor_schema.py | 210 +++++++++++++++++++ 3 files changed, 372 insertions(+), 72 deletions(-) create mode 100644 tests/standalone_tests/test_tensor_schema.py create mode 100644 vllm/utils/tensor_schema.py diff --git a/tests/standalone_tests/test_tensor_schema.py b/tests/standalone_tests/test_tensor_schema.py new file mode 100644 index 00000000000..c5b77bb09bb --- /dev/null +++ b/tests/standalone_tests/test_tensor_schema.py @@ -0,0 +1,126 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import pytest +import torch + +from vllm.model_executor.models.phi3v import Phi3VImagePixelInputs + + +def test_tensor_schema_valid_tensor(): + Phi3VImagePixelInputs( + data=torch.randn(16, 64, 3, 32, 32), + image_sizes=torch.randint(0, 256, (16, 2)), + ) + + +def test_tensor_schema_optional_fields(): + Phi3VImagePixelInputs( + data=torch.randn(16, 64, 3, 32, 32), + image_sizes=None, + ) + + Phi3VImagePixelInputs(data=torch.randn(16, 64, 3, 32, 32), ) + + +def test_tensor_schema_constant_dim_failure(): + with pytest.raises(ValueError, match="dim\\[2\\] expected 3, got 4"): + Phi3VImagePixelInputs( + data=torch.randn(16, 64, 4, 32, 32), # dim[2] = 4 + image_sizes=torch.randint(0, 256, (16, 2)), + ) + + +def test_tensor_schema_symbolic_dim_mismatch(): + with pytest.raises(ValueError, match="expected 'bn'=12, got 16"): + Phi3VImagePixelInputs( + data=torch.randn(12, 64, 3, 32, 32), + image_sizes=torch.randint(0, 256, (16, 2)), + ) + + +def test_tensor_schema_list_tensor_valid(): + Phi3VImagePixelInputs( + data=[torch.randn(64, 3, 32, 32) for _ in range(16)], + image_sizes=torch.randint(0, 256, (16, 2)), + ) + + +def test_tensor_schema_variable_patch_counts_valid(): + # Each image has a different number of patches (p) + # Each tensor has shape (p, 3, 32, 32) + data = [ + torch.randn(16, 3, 32, 32), # p = 16 + torch.randn(32, 3, 32, 32), # p = 32 + torch.randn(64, 3, 32, 32), # p = 64 + ] + image_sizes = torch.randint(0, 256, (3, 2)) # bn = 3 + Phi3VImagePixelInputs( + data=data, + image_sizes=image_sizes, + ) + + +def test_tensor_schema_tuple_tensor_valid(): + Phi3VImagePixelInputs( + data=tuple(torch.randn(64, 3, 32, 32) for _ in range(16)), + image_sizes=torch.randint(0, 256, (16, 2)), + ) + + +def test_tensor_schema_inconsistent_shapes_in_list(): + with pytest.raises(ValueError, match="contains inconsistent shapes"): + Phi3VImagePixelInputs( + data=[torch.randn(64, 3, 32, 32), + torch.randn(64, 3, 16, 16)] + + [torch.randn(64, 3, 32, 32) for _ in range(14)], + image_sizes=torch.randint(0, 256, (16, 2)), + ) + + +def test_tensor_schema_empty_list(): + with pytest.raises(ValueError, match="is an empty list"): + Phi3VImagePixelInputs( + data=[], + image_sizes=torch.randint(0, 256, (0, 2)), + ) + + +def test_tensor_schema_validation_disabled_skips_shape_check(): + # This should NOT raise, because validation is turned off + # This would normally fail (dim[2] should be 3, not 4) + Phi3VImagePixelInputs( + data=torch.randn(16, 64, 4, 32, 32), + image_sizes=torch.randint(0, 256, (16, 2)), + validate=False, + ) + + +def test_tensor_schema_with_valid_resolve_binding_dims(): + data = torch.randn(16, 64, 3, 336, 336) # h=336, w=336 + image_sizes = torch.randint(0, 256, (16, 2)) + + Phi3VImagePixelInputs( + data=data, + image_sizes=image_sizes, + resolve_bindings={ + "h": 336, + "w": 336 + }, + ) + + +def test_tensor_schema_with_invalid_resolve_binding_dims(): + data = torch.randn(16, 64, 3, 36, 36) # h=36, w=36 + image_sizes = torch.randint(0, 256, (16, 2)) + + # Should raise because 'h' and 'w' don't match resolve bindings + with pytest.raises(ValueError, match="dim\\[3\\] expected 336, got 36"): + Phi3VImagePixelInputs( + data=data, + image_sizes=image_sizes, + resolve_bindings={ + "h": 336, + "w": 336 + }, + ) diff --git a/vllm/model_executor/models/phi3v.py b/vllm/model_executor/models/phi3v.py index 745cf7aa251..aa739f22fd7 100644 --- a/vllm/model_executor/models/phi3v.py +++ b/vllm/model_executor/models/phi3v.py @@ -16,7 +16,7 @@ # See the License for the specific language governing permissions and # limitations under the License. from collections.abc import Iterable, Mapping, Sequence -from typing import Any, Literal, Optional, TypedDict, Union +from typing import Annotated, Any, Literal, Optional, Union import regex as re import torch @@ -45,6 +45,7 @@ from vllm.multimodal.profiling import BaseDummyInputsBuilder from vllm.sequence import IntermediateTensors from vllm.utils import is_list_of +from vllm.utils.tensor_schema import TensorSchema, TensorShape from .clip import CLIPVisionModel from .interfaces import (MultiModalEmbeddings, SupportsMultiModal, SupportsPP, @@ -93,32 +94,42 @@ def _init_img_processor(hf_config: PretrainedConfig, return img_processor -class Phi3VImagePixelInputs(TypedDict): - type: Literal["pixel_values"] - data: Union[torch.Tensor, list[torch.Tensor]] +class Phi3VImagePixelInputs(TensorSchema): """ - Shape: - `(batch_size * num_images, 1 + num_patches, num_channels, height, width)` - - Note that `num_patches` may be different per batch and image, - in which case the data is passed as a list instead of a batched tensor. + Dimensions: + - b: Batch size + - n: Number of images + - p: Number of patches + - h: Height of each patch + - w: Width of each patch """ - image_sizes: torch.Tensor - """ - Shape: `(batch_size * num_images, 2)` + type: Literal["pixel_values", "image_embeds"] = "pixel_values" - This should be in `(height, width)` format. - """ + # Supports either a stacked tensor or a list of (p, 3, h, w) tensors + data: Annotated[ + Union[torch.Tensor, list[torch.Tensor]], + TensorShape("bn", "p", 3, "h", "w", dynamic_dims={"p"} + ), # 'p' may vary across items + ] + # Stacked tensor with height and width for each image + image_sizes: Annotated[Optional[torch.Tensor], TensorShape("bn", 2)] -class Phi3VImageEmbeddingInputs(TypedDict): - type: Literal["image_embeds"] - data: Union[torch.Tensor, list[torch.Tensor]] - """Shape: `(batch_size * num_images, image_feature_size, hidden_size)` - `hidden_size` must match the hidden size of language model backbone. +class Phi3VImageEmbeddingInputs(TensorSchema): """ + Dimensions: + - b: Batch size + - n: Number of images + - f: Image feature size (e.g., number of tokens per image) + - h: Hidden size (must match language model backbone) + """ + type: Literal["image_embeds"] = "image_embeds" + data: Annotated[ + Union[torch.Tensor, list[torch.Tensor]], + TensorShape("bn", "f", "h"), + ] Phi3VImageInputs = Union[Phi3VImagePixelInputs, Phi3VImageEmbeddingInputs] @@ -563,44 +574,6 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.make_empty_intermediate_tensors = ( self.language_model.make_empty_intermediate_tensors) - def _validate_image_sizes(self, data: torch.Tensor) -> torch.Tensor: - expected_dims = (2, ) - - def _validate_shape(d: torch.Tensor): - actual_dims = tuple(d.shape) - - if actual_dims != expected_dims: - expected_expr = str(expected_dims) - raise ValueError( - f"The expected shape of image sizes per image per batch " - f"is {expected_expr}. You supplied {tuple(d.shape)}.") - - for d in data: - _validate_shape(d) - - return data - - def _validate_pixel_values( - self, data: Union[torch.Tensor, list[torch.Tensor]] - ) -> Union[torch.Tensor, list[torch.Tensor]]: - - h = w = CLIP_VIT_LARGE_PATCH14_336_CONFIG.image_size - expected_dims = (3, h, w) - - def _validate_shape(d: torch.Tensor): - actual_dims = tuple(d.shape[1:]) - - if actual_dims != expected_dims: - expected_expr = ("num_patches", *map(str, expected_dims)) - raise ValueError( - "The expected shape of pixel values per image per batch " - f"is {expected_expr}. You supplied {tuple(d.shape)}.") - - for d in data: - _validate_shape(d) - - return data - def _parse_and_validate_image_input( self, **kwargs: object) -> Optional[Phi3VImageInputs]: pixel_values = kwargs.pop("pixel_values", None) @@ -611,25 +584,16 @@ def _parse_and_validate_image_input( return None if pixel_values is not None: - if not isinstance(pixel_values, (torch.Tensor, list)): - raise ValueError("Incorrect type of pixel values. " - f"Got type: {type(pixel_values)}") - - if not isinstance(image_sizes, (torch.Tensor, list)): - raise ValueError("Incorrect type of image sizes. " - f"Got type: {type(image_sizes)}") - return Phi3VImagePixelInputs( type="pixel_values", - data=self._validate_pixel_values(flatten_bn(pixel_values)), - image_sizes=self._validate_image_sizes( - flatten_bn(image_sizes, concat=True))) + data=flatten_bn(pixel_values), + image_sizes=flatten_bn(image_sizes, concat=True), + resolve_bindings={ + "h": CLIP_VIT_LARGE_PATCH14_336_CONFIG.image_size, + "w": CLIP_VIT_LARGE_PATCH14_336_CONFIG.image_size + }) if image_embeds is not None: - if not isinstance(image_embeds, torch.Tensor): - raise ValueError("Incorrect type of image embeddings. " - f"Got type: {type(image_embeds)}") - return Phi3VImageEmbeddingInputs( type="image_embeds", data=flatten_bn(image_embeds), diff --git a/vllm/utils/tensor_schema.py b/vllm/utils/tensor_schema.py new file mode 100644 index 00000000000..485a0a72ddc --- /dev/null +++ b/vllm/utils/tensor_schema.py @@ -0,0 +1,210 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from typing import Annotated, Any, Union, get_args, get_origin, get_type_hints + +import torch + +from vllm.logger import init_logger + +logger = init_logger(__name__) + + +class TensorShape: + + def __init__(self, + *dims: Union[int, str], + dynamic_dims: set[str, ...] = None) -> None: + self.dims = dims + self.dynamic_dims = dynamic_dims if dynamic_dims else set() + + def resolve(self, **bindings: dict[str, + int]) -> tuple[Union[int, str], ...]: + resolved = [] + for dim in self.dims: + if isinstance(dim, str) and dim in bindings: + resolved.append(bindings[dim]) + else: + resolved.append(dim) + return tuple(resolved) + + def __str__(self) -> str: + """Return a string representation of the tensor shape.""" + dim_strs = [] + for dim in self.dims: + if isinstance(dim, str): + if dim in self.dynamic_dims: + dim_strs.append( + f"{dim}*") # Mark dynamic dimensions with * + else: + dim_strs.append(dim) + else: + dim_strs.append(str(dim)) + return f"({', '.join(dim_strs)})" + + +class TensorSchema: + + def __init__(self, + *, + validate: bool = True, + resolve_bindings: dict[str, int] = None, + **kwargs: Any) -> None: + self._resolve_bindings = resolve_bindings if resolve_bindings else {} + + for key, value in kwargs.items(): + setattr(self, key, value) + + if validate: + self.validate() + + def __getitem__(self, item) -> Any: + return getattr(self, item) + + def _match_shape_with_dynamic(self, actual: tuple[int, ...], + reference: tuple[int, ...], + expected_shape: tuple[Union[int, str], ...], + dynamic_dims: set[str, ...]) -> bool: + if len(actual) != len(reference) or len(actual) > len(expected_shape): + return False + + for i, (a, r) in enumerate(zip(actual, reference)): + # When validating list inputs, we match shape suffixes only + # (e.g. "p", 3, "h", "w"), assuming the list length corresponds + # to the leading symbolic dim (e.g. "bn"). This allows comparing + # only the trailing dimensions of each element in the list. + dim = expected_shape[-len(actual) + i] + # Skip this dimension if it's marked dynamic + if dim in dynamic_dims: + continue + if a != r: + return False + return True + + def _validate_nested_tensors( + self, value: Union[list[torch.Tensor, ...], + tuple[torch.Tensor, ...]], field_name: str, + expected_shape: tuple[Union[int, str], ...], + dynamic_dims: set[str, ...]) -> tuple[int, ...]: + """Validate a list/tuple of tensors and return the actual shape.""" + if not value: + raise ValueError(f"{field_name} is an empty list") + + # Ensure all tensors in the list have the same + # shape, besides dynamic dimensions + first = value[0] + for i, v in enumerate(value): + if not isinstance(v, torch.Tensor): + raise ValueError(f"{field_name}[{i}] is not a " + f"torch.Tensor") + if not self._match_shape_with_dynamic( + v.shape, + first.shape, + expected_shape, + dynamic_dims, + ): + raise ValueError(f"{field_name} contains inconsistent " + f"shapes: {first.shape} vs {v.shape} " + f"at index {i}") + + # Treat the list as a stacked tensor: + # shape = (len(list), *tensor.shape) + return (len(value), ) + first.shape + + def _validate_tensor_shape_expected(self, actual_shape: tuple[int, ...], + expected_shape: tuple[Union[int, str], + ...], + field_name: str, shape_env: dict[str, + int], + dynamic_dims: set[str, ...]) -> None: + """Validate that the actual tensor shape matches the expected shape.""" + if len(actual_shape) != len(expected_shape): + raise ValueError(f"{field_name} has rank {len(actual_shape)} " + f"but expected {len(expected_shape)}") + + for i, dim in enumerate(expected_shape): + if dim in dynamic_dims: + continue + elif isinstance(dim, int): + if actual_shape[i] != dim: + raise ValueError(f"{field_name} dim[{i}] expected " + f"{dim}, got {actual_shape[i]}") + elif isinstance(dim, str): + if dim in shape_env: + if actual_shape[i] != shape_env[dim]: + raise ValueError(f"{field_name} dim[{i}] expected " + f"'{dim}'={shape_env[dim]}, got " + f"{actual_shape[i]}") + else: + shape_env[dim] = actual_shape[i] + else: + raise TypeError(f"{field_name} dim[{i}] has unsupported " + f"type: {type(dim)}") + + def validate(self) -> None: + type_hints = get_type_hints(self.__class__, include_extras=True) + shape_env = {} + + for field_name, field_type in type_hints.items(): + # Check if field is missing + if (not hasattr(self, field_name) + or getattr(self, field_name) is None): + # Check if field is marked as optional + actual_type = field_type + if get_origin(field_type) is Annotated: + args = get_args(field_type) + actual_type = args[0] + + # Check arg was provided as Union + if get_origin(actual_type) is Union: + args = get_args(actual_type) + # Skip validation when Union contains None + if type(None) in args: + continue + # If not optional, raise error + raise ValueError(f"Required field '{field_name}' is missing") + + # Field exists, proceed with validation + value = getattr(self, field_name) + + if get_origin(field_type) is not None: + args = get_args(field_type) + + for arg in args: + if isinstance(arg, TensorShape): + expected_shape = arg.resolve(**self._resolve_bindings) + if isinstance(value, (list, tuple)): + actual_shape = self._validate_nested_tensors( + value, field_name, expected_shape, + arg.dynamic_dims) + + elif isinstance(value, torch.Tensor): + actual_shape = value.shape + + else: + type_names = [] + for arg in args: + if hasattr(arg, "__name__"): + type_names.append(str(arg.__name__)) + else: + type_names.append(str(arg)) + + expected_types = ", ".join(type_names) + raise ValueError( + f"{field_name} is not one of the expected " + f"types: {expected_types}") + + self._validate_tensor_shape_expected( + actual_shape, expected_shape, field_name, + shape_env, arg.dynamic_dims) + + def print_shapes(self) -> None: + """Print TensorShape annotations for debugging.""" + logger.debug("Shapes in %s:", self.__class__.__name__) + type_hints = get_type_hints(self.__class__, include_extras=True) + + for field_name, field_type in type_hints.items(): + if get_origin(field_type) is not None: + args = get_args(field_type) + for arg in args: + if isinstance(arg, TensorShape): + logger.debug(" %s: %s", field_name, str(arg)) From 38ef6d842eebe60a55da18a2465ecaa3535d9dd5 Mon Sep 17 00:00:00 2001 From: "Li, Jiang" Date: Fri, 25 Jul 2025 12:58:03 +0800 Subject: [PATCH 347/552] [CI] Update CODEOWNERS for CPU and Intel GPU (#21582) Signed-off-by: jiang1.li Signed-off-by: x22x22 --- .github/CODEOWNERS | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 8c68bc8f02b..24410553716 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -52,3 +52,15 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson # Docs /docs @hmellor mkdocs.yaml @hmellor + +# CPU +/vllm/v1/worker/^cpu @bigPYJ1151 +/csrc/cpu @bigPYJ1151 +/vllm/platforms/cpu.py @bigPYJ1151 +/cmake/cpu_extension.cmake @bigPYJ1151 +/docker/Dockerfile.cpu @bigPYJ1151 + +# Intel GPU +/vllm/v1/worker/^xpu @jikunshang +/vllm/platforms/xpu.py @jikunshang +/docker/Dockerfile.xpu @jikunshang From 289ca2dcb31fb56f78674ad856bbf9a5bb89f6e8 Mon Sep 17 00:00:00 2001 From: Ning Xie Date: Fri, 25 Jul 2025 13:44:38 +0800 Subject: [PATCH 348/552] [Bugfix] fix modelscope snapshot_download serialization (#21536) Signed-off-by: Andy Xie Signed-off-by: x22x22 --- vllm/model_executor/model_loader/default_loader.py | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/vllm/model_executor/model_loader/default_loader.py b/vllm/model_executor/model_loader/default_loader.py index 36568e881eb..2b8e4427591 100644 --- a/vllm/model_executor/model_loader/default_loader.py +++ b/vllm/model_executor/model_loader/default_loader.py @@ -69,10 +69,10 @@ def _maybe_download_from_modelscope( # pylint: disable=C. from modelscope.hub.snapshot_download import snapshot_download - if not os.path.exists(model): - # Use file lock to prevent multiple processes from - # downloading the same model weights at the same time. - with get_lock(model, self.load_config.download_dir): + # Use file lock to prevent multiple processes from + # downloading the same model weights at the same time. + with get_lock(model, self.load_config.download_dir): + if not os.path.exists(model): model_path = snapshot_download( model_id=model, cache_dir=self.load_config.download_dir, @@ -81,8 +81,8 @@ def _maybe_download_from_modelscope( revision=revision, ignore_file_pattern=self.load_config.ignore_patterns, ) - else: - model_path = model + else: + model_path = model return model_path return None From 2eebbd912ab1b2cc56935b20114105d07414bf94 Mon Sep 17 00:00:00 2001 From: Jason Gu <1057337859@qq.com> Date: Fri, 25 Jul 2025 13:45:16 +0800 Subject: [PATCH 349/552] [Model] Support tensor parallel for timm ViT in Deepseek_vl2 (#21494) Signed-off-by: wzqd <1057337859@qq.com> Signed-off-by: x22x22 --- vllm/model_executor/models/deepseek_vl2.py | 40 ++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-) diff --git a/vllm/model_executor/models/deepseek_vl2.py b/vllm/model_executor/models/deepseek_vl2.py index a222c4cbe9d..0ca6b28073e 100644 --- a/vllm/model_executor/models/deepseek_vl2.py +++ b/vllm/model_executor/models/deepseek_vl2.py @@ -14,9 +14,11 @@ from transformers import BatchFeature from vllm.config import VllmConfig +from vllm.distributed import get_tensor_model_parallel_world_size from vllm.model_executor import SamplingMetadata from vllm.model_executor.layers.quantization import QuantizationConfig from vllm.model_executor.model_loader.utils import set_default_torch_dtype +from vllm.model_executor.models.transformers import replace_linear_class from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, MultiModalKwargs, NestedTensors) @@ -379,6 +381,37 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.make_empty_intermediate_tensors = ( self.language_model.make_empty_intermediate_tensors) + def _get_parent_and_attr(self, root: torch.nn.Module, dotted_name: str): + """Return (parent_module, final_attr_name) for a dotted module path.""" + names = dotted_name.split('.') + parent = root + for n in names[:-1]: + parent = getattr(parent, n) + return parent, names[-1] + + #patch for timm ViT instance to support tensor parallel + def patch_vit_for_tp(self, vit: torch.nn.Module, + quant_config: QuantizationConfig): + try: + import timm + except ImportError as e: + raise ImportError("Please install timm") from e + + for name, module in vit.named_modules(): + if isinstance(module, nn.Linear): + parent, attr_name = self._get_parent_and_attr(vit, name) + if isinstance(parent, timm.layers.Mlp) and attr_name == "fc1": + new_linear = replace_linear_class(module, "colwise", + quant_config) + setattr(parent, attr_name, new_linear) + elif isinstance(parent, + timm.layers.Mlp) and attr_name == "fc2": + new_linear = replace_linear_class(module, "rowwise", + quant_config) + setattr(parent, attr_name, new_linear) + + return vit + def _init_vision_module( self, vision_config: VisionEncoderConfig, @@ -388,8 +421,8 @@ def _init_vision_module( # TODO: refactor vision model through timm wrapper from transformers try: import timm - except ImportError: - raise ImportError("Please install timm") from ImportError + except ImportError as e: + raise ImportError("Please install timm") from e with set_default_torch_dtype(torch.float16): model = timm.create_model( @@ -400,6 +433,9 @@ def _init_vision_module( dynamic_img_pad=True, ) + if get_tensor_model_parallel_world_size() > 1: + model = self.patch_vit_for_tp(model, quant_config) + model = model.to(dtype=torch.get_default_dtype()) return model From d09777ed1da2c6d71438f6b5bfae5b2e4cdc540b Mon Sep 17 00:00:00 2001 From: hfan Date: Fri, 25 Jul 2025 01:46:06 -0400 Subject: [PATCH 350/552] [Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings (#21479) Signed-off-by: Hongmin Fan Signed-off-by: x22x22 --- vllm/model_executor/models/gemma3_mm.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/model_executor/models/gemma3_mm.py b/vllm/model_executor/models/gemma3_mm.py index d14f5fa3d60..d756f54c49b 100644 --- a/vllm/model_executor/models/gemma3_mm.py +++ b/vllm/model_executor/models/gemma3_mm.py @@ -627,7 +627,7 @@ def forward(self, inputs_embeds = self.get_input_embeddings(input_ids, vision_embeddings) - if vision_embeddings is not None: + if (vision_embeddings is not None) and len(vision_embeddings) != 0: kwargs = self.prepare_attn_masks( input_ids, positions, From c1d3d7c3914d028dc7d924329a3e2617a66be22f Mon Sep 17 00:00:00 2001 From: Chengji Yao Date: Thu, 24 Jul 2025 22:46:43 -0700 Subject: [PATCH 351/552] [Misc][Tools] make max-model-len a parameter in auto_tune script (#21321) Signed-off-by: Chengji Yao Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- benchmarks/auto_tune/README.md | 4 ++++ benchmarks/auto_tune/auto_tune.sh | 14 +++++++++++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/benchmarks/auto_tune/README.md b/benchmarks/auto_tune/README.md index 7732f50b1d2..ae5962fe925 100644 --- a/benchmarks/auto_tune/README.md +++ b/benchmarks/auto_tune/README.md @@ -39,6 +39,7 @@ You must set the following variables at the top of the script before execution. | `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) | | `INPUT_LEN` | **Required.** Request input length. | `4000` | | `OUTPUT_LEN` | **Required.** Request output length. | `16` | +| `MAX_MODEL_LEN` | **Required.** Max model length. | `4096` | | `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` | | `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` | | `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` | @@ -69,6 +70,7 @@ Here are a few examples of how to configure the script for different goals: ```bash INPUT_LEN=1800 OUTPUT_LEN=20 +MAX_MODEL_LEN=2048 MIN_CACHE_HIT_PCT=0 MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number ``` @@ -80,6 +82,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number ```bash INPUT_LEN=1800 OUTPUT_LEN=20 +MAX_MODEL_LEN=2048 MIN_CACHE_HIT_PCT=0 MAX_LATENCY_ALLOWED_MS=500 ``` @@ -91,6 +94,7 @@ MAX_LATENCY_ALLOWED_MS=500 ```bash INPUT_LEN=1800 OUTPUT_LEN=20 +MAX_MODEL_LEN=2048 MIN_CACHE_HIT_PCT=60 MAX_LATENCY_ALLOWED_MS=500 ``` diff --git a/benchmarks/auto_tune/auto_tune.sh b/benchmarks/auto_tune/auto_tune.sh index eaa28ea5c92..8d3e1d4bee3 100644 --- a/benchmarks/auto_tune/auto_tune.sh +++ b/benchmarks/auto_tune/auto_tune.sh @@ -4,13 +4,15 @@ # See details in README (benchmarks/auto_tune/README.md). TAG=$(date +"%Y_%m_%d_%H_%M") -BASE="" +SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) +BASE="$SCRIPT_DIR/../../.." MODEL="meta-llama/Llama-3.1-8B-Instruct" SYSTEM="TPU" TP=1 DOWNLOAD_DIR="" INPUT_LEN=4000 OUTPUT_LEN=16 +MAX_MODEL_LEN=4096 MIN_CACHE_HIT_PCT=0 MAX_LATENCY_ALLOWED_MS=100000000000 NUM_SEQS_LIST="128 256" @@ -36,6 +38,13 @@ current_hash=$(git rev-parse HEAD) echo "hash:$current_hash" >> "$RESULT" echo "current_hash: $current_hash" +TOTAL_LEN=$((INPUT_LEN + OUTPUT_LEN)) +RED='\033[0;31m' +if (( TOTAL_LEN > MAX_MODEL_LEN )); then + echo -e "${RED}FAILED: INPUT_LEN($INPUT_LEN) + OUTPUT_LEN($OUTPUT_LEN) = $TOTAL_LEN, which is > MAX_MODEL_LEN = $MAX_MODEL_LEN.\033[0m" >&2 + exit 1 +fi + best_throughput=0 best_max_num_seqs=0 best_num_batched_tokens=0 @@ -60,7 +69,7 @@ start_server() { --enable-prefix-caching \ --load-format dummy \ --download-dir "$DOWNLOAD_DIR" \ - --max-model-len $(( INPUT_LEN+OUTPUT_LEN )) > "$vllm_log" 2>&1 & + --max-model-len $MAX_MODEL_LEN > "$vllm_log" 2>&1 & # wait for 10 minutes... server_started=0 @@ -245,4 +254,3 @@ done echo "finish permutations" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT" - From b634dcfc41de3a8be627a725f58bc44ffae4084c Mon Sep 17 00:00:00 2001 From: Ignacio Sica Date: Fri, 25 Jul 2025 02:53:59 -0300 Subject: [PATCH 352/552] [CI/Build] fix cpu_extension for apple silicon (#21195) Signed-off-by: ignaciosica Signed-off-by: x22x22 --- cmake/cpu_extension.cmake | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/cmake/cpu_extension.cmake b/cmake/cpu_extension.cmake index 21fcee66d60..e0da46e2acc 100644 --- a/cmake/cpu_extension.cmake +++ b/cmake/cpu_extension.cmake @@ -58,6 +58,22 @@ function (find_isa CPUINFO TARGET OUT) endif() endfunction() + +function(check_sysctl TARGET OUT) + execute_process(COMMAND sysctl -n "${TARGET}" + RESULT_VARIABLE SYSCTL_RET + OUTPUT_VARIABLE SYSCTL_INFO + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) + if(SYSCTL_RET EQUAL 0 AND + (SYSCTL_INFO STREQUAL "1" OR SYSCTL_INFO GREATER 0)) + set(${OUT} ON PARENT_SCOPE) + else() + set(${OUT} OFF PARENT_SCOPE) + endif() +endfunction() + + function (is_avx512_disabled OUT) set(DISABLE_AVX512 $ENV{VLLM_CPU_DISABLE_AVX512}) if(DISABLE_AVX512 AND DISABLE_AVX512 STREQUAL "true") @@ -70,7 +86,10 @@ endfunction() is_avx512_disabled(AVX512_DISABLED) if (MACOSX_FOUND AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64") - set(APPLE_SILICON_FOUND TRUE) + message(STATUS "Apple Silicon Detected") + set(ENABLE_NUMA OFF) + check_sysctl(hw.optional.neon ASIMD_FOUND) + check_sysctl(hw.optional.arm.FEAT_BF16 ARM_BF16_FOUND) else() find_isa(${CPUINFO} "avx2" AVX2_FOUND) find_isa(${CPUINFO} "avx512f" AVX512_FOUND) @@ -82,7 +101,6 @@ else() find_isa(${CPUINFO} "S390" S390_FOUND) endif() - if (AVX512_FOUND AND NOT AVX512_DISABLED) list(APPEND CXX_COMPILE_FLAGS "-mavx512f" @@ -149,9 +167,6 @@ elseif (ASIMD_FOUND) set(MARCH_FLAGS "-march=armv8.2-a+dotprod+fp16") endif() list(APPEND CXX_COMPILE_FLAGS ${MARCH_FLAGS}) -elseif(APPLE_SILICON_FOUND) - message(STATUS "Apple Silicon Detected") - set(ENABLE_NUMA OFF) elseif (S390_FOUND) message(STATUS "S390 detected") # Check for S390 VXE support From b5ed2e1a3baf6d5ed3ef3c50c0a419ae3083d5e3 Mon Sep 17 00:00:00 2001 From: Yang Chen Date: Thu, 24 Jul 2025 22:54:23 -0700 Subject: [PATCH 353/552] [Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS (#21262) Signed-off-by: Yang Chen Signed-off-by: x22x22 --- CMakeLists.txt | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 529ce29029b..ea56b8451f2 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -768,6 +768,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu") endif() +if(VLLM_GPU_LANG STREQUAL "CUDA") + set(MOE_PERMUTE_SRC + "csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu" + "csrc/moe/moe_permute_unpermute_op.cu") + + list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}") +endif() + set_gencode_flags_for_srcs( SRCS "${VLLM_MOE_EXT_SRC}" CUDA_ARCHS "${CUDA_ARCHS}") @@ -836,17 +844,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") endif() endif() -if(VLLM_GPU_LANG STREQUAL "CUDA") - set(MOE_PERMUTE_SRC - "csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu" - "csrc/moe/moe_permute_unpermute_op.cu") - - set_gencode_flags_for_srcs( - SRCS "${MOE_PERMUTE_SRC}" - CUDA_ARCHS "${CUDA_ARCHS}") - - list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}") -endif() message(STATUS "Enabling moe extension.") define_gpu_extension_target( _moe_C From 83ae2b9ae2a0d2c2666aa6b50c4d80aea0d1df2a Mon Sep 17 00:00:00 2001 From: Chengji Yao Date: Thu, 24 Jul 2025 23:01:53 -0700 Subject: [PATCH 354/552] [TPU][Bugfix] fix OOM issue in CI test (#21550) Signed-off-by: Chengji Yao Signed-off-by: x22x22 --- tests/v1/tpu/test_basic.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/v1/tpu/test_basic.py b/tests/v1/tpu/test_basic.py index 865b58bc7f4..dd89059ded5 100644 --- a/tests/v1/tpu/test_basic.py +++ b/tests/v1/tpu/test_basic.py @@ -59,7 +59,7 @@ def test_basic( # actually test chunked prompt max_num_batched_tokens=1024, max_model_len=8192, - gpu_memory_utilization=0.7, + gpu_memory_utilization=0.95, max_num_seqs=max_num_seqs, tensor_parallel_size=tensor_parallel_size) as vllm_model: vllm_outputs = vllm_model.generate_greedy(example_prompts, From bf2960ec551e6a2248be0b5704fe0fdc09a25bef Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Fri, 25 Jul 2025 10:27:24 +0100 Subject: [PATCH 355/552] [Tests] Harden DP tests (#21508) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- tests/v1/test_external_lb_dp.py | 9 +++-- tests/v1/test_hybrid_lb_dp.py | 60 ++++++++++++++--------------- tests/v1/test_internal_lb_dp.py | 68 +++++++++++++++++++++++---------- 3 files changed, 82 insertions(+), 55 deletions(-) diff --git a/tests/v1/test_external_lb_dp.py b/tests/v1/test_external_lb_dp.py index 98fefad1ff4..4a5c47fead5 100644 --- a/tests/v1/test_external_lb_dp.py +++ b/tests/v1/test_external_lb_dp.py @@ -11,7 +11,7 @@ import pytest_asyncio from tests.utils import RemoteOpenAIServer -from vllm.platforms import Platform +from vllm.platforms import current_platform MODEL_NAME = "ibm-research/PowerMoE-3b" @@ -70,10 +70,11 @@ def start_server(r: int, sargs: list[str]): sargs, auto_port=False, env_dict={ - "CUDA_VISIBLE_DEVICES": + current_platform.device_control_env_var: ",".join( - str(Platform.device_id_to_physical_device_id( - i)) + str( + current_platform. + device_id_to_physical_device_id(i)) for i in range(r * TP_SIZE, (r + 1) * TP_SIZE)) }) server.__enter__() diff --git a/tests/v1/test_hybrid_lb_dp.py b/tests/v1/test_hybrid_lb_dp.py index 74708b61765..293b1257be6 100644 --- a/tests/v1/test_hybrid_lb_dp.py +++ b/tests/v1/test_hybrid_lb_dp.py @@ -12,7 +12,7 @@ from tests.utils import RemoteOpenAIServer from tests.v1.test_utils import check_request_balancing -from vllm.platforms import Platform +from vllm.platforms import current_platform MODEL_NAME = "ibm-research/PowerMoE-3b" @@ -92,10 +92,12 @@ def start_server(node: int, sargs: list[str]): sargs, auto_port=False, env_dict={ - "CUDA_VISIBLE_DEVICES": + current_platform.device_control_env_var: ",".join( - str(Platform.device_id_to_physical_device_id( - i)) for i in range(gpu_start, gpu_end)) + str( + current_platform. + device_id_to_physical_device_id(i)) + for i in range(gpu_start, gpu_end)) }) server.__enter__() print(f"Hybrid LB node {node} started successfully with " @@ -180,7 +182,7 @@ async def make_request(client: openai.AsyncOpenAI): completion = await client.completions.create( model=model_name, prompt="Hello, my name is", - max_tokens=10, + max_tokens=5, temperature=1.0) assert completion.id is not None @@ -212,27 +214,28 @@ async def make_request(client: openai.AsyncOpenAI): await asyncio.sleep(0.5) # Send requests to all nodes - each should balance within its local DP ranks - num_requests_per_node = 25 # Total 50 requests across 2 nodes + num_requests = 200 # Total 200 requests across 2 nodes all_tasks = [] - - for i, client in enumerate(clients): - tasks = [make_request(client) for _ in range(num_requests_per_node)] - all_tasks.extend(tasks) + for i in range(num_requests): + client = clients[i % len(clients)] + all_tasks.append(asyncio.create_task(make_request(client))) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) - assert len(results) == num_requests_per_node * len(clients) + assert len(results) == num_requests assert all(completion is not None for completion in results) await asyncio.sleep(0.5) # Second burst of requests all_tasks = [] - for i, client in enumerate(clients): - tasks = [make_request(client) for _ in range(num_requests_per_node)] - all_tasks.extend(tasks) + for i in range(num_requests): + client = clients[i % len(clients)] + all_tasks.append(asyncio.create_task(make_request(client))) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) - assert len(results) == num_requests_per_node * len(clients) + assert len(results) == num_requests assert all(completion is not None for completion in results) _, server_args = servers[0] @@ -309,33 +312,28 @@ async def make_streaming_request(client: openai.AsyncOpenAI): await asyncio.sleep(0.5) # Send streaming requests to all nodes - num_requests_per_node = 25 # Total 50 requests across 2 nodes + num_requests = 200 # Total 200 requests across 2 nodes all_tasks = [] - - for i, client in enumerate(clients): - tasks = [ - make_streaming_request(client) - for _ in range(num_requests_per_node) - ] - all_tasks.extend(tasks) + for i in range(num_requests): + client = clients[i % len(clients)] + all_tasks.append(asyncio.create_task(make_streaming_request(client))) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) - assert len(results) == num_requests_per_node * len(clients) + assert len(results) == num_requests assert all(results), "Not all streaming requests completed successfully." await asyncio.sleep(0.5) # Second burst of streaming requests all_tasks = [] - for i, client in enumerate(clients): - tasks = [ - make_streaming_request(client) - for _ in range(num_requests_per_node) - ] - all_tasks.extend(tasks) + for i in range(num_requests): + client = clients[i % len(clients)] + all_tasks.append(asyncio.create_task(make_streaming_request(client))) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) - assert len(results) == num_requests_per_node * len(clients) + assert len(results) == num_requests assert all(results), "Not all streaming requests completed successfully." _, server_args = servers[0] diff --git a/tests/v1/test_internal_lb_dp.py b/tests/v1/test_internal_lb_dp.py index 9aef4d5821e..ca80d3a4949 100644 --- a/tests/v1/test_internal_lb_dp.py +++ b/tests/v1/test_internal_lb_dp.py @@ -11,7 +11,7 @@ from tests.utils import RemoteOpenAIServer from tests.v1.test_utils import check_request_balancing -from vllm.platforms import Platform +from vllm.platforms import current_platform MODEL_NAME = "ibm-research/PowerMoE-3b" @@ -96,10 +96,12 @@ def start_server(r: int, sargs: list[str]): sargs, auto_port=False, env_dict={ - "CUDA_VISIBLE_DEVICES": + current_platform.device_control_env_var: ",".join( - str(Platform.device_id_to_physical_device_id( - i)) for i in range(r, r + gpus_per_node)) + str( + current_platform. + device_id_to_physical_device_id(i)) + for i in range(r, r + gpus_per_node)) }) server.__enter__() if r == 0: @@ -219,9 +221,11 @@ def start_engines_server(): engines_server_args, auto_port=False, env_dict={ - "CUDA_VISIBLE_DEVICES": + current_platform.device_control_env_var: ",".join( - str(Platform.device_id_to_physical_device_id(i)) + str( + current_platform. + device_id_to_physical_device_id(i)) for i in range(self.dp_size * self.tp_size)) }) server.__enter__() @@ -330,7 +334,7 @@ async def make_request(): completion = await client.completions.create( model=model_name, prompt="Hello, my name is", - max_tokens=10, + max_tokens=5, temperature=1.0) assert completion.id is not None @@ -361,8 +365,11 @@ async def make_request(): await asyncio.sleep(0.5) # Send multiple requests - internal LB should distribute across DP ranks - num_requests = 50 - all_tasks = [make_request() for _ in range(num_requests)] + num_requests = 200 + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -371,7 +378,10 @@ async def make_request(): await asyncio.sleep(0.5) # Second burst of requests - all_tasks = [make_request() for _ in range(num_requests)] + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -449,8 +459,11 @@ async def make_streaming_request(): # Send multiple streaming requests - internal LB should distribute across # DP ranks - num_requests = 50 - all_tasks = [make_streaming_request() for _ in range(num_requests)] + num_requests = 200 + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_streaming_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -459,7 +472,10 @@ async def make_streaming_request(): await asyncio.sleep(0.5) # Second burst of streaming requests - all_tasks = [make_streaming_request() for _ in range(num_requests)] + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_streaming_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -492,7 +508,7 @@ async def make_request(): completion = await api_only_client.completions.create( model=model_name, prompt="Hello, my name is", - max_tokens=10, + max_tokens=5, temperature=1.0) assert completion.id is not None @@ -522,8 +538,11 @@ async def make_request(): # Send multiple requests - should be distributed across engines on # headless server - num_requests = 50 - all_tasks = [make_request() for _ in range(num_requests)] + num_requests = 200 + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -532,7 +551,10 @@ async def make_request(): await asyncio.sleep(0.5) # Second burst of requests - all_tasks = [make_request() for _ in range(num_requests)] + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -610,8 +632,11 @@ async def make_streaming_request(): await asyncio.sleep(0.5) # Send multiple streaming requests - should be distributed across engines - num_requests = 50 - all_tasks = [make_streaming_request() for _ in range(num_requests)] + num_requests = 200 + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_streaming_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests @@ -620,7 +645,10 @@ async def make_streaming_request(): await asyncio.sleep(0.5) # Second burst of streaming requests - all_tasks = [make_streaming_request() for _ in range(num_requests)] + all_tasks = [] + for _ in range(num_requests): + all_tasks.append(asyncio.create_task(make_streaming_request())) + await asyncio.sleep(0.01) results = await asyncio.gather(*all_tasks) assert len(results) == num_requests From 7438f06f44724300b191e15beae8dc03cfcc369f Mon Sep 17 00:00:00 2001 From: Xu Wenqing <121550081+Xu-Wenqing@users.noreply.github.com> Date: Fri, 25 Jul 2025 17:36:55 +0800 Subject: [PATCH 356/552] Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (#21598) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: 许文卿 Signed-off-by: x22x22 --- ...E=160,N=320,device_name=NVIDIA_H20-3e.json | 146 ++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 vllm/model_executor/layers/fused_moe/configs/E=160,N=320,device_name=NVIDIA_H20-3e.json diff --git a/vllm/model_executor/layers/fused_moe/configs/E=160,N=320,device_name=NVIDIA_H20-3e.json b/vllm/model_executor/layers/fused_moe/configs/E=160,N=320,device_name=NVIDIA_H20-3e.json new file mode 100644 index 00000000000..52f2a8278c8 --- /dev/null +++ b/vllm/model_executor/layers/fused_moe/configs/E=160,N=320,device_name=NVIDIA_H20-3e.json @@ -0,0 +1,146 @@ +{ + "1": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 32, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "2": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 64, + "num_warps": 4, + "num_stages": 4 + }, + "4": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "8": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "16": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "24": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "32": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "48": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "64": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 4 + }, + "96": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "128": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "256": { + "BLOCK_SIZE_M": 16, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "512": { + "BLOCK_SIZE_M": 32, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "1024": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "1536": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 64, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 16, + "num_warps": 4, + "num_stages": 3 + }, + "2048": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 1, + "num_warps": 4, + "num_stages": 3 + }, + "3072": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + }, + "4096": { + "BLOCK_SIZE_M": 64, + "BLOCK_SIZE_N": 128, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 32, + "num_warps": 4, + "num_stages": 3 + } +} From 10ed9ade3462cb8ca4341e928b205ec9210a3b42 Mon Sep 17 00:00:00 2001 From: Kebe Date: Fri, 25 Jul 2025 18:42:23 +0800 Subject: [PATCH 357/552] [Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' (#21579) Signed-off-by: Kebe Signed-off-by: x22x22 --- vllm/transformers_utils/config.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index 8d1f59e6ead..da475c3b50a 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -584,7 +584,7 @@ def get_pooling_config_name(pooling_name: str) -> Union[str, None]: @cache -def get_sentence_transformer_tokenizer_config(model: str, +def get_sentence_transformer_tokenizer_config(model: Union[str, Path], revision: Optional[str] = 'main' ): """ @@ -592,7 +592,7 @@ def get_sentence_transformer_tokenizer_config(model: str, given Sentence Transformer BERT model. Parameters: - - model (str): The name of the Sentence Transformer + - model (str|Path): The name of the Sentence Transformer BERT model. - revision (str, optional): The revision of the m odel to use. Defaults to 'main'. @@ -620,7 +620,7 @@ def get_sentence_transformer_tokenizer_config(model: str, if encoder_dict: break - if not encoder_dict and not model.startswith("/"): + if not encoder_dict and not Path(model).is_absolute(): try: # If model is on HuggingfaceHub, get the repo files repo_files = list_repo_files(model, From 7a251e9cc0cf3ce2b37e4ce9c6067fec277f14da Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Fri, 25 Jul 2025 18:57:34 +0800 Subject: [PATCH 358/552] [Quantization] Enable BNB support for more MoE models (#21370) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- vllm/model_executor/models/dots1.py | 148 +++++++++++++------------ vllm/model_executor/models/glm4_moe.py | 25 +++-- 2 files changed, 93 insertions(+), 80 deletions(-) diff --git a/vllm/model_executor/models/dots1.py b/vllm/model_executor/models/dots1.py index 4bdcbfabbbc..9b21a794461 100644 --- a/vllm/model_executor/models/dots1.py +++ b/vllm/model_executor/models/dots1.py @@ -54,8 +54,8 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsPP -from .utils import (PPMissingLayer, is_pp_missing_parameter, +from .interfaces import SupportsLoRA, SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, maybe_prefix) @@ -327,6 +327,7 @@ def forward( return hidden_states, residual +@support_torch_compile class Dots1Model(nn.Module): fall_back_to_pt_during_load = False @@ -404,68 +405,12 @@ def forward( hidden_states, _ = self.norm(hidden_states, residual) return hidden_states - -@support_torch_compile -class Dots1ForCausalLM(nn.Module, SupportsPP): - - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - config = vllm_config.model_config.hf_config - quant_config = vllm_config.quant_config - self.config = config - self.quant_config = quant_config - self.model = Dots1Model(vllm_config=vllm_config, - prefix=maybe_prefix(prefix, "model")) - if get_pp_group().is_last_rank: - self.lm_head = ParallelLMHead(config.vocab_size, - config.hidden_size, - quant_config=quant_config) - else: - self.lm_head = PPMissingLayer() - self.logits_processor = LogitsProcessor(config.vocab_size) - self.make_empty_intermediate_tensors = ( - self.model.make_empty_intermediate_tensors) - - def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: - return self.model.get_input_embeddings(input_ids) - - def forward( - self, - input_ids: torch.Tensor, - positions: torch.Tensor, - intermediate_tensors: Optional[IntermediateTensors] = None, - inputs_embeds: Optional[torch.Tensor] = None, - ) -> Union[torch.Tensor, IntermediateTensors]: - hidden_states = self.model( - input_ids, - positions, - intermediate_tensors, - inputs_embeds, - ) - return hidden_states - - def compute_logits( - self, - hidden_states: torch.Tensor, - sampling_metadata: SamplingMetadata, - ) -> Optional[torch.Tensor]: - logits = self.logits_processor(self.lm_head, hidden_states, - sampling_metadata) - return logits - - def make_empty_intermediate_tensors( - self, batch_size: int, dtype: torch.dtype, - device: torch.device) -> IntermediateTensors: - return IntermediateTensors({ - "hidden_states": - torch.zeros((batch_size, self.config.hidden_size), - dtype=dtype, - device=device), - "residual": - torch.zeros((batch_size, self.config.hidden_size), - dtype=dtype, - device=device), - }) + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.n_routed_experts) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: @@ -477,14 +422,9 @@ def load_weights(self, weights: Iterable[tuple[str, ("gate_up_proj", "up_proj", 1), ] - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="gate_proj", - ckpt_down_proj_name="down_proj", - ckpt_up_proj_name="up_proj", - num_experts=self.config.n_routed_experts) - params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: if "rotary_emb.inv_freq" in name: continue @@ -534,3 +474,71 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight) loaded_params.add(name) return loaded_params + + +class Dots1ForCausalLM(nn.Module, SupportsPP, SupportsLoRA): + + packed_modules_mapping = { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } + + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + self.config = config + self.quant_config = quant_config + self.model = Dots1Model(vllm_config=vllm_config, + prefix=maybe_prefix(prefix, "model")) + if get_pp_group().is_last_rank: + self.lm_head = ParallelLMHead(config.vocab_size, + config.hidden_size, + quant_config=quant_config) + else: + self.lm_head = PPMissingLayer() + self.logits_processor = LogitsProcessor(config.vocab_size) + self.make_empty_intermediate_tensors = ( + self.model.make_empty_intermediate_tensors) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + return self.model.get_input_embeddings(input_ids) + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + hidden_states = self.model( + input_ids, + positions, + intermediate_tensors, + inputs_embeds, + ) + return hidden_states + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + logits = self.logits_processor(self.lm_head, hidden_states, + sampling_metadata) + return logits + + def load_weights(self, weights: Iterable[tuple[str, + torch.Tensor]]) -> set[str]: + loader = AutoWeightsLoader(self) + return loader.load_weights(weights) + + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() diff --git a/vllm/model_executor/models/glm4_moe.py b/vllm/model_executor/models/glm4_moe.py index 43824abb571..6a196fef572 100644 --- a/vllm/model_executor/models/glm4_moe.py +++ b/vllm/model_executor/models/glm4_moe.py @@ -53,7 +53,7 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsPP +from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, maybe_prefix) @@ -461,6 +461,15 @@ def make_empty_intermediate_tensors( device=device), }) + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + # Params for weights, fp8 weight scales, fp8 activation scales + # (param_name, weight_name, expert_id, shard_id) + return FusedMoE.make_expert_params_mapping( + ckpt_gate_proj_name="gate_proj", + ckpt_down_proj_name="down_proj", + ckpt_up_proj_name="up_proj", + num_experts=self.config.n_routed_experts) + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: stacked_params_mapping = [ @@ -472,16 +481,9 @@ def load_weights(self, weights: Iterable[tuple[str, ("gate_up_proj", "up_proj", 1), ] - # Params for weights, fp8 weight scales, fp8 activation scales - # (param_name, weight_name, expert_id, shard_id) - expert_params_mapping = FusedMoE.make_expert_params_mapping( - ckpt_gate_proj_name="gate_proj", - ckpt_down_proj_name="down_proj", - ckpt_up_proj_name="up_proj", - num_experts=self.config.n_routed_experts) - params_dict = dict(self.named_parameters()) loaded_params: set[str] = set() + expert_params_mapping = self.get_expert_mapping() for name, loaded_weight in weights: spec_layer = get_spec_layer_idx_from_weight_name(self.config, name) if spec_layer is not None: @@ -570,7 +572,7 @@ def load_weights(self, weights: Iterable[tuple[str, return loaded_params -class Glm4MoeForCausalLM(nn.Module, SupportsPP): +class Glm4MoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA): packed_modules_mapping = { "qkv_proj": [ "q_proj", @@ -677,6 +679,9 @@ def load_weights(self, weights: Iterable[tuple[str, loader = AutoWeightsLoader(self) return loader.load_weights(weights) + def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: + return self.model.get_expert_mapping() + def get_spec_layer_idx_from_weight_name(config: PretrainedConfig, weight_name: str) -> Optional[int]: From fe2ded5d831cda4cbfff01ce2b91121abc24e7e0 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 25 Jul 2025 20:36:45 +0800 Subject: [PATCH 359/552] [V1] Get supported tasks from model runner instead of model config (#21585) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/entrypoints/llm.py | 24 +++++++++++----- vllm/entrypoints/openai/api_server.py | 32 +++++++++++++--------- vllm/entrypoints/openai/run_batch.py | 21 +++++++++----- vllm/executor/executor_base.py | 8 +++--- vllm/model_executor/layers/pooler.py | 3 +- vllm/model_executor/models/bert.py | 2 +- vllm/model_executor/models/gritlm.py | 2 +- vllm/model_executor/models/modernbert.py | 2 +- vllm/pooling_params.py | 5 ++-- vllm/tasks.py | 11 ++++++++ vllm/v1/engine/async_llm.py | 4 +++ vllm/v1/engine/core.py | 11 ++++++-- vllm/v1/engine/core_client.py | 16 +++++++++++ vllm/v1/engine/llm_engine.py | 4 +++ vllm/v1/worker/gpu_model_runner.py | 35 +++++++++++++++++++++--- vllm/v1/worker/gpu_worker.py | 6 ++-- vllm/v1/worker/tpu_model_runner.py | 31 +++++++++++++++++++-- vllm/v1/worker/tpu_worker.py | 6 ++-- vllm/worker/model_runner_base.py | 31 +++++++++++++++++++-- 19 files changed, 200 insertions(+), 54 deletions(-) create mode 100644 vllm/tasks.py diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py index 2f766a2dae5..2c961156bc8 100644 --- a/vllm/entrypoints/llm.py +++ b/vllm/entrypoints/llm.py @@ -14,6 +14,7 @@ from tqdm.auto import tqdm from typing_extensions import TypeVar, deprecated +import vllm.envs as envs from vllm.beam_search import (BeamSearchInstance, BeamSearchOutput, BeamSearchSequence, create_sort_beams_key_function) @@ -44,9 +45,10 @@ from vllm.outputs import (ClassificationRequestOutput, EmbeddingRequestOutput, PoolingRequestOutput, RequestOutput, ScoringRequestOutput) -from vllm.pooling_params import PoolingParams, PoolingTask +from vllm.pooling_params import PoolingParams from vllm.sampling_params import (BeamSearchParams, GuidedDecodingParams, RequestOutputKind, SamplingParams) +from vllm.tasks import PoolingTask from vllm.transformers_utils.tokenizer import (AnyTokenizer, MistralTokenizer, get_cached_tokenizer) from vllm.usage.usage_lib import UsageContext @@ -277,6 +279,16 @@ def __init__( self.request_counter = Counter() self.default_sampling_params: Union[dict[str, Any], None] = None + if envs.VLLM_USE_V1: + supported_tasks = self.llm_engine \ + .get_supported_tasks() # type: ignore + else: + supported_tasks = self.llm_engine.model_config.supported_tasks + + logger.info("Supported_tasks: %s", supported_tasks) + + self.supported_tasks = supported_tasks + def get_tokenizer( self, lora_request: Optional[LoRARequest] = None, @@ -1170,8 +1182,7 @@ def embed( A list of `EmbeddingRequestOutput` objects containing the embedding vectors in the same order as the input prompts. """ - model_config = self.llm_engine.model_config - if "embed" not in model_config.supported_tasks: + if "embed" not in self.supported_tasks: raise ValueError("Embedding API is not supported by this model. " "Please set `--task embed`.") @@ -1215,8 +1226,7 @@ def classify( A list of `ClassificationRequestOutput` objects containing the embedding vectors in the same order as the input prompts. """ - model_config = self.llm_engine.model_config - if "classify" not in model_config.supported_tasks: + if "classify" not in self.supported_tasks: raise ValueError( "Classification API is not supported by this model. " "Please set `--task classify`.") @@ -1397,8 +1407,8 @@ def score( raise ValueError(" ".join(messages)) - if all(t not in model_config.supported_tasks - for t in ("embed", "classify")): + supported_tasks = self.supported_tasks + if all(t not in supported_tasks for t in ("embed", "classify")): raise ValueError("Score API is not supported by this model. " "Please set `--task embed` or `--task classify`.") diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 8540d25d4e9..5b87aed06e9 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1586,6 +1586,14 @@ async def init_app_state( state.vllm_config = vllm_config model_config = vllm_config.model_config + if envs.VLLM_USE_V1: + supported_tasks = await engine_client \ + .get_supported_tasks() # type: ignore + else: + supported_tasks = model_config.supported_tasks + + logger.info("Supported_tasks: %s", supported_tasks) + resolved_chat_template = load_chat_template(args.chat_template) if resolved_chat_template is not None: # Get the tokenizer to check official template @@ -1647,7 +1655,7 @@ async def init_app_state( reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, enable_force_include_usage=args.enable_force_include_usage, - ) if "generate" in model_config.supported_tasks else None + ) if "generate" in supported_tasks else None state.openai_serving_chat = OpenAIServingChat( engine_client, model_config, @@ -1664,7 +1672,7 @@ async def init_app_state( reasoning_parser=args.reasoning_parser, enable_prompt_tokens_details=args.enable_prompt_tokens_details, enable_force_include_usage=args.enable_force_include_usage, - ) if "generate" in model_config.supported_tasks else None + ) if "generate" in supported_tasks else None state.openai_serving_completion = OpenAIServingCompletion( engine_client, model_config, @@ -1673,7 +1681,7 @@ async def init_app_state( return_tokens_as_token_ids=args.return_tokens_as_token_ids, enable_prompt_tokens_details=args.enable_prompt_tokens_details, enable_force_include_usage=args.enable_force_include_usage, - ) if "generate" in model_config.supported_tasks else None + ) if "generate" in supported_tasks else None state.openai_serving_pooling = OpenAIServingPooling( engine_client, model_config, @@ -1681,7 +1689,7 @@ async def init_app_state( request_logger=request_logger, chat_template=resolved_chat_template, chat_template_content_format=args.chat_template_content_format, - ) if "encode" in model_config.supported_tasks else None + ) if "encode" in supported_tasks else None state.openai_serving_embedding = OpenAIServingEmbedding( engine_client, model_config, @@ -1689,24 +1697,22 @@ async def init_app_state( request_logger=request_logger, chat_template=resolved_chat_template, chat_template_content_format=args.chat_template_content_format, - ) if "embed" in model_config.supported_tasks else None + ) if "embed" in supported_tasks else None state.openai_serving_classification = ServingClassification( engine_client, model_config, state.openai_serving_models, request_logger=request_logger, - ) if "classify" in model_config.supported_tasks else None + ) if "classify" in supported_tasks else None - enable_serving_reranking = ("classify" in model_config.supported_tasks - and getattr(model_config.hf_config, - "num_labels", 0) == 1) + enable_serving_reranking = ("classify" in supported_tasks and getattr( + model_config.hf_config, "num_labels", 0) == 1) state.openai_serving_scores = ServingScores( engine_client, model_config, state.openai_serving_models, request_logger=request_logger, - ) if ("embed" in model_config.supported_tasks - or enable_serving_reranking) else None + ) if ("embed" in supported_tasks or enable_serving_reranking) else None state.openai_serving_tokenization = OpenAIServingTokenization( engine_client, @@ -1721,13 +1727,13 @@ async def init_app_state( model_config, state.openai_serving_models, request_logger=request_logger, - ) if "transcription" in model_config.supported_tasks else None + ) if "transcription" in supported_tasks else None state.openai_serving_translation = OpenAIServingTranslation( engine_client, model_config, state.openai_serving_models, request_logger=request_logger, - ) if "transcription" in model_config.supported_tasks else None + ) if "transcription" in supported_tasks else None state.task = model_config.task state.enable_server_load_tracking = args.enable_server_load_tracking diff --git a/vllm/entrypoints/openai/run_batch.py b/vllm/entrypoints/openai/run_batch.py index 57705509232..137b368dad2 100644 --- a/vllm/entrypoints/openai/run_batch.py +++ b/vllm/entrypoints/openai/run_batch.py @@ -14,6 +14,7 @@ from prometheus_client import start_http_server from tqdm import tqdm +import vllm.envs as envs from vllm.config import VllmConfig from vllm.engine.arg_utils import AsyncEngineArgs, optional_type from vllm.engine.protocol import EngineClient @@ -335,6 +336,14 @@ async def run_batch( model_config = vllm_config.model_config + if envs.VLLM_USE_V1: + supported_tasks = await engine_client \ + .get_supported_tasks() # type: ignore + else: + supported_tasks = model_config.supported_tasks + + logger.info("Supported_tasks: %s", supported_tasks) + # Create the openai serving objects. openai_serving_models = OpenAIServingModels( engine_client=engine_client, @@ -351,7 +360,7 @@ async def run_batch( chat_template=None, chat_template_content_format="auto", enable_prompt_tokens_details=args.enable_prompt_tokens_details, - ) if "generate" in model_config.supported_tasks else None + ) if "generate" in supported_tasks else None openai_serving_embedding = OpenAIServingEmbedding( engine_client, model_config, @@ -359,19 +368,17 @@ async def run_batch( request_logger=request_logger, chat_template=None, chat_template_content_format="auto", - ) if "embed" in model_config.supported_tasks else None + ) if "embed" in supported_tasks else None - enable_serving_reranking = ("classify" in model_config.supported_tasks - and getattr(model_config.hf_config, - "num_labels", 0) == 1) + enable_serving_reranking = ("classify" in supported_tasks and getattr( + model_config.hf_config, "num_labels", 0) == 1) openai_serving_scores = ServingScores( engine_client, model_config, openai_serving_models, request_logger=request_logger, - ) if ("embed" in model_config.supported_tasks - or enable_serving_reranking) else None + ) if ("embed" in supported_tasks or enable_serving_reranking) else None tracker = BatchProgressTracker() logger.info("Reading batch from %s...", args.input_file) diff --git a/vllm/executor/executor_base.py b/vllm/executor/executor_base.py index 483fdb1486f..97d0d6f08b8 100644 --- a/vllm/executor/executor_base.py +++ b/vllm/executor/executor_base.py @@ -16,8 +16,8 @@ from vllm.logger import init_logger from vllm.lora.request import LoRARequest from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.pooling_params import PoolingTask from vllm.sequence import ExecuteModelRequest, PoolerOutput +from vllm.tasks import SupportedTask from vllm.utils import make_async from vllm.worker.worker_base import WorkerBase @@ -136,9 +136,9 @@ def rpc_func(worker: WorkerBase) -> _R: return self.collective_rpc(rpc_func) @cached_property # Avoid unnecessary RPC calls - def supported_pooling_tasks(self) -> tuple[PoolingTask, ...]: - output = self.collective_rpc("get_supported_pooling_tasks") - return tuple({task for tasks in output for task in tasks}) + def supported_tasks(self) -> tuple[SupportedTask, ...]: + output = self.collective_rpc("get_supported_tasks") + return output[0] def execute_model( self, execute_model_req: ExecuteModelRequest diff --git a/vllm/model_executor/layers/pooler.py b/vllm/model_executor/layers/pooler.py index c06cca08022..5bfd4aaccc1 100644 --- a/vllm/model_executor/layers/pooler.py +++ b/vllm/model_executor/layers/pooler.py @@ -16,8 +16,9 @@ from vllm.model_executor.pooling_metadata import ( # noqa: E501 PoolingMetadata as V0PoolingMetadata) from vllm.model_executor.pooling_metadata import PoolingTensors -from vllm.pooling_params import PoolingParams, PoolingTask +from vllm.pooling_params import PoolingParams from vllm.sequence import PoolerOutput, PoolingSequenceGroupOutput +from vllm.tasks import PoolingTask from vllm.utils import resolve_obj_by_qualname from vllm.v1.pool.metadata import PoolingMetadata as V1PoolingMetadata diff --git a/vllm/model_executor/models/bert.py b/vllm/model_executor/models/bert.py index 9dc6115f850..c3066aaa2b8 100644 --- a/vllm/model_executor/models/bert.py +++ b/vllm/model_executor/models/bert.py @@ -26,8 +26,8 @@ from vllm.model_executor.layers.vocab_parallel_embedding import ( VocabParallelEmbedding) from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors +from vllm.tasks import PoolingTask from .interfaces import SupportsCrossEncoding, SupportsQuant, SupportsV0Only from .utils import AutoWeightsLoader, WeightsMapper, maybe_prefix diff --git a/vllm/model_executor/models/gritlm.py b/vllm/model_executor/models/gritlm.py index 8a3fbc6a49f..c99970284a9 100644 --- a/vllm/model_executor/models/gritlm.py +++ b/vllm/model_executor/models/gritlm.py @@ -16,8 +16,8 @@ get_prompt_token_ids) from vllm.model_executor.models.llama import LlamaForCausalLM from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.pooling_params import PoolingTask from vllm.sequence import PoolerOutput +from vllm.tasks import PoolingTask from vllm.transformers_utils.tokenizer import cached_tokenizer_from_config from .interfaces import SupportsV0Only diff --git a/vllm/model_executor/models/modernbert.py b/vllm/model_executor/models/modernbert.py index be1c3438d9d..fc2b0c1f518 100644 --- a/vllm/model_executor/models/modernbert.py +++ b/vllm/model_executor/models/modernbert.py @@ -23,8 +23,8 @@ VocabParallelEmbedding) from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.pooling_metadata import PoolingMetadata -from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors +from vllm.tasks import PoolingTask from .interfaces import SupportsCrossEncoding, SupportsV0Only from .utils import WeightsMapper, maybe_prefix diff --git a/vllm/pooling_params.py b/vllm/pooling_params.py index 868facbe255..23eb775f2dc 100644 --- a/vllm/pooling_params.py +++ b/vllm/pooling_params.py @@ -1,17 +1,16 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from typing import TYPE_CHECKING, Literal, Optional +from typing import TYPE_CHECKING, Optional import msgspec from vllm.sampling_params import RequestOutputKind +from vllm.tasks import PoolingTask if TYPE_CHECKING: from vllm.config import ModelConfig -PoolingTask = Literal["encode", "embed", "classify", "score"] - class PoolingParams( msgspec.Struct, diff --git a/vllm/tasks.py b/vllm/tasks.py new file mode 100644 index 00000000000..85c5c6e4362 --- /dev/null +++ b/vllm/tasks.py @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from typing import Literal, get_args + +GenerationTask = Literal["generate", "transcription"] +GENERATION_TASKS = get_args(GenerationTask) + +PoolingTask = Literal["encode", "embed", "classify", "score"] +POOLING_TASKS = get_args(PoolingTask) + +SupportedTask = Literal[GenerationTask, PoolingTask] diff --git a/vllm/v1/engine/async_llm.py b/vllm/v1/engine/async_llm.py index 02cb80197fa..ed0d9620f47 100644 --- a/vllm/v1/engine/async_llm.py +++ b/vllm/v1/engine/async_llm.py @@ -21,6 +21,7 @@ from vllm.outputs import PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams from vllm.sampling_params import SamplingParams +from vllm.tasks import SupportedTask from vllm.transformers_utils.config import ( maybe_register_config_serialize_by_value) from vllm.transformers_utils.tokenizer import AnyTokenizer @@ -211,6 +212,9 @@ def shutdown(self): if handler := getattr(self, "output_handler", None): handler.cancel() + async def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return await self.engine_core.get_supported_tasks_async() + async def add_request( self, request_id: str, diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 88c511606d7..4124ee05326 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -23,6 +23,7 @@ from vllm.logger import init_logger from vllm.logging_utils.dump_input import dump_engine_exception from vllm.lora.request import LoRARequest +from vllm.tasks import POOLING_TASKS, SupportedTask from vllm.transformers_utils.config import ( maybe_register_config_serialize_by_value) from vllm.utils import (bind_process_name, make_zmq_socket, @@ -195,11 +196,17 @@ def _initialize_kv_caches( "warmup model) took %.2f seconds"), elapsed) return num_gpu_blocks, num_cpu_blocks, scheduler_kv_cache_config + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return self.model_executor.supported_tasks + def add_request(self, request: EngineCoreRequest): """Add request to the scheduler.""" if pooling_params := request.pooling_params: - supported_pooling_tasks = ( - self.model_executor.supported_pooling_tasks) + supported_pooling_tasks = [ + task for task in self.get_supported_tasks() + if task in POOLING_TASKS + ] + if pooling_params.task not in supported_pooling_tasks: raise ValueError(f"Unsupported task: {pooling_params.task!r} " f"Supported tasks: {supported_pooling_tasks}") diff --git a/vllm/v1/engine/core_client.py b/vllm/v1/engine/core_client.py index 69ae3690d00..b14d85bbf8e 100644 --- a/vllm/v1/engine/core_client.py +++ b/vllm/v1/engine/core_client.py @@ -21,6 +21,7 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.lora.request import LoRARequest +from vllm.tasks import SupportedTask from vllm.utils import get_open_port, get_open_zmq_inproc_path, make_zmq_socket from vllm.v1.engine import (EngineCoreOutputs, EngineCoreRequest, EngineCoreRequestType, @@ -104,6 +105,9 @@ def shutdown(self): def get_output(self) -> EngineCoreOutputs: raise NotImplementedError + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + raise NotImplementedError + def add_request(self, request: EngineCoreRequest) -> None: raise NotImplementedError @@ -170,6 +174,9 @@ async def scale_elastic_ep(self, new_data_parallel_size: int) -> None: async def get_output_async(self) -> EngineCoreOutputs: raise NotImplementedError + async def get_supported_tasks_async(self) -> tuple[SupportedTask, ...]: + raise NotImplementedError + async def add_request_async(self, request: EngineCoreRequest) -> None: raise NotImplementedError @@ -238,6 +245,9 @@ def get_output(self) -> EngineCoreOutputs: outputs, _ = self.engine_core.step() return outputs.get(0) or EngineCoreOutputs() + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return self.engine_core.get_supported_tasks() + def add_request(self, request: EngineCoreRequest) -> None: self.engine_core.add_request(request) @@ -608,6 +618,9 @@ def call_utility(self, method: str, *args) -> Any: return future.result() + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return self.call_utility("get_supported_tasks") + def add_request(self, request: EngineCoreRequest) -> None: if self.is_dp: self.engines_running = True @@ -802,6 +815,9 @@ async def _call_utility_async(self, method: str, *args, self._ensure_output_queue_task() return await future + async def get_supported_tasks_async(self) -> tuple[SupportedTask, ...]: + return await self.call_utility_async("get_supported_tasks") + async def add_request_async(self, request: EngineCoreRequest) -> None: request.client_index = self.client_index await self._send_input(EngineCoreRequestType.ADD, request) diff --git a/vllm/v1/engine/llm_engine.py b/vllm/v1/engine/llm_engine.py index 991242e1827..efbdffbc090 100644 --- a/vllm/v1/engine/llm_engine.py +++ b/vllm/v1/engine/llm_engine.py @@ -18,6 +18,7 @@ from vllm.outputs import PoolingRequestOutput, RequestOutput from vllm.pooling_params import PoolingParams from vllm.sampling_params import SamplingParams +from vllm.tasks import SupportedTask from vllm.transformers_utils.tokenizer_group import ( TokenizerGroup, init_tokenizer_from_configs) from vllm.usage.usage_lib import UsageContext @@ -176,6 +177,9 @@ def has_unfinished_requests_dp(self, has_unfinished: bool) -> bool: def validate_outputs(cls, outputs, output_type): return outputs + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return self.engine_core.get_supported_tasks() + def abort_request(self, request_ids: list[str]) -> None: """Remove request_ids from EngineCore and Detokenizer.""" diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 32004ced4aa..5fe594db667 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -30,15 +30,17 @@ from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding from vllm.model_executor.model_loader import TensorizerLoader, get_model_loader -from vllm.model_executor.models.interfaces import is_mixture_of_experts -from vllm.model_executor.models.interfaces_base import (VllmModelForPooling, - is_pooling_model) +from vllm.model_executor.models.interfaces import (is_mixture_of_experts, + supports_transcription) +from vllm.model_executor.models.interfaces_base import ( + VllmModelForPooling, is_pooling_model, is_text_generation_model) from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import MultiModalKwargs, PlaceholderRange from vllm.multimodal.utils import group_mm_inputs_by_modality -from vllm.pooling_params import PoolingParams, PoolingTask +from vllm.pooling_params import PoolingParams from vllm.sampling_params import SamplingType from vllm.sequence import IntermediateTensors, PoolerOutput +from vllm.tasks import GenerationTask, PoolingTask, SupportedTask from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, DeviceMemoryProfiler, GiB_bytes, LazyLoader, check_use_alibi, get_dtype_size, is_pin_memory_available, round_up) @@ -1153,6 +1155,21 @@ def _gather_mm_embeddings( def get_model(self) -> nn.Module: return self.model + def get_supported_generation_tasks(self) -> list[GenerationTask]: + model = self.get_model() + supported_tasks = list[GenerationTask]() + + if is_text_generation_model(model): + supported_tasks.append("generate") + + if supports_transcription(model): + if model.supports_transcription_only: + return ["transcription"] + + supported_tasks.append("transcription") + + return supported_tasks + def get_supported_pooling_tasks(self) -> list[PoolingTask]: model = self.get_model() if not is_pooling_model(model): @@ -1160,6 +1177,16 @@ def get_supported_pooling_tasks(self) -> list[PoolingTask]: return list(model.pooler.get_supported_tasks()) + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + tasks = list[SupportedTask]() + + if self.model_config.runner_type == "generate": + tasks.extend(self.get_supported_generation_tasks()) + if self.model_config.runner_type == "pooling": + tasks.extend(self.get_supported_pooling_tasks()) + + return tuple(tasks) + def apply_grammar_bitmask( self, scheduler_output: "SchedulerOutput", diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index 52294635114..dcfb038d28c 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -23,8 +23,8 @@ from vllm.lora.request import LoRARequest from vllm.model_executor import set_random_seed from vllm.platforms import current_platform -from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors +from vllm.tasks import SupportedTask from vllm.utils import GiB_bytes, MemorySnapshot, memory_profiling from vllm.v1.engine import ReconfigureDistributedRequest, ReconfigureRankType from vllm.v1.kv_cache_interface import KVCacheConfig, KVCacheSpec @@ -320,8 +320,8 @@ def compile_or_warm_up_model(self) -> None: def get_model(self) -> nn.Module: return self.model_runner.get_model() - def get_supported_pooling_tasks(self) -> list[PoolingTask]: - return self.model_runner.get_supported_pooling_tasks() + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return self.model_runner.get_supported_tasks() @torch.inference_mode() def execute_model( diff --git a/vllm/v1/worker/tpu_model_runner.py b/vllm/v1/worker/tpu_model_runner.py index e8c80084589..59cbb015057 100644 --- a/vllm/v1/worker/tpu_model_runner.py +++ b/vllm/v1/worker/tpu_model_runner.py @@ -27,13 +27,15 @@ from vllm.lora.layers import BaseLayerWithLoRA from vllm.model_executor.model_loader import get_model_loader from vllm.model_executor.model_loader.tpu import TPUModelLoader -from vllm.model_executor.models.interfaces_base import is_pooling_model +from vllm.model_executor.models.interfaces import supports_transcription +from vllm.model_executor.models.interfaces_base import ( + is_pooling_model, is_text_generation_model) from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import (BatchedTensorInputs, MultiModalKwargs, PlaceholderRange) from vllm.multimodal.utils import group_mm_inputs_by_modality -from vllm.pooling_params import PoolingTask from vllm.sequence import IntermediateTensors +from vllm.tasks import GenerationTask, PoolingTask, SupportedTask from vllm.utils import (LayerBlockType, cdiv, is_pin_memory_available, prev_power_of_2) from vllm.v1.attention.backends.pallas import (TPU_STR_DTYPE_TO_TORCH_DTYPE, @@ -489,6 +491,21 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> bool: def get_model(self) -> nn.Module: return self.model + def get_supported_generation_tasks(self) -> list[GenerationTask]: + model = self.get_model() + supported_tasks = list[GenerationTask]() + + if is_text_generation_model(model): + supported_tasks.append("generate") + + if supports_transcription(model): + if model.supports_transcription_only: + return ["transcription"] + + supported_tasks.append("transcription") + + return supported_tasks + def get_supported_pooling_tasks(self) -> list[PoolingTask]: model = self.get_model() if not is_pooling_model(model): @@ -496,6 +513,16 @@ def get_supported_pooling_tasks(self) -> list[PoolingTask]: return list(model.pooler.get_supported_tasks()) + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + tasks = list[SupportedTask]() + + if self.model_config.runner_type == "generate": + tasks.extend(self.get_supported_generation_tasks()) + if self.model_config.runner_type == "pooling": + tasks.extend(self.get_supported_pooling_tasks()) + + return tuple(tasks) + def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: """ Generates the KVCacheSpec by parsing the kv cache format from each diff --git a/vllm/v1/worker/tpu_worker.py b/vllm/v1/worker/tpu_worker.py index 254b058d2cd..72e0e4230a0 100644 --- a/vllm/v1/worker/tpu_worker.py +++ b/vllm/v1/worker/tpu_worker.py @@ -21,7 +21,7 @@ from vllm.lora.request import LoRARequest from vllm.model_executor import set_random_seed from vllm.platforms import current_platform -from vllm.pooling_params import PoolingTask +from vllm.tasks import SupportedTask from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, cdiv from vllm.v1.attention.backends.pallas import TPU_HEAD_SIZE_ALIGNMENT from vllm.v1.core.sched.output import SchedulerOutput @@ -282,8 +282,8 @@ def compile_or_warm_up_model(self) -> None: def get_model(self) -> nn.Module: return self.model_runner.get_model() - def get_supported_pooling_tasks(self) -> list[PoolingTask]: - return self.model_runner.get_supported_pooling_tasks() + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + return self.model_runner.get_supported_tasks() def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]: return self.model_runner.get_kv_cache_spec() diff --git a/vllm/worker/model_runner_base.py b/vllm/worker/model_runner_base.py index feca8a7a1e7..7b8fe2f802d 100644 --- a/vllm/worker/model_runner_base.py +++ b/vllm/worker/model_runner_base.py @@ -12,9 +12,11 @@ from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.model_executor.layers.sampler import SamplerOutput -from vllm.model_executor.models.interfaces_base import is_pooling_model -from vllm.pooling_params import PoolingTask +from vllm.model_executor.models.interfaces import supports_transcription +from vllm.model_executor.models.interfaces_base import ( + is_pooling_model, is_text_generation_model) from vllm.sequence import IntermediateTensors, SequenceGroupMetadata +from vllm.tasks import GenerationTask, PoolingTask, SupportedTask if TYPE_CHECKING: from vllm.attention import AttentionMetadata @@ -224,6 +226,21 @@ def prepare_model_input( def get_model(self) -> nn.Module: raise NotImplementedError + def get_supported_generation_tasks(self) -> list[GenerationTask]: + model = self.get_model() + supported_tasks = list[GenerationTask]() + + if is_text_generation_model(model): + supported_tasks.append("generate") + + if supports_transcription(model): + if model.supports_transcription_only: + return ["transcription"] + + supported_tasks.append("transcription") + + return supported_tasks + def get_supported_pooling_tasks(self) -> list[PoolingTask]: model = self.get_model() if not is_pooling_model(model): @@ -231,6 +248,16 @@ def get_supported_pooling_tasks(self) -> list[PoolingTask]: return list(model.pooler.get_supported_tasks()) + def get_supported_tasks(self) -> tuple[SupportedTask, ...]: + tasks = list[SupportedTask]() + + if self.model_config.runner_type == "generate": + tasks.extend(self.get_supported_generation_tasks()) + if self.model_config.runner_type == "pooling": + tasks.extend(self.get_supported_pooling_tasks()) + + return tuple(tasks) + def execute_model( self, model_input: T, From 59f108c041de49ca87588de7065620b4e0475098 Mon Sep 17 00:00:00 2001 From: Mengqing Cao Date: Fri, 25 Jul 2025 20:53:07 +0800 Subject: [PATCH 360/552] [Bugfix][Logprobs] Fix logprobs op to support more backend (#21591) Signed-off-by: MengqingCao Signed-off-by: x22x22 --- vllm/v1/sample/ops/logprobs.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/vllm/v1/sample/ops/logprobs.py b/vllm/v1/sample/ops/logprobs.py index a4d65485140..82875b7c845 100644 --- a/vllm/v1/sample/ops/logprobs.py +++ b/vllm/v1/sample/ops/logprobs.py @@ -4,8 +4,10 @@ import torch +from vllm.platforms import current_platform -@torch.compile(dynamic=True) + +@torch.compile(dynamic=True, backend=current_platform.simple_compile_backend) def batched_count_greater_than(x: torch.Tensor, values: torch.Tensor) -> torch.Tensor: """ From b81f4146b7bb1c3ee63486a064578edd89d48507 Mon Sep 17 00:00:00 2001 From: xyxinyang <43821961+xyxinyang@users.noreply.github.com> Date: Fri, 25 Jul 2025 21:02:53 +0800 Subject: [PATCH 361/552] [Model] Fix Ernie4.5MoE e_score_correction_bias parameter (#21586) Signed-off-by: zhouchong Co-authored-by: zhouchong Signed-off-by: x22x22 --- vllm/model_executor/models/ernie45_moe.py | 25 +++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/vllm/model_executor/models/ernie45_moe.py b/vllm/model_executor/models/ernie45_moe.py index 984003e62d1..5824b0967e7 100644 --- a/vllm/model_executor/models/ernie45_moe.py +++ b/vllm/model_executor/models/ernie45_moe.py @@ -123,14 +123,19 @@ def __init__( quant_config=None, prefix=f"{prefix}.gate") - self.experts = FusedMoE(num_experts=config.moe_num_experts, - top_k=config.moe_k, - hidden_size=config.hidden_size, - intermediate_size=config.moe_intermediate_size, - reduce_results=False, - renormalize=True, - quant_config=quant_config, - prefix=f"{prefix}.experts") + self.gate.e_score_correction_bias = nn.Parameter( + torch.empty(config.moe_num_experts)) + + self.experts = FusedMoE( + num_experts=config.moe_num_experts, + top_k=config.moe_k, + hidden_size=config.hidden_size, + intermediate_size=config.moe_intermediate_size, + reduce_results=False, + renormalize=True, + quant_config=quant_config, + prefix=f"{prefix}.experts", + e_score_correction_bias=self.gate.e_score_correction_bias) if self.moe_num_shared_experts is not None: intermediate_size = (config.moe_intermediate_size * @@ -459,6 +464,10 @@ def load_weights(self, weights: Iterable[tuple[str, if "mtp" in name: continue + if "e_score_correction_bias" in name: + name = name.replace("moe_statics", "gate") + loaded_weight = loaded_weight.squeeze(0) + for (param_name, weight_name, shard_id) in stacked_params_mapping: # Skip non-stacked layers and experts (experts handled below). if weight_name not in name: From 1dce08944517f2b800e6f98380256ce08a2e1856 Mon Sep 17 00:00:00 2001 From: bigshanedogg Date: Fri, 25 Jul 2025 22:05:42 +0900 Subject: [PATCH 362/552] [MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B (#20931) Signed-off-by: bigshanedogg Signed-off-by: x22x22 --- docs/models/supported_models.md | 1 + examples/offline_inference/vision_language.py | 80 ++ .../vision_language_multi_image.py | 48 + .../multimodal/processing/test_common.py | 1 + tests/models/registry.py | 3 + .../models/hyperclovax_vision.py | 1231 +++++++++++++++++ vllm/model_executor/models/registry.py | 1 + 7 files changed, 1365 insertions(+) create mode 100644 vllm/model_executor/models/hyperclovax_vision.py diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 4dd4f8f4c22..a8d442a1ae7 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -365,6 +365,7 @@ th { | `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ | ✅︎ | | `HunYuanDenseV1ForCausalLM` | Hunyuan-7B-Instruct-0124 | `tencent/Hunyuan-7B-Instruct-0124` | ✅︎ | | ✅︎ | | `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, etc. | ✅︎ | | ✅︎ | +| `HCXVisionForCausalLM` | HyperCLOVAX-SEED-Vision-Instruct-3B | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | | | ✅︎ | | `InternLMForCausalLM` | InternLM | `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM2ForCausalLM` | InternLM2 | `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM3ForCausalLM` | InternLM3 | `internlm/internlm3-8b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | diff --git a/examples/offline_inference/vision_language.py b/examples/offline_inference/vision_language.py index e4811c02337..eb6b4108485 100644 --- a/examples/offline_inference/vision_language.py +++ b/examples/offline_inference/vision_language.py @@ -316,6 +316,85 @@ def run_h2ovl(questions: list[str], modality: str) -> ModelRequestData: ) +# naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B +def run_hyperclovax_seed_vision( + questions: list[str], modality: str +) -> ModelRequestData: + model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B" + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + + engine_args = EngineArgs( + model=model_name, + trust_remote_code=True, + max_model_len=8192 if modality == "image" else 16384, + limit_mm_per_prompt={modality: 1}, + ) + + messages = list() + for question in questions: + if modality == "image": + """ + ocr: List the words in the image in raster order. + Even if the word order feels unnatural for reading, + the model will handle it as long as it follows raster order. + e.g. "Naver, CLOVA, bigshane" + lens_keywords: List the entity names in the image. + e.g. "iPhone" + lens_local_keywords: List the entity names with quads in the image. + e.g. "[0.07, 0.21, 0.92, 0.90] iPhone" + """ + messages.append( + [ + { + "role": "user", + "content": [ + { + "type": "image", + "ocr": "", + "lens_keywords": "", + "lens_local_keywords": "", + }, + { + "type": "text", + "text": question, + }, + ], + } + ] + ) + elif modality == "video": + messages.append( + [ + { + "role": "user", + "content": [ + { + "type": "video", + }, + { + "type": "text", + "text": question, + }, + ], + } + ] + ) + else: + raise ValueError(f"Unsupported modality: {modality}") + + prompts = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True, + ) + + return ModelRequestData( + engine_args=engine_args, + prompts=prompts, + stop_token_ids=None, + ) + + # Idefics3-8B-Llama3 def run_idefics3(questions: list[str], modality: str) -> ModelRequestData: assert modality == "image" @@ -1222,6 +1301,7 @@ def run_skyworkr1v(questions: list[str], modality: str) -> ModelRequestData: "glm4v": run_glm4v, "glm4_1v": run_glm4_1v, "h2ovl_chat": run_h2ovl, + "hyperclovax_seed_vision": run_hyperclovax_seed_vision, "idefics3": run_idefics3, "internvl_chat": run_internvl, "nemotron_vl": run_nemotron_vl, diff --git a/examples/offline_inference/vision_language_multi_image.py b/examples/offline_inference/vision_language_multi_image.py index eb4f3b6c8f4..2e14fc807e1 100644 --- a/examples/offline_inference/vision_language_multi_image.py +++ b/examples/offline_inference/vision_language_multi_image.py @@ -289,6 +289,53 @@ def load_internvl(question: str, image_urls: list[str]) -> ModelRequestData: ) +def load_hyperclovax_seed_vision( + question: str, image_urls: list[str] +) -> ModelRequestData: + model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B" + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + + engine_args = EngineArgs( + model=model_name, + trust_remote_code=True, + max_model_len=16384, + limit_mm_per_prompt={"image": len(image_urls)}, + ) + + message = {"role": "user", "content": list()} + for _image_url in image_urls: + message["content"].append( + { + "type": "image", + "image": _image_url, + "ocr": "", + "lens_keywords": "", + "lens_local_keywords": "", + } + ) + message["content"].append( + { + "type": "text", + "text": question, + } + ) + + prompt = tokenizer.apply_chat_template( + [ + message, + ], + tokenize=False, + add_generation_prompt=True, + ) + + return ModelRequestData( + engine_args=engine_args, + prompt=prompt, + stop_token_ids=None, + image_data=[fetch_image(url) for url in image_urls], + ) + + def load_llava(question: str, image_urls: list[str]) -> ModelRequestData: # NOTE: CAUTION! Original Llava models wasn't really trained on multi-image inputs, # it will generate poor response for multi-image inputs! @@ -900,6 +947,7 @@ def load_tarsier2(question: str, image_urls: list[str]) -> ModelRequestData: "h2ovl_chat": load_h2ovl, "idefics3": load_idefics3, "internvl_chat": load_internvl, + "hyperclovax_seed_vision": load_hyperclovax_seed_vision, "keye_vl": load_keye_vl, "kimi_vl": load_kimi_vl, "llava": load_llava, diff --git a/tests/models/multimodal/processing/test_common.py b/tests/models/multimodal/processing/test_common.py index fd584252317..c2e9a73fa82 100644 --- a/tests/models/multimodal/processing/test_common.py +++ b/tests/models/multimodal/processing/test_common.py @@ -278,6 +278,7 @@ def _test_processing_correctness_one( "HuggingFaceTB/SmolVLM2-2.2B-Instruct", "moonshotai/Kimi-VL-A3B-Instruct", "meta-llama/Llama-4-Scout-17B-16E-Instruct", + "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B", "llava-hf/llava-1.5-7b-hf", "llava-hf/llava-v1.6-mistral-7b-hf", "llava-hf/LLaVA-NeXT-Video-7B-hf", diff --git a/tests/models/registry.py b/tests/models/registry.py index 3b92462e58a..1800262ced6 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -201,6 +201,9 @@ def check_available_online( trust_remote_code=True), "HunYuanDenseV1ForCausalLM":_HfExamplesInfo("tencent/Hunyuan-7B-Instruct-0124", trust_remote_code=True), + "HCXVisionForCausalLM": _HfExamplesInfo( + "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B", + trust_remote_code=True), "InternLMForCausalLM": _HfExamplesInfo("internlm/internlm-chat-7b", trust_remote_code=True), "InternLM2ForCausalLM": _HfExamplesInfo("internlm/internlm2-chat-7b", diff --git a/vllm/model_executor/models/hyperclovax_vision.py b/vllm/model_executor/models/hyperclovax_vision.py new file mode 100644 index 00000000000..3e8e50b35c0 --- /dev/null +++ b/vllm/model_executor/models/hyperclovax_vision.py @@ -0,0 +1,1231 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# copied from : https://github.com/huggingface/transformers +import ast +import sys +from collections import defaultdict +from collections.abc import Iterable, Mapping, Sequence +from functools import partial +from itertools import chain +from typing import Any, Literal, Optional, TypedDict, Union + +import numpy as np +import PIL +from einops import rearrange +from PIL import Image + +if sys.version_info >= (3, 11): + import typing + Unpack = typing.Unpack +else: + import typing_extensions + Unpack = typing_extensions.Unpack + +import torch +import torch.nn as nn +from timm.layers import LayerNorm, LayerNorm2d +from timm.models.regnet import RegStage +from transformers import (AutoProcessor, BatchFeature, CLIPVisionConfig, + SiglipVisionConfig) +from transformers.modeling_utils import no_init_weights + +from vllm.config import VllmConfig +from vllm.inputs import InputProcessingContext +from vllm.model_executor.layers.quantization import QuantizationConfig +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.multimodal import MULTIMODAL_REGISTRY +from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, + MultiModalKwargs) +from vllm.multimodal.parse import ImageSize, MultiModalDataItems +from vllm.multimodal.processing import (BaseMultiModalProcessor, + BaseProcessingInfo, ProcessingCache, + PromptReplacement, PromptUpdate) +from vllm.multimodal.profiling import BaseDummyInputsBuilder +from vllm.sequence import IntermediateTensors + +from .clip import CLIPVisionModel +from .interfaces import MultiModalEmbeddings, SupportsMultiModal, SupportsPP +from .siglip import SiglipVisionModel +from .utils import AutoWeightsLoader, init_vllm_registered_model, maybe_prefix +from .vision import get_vision_encoder_info + +EOT = "<|endofturn|>" +IMAGE_TOKEN: str = "<|dummy3|>" +VIDEO_TOKEN: str = "<|_unuse_missing_100270|>" + + +class HCXVisionMultimodalPixelInputs(TypedDict): + type: Literal["pixel_values"] + pixel_values_images: list[torch.Tensor] + """ + Shape: `[(num_grids, num_channels, height, width), ...]` if anyres + + Note that `height` or `width` may be different per batch and image, + in which case the data is passed as a list instead of a batched tensor. + """ + image_sizes_images: list[tuple[Union[int, float]]] + """ + Shape: `[(height, width), ...]` + """ + vision_query_lengths_images: list[Union[int, float]] + pixel_values_videos: list[tuple[Union[int, float]]] + """ + Shape: `[(num_grids, num_channels, height, width), ...]` if anyres + """ + vision_query_lengths_videos: list[Union[int, float]] + + +HCXVisionMultimodalInputs = Union[HCXVisionMultimodalPixelInputs] + + +class HCXVisionProcessingInfo(BaseProcessingInfo): + + def get_hf_config(self): + return self.ctx.get_hf_config() + + def get_vision_encoder_info(self): + return get_vision_encoder_info(self.get_hf_config()) + + def get_hf_processor( + self, + **kwargs: object, + ): + processor_cls = type( + AutoProcessor.from_pretrained( + self.ctx.model_config.model, + trust_remote_code=self.ctx.model_config.trust_remote_code, + )) + return self.ctx.get_hf_processor( + processor_cls, + **kwargs, + ) + + def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]: + return {"image": None, "video": None} + + def get_num_image_tokens( + self, + *, + vision_query_length: Union[int, list[int]], + ) -> int: + if isinstance(vision_query_length, int): + return vision_query_length + else: + return sum(vision_query_length) + + def get_num_video_tokens( + self, + *, + vision_query_length: Union[int, list[int]], + ) -> int: + if isinstance(vision_query_length, int): + return vision_query_length + else: + return sum(vision_query_length) + + def get_image_size_with_most_features(self) -> ImageSize: + vision_encoder_info = self.get_vision_encoder_info() + width = height = vision_encoder_info.get_image_size() + return ImageSize(width=width, height=height) + + def get_max_image_tokens(self) -> int: + target_width, target_height = self.get_image_size_with_most_features() + + return self.get_num_image_tokens( + image_width=target_width, + image_height=target_height, + ) + + +class HCXVisionDummyInputsBuilder( + BaseDummyInputsBuilder[HCXVisionProcessingInfo]): + + def get_dummy_text( + self, + mm_counts: Mapping[str, int], + ) -> str: + dummy_text = IMAGE_TOKEN * mm_counts.get( + "image", 0) + VIDEO_TOKEN * mm_counts.get("video", 0) + return dummy_text + + def get_dummy_mm_data( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> MultiModalDataDict: + num_images = mm_counts.get("image", 0) + num_videos = mm_counts.get("video", 0) + + target_width, target_height = \ + self.info.get_image_size_with_most_features() + target_num_frames = 32 + return { + "image": + self._get_dummy_images( + width=target_width, + height=target_height, + num_images=num_images, + ), + "video": + self._get_dummy_videos( + width=target_width - 1, + height=target_height - 1, + num_frames=target_num_frames, + num_videos=num_videos, + ) + } + + +class HCXVisionMultiModalProcessor( + BaseMultiModalProcessor[HCXVisionProcessingInfo]): + + def _call_hf_processor( + self, + prompt: str, + mm_data: Mapping[str, object], + mm_kwargs: Mapping[str, object], + tok_kwargs: Mapping[str, object], + ) -> BatchFeature: + + def replace_multimodal_token( + token_ids: torch.Tensor, + target_token: int, + repeats: list, + ): + output = list() + _repeats_idx = 0 + for token_id in token_ids: + if token_id == target_token: + output += [ + token_id.item(), + ] * repeats[_repeats_idx] + _repeats_idx += 1 + else: + output += [ + token_id.item(), + ] + return torch.tensor(output, device=token_ids.device) + + for video_idx, video_arr in enumerate(mm_data.get("videos", list())): + if video_arr.dtype == np.uint8: + continue + mm_data["videos"][video_idx] = video_arr.astype(np.uint8) + + processed_outputs = self.info.ctx.call_hf_processor( + hf_processor=self.info.get_hf_processor(**mm_kwargs), + data=dict( + text=prompt, + images=None, + videos=None, + ), + ) # text-only + + if len(mm_data) > 0: + # batchify input as a single item + images = mm_data.get("images", None) + num_images = 0 + if images is not None: + num_images = len(images) + images = [ + images, + ] # batchify + + videos = mm_data.get("videos", + None) # list of video in single conversation + num_videos = 0 + if videos is not None: + num_videos = len(videos) + videos = [ + videos, + ] # batchify + + _processed_outputs = self.info.ctx.call_hf_processor( + hf_processor=self.info.get_hf_processor(**mm_kwargs), + data=dict( + text=None, + images=images, + videos=videos, + ), + ) # mm-only + + for k, v in _processed_outputs.items(): + if len(v) < 1: + continue + elif k.endswith("_images"): + # list of list of 4D tensor -> list of 4D tensor + _processed_outputs[k] = v[0] + elif k.endswith("_videos"): + # list of list of 4D tensor -> list of 4D tensor + v = v[0] + if k == "pixel_values_videos": + v = torch.cat(v, dim=0) + _c, _w, _h = v.shape[-3:] + v = v.reshape(num_videos, -1, _c, _w, _h) + v = list(torch.unbind(v, dim=0)) + _processed_outputs[k] = v + + if num_images > 0: + tokenizer = self.info.get_tokenizer() + processed_outputs["input_ids"] = torch.stack([ + replace_multimodal_token( + token_ids=_input_ids, + target_token=tokenizer.convert_tokens_to_ids( + IMAGE_TOKEN), + repeats=_processed_outputs[ + "vision_query_lengths_images"], + ) for _input_ids in processed_outputs["input_ids"] + ], + dim=0) + + if num_videos > 0: + tokenizer = self.info.get_tokenizer() + processed_outputs["input_ids"] = torch.stack([ + replace_multimodal_token( + token_ids=_input_ids, + target_token=tokenizer.convert_tokens_to_ids( + VIDEO_TOKEN), + repeats=_processed_outputs[ + "vision_query_lengths_videos"], + ) for _input_ids in processed_outputs["input_ids"] + ], + dim=0) + + _ratios = [ + len(_pixel_values) for _pixel_values in + _processed_outputs["pixel_values_videos"] + ] + _num_per_videos = [ + int(_e / sum(_ratios) * + len(_processed_outputs["vision_query_lengths_videos"])) + for _e in _ratios + ] + _processed_outputs["vision_query_lengths_videos"] = [ + _processed_outputs["vision_query_lengths_videos"] + [sum(_num_per_videos[:_i]):sum(_num_per_videos[:_i + 1])] + for _i in range(0, num_videos) + ] + + processed_outputs.update(_processed_outputs) + + return processed_outputs + + def _get_prompt_updates( + self, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + out_mm_kwargs: MultiModalKwargs, + ) -> Sequence[PromptUpdate]: + hf_config = self.info.get_hf_config() + placeholder = { + "image": hf_config.image_token_id, + "video": hf_config.video_token_id, + } + + def get_replacement_hyperclovax( + item_idx: int, + modality: str, + out_mm_kwargs: MultiModalKwargs, + ): + num_tokens = None + if modality == "image": + num_tokens = self.info.get_num_image_tokens( + vision_query_length=out_mm_kwargs[ + "vision_query_lengths_images"][item_idx], ) + if modality == "video": + num_tokens = self.info.get_num_video_tokens( + vision_query_length=out_mm_kwargs[ + "vision_query_lengths_videos"][item_idx], ) + assert isinstance(num_tokens, int) + return [ + placeholder[modality], + ] * num_tokens + + return [ + PromptReplacement( + modality=modality, + target=[ + placeholder[modality], + ], + replacement=partial( + get_replacement_hyperclovax, + modality=modality, + out_mm_kwargs=out_mm_kwargs, + ), + ) for modality in ("image", "video") + ] + + def _get_mm_fields_config( + self, + hf_inputs: BatchFeature, + hf_processor_mm_kwargs: Mapping[str, object], + ) -> Mapping[str, MultiModalFieldConfig]: + return dict( + # image + pixel_values_images=MultiModalFieldConfig.batched("image"), + image_sizes_images=MultiModalFieldConfig.batched("image"), + vision_query_lengths_images=MultiModalFieldConfig.batched("image"), + num_queries_vis_abstractors_images=MultiModalFieldConfig.batched( + "image"), + num_queries_vis_abstractors_slow_images=MultiModalFieldConfig. + batched("image"), + first_last_frames_slows_images=MultiModalFieldConfig.batched( + "image"), + # video + pixel_values_videos=MultiModalFieldConfig.batched("video"), + image_sizes_videos=MultiModalFieldConfig.batched("video"), + vision_query_lengths_videos=MultiModalFieldConfig.batched("video"), + num_queries_vis_abstractors_videos=MultiModalFieldConfig.batched( + "video"), + num_queries_vis_abstractors_slow_videos=MultiModalFieldConfig. + batched("video"), + first_last_frames_slows_videos=MultiModalFieldConfig.batched( + "video"), + ) + + +def _build_hcxvision_hf_info( + ctx: InputProcessingContext, ) -> HCXVisionProcessingInfo: + return HCXVisionProcessingInfo(ctx) + + +def _build_hcxvision_hf_processor( + info: HCXVisionProcessingInfo, + dummy_inputs: BaseDummyInputsBuilder[HCXVisionProcessingInfo], + *, + cache: Optional[ProcessingCache] = None, +) -> BaseMultiModalProcessor: + if isinstance(info, HCXVisionProcessingInfo): + return HCXVisionMultiModalProcessor( + info, + dummy_inputs, # type: ignore + cache=cache, + ) + + raise NotImplementedError(type(info)) + + +def init_vision_tower_for_hcxvision( + vision_config, + quant_config: Optional[QuantizationConfig], + *, + use_nth_layer: Optional[int] = None, + require_post_norm: Optional[bool] = None, + prefix: str = "", +) -> Union[CLIPVisionModel, SiglipVisionModel]: + num_hidden_layers = vision_config.num_hidden_layers + if not isinstance(use_nth_layer, int): + pass + elif use_nth_layer >= 0: + num_hidden_layers = use_nth_layer + 1 + else: + num_hidden_layers = num_hidden_layers + use_nth_layer + 1 + + if isinstance(vision_config, CLIPVisionConfig): + return CLIPVisionModel( + vision_config, + quant_config=quant_config, + num_hidden_layers_override=num_hidden_layers, + require_post_norm=require_post_norm, + prefix=prefix, + ) + elif isinstance(vision_config, SiglipVisionConfig): + return SiglipVisionModel( + vision_config, + quant_config=quant_config, + num_hidden_layers_override=num_hidden_layers, + require_post_norm=require_post_norm, + prefix=prefix, + ) + + msg = f"Unsupported vision config: {type(vision_config)}" + raise NotImplementedError(msg) + + +class HCXVisionMlp(nn.Module): + + def __init__( + self, + mm_projector_type, + in_features, + hidden_features=None, + out_features=None, + act_layer=nn.GELU, + ): + super().__init__() + out_features = out_features or in_features + hidden_features = hidden_features or in_features + self.mm_projector_type = mm_projector_type + if self.mm_projector_type == "mlp": + self.fc1 = nn.Linear(in_features, hidden_features) + self.act = act_layer() + self.fc2 = nn.Linear(hidden_features, out_features) + elif self.mm_projector_type == "inverted_mlp": + self.fc1 = nn.Linear(in_features, 2 * hidden_features) + self.act = act_layer() + self.fc2 = nn.Linear(2 * hidden_features, out_features) + else: + raise NotImplementedError("{} is not implemented".format( + self.mm_projector_type)) + + def forward(self, x): + x = self.fc1(x) + x = self.act(x) + x = self.fc2(x) + return x + + +class HCXVisionCAbstractor(nn.Module): + """ + This module is based on C-Abstractor, whose license is under apache-2.0. + You can check the original code at + https://github.com/khanrc/honeybee/blob/main/honeybee/projectors/projectors.py + and we made necessary modifications. + """ + + def __init__( + self, + num_queries: int, + num_input_tokens: int, + encoder_hidden_size: int, + hidden_size: int, + output_hidden_size: int, + pos_emb: bool = True, + prenorm: bool = False, + ): + super().__init__() + self.num_input_tokens = num_input_tokens + self.output_hidden_size = output_hidden_size + + # Positional embedding + if pos_emb: + self.pos_emb = torch.nn.Parameter( + torch.zeros(1, num_input_tokens, encoder_hidden_size)) + self.pos_emb.data.normal_(mean=0.0, std=0.02) + else: + self.pos_emb = None + + # (Optional) Pre-normalization layer + if prenorm: + self.prenorm = LayerNorm(encoder_hidden_size) + else: + self.prenorm = None + + self.build_net(num_queries, encoder_hidden_size, hidden_size, + output_hidden_size) + self.dtype = next(self.parameters()).dtype + + def forward( + self, + x: torch.Tensor, + num_queries_vis_abstractors: Optional[list[list[int]]] = None, + num_grids: Optional[list[int]] = None, + ) -> torch.Tensor: + if self.prenorm is not None: + x = self.prenorm(x) + + if self.pos_emb is not None: + x = x + self.pos_emb + + x = self._forward( + x, + num_queries_vis_abstractors=num_queries_vis_abstractors, + num_grids=num_grids, + ) # (B, L, output_hidden_size) + + return x + + def _forward( + self, + x: torch.Tensor, + num_queries_vis_abstractors: Optional[list[list[int]]] = None, + num_grids: Optional[list[int]] = None, + ) -> torch.Tensor: + # x: [B, L, dim] + B, L, dim = x.shape + hw = int(L**0.5) + x = rearrange(x, "b (h w) d -> b d h w", h=hw, w=hw) + + if num_queries_vis_abstractors is not None: + assert num_grids is not None + return self._forward_adaptive_num_query( + x, num_queries_vis_abstractors, num_grids) + + x = self.net(x) + x = rearrange(x, "b d h w -> b (h w) d") + x = self.readout(x) + return x + + def _forward_adaptive_num_query( + self, + x: torch.Tensor, + num_queries_vis_abstractors: Optional[list[list[int]]] = None, + num_grids: Optional[list[int]] = None, + ) -> list[torch.Tensor]: + # self.net is consisted by 3 layers (s1, sampler, s2) + assert len(self.net) == 3 + + x = self.net[0](x) # s1 + new_x = [] + for i, num_queries in enumerate(num_queries_vis_abstractors): + hw = int(num_queries**0.5) + sampler = nn.AdaptiveAvgPool2d((hw, hw)) + out = sampler(x[num_grids[i]:num_grids[i + 1], :]) + out = self.net[2](out) # s2 + + out = rearrange(out, "b d h w -> b (h w) d") + out = self.readout(out) + + new_x.append(out) + return new_x + + def build_net( + self, + n_queries: int, + encoder_hidden_size: int, + hidden_size: int, + output_hidden_size: int, + depth: int = 3, + mlp_depth: int = 2, + ): + assert (n_queries**0.5).is_integer( + ), f"n_queries must be square number. n_queries: {n_queries}" + hw = int(n_queries**0.5) + + # RegBlock = ResBlock + SE + RegBlock = partial( + RegStage, + stride=1, + dilation=1, + act_layer=nn.SiLU, + norm_layer=LayerNorm2d, + ) + + s1 = RegBlock( + depth, + encoder_hidden_size, + hidden_size, + ) + sampler = nn.AdaptiveAvgPool2d((hw, hw)) + s2 = RegBlock( + depth, + hidden_size, + hidden_size, + ) + + self.net = nn.Sequential(s1, sampler, s2) + self.readout = self.build_mlp(mlp_depth, hidden_size, + output_hidden_size) + + def build_mlp( + self, + depth: int, + hidden_size: int, + output_hidden_size: int, + ): + layers = [nn.Linear(hidden_size, output_hidden_size)] + for _ in range(1, depth): + layers.append(nn.SiLU()) + layers.append(nn.Linear(output_hidden_size, output_hidden_size)) + return nn.Sequential(*layers) + + +@MULTIMODAL_REGISTRY.register_processor( + _build_hcxvision_hf_processor, + info=_build_hcxvision_hf_info, + dummy_inputs=HCXVisionDummyInputsBuilder) +class HCXVisionForCausalLM(nn.Module, SupportsMultiModal, SupportsPP): + + packed_modules_mapping = { + "qkv_proj": ["q_proj", "k_proj", "v_proj"], + "gate_up_proj": ["gate_proj", "up_proj"] + } + + def __init__( + self, + *, + vllm_config: VllmConfig, + prefix: str = "", + **kwargs: Optional[Any], + ) -> None: + super().__init__() + + # init configs + config = vllm_config.model_config.hf_config + quant_config = vllm_config.quant_config + # text_config + text_config = config.text_config + if text_config.model_type in ["gpt2", "hyperclovax", "llama"]: + text_config._attn_implementation = "sdpa" + if text_config.model_type != "hyperclovax": + text_config.logits_scaling = 1.0 + # vision_config + vision_config = config.vision_config + vision_config.auto_map = {} + vision_config.anyres = config.anyres + vision_config.max_num_grids = config.max_num_grids + self.dtype = vllm_config.model_config.dtype + + ## possible_resolution should be matched with preprocessor_config.json + config.possible_resolutions = self._init_possible_resolutions( + config, vision_config) + + # init models & parameters + with no_init_weights(): # weight will be loaded in from_pretrained + self.vision_model = init_vision_tower_for_hcxvision( + vision_config, + quant_config, + use_nth_layer=getattr(config, "use_nth_layer", -1), + require_post_norm=False, + prefix=maybe_prefix(prefix, "vision_model"), + ) + self.mm_projector = self._init_mm_projector(config, text_config, + vision_config) + + self.lm_head_vocab_size = getattr(text_config, "padded_vocab_size", + text_config.vocab_size) + self.language_model = init_vllm_registered_model( + vllm_config=vllm_config, + hf_config=text_config, + prefix=maybe_prefix(prefix, "language_model"), + ) + + if config.anyres: + self.image_newline = nn.Parameter( + torch.empty(text_config.hidden_size, dtype=self.dtype)) + + self.config = config + self.vision_config = vision_config + self.text_config = text_config + + # use_sum_loss = bool(kwargs.pop("use_sum_loss", False)) + # self.reduction = self._init_reduction_type(use_sum_loss) + + @classmethod + def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: + if modality.startswith("image"): + return IMAGE_TOKEN + if modality.startswith("video"): + return VIDEO_TOKEN + + raise ValueError("Only image or video modality is supported") + + def get_language_model(self) -> torch.nn.Module: + return self.language_model + + def get_multimodal_embeddings( + self, + **kwargs: Unpack[HCXVisionMultimodalInputs], + ) -> Optional[MultiModalEmbeddings]: + + multimodal_embeddings = list() + if kwargs.get("pixel_values_images") is not None: + for _pixel_values_images, _image_sizes_images in zip( + kwargs["pixel_values_images"], + kwargs["image_sizes_images"]): + _pixel_values_images = _pixel_values_images.unsqueeze(dim=0) + _image_sizes_images = _image_sizes_images.unsqueeze(dim=0) + _len_pixel_values_images = [ + len(pixel_value) for pixel_value in _pixel_values_images + ] + if isinstance(_image_sizes_images, torch.Tensor): + _image_sizes_images = _image_sizes_images.detach().cpu( + ).tolist() + _multimodal_embeddings_images = self.forward_images( + pixel_values_images=_pixel_values_images, + image_sizes_images=_image_sizes_images, + len_pixel_values_images=_len_pixel_values_images, + ) + _multimodal_embeddings_images = torch.cat( + _multimodal_embeddings_images, dim=0) + multimodal_embeddings.append(_multimodal_embeddings_images) + + if kwargs.get("pixel_values_videos") is not None: + for _pixel_values_videos, _vision_query_lengths_videos in zip( + kwargs["pixel_values_videos"], + kwargs["vision_query_lengths_videos"]): + _len_pixel_values_videos = [ + len(_vision_query_lengths) + for _vision_query_lengths in _vision_query_lengths_videos + ] + _c, _w, _h = _pixel_values_videos.shape[-3:] + _pixel_values_videos = _pixel_values_videos.reshape( + sum(_len_pixel_values_videos), -1, _c, _w, + _h).unsqueeze(dim=0) + _multimodal_embeddings_videos = self.forward_videos( + pixel_values_videos=_pixel_values_videos, + len_pixel_values_videos=_len_pixel_values_videos, + ) + _multimodal_embeddings_videos = torch.cat( + _multimodal_embeddings_videos, dim=0) + multimodal_embeddings.append(_multimodal_embeddings_videos) + return multimodal_embeddings + + def get_input_embeddings( + self, + input_ids: torch.Tensor, + multimodal_embeddings: Optional[MultiModalEmbeddings] = None, + **kwargs, + ) -> torch.Tensor: + inputs_embeds = self.language_model.get_input_embeddings(input_ids) + if (kwargs.get("pixel_values_images") is not None + or kwargs.get("pixel_values_videos") + is not None): # v0 compatibility + multimodal_embeddings = self.get_multimodal_embeddings(**kwargs) + if multimodal_embeddings is not None: + multimodal_embeddings = torch.cat(multimodal_embeddings, dim=0) + _mask_image = input_ids == self.config.image_token_id + _mask_video = input_ids == self.config.video_token_id + assert _mask_image.sum() + _mask_video.sum() == len( + multimodal_embeddings) + + if multimodal_embeddings.dtype != inputs_embeds.dtype: + multimodal_embeddings = multimodal_embeddings.to( + dtype=inputs_embeds.dtype) + if multimodal_embeddings.device != inputs_embeds.device: + multimodal_embeddings = multimodal_embeddings.to( + device=inputs_embeds.device) + + if _mask_image.sum() > 0: + inputs_embeds[ + _mask_image] = multimodal_embeddings[:sum(_mask_image)] + if _mask_video.sum() > 0: + inputs_embeds[_mask_video] = multimodal_embeddings[ + -sum(_mask_video):] + return inputs_embeds + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + **kwargs: object, + ) -> Union[torch.Tensor, IntermediateTensors]: + if intermediate_tensors is not None: + inputs_embeds = None + + # NOTE: In v1, inputs_embeds is always generated at model runner, this + # condition is for v0 compatibility. + elif inputs_embeds is None: + inputs_embeds = self.get_input_embeddings(input_ids=input_ids, + **kwargs) + input_ids = None + hidden_states = self.language_model.model(input_ids, + positions, + intermediate_tensors, + inputs_embeds=inputs_embeds) + return hidden_states + + def forward_images( + self, + pixel_values_images: list[list[torch.FloatTensor]], + image_sizes_images: list[list[tuple[int, int]]], + len_pixel_values_images: list[int], + ) -> list[list[torch.Tensor]]: + if sum(len_pixel_values_images) == 0: + return None + + concat_pixel_values_images = torch.cat(list( + chain(*pixel_values_images)), + dim=0) + + visual_token_idx = 0 if "siglip" in self.vision_config.model_type else 1 + image_forward_outs = self.vision_model( + concat_pixel_values_images)[:, visual_token_idx:] + + image_forward_outs = image_forward_outs.to( + dtype=self.mm_projector.dtype) + image_forward_outs = self.mm_projector(image_forward_outs) # b (h w) d + + split_sizes = [ + pixel_value.shape[0] for pixel_value in chain(*pixel_values_images) + ] + image_forward_outs = torch.split(image_forward_outs, + split_sizes, + dim=0) + + # newline for anyres postprocessing + image_features = anyres_postprocessing( + image_forward_outs=image_forward_outs, + image_sizes=[ + image_size for image_sizes in image_sizes_images + for image_size in image_sizes + ], + num_queries_vis_abstractor=self.config. + num_queries_vis_abstractor_image, + unpad=self.config.unpad, + patch_size=self.vision_config.patch_size, + grid_size=self.vision_config.image_size, + image_newline=self.image_newline, + possible_resolutions=self.config.possible_resolutions, + ) + return image_features + + def forward_videos( + self, + pixel_values_videos: list[list[torch.FloatTensor]], + len_pixel_values_videos: list[int], + ) -> list[torch.Tensor]: + + len_video_grids = sum(len_pixel_values_videos) + if len_video_grids == 0: + return None + + # Run Vision Model + concat_pixel_values_videos = torch.cat(list( + chain(*pixel_values_videos)), + dim=0) + + visual_token_idx = 0 if "siglip" in self.vision_config.model_type else 1 + video_forward_outs = self.vision_model( + concat_pixel_values_videos)[:, visual_token_idx:] + + video_forward_outs = video_forward_outs.to( + dtype=self.mm_projector.dtype) + + # Run MM-Projector + # len(num_grids) == len(num_queries_vis_abstractors) + 1 + grid_idx = 0 + num_grids = [ + grid_idx + ] # e.g. [0, 9, 18, 19, 27, 28, 36, 37, 45, 46, 54, 55, 56] + num_queries_vis_abstractors = [ + ] # e.g. [81, 81, 81, 9, 81, 9, 81, 9, 81, 9, 81, 9] + len_total_frames = video_forward_outs.shape[0] + + if self.config.first_last_frames_slow: + # slowfast (first_last_frames_slow) + assert len_total_frames != 0 + if len_total_frames <= 2: + num_queries_vis_abstractors.append( + self.config.num_queries_vis_abstractor_video_slow) + grid_idx += len_total_frames + num_grids.append(grid_idx) + else: + num_queries_vis_abstractors.append( + self.config.num_queries_vis_abstractor_video_slow) + grid_idx += 1 + num_grids.append(grid_idx) + + num_queries_vis_abstractors.append( + self.config.num_queries_vis_abstractor_video_fast) + grid_idx += len_total_frames - 2 + num_grids.append(grid_idx) + + num_queries_vis_abstractors.append( + self.config.num_queries_vis_abstractor_video_slow) + grid_idx += 1 + num_grids.append(grid_idx) + else: + # slowfast + for pixel_values_frames in pixel_values_videos: + for pixel_values_frame in pixel_values_frames: + if len(pixel_values_frame) > 0: + num_queries_vis_abstractors.append( + self.config.num_queries_vis_abstractor_video_slow) + grid_idx += 1 + num_grids.append(grid_idx) + num_queries_vis_abstractors.append( + self.config.num_queries_vis_abstractor_video_fast) + grid_idx = grid_idx + len(pixel_values_frame) - 1 + num_grids.append(grid_idx) + + video_forward_outs = self.mm_projector(video_forward_outs, + num_queries_vis_abstractors, + num_grids) + + video_features = [] # what we want to return + target_features = [] + target_group_size = 0 + group_counter = 0 + video_groups = [ + len(frame) for frames in pixel_values_videos for frame in frames + ] # for concat video features after projector + + for forward_out in video_forward_outs: + target_group_size += len(forward_out) + target_features.append(forward_out.flatten(0, 1)) + + video_group_size = video_groups[group_counter] + if video_group_size == target_group_size: + video_features.append(torch.cat(target_features, dim=0)) + target_features = [] + group_counter += 1 + target_group_size = 0 + + elif video_group_size < target_group_size: + raise RuntimeError(f"video_group_size < target_group_size!! \ + [{video_group_size} < {target_group_size}]") + + assert len(target_features + ) == 0, f"target_features is not empty!! {target_features}" + assert len(video_groups) == len(video_features) + + return video_features + + def _prepare_multimodal_kwargs(self, **kwargs: object): + output = defaultdict(list) + for k, v in kwargs.items(): + if len(v) < 1 or len(v[0]) < 1: + continue # if empty batch of empty sample + + new_k, is_video = k, False + if (not k.endswith("_images") and not k.endswith("_videos")): + pass + else: + new_k, is_video = k.split("_")[:-1], k.split("_")[-1] + new_k = "_".join(new_k) + is_video = is_video == "videos" + + for _sample_idx, _v in enumerate(v): # batch -> sample + if new_k not in ["pixel_values"]: + if len(output[new_k]) < _sample_idx + 1: + output[new_k].append(list()) + _v = _v.detach().cpu().numpy().tolist() + output[new_k][_sample_idx] += _v + elif isinstance(_v, torch.Tensor): + if len(output[new_k]) < _sample_idx + 1: + output[new_k].append(list()) + output["is_videos"].append(list()) + _v = list(torch.unbind(_v, dim=0)) + output[new_k][_sample_idx] += _v + output["is_videos"][_sample_idx] += [ + is_video, + ] * len(_v) + return dict(output) + + def compute_logits( + self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata, + ) -> Optional[torch.Tensor]: + return self.language_model.compute_logits(hidden_states, + sampling_metadata) + + def load_weights( + self, + weights: Iterable[tuple[str, torch.Tensor]], + ) -> set[str]: + loader = AutoWeightsLoader(self) + return loader.load_weights(weights) + + def _init_possible_resolutions( + self, + config, + vision_config, + ): + if not getattr(config, "possible_resolutions", []): + possible_resolutions = [] + if config.anyres: + assert config.max_num_grids > 0 + for i in range(1, config.max_num_grids + 1): + for j in range(1, config.max_num_grids + 1): + if i == 1 and j == 1 and not config.use_1x1_grid: + continue + if i * j <= config.max_num_grids: + possible_resolutions.append([i, j]) + + possible_resolutions = [[ + ys * vision_config.image_size, + xs * vision_config.image_size + ] for ys, xs in possible_resolutions] + return possible_resolutions + else: + return config.possible_resolutions + + def _init_mm_projector( + self, + config, + text_config, + vision_config, + ): + input_hidden_size = vision_config.hidden_size + if config.mm_projector_type == "linear": + mm_projector = nn.Linear(input_hidden_size, + text_config.hidden_size) + mm_projector.dtype = next(mm_projector.parameters()).dtype + elif config.mm_projector_type == "cabstractor": + mm_projector = HCXVisionCAbstractor( + num_queries=config.num_queries_vis_abstractor_image, + num_input_tokens=(vision_config.image_size // + vision_config.patch_size)**2, + encoder_hidden_size=input_hidden_size, + hidden_size=input_hidden_size, + output_hidden_size=text_config.hidden_size, + pos_emb=config.proj_pos_emb, + prenorm=config.proj_prenorm, + ) + else: + mm_projector = HCXVisionMlp( + config.mm_projector_type, + input_hidden_size, + hidden_features=input_hidden_size, + out_features=self.text_config.hidden_size, + ) + return mm_projector + + +def unpad_image(tensor: torch.Tensor, + original_size: tuple[int, int]) -> torch.Tensor: + original_width, original_height = original_size + current_height, current_width = tensor.shape[1:] + + original_aspect_ratio = original_width / original_height + current_aspect_ratio = current_width / current_height + + if original_aspect_ratio > current_aspect_ratio: + scale_factor = current_width / original_width + new_height = int(original_height * scale_factor) + padding = (current_height - new_height) // 2 + unpadded_tensor = tensor[:, padding:current_height - padding, :] + else: + scale_factor = current_height / original_height + new_width = int(original_width * scale_factor) + padding = (current_width - new_width) // 2 + unpadded_tensor = tensor[:, :, padding:current_width - padding] + + return unpadded_tensor + + +def select_best_resolution(original_size: tuple, + possible_resolutions: list) -> tuple: + original_height, original_width = original_size + best_fit = None + max_effective_resolution = 0 + min_wasted_resolution = float("inf") + + for height, width in possible_resolutions: + scale = min(width / original_width, height / original_height) + downscaled_width, downscaled_height = int(original_width * scale), int( + original_height * scale) + effective_resolution = min(downscaled_width * downscaled_height, + original_width * original_height) + wasted_resolution = (width * height) - effective_resolution + + if effective_resolution > max_effective_resolution or ( + effective_resolution == max_effective_resolution + and wasted_resolution < min_wasted_resolution): + max_effective_resolution = effective_resolution + min_wasted_resolution = wasted_resolution + best_fit = (height, width) + + return best_fit + + +def get_anyres_image_grid_shape( + image_size: tuple[int, int], + grid_pinpoints: Union[str, list[tuple[int, int]]], + patch_size: int, +) -> tuple[int, int]: + possible_resolutions = grid_pinpoints if isinstance( + grid_pinpoints, list) else ast.literal_eval(grid_pinpoints) + + original_width, original_height = image_size + height, width = select_best_resolution((original_height, original_width), + possible_resolutions) + return width // patch_size, height // patch_size + + +def reshape_and_unpad_image_features( + image_feature: torch.Tensor, + height: int, + width: int, + image_size: tuple[int, int], + possible_resolutions: list[tuple[int, int]], + grid_size: int, + unpad: bool, + image_newline: torch.Tensor, +) -> torch.Tensor: + base_image_feature = image_feature[0] + image_feature = image_feature[1:] + + assert (height * width == base_image_feature.shape[0] + ), f"height: {height}, width: {width}, \ + base_image_feature.shape[0]: {base_image_feature.shape[0]}" + + num_patch_width, num_patch_height = get_anyres_image_grid_shape( + image_size, possible_resolutions, grid_size) + image_feature = image_feature.view(num_patch_height, num_patch_width, + height, width, -1) + + if unpad: + image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous() + image_feature = image_feature.flatten(1, 2).flatten(2, 3) + image_feature = unpad_image(image_feature, image_size) + image_feature = torch.cat( + ( + image_feature, + image_newline[:, None, None].expand( + *image_feature.shape[:-1], 1).to(image_feature.device), + ), + dim=-1, + ) + image_feature = image_feature.flatten(1, 2).transpose(0, 1) + else: + image_feature = image_feature.permute(0, 2, 1, 3, 4).contiguous() + image_feature = image_feature.flatten(0, 3) + image_feature = torch.cat((base_image_feature, image_feature), dim=0) + + return image_feature + + +def anyres_postprocessing( + image_forward_outs: list[torch.FloatTensor], + image_sizes: list[list[int]], + possible_resolutions: list[tuple[int, int]], + patch_size: int, + grid_size: int, + image_newline: torch.FloatTensor, + num_queries_vis_abstractor: int = -1, + unpad: bool = False, +) -> list[torch.FloatTensor]: + height = width = grid_size // patch_size + + if num_queries_vis_abstractor > 0: + assert (num_queries_vis_abstractor**0.5 + ).is_integer(), "n_queries must be square number" + height = width = int(num_queries_vis_abstractor**0.5) + + # post-processing (unpad, add newline) + new_image_features = [] + for image_idx, image_feature in enumerate(image_forward_outs): + if image_feature.shape[0] > 1: + image_feature = reshape_and_unpad_image_features( + image_feature=image_feature, + height=height, + width=width, + image_size=image_sizes[image_idx], + possible_resolutions=possible_resolutions, + grid_size=grid_size, # Pass grid info if needed by helper + unpad=unpad, + image_newline=image_newline, + ) + else: + image_feature = image_feature[0] + image_feature = torch.cat( + (image_feature, image_newline[None].to(image_feature.device)), + dim=0) + new_image_features.append(image_feature) + image_features = new_image_features + return image_features + + +def resize_image( + image: Union[np.ndarray, PIL.Image.Image], + max_side: int = 378, +) -> np.ndarray: + image_arr = image + if isinstance(image, np.ndarray): + image = Image.fromarray(image) + + width, height = image.size + cur_max_size = max(width, height) + if cur_max_size <= max_side: + return image_arr + + scale = max_side / cur_max_size + width = int(width * scale) + height = int(height * scale) + image = image.resize((width, height), Image.LANCZOS) + image_arr = np.array(image) + return image_arr diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 7470b31e125..14a8ac7876f 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -81,6 +81,7 @@ "Grok1ModelForCausalLM": ("grok1", "Grok1ForCausalLM"), "HunYuanMoEV1ForCausalLM": ("hunyuan_v1", "HunYuanMoEV1ForCausalLM"), "HunYuanDenseV1ForCausalLM": ("hunyuan_v1", "HunYuanDenseV1ForCausalLM"), + "HCXVisionForCausalLM": ("hyperclovax_vision", "HCXVisionForCausalLM"), "InternLMForCausalLM": ("llama", "LlamaForCausalLM"), "InternLM2ForCausalLM": ("internlm2", "InternLM2ForCausalLM"), "InternLM2VEForCausalLM": ("internlm2_ve", "InternLM2VEForCausalLM"), From 0895e62c5d012e755b604258fe453db781394fd3 Mon Sep 17 00:00:00 2001 From: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Fri, 25 Jul 2025 06:49:11 -0700 Subject: [PATCH 363/552] [Frontend] Add request_id to the Request object so they can be controlled better via external load balancers (#21009) Signed-off-by: Kourosh Hakhamaneshi Signed-off-by: x22x22 --- vllm/entrypoints/openai/protocol.py | 21 +++++++++++++++++++ vllm/entrypoints/openai/serving_completion.py | 4 +++- vllm/entrypoints/openai/serving_embedding.py | 5 +++-- 3 files changed, 27 insertions(+), 3 deletions(-) diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index 6c6ec207a3c..b6b3bf3f530 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -1007,6 +1007,13 @@ class CompletionRequest(OpenAIBaseModel): "default: 0). Any priority other than 0 will raise an error " "if the served model does not use priority scheduling."), ) + request_id: str = Field( + default_factory=lambda: f"{random_uuid()}", + description=( + "The request_id related to this request. If the caller does " + "not set it, a random_uuid will be generated. This id is used " + "through out the inference process and return in response."), + ) logits_processors: Optional[LogitsProcessors] = Field( default=None, description=( @@ -1251,6 +1258,13 @@ class EmbeddingCompletionRequest(OpenAIBaseModel): "default: 0). Any priority other than 0 will raise an error " "if the served model does not use priority scheduling."), ) + request_id: str = Field( + default_factory=lambda: f"{random_uuid()}", + description=( + "The request_id related to this request. If the caller does " + "not set it, a random_uuid will be generated. This id is used " + "through out the inference process and return in response."), + ) # --8<-- [end:embedding-extra-params] @@ -1302,6 +1316,13 @@ class EmbeddingChatRequest(OpenAIBaseModel): "default: 0). Any priority other than 0 will raise an error " "if the served model does not use priority scheduling."), ) + request_id: str = Field( + default_factory=lambda: f"{random_uuid()}", + description=( + "The request_id related to this request. If the caller does " + "not set it, a random_uuid will be generated. This id is used " + "through out the inference process and return in response."), + ) # --8<-- [end:chat-embedding-extra-params] @model_validator(mode="before") diff --git a/vllm/entrypoints/openai/serving_completion.py b/vllm/entrypoints/openai/serving_completion.py index 323795ca437..22c6b625039 100644 --- a/vllm/entrypoints/openai/serving_completion.py +++ b/vllm/entrypoints/openai/serving_completion.py @@ -113,7 +113,9 @@ async def create_completion( return self.create_error_response( "Echo is unsupported with prompt embeds.") - request_id = f"cmpl-{self._base_request_id(raw_request)}" + request_id = ( + f"cmpl-" + f"{self._base_request_id(raw_request, request.request_id)}") created_time = int(time.time()) request_metadata = RequestResponseMetadata(request_id=request_id) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index a7f95cf2b85..93bed980f9a 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -883,8 +883,9 @@ async def create_embedding( for the API specification. This API mimics the OpenAI Embedding API. """ model_name = self._get_model_name(request.model) - request_id = (f"{self.request_id_prefix}-" - f"{self._base_request_id(raw_request)}") + request_id = ( + f"{self.request_id_prefix}-" + f"{self._base_request_id(raw_request, request.request_id)}") ctx = EmbeddingServeContext( request=request, From 9dd097379c04dac2b0b458243cd64b49e48ce985 Mon Sep 17 00:00:00 2001 From: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Date: Fri, 25 Jul 2025 09:49:36 -0400 Subject: [PATCH 364/552] [Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel (#20839) Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Yu Chin Fabian Lim Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Yu Chin Fabian Lim Signed-off-by: x22x22 --- .../layers/mamba/mamba_mixer2.py | 21 +-- .../layers/mamba/ops/layernorm_gated.py | 168 ++++++++++++++++++ 2 files changed, 176 insertions(+), 13 deletions(-) create mode 100644 vllm/model_executor/layers/mamba/ops/layernorm_gated.py diff --git a/vllm/model_executor/layers/mamba/mamba_mixer2.py b/vllm/model_executor/layers/mamba/mamba_mixer2.py index e32b2be4d40..2c95099e53a 100644 --- a/vllm/model_executor/layers/mamba/mamba_mixer2.py +++ b/vllm/model_executor/layers/mamba/mamba_mixer2.py @@ -24,6 +24,7 @@ extra_groups_for_head_shards, get_mamba_state_shape) from vllm.model_executor.layers.mamba.ops.causal_conv1d import ( causal_conv1d_fn, causal_conv1d_update) +from vllm.model_executor.layers.mamba.ops.layernorm_gated import rms_norm_gated from vllm.model_executor.layers.mamba.ops.mamba_ssm import ( selective_state_update) from vllm.model_executor.layers.mamba.ops.ssd_combined import ( @@ -133,21 +134,15 @@ def forward_cuda( return x * nn.functional.silu(gate.to( torch.float32)).to(input_dtype) - if self.tp_size > 1 or self.n_groups != 1: + if (((self.n_groups % self.tp_size) != 0) or self.n_groups != 1): return self.forward_native(x, gate) - from vllm import _custom_ops as ops - - # cast x and gate to float32 before silu - out = torch.empty_like(x) - y = x * nn.functional.silu(gate.to(torch.float32)) - ops.rms_norm( - out, - y.to(x.dtype), - self.weight.data, - self.variance_epsilon, - ) - return out + return rms_norm_gated(x, + self.weight.data, + bias=None, + z=gate, + eps=self.variance_epsilon, + norm_before_gate=False) def mamba_v2_sharded_weight_loader( diff --git a/vllm/model_executor/layers/mamba/ops/layernorm_gated.py b/vllm/model_executor/layers/mamba/ops/layernorm_gated.py new file mode 100644 index 00000000000..f3a45ab097c --- /dev/null +++ b/vllm/model_executor/layers/mamba/ops/layernorm_gated.py @@ -0,0 +1,168 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# Copyright (c) 2024, Tri Dao. +# Adapted from https://github.com/state-spaces/mamba/blob/60dadf2e0ee730ac337035d5533de10bc26e4847/mamba_ssm/ops/triton/layernorm_gated.py + +import torch + +from vllm.triton_utils import tl, triton + + +@triton.heuristics({"HAS_BIAS": lambda args: args["B"] is not None}) +@triton.heuristics({"HAS_Z": lambda args: args["Z"] is not None}) +@triton.jit +def _layer_norm_fwd_1pass_kernel( + X, # pointer to the input + Y, # pointer to the output + W, # pointer to the weights + B, # pointer to the biases + Z, # pointer to the other branch + Mean, # pointer to the mean + Rstd, # pointer to the 1/std + stride_x_row: tl.int64, + stride_y_row: tl.int64, + stride_z_row: tl.int64, + M: tl.int64, # number of rows in X + N: tl.int64, # number of columns in X + eps, # epsilon to avoid division by zero + BLOCK_N: tl.constexpr, + HAS_BIAS: tl.constexpr, + HAS_Z: tl.constexpr, + NORM_BEFORE_GATE: tl.constexpr, + IS_RMS_NORM: tl.constexpr, +): + # Map the program id to the row of X and Y it should compute. + row = tl.program_id(0) + group = tl.program_id(1) + X += row * stride_x_row + group * N + Y += row * stride_y_row + group * N + if HAS_Z: + Z += row * stride_z_row + group * N + if not IS_RMS_NORM: + Mean += group * M + Rstd += group * M + W += group * N + if HAS_BIAS: + B += group * N + # Compute mean and variance + cols = tl.arange(0, BLOCK_N) + x = tl.load(X + cols, mask=cols < N, other=0.).to(tl.float32) + if HAS_Z and not NORM_BEFORE_GATE: + z = tl.load(Z + cols, mask=cols < N).to(tl.float32) + x *= z * tl.sigmoid(z) + if not IS_RMS_NORM: + mean = tl.sum(x, axis=0) / N + tl.store(Mean + row, mean) + xbar = tl.where(cols < N, x - mean, 0.) + var = tl.sum(xbar * xbar, axis=0) / N + else: + xbar = tl.where(cols < N, x, 0.) + var = tl.sum(xbar * xbar, axis=0) / N + rstd = 1 / tl.sqrt(var + eps) + tl.store(Rstd + row, rstd) + # Normalize and apply linear transformation + mask = cols < N + w = tl.load(W + cols, mask=mask).to(tl.float32) + if HAS_BIAS: + b = tl.load(B + cols, mask=mask).to(tl.float32) + x_hat = (x - mean) * rstd if not IS_RMS_NORM else x * rstd + y = x_hat * w + b if HAS_BIAS else x_hat * w + if HAS_Z and NORM_BEFORE_GATE: + z = tl.load(Z + cols, mask=mask).to(tl.float32) + y *= z * tl.sigmoid(z) + # Write output + tl.store(Y + cols, y, mask=mask) + + +def _layer_norm_fwd(x, + weight, + bias, + eps, + z=None, + out=None, + group_size=None, + norm_before_gate=True, + is_rms_norm=False): + M, N = x.shape + if group_size is None: + group_size = N + assert N % group_size == 0 + ngroups = N // group_size + assert x.stride(-1) == 1 + if z is not None: + assert z.stride(-1) == 1 + assert z.shape == (M, N) + assert weight.shape == (N, ) + assert weight.stride(-1) == 1 + if bias is not None: + assert bias.stride(-1) == 1 + assert bias.shape == (N, ) + # allocate output + if out is not None: + assert out.shape == x.shape + else: + out = torch.empty_like(x) + assert out.stride(-1) == 1 + mean = torch.empty((ngroups * M, ), dtype=torch.float32, + device=x.device) if not is_rms_norm else None + rstd = torch.empty((ngroups * M, ), dtype=torch.float32, device=x.device) + # Less than 64KB per feature: enqueue fused kernel + MAX_FUSED_SIZE = 65536 // x.element_size() + BLOCK_N = min(MAX_FUSED_SIZE, triton.next_power_of_2(group_size)) + if group_size > BLOCK_N: + raise RuntimeError( + "This layer norm doesn't support feature dim >= 64KB.") + # heuristics for number of warps + num_warps = min(max(BLOCK_N // 256, 1), 8) + grid = (M, ngroups) + with torch.cuda.device(x.device.index): + _layer_norm_fwd_1pass_kernel[grid](x, + out, + weight, + bias, + z, + mean, + rstd, + x.stride(0), + out.stride(0), + z.stride(0) if z is not None else 0, + M, + group_size, + eps, + BLOCK_N=BLOCK_N, + NORM_BEFORE_GATE=norm_before_gate, + IS_RMS_NORM=is_rms_norm, + num_warps=num_warps) + return out, mean, rstd + + +def rms_norm_gated(x, + weight, + bias, + z=None, + eps=1e-6, + group_size=None, + norm_before_gate=True): + x_shape_og = x.shape + # reshape input data into 2D tensor + x = x.reshape(-1, x.shape[-1]) + if x.stride(-1) != 1: + x = x.contiguous() + if z is not None: + assert z.shape == x_shape_og + z = z.reshape(-1, z.shape[-1]) + if z.stride(-1) != 1: + z = z.contiguous() + weight = weight.contiguous() + if bias is not None: + bias = bias.contiguous() + y, _, _ = _layer_norm_fwd(x, + weight, + bias, + eps, + z=z, + group_size=group_size, + norm_before_gate=norm_before_gate, + is_rms_norm=True) + + return y.reshape(x_shape_og) From a1d17e9c22687e8fdee2394081607c938b0af096 Mon Sep 17 00:00:00 2001 From: who who who Date: Fri, 25 Jul 2025 21:50:21 +0800 Subject: [PATCH 365/552] [ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295) Signed-off-by: fsx950223 Signed-off-by: amd-ruitang3 Co-authored-by: amd-ruitang3 Signed-off-by: x22x22 --- .../attention/test_aiter_flash_attn.py | 191 +++++++++++++++ vllm/v1/attention/backends/rocm_aiter_fa.py | 225 ++++++++++-------- 2 files changed, 320 insertions(+), 96 deletions(-) create mode 100644 tests/kernels/attention/test_aiter_flash_attn.py diff --git a/tests/kernels/attention/test_aiter_flash_attn.py b/tests/kernels/attention/test_aiter_flash_attn.py new file mode 100644 index 00000000000..d0687c62b11 --- /dev/null +++ b/tests/kernels/attention/test_aiter_flash_attn.py @@ -0,0 +1,191 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from typing import Optional + +import pytest +import torch + +import vllm.v1.attention.backends.rocm_aiter_fa # noqa: F401 +from vllm.platforms import current_platform + +NUM_HEADS = [(4, 4), (8, 2), (16, 2)] +HEAD_SIZES = [128, 256] +BLOCK_SIZES = [16, 32] +DTYPES = [torch.float16, torch.bfloat16] +QDTYPES = [None] +# one value large enough to test overflow in index calculation. +# one value small enough to test the schema op check +NUM_BLOCKS = [32768, 2048] + + +def ref_paged_attn( + query: torch.Tensor, + key_cache: torch.Tensor, + value_cache: torch.Tensor, + query_lens: list[int], + kv_lens: list[int], + block_tables: torch.Tensor, + scale: float, + sliding_window: Optional[int] = None, + soft_cap: Optional[float] = None, +) -> torch.Tensor: + num_seqs = len(query_lens) + block_tables = block_tables.cpu().numpy() + _, block_size, num_kv_heads, head_size = key_cache.shape + + outputs: list[torch.Tensor] = [] + start_idx = 0 + for i in range(num_seqs): + query_len = query_lens[i] + kv_len = kv_lens[i] + q = query[start_idx:start_idx + query_len] + q *= scale + + num_kv_blocks = (kv_len + block_size - 1) // block_size + block_indices = block_tables[i, :num_kv_blocks] + + k = key_cache[block_indices].view(-1, num_kv_heads, head_size) + k = k[:kv_len] + v = value_cache[block_indices].view(-1, num_kv_heads, head_size) + v = v[:kv_len] + + if q.shape[1] != k.shape[1]: + k = torch.repeat_interleave(k, q.shape[1] // k.shape[1], dim=1) + v = torch.repeat_interleave(v, q.shape[1] // v.shape[1], dim=1) + attn = torch.einsum("qhd,khd->hqk", q, k).float() + empty_mask = torch.ones(query_len, kv_len) + mask = torch.triu(empty_mask, diagonal=kv_len - query_len + 1).bool() + if sliding_window is not None: + sliding_window_mask = torch.triu(empty_mask, + diagonal=kv_len - + (query_len + sliding_window) + + 1).bool().logical_not() + mask |= sliding_window_mask + if soft_cap is not None: + attn = soft_cap * torch.tanh(attn / soft_cap) + attn.masked_fill_(mask, float("-inf")) + attn = torch.softmax(attn, dim=-1).to(v.dtype) + out = torch.einsum("hqk,khd->qhd", attn, v) + + outputs.append(out) + start_idx += query_len + + return torch.cat(outputs, dim=0) + + +@pytest.mark.skipif(not current_platform.is_rocm(), + reason="Only ROCm is supported") +@pytest.mark.parametrize("seq_lens", + [[(10, 1328), (5, 18), + (129, 463)], [(8, 523), (24, 37), (3, 2011)]]) +@pytest.mark.parametrize("num_heads", NUM_HEADS) +@pytest.mark.parametrize("head_size", HEAD_SIZES) +@pytest.mark.parametrize("block_size", BLOCK_SIZES) +@pytest.mark.parametrize("sliding_window", [None, 256]) +@pytest.mark.parametrize("dtype", DTYPES) +@pytest.mark.parametrize("soft_cap", [None]) +@pytest.mark.parametrize("num_blocks", NUM_BLOCKS) +@pytest.mark.parametrize("q_dtype", QDTYPES) +@torch.inference_mode() +def test_varlen_with_paged_kv( + seq_lens: list[tuple[int, int]], + num_heads: tuple[int, int], + head_size: int, + sliding_window: Optional[int], + dtype: torch.dtype, + block_size: int, + soft_cap: Optional[float], + num_blocks: int, + q_dtype: Optional[torch.dtype], +) -> None: + torch.set_default_device("cuda") + current_platform.seed_everything(0) + num_seqs = len(seq_lens) + query_lens = [x[0] for x in seq_lens] + kv_lens = [x[1] for x in seq_lens] + num_query_heads = num_heads[0] + num_kv_heads = num_heads[1] + assert num_query_heads % num_kv_heads == 0 + max_query_len = max(query_lens) + max_kv_len = max(kv_lens) + window_size = ((sliding_window - 1, 0) if sliding_window is not None else + (-1, -1)) + scale = head_size**-0.5 + + query = torch.randn(sum(query_lens), + num_query_heads, + head_size, + dtype=dtype) + key_cache = torch.randn(num_blocks, + block_size, + num_kv_heads, + head_size, + dtype=dtype) + value_cache = torch.randn_like(key_cache) + cu_query_lens = torch.tensor([0] + query_lens, + dtype=torch.int32).cumsum(dim=0, + dtype=torch.int32) + + cu_seq_lens = torch.tensor([0] + kv_lens, + dtype=torch.int32).cumsum(dim=0, + dtype=torch.int32) + kv_lens = torch.tensor(kv_lens, dtype=torch.int32) + + max_num_blocks_per_seq = (max_kv_len + block_size - 1) // block_size + block_tables = torch.randint(0, + num_blocks, + (num_seqs, max_num_blocks_per_seq), + dtype=torch.int32) + + output = torch.empty_like(query) + + maybe_quantized_query = query + maybe_quantized_key_cache = key_cache + maybe_quantized_value_cache = value_cache + k_descale = None + v_descale = None + if q_dtype is not None: + # QKV are drawn from N(0, 1): no need for a fp8 scaling factor + maybe_quantized_query = query.to(q_dtype) + maybe_quantized_key_cache = key_cache.to(q_dtype) + maybe_quantized_value_cache = value_cache.to(q_dtype) + + scale_shape = (num_seqs, num_kv_heads) + k_descale = torch.ones(scale_shape, dtype=torch.float32) + v_descale = torch.ones(scale_shape, dtype=torch.float32) + + torch.ops.vllm.flash_attn_varlen_func( + maybe_quantized_query, + maybe_quantized_key_cache, + maybe_quantized_value_cache, + out=output, + cu_seqlens_q=cu_query_lens, + max_seqlen_q=max_query_len, + max_seqlen_k=max_kv_len, + softmax_scale=scale, + alibi_slopes=None, + window_size=window_size, + block_table=block_tables, + cu_seqlens_k=cu_seq_lens, + k_scale=k_descale, + v_scale=v_descale, + ) + + ref_output = ref_paged_attn( + query=query, + key_cache=key_cache, + value_cache=value_cache, + query_lens=query_lens, + kv_lens=kv_lens, + block_tables=block_tables, + scale=scale, + sliding_window=sliding_window, + soft_cap=soft_cap, + ) + + atol, rtol = 2e-2, 2e-2 + if q_dtype is not None: + atol, rtol = 1.5e-1, 1.5e-1 + torch.testing.assert_close(output, ref_output, atol=atol, rtol=rtol), \ + f"{torch.max(torch.abs(output - ref_output))}" diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index 0739d259667..85a5dc8c91c 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -2,20 +2,21 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project """Attention layer with AiterFlashAttention.""" from dataclasses import dataclass -from typing import Optional +from typing import ClassVar, Optional import torch -from vllm import _custom_ops as ops from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, - AttentionMetadata, AttentionType, - is_quantized_kv_cache) + AttentionMetadata, AttentionType) from vllm.config import VllmConfig from vllm.logger import init_logger from vllm.platforms import current_platform -from vllm.v1.attention.backends.utils import CommonAttentionMetadata +from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder, + CommonAttentionMetadata) from vllm.v1.kv_cache_interface import AttentionSpec +_PARTITION_SIZE_ROCM = 256 + if current_platform.is_rocm(): import aiter @@ -32,38 +33,54 @@ def _vllm_layout_trans_kernel( b_seq_lens_loc, block_table, block_table_stride_0, + k_scale, + v_scale, + output_dtype: tl.constexpr, E_DIM: tl.constexpr, BLOCK_SIZE: tl.constexpr, ): batch_idx = tl.program_id(0) block_idx = tl.program_id(1) - batch_token_indexes = tl.load(b_seq_lens_loc + batch_idx + - tl.arange(0, 2)) - batch_token_start, batch_token_end = tl.split(batch_token_indexes) - seq_len = batch_token_end - batch_token_start batch_query_indexes = tl.load(b_query_lens_loc + batch_idx + tl.arange(0, 2)) batch_query_start, batch_query_end = tl.split(batch_query_indexes) query_len = batch_query_end - batch_query_start + if query_len <= 1: return + + batch_token_indexes = tl.load(b_seq_lens_loc + batch_idx + + tl.arange(0, 2)) + batch_token_start, batch_token_end = tl.split(batch_token_indexes) + seq_len = batch_token_end - batch_token_start + if block_idx * BLOCK_SIZE < seq_len: block_mask = (block_idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)[:, None]) < seq_len kv_idx = tl.load(block_table + batch_idx * block_table_stride_0 + - block_idx) + block_idx).to(tl.int64) kv_buffer_off = kv_idx * BLOCK_SIZE * E_DIM + tl.arange( 0, BLOCK_SIZE)[:, None] * E_DIM + tl.arange(0, E_DIM)[None, :] k_vals = tl.load(k_buffer_ptr + kv_buffer_off, mask=block_mask, other=0.0) + if k_vals.dtype.is_fp8(): + k_vals = (k_vals.to(tl.float32) * + tl.load(k_scale)).to(output_dtype) + else: + k_vals = k_vals.to(output_dtype) + v_vals = tl.load(v_buffer_ptr + kv_buffer_off, mask=block_mask, other=0.0) - + if v_vals.dtype.is_fp8(): + v_vals = (v_vals.to(tl.float32) * + tl.load(v_scale)).to(output_dtype) + else: + v_vals = v_vals.to(output_dtype) kv_values_off = batch_token_start * E_DIM + \ block_idx * BLOCK_SIZE * E_DIM + \ tl.arange(0, BLOCK_SIZE)[:, None] * E_DIM + \ @@ -72,29 +89,44 @@ def _vllm_layout_trans_kernel( tl.store(v_values_ptr + kv_values_off, v_vals, mask=block_mask) def vllm_layout_trans(b_query_lens_loc, b_seq_lens_loc, block_table, - k_buffer, v_buffer, max_seq_len, total_tokens): - H_KV = v_buffer.shape[2] - D = v_buffer.shape[3] - BLOCK_SIZE = v_buffer.shape[1] - dtype = k_buffer.dtype - k_values = torch.empty((total_tokens, H_KV, D), - dtype=dtype, - device="cuda") - v_values = torch.empty((total_tokens, H_KV, D), - dtype=dtype, - device="cuda") + k_cache, v_cache, max_seq_len, k_scale, v_scale, + output_dtype, total_tokens): + H_KV = v_cache.shape[2] + D = v_cache.shape[3] + BLOCK_SIZE = v_cache.shape[1] + + k_values = torch.empty( + (total_tokens, H_KV, D), + dtype=output_dtype, + device=k_cache.device, + ) + v_values = torch.empty( + (total_tokens, H_KV, D), + dtype=output_dtype, + device=v_cache.device, + ) grid = (block_table.shape[0], (max_seq_len + BLOCK_SIZE - 1) // BLOCK_SIZE) - _vllm_layout_trans_kernel[grid](k_buffer, - v_buffer, + if output_dtype == torch.float16: + output_dtype = tl.float16 + elif output_dtype == torch.bfloat16: + output_dtype = tl.bfloat16 + else: + raise ValueError(f"Unsupported output dtype: {output_dtype}") + + _vllm_layout_trans_kernel[grid](k_cache, + v_cache, k_values, v_values, b_query_lens_loc, b_seq_lens_loc, block_table, block_table.stride(0), + k_scale, + v_scale, + output_dtype=output_dtype, E_DIM=H_KV * D, BLOCK_SIZE=BLOCK_SIZE) @@ -107,16 +139,22 @@ def flash_attn_varlen_func_impl( out: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, - total_tokens: int, max_seqlen_q: int, max_seqlen_k: int, softmax_scale: float, window_size: Optional[list[int]], # -1 means infinite context window alibi_slopes: Optional[list[float]], block_table: torch.Tensor, + k_scale: torch.Tensor, + v_scale: torch.Tensor, + total_tokens: int = 0, ) -> torch.Tensor: + if total_tokens == 0: + total_tokens = int(cu_seqlens_k[-1].item()) k, v = vllm_layout_trans(cu_seqlens_q, cu_seqlens_k, block_table, - k_cache, v_cache, max_seqlen_k, total_tokens) + k_cache, v_cache, max_seqlen_k, k_scale, + v_scale, q.dtype, total_tokens) + output = aiter.flash_attn_varlen_func( q=q, k=k, @@ -141,19 +179,21 @@ def flash_attn_varlen_func_fake( out: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, - total_tokens: int, max_seqlen_q: int, max_seqlen_k: int, softmax_scale: float, window_size: Optional[list[int]], # -1 means infinite context window alibi_slopes: Optional[list[float]], block_table: torch.Tensor, + k_scale: torch.Tensor, + v_scale: torch.Tensor, + total_tokens: int = 0, ) -> torch.Tensor: return torch.empty(q.shape[0], q.shape[1], v_cache.shape[-2], - dtype=torch.float8_e4m3fnuz, - device="cuda") + dtype=q.dtype, + device=q.device) direct_register_custom_op("flash_attn_varlen_func", flash_attn_varlen_func_impl, ["out"], @@ -163,7 +203,33 @@ def flash_attn_varlen_func_fake( logger = init_logger(__name__) -class AiterFlashAttentionMetadataBuilder: +@dataclass +class AiterFlashAttentionMetadata: + # NOTE(sang): Definition of context_len, query_len, and seq_len. + # |---------- N-1 iteration --------| + # |---------------- N iteration ---------------------| + # |- tokenA -|......................|-- newTokens ---| + # |---------- context_len ----------| + # |-------------------- seq_len ---------------------| + # |-- query_len ---| + + num_actual_tokens: int # Number of tokens excluding padding. + max_query_len: int + query_start_loc: torch.Tensor + max_seq_len: int + seq_lens: torch.Tensor + slot_mapping: torch.Tensor + block_table: torch.Tensor + + # For cascade attention. + use_cascade: bool + common_prefix_len: int + total_tokens: int + + +class AiterFlashAttentionMetadataBuilder( + AttentionMetadataBuilder[AiterFlashAttentionMetadata]): + full_cudagraph_supported: ClassVar[bool] = True def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, device: torch.device): @@ -180,14 +246,23 @@ def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, self.headdim = self.model_config.get_head_size() self.block_size = kv_cache_spec.block_size self.kv_cache_spec = kv_cache_spec - # Sliding window size to be used with the AOT scheduler will be # populated on first build() call. self.aot_sliding_window: Optional[tuple[int, int]] = None + self.total_tokens: int = 0 def reorder_batch(self, input_batch, scheduler_output) -> bool: return False + def build_for_cudagraph_capture( + self, common_attn_metadata: CommonAttentionMetadata): + self.total_tokens = self.model_config.max_model_len \ + * self.vllm_config.scheduler_config.max_num_partial_prefills + res = self.build(common_prefix_len=0, + common_attn_metadata=common_attn_metadata) + self.total_tokens = 0 + return res + def build(self, common_prefix_len: int, common_attn_metadata: CommonAttentionMetadata, @@ -195,43 +270,29 @@ def build(self, num_actual_tokens = common_attn_metadata.num_actual_tokens max_query_len = common_attn_metadata.max_query_len - max_seq_len = int(common_attn_metadata.seq_lens_cpu.max()) - total_tokens = int(common_attn_metadata.seq_lens_cpu.sum()) query_start_loc = common_attn_metadata.query_start_loc seq_lens = common_attn_metadata.seq_lens block_table_tensor = common_attn_metadata.block_table_tensor slot_mapping = common_attn_metadata.slot_mapping - cu_seq_lens = torch.zeros(seq_lens.shape[0] + 1, - dtype=torch.int32, - device=self.device) - torch.cumsum(seq_lens, - dim=0, - dtype=cu_seq_lens.dtype, - out=cu_seq_lens[1:]) + def schedule(batch_size, cu_query_lens, max_query_len, seqlens, + max_seq_len, causal): + return None use_cascade = common_prefix_len > 0 - cu_prefix_query_lens = None - prefix_kv_lens = None - suffix_kv_lens = None - attn_metadata = AiterFlashAttentionMetadata( num_actual_tokens=num_actual_tokens, max_query_len=max_query_len, query_start_loc=query_start_loc, max_seq_len=max_seq_len, seq_lens=seq_lens, - cu_seq_lens=cu_seq_lens, - total_tokens=total_tokens, block_table=block_table_tensor, slot_mapping=slot_mapping, use_cascade=use_cascade, common_prefix_len=common_prefix_len, - cu_prefix_query_lens=cu_prefix_query_lens, - prefix_kv_lens=prefix_kv_lens, - suffix_kv_lens=suffix_kv_lens, + total_tokens=self.total_tokens, ) return attn_metadata @@ -254,7 +315,7 @@ def get_supported_dtypes(cls) -> list[torch.dtype]: @classmethod def get_supported_head_sizes(cls) -> list[int]: - return [32, 64, 96, 128, 160, 192, 224, 256] + return [64, 128, 256] @classmethod def validate_head_size(cls, head_size: int) -> None: @@ -295,34 +356,6 @@ def get_kv_cache_shape( return (2, num_blocks, block_size, num_kv_heads, head_size) -@dataclass -class AiterFlashAttentionMetadata: - # NOTE(sang): Definition of context_len, query_len, and seq_len. - # |---------- N-1 iteration --------| - # |---------------- N iteration ---------------------| - # |- tokenA -|......................|-- newTokens ---| - # |---------- context_len ----------| - # |-------------------- seq_len ---------------------| - # |-- query_len ---| - - num_actual_tokens: int # Number of tokens excluding padding. - max_query_len: int - query_start_loc: torch.Tensor - max_seq_len: int - seq_lens: torch.Tensor - cu_seq_lens: torch.Tensor - total_tokens: int - block_table: torch.Tensor - slot_mapping: torch.Tensor - - # For cascade attention. - use_cascade: bool - common_prefix_len: int - cu_prefix_query_lens: Optional[torch.Tensor] - prefix_kv_lens: Optional[torch.Tensor] - suffix_kv_lens: Optional[torch.Tensor] - - class AiterFlashAttentionImpl(AttentionImpl): def __init__( @@ -366,10 +399,6 @@ def __init__( "encoder/decoder cross-attention " "are not implemented for " "FlashAttentionImpl") - if is_quantized_kv_cache(self.kv_cache_dtype): - raise NotImplementedError( - "AiterFlashAttention does not support fp8 kv-cache on this " - "device.") def forward( self, @@ -440,12 +469,6 @@ def forward( if self.kv_cache_dtype.startswith("fp8"): key_cache = key_cache.view(torch.float8_e4m3fnuz) value_cache = value_cache.view(torch.float8_e4m3fnuz) - num_tokens, num_heads, head_size = query.shape - query, _ = ops.scaled_fp8_quant( - query.reshape( - (num_tokens, num_heads * head_size)).contiguous(), - layer._q_scale) - query = query.reshape((num_tokens, num_heads, head_size)) if not attn_metadata.use_cascade: cu_seqlens_q = attn_metadata.query_start_loc @@ -455,8 +478,16 @@ def forward( block_table = attn_metadata.block_table if max_seqlen_q > 1: - cu_seq_lens = attn_metadata.cu_seq_lens - total_tokens = attn_metadata.total_tokens + + cu_seq_lens = torch.zeros(seqused_k.shape[0] + 1, + dtype=torch.int32, + device=query.device) + + torch.cumsum(seqused_k, + dim=0, + dtype=cu_seq_lens.dtype, + out=cu_seq_lens[1:]) + torch.ops.vllm.flash_attn_varlen_func( query[:num_actual_tokens], key_cache, @@ -465,29 +496,31 @@ def forward( cu_seqlens_q=cu_seqlens_q, max_seqlen_q=max_seqlen_q, max_seqlen_k=max_seqlen_k, - total_tokens=total_tokens, softmax_scale=self.scale, alibi_slopes=self.alibi_slopes, window_size=self.sliding_window, block_table=block_table, - cu_seqlens_k=cu_seq_lens) + cu_seqlens_k=cu_seq_lens, + k_scale=layer._k_scale, + v_scale=layer._v_scale, + total_tokens=attn_metadata.total_tokens, + ) _, num_heads, head_size = query.shape - _PARTITION_SIZE_ROCM = 256 + nbytes_per_qo_elem = torch.finfo(query.dtype).bits // 8 num_seqs = seqused_k.shape[0] - nbyes_per_qo_elem = torch.finfo(output.dtype).bits // 8 max_num_partitions = (max_seqlen_k + _PARTITION_SIZE_ROCM - 1) // _PARTITION_SIZE_ROCM workspace_buffer = torch.empty( (num_seqs * num_heads * max_num_partitions * head_size) * - nbyes_per_qo_elem + 2 * + nbytes_per_qo_elem + 2 * (num_seqs * num_heads * max_num_partitions) * 4, dtype=torch.uint8, device=output.device, ) - aiter.paged_attention_v1( + torch.ops.aiter.paged_attention_v1( output[:num_actual_tokens], workspace_buffer, query[:num_actual_tokens], From 2640f7b9f4e2549b77136220e74ce852151b0f05 Mon Sep 17 00:00:00 2001 From: czhu-cohere Date: Fri, 25 Jul 2025 06:53:21 -0700 Subject: [PATCH 366/552] [Kernel] Improve machete memory bound perf (#21556) Signed-off-by: czhu-cohere Signed-off-by: x22x22 --- csrc/quantization/machete/machete_prepacked_layout.cuh | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/csrc/quantization/machete/machete_prepacked_layout.cuh b/csrc/quantization/machete/machete_prepacked_layout.cuh index 81aaa6c4f3a..4a7d6341e6c 100644 --- a/csrc/quantization/machete/machete_prepacked_layout.cuh +++ b/csrc/quantization/machete/machete_prepacked_layout.cuh @@ -187,8 +187,12 @@ struct PrepackedLayoutBTemplate { CUTE_HOST_DEVICE static constexpr auto TVbNbKL_to_offset_copy( Shape_NKL shape_mkl) { auto layout = TVbNbKL_to_offset(shape_mkl); - return make_layout(coalesce(get<0>(layout)), get<1>(layout), - get<2>(layout)); + // for 4-bit elements, having >= 64 values per column + // allows TMA to load full 32-byte sectors + auto inner_layout = + make_layout(make_shape(_256{}, size<0>(layout) / _256{})); + + return make_layout(inner_layout, get<1>(layout), get<2>(layout)); } // ((BlockN, BlockK), (BlocksN, BlocksK), L) -> (storage_idx) From bab02d9d28206649849fb3f1fbc4e8268af03fda Mon Sep 17 00:00:00 2001 From: mgazz Date: Fri, 25 Jul 2025 15:01:27 +0100 Subject: [PATCH 367/552] Add support for Prithvi in Online serving mode (#21518) Signed-off-by: Michele Gazzetti Co-authored-by: Cyrus Leung Signed-off-by: x22x22 --- .../entrypoints/openai/test_skip_tokenizer.py | 93 +++++++++++++++++++ vllm/engine/multiprocessing/client.py | 20 ++-- vllm/entrypoints/openai/serving_engine.py | 14 ++- vllm/entrypoints/openai/serving_pooling.py | 6 +- .../models/prithvi_geospatial_mae.py | 5 +- 5 files changed, 128 insertions(+), 10 deletions(-) create mode 100644 tests/entrypoints/openai/test_skip_tokenizer.py diff --git a/tests/entrypoints/openai/test_skip_tokenizer.py b/tests/entrypoints/openai/test_skip_tokenizer.py new file mode 100644 index 00000000000..32d28277e0e --- /dev/null +++ b/tests/entrypoints/openai/test_skip_tokenizer.py @@ -0,0 +1,93 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import base64 +import io + +import numpy as np +import pytest +import requests +import torch + +from ...utils import RemoteOpenAIServer + +MODEL_NAME = "christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM" +DTYPE = "float16" + + +@pytest.fixture(autouse=True) +def v1(run_with_both_engines): + # Simple autouse wrapper to run both engines for each test + # This can be promoted up to conftest.py to run for every + # test in a package + pass + + +@pytest.fixture(scope="module") +def server(): + args = [ + "--task", + "embed", + # use half precision for speed and memory savings in CI environment + "--dtype", + DTYPE, + "--enforce-eager", + "--trust-remote-code", + "--skip-tokenizer-init", + "--max-num-seqs", + "32" + ] + + with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: + yield remote_server + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +async def test_single_request(server: RemoteOpenAIServer, model_name: str): + + pixel_values = torch.full((6, 512, 512), 1.0, dtype=torch.float16) + location_coords = torch.full((1, 2), 1.0, dtype=torch.float16) + + buffer_tiff = io.BytesIO() + torch.save(pixel_values, buffer_tiff) + buffer_tiff.seek(0) + binary_data = buffer_tiff.read() + base64_tensor_embedding = base64.b64encode(binary_data).decode('utf-8') + + buffer_coord = io.BytesIO() + torch.save(location_coords, buffer_coord) + buffer_coord.seek(0) + binary_data = buffer_coord.read() + base64_coord_embedding = base64.b64encode(binary_data).decode('utf-8') + + prompt = { + "model": + model_name, + "additional_data": { + "prompt_token_ids": [1] + }, + "encoding_format": + "base64", + "messages": [{ + "role": + "user", + "content": [{ + "type": "image_embeds", + "image_embeds": { + "pixel_values": base64_tensor_embedding, + "location_coords": base64_coord_embedding, + }, + }], + }] + } + + # test single pooling + response = requests.post(server.url_for("pooling"), json=prompt) + response.raise_for_status() + + output = response.json()["data"][0]['data'] + + np_response = np.frombuffer(base64.b64decode(output), dtype=np.float32) + + assert len(np_response) == 524288 diff --git a/vllm/engine/multiprocessing/client.py b/vllm/engine/multiprocessing/client.py index 67d9a3bf6ce..cde8fc367fb 100644 --- a/vllm/engine/multiprocessing/client.py +++ b/vllm/engine/multiprocessing/client.py @@ -97,11 +97,16 @@ def __init__(self, ipc_path: str, engine_config: VllmConfig, self.model_config = engine_config.model_config self.decoding_config = engine_config.decoding_config - # Create the tokenizer group. - self.tokenizer = init_tokenizer_from_configs( - model_config=self.model_config, - scheduler_config=engine_config.scheduler_config, - lora_config=engine_config.lora_config) + if self.vllm_config.model_config.skip_tokenizer_init: + self.tokenizer = None + + else: + # Create the tokenizer group. + self.tokenizer = init_tokenizer_from_configs( + model_config=self.model_config, + scheduler_config=engine_config.scheduler_config, + lora_config=engine_config.lora_config) + self.input_preprocessor = InputPreprocessor(self.model_config, self.tokenizer) @@ -375,7 +380,10 @@ async def get_input_preprocessor(self) -> InputPreprocessor: return self.input_preprocessor async def get_tokenizer(self, lora_request: Optional[LoRARequest] = None): - return await self.tokenizer.get_lora_tokenizer_async(lora_request) + if self.tokenizer is None: + return None + else: + return await self.tokenizer.get_lora_tokenizer_async(lora_request) async def get_vllm_config(self) -> VllmConfig: return self.vllm_config diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index 7b230703d86..fb4598b2f73 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -886,7 +886,10 @@ async def _preprocess_chat( _chat_template_kwargs.update(chat_template_kwargs or {}) request_prompt: Union[str, list[int]] - if isinstance(tokenizer, MistralTokenizer): + + if tokenizer is None: + request_prompt = "placeholder" + elif isinstance(tokenizer, MistralTokenizer): request_prompt = apply_mistral_chat_template( tokenizer, messages=messages, @@ -927,7 +930,14 @@ async def _preprocess_chat( request = tool_parser(tokenizer).adjust_request( # type: ignore request=request) - if isinstance(request_prompt, str): + if tokenizer is None: + assert isinstance(request_prompt, str), ( + "Prompt has to be a string", \ + "when the tokenizer is not initialised" + ) + prompt_inputs = TextTokensPrompt(prompt=request_prompt, + prompt_token_ids=[1]) + elif isinstance(request_prompt, str): prompt_inputs = await self._tokenize_prompt_input_async( request, tokenizer, diff --git a/vllm/entrypoints/openai/serving_pooling.py b/vllm/entrypoints/openai/serving_pooling.py index 12334cdac36..38745d001ad 100644 --- a/vllm/entrypoints/openai/serving_pooling.py +++ b/vllm/entrypoints/openai/serving_pooling.py @@ -96,7 +96,11 @@ async def create_pooling( self.max_model_len, truncate_prompt_tokens) lora_request = self._maybe_get_adapters(request) - tokenizer = await self.engine_client.get_tokenizer(lora_request) + if self.model_config.skip_tokenizer_init: + tokenizer = None + else: + tokenizer = await self.engine_client.get_tokenizer(lora_request + ) if isinstance(request, PoolingChatRequest): ( diff --git a/vllm/model_executor/models/prithvi_geospatial_mae.py b/vllm/model_executor/models/prithvi_geospatial_mae.py index 0f00fd47fe4..304a9e987ee 100644 --- a/vllm/model_executor/models/prithvi_geospatial_mae.py +++ b/vllm/model_executor/models/prithvi_geospatial_mae.py @@ -103,7 +103,10 @@ def apply( mm_kwargs = {} for k, v in mm_data.items(): - mm_kwargs[k] = v + if isinstance(v, dict) and k == "image": + mm_kwargs.update(v) + else: + mm_kwargs[k] = v mm_placeholders = {"image": [PlaceholderRange(offset=0, length=0)]} # This model receives in input a multi-dimensional tensor representing From 94d9349134ddc1173ffcfc1c3a3471bca3927d1a Mon Sep 17 00:00:00 2001 From: Kebe Date: Fri, 25 Jul 2025 22:33:56 +0800 Subject: [PATCH 368/552] [CI] Unifying Dockerfiles for ARM and X86 Builds (#21343) Signed-off-by: Kebe Signed-off-by: x22x22 --- .github/workflows/lint-and-deploy.yaml | 2 +- docker/Dockerfile.arm | 62 ------------------- docker/Dockerfile.cpu | 24 ++++++- .../installation/cpu/arm.inc.md | 2 +- requirements/cpu.txt | 6 +- 5 files changed, 29 insertions(+), 67 deletions(-) delete mode 100644 docker/Dockerfile.arm diff --git a/.github/workflows/lint-and-deploy.yaml b/.github/workflows/lint-and-deploy.yaml index 74a7a3a3530..d5736c0aee2 100644 --- a/.github/workflows/lint-and-deploy.yaml +++ b/.github/workflows/lint-and-deploy.yaml @@ -7,7 +7,7 @@ permissions: jobs: lint-and-deploy: - runs-on: ubuntu-latest + runs-on: ubuntu-24.04-arm steps: - name: Checkout uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 diff --git a/docker/Dockerfile.arm b/docker/Dockerfile.arm deleted file mode 100644 index bad09368423..00000000000 --- a/docker/Dockerfile.arm +++ /dev/null @@ -1,62 +0,0 @@ -# This vLLM Dockerfile is used to construct an image that can build and run vLLM on ARM CPU platform. - -FROM ubuntu:22.04 AS cpu-test-arm - -ENV CCACHE_DIR=/root/.cache/ccache - -ENV CMAKE_CXX_COMPILER_LAUNCHER=ccache - -RUN --mount=type=cache,target=/var/cache/apt \ - apt-get update -y \ - && apt-get install -y curl ccache git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \ - && apt-get install -y ffmpeg libsm6 libxext6 libgl1 \ - && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 - -# tcmalloc provides better memory allocation efficiency, e.g., holding memory in caches to speed up access of commonly-used objects. -RUN --mount=type=cache,target=/root/.cache/pip \ - pip install py-cpuinfo # Use this to gather CPU info and optimize based on ARM Neoverse cores - -# Set LD_PRELOAD for tcmalloc on ARM -ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4" - -RUN echo 'ulimit -c 0' >> ~/.bashrc - -WORKDIR /workspace - -ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" -ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL} -RUN --mount=type=cache,target=/root/.cache/pip \ - --mount=type=bind,src=requirements/build.txt,target=requirements/build.txt \ - pip install --upgrade pip && \ - pip install -r requirements/build.txt - -FROM cpu-test-arm AS build - -WORKDIR /workspace/vllm - -RUN --mount=type=cache,target=/root/.cache/pip \ - --mount=type=bind,src=requirements/common.txt,target=requirements/common.txt \ - --mount=type=bind,src=requirements/cpu.txt,target=requirements/cpu.txt \ - pip install -v -r requirements/cpu.txt - -COPY . . -ARG GIT_REPO_CHECK=0 -RUN --mount=type=bind,source=.git,target=.git \ - if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi - -# Disabling AVX512 specific optimizations for ARM -ARG VLLM_CPU_DISABLE_AVX512="true" -ENV VLLM_CPU_DISABLE_AVX512=${VLLM_CPU_DISABLE_AVX512} - -RUN --mount=type=cache,target=/root/.cache/pip \ - --mount=type=cache,target=/root/.cache/ccache \ - --mount=type=bind,source=.git,target=.git \ - VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel && \ - pip install dist/*.whl && \ - rm -rf dist - -WORKDIR /workspace/ - -RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks - -ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] \ No newline at end of file diff --git a/docker/Dockerfile.cpu b/docker/Dockerfile.cpu index 982c1ddf274..5e49e87131e 100644 --- a/docker/Dockerfile.cpu +++ b/docker/Dockerfile.cpu @@ -1,4 +1,11 @@ -# This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform. +# This vLLM Dockerfile is used to build images that can run vLLM on both x86_64 and arm64 CPU platforms. +# +# Supported platforms: +# - linux/amd64 (x86_64) +# - linux/arm64 (aarch64) +# +# Use the `--platform` option with `docker buildx build` to specify the target architecture, e.g.: +# docker buildx build --platform=linux/arm64 -f docker/Dockerfile.cpu . # # Build targets: # vllm-openai (default): used for serving deployment @@ -53,7 +60,20 @@ RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --upgrade pip && \ uv pip install -r requirements/cpu.txt -ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so:$LD_PRELOAD" +ARG TARGETARCH +ENV TARGETARCH=${TARGETARCH} + +RUN if [ "$TARGETARCH" = "arm64" ]; then \ + PRELOAD_PATH="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"; \ + else \ + PRELOAD_PATH="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so"; \ + fi && \ + echo "export LD_PRELOAD=$PRELOAD_PATH" >> ~/.bashrc + +# Ensure that the LD_PRELOAD environment variable for export is in effect. +SHELL ["/bin/bash", "-c"] + +ENV LD_PRELOAD=${LD_PRELOAD} RUN echo 'ulimit -c 0' >> ~/.bashrc diff --git a/docs/getting_started/installation/cpu/arm.inc.md b/docs/getting_started/installation/cpu/arm.inc.md index 63ae351b395..cac578eefb1 100644 --- a/docs/getting_started/installation/cpu/arm.inc.md +++ b/docs/getting_started/installation/cpu/arm.inc.md @@ -33,7 +33,7 @@ Testing has been conducted on AWS Graviton3 instances for compatibility. # --8<-- [end:pre-built-images] # --8<-- [start:build-image-from-source] ```bash -docker build -f docker/Dockerfile.arm \ +docker build -f docker/Dockerfile.cpu \ --tag vllm-cpu-env . # Launching OpenAI server diff --git a/requirements/cpu.txt b/requirements/cpu.txt index d80354342bc..6860275acab 100644 --- a/requirements/cpu.txt +++ b/requirements/cpu.txt @@ -10,7 +10,8 @@ setuptools>=77.0.3,<80.0.0 --extra-index-url https://download.pytorch.org/whl/cpu torch==2.6.0+cpu; platform_machine == "x86_64" # torch>2.6.0+cpu has performance regression on x86 platform, see https://github.com/pytorch/pytorch/pull/151218 torch==2.7.0; platform_system == "Darwin" -torch==2.7.0; platform_machine == "ppc64le" or platform_machine == "aarch64" +torch==2.7.0; platform_machine == "ppc64le" +torch==2.6.0; platform_machine == "aarch64" # for arm64 CPUs, torch 2.7.0 has a issue: https://github.com/vllm-project/vllm/issues/17960 # required for the image processor of minicpm-o-2_6, this must be updated alongside torch torchaudio; platform_machine != "ppc64le" and platform_machine != "s390x" @@ -25,3 +26,6 @@ datasets # for benchmark scripts intel-openmp==2024.2.1; platform_machine == "x86_64" intel_extension_for_pytorch==2.6.0; platform_machine == "x86_64" # torch>2.6.0+cpu has performance regression on x86 platform, see https://github.com/pytorch/pytorch/pull/151218 triton==3.2.0; platform_machine == "x86_64" # Triton is required for torch 2.6+cpu, as it is imported in torch.compile. + +# Use this to gather CPU info and optimize based on ARM Neoverse cores +py-cpuinfo; platform_machine == "aarch64" From b084cd2a0e20898d388707638038b808a6928824 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Fri, 25 Jul 2025 23:52:42 +0800 Subject: [PATCH 369/552] [Docs] add auto-round quantization readme (#21600) Signed-off-by: Wenhua Cheng Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- docs/features/quantization/README.md | 1 + docs/features/quantization/auto_round.md | 103 +++++++++++++++++++++++ 2 files changed, 104 insertions(+) create mode 100644 docs/features/quantization/auto_round.md diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md index e8c3b112307..e18c128f30f 100644 --- a/docs/features/quantization/README.md +++ b/docs/features/quantization/README.md @@ -6,6 +6,7 @@ Contents: - [Supported Hardware](supported_hardware.md) - [AutoAWQ](auto_awq.md) +- [AutoRound](auto_round.md) - [BitsAndBytes](bnb.md) - [BitBLAS](bitblas.md) - [GGUF](gguf.md) diff --git a/docs/features/quantization/auto_round.md b/docs/features/quantization/auto_round.md new file mode 100644 index 00000000000..2dfd847bb7d --- /dev/null +++ b/docs/features/quantization/auto_round.md @@ -0,0 +1,103 @@ +# AutoRound + +[AutoRound](https://github.com/intel/auto-round) is Intel’s advanced quantization algorithm designed to produce highly efficient **INT2, INT3, INT4, and INT8** +quantized large language models—striking an optimal balance between accuracy and deployment performance. + +AutoRound applies weight-only quantization to transformer-based models, enabling significant memory savings and faster +inference while maintaining near-original accuracy. It supports a wide range of hardware platforms, including **CPUs, +Intel GPUs, HPUs, and CUDA-enabled devices**. + +Please refer to the [AutoRound guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md) for more details. + +Key Features: + +✅ **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** are supported + +✅ **10+ vision-language models (VLMs)** are supported + +✅ **Per-layer mixed-bit quantization** for fine-grained control + +✅ **RTN (Round-To-Nearest) mode** for quick quantization with slight accuracy loss + +✅ **Multiple quantization recipes**: best, base, and light + +✅ Advanced utilities such as immediate packing and support for **10+ backends** + +## Installation + +```bash +uv pip install auto-round +``` + +## Quantizing a model + +For VLMs, please change to `auto-round-mllm` in CLI usage and `AutoRoundMLLM` in API usage. + +### CLI usage + +```bash +auto-round \ + --model Qwen/Qwen3-0.6B \ + --bits 4 \ + --group_size 128 \ + --format "auto_round" \ + --output_dir ./tmp_autoround +``` + +```bash +auto-round \ + --model Qwen/Qwen3-0.6B \ + --format "gguf:q4_k_m" \ + --output_dir ./tmp_autoround +``` + +### API usage + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from auto_round import AutoRound + +model_name = "Qwen/Qwen3-0.6B" +model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto") +tokenizer = AutoTokenizer.from_pretrained(model_name) + +bits, group_size, sym = 4, 128, True +autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym) + +# the best accuracy, 4-5X slower, low_gpu_mem_usage could save ~20G but ~30% slower +# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym) + +# 2-3X speedup, slight accuracy drop at W4G128 +# autoround = AutoRound(model, tokenizer, nsamples=128, iters=50, lr=5e-3, bits=bits, group_size=group_size, sym=sym ) + +output_dir = "./tmp_autoround" +# format= 'auto_round'(default), 'auto_gptq', 'auto_awq' +autoround.quantize_and_save(output_dir, format="auto_round") +``` + +## Running a quantized model with vLLM + +Here is some example code to run auto-round format in vLLM: + +```python +from vllm import LLM, SamplingParams + +prompts = [ + "Hello, my name is", +] +sampling_params = SamplingParams(temperature=0.6, top_p=0.95) +model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" +llm = LLM(model=model_name) + +outputs = llm.generate(prompts, sampling_params) + +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") +``` + +# Acknowledgement + +Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and +ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound. From 4bcf8bd3a27dccc84a5aa5ea2027c786d83229fa Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Fri, 25 Jul 2025 13:22:01 -0700 Subject: [PATCH 370/552] [TPU][Test] Rollback PR-21550. (#21619) Signed-off-by: Qiliang Cui Signed-off-by: x22x22 --- tests/v1/tpu/test_basic.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/v1/tpu/test_basic.py b/tests/v1/tpu/test_basic.py index dd89059ded5..865b58bc7f4 100644 --- a/tests/v1/tpu/test_basic.py +++ b/tests/v1/tpu/test_basic.py @@ -59,7 +59,7 @@ def test_basic( # actually test chunked prompt max_num_batched_tokens=1024, max_model_len=8192, - gpu_memory_utilization=0.95, + gpu_memory_utilization=0.7, max_num_seqs=max_num_seqs, tensor_parallel_size=tensor_parallel_size) as vllm_model: vllm_outputs = vllm_model.generate_greedy(example_prompts, From a2b51daeccf37a5f5848349730af121e49621032 Mon Sep 17 00:00:00 2001 From: Daniel Han Date: Fri, 25 Jul 2025 17:06:48 -0700 Subject: [PATCH 371/552] Add Unsloth to RLHF.md (#21636) Signed-off-by: x22x22 --- docs/training/rlhf.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/training/rlhf.md b/docs/training/rlhf.md index 4f75e4e0149..f608a630ab7 100644 --- a/docs/training/rlhf.md +++ b/docs/training/rlhf.md @@ -2,10 +2,14 @@ Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. -vLLM can be used to generate the completions for RLHF. The best way to do this is with libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl). +vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl) and [unsloth](https://github.com/unslothai/unsloth). See the following basic examples to get started if you don't want to use an existing library: - [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md) - [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md) - [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md) + +See the following notebooks showing how to use vLLM for GRPO: + +- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) From 94ca46e22c52eff4879fd1608d2192308c648da6 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Fri, 25 Jul 2025 20:07:07 -0400 Subject: [PATCH 372/552] [Perf] Cuda Kernel for Int8 Per Token Group Quant (#21476) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- csrc/ops.h | 5 +++++ .../compressed_tensors/int8_quant_kernels.cu | 10 ++++++++++ csrc/quantization/fp8/per_token_group_quant.cu | 6 +++++- csrc/quantization/per_token_group_quant_8bit.h | 10 ++++++++++ csrc/torch_bindings.cpp | 8 ++++++++ .../layers/quantization/utils/int8_utils.py | 11 +++++++++-- 6 files changed, 47 insertions(+), 3 deletions(-) create mode 100644 csrc/quantization/per_token_group_quant_8bit.h diff --git a/csrc/ops.h b/csrc/ops.h index 97a247d9d62..207291eceb1 100644 --- a/csrc/ops.h +++ b/csrc/ops.h @@ -292,6 +292,11 @@ void per_token_group_quant_fp8(const torch::Tensor& input, torch::Tensor& output_q, torch::Tensor& output_s, int64_t group_size, double eps, double fp8_min, double fp8_max, bool scale_ue8m0); + +void per_token_group_quant_int8(const torch::Tensor& input, + torch::Tensor& output_q, + torch::Tensor& output_s, int64_t group_size, + double eps, double int8_min, double int8_max); #endif void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, diff --git a/csrc/quantization/compressed_tensors/int8_quant_kernels.cu b/csrc/quantization/compressed_tensors/int8_quant_kernels.cu index 5cd2ac17976..6a81f159f46 100644 --- a/csrc/quantization/compressed_tensors/int8_quant_kernels.cu +++ b/csrc/quantization/compressed_tensors/int8_quant_kernels.cu @@ -1,6 +1,8 @@ #include #include +#include "../per_token_group_quant_8bit.h" + #include #include "../../dispatch_utils.h" @@ -336,3 +338,11 @@ void dynamic_scaled_int8_quant( } }); } + +void per_token_group_quant_int8(const torch::Tensor& input, + torch::Tensor& output_q, + torch::Tensor& output_s, int64_t group_size, + double eps, double int8_min, double int8_max) { + per_token_group_quant_8bit(input, output_q, output_s, group_size, eps, + int8_min, int8_max); +} \ No newline at end of file diff --git a/csrc/quantization/fp8/per_token_group_quant.cu b/csrc/quantization/fp8/per_token_group_quant.cu index afc41faeca9..2609054f207 100644 --- a/csrc/quantization/fp8/per_token_group_quant.cu +++ b/csrc/quantization/fp8/per_token_group_quant.cu @@ -1,6 +1,8 @@ #include #include +#include "../per_token_group_quant_8bit.h" + #include #include @@ -120,7 +122,7 @@ void per_token_group_quant_8bit(const torch::Tensor& input, torch::Tensor& output_q, torch::Tensor& output_s, int64_t group_size, double eps, double min_8bit, double max_8bit, - bool scale_ue8m0 = false) { + bool scale_ue8m0) { TORCH_CHECK(input.is_contiguous()); TORCH_CHECK(output_q.is_contiguous()); @@ -198,6 +200,8 @@ void per_token_group_quant_8bit(const torch::Tensor& input, input.scalar_type(), "per_token_group_quant_8bit", ([&] { if (dst_type == at::ScalarType::Float8_e4m3fn) { LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn); + } else if (dst_type == at::ScalarType::Char) { + LAUNCH_KERNEL(scalar_t, int8_t); } })); diff --git a/csrc/quantization/per_token_group_quant_8bit.h b/csrc/quantization/per_token_group_quant_8bit.h new file mode 100644 index 00000000000..537b61bc430 --- /dev/null +++ b/csrc/quantization/per_token_group_quant_8bit.h @@ -0,0 +1,10 @@ +#pragma once +#include + +// TODO(wentao): refactor the folder to 8bit, then includes fp8 and int8 folders +// 8-bit per-token-group quantization helper used by both FP8 and INT8 +void per_token_group_quant_8bit(const torch::Tensor& input, + torch::Tensor& output_q, + torch::Tensor& output_s, int64_t group_size, + double eps, double min_8bit, double max_8bit, + bool scale_ue8m0 = false); \ No newline at end of file diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index 95f8541bc9e..85b6abef00b 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -624,6 +624,14 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { ops.impl("per_token_group_fp8_quant", torch::kCUDA, &per_token_group_quant_fp8); + // Compute per-token-group INT8 quantized tensor and scaling factor. + ops.def( + "per_token_group_quant_int8(Tensor input, Tensor! output_q, Tensor! " + "output_s, int group_size, float eps, float int8_min, float int8_max) -> " + "()"); + ops.impl("per_token_group_quant_int8", torch::kCUDA, + &per_token_group_quant_int8); + // reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel ops.def( "rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, " diff --git a/vllm/model_executor/layers/quantization/utils/int8_utils.py b/vllm/model_executor/layers/quantization/utils/int8_utils.py index 1fdf7d174e2..6840cabbf1a 100644 --- a/vllm/model_executor/layers/quantization/utils/int8_utils.py +++ b/vllm/model_executor/layers/quantization/utils/int8_utils.py @@ -238,13 +238,20 @@ def per_token_group_quant_int8( int8_min = iinfo.min x_q = torch.empty_like(x, device=x.device, dtype=dtype) - M = x.numel() // group_size - N = group_size x_s = torch.empty( x.shape[:-1] + (x.shape[-1] // group_size, ), device=x.device, dtype=torch.float32, ) + # prefer CUDA kernel if available + if current_platform.is_cuda(): + torch.ops._C.per_token_group_quant_int8(x, x_q, x_s, group_size, eps, + float(int8_min), + float(int8_max)) + return x_q, x_s + + M = x.numel() // group_size + N = group_size BLOCK = triton.next_power_of_2(N) # heuristics for number of warps From 9ddd8f6737fcfd5e97862e7e3c7e8fd0eb08d2b5 Mon Sep 17 00:00:00 2001 From: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Date: Fri, 25 Jul 2025 17:07:26 -0700 Subject: [PATCH 373/552] Add interleaved RoPE test for Llama4 (Maverick) (#21478) Signed-off-by: Yong Hoon Shin Signed-off-by: x22x22 --- .../multimodal/generation/test_maverick.py | 94 +++++++++++++++---- 1 file changed, 74 insertions(+), 20 deletions(-) diff --git a/tests/models/multimodal/generation/test_maverick.py b/tests/models/multimodal/generation/test_maverick.py index 306cf39002d..bacc9ef94f4 100644 --- a/tests/models/multimodal/generation/test_maverick.py +++ b/tests/models/multimodal/generation/test_maverick.py @@ -22,6 +22,9 @@ GenerationConfig) from vllm import LLM, SamplingParams +from vllm.v1.executor.abstract import Executor +from vllm.v1.kv_cache_interface import (ChunkedLocalAttentionSpec, + FullAttentionSpec) from ....utils import multi_gpu_test @@ -69,6 +72,26 @@ def run_maverick_serving(model: str): raise +def get_rope_layers_config(model_path: str) -> list[int]: + """ + Get the interleaved RoPE configuration from HuggingFace config + + Args: + model_path: Path to the local directory containing the reduced + Maverick model checkpoint + + Returns: + List of 0 or 1 indicating whether each layer uses RoPE and local attn + 0 indicates that RoPE is not used while 1 indicates that RoPE is used. + """ + config_path = Path(model_path) / "config.json" + model_config = json.loads(config_path.read_text()) + text_config = model_config["text_config"] + no_rope_layers = text_config["no_rope_layers"] + print(f"Found no_rope_layers: {no_rope_layers}") + return no_rope_layers + + def create_reduced_maverick_model( original_model_name: str = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", @@ -113,7 +136,6 @@ def create_reduced_maverick_model( print("Loading original model configuration...") original_config = AutoConfig.from_pretrained(original_model_name, trust_remote_code=True) - print("Creating reduced configuration...") reduced_config = create_reduced_config(original_config, text_layers, num_experts, vision_layers) @@ -510,21 +532,32 @@ def save_weights_to_safetensors(weights: dict[str, torch.Tensor], f"{index_data['metadata']['total_size'] / (1024**3):.2f} GB") -def run_reduced_model(model_path: str, - should_profile: bool = False, - **kwargs) -> None: - """Test the created reduced model with vLLM.""" - - print(f"\nTesting reduced model at {model_path}...") - - llm = LLM( - model=model_path, - trust_remote_code=True, - max_model_len=512, # Small context for testing - gpu_memory_utilization=0.3, # Conservative memory usage - **kwargs, +def check_attention_spec_interleaved_rope( + llm: LLM, + num_attention_layers: int, + num_ranks: int, + rope_layers: list[int], +): + """Check that the attention spec is correct.""" + assert isinstance(llm.llm_engine.model_executor, Executor) + kv_cache_specs_per_rank = llm.llm_engine.model_executor.get_kv_cache_specs( ) - + for rank in range(num_ranks): + kv_cache_specs = kv_cache_specs_per_rank[rank] + assert len(kv_cache_specs.keys()) == num_attention_layers + for i in range(num_attention_layers): + if rope_layers[i] == 0: + expected_spec = FullAttentionSpec + else: + expected_spec = ChunkedLocalAttentionSpec + assert isinstance( + kv_cache_specs[ + f"language_model.model.layers.{i}.self_attn.attn"], + expected_spec) + + +def run_reduced_model(llm: LLM, should_profile: bool = False) -> None: + """Test the created reduced model with vLLM.""" sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50) @@ -551,6 +584,7 @@ def run_reduced_model(model_path: str, @pytest.mark.parametrize("tp,ep", [(2, True)]) @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") def test_dummy_maverick( + monkeypatch, original_model_name: str, text_layers: int, num_experts: int, @@ -562,6 +596,10 @@ def test_dummy_maverick( force_recreate: bool = True, profile: bool = False, ) -> None: + # Disable multiprocessing allows us to access model executor from LLM engine + monkeypatch.setenv("VLLM_USE_V1", "1") + monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0") + model_path = create_reduced_maverick_model( original_model_name=original_model_name, output_dir=output_dir, @@ -573,11 +611,27 @@ def test_dummy_maverick( print(f"\nReduced model created successfully at: {model_path}") - run_reduced_model(model_path=model_path, - should_profile=profile, - enforce_eager=enforce_eager, - tensor_parallel_size=tp, - enable_expert_parallel=ep) + rope_layers = get_rope_layers_config(model_path) + + llm = LLM( + model=model_path, + trust_remote_code=True, + max_model_len=512, # Small context for testing + gpu_memory_utilization=0.3, # Conservative memory usage + enforce_eager=enforce_eager, + tensor_parallel_size=tp, + enable_expert_parallel=ep, + ) + + check_attention_spec_interleaved_rope( + llm, + text_layers, + tp, + rope_layers, + ) + + print(f"\nTesting reduced model at {model_path}...") + run_reduced_model(llm=llm, should_profile=profile) def main(): From 87a9c1b27cbffee9b575725d5ac36037580923ff Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Fri, 25 Jul 2025 17:07:58 -0700 Subject: [PATCH 374/552] [Bugfix] Fix sync_and_slice_intermediate_tensors (#21537) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- vllm/v1/worker/gpu_model_runner.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 5fe594db667..6ddb2c422df 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1270,7 +1270,7 @@ def sync_and_slice_intermediate_tensors( if sync_self: assert intermediate_tensors is not None for k, v in intermediate_tensors.items(): - is_scattered = "residual" and is_residual_scattered + is_scattered = k == "residual" and is_residual_scattered copy_len = num_tokens // tp if is_scattered else \ num_tokens self.intermediate_tensors[k][:copy_len].copy_( From 04daaa7e6027caef4db21210c5e16c2c732ff728 Mon Sep 17 00:00:00 2001 From: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Fri, 25 Jul 2025 17:08:30 -0700 Subject: [PATCH 375/552] [Bugfix] Always set RAY_ADDRESS for Ray actor before spawn (#21540) Signed-off-by: Rui Qiao Signed-off-by: x22x22 --- vllm/utils/__init__.py | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py index 9f4140ac64e..054037b8932 100644 --- a/vllm/utils/__init__.py +++ b/vllm/utils/__init__.py @@ -2883,26 +2883,27 @@ def _maybe_force_spawn(): if os.environ.get("VLLM_WORKER_MULTIPROC_METHOD") == "spawn": return - reason = None - if cuda_is_initialized(): - reason = "CUDA is initialized" - elif xpu_is_initialized(): - reason = "XPU is initialized" - elif is_in_ray_actor(): + reasons = [] + if is_in_ray_actor(): # even if we choose to spawn, we need to pass the ray address # to the subprocess so that it knows how to connect to the ray cluster. # env vars are inherited by subprocesses, even if we use spawn. import ray os.environ["RAY_ADDRESS"] = ray.get_runtime_context().gcs_address - reason = "In a Ray actor and can only be spawned" + reasons.append("In a Ray actor and can only be spawned") + + if cuda_is_initialized(): + reasons.append("CUDA is initialized") + elif xpu_is_initialized(): + reasons.append("XPU is initialized") - if reason is not None: + if reasons: logger.warning( "We must use the `spawn` multiprocessing start method. " "Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. " "See https://docs.vllm.ai/en/latest/usage/" "troubleshooting.html#python-multiprocessing " - "for more information. Reason: %s", reason) + "for more information. Reasons: %s", "; ".join(reasons)) os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" From 94c289e5535b7e6fbf3bf1870decfded1a7c532f Mon Sep 17 00:00:00 2001 From: Chengji Yao Date: Fri, 25 Jul 2025 17:09:00 -0700 Subject: [PATCH 376/552] [TPU] Update ptxla nightly version to 20250724 (#21555) Signed-off-by: Chengji Yao Signed-off-by: x22x22 --- docker/Dockerfile.tpu | 2 +- requirements/tpu.txt | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docker/Dockerfile.tpu b/docker/Dockerfile.tpu index 3474ff50de7..b9fc9def881 100644 --- a/docker/Dockerfile.tpu +++ b/docker/Dockerfile.tpu @@ -1,4 +1,4 @@ -ARG NIGHTLY_DATE="20250714" +ARG NIGHTLY_DATE="20250724" ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE" FROM $BASE_IMAGE diff --git a/requirements/tpu.txt b/requirements/tpu.txt index d86f643d388..2d0d8bd8457 100644 --- a/requirements/tpu.txt +++ b/requirements/tpu.txt @@ -19,8 +19,8 @@ nixl==0.3.0 --find-links https://storage.googleapis.com/libtpu-releases/index.html --find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html --find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html -torch==2.9.0.dev20250716 -torchvision==0.24.0.dev20250716 -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250716-cp311-cp311-linux_x86_64.whl ; python_version == "3.11" -torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250716-cp312-cp312-linux_x86_64.whl ; python_version == "3.12" +torch==2.9.0.dev20250724 +torchvision==0.24.0.dev20250724 +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250724-cp311-cp311-linux_x86_64.whl ; python_version == "3.11" +torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0.dev20250724-cp312-cp312-linux_x86_64.whl ; python_version == "3.12" From 306d170bf849586994982aa16a70c78bf350685b Mon Sep 17 00:00:00 2001 From: Alex Kogan <82225080+sakogan@users.noreply.github.com> Date: Fri, 25 Jul 2025 21:09:34 -0400 Subject: [PATCH 377/552] [Feature] Add support for MoE models in the calibration-free RTN-based quantization (#20766) Signed-off-by: Alex Kogan Signed-off-by: x22x22 --- tests/quantization/test_rtn.py | 5 +- .../model_executor/layers/quantization/rtn.py | 234 +++++++++++++++--- 2 files changed, 201 insertions(+), 38 deletions(-) diff --git a/tests/quantization/test_rtn.py b/tests/quantization/test_rtn.py index 133b2d9e4df..bc2b468f97d 100644 --- a/tests/quantization/test_rtn.py +++ b/tests/quantization/test_rtn.py @@ -8,7 +8,10 @@ from tests.quantization.utils import is_quant_method_supported -MODELS = ["microsoft/Phi-3-mini-4k-instruct"] +MODELS = [ + "microsoft/Phi-3-mini-4k-instruct", # dense model + "ai21labs/Jamba-tiny-dev", # MoE model +] @pytest.mark.skipif(not is_quant_method_supported("rtn"), diff --git a/vllm/model_executor/layers/quantization/rtn.py b/vllm/model_executor/layers/quantization/rtn.py index 68309716cf9..cceaf9857c4 100644 --- a/vllm/model_executor/layers/quantization/rtn.py +++ b/vllm/model_executor/layers/quantization/rtn.py @@ -3,18 +3,19 @@ # Copyright © 2025, Oracle and/or its affiliates. import os -from typing import Any, Optional +from typing import Any, Callable, Optional import torch import torch.nn.functional as F from torch.nn.parameter import Parameter from vllm.logger import init_logger +from vllm.model_executor.layers.fused_moe import FusedMoE, FusedMoEMethodBase from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, set_weight_attrs) from vllm.model_executor.layers.quantization import QuantizationMethods from vllm.model_executor.layers.quantization.base_config import ( - QuantizationConfig) + QuantizationConfig, QuantizeMethodBase) logger = init_logger(__name__) """By default, use 8 bit as target precision, but it can be @@ -71,9 +72,11 @@ def from_config(cls, config: dict[str, Any]) -> "RTNConfig": return cls(weight_bits, group_size) def get_quant_method(self, layer: torch.nn.Module, - prefix: str) -> Optional["RTNLinearMethod"]: + prefix: str) -> Optional["QuantizeMethodBase"]: if isinstance(layer, LinearBase): return RTNLinearMethod(self) + elif isinstance(layer, FusedMoE): + return RTNMoEMethod(self) return None @@ -94,11 +97,18 @@ def narrow(self, dim, start, length): self.data.narrow(dim, start // factor, length // factor), self.scale.narrow(dim, start, length), self.quant_config) + def __getitem__(self, key): + return RTNTensor(self.data[key], self.scale[key], self.quant_config) + @property def shape(self): shape = self.data.shape factor = 1 if self.quant_config.weight_bits == 8 else 2 - return torch.Size((shape[0] * factor, shape[1])) + batch_present = len(shape) == 3 + if batch_present: + return torch.Size((shape[0], shape[1] * factor, shape[2])) + else: + return torch.Size((shape[0] * factor, shape[1])) def copy_(self, loaded_weight: torch.Tensor) -> None: qweight, weight_scale = rtn_quantize(loaded_weight.cuda(), @@ -165,7 +175,7 @@ def create_weights( weight = RTNParameter(data=torch.empty(output_size_per_partition // factor, input_size_per_partition, - dtype=torch.int8), + dtype=torch.uint8), scale=scale, quant_config=self.quant_config) @@ -180,18 +190,7 @@ def create_weights( layer.output_size_per_partition = output_size_per_partition def process_weights_after_loading(self, layer: torch.nn.Module) -> None: - """torch.compile does not know how to deal with a Parameter subclass - (aka RTNParameter). As we don't really need RTNParameters for the - forward pass, we replace them with equivalent instances of Parameters. - """ - old_weight = layer.weight - assert isinstance(old_weight, RTNParameter) - data = old_weight.data.data - - delattr(layer, "weight") - - new_weight = Parameter(data=data, requires_grad=False) - layer.register_parameter("weight", new_weight) + fix_weights(layer, "weight") def apply(self, layer: torch.nn.Module, @@ -209,6 +208,128 @@ def apply(self, return out +class RTNMoEMethod(FusedMoEMethodBase): + + def __init__(self, quant_config: RTNConfig): + self.quant_config = quant_config + + def create_weights(self, layer: torch.nn.Module, num_experts: int, + hidden_size: int, intermediate_size_per_partition: int, + params_dtype: torch.dtype, **extra_weight_attrs): + + factor = 1 if self.quant_config.weight_bits == 8 else 2 + + # Fused gate_up_proj (column parallel) + num_groups_per_col = (hidden_size // self.quant_config.group_size + if self.quant_config.group_size != -1 else 1) + w13_scale = Parameter( + torch.empty(num_experts, + 2 * intermediate_size_per_partition, + num_groups_per_col, + dtype=params_dtype), + requires_grad=False, + ) + layer.register_parameter("w13_scale", w13_scale) + + w13_weight = RTNParameter(data=torch.empty( + num_experts, + 2 * intermediate_size_per_partition // factor, + hidden_size, + dtype=torch.uint8), + scale=w13_scale, + quant_config=self.quant_config) + layer.register_parameter("w13_weight", w13_weight) + set_weight_attrs(w13_weight, extra_weight_attrs) + + # down_proj (row parallel) + num_groups_per_col = (intermediate_size_per_partition // + self.quant_config.group_size + if self.quant_config.group_size != -1 else 1) + w2_scale = Parameter(torch.zeros(num_experts, + hidden_size, + num_groups_per_col, + dtype=params_dtype), + requires_grad=False) + layer.register_parameter("w2_scale", w2_scale) + + w2_weight = RTNParameter(data=torch.empty( + num_experts, + hidden_size // factor, + intermediate_size_per_partition, + dtype=torch.uint8), + scale=w2_scale, + quant_config=self.quant_config) + layer.register_parameter("w2_weight", w2_weight) + set_weight_attrs(w2_weight, extra_weight_attrs) + + def process_weights_after_loading(self, layer: torch.nn.Module) -> None: + weight_bits = self.quant_config.weight_bits + fix_weights(layer, "w13_weight", weight_bits == 4) + fix_weights(layer, "w2_weight", weight_bits == 4) + + def apply( + self, + layer: torch.nn.Module, + x: torch.Tensor, + router_logits: torch.Tensor, + top_k: int, + renormalize: bool, + use_grouped_topk: bool = False, + topk_group: Optional[int] = None, + num_expert_group: Optional[int] = None, + global_num_experts: int = -1, + expert_map: Optional[torch.Tensor] = None, + custom_routing_function: Optional[Callable] = None, + scoring_func: str = "softmax", + e_score_correction_bias: Optional[torch.Tensor] = None, + apply_router_weight_on_input: bool = False, + activation: str = "silu", + enable_eplb: bool = False, + expert_load_view: Optional[torch.Tensor] = None, + logical_to_physical_map: Optional[torch.Tensor] = None, + logical_replica_count: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + if enable_eplb: + raise NotImplementedError( + "EPLB not supported for `RTNMoEMethod` yet.") + + from vllm.model_executor.layers.fused_moe import fused_experts + + topk_weights, topk_ids = FusedMoE.select_experts( + hidden_states=x, + router_logits=router_logits, + use_grouped_topk=use_grouped_topk, + top_k=top_k, + renormalize=renormalize, + topk_group=topk_group, + num_expert_group=num_expert_group, + custom_routing_function=custom_routing_function, + scoring_func=scoring_func, + e_score_correction_bias=e_score_correction_bias) + + weight_bits = self.quant_config.weight_bits + group_size = self.quant_config.group_size + + ret = fused_experts( + x, + layer.w13_weight, + layer.w2_weight, + topk_weights=topk_weights, + topk_ids=topk_ids, + inplace=True, + activation=activation, + use_int4_w4a16=weight_bits == 4, + use_int8_w8a16=weight_bits == 8, + global_num_experts=global_num_experts, + w1_scale=layer.w13_scale, + w2_scale=layer.w2_scale, + apply_router_weight_on_input=apply_router_weight_on_input, + expert_map=expert_map, + block_shape=[0, group_size]) + + return ret + + def rtn_quantize(tensor: torch.Tensor, num_bits: int, group_size: int) -> tuple[torch.Tensor, torch.Tensor]: """Quantize a tensor using per-group static scaling factor. @@ -221,34 +342,44 @@ def rtn_quantize(tensor: torch.Tensor, num_bits: int, If equal to -1, each row in the input tensor is treated as one group. """ + batch_present = len(tensor.shape) == 3 + if not batch_present: + tensor = tensor.unsqueeze(0) q_range = 2**num_bits - num_groups = (tensor.shape[0] * tensor.shape[1] // - group_size if group_size != -1 else tensor.shape[0]) + num_groups = (tensor.shape[1] * tensor.shape[2] // + group_size if group_size != -1 else tensor.shape[1]) """Calculate a scaling factor per input group. """ - input_flat = tensor.reshape(num_groups, -1) - input_min = torch.min(input_flat, dim=1, keepdim=True)[0] - input_max = torch.max(input_flat, dim=1, keepdim=True)[0] + input_flat = tensor.reshape(tensor.shape[0], num_groups, -1) + input_min = torch.min(input_flat, dim=2, keepdim=True)[0] + input_max = torch.max(input_flat, dim=2, keepdim=True)[0] input_max_abs = torch.max(input_min.abs(), input_max.abs()) scale = (input_max_abs * 2.0 / (q_range - 1)) - """Scale each input group, truncate and round to the nearest integer. + """Scale each input group, round to the nearest integer, shift + the range and truncate. """ scaled_input = input_flat / scale - scaled_input = scaled_input.clamp(-q_range // 2, q_range // 2 - 1) scaled_input = scaled_input.round() + scaled_input += q_range // 2 + scaled_input = scaled_input.clamp(0, q_range - 1) - scale = scale.reshape(tensor.shape[0], -1).contiguous() - inputs_q = scaled_input.reshape(tensor.shape).to(torch.int8) + scale = scale.reshape(tensor.shape[0], tensor.shape[1], -1).contiguous() + inputs_q = scaled_input.reshape(tensor.shape).to(torch.uint8) inputs_q = inputs_q.contiguous() if num_bits == 4: """Pack two 4-bit values into each byte. """ - inputs_q = (inputs_q[:, 1::2] << 4) | (inputs_q[:, ::2] & 0xf) - inputs_q = inputs_q.reshape(tensor.shape[0] // 2, tensor.shape[1]) + inputs_q = (inputs_q[:, :, 1::2] << 4) | (inputs_q[:, :, ::2] & 0xf) + inputs_q = inputs_q.reshape(tensor.shape[0], tensor.shape[1] // 2, + tensor.shape[2]) inputs_q = inputs_q.contiguous() + if not batch_present: + inputs_q = inputs_q.squeeze(0) + scale = scale.squeeze(0) + return inputs_q, scale @@ -259,31 +390,60 @@ def rtn_dequantize(tensor: torch.Tensor, scale: torch.Tensor) -> torch.Tensor: tensor: The input tensor. scale: The tensor with per-group scale factors. """ + batch_present = len(tensor.shape) == 3 + if not batch_present: + tensor = tensor.unsqueeze(0) + scale = scale.unsqueeze(0) - num_groups = scale.size(0) * scale.size(1) - input_dim, output_dim = tensor.shape + num_groups = scale.size(1) * scale.size(2) + batch, input_dim, output_dim = tensor.shape - num_bits = 8 if input_dim == scale.size(0) else 4 + num_bits = 8 if input_dim == scale.size(1) else 4 + q_range = 2**num_bits if num_bits == 4: input_dim *= 2 - data = torch.empty((input_dim, output_dim), + data = torch.empty((batch, input_dim, output_dim), dtype=scale.dtype, device=tensor.device) if num_bits == 8: data.copy_(tensor) + data -= q_range // 2 else: """Unpack two 4-bit values from each byte. """ - tensor = tensor.reshape(input_dim, output_dim // 2) + tensor = tensor.reshape(batch, input_dim, output_dim // 2) for i in range(2): - data[:, i::2] = (tensor << 4 * (1 - i)) >> 4 + data[:, :, i::2] = ((tensor << 4 * + (1 - i)) >> 4).to(torch.int8) - q_range // 2 """Scale each input group with its scaling factor. """ - scale = scale.reshape(num_groups, -1) - data = data.reshape(num_groups, -1) + scale = scale.reshape(batch, num_groups, -1) + data = data.reshape(batch, num_groups, -1) data = torch.mul(data, scale) - input_deq = data.reshape((input_dim, output_dim)).contiguous() + input_deq = data.reshape((batch, input_dim, output_dim)).contiguous() + if not batch_present: + input_deq = input_deq.squeeze(0) + return input_deq + + +def fix_weights(layer: torch.nn.Module, + param_name: str, + reshape: bool = False): + """torch.compile does not know how to deal with a Parameter subclass + (aka RTNParameter). As we don't really need RTNParameters for the + forward pass, we replace them with equivalent instances of Parameters. + """ + old_weight = getattr(layer, param_name) + assert isinstance(old_weight, RTNParameter) + data = old_weight.data.data + + delattr(layer, param_name) + + if reshape: + data = data.reshape(old_weight.shape[0], old_weight.shape[1] * 2, -1) + new_weight = Parameter(data=data, requires_grad=False) + layer.register_parameter(param_name, new_weight) From 7ccd355cf3f7475a3c792d5a687a8f49594a60cc Mon Sep 17 00:00:00 2001 From: Farzad Abdolhosseini Date: Sat, 26 Jul 2025 04:12:31 +0300 Subject: [PATCH 378/552] [Model] Ultravox: Support Llama 4 and Gemma 3 backends (#17818) Signed-off-by: Farzad Abdolhosseini Signed-off-by: Patrick Li Co-authored-by: Patrick Li Signed-off-by: x22x22 --- tests/models/registry.py | 2 ++ vllm/model_executor/models/registry.py | 1 + vllm/model_executor/models/ultravox.py | 38 +++++++++++++-------- vllm/transformers_utils/configs/ultravox.py | 22 +++++++----- 4 files changed, 39 insertions(+), 24 deletions(-) diff --git a/tests/models/registry.py b/tests/models/registry.py index 1800262ced6..b41e432d738 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -221,6 +221,8 @@ def check_available_online( "fp8": "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8"}), # noqa: E501 "LLaMAForCausalLM": _HfExamplesInfo("decapoda-research/llama-7b-hf", is_available_online=False), + "Llama4ForCausalLM": _HfExamplesInfo("meta-llama/Llama-4-Scout-17B-16E-Instruct", # noqa: E501 + is_available_online=False), "MambaForCausalLM": _HfExamplesInfo("state-spaces/mamba-130m-hf"), "Mamba2ForCausalLM": _HfExamplesInfo("mistralai/Mamba-Codestral-7B-v0.1"), "FalconMambaForCausalLM": _HfExamplesInfo("tiiuae/falcon-mamba-7b-instruct"), # noqa: E501 diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py index 14a8ac7876f..9b204fdcbe1 100644 --- a/vllm/model_executor/models/registry.py +++ b/vllm/model_executor/models/registry.py @@ -89,6 +89,7 @@ "JAISLMHeadModel": ("jais", "JAISLMHeadModel"), "JambaForCausalLM": ("jamba", "JambaForCausalLM"), "LlamaForCausalLM": ("llama", "LlamaForCausalLM"), + "Llama4ForCausalLM": ("llama4", "Llama4ForCausalLM"), # noqa: E501 # For decapoda-research/llama-* "LLaMAForCausalLM": ("llama", "LlamaForCausalLM"), "MambaForCausalLM": ("mamba", "MambaForCausalLM"), diff --git a/vllm/model_executor/models/ultravox.py b/vllm/model_executor/models/ultravox.py index 3697e3fd0cf..a4569ccd5a8 100644 --- a/vllm/model_executor/models/ultravox.py +++ b/vllm/model_executor/models/ultravox.py @@ -39,9 +39,7 @@ merge_multimodal_embeddings, merge_multimodal_embeddings_from_map) -_AUDIO_PLACEHOLDER_OVERRIDE = "<|reserved_special_token_0|>" -_AUDIO_PLACEHOLDER_TOKEN = 128002 -_AUDIO_TOKENS_PER_SECOND = 6.25 +_AUDIO_PLACEHOLDER_OVERRIDE = "<|audio|>" _MAX_ENCODER_BATCH_SIZE = 16 @@ -80,14 +78,15 @@ def get_hf_processor( sampling_rate: Optional[int] = None, **kwargs: object, ) -> ProcessorMixin: + config = self.ctx.model_config.hf_config hf_processor = self.ctx.get_hf_processor(**kwargs) # NOTE: Ultravox processing definition uses '<|eot_id|>' as the # placeholder that will cause confusion with the actual end of turn - # token, thus we override placeholder with a reserved special - # token. + # token, thus we override placeholder with a reserved token. hf_processor.audio_token_replacement = _AUDIO_PLACEHOLDER_OVERRIDE - hf_processor.audio_replacement_token_id = _AUDIO_PLACEHOLDER_TOKEN + hf_processor.audio_replacement_token_id = config.audio_token_index + return hf_processor def get_feature_extractor( @@ -274,7 +273,7 @@ def __init__(self, config: UltravoxConfig): else: self.act = get_act_fn(config.projector_act) - dim_out = config.text_config.hidden_size + dim_out = config.text_hidden_size self.linear_2 = nn.Linear(dim_mid, dim_out, bias=False) # Ultravox v0.4.1 and below use layer_norm after the second linear layer @@ -572,9 +571,14 @@ def get_input_embeddings( input_ids: torch.Tensor, multimodal_embeddings: Optional[MultiModalEmbeddings] = None, ) -> torch.Tensor: - inputs_embeds = self.language_model.get_input_embeddings(input_ids) - if multimodal_embeddings is not None \ - and len(multimodal_embeddings) != 0: + # The audio token index is not included in the embedding table + # We need to remove it before embedding lookup + safe_input_ids = input_ids.clone() + safe_input_ids[safe_input_ids == self.config.audio_token_index] = 0 + inputs_embeds = self.language_model.get_input_embeddings( + safe_input_ids) + if multimodal_embeddings is not None and len( + multimodal_embeddings) > 0: # TODO(ywang96): remove this block after v0 is deprecated. if not envs.VLLM_USE_V1: @@ -585,7 +589,7 @@ def get_input_embeddings( else: inputs_embeds = merge_multimodal_embeddings( input_ids, inputs_embeds, multimodal_embeddings, - _AUDIO_PLACEHOLDER_TOKEN) + self.config.audio_token_index) return inputs_embeds def forward(self, @@ -623,10 +627,14 @@ def forward(self, multimodal_embeddings) input_ids = None - hidden_states = self.language_model.model(input_ids, - positions, - intermediate_tensors, - inputs_embeds=inputs_embeds) + language_model = self.language_model + if hasattr(language_model, "language_model"): + language_model = language_model.language_model + + hidden_states = language_model.model(input_ids, + positions, + intermediate_tensors, + inputs_embeds=inputs_embeds) return hidden_states def compute_logits(self, hidden_states: torch.Tensor, diff --git a/vllm/transformers_utils/configs/ultravox.py b/vllm/transformers_utils/configs/ultravox.py index 62f63b02d49..87064cc12de 100644 --- a/vllm/transformers_utils/configs/ultravox.py +++ b/vllm/transformers_utils/configs/ultravox.py @@ -45,6 +45,7 @@ class UltravoxConfig(transformers.PretrainedConfig): """ model_type = "ultravox" + audio_token = "<|audio|>" is_composition = False def __init__( @@ -80,29 +81,32 @@ def __init__( # Avoid circular import from vllm.transformers_utils.config import get_config - self.text_config = get_config(text_model_id, - trust_remote_code=False) + text_config_obj = get_config(text_model_id, + trust_remote_code=False) else: text_config = text_config or {} - self.text_config = transformers.CONFIG_MAPPING[text_config.get( + text_config_obj = transformers.CONFIG_MAPPING[text_config.get( "model_type", "llama")](**text_config) + inner_text_config = text_config_obj.get_text_config() + if audio_model_id is not None: # Avoid circular import from vllm.transformers_utils.config import get_config - self.audio_config = get_config(audio_model_id, - trust_remote_code=False) + audio_config = get_config(audio_model_id, trust_remote_code=False) else: audio_config = audio_config or {} - self.audio_config = transformers.CONFIG_MAPPING[audio_config.get( + audio_config = transformers.CONFIG_MAPPING[audio_config.get( "model_type", "whisper")](**audio_config) + self.text_config = text_config_obj + self.audio_config = audio_config self.text_model_lora_config = text_model_lora_config or {} self.audio_model_lora_config = audio_model_lora_config or {} - self.vocab_size = self.text_config.vocab_size - - self.initializer_range = self.text_config.initializer_range + self.vocab_size = inner_text_config.vocab_size + self.initializer_range = inner_text_config.initializer_range + self.text_hidden_size = inner_text_config.hidden_size super().__init__(**kwargs) From 7f67a541193d33fd76a0f54802eeca05579bfd0a Mon Sep 17 00:00:00 2001 From: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Date: Sat, 26 Jul 2025 09:37:32 +0800 Subject: [PATCH 379/552] [Docs] add offline serving multi-modal video input expamle Qwen2.5-VL (#21530) Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: x22x22 --- docs/features/multimodal_inputs.md | 64 ++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index e820ace4f8f..e83dfdb11da 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -177,6 +177,70 @@ Multi-image input can be extended to perform video captioning. We show this with You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary instead of using multi-image input. +Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown in this example using Qwen2.5-VL: + +??? code + + ```python + from transformers import AutoProcessor + from vllm import LLM, SamplingParams + from qwen_vl_utils import process_vision_info + + model_path = "Qwen/Qwen2.5-VL-3B-Instruct/" + video_path = "https://content.pexels.com/videos/free-videos.mp4" + + llm = LLM( + model=model_path, + gpu_memory_utilization=0.8, + enforce_eager=True, + limit_mm_per_prompt={"video": 1}, + ) + + sampling_params = SamplingParams( + max_tokens=1024, + ) + + video_messages = [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": [ + {"type": "text", "text": "describe this video."}, + { + "type": "video", + "video": video_path, + "total_pixels": 20480 * 28 * 28, + "min_pixels": 16 * 28 * 28 + } + ] + }, + ] + + messages = video_messages + processor = AutoProcessor.from_pretrained(model_path) + prompt = processor.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True, + ) + + image_inputs, video_inputs = process_vision_info(messages) + mm_data = {} + if video_inputs is not None: + mm_data["video"] = video_inputs + + llm_inputs = { + "prompt": prompt, + "multi_modal_data": mm_data, + } + + outputs = llm.generate([llm_inputs], sampling_params=sampling_params) + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + ``` + + !!! note + 'process_vision_info' is only applicable to Qwen2.5-VL and similar models. + Full example: ### Audio Inputs From f2629edfe748b66587d90b7eb418f9ac7585e67f Mon Sep 17 00:00:00 2001 From: Huy Do Date: Fri, 25 Jul 2025 19:06:21 -0700 Subject: [PATCH 380/552] Correctly kill vLLM processes after finishing serving benchmarks (#21641) Signed-off-by: Huy Do Signed-off-by: x22x22 --- .../scripts/run-nightly-benchmarks.sh | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh b/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh index 4d01a314adc..4162905bb3c 100644 --- a/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh +++ b/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh @@ -95,12 +95,14 @@ json2args() { } kill_gpu_processes() { - pkill -f python - pkill -f python3 - pkill -f tritonserver - pkill -f pt_main_thread - pkill -f text-generation - pkill -f lmdeploy + pkill -f '[p]ython' + pkill -f '[p]ython3' + pkill -f '[t]ritonserver' + pkill -f '[p]t_main_thread' + pkill -f '[t]ext-generation' + pkill -f '[l]mdeploy' + # vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445 + pkill -f '[V]LLM' while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do sleep 1 From 3339fbbbed2213802a9e11d80b6bb211f1779fe4 Mon Sep 17 00:00:00 2001 From: Alexandre JUAN Date: Sat, 26 Jul 2025 05:11:10 +0200 Subject: [PATCH 381/552] [Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison (#21612) Signed-off-by: Alexandre Juan Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_engine.py | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py index fb4598b2f73..d74231d7e9d 100644 --- a/vllm/entrypoints/openai/serving_engine.py +++ b/vllm/entrypoints/openai/serving_engine.py @@ -981,9 +981,11 @@ def _load_prompt_embeds( def _load_and_validate_embed(embed: bytes) -> EmbedsPrompt: tensor = torch.load(io.BytesIO(base64.b64decode(embed)), weights_only=True) - assert isinstance( - tensor, - (torch.FloatTensor, torch.BFloat16Tensor, torch.HalfTensor)) + assert isinstance(tensor, torch.Tensor) and tensor.dtype in ( + torch.float32, + torch.bfloat16, + torch.float16, + ) if tensor.dim() > 2: tensor = tensor.squeeze(0) assert tensor.dim() == 2 From b539c9a39db8d85f5042693900de22db8f70ce0e Mon Sep 17 00:00:00 2001 From: QiliangCui Date: Fri, 25 Jul 2025 23:20:30 -0700 Subject: [PATCH 382/552] [TPU][Test] Divide TPU v1 Test into 2 parts. (#21431) Signed-off-by: x22x22 --- .../hardware_ci/run-tpu-v1-test-part2.sh | 166 ++++++++++++++++++ .../scripts/hardware_ci/run-tpu-v1-test.sh | 12 -- 2 files changed, 166 insertions(+), 12 deletions(-) create mode 100755 .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh new file mode 100755 index 00000000000..d998c1f73b5 --- /dev/null +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh @@ -0,0 +1,166 @@ +#!/bin/bash + +set -xu + + +remove_docker_container() { + docker rm -f tpu-test || true; + docker rm -f vllm-tpu || true; +} + +trap remove_docker_container EXIT + +# Remove the container that might not be cleaned up in the previous run. +remove_docker_container + +# Build the docker image. +docker build -f docker/Dockerfile.tpu -t vllm-tpu . + +# Set up cleanup. +cleanup_docker() { + # Get Docker's root directory + docker_root=$(docker info -f '{{.DockerRootDir}}') + if [ -z "$docker_root" ]; then + echo "Failed to determine Docker root directory." + exit 1 + fi + echo "Docker root directory: $docker_root" + # Check disk usage of the filesystem where Docker's root directory is located + disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//') + # Define the threshold + threshold=70 + if [ "$disk_usage" -gt "$threshold" ]; then + echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..." + # Remove dangling images (those that are not tagged and not used by any container) + docker image prune -f + # Remove unused volumes / force the system prune for old images as well. + docker volume prune -f && docker system prune --force --filter "until=72h" --all + echo "Docker images and volumes cleanup completed." + else + echo "Disk usage is below $threshold%. No cleanup needed." + fi +} +cleanup_docker + +# For HF_TOKEN. +source /etc/environment + +docker run --privileged --net host --shm-size=16G -it \ + -e "HF_TOKEN=$HF_TOKEN" --name tpu-test \ + vllm-tpu /bin/bash -c ' +set -e # Exit immediately if a command exits with a non-zero status. +set -u # Treat unset variables as an error. + +echo "--- Starting script inside Docker container ---" + +# Create results directory +RESULTS_DIR=$(mktemp -d) +# If mktemp fails, set -e will cause the script to exit. +echo "Results will be stored in: $RESULTS_DIR" + +# Install dependencies +echo "--- Installing Python dependencies ---" +python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \ + && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \ + && python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 \ + && python3 -m pip install --progress-bar off hf-transfer +echo "--- Python dependencies installed ---" +export VLLM_USE_V1=1 +export VLLM_XLA_CHECK_RECOMPILATION=1 +export VLLM_XLA_CACHE_PATH= +echo "Using VLLM V1" + +echo "--- Hardware Information ---" +# tpu-info +echo "--- Starting Tests ---" +set +e +overall_script_exit_code=0 + +# --- Test Definitions --- +# If a test fails, this function will print logs and will not cause the main script to exit. +run_test() { + local test_num=$1 + local test_name=$2 + local test_command=$3 + local log_file="$RESULTS_DIR/test_${test_num}.log" + local actual_exit_code + + echo "--- TEST_$test_num: Running $test_name ---" + + # Execute the test command. + eval "$test_command" > >(tee -a "$log_file") 2> >(tee -a "$log_file" >&2) + actual_exit_code=$? + + echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" # This goes to main log + echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" >> "$log_file" # Also to per-test log + + if [ "$actual_exit_code" -ne 0 ]; then + echo "TEST_$test_num ($test_name) FAILED with exit code $actual_exit_code." >&2 + echo "--- Log for failed TEST_$test_num ($test_name) ---" >&2 + if [ -f "$log_file" ]; then + cat "$log_file" >&2 + else + echo "Log file $log_file not found for TEST_$test_num ($test_name)." >&2 + fi + echo "--- End of log for TEST_$test_num ($test_name) ---" >&2 + return "$actual_exit_code" # Return the failure code + else + echo "TEST_$test_num ($test_name) PASSED." + return 0 # Return success + fi +} + +# Helper function to call run_test and update the overall script exit code +run_and_track_test() { + local test_num_arg="$1" + local test_name_arg="$2" + local test_command_arg="$3" + + # Run the test + run_test "$test_num_arg" "$test_name_arg" "$test_command_arg" + local test_specific_exit_code=$? + + # If the test failed, set the overall script exit code to 1 + if [ "$test_specific_exit_code" -ne 0 ]; then + # No need for extra echo here, run_test already logged the failure. + overall_script_exit_code=1 + fi +} + +# --- Actual Test Execution --- +run_and_track_test 1 "test_struct_output_generate.py" \ + "HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\"" +run_and_track_test 2 "test_moe_pallas.py" \ + "python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py" +run_and_track_test 3 "test_lora.py" \ + "VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py" +run_and_track_test 4 "test_tpu_qkv_linear.py" \ + "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py" +run_and_track_test 5 "test_spmd_model_weight_loading.py" \ + "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py" +run_and_track_test 6 "test_kv_cache_update_kernel.py" \ + "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py" + +# After all tests have been attempted, exit with the overall status. +if [ "$overall_script_exit_code" -ne 0 ]; then + echo "--- One or more tests FAILED. Overall script exiting with failure code 1. ---" +else + echo "--- All tests have completed and PASSED. Overall script exiting with success code 0. ---" +fi +exit "$overall_script_exit_code" +' # IMPORTANT: This is the closing single quote for the bash -c "..." command. Ensure it is present and correct. + +# Capture the exit code of the docker run command +DOCKER_RUN_EXIT_CODE=$? + +# The trap will run for cleanup. +# Exit the main script with the Docker run command's exit code. +if [ "$DOCKER_RUN_EXIT_CODE" -ne 0 ]; then + echo "Docker run command failed with exit code $DOCKER_RUN_EXIT_CODE." + exit "$DOCKER_RUN_EXIT_CODE" +else + echo "Docker run command completed successfully." + exit 0 +fi +# TODO: This test fails because it uses RANDOM_SEED sampling +# pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \ diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh index 5514d7770cf..e565d4b2469 100755 --- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh @@ -150,18 +150,6 @@ run_and_track_test 9 "test_multimodal.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py" run_and_track_test 10 "test_pallas.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" -run_and_track_test 11 "test_struct_output_generate.py" \ - "HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\"" -run_and_track_test 12 "test_moe_pallas.py" \ - "python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py" -run_and_track_test 13 "test_lora.py" \ - "VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py" -run_and_track_test 14 "test_tpu_qkv_linear.py" \ - "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py" -run_and_track_test 15 "test_spmd_model_weight_loading.py" \ - "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py" -run_and_track_test 16 "test_kv_cache_update_kernel.py" \ - "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py" # After all tests have been attempted, exit with the overall status. if [ "$overall_script_exit_code" -ne 0 ]; then From 1e436552b97e11f1d09aedaa483a1d141a4b5b51 Mon Sep 17 00:00:00 2001 From: Lyu Han Date: Sat, 26 Jul 2025 19:14:04 +0800 Subject: [PATCH 383/552] Support Intern-S1 (#21628) Signed-off-by: Roger Wang Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Isotr0py Co-authored-by: Your Name Co-authored-by: Roger Wang Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py Signed-off-by: x22x22 --- docs/models/supported_models.md | 1 + examples/offline_inference/vision_language.py | 32 + .../vision_language_multi_image.py | 28 + tests/models/registry.py | 2 + vllm/model_executor/models/interns1.py | 711 ++++++++++++++++++ vllm/model_executor/models/interns1_vit.py | 421 +++++++++++ vllm/model_executor/models/registry.py | 1 + 7 files changed, 1196 insertions(+) create mode 100644 vllm/model_executor/models/interns1.py create mode 100644 vllm/model_executor/models/interns1_vit.py diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index a8d442a1ae7..faffa08d41b 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -596,6 +596,7 @@ Specified using `--task generate`. | `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | ✅︎ | | `H2OVLChatModel` | H2OVL | T + IE+ | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | ✅︎ | | `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3`, etc. | ✅︎ | | ✅︎ | +| `InternS1ForConditionalGeneration` | Intern-S1 | T + IE+ | `internlm/Intern-S1`, etc. | | ✅︎ | ✅︎ | | `InternVLChatModel` | InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0 | T + IE+ + (VE+) | `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `KeyeForConditionalGeneration` | Keye-VL-8B-Preview | T + IE+ + VE+ | `Kwai-Keye/Keye-VL-8B-Preview` | | | ✅︎ | | `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I+ | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | | ✅︎ | diff --git a/examples/offline_inference/vision_language.py b/examples/offline_inference/vision_language.py index eb6b4108485..61f5525c6d7 100644 --- a/examples/offline_inference/vision_language.py +++ b/examples/offline_inference/vision_language.py @@ -468,6 +468,37 @@ def run_tarsier(questions: list[str], modality: str) -> ModelRequestData: ) +# Intern-S1 +def run_interns1(questions: list[str], modality: str) -> ModelRequestData: + assert modality == "image" + + model_name = "internlm/Intern-S1" + + engine_args = EngineArgs( + model=model_name, + trust_remote_code=True, + max_model_len=8192, + max_num_seqs=2, + limit_mm_per_prompt={modality: 1}, + enforce_eager=True, + ) + + placeholder = "" + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + messages = [ + [{"role": "user", "content": f"{placeholder}\n{question}"}] + for question in questions + ] + prompts = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + + return ModelRequestData( + engine_args=engine_args, + prompts=prompts, + ) + + # InternVL def run_internvl(questions: list[str], modality: str) -> ModelRequestData: model_name = "OpenGVLab/InternVL3-2B" @@ -1303,6 +1334,7 @@ def run_skyworkr1v(questions: list[str], modality: str) -> ModelRequestData: "h2ovl_chat": run_h2ovl, "hyperclovax_seed_vision": run_hyperclovax_seed_vision, "idefics3": run_idefics3, + "interns1": run_interns1, "internvl_chat": run_internvl, "nemotron_vl": run_nemotron_vl, "keye_vl": run_keye_vl, diff --git a/examples/offline_inference/vision_language_multi_image.py b/examples/offline_inference/vision_language_multi_image.py index 2e14fc807e1..e312a0953e9 100644 --- a/examples/offline_inference/vision_language_multi_image.py +++ b/examples/offline_inference/vision_language_multi_image.py @@ -253,6 +253,33 @@ def load_smolvlm(question: str, image_urls: list[str]) -> ModelRequestData: ) +def load_interns1(question: str, image_urls: list[str]) -> ModelRequestData: + model_name = "internlm/Intern-S1" + + engine_args = EngineArgs( + model=model_name, + trust_remote_code=True, + max_model_len=4096, + limit_mm_per_prompt={"image": len(image_urls)}, + ) + + placeholders = "\n".join( + f"Image-{i}: \n" for i, _ in enumerate(image_urls, start=1) + ) + messages = [{"role": "user", "content": f"{placeholders}\n{question}"}] + + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + prompt = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + + return ModelRequestData( + engine_args=engine_args, + prompt=prompt, + image_data=[fetch_image(url) for url in image_urls], + ) + + def load_internvl(question: str, image_urls: list[str]) -> ModelRequestData: model_name = "OpenGVLab/InternVL2-2B" @@ -946,6 +973,7 @@ def load_tarsier2(question: str, image_urls: list[str]) -> ModelRequestData: "gemma3": load_gemma3, "h2ovl_chat": load_h2ovl, "idefics3": load_idefics3, + "interns1": load_interns1, "internvl_chat": load_internvl, "hyperclovax_seed_vision": load_hyperclovax_seed_vision, "keye_vl": load_keye_vl, diff --git a/tests/models/registry.py b/tests/models/registry.py index b41e432d738..0dc5aec8db1 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -381,6 +381,8 @@ def check_available_online( extras={"2B": "OpenGVLab/InternVL2-2B", "3.0": "OpenGVLab/InternVL3-1B"}, # noqa: E501 trust_remote_code=True), + "InternS1ForConditionalGeneration": _HfExamplesInfo("internlm/Intern-S1", + trust_remote_code=True), "Idefics3ForConditionalGeneration": _HfExamplesInfo("HuggingFaceM4/Idefics3-8B-Llama3", # noqa: E501 {"tiny": "HuggingFaceTB/SmolVLM-256M-Instruct"}), # noqa: E501 "KeyeForConditionalGeneration": _HfExamplesInfo("Kwai-Keye/Keye-VL-8B-Preview", # noqa: E501 diff --git a/vllm/model_executor/models/interns1.py b/vllm/model_executor/models/interns1.py new file mode 100644 index 00000000000..36204e4c595 --- /dev/null +++ b/vllm/model_executor/models/interns1.py @@ -0,0 +1,711 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# -------------------------------------------------------- +# InternS1 +# Copyright (c) 2025 Shanghai AI Lab +# Licensed under The MIT License [see LICENSE for details] +# -------------------------------------------------------- +from collections.abc import Iterable, Mapping, Sequence +from typing import Literal, Optional, TypedDict, Union + +import torch +import torch.nn as nn +from transformers import InternVLProcessor, PretrainedConfig +from transformers.activations import ACT2FN +from transformers.models.got_ocr2.image_processing_got_ocr2_fast import ( + GotOcr2ImageProcessorFast) + +from vllm.config import VllmConfig +from vllm.model_executor.layers.quantization import QuantizationConfig +from vllm.model_executor.models.interns1_vit import InternS1VisionModel +from vllm.model_executor.models.module_mapping import MultiModelKeys +from vllm.model_executor.sampling_metadata import SamplingMetadata +from vllm.multimodal import MULTIMODAL_REGISTRY +from vllm.multimodal.inputs import (MultiModalDataDict, MultiModalFieldConfig, + MultiModalKwargs, NestedTensors) +from vllm.multimodal.parse import (ImageEmbeddingItems, ImageProcessorItems, + ImageSize, MultiModalDataItems) +from vllm.multimodal.processing import (BaseMultiModalProcessor, + BaseProcessingInfo, PromptReplacement, + PromptUpdate, PromptUpdateDetails) +from vllm.multimodal.profiling import BaseDummyInputsBuilder +from vllm.sequence import IntermediateTensors + +from .interfaces import (MultiModalEmbeddings, SupportsLoRA, + SupportsMultiModal, SupportsPP) +from .utils import (AutoWeightsLoader, WeightsMapper, flatten_bn, + init_vllm_registered_model, maybe_prefix, + merge_multimodal_embeddings) + + +class InternS1MultiModalProjector(nn.Module): + + def __init__(self, config): + super().__init__() + self.layer_norm = nn.LayerNorm(config.vision_config.hidden_size * + int(1 / config.downsample_ratio)**2) + self.linear_1 = nn.Linear( + config.vision_config.hidden_size * + int(1 / config.downsample_ratio)**2, + config.text_config.hidden_size) + self.act = ACT2FN[config.projector_hidden_act] + self.linear_2 = nn.Linear(config.text_config.hidden_size, + config.text_config.hidden_size) + + def forward(self, image_features): + hidden_states = self.layer_norm(image_features) + hidden_states = self.linear_1(hidden_states) + hidden_states = self.act(hidden_states) + hidden_states = self.linear_2(hidden_states) + return hidden_states + + +class InternS1ImagePixelInputs(TypedDict): + type: Literal["pixel_values"] + pixel_values: torch.Tensor + """ + Shape: + `(batch_size * num_images * (1 + num_patches), num_channels, height, width)` + """ + + +class InternS1ImageEmbeddingInputs(TypedDict): + type: Literal["image_embeds"] + data: Union[torch.Tensor, list[torch.Tensor]] + """ + A tensor of shape `(num_images, total_image_feature_size, hidden_size)` + or a list of tensors of shape `(total_image_feature_size, hidden_size)` + + `hidden_size` must match the hidden size of language model backbone. + """ + + +InternS1ImageInputs = Union[InternS1ImagePixelInputs, + InternS1ImageEmbeddingInputs] + + +class InternS1VideoPixelInputs(TypedDict): + type: Literal["pixel_values_videos"] + pixel_values: torch.Tensor + """ + Shape: + `(batch_size * num_video * num_frames, num_channels, height, width)` + """ + + num_patches: torch.Tensor + """Shape: `(batch_size * num_images)`""" + + +class InternS1VideoEmbeddingInputs(TypedDict): + type: Literal["video_embeds"] + data: Union[torch.Tensor, list[torch.Tensor]] + """ + A tensor of shape `(num_videos, total_video_feature_size, hidden_size)` + or a list of tensors of shape `(total_video_feature_size, hidden_size)` + + `hidden_size` must match the hidden size of language model backbone. + """ + + +InternS1VideoInputs = Union[InternS1VideoPixelInputs, + InternS1VideoEmbeddingInputs] + + +def resolve_interns1_min_max_num( + min_dynamic_patch: int, + max_dynamic_patch: int, + dynamic_image_size: bool, + use_thumbnail: bool, +) -> tuple[int, int]: + min_dynamic_patch = min_dynamic_patch if dynamic_image_size else 1 + max_dynamic_patch = max_dynamic_patch if dynamic_image_size else 1 + + if use_thumbnail and max_dynamic_patch != 1: + max_dynamic_patch += 1 + + return min_dynamic_patch, max_dynamic_patch + + +def get_interns1_target_ratios( + min_num: int, + max_num: int, +) -> list[tuple[int, int]]: + target_ratios = {(i, j) + for n in range(min_num, max_num + 1) + for i in range(1, n + 1) + for j in range(1, n + 1) if min_num <= i * j <= max_num} + return sorted(target_ratios, key=lambda x: x[0] * x[1]) + + +class InternS1ProcessingInfo(BaseProcessingInfo): + """Basic image-only ProcessingInfo for InternS1-style models.""" + + def get_hf_processor(self, **kwargs: object) -> InternVLProcessor: + return self.ctx.get_hf_processor(InternVLProcessor, **kwargs) + + def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]: + return {"image": None} + + def get_num_image_tokens( + self, + *, + image_width: int, + image_height: int, + processor: Optional['GotOcr2ImageProcessorFast'] = None, + ) -> int: + if processor is None: + processor = self.get_hf_processor().image_processor + + if not isinstance(processor, GotOcr2ImageProcessorFast): + raise ValueError(f'GotOcr2ImageProcessorFast is expected but got ' + f'{type(processor)}') + num_image_patches = processor.get_number_of_image_tokens( + image_height, image_width, images_kwargs=dict()) + num_image_tokens = self.get_hf_processor( + ).image_seq_length * num_image_patches + return num_image_tokens + + def resolve_target_ratios(self, use_thumbnail: Optional[bool] = None): + image_processor = self.get_hf_processor().image_processor + min_dynamic_patch = image_processor.min_patches + max_dynamic_patch = image_processor.max_patches + # HF format's InternVL processor uses `crop_to_patches` which is + # equivalent to `use_thumbnail` in original format. + use_thumbnail = image_processor.crop_to_patches + dynamic_image_size = True + min_num, max_num = resolve_interns1_min_max_num( + min_dynamic_patch, + max_dynamic_patch, + dynamic_image_size, + use_thumbnail=use_thumbnail) + + return get_interns1_target_ratios(min_num, max_num) + + def get_image_size_with_most_features(self) -> ImageSize: + processor = self.get_hf_processor() + + hf_config = self.ctx.get_hf_config() + base_height, base_width = hf_config.vision_config.image_size + target_ratios = self.resolve_target_ratios() + + largest_feature_size, largest_feature_pinpoint = 0, None + for wr, hr in target_ratios: + width, height = base_width * wr, base_height * hr + + feat_size = self.get_num_image_tokens( + image_width=width, + image_height=height, + processor=processor.image_processor, + ) + if feat_size > largest_feature_size: + largest_feature_size = feat_size + largest_feature_pinpoint = ImageSize(width=width, + height=height) + + assert not (largest_feature_size == 0 or largest_feature_pinpoint + is None), ("Cannot have a largest feature size of 0!") + + return largest_feature_pinpoint + + def get_max_image_tokens(self) -> int: + processor = self.get_hf_processor() + target_width, target_height = self.get_image_size_with_most_features() + + return self.get_num_image_tokens( + image_width=target_width, + image_height=target_height, + processor=processor.image_processor, + ) + + +class InternS1DummyInputsBuilder(BaseDummyInputsBuilder[InternS1ProcessingInfo] + ): + """Basic image-only DummyInputsBuilder for InternS1-style models.""" + + def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str: + num_images = mm_counts.get("image", 0) + image_token = self.info.get_hf_processor().image_token + + return image_token * num_images + + def get_dummy_mm_data( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> MultiModalDataDict: + target_width, target_height = \ + self.info.get_image_size_with_most_features() + num_images = mm_counts.get("image", 0) + + return { + "image": + self._get_dummy_images(width=target_width, + height=target_height, + num_images=num_images) + } + + +class InternS1MultiModalProcessor( + BaseMultiModalProcessor[InternS1ProcessingInfo]): + """ Basic image-only MultiModalProcessor for InternS1-style models.""" + + def _call_hf_processor( + self, + prompt: str, + mm_data: Mapping[str, object], + mm_kwargs: Mapping[str, object], + tok_kwargs: Mapping[str, object], + ) -> Mapping[str, NestedTensors]: + processed_outputs = super()._call_hf_processor( + prompt=prompt, + mm_data=mm_data, + mm_kwargs=mm_kwargs, + tok_kwargs=tok_kwargs, + ) + + hf_processor = self.info.get_hf_processor(**mm_kwargs) + image_token_id = hf_processor.image_token_id + + # Since there may be extra tokens in the feature placeholders, + # we need to pass the image token ID to the model to select the + # tokens to merge from the vision encoder outputs + processed_outputs["image_token_id"] = torch.tensor(image_token_id) + images = mm_data.get('images', None) + image_processor = self.info.get_hf_processor().image_processor + if images is not None: + image_inputs = image_processor(images=images) + image_num_patches = image_inputs.pop("num_patches") + if not isinstance(image_num_patches, list): + raise ValueError( + f'num_patches is supposed to be list, but got ' + f'{type(image_num_patches)}') + image_num_patches = torch.tensor(image_num_patches) + processed_outputs['image_num_patches'] = image_num_patches + + return processed_outputs + + def _get_mm_fields_config( + self, + hf_inputs: Mapping[str, NestedTensors], + hf_processor_mm_kwargs: Mapping[str, object], + ) -> Mapping[str, MultiModalFieldConfig]: + + image_num_patches = hf_inputs.get("image_num_patches", torch.empty(0)) + num_images = len(image_num_patches) + + return dict( + pixel_values=MultiModalFieldConfig.flat_from_sizes( + "image", image_num_patches), + image_num_patches=MultiModalFieldConfig.batched("image"), + image_embeds=MultiModalFieldConfig.batched("image"), + image_token_id=MultiModalFieldConfig.shared("image", num_images), + ) + + def _get_prompt_updates( + self, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + out_mm_kwargs: MultiModalKwargs, + ) -> Sequence[PromptUpdate]: + hf_processor = self.info.get_hf_processor(**hf_processor_mm_kwargs) + img_context_token = hf_processor.image_token + start_image_token = hf_processor.start_image_token + end_image_token = hf_processor.end_image_token + + def get_replacement(item_idx: int): + images = mm_items.get_items( + "image", (ImageEmbeddingItems, ImageProcessorItems)) + + if isinstance(images, ImageEmbeddingItems): + feature_size = images.get_feature_size(item_idx) + else: + image_size = images.get_image_size(item_idx) + feature_size = self.info.get_num_image_tokens( + image_width=image_size.width, + image_height=image_size.height, + processor=hf_processor.image_processor, + ) + + repl_features = img_context_token * feature_size + repl_full = start_image_token + repl_features + end_image_token + return PromptUpdateDetails.select_text(repl_full, + img_context_token) + + return [ + PromptReplacement( + modality="image", + target=img_context_token, + replacement=get_replacement, + ) + ] + + +@MULTIMODAL_REGISTRY.register_processor( + InternS1MultiModalProcessor, + info=InternS1ProcessingInfo, + dummy_inputs=InternS1DummyInputsBuilder) +class InternS1ForConditionalGeneration(nn.Module, SupportsMultiModal, + SupportsPP, SupportsLoRA): + + # To ensure correct weight loading and mapping. + hf_to_vllm_mapper = WeightsMapper( + orig_to_new_prefix={ + "lm_head.": "language_model.lm_head.", + "model.language_model.": "language_model.model.", + "model.vision_tower.": "vision_tower.", + "model.multi_modal_projector.": "multi_modal_projector.", + }) + + @classmethod + def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]: + # transformers InternVLProcessor uses as the seperator + # refer to https://github.com/huggingface/transformers/blob/f90de364c2484c7c325bbe05befdcf487bd75b63/src/transformers/models/internvl/processing_internvl.py#L116 + if modality.startswith("image"): + return '' + if modality.startswith("video"): + return "
HdPa(54 zz#U+*<$`W&6tFr%M?7Obc4Dm~zHope1A4Vx%kW1R0i_`&x!U^d&Ok>y;?RRLPds8i z20PYUozs52D`boq^W$Vfra??U>qD+jjjA|C$>FjZYw1CjUQbr|>kYb53*|+HfD`fo z{d*%Y>qEe#G__6^&t2M=Py`-u zU!$Nrudojs_J_3liJNr1AkIZ1rJpFlC!uKnJz$nx8}`p{>N{_Ai2Ul#iAu%@MsiKz zHAcb0Uc%l>R%ec&{JKiqOE8$ui`RPr@8e$zg2|y_f4HQ83ubkcM@+uon|s@{84}0S z_@=KUhW3XyH!OXMbd(s4F8Q|O(zwz|3k#Xz4<+B?rvNX}Lv8%{I!!*fz;6+_K*{{z zst;?SvSqNF_q1`LS&~#qyFh%$TSb_3bs7?a;3WC)i>dX zZdh(n;@b+?m)og3oDsK^{Qya%_x`TzxJzN8{XKc@hEEIrRX<)zK$d&v8e7H)*<)=B zXT)<0+u4LjW~rb2$z(-`%jp_`f+$zb6S8RrR(6+Qs&!JdwjUvTcWyz z6_j(B$DU-R&m=U=zuxHIS(xhT(Sdnowj~M=4s=ymQl&c2LeDlwj|p7jEU#(jd6bh8 za^H2>+YY9kiA>(2Bkk=%!~v5oMtC-3{TAlY0#=*<_F3Fm0i9JK7O08e1j;nlLC5)2-T`J^jOlHDHr!cGac}!I)h( zHmeRNf}By8FC+25@o)#Ajo*~CvOC9Q&DbM=Pf0pH@%(BtYi<0@CDiGshj=5(_C)Tz zw-&oALIBkyIc~fAQr@i&sR(foKou&RC2l%x^F(%N)b+)UcvaQr_)nOa{Crr&RKD5E zQ(OFO7I81u)K}W(pqA#9^qv~8+`(VlI0U@ z1J|TA`5bhKO4DRPO}#Pafd>0!?cW%p^NyZ4#eN8ggUJNTrqrxrUvOu^yq#A9+1=&! e6QXl5M|<@J;ajtk!GD;CneX0XQi9rd@P7b99exG? literal 0 HcmV?d00001 diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md new file mode 100644 index 00000000000..005b7f78f44 --- /dev/null +++ b/docs/configuration/tpu.md @@ -0,0 +1,104 @@ +# TPU Optimization Tips + +This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. + +## Get started + +Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md). + +### TPU workload sizing + +When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed. + +The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you: + +- KV cache size requirement per token and per request +- TPU/GPU memory consumed by the model weights +- TPU/GPU memory allocated for the KV cache +- Maximum \# of requests you can approximately set (--max-num-seqs) + +This approach serves as a general rule of thumb. + +#### Latency-throughput tradeoff + +As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency. + +`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request. + +Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload. + +In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput. + +#### Compilation and Caching + +Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process. + +To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used. + +Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs. + +Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). + +#### Reducing compilation time +This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. + +### Optimize based on your data + +#### max model len vs. most model len + +![most_model_len](../assets/design/v1/tpu/most_model_len.png) + +If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. + +For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`. + +The requests get subdivided into max-model-len and most-model-len categories, for the latter category, we can gain better performance since the server can process more requests at a time. + +#### Padding + +For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128: 128, 256, etc. + +The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about tpu padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests: + +1) the default exponential padding (pad to the nearest power of 2) +2) bucket padding (pad to the nearest linearly increasing bucket). + +When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`. + +For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512]. + +The fewer tokens we pad, the less unnecessary computation TPU does, the better performance we can get. For example, if num_tokens=300, with exponential padding, we pad to 512, with the bucket_padding above, we pad to 320. + +However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. + +**If possible, use the precision that matches the chip’s hardware acceleration** + +- v5e has int4/int8 hardware acceleration in the MXU +- v6e has int4/int8 hardware acceleration in the MXU + +Supported quantized formats and features in vLLM on TPU [Jul '25] +- INT8 W8A8 +- INT8 W8A16 +- FP8 KV cache +- [WIP] FP8 W8A8 +- [WIP] AWQ +- [WIP] FP4 W4A8 + +**Don't set TP to be less than the number of chips on a single-host deployment** + +Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). + +### Tune your workloads! + +Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. + +### Future Topics We'll Cover + +#### Profiling + +The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance. + +#### SPMD +More details to come. + +**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** From 11e30424c62ad5cb49401123425ec47fbb1fc43b Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Tue, 29 Jul 2025 22:26:31 +0800 Subject: [PATCH 477/552] [Bugfix]fix mixed bits and visual language model quantization in AutoRound (#21802) Signed-off-by: Wenhua Cheng Signed-off-by: x22x22 --- .../layers/quantization/auto_round.py | 153 +++++++++++++----- 1 file changed, 115 insertions(+), 38 deletions(-) diff --git a/vllm/model_executor/layers/quantization/auto_round.py b/vllm/model_executor/layers/quantization/auto_round.py index ea17cd56c98..a9e967e608e 100644 --- a/vllm/model_executor/layers/quantization/auto_round.py +++ b/vllm/model_executor/layers/quantization/auto_round.py @@ -2,7 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from fractions import Fraction -from typing import Any, Optional, Union +from typing import TYPE_CHECKING, Any, Optional, Union import torch @@ -16,6 +16,9 @@ from vllm.platforms import current_platform from vllm.scalar_type import scalar_types +if TYPE_CHECKING: + from vllm.model_executor.models.utils import WeightsMapper + logger = init_logger(__name__) @@ -28,7 +31,13 @@ class AutoRoundConfig(QuantizationConfig): SUPPORTED_DTYPES = {"int"} SUPPORTED_FORMATS = {"auto_round:auto_gptq", "auto_round:auto_awq"} SUPPORTED_BACKENDS = { - "auto", "gptq", "gptq:marlin", "awq", "awq:marlin", "marlin", "ipex" + "auto", + "gptq", + "gptq:marlin", + "awq", + "awq:marlin", + "marlin", + "ipex", } def __init__( @@ -109,26 +118,70 @@ def from_config(cls, config: dict[str, Any]) -> "AutoRoundConfig": ) def get_layer_config(self, layer, layer_name: str): - # Priority: extra_config > block_name_to_quantize > type fallback + + def get_config(name: str, quantized: bool = True): + cfg = self.extra_config.get(name, {}) if self.extra_config else {} + return ( + cfg.get("bits", self.weight_bits if quantized else 16), + cfg.get("group_size", self.group_size if quantized else -1), + cfg.get("sym", self.sym if quantized else True), + ) + + # 1. Exact match from config if self.extra_config and layer_name in self.extra_config: - cfg = self.extra_config[layer_name] - return cfg.get("bits", self.weight_bits), cfg.get( - "group_size", self.group_size), cfg.get("sym", self.sym) + return get_config(layer_name) - quantized = True + # 2. Determine whether layer should be quantized + quantized = not isinstance(layer, ParallelLMHead) if self.block_name_to_quantize: quantized = any( layer_name.startswith(name) for name in self.block_name_to_quantize) - elif isinstance(layer, ParallelLMHead): - quantized = False - return (self.weight_bits, self.group_size, - self.sym) if quantized else (16, -1, True) + # 3. Handle fused MoE + if self.extra_config and "fusedmoe" in layer.__class__.__name__.lower( + ): + moe_configs = [ + get_config(name, quantized) for name in self.extra_config + if name.startswith(layer_name) + ] + if moe_configs: + if len(set(moe_configs)) == 1: + return moe_configs[0] + raise ValueError(f"Fused MoE layer '{layer_name}' requires " + f"consistent quant config for all sub-layers") + + # 4. Handle fused QKV or other patterns + if self.extra_config: + for fusion_key, sub_keys in self.packed_modules_mapping.items(): + if fusion_key in layer_name and layer_name.count( + fusion_key) == 1: + sub_names = [ + layer_name.replace(fusion_key, sub_key) + for sub_key in sub_keys + ] + sub_configs = [ + get_config(name, quantized) for name in sub_names + ] + if len(set(sub_configs)) == 1: + return sub_configs[0] + raise ValueError( + f"Fused module '{layer_name}' requires " + f"consistent quant config for {sub_names}") + + # 5. Fallback + return get_config(layer_name, quantized) def check_quantized(self, weight_bits: int) -> bool: return weight_bits < 16 + def apply_vllm_mapper(self, hf_to_vllm_mapper: "WeightsMapper"): + if self.block_name_to_quantize is not None: + self.block_name_to_quantize = hf_to_vllm_mapper.apply_list( + self.block_name_to_quantize) + if self.extra_config is not None: + self.extra_config = hf_to_vllm_mapper.apply_dict(self.extra_config) + def apply_awq_quant_layer(self, layer, prefix: str, backend: str = "auto"): from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.quantization.utils.marlin_utils import ( @@ -141,9 +194,14 @@ def apply_awq_quant_layer(self, layer, prefix: str, backend: str = "auto"): else: return None - logger.debug("[%s] Type: %s, Bits: %s, Group Size: %s, Sym: %s", - prefix, layer.__class__.__name__, weight_bits, group_size, - sym) + logger.debug( + "[%s] Type: %s, Bits: %s, Group Size: %s, Sym: %s", + prefix, + layer.__class__.__name__, + weight_bits, + group_size, + sym, + ) if backend == "auto" or "marlin" in backend: AWQ_TYPE_MAP = { 4: scalar_types.uint4, @@ -162,15 +220,19 @@ def apply_awq_quant_layer(self, layer, prefix: str, backend: str = "auto"): if use_marlin: from vllm.model_executor.layers.quantization.awq_marlin import ( AWQMarlinConfig, AWQMarlinLinearMethod, AWQMoEMethod) - quant_args_marlin = AWQMarlinConfig(weight_bits=weight_bits, - group_size=group_size, - zero_point=not sym, - lm_head_quantized=False, - full_config={}, - modules_to_not_convert=[]) + + quant_args_marlin = AWQMarlinConfig( + weight_bits=weight_bits, + group_size=group_size, + zero_point=not sym, + lm_head_quantized=False, + full_config={}, + modules_to_not_convert=[], + ) else: from vllm.model_executor.layers.quantization.awq import ( AWQConfig, AWQLinearMethod) + quant_args = AWQConfig( weight_bits=weight_bits, group_size=group_size, @@ -182,6 +244,7 @@ def apply_awq_quant_layer(self, layer, prefix: str, backend: str = "auto"): return AWQMoEMethod(quant_args_marlin) from vllm.model_executor.layers.quantization.moe_wna16 import ( MoeWNA16Config) + config = { "quant_method": "awq", "bits": weight_bits, @@ -206,6 +269,7 @@ def apply_gptq_quant_layer(self, from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.quantization.utils.marlin_utils import ( check_marlin_supported, check_moe_marlin_supports_layer) + weight_bits, group_size, sym = self.get_layer_config(layer, prefix) if not self.check_quantized(weight_bits): if isinstance(layer, (LinearBase, ParallelLMHead)): @@ -213,19 +277,24 @@ def apply_gptq_quant_layer(self, else: return None - logger.debug("[%s] Type: %s, Bits: %s, Group Size: %s, Sym: %s", - prefix, layer.__class__.__name__, weight_bits, group_size, - sym) + logger.debug( + "[%s] Type: %s, Bits: %s, Group Size: %s, Sym: %s", + prefix, + layer.__class__.__name__, + weight_bits, + group_size, + sym, + ) if backend == "auto" or "marlin" in backend: GPTQ_TYPE_MAP = { (4, True): scalar_types.uint4b8, (8, True): scalar_types.uint8b128, } - use_marlin = ((weight_bits, sym) in GPTQ_TYPE_MAP - and check_marlin_supported( + use_marlin = (weight_bits, + sym) in GPTQ_TYPE_MAP and check_marlin_supported( GPTQ_TYPE_MAP[(weight_bits, sym)], group_size, - has_zp=not sym)) + has_zp=not sym) if isinstance(layer, FusedMoE): use_marlin = use_marlin and check_moe_marlin_supports_layer( layer, group_size) @@ -234,26 +303,33 @@ def apply_gptq_quant_layer(self, if use_marlin: from vllm.model_executor.layers.quantization.gptq_marlin import ( GPTQMarlinConfig, GPTQMarlinLinearMethod, GPTQMarlinMoEMethod) - quant_args_marlin = GPTQMarlinConfig(weight_bits=weight_bits, - group_size=group_size, - is_sym=sym, - lm_head_quantized=False, - desc_act=False, - dynamic={}, - full_config={}) + + quant_args_marlin = GPTQMarlinConfig( + weight_bits=weight_bits, + group_size=group_size, + is_sym=sym, + lm_head_quantized=False, + desc_act=False, + dynamic={}, + full_config={}, + ) else: from vllm.model_executor.layers.quantization.gptq import ( GPTQConfig, GPTQLinearMethod) - quant_args = GPTQConfig(weight_bits=weight_bits, - group_size=group_size, - lm_head_quantized=False, - desc_act=False, - dynamic={}) + + quant_args = GPTQConfig( + weight_bits=weight_bits, + group_size=group_size, + lm_head_quantized=False, + desc_act=False, + dynamic={}, + ) if isinstance(layer, FusedMoE): if use_marlin: from vllm.model_executor.layers.quantization.moe_wna16 import ( MoeWNA16Config) + config = { "quant_method": "gptq", "bits": weight_bits, @@ -282,6 +358,7 @@ def apply_ipex_quant_layer(self, layer, prefix: str): return None from vllm.model_executor.layers.quantization.ipex_quant import ( IPEXAWQLinearMethod, IPEXConfig, IPEXGPTQLinearMethod) + if isinstance(layer, (LinearBase, ParallelLMHead)): if "awq" in self.packing_format: config = IPEXConfig(method="awq", From e21f1cdcb50777675b6d5d7353fec467e64fecc5 Mon Sep 17 00:00:00 2001 From: elvischenv <219235043+elvischenv@users.noreply.github.com> Date: Tue, 29 Jul 2025 22:34:00 +0800 Subject: [PATCH 478/552] [Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: x22x22 --- .../kernels/benchmark_trtllm_attention.py | 42 ++++++++++++------- ...test_flashinfer_trtllm_decode_attention.py | 16 ++++--- vllm/attention/backends/flashinfer.py | 15 +++++-- vllm/v1/attention/backends/flashinfer.py | 30 ++++++------- 4 files changed, 61 insertions(+), 42 deletions(-) diff --git a/benchmarks/kernels/benchmark_trtllm_attention.py b/benchmarks/kernels/benchmark_trtllm_attention.py index 8c980f93036..68c48858e61 100644 --- a/benchmarks/kernels/benchmark_trtllm_attention.py +++ b/benchmarks/kernels/benchmark_trtllm_attention.py @@ -71,22 +71,20 @@ def benchmark_decode( if kv_cache_dtype.startswith("fp8"): kv_cache, _ = to_float8(kv_cache) + output_trtllm = torch.empty(q.shape, dtype=dtype) + # Benchmark TRT decode def trt_decode(): return flashinfer.decode.trtllm_batch_decode_with_kv_cache( q, kv_cache, workspace_buffer, - num_qo_heads, - num_kv_heads, - sm_scale, block_tables, kv_lens_tensor, - page_size, max_kv_len, - kv_cache_dtype, - k_scale, - v_scale, + bmm1_scale=k_scale * sm_scale, + bmm2_scale=v_scale, + out=output_trtllm, ) def time_fn(fn, warmup=10, trials=20): @@ -125,6 +123,8 @@ def time_fn(fn, warmup=10, trials=20): kv_indices = torch.tensor(kv_indices, dtype=torch.int32) kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32) + output_baseline = torch.empty(q.shape, dtype=dtype) + wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper( workspace_buffer, kv_layout, @@ -145,7 +145,7 @@ def time_fn(fn, warmup=10, trials=20): ) def baseline_decode(): - return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale) + return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale, output_baseline) baseline_mean, baseline_std = time_fn(baseline_decode) @@ -214,25 +214,39 @@ def write_results_to_csv(results, filename=None): max_seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072] all_results = [] - print("Running benchmark for kv_cache_dtype: bfloat16") print( - "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent" + "Running benchmark for q_dtype = bfloat16, kv_cache_dtype: bfloat16, " + "output_dtype: bfloat16" + ) + print( + "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t" + "baseline_std\tspeedup_percent" ) for max_seq_len in max_seq_lens: for bs in num_seqs: result = benchmark_decode( - bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="auto" + bs, + max_seq_len, + dtype=torch.bfloat16, + kv_cache_dtype="auto", ) all_results.append(result) - print("Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8") print( - "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent" + "Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8, " + "output_dtype: bfloat16" + ) + print( + "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t" + "baseline_std\tspeedup_percent" ) for max_seq_len in max_seq_lens: for bs in num_seqs: result = benchmark_decode( - bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="fp8" + bs, + max_seq_len, + dtype=torch.bfloat16, + kv_cache_dtype="fp8", ) all_results.append(result) diff --git a/tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py b/tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py index 96eee13695a..2e2130fab6a 100644 --- a/tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py +++ b/tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py @@ -113,27 +113,25 @@ def test_flashinfer_trtllm_decode_with_baseline( kv_data_type=dtype, logits_soft_cap=soft_cap) - output = wrapper.run(query, key_value_cache, scale) + output = torch.empty(query.shape, dtype=dtype) + wrapper.run(query, key_value_cache, scale, out=output) # TRTLLM Decode max_kv_len = max(kv_lens) kv_lens_tensor = torch.tensor(kv_lens, dtype=torch.int, device=query.device) - output_trtllm = flashinfer.decode.trtllm_batch_decode_with_kv_cache( + output_trtllm = torch.empty(query.shape, dtype=dtype) + flashinfer.decode.trtllm_batch_decode_with_kv_cache( query.contiguous(), key_value_cache, workspace_buffer, - num_query_heads, - num_kv_heads, - scale, block_tables, kv_lens_tensor, - block_size, max_kv_len, - "auto", - k_scale, - v_scale, + bmm1_scale=k_scale * scale, + bmm2_scale=v_scale, + out=output_trtllm, ) torch.testing.assert_close(output, output_trtllm, atol=1e-2, rtol=1e-2), \ diff --git a/vllm/attention/backends/flashinfer.py b/vllm/attention/backends/flashinfer.py index e6e60e75624..824ff8cca20 100644 --- a/vllm/attention/backends/flashinfer.py +++ b/vllm/attention/backends/flashinfer.py @@ -1104,7 +1104,12 @@ def forward( window_left = window_size[0] if window_size is not None else -1 prefill_output: Optional[torch.Tensor] = None - decode_output: Optional[torch.Tensor] = None + if num_decode_tokens > 0: + decode_output = torch.empty(decode_query.shape, + dtype=decode_query.dtype, + device=decode_query.device) + else: + decode_output = None stride_order = FlashInferBackend.get_kv_cache_stride_order() if prefill_meta := attn_metadata.prefill_metadata: # We will use flash attention for prefill @@ -1155,17 +1160,18 @@ def forward( num_decode_tokens, attn_metadata.max_decode_seq_len, kv_cache_dtype, attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim): - decode_output = decode_meta.decode_wrapper.run( + decode_meta.decode_wrapper.run( decode_query, kv_cache.permute(*stride_order), k_scale=layer._k_scale_float, v_scale=layer._v_scale_float, + out=decode_output, ) else: workspace_buffer = ( - decode_meta.decode_wrapper._int_workspace_buffer) + decode_meta.decode_wrapper._float_workspace_buffer) assert FlashInferState.get_kv_cache_layout() == "HND" - decode_output = trtllm_batch_decode_with_kv_cache( + trtllm_batch_decode_with_kv_cache( query=decode_query, kv_cache=kv_cache.permute(*stride_order), workspace_buffer=workspace_buffer, @@ -1174,6 +1180,7 @@ def forward( max_seq_len=attn_metadata.max_decode_seq_len, bmm1_scale=layer._k_scale_float * softmax_scale, bmm2_scale=layer._v_scale_float, + out=decode_output, ) if prefill_output is None and decode_output is not None: diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index b72745ef156..775780807ea 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -194,7 +194,6 @@ class FlashInferMetadata: max_seq_len: int seq_lens: torch.Tensor block_table_tensor: torch.Tensor - workspace_buffer: torch.Tensor # For handling prefill decode split num_decodes: int @@ -473,7 +472,6 @@ def build(self, max_seq_len=max_seq_len, seq_lens=seq_lens, block_table_tensor=block_table_tensor, - workspace_buffer=self._get_workspace_buffer(), ) self._plan(num_prefills, num_decodes, attn_metadata) @@ -641,11 +639,11 @@ def forward( if decode_wrapper := attn_metadata.decode_wrapper: decode_query = query[:num_decode_tokens] assert decode_query.shape[0] == num_decode_tokens + assert decode_wrapper is not None if not FlashInferBackend.use_trtllm_decode_attention( attn_metadata.num_decodes, attn_metadata.max_seq_len, self.kv_cache_dtype, attn_metadata.num_qo_heads, attn_metadata.num_kv_heads, attn_metadata.head_dim): - assert decode_wrapper is not None assert decode_wrapper._window_left == window_left assert decode_wrapper._logits_soft_cap == (self.logits_soft_cap or 0.0) @@ -666,22 +664,24 @@ def forward( num_decode_tokens] seq_lens_decode = attn_metadata.seq_lens[: num_decode_tokens] + workspace_buffer = decode_wrapper._float_workspace_buffer assert get_kv_cache_layout() == "HND" assert decode_query.is_contiguous() assert kv_cache_permute.is_contiguous() assert block_tables_decode.is_contiguous() assert seq_lens_decode.is_contiguous() - - output[:num_decode_tokens] = ( - trtllm_batch_decode_with_kv_cache( - query=decode_query, - kv_cache=kv_cache_permute, - workspace_buffer=attn_metadata.workspace_buffer, - block_tables=block_tables_decode, - seq_lens=seq_lens_decode, - max_seq_len=attn_metadata.max_seq_len, - bmm1_scale=layer._k_scale_float * self.scale, - bmm2_scale=layer._v_scale_float, - )) + assert workspace_buffer.is_contiguous() + + trtllm_batch_decode_with_kv_cache( + query=decode_query, + kv_cache=kv_cache_permute, + workspace_buffer=workspace_buffer, + block_tables=block_tables_decode, + seq_lens=seq_lens_decode, + max_seq_len=attn_metadata.max_seq_len, + bmm1_scale=layer._k_scale_float * self.scale, + bmm2_scale=layer._v_scale_float, + out=output[:num_decode_tokens], + ) return output_padded From e84d4da3f743dc85ee5a0be0900698b52db9e6de Mon Sep 17 00:00:00 2001 From: David Xia Date: Tue, 29 Jul 2025 13:32:06 -0400 Subject: [PATCH 479/552] [Docs] use `uv` in GPU installation docs (#20277) Signed-off-by: David Xia Signed-off-by: x22x22 --- .../installation/gpu/cuda.inc.md | 84 ++++++++++--------- 1 file changed, 44 insertions(+), 40 deletions(-) diff --git a/docs/getting_started/installation/gpu/cuda.inc.md b/docs/getting_started/installation/gpu/cuda.inc.md index 5298c22c843..69a9842e471 100644 --- a/docs/getting_started/installation/gpu/cuda.inc.md +++ b/docs/getting_started/installation/gpu/cuda.inc.md @@ -20,16 +20,16 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I # --8<-- [end:set-up-using-python] # --8<-- [start:pre-built-wheels] -You can install vLLM using either `pip` or `uv pip`: - ```bash -# Install vLLM with CUDA 12.8. -# If you are using pip. -pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128 -# If you are using uv. uv pip install vllm --torch-backend=auto ``` +??? console "pip" + ```bash + # Install vLLM with CUDA 12.8. + pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128 + ``` + We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`). If this doesn't work, try running `uv self update` to update `uv` first. !!! note @@ -50,36 +50,22 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since `v0.5.3`. -##### Install the latest code using `pip` - -```bash -pip install -U vllm \ - --pre \ - --extra-index-url https://wheels.vllm.ai/nightly -``` - -`--pre` is required for `pip` to consider pre-released versions. - -Another way to install the latest code is to use `uv`: - ```bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly ``` -##### Install specific revisions using `pip` +??? console "pip" + ```bash + pip install -U vllm \ + --pre \ + --extra-index-url https://wheels.vllm.ai/nightly + ``` -If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL: + `--pre` is required for `pip` to consider pre-released versions. -```bash -export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch -pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl -``` - -Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before. - -##### Install specific revisions using `uv` +##### Install specific revisions If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: @@ -92,17 +78,35 @@ uv pip install vllm \ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. +??? note "pip" + If you want to access the wheels for previous commits (e.g. to bisect the behavior change, + performance regression), due to the limitation of `pip`, you have to specify the full URL of the + wheel file by embedding the commit hash in the URL: + + ```bash + export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch + pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl + ``` + + Note that the wheels are built with Python 3.8 ABI (see [PEP + 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible + with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a + placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in + the wheel metadata (the wheels listed in the extra index url have correct versions). Although we + don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the + wheels are still built with Python 3.8 ABI to keep the same wheel name as before. + # --8<-- [end:pre-built-wheels] # --8<-- [start:build-wheel-from-source] #### Set up using Python-only build (without compilation) -If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM: +If you only need to change Python code, you can build and install vLLM without compilation. Using `uv pip`'s [`--editable` flag](https://docs.astral.sh/uv/pip/packages/#editable-packages), changes you make to the code will be reflected when you run vLLM: ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -VLLM_USE_PRECOMPILED=1 pip install --editable . +VLLM_USE_PRECOMPILED=1 uv pip install --editable . ``` This command will do the following: @@ -121,7 +125,7 @@ In case you see an error about wheel not found when running the above command, i ```bash export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl -pip install --editable . +uv pip install --editable . ``` You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code]. @@ -137,7 +141,7 @@ If you want to modify C++ or CUDA code, you'll need to build vLLM from source. T ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -pip install -e . +uv pip install -e . ``` !!! tip @@ -152,14 +156,14 @@ pip install -e . The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`. !!! note "Faster Kernel Development" - For frequent C++/CUDA kernel changes, after the initial `pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code. + For frequent C++/CUDA kernel changes, after the initial `uv pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code. ##### Use an existing PyTorch installation -There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.: +There are scenarios where the PyTorch dependency cannot be easily installed with `uv`, e.g.: - Building vLLM with PyTorch nightly or a custom PyTorch build. -- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124` to [install PyTorch nightly](https://pytorch.org/get-started/locally/), and then build vLLM on top of it. +- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `uv pip install --index-url https://download.pytorch.org/whl/nightly/cu128 torch torchvision torchaudio` to [install PyTorch nightly](https://pytorch.org/get-started/locally/) and then build vLLM on top of it. To build vLLM using an existing PyTorch installation: @@ -167,8 +171,8 @@ To build vLLM using an existing PyTorch installation: git clone https://github.com/vllm-project/vllm.git cd vllm python use_existing_torch.py -pip install -r requirements/build.txt -pip install --no-build-isolation -e . +uv pip install -r requirements/build.txt +uv pip install --no-build-isolation -e . ``` ##### Use the local cutlass for compilation @@ -179,7 +183,7 @@ To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to po ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e . +VLLM_CUTLASS_SRC_DIR=/path/to/cutlass uv pip install -e . ``` ##### Troubleshooting @@ -189,7 +193,7 @@ to be run simultaneously, via the environment variable `MAX_JOBS`. For example: ```bash export MAX_JOBS=6 -pip install -e . +uv pip install -e . ``` This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory. @@ -228,7 +232,7 @@ Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing: ```bash export VLLM_TARGET_DEVICE=empty -pip install -e . +uv pip install -e . ``` # --8<-- [end:build-wheel-from-source] From 4e37a043835a7583103870a8d25d9145b19bdabf Mon Sep 17 00:00:00 2001 From: Varun Sundar Rabindranath Date: Tue, 29 Jul 2025 23:02:30 +0530 Subject: [PATCH 480/552] [Doc] Add FusedMoE Modular Kernel Documentation (#21623) Signed-off-by: Varun Sundar Rabindranath Co-authored-by: Varun Sundar Rabindranath Signed-off-by: x22x22 --- .../fused_experts_blocks.png | Bin 0 -> 191037 bytes .../fused_moe_batched.png | Bin 0 -> 193655 bytes .../fused_moe_non_batched.png | Bin 0 -> 232056 bytes .../prepare_and_finalize_blocks.png | Bin 0 -> 130810 bytes docs/design/fused_moe_modular_kernel.md | 236 ++++++++++++++++++ 5 files changed, 236 insertions(+) create mode 100644 docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png create mode 100644 docs/assets/design/fused_moe_modular_kernel/fused_moe_batched.png create mode 100644 docs/assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png create mode 100644 docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png create mode 100644 docs/design/fused_moe_modular_kernel.md diff --git a/docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png b/docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png new file mode 100644 index 0000000000000000000000000000000000000000..5721d5582c7f14d89e1bcd7defc58fe1669442e0 GIT binary patch literal 191037 zcmeEv1zc3w`?n$(pjaRlimgZwjf5y*0E!3*5>kWYkTdiMiUA5X3?V8?hdA^QVt@hz zN)N3FNDPQ{zvs@(MO1Wm-`$_YLc=TiWqh8QX(P;J&_zu@$X>63W@s!a{$WfV>b7AGozsN#D@a${uBRdK*F( zd_QPqZ)^cR!DaCCh${GT0{r9Wu6 zZ)$6dL^*kXkX^$QSOXQ|KZ)!w+_QXLGZvKOErkZm559rIO8Xi(4h zqb0AX#tnEAVo$Wbg#(S`)C+d@PA~`Cq8zM@jG-Ok2f~6mZEA0KR`xTlBPsuZ>vs0GD0AZ(cMyAp1QF0cw22wp z+8e`+OTA>R|Ct^X68(5nK5vbJG0M`|-qs1Mm)0Xfgho)BAOKmXUBG0a_PS4~i zn5Fb-ylgVvIYiIjv)`Z%ja`1c07xf(#Z|LzC5>JFuWqHV2#*kzL1>#P3~%lidno|7 znz@?@c-#M$-IPUWD4U;lP(ENUuZM6vp`wa*M8SH0h8!9j0VPNiOO&nsX_N`dO5Z~6 z^Htf;k_t?U?B_O$C=?J=iujrt+uJ+QG`hZnJqom&A-zD)*gBm6P5A(o5VDl7n$TDH z%YI03K7WHK4SoBLU<&Zrp&V=}ngX<#b(u8%W3O*(0t`DvkI&kUX0ad|P+R@~O1`bJ zg}%KhdS)*GOZ*)|JnK4V)5pRVfHJa1rs!<~!Y0tK;FEF*QmR1NeC82h8wbcpQM4_$ z9mQ@#5QAUH8tfbN80|jX(f*@);6Al8ppK~heK`iWJj=+lT>8_K0Ta=;1bixMWnc$= z?38z~Gd5B}$$=G^T3I_lOWkV-_73bH%E}G^wLdg*NG*S+lmM{&)({f_Yoc#qVQhhd z;3XK<+St|s{AKR248ATDnFO8xA{C$QV}|2pD?piXGUzW*z=C`~*5rWU2S zC;xgaN~6aYQ0OmGnJNzYR`#aO6dm+k4fK_+`O+@`K?KdGFUBwON0f!Yei31SPJfcJ zfWa{hbP2*pYNoP)f!G(y;%^1ev)W!7e4K@A{ukzsYI{MfLeI#0o4`JxHV^+vlLia` zWc+8~IeiOL6DT}k2#_3P(PSYrVG11Sy|f#arbb4PpEs*T`;u*chUpQ7Ifcee)Bf2^ zZbHmrdf;onj^z>lzhQZ37UefB4?oR2{fVaM_aGn`2ko`B)raB&F#lMiK>W}S+QQ%8 z3*48b<5B2o&>JnvFddkpaka20je~zXNLcfpQRTN z5&kl~A_Q~sOx+B-3NtM|+zytlpBgEfm32g^x_K7X`Cl1&8^B9nS$fFg!)#?>3x4~; zwg20_aR{F1QS%sIOr3{^=a0e^D)OF%w6iYzi*{cD>3;#j`3f%!!N_H%6u}ZRQ;J~8 z`gNp;mNWWEQnZ&ULZ6ZHFD^ebg@-!n|0L&k!YK&iOv8$46gBM&z$iy-I^aK3c0fj# z5`&s4JFtxXI3s4fOeqfk*fIQ@wr8a?Vt6{J$PL;0GHA?cI<09Oi>R@3QHcnCQVMTOSnu zOh?pyd7JiIUqW=vp(V!MmK?Rxi zwx*z{l8>5=2Z4seRK=}49b=eP;e65kSDNKtj1WYGzEU~Dv_SGql>>+AzQ6~+Ua}4L z(|*i(Fq8dh`=zABuZNk3G{fl37O;jz{}&SJ@8^B~w0+sDG^=@`aE&|^XA7Q>1_@^ThG_Dbawf+pQ;fGn|7q+{F~oF2 z??(Zh`#V3+XwYoIZyODoKlSf78Z?T`qB3Tfgs($t5IFoF zh18V40HL1~Bhoe1I!D0WI`Uf_7zmVbjeqi@c8!ja!f!Pe# z56aiS(uDtf|3*kulm}*n9~mnlejb5;L_rEX%fHFV`FZb~(#+6r+bdeI;NNerXcYOe zy;8IKTsksck^?Q1D?XV{0x40Als(GY{6BLAS?}2^ z;y}R&{4zTnWf7b9t7e)kVOm6Urpbb}*DqwUXtVsgRM1b0v(PllZ>k_TDe~`ELEjI| z&QK=DQPv4rP3Zag;5gMxnHQW*=4mOMUr6Tv-q`9-gB`Fa|F+E2;s*bEnTILzW7u)b z*vJ8#A3_uFpX#grA_6b+)e9rwh$5P6H&fdAW^pn^X{_|?NIN*!>2C;%QOEq|)F@DN z^jnA1AZQf)ZK?ggk`aL^GDB)Vzbv^AWock)1*G*WcmL0rT|Pcha9YZ2W|xmo2%L%n z?m+R383%|>A1(IXeg7YxmGU|7Ia4}e=lTo${ma>6#P14Rg7*SH*HKx2?Vf$@5q`x9 z{~fyXJzq`)93Kh0WH7E1fK~rYUJ{&6lFa0$UoQX&pD{T@Q^02Y3y4mQLjUQQ1XIT0hqbqKqvft23rciFSN~MN*Ln(LI8vYlYSCYVkmXa#zz0~ zBn1-ug?3j+2=R@&oYf76GdRCEU;eHbEUZ8Ow6c`@Y{O?U;MDGa+S@$;Xx%6Z>pj?J z^TQWrD97n@@qTe8{I>4(8z^sVWdu%$p`4g(XrXUsXKD!S(u@PAzJ&V5j;8k1(?t0M z1VKaSyCyh}3Y@zQ3Tvp}KfiJeZtjQ32pLGIC3pvR`Xol^+IMZ;Z~Aexh5h-M%8X5g zvDwT$g|GeYGe&=W5)KsBga4o9*q`Ya&S<e;l9To&3ae#3CzCa%! z+1`!*QaryaP=BSNhB*Z0m|4K%Y-PZ(D8Y>N_a7RBnC-liulDPgRFD5QwI@syRT@VL z^9k|@(DnxAmjAuxn{Cv8tn!0PD!*`x{$p*RA7M50lPo&??k3FI{9-?Ywc+xBAB_6X zW$oDtxc?k$PoI$w;Z*30CI!+?ADI6c(Ly(;A?>Sqe=Y-mzvljbk@AwaML`F&p?-gKVfH*1 z_3A7${KMro|AKIZ2vibAv7@kp6@b&SGr$v!B4!5vU}FHY@!ub@fJ*?sKm>m*-zp&R zoxdCeN3aaQFO-9e;Xl>`qPEZpgL0H}{_LjHvVc)#El`G_o_m@Jryreh?aSQ5SIYm- zu-892%OD1;Wg1Tk!9aNiOA5l*X7VJg-+v*VoGps>T@CbmL$Baq%h~_H70g=Gy!M-H z`Ulu^y6G3X^DCzP9CC#+9m&Y!5!tWL=$Za&1T&av`mEiV%mgdFUx%55 zW^?X;A~PxJJ5i1~{g0`{U^g1JnzL1@{%u@ENiKco@3+EA(IF zDSM{rawx;BeS(MPf8Cza$HMJTeDSB0_jt`R+EszMp7@*GqGfWkjJ9yQjP}Feg`&Z^s?kStfG;2ksTD z3U1&0Ejbs`ujCMAYJILc+hlfudxuG$2%T@JDKp386rgo(up{GalQ9POcAc~i_nC!m z%M{NvaRsbAvT2sdJb>;QGZot}pJfX@7cC2BuzR;_w#nQC_a0j0;&>ps$$X-3Ejf9L ztg-q=<|Y2H{fwEysd3zTSz4Rh@*dE3gu@9;@zqIHo?;h34T=&9uj*go`ja9BM(-Fcg}wxHpDMN{&`K7m<WlGm2+B`&+ zNSVtFk&$bfUmSk(Iw}%({{=Fdx`wUL&gwH2*i#-V-6kVPw2#ARn;e!Srp2#smZ!0; zC&GOm^?~On&Ugp%>t?bn3GoixrDYNBP2nY(r$7i+q})|p0P!7%4&Cu`#VY4~hm_;x zmpcpit-f8t@0-vTFroffhhURKazxRD4TcKePUtu^A)U0FOM<|JrkVSz2}wc|D$&Y! z3I`MF>)=O!J0V4gLS*lD5Py4etEVn`>Q5&ofk~# zeS*yj<=cS;~0Krwh4_u6q;PSL;pPZe99@X;9Ck6 z(E|$M84kBw0}7RTnbgcip}%ZGBxpi|quY(q;IrV6z|wCgbQzjZ*o#BGN5O>Ny)fyR zbwb>LLPiOwV@L%#;)B?&eX~&L07N0JSeB)?0EKj2!+mF?&|fy8-B$;_12sy7PR5Vk z@02jVPkN3ac1yqNig<;~F|Vn#c;ezS#XLDq?5a&I>J}@u81+w|P+ptLziwE+HKf*$ z&&AHjswbk|l+!HSG#R&>Y9wE=0V_4R7O@W^lgGJ*53=sJeq_LT1&`Xah9<8ItcVUB z$-z!EbQSm_%klz~PopY_)bKfNIJY6q)*kb6p_nq!J}h?1i_|n=H>R?lug>BmTUwOc zh-^5oRt4CL_x>>2IJg1~Rca={Mj2{96BsJo6Hh`AmB)5)l^G;t-6AE1X`iWl()DZ# zcjZF{pXgY}HvH}sz2O_L@jIkYDF#dMKD9WdwJXQl1nkdi5}k2vQ+ImMp+dI$_5Dp? zX$OC6X-KtP+->xPGBPXlRTp1p8~V18Z9ZpmbF8g(xXA)Pk(gBU_EV~_S3R(Ff227i zq4Rn$o0KVy(64W93y|a&eOcS?a^5?z{iGd|CY{$A78YO6%qUtyQK%|(rU{Yh51JHx zTzv_6ivb7XdCN|^q?{P$V&Uo+UBf$yS_x<3a%z(Y^AmF2H)Mr6yscA4E*&7c5^5if z`?}2UUTDsQ4gP=!_s}<<6tdQ(0sj?Cz#F%_UJ$wsoiT*9z7Am*3 zLo86`j5tYQfdEE`+e$qVD2-|sd^$EDF-hXw+dWW(gcfE3v327fIPM(8b#f1(>#x zu6xgW|F|wr92spr`6-0dzJ$%K&tS+TCSD)iHy3HBJ(YBM4y(|}HX3M_DFv4DddmGG zs-?UJZEX7j6LEo&-pHaA3GdPVLL@;uEGhO9a-WFEn1ge#Cw@zhusP-~iAf_s()!bV z>6m%vgDhQwNjv<4bgIOzNvCISmy;uARy!`DfxgXL0Or1Zp?{jXm|vG>LgL;049(i zO)hB^$lqtWW-h1Mr^7UDC6<_U>G3sn(5*Xv1o;28g1IKQW z)C44TvDJ$k)TwYRTMFR0W9-VX{g5R;kO)ZQlQUc!db%>Iu{0;f8^>63Iv3KSpHVu8 zE`WBuTDFJKg(DrbHz3Zt+v0?_2;v*Fn;RTac}pS-IygnJ4H#w^_Wjz-Xe-Jp|Cqh&69 zVDR~$!R|$W0O2z$ecn(Y{DY%+>sW;}bSo^b3=p?xuC#Q$(&wiaCYiU@&(}Nfpsr3U z)p?R}2C`jB7%W;3bsXRU z6|R0z51(4^NYN!Jq#{&Nu~p)*WAXU&*thaXX(fJ%u!E0!bw|TUoDR3gyU6v#hD^)e zoE+zP(t{!5d8_5Ip|YXT&8=nj2z-Jrm(EpSo@F(z4ABUnIwYZS+DwR39)!wu&*a)qNz32`B5)(kR_yVLnG8D#$nzl1sYLhLdcSO^Qr<$Y8>|M1)Y{rRCmfdyP~)Phje{x%J?XD?U5fE-2ae zYV@s+)uv*e2u`!s!JsBnsw!M|(GP5@Yo+%!cvFK}g;~km?R-2`IwX*xWv1jV?WH@?kWymVJlhd-ljUdkMaNkwtZE`OBK7h^aYwK zcr^w5%215LMxABo{A;+}(OiNLo_ToY^j(SD)oE8qy%nn_C)RAl#kD_8 zZQ!|jOs1kaTk-V6IZ-%?r;&!YHm22Eil7VW&Bh4LcNnu+wpT=(s5dekV7zgDTwSt*3#vNlDonJV=}e`G-g{?8mz;|pasXCGMo^17JuTVpdauwD@S{E zcvzriMgV)Cb?d^8>K!^C-;ZkNd%4zZ3x9odVy9?K_Wd#Sh!9V?;9mdSX$OKWdY8A>@dTwTIb?Ws*-82DwUr|nT zY;^i=!J>wW+Y^ca-Z1Wryy}_f9?wKH6XKK5AMwh?bst*y*7s#J`+K^AseVD1iR~hx zs<$@;>onxXu2&F>id@r#E!yKr!LD=Y5?O%eZF$Upnxc6H7v3b?Z(WW_Hw#>VtACVA zQt&f+KGCE|NM&&~PF)j!{o$P$0Z}^zrm8v)d~<9vQ(+Pzrp6X&=Z+>kZ4ekMb?a#{ zN6am~n}E=p_~hlKr{}U8&#L}vy=l3XpdEoYnr=gSC6-& zeM8h6$s5Lm?e^H;Wg^Wd7+!I=Mz?H47N$2v4>2;0m5chZ-xAv{?wX242gfCSa5J4O z!<+DoY@Uqv#Lwj^2Nt=Nd_%8bJ)$C%+$Jkd7A+J;y=kbAX+_^?7wsHyk>eE&yN_Hu znX}xEV&h~uvVdxB*Kt8oK>RWr4_l4$J>84&SF}$(c+}dgfH<+YTQr7Kw2bGmntEYZ z^fA?CDajI(>33@PR^BykEy>HW7 zDIe!4j@`g{`mW(+#=A*&N)j)}n-t$xZtPAo)Dw}-o37wG@R-Pso(aEYOx)Wt)fTSWggSGYV<2ejCrlo?0Gah}qDB_DY{~5Oa0c__2lI z;S%NcuK4ICMK5~h9JjaP^=_9*FtUj2|u#`yfR-cMVW$V^eco*T6QO zSH@a>Qs@*tb)Cm|9WC|_T>DzMTlvwxx-2c!L?~&B{mvSLtL!F>Q7r21vJ*$@;*%?_ zQg3%}>^qB4dscpJznYKh)LhK|QZl2e(82nZXe0uEXDQC|u^-cg?E!8deC>s8{NHu% z4D;PLh*1zDq;r@rz_eBCt;ESEY z0Uv=&GaC*Nsw@K{`;?Q9<|RZg+i4=Fv_wH|7jGa^c;JSj+=1kfu^@-^YAzRr<1P0@ zjt?keNI6dhgtGK%$*o#Zw>3kt(R!nV(s2x{XFVB>$s)^&FlZ|fWGp5^R=I0R#9GkM z4?h$dJ=`whb-Lzl>ji!P$$+Rc=?_}jc1Mm6X`~kfFs7Z)uue!|%4Ydw-1rck8S^eZ zw;0Eco@6(P?v7%-W5&o5n*FkCTWX)8W#3t%eMHbx0R{_!H!3B(+6?6}!8r7A`*4JB zUiLdQvEHukdcOAMoO}0?{468$X<&1|8t@OIcRx#`I8(=XAzqQn#!6SeAQx><-x826 zNTyeP6jrYVHih^wL^UG~>nRc4uGilq9kV5vbLZW9z0t7(C2Y=liJGVojY+w&^vem+ zCTgO0FEK6hFE_c@s8>0r&=9hV`R?6ggy!R1`UfTQSguYq(LF1$BDo)atXC;&JK_R> z%l#L47+Q+}%bXUsx|tFYJ>VsizT0f{C1dLRv{N017_o_k7}x8o;$_g*!Y}G;R}IL^ z5hD|pHB*D2b39}LCS_!OPqR_7%OL#MwX1zH!M9sEQMI1`#Wu1PqX);W(0;$0X8D@y zA=DK9$)w-`6+j-i1++r+?Vwi_(DYmmLc+Tx+8g*~@gK30ww1j_$JQru5)&jSt0uPu0XKxh}C7!qu}hTS{^$=q=wu5()M^m+>%OSmX_(I^K6mp0E9Pqw+Tjj ziD2TF+5GNpfgFdesvMl4%P2zgE+nVGqb3v{5fn5bC^lRblp++9b)FP)1liC&7D&=M z88D!`X;|cTzfs6`q>>0?tm8!yyOj=)K&{gml0yhl$|>_>4=a*GIeq(~1+xYZpo)rhLD?Fv(1oSJ-`&G&u* zKF3`|qR{J-jXBWawch-?v^Bp}0NlG#s9`TP0KVTN5)5h|O%xH4Ad|iHaFe%4cLzMN z$sl&I#dmkC!C+^9Q{%$5!R_g48ob4+XD?gzY(j=8BM5yv@l`11PxAJ6fOyzfdR~K_ zTt~2~x1>+T)B!)OIG`m*JgX3pEM3!eTU~+t9(u7 z9D8BWrHb@R*D)0rCsmT?1B^zx=?4X!+?qM?^6PMt4*IP3uJMHK^ZGmU+ks7uCCH| zT@&~E!&9ZALD4FU8*T!Gn54o(mSt{4U-u^#4SF~}rM<^P@{lVV9X3@JBByNTRv$_( zl5)xGx>iPH&dz1foDyEwBpb~XVU+!*#7j>+FRI>PloH%{hYk-Btnorh7}3RCGOW8+ z!d;K(&Wg}!5`M;@P6#npuVTLu{Lc5fZ}w)p;G{4nqJ2+wRYE|BM6seo9ZSl@NDyf> zL}1e$Y!W_)DPM4v5QA|Hp>R?o%CNO3TH>P?o|$L=*rXjE*BF>={Wc&QGj$2cs~h5k z`f#~OVj}oj>oR|!P10gnnyH8fXcOR~F}`W0WvFDA9XKXOUGT6lpP4EmG(opxYlj~I}go4b`$p#4z5D@<~w|~ zr$z7!bb9BKSJ$TJ)Ow*4NkLnPRv3x=;^AtG$vVYGvq`V}OrVB1@CwGJ@D!(Zx!(o9bRzu?0%6nraaCGHA;Pi%ec?)P1URNWE1mMa9Y_ z=&3mQR2j1%gUwPxVsupXULVV%Lk*5gM}DevwtDsP(aYe>zb9XwKg^tx7yQ^L_?>HhyoQ}kR5baaQ0*Jh4qOvfH630Q{_iBYm=e|h`(YI^k z+NxDdejc5D9q7P79Rh&_g1H6PcT{N$ik9V!s0sQfH7k~UIE}Q#h}2x^ei+e?j3(K zmcMU;4>&5xc2hd}^_}5{_suePkX0~>$H_8mw;e=<@%xk|THd_|L81$42A5%pTMQ)5 zVCSbLRF&hs_38ufz>J%E{yo^?N2GR5NZR#6o4IwjdntjiI8uEwOJZjC-ys}ez|Wmy*8ej>`(-wYGzDU{q$l0hg# z7u3y-pj&8j8ffEZoc!`MZM^e5ut1-&h1?2ASo*Y2sil(MDX`16G@1{-JX7>ySb~5x z!;ls%4U9{4Z&gOHpFF;>wWsNrnzAIet{KNP;*gIE^wmpWE1U<{0(9smniU#H8w@%q z?5Q}eY4^khi8smc8P++GqO}FHwG=oH(P9<#ZsBxu#Dz1)NH~dn`E+(US_@$ z_D$r06Wg20y3?XUjiS$*EvRLe6v*fHXAL$LA%*J@rLa79`|;`PFcKP?<&C}JHqp`H z(U%6JR32I)_%!eE4);Xz6PsY*j z`jBzrfm!G8D6#I2!P#f4lvGS+zsbks%{A(lYy7hNNN016o|x?KVf zz7;5@&FNf=DK!loSLjURBXM~HzMa8fT~)UX>uBq`3dQZD%rYrR>Ws`aC;)79REo_T zfW&?JEn&1~%k+VwSjfw~kAi#;_<_4+_)42QyI6&qGo3H9Q)V-l{?l>}(L#kQ7%IQ+ zd<-GY#RYA$%kSPt?u>P1p*8z5)ZPx=b*>aIMu4RRYxpHnKqa7`6u?6xk-f~+h0Co1 z&93bo(cTHI)a#ZXywY>fN;5?lQNBy>fOVBA-K!lxqe9VmBq=j98h0sT`zkL9}S&r^XqGZfM{3~81+3_|*D3~Qum{PBJZD9Q0#+e1Z_ zN4J7zs}@}^ImZd2@kgCTL}|_Tp8&e$sn4_5l)d_1zSqMQ@bD*j)&o*2qakK% zX=$FF*|Ndi8##Fg12%!rN8UwL5&VYPJRxQ~^KyqQ8{m@<&+Vxxjj#8>hzaZgsWa#% z2{`S$cE8#~WmOrDul4|J6@KAG*Uk3O9w;j=n0;y~(A2WB58~Lt=i}=|^JbTWzjSJJ zkce_{B>3+KOt(!xeC4d1yHx|+eWg>om3SFsQM8m=;eg^-oVycjDBWDAtgVndEJB39 zzQ~tCl@9JIcGqk-MS@Q_XAIW*l~Bn*yteRV=Qc>94+gZ;&noY)SixP}D&d0&5c0dO zej^9&EZ2(}Su~EL+kpfN(VcH1rK2igW0UCyxc@ZJ{|n=Xt9Q`) zXOZJ5&LxR=#u_%~`{_NuXIYz48Kt^IqAl3Y;+C3v@1)gW7fC%vJ?=%Ox}KXuXX!&9 z9oOC$>Zw-s!F#Y%6POrHeE3jtM5w#jI;%AOyF*Jwo93KEWes#?_7&r^25Sm8xb>^l zqH-p>vd6klc1>)Mbgt%3tur8WJ@%0tC|C8!)~nd3)oh-SRp~s?acfBAv0khWA%&bX zHHj`$5>XrP)x%C0*Jn8NwU*^zv9+#!L~<9tBxR}e_-iIIE&t7N74i*?Vz=Po-1Qkc zo2A7p)$3il5*sW`h=Inu*~9m*tU=_}q@sGflo*TL$H*cl9V*l;W8zXC7l%l=lysEz zcLkxcO&4r*eN^w%7BcsRfM2~`i6r{<-UUumsdV@Eo}JI#H&LaBtzTDlbC+tYN|=1Z zi4E9sn;uEBY}x9Ga2mkL zcScNVS*-k(wc+(9E4M2?YPo&VDynDqPE}F#P(8+_#CU;8jZHXT0qTg<$OAIisfHYR zG0_|M(0TRfTFXkj$q~>+C+B(|LYKfgar;Q6J9|j#C-l9Zr0ls!ldm^`)qT2a_ylOe zvT!elX)f;Kf~J>1nuc6YNY$|;3JF`*2%ZF=^A>dm5d($#(1B5Y zh>|^Jy6z*!ov-gbThTbw_(nw@TNN}J<(@ni=g?U>_+;MFbtwUulKcPgZoWFQ=6B%TF z@JocNpGf#d)~g>j6TcpHS5d#T#yo*68IIMs8@{X~7GL~06gx>`IW&?lctpH?t4nu! zTV~>dMaA|;wl|c76B7;do%dE2>`v(N(lT?mZf%SSmmt~49-Gj!eqK=HF)N2If=b&xd^5ggP}!erVtpA{QNd@*q6yMvVjO1MU-#o-oyRlt&r%WDVUJ@_rhP${xwfo|+u<_N2d+ z^U&I?>4`^!74OtzS7}kgP=ia>N8-i{SjE1xOQk;2rQZb!#nojGC&YzE8mA_!)_I~m zcBMtH4>n%lSR)o;z|fp`r4N0PJ7$+?-@Jp8F}7*8qyKz2_qm{1CON& z8)4AB7#Kf;M+X+emdSg^)$%6Ad{#AX)fe-U&K`d2%W2h}RHn@8nBj`8wMbg5+jRWiOrO3nRng&-@BTki`tR&37#ljuEBAC>Oi z%D!h9n2zpS(bSEcQYMEbaQfwq{xZvwxl`}Ee z5T6)=X>h`=a<>lG!9)*7xx70rr)n7QdL-D_+jqTKDT`<+j0GkkTc@`V37bc;g}iRn#aDamu%Ijh;==VSP<^5;1J>OYZRZ_;Ovt*+2XI~d=KZD5FJP-PPQ%U zRn7r|?62z_rQvZ5;~NKL7TpmQ;f#rFSrN?cP_Cl;u&_&Zf%N2X80LY1OFj;L=G6;S zv{Xi|6+v@w!u&8^#X7v{EaZ45FgX|9Gl|fy$HQms|V<*{k+6 zUq*PZ3XXCP_jNPFR7R?Vtx_4>ZB!h<|E!}TCbgPzz3t0Q4aVbY>=os{2=;+;6@_~C z!KSlIo#zI1Zz|j-+SjA&LQbx-05v4Nd3SwA01x%ZdE(jhT;Cvxfoe8J$2Wy=oC$@cm#7XYV;vpd{_9`tK(V6{F^$J6AP65 z`^|^c1|T?Lk&8X(%Sm2QcwDh)H_2Tv)PKAg&JYIQZ&l~lVmO?!V$dWsj$z~OgCnv^ShVHb~ zve>s@G*pwD!y+0bJrL@)xU}c6{&l*nQ8P62;}`mu~D<%P3zhq%Oo@GHUgj z{p2JwB1eyv?X}9Qd;2apd(AIz19k=_WLD0{&a0;SGV)1+CLY;fQ#v}(=`s`>^+-ld zDHrFx?D*uEM0Xl6I9+@wg*$X&Qvs)WUa)=AS=Rs)zZ5`@TX{HLnATP z#JXV4a>vDGek$0W;2P8|raE^H`!-^7i+~h^S50C#hoZt^e2^zX?fLOX552<}StexZ zChanc^e}PlpQdDMmWb3f)z=Ce7qFmqtfe^A-viL|JAr0Py(8(uqt;C_2^|2OmO8>J zwnHh<2w0_8?T8m^_0ZPp6?2m8lvxEFE}R5_$9ApZ;IqC~VqZg*O-VzROWOpyE{wrW zgUUELbIH|9X#8{zD#sJBeps>*JKme}&U<$au*-*Pd!@6ucZ&u|W3!1kM1Jih^#hPf zwEWnV+rWRke%~~E8UYGzGi1lXp)`6>N5QmE?LmIh%FcwG$xq#@cPo^!BgF?vclVw> zN~l`T#&zP8z84S* zR;!n6u=C-O1g7!O_#qC)hER%Ud>MRuF1;r}o(#9%iP3sIa1M;P`S!Jg({Ohifc?8? z_+`g|dcOa7#bwyjTy`-VJknovaXB0gSq|{B_r+J!uj_c61C!e3>VexpqbwIE2q5uN z14xkj!4V01IJ)Z&%v}2+(iyD5(bL*<99;pB;B~&TMYIby58s8b5-{}DAKyNUJDF<5 zswTA2v1u;g+UBlD&zSLrJcL;7XJL9Uq2_!UAONDs!ovrc%)Io$rHUxk{6HadaHgz=5K(Ykye)TK&)L`v9Rs?muZVb;b0$F(#^fV0bBgR9(J!D zvPEWgUam(z)SKr_<8t7PLN8OLvKnT>b4WeYExwN5;2gWt`)hTUw^KLs(l&AqT|3zO z!zEe>RVo>e@iz5{8>VV@U>WC3^M?nRD;%dKnC;g)Z$N-gS6rG~alT-ehK|HyyBf3uj&*b$G-phUK|do)rVaSpmbKZ5N|Zx8flSI; zWE4QP$0+e>b#f6rE$8a7n!@xl8Q_`HtVtW3WUl7S2yHx|NV<3L9xC;=0Ufb>ldw(; zH*kYR*x`+kc@Cc5oGr7M5}4(v2hX(YPNfyyY2xD9{X!IW$TfhFXuojWK=J53R)7`X zS{4-l2~eak$d?Un1-;d`T#w^C#f7GM`B-h_5$jd{w+w<|ricNAi8t6oi$^R1;~%=H zq~QeC)S{H{Lu<9n28?)-ciA%vgCBs1nNcOpgWnHlLdSYiLiUNpP%AA7-I=pUlJ zAoRR0E-^$+&x&CTS-D`aFceXm6|m?IP_v$F{!!G|Vz1f)J#}+xPJk4&&oW62?^Xv_ z=tgCS^oLI2lRK;o3e`?#(>Yf5sFEu&V>LS(vyL9?zfx6~i<}w=V{3@2&ulab`q;cf zcsN~$X+wUFy6I`+R7H(c=w!wOe~*YnSaVqA{RU)Z^W=60tc3DLU_{Ey_|<8e?WG*h zMr_=7sBgU9eNJb?RJ^v_!YTseTFtU^LaYZfW}6ozjGEM|d_38j(s5{@tB0Yls}eI> z6VupmQsC9ep{e5S6M~h+?%~bSxg?gJ#uB#_)m+1B>wxUSd6 zTm^%W_%td1J164X*t&fTbE8<;aB%Co!GcY~1OB{{iML{^9#7Uve;~__nHFQvpTdG} zD?SEgMTwE77f#%Z1HYjX(#8v-ByJzFR{y|z>ASBZTvP?MNP+%6BiYW!6BnIg_}8LX@sP?a6f;v06XtI#cBMqw^7Ev~KhWWd%c4%LMjULXan-1uoycLLj%xB12r?Tv-pshL-T zdhh5A$(P)xcX>FvaXHAiK4_;V-saLD0=7zh#dTB{Tnk`kGfG8|z*7_jZ@C|i?McDZ z-VBvaMT9nF7iexER1BmTaypnjb>Pb2Wv1Zn$)@!^BgCNY_6d?O^^0WUtqkNZC_b ztA=^*W*+GIbdyVFJ2_?S-31ZtWx}=N28$-&C|{9N%5A7ZG~yjPlI@0`s-MKr}47Ivdg)cTO0MU~XBCTvh|sT;C% z%Sb(Jo>Z$cII=W>8zI`6IKrCJ8(2%Y8taq#Oc4`dsu6^(D_dHe=RWmV=TqsVZfJys zi+{2%K1^po3*&+2WvY%~fV8-|28&do3)DkPT9oTyu+ZR}g)LQlo z_HF1)ZxKpy=(9rS2Y58Z<3Wvf38}r^yQ%AS3R^k{d7+Ep(|N-0;#Ou3P2nr&RvP;3 z*0J&Q%aW)RTt(hrR-L@>h(z~%^m?`4A%>|lhx&mUTd}tcc2I8um=pW{y*(AZ%2-1T z!g!N~>&nC=D;qsoqh%{$YV25T{EK^oFC+YGZt7*9;1aF9->zF}-npC=1U#-u3X8sB zy4ew5`jXwtq8ojtLYqsJ&>PnQj9P)@rtc-uP zVQ>P~Os+FQ$BS-{Vm~brYnP@Wrg~3jB2Xzy%9Gw?9(p)5+3ZoW&PWHAd^W1&xMSVp z#^YmllRcH4_ii^-+aK@i-%+bEKA}{0j8T|BvOSKkeK6Jy;u8Ssw220JPn z*7A(Dqdd@#nuaW~{8$`jIjARH8|9ZAvoO^Dc%srX-@(BLDh*cr3`fNUqQ){gH&z_? zIz)c=$?wo${on>vJ}Edor&?ZhV3x=4uwA zU8PJ?*GxPxa*H=(xmkRg)(igU#I-$UJ5n~9qLf^#!yZ>V_0ISp+;6seeJ98cXJNc+Jv)fBP%S*s|&Nd zZ-X~Gh8CEd%prH%F)-ZjL*v{=o}FYhTrX{YIaR_UM;O#XP31ek#rgVJl}6qeDk9w2 zh!1M)HYCJ17Vg%)TR0_@a8^Vz;hisn;S3|4L+_MAnY@eKN_$1MzE=%P@_q4FC-a}d zX-rV<0HO`7bFUqtdbG=K%hY)y#A0$WUqpB8joEZAV{=UN9U^RvKEwJ z9qW_K71em8CnmS`tf{|L88;w9cuQ=~luBxjbL*SPJfk?|t%Dyj5q_hV>k5ofy=X$b zZF`=gQ|UUx9CAVK$2Mcls`!-5DGSI7Q-b=~)IB>u`qedRa zA*_b&8NH^u4;6)o6PHR7dO6Y2?NSK#=FC(5cUYU#;yXoy2F3&>nxmNYCt|%Pb-G1& zi|^Vvjs5^hEqcx`+Shh3MBlu^vKbv3ce>1dq<}!LCsB~J&wRL3L9K3E zc!*AhL{ZLySk4Pe)TjK0vb!gbm0cq?*1tKz-nh85Jm0%8Vg2w(7b-eIO(WVwkLXxC zX6P1Ubw$1*#KhhFDBkIXE2p1A$!@Z)n1K#)c3{E_7 zA8~AFx?wk05a`O+r$_jx$Q}-hJp~RvQ)9peTKwzo9^g2(*hFYhp&A`@CW^tkdji1C zz2!9b1J#qBOD_wZ>T>1wQkvIEE5zF53WoT2r5&yYDF9_J>Kic5w3AJs_iZ?~Z}@Lw z3ptMdN3s8Sj1oNkPb&XY7yo_*d&U#7md<7w6#f=ne1m^FoGu3JOUv^YG&sw#3_~jy zJGaM1Phq;UbKy-;bREl5&=0?Ebqz$}eTt6pEu#e#WH>$nvvo~x^L#jN{Q`;`hi`A3 zKCkRL6!YvFFh0vZwDd`EenSPn^8ZD(+a5 zk~@{DsK(%ZVNf{~sHsB5BxGkO;RLp|4f*$%i8h^?2bEah`-&w>25ZcDUHacBSc0ku z{8%>um8jv||3N>s&elgHHR6slXOCWOH=)N0B(S3&`?*R*i@h`{eQ?kx(;XCH)I!BQ z?Be#1KuuYNR*HowC{s&KuF{nVwW5_EgYyZt0yMSVF@uVx=6HNs2rww2?T$;=-LuRB zUJ8u5ttFwTQ1{U&$&vRr20==-W@4xj1=W-#MA=}n-zGn1}VK%N@qJ-z#D&0^xZz?xEOlFePrBYHJh8CxaT zrgP;VL8jxWNH}~F*2Swb!DsOrdTtf9BQ2?*1gHk&*-zQE7A1m$pJKD}NR{Qml7rQT zDft^*j_;8$%v)R*kB5q7n0PanIm)@y>hmDVho6ab8R~9ej+PDb6!0>{G`gh@MRkF9 z$giWpOG9(-(eWlnE-(>kbj3r3B9$qYCtW5-+TxT$#2eBY-N!6zZ40@{&3@7;S40}p zj!My9N0{TW0w}6m3j?iSR$5DUkg<5Hxa+XB#M^T#yFe+%qkYU`Mo{6GWkvL@)P;Jk zy>d-2K4J)_5hOV_9iuP>zZ5V4k*REa@QuPoN*x>6h-|h;yB6S_MX48)H`aVX~RNj_YZV`usF+@cgvIOH@G z8e&_pE4)`5l)f~yu}1EqkEyfYs=8NmFhp+@!(dipimuc&A%UcE7rRZrpXnX@(u0F_ zb|p@Pa?cGFdq9m{R^kU9QF1u})E;CuU!RmTo6Is3FOB}FGO>Go9oTHPMcucNE#MTQ z0>F40C|6KPm5fPBvQ@=Cj&~%!FuDiowX;IZp=RjfH-Oa&$ex~jpa~Ghy67%OX7mp{J zY~-$}8z2|tYNF3z21zC^x$n|!nk_;cDm7PK+a(&~iX!b>DrqzZYC#Qpm{;>XIlp+# zyEr*eLFPYVxdAQ%{RFvZ*d?ps!k?Q*p`t{dQa#J$icA^#{B09Wv4{P6Sm)weV+}nS zGm|8^Peb*My98BFwKNt!Y`#(!ab(C>Ek3<9Fa=aOxqvrw!E0v7C~LEWxwxm>8T3gC z@<4|?TPmHTBuskF6vrad9uVG*F2i^1{s5v1E})JtBGpb2oYzp9WUAUD!AMA&M=2A_ zK^*ZP?EzJ?g%3WRMbu}7*IU+*!X?@SG34ikRyB#3=gSqU1NaRs@I>L0t*K6wH+jL2 zf-g@o;5aJl@qod^{lG&){_&ls4k`%QZL18?=02Iy@2hGy6xev2OGY9b%ztf}%7OZz z@T;AdcHD_{l*jn*x~X-{I)$84TT_HXWEI?XFJvy}$?8AWS(OlqX{*lYj5i0`Iofo` z2_8_k=PCh;CAw~D=q65$w@I^3M2dRmbuKW0iXXDZKNYfKr^;22kgN#gl$TH;83^X< z*f1{D&~<8I#Hy8-W0KdsMoqfeaCY1cK~df?0U&>ta**pR!R;R&baRaTKla`-s>-cx z8>YLv1qA69Sad2V-Jp~xE!{1Rba#kIBOxH&T?-VEkWMM-M)>C1TfFb*$M^hv$NS^m zV+|eS+Sj_)yyiUnJdV??(u!uU>YAf6-HvG>Xq(7dDMr2xZdBuTH(4(+Y5*!sY_I$i z4E6QH|MRTlkVj0G-_6Cmn5hB=RI^($y2x>BBPT%!8xfy zk@Zg-2HiOVIe!8te>_;hQF86?*ee5n2&#D_siH_y4W*(~5(3SxvnQ~iEtvZg;oxbc z+ojBsd5@z=^u}IB3-vyZBzrLUOdwt?^_No?(uq~9O{U{a_}!&>tEb#;vQRx5v6%kR@4^Lc{v?sr0O#uPe6obKa z+;iXk3L1A%7pw64PGqY!L6m9f_I&MWnlD|z4r_++gT@|`=TpGSZ}^k|?9Y63>$|2O z0mEtLccq+q5EJ*DtXltxc)`FH@Pa)LCT!WMN%cV~YA8wj8B3up*fP1R(;s#%BBRb} zZ3so+Y%VLMh35EOw?Edce~Y03o}o=?Y5r^aFUs*}*5xy6`SyyxH%G4v7I#2cbmE%6 zS#J%xBV{kTLL$7iNiVm{`ETpZp@F|QkayEH{E>HGjsG^c0+`m6THwl@1}8)o9QN;! zPphzL#Tle3cUae95E=|OxmiCEDXM7(F0B-zoI2`w)ij?>0sxCemWA3N{BzuTET`dh zy3_E@E!*c*Z!fzhzjkM_tYpJkQ={n{z<+D0%6B)EXMvn0d5%(hxhz;^RsN8 zmB6uSc|Qg2HQ@3QF~(+~+g-hl3KYZa^)FcOoGFO2*{_4Y@I2M3{1$u$0#_rGP~peL zE8)VA^`Z+zhMjk+0rWoCwE|B0!CJa`QNqd}@)ZLNys?BieEjENjn!a?Df~Y#0gC8S zV9x@c9#H-HB?cHp6@RFo;m`eW{_&uCnG}?!d?)IxKr+Zr}ASX)x!xwZGZ6~JZrf0$+19B!S)D{xl9B5qBO zYN}me@oIqH{Sd$2@5-OQuM=JM~jCvH{GR4)6oM*K;L<-#WSFd^tkE zY?#b?p|u4@(l9gzYS1=aQd(~H+@?0k<=5=Am$qMl`GWV6lF0dD;4V~TPk)^XOlJ5K zbsyfyOAP@_;OHqa916g4Y8s|P06#qlBKR~n9T*(}MEir~&4*9DHnQAaM{s=JZ_RJ8 zf==(kpzEzZSDt=E8(=07@xu+@-`pg8(3EA|D-K|CD7o8 zb@)|*t9C@-$M6+!2z)z<8p>55;+lfNf&U@f1Q9B82Wwlw!u&a}y%qt_V+##1h)vZ$ zH#$ZeSmo)ZuFxL(pIJ`T@uu93q06rdSr(qrelWw%DlSv~ne(H&`=9Sl%$D8wNy-@-) z{qycCQSFU0wF1WNaCU;pu4y)v9Vqivt$@|Rxi1Knxb+sKt2t7FkBio3qp07j1{6~mTq-0{VCWCqu)7nwqYZC`*)VBMV9AQaluL{ zDvi7GVqQAN$mxQg9~aX0d`r{W%~aLPmRfZX00lDTcz-(EGtYDgAywZJ7v>Lsenai# z{{>OJ$RJ{ftF0N9V$R!z-H|eSKm~%WmBr|;w}OT#%KQAc~Y?ds0Ys3FDad~ zSw}vvVD34FcR&p?CW-rt=Ku@7({eUHaC|OthXpt?RO>$(5nPx8AnSYemmhj_uWE>b zU`XrvE5TX&z&b)jx@?4eu>5w+-fxeisVteY-QRC^`tBlmU2Yccet&Lv!YWw;Bm-n5 zO2P1-0;_@Z@Hm<@{oQz^q|_1Xn#B@U!pi-v#R1@u<4M&|_7)br`f#D(t~F%+MoOqe zd2y_g$TbnoFHkk`59Fh+)-nQV-bYeROZ}2XA#-RZ7+DwWa^vIZYvB&48Pk5ZR|~WX z$&-BPzzF`XnTq??EZ(c|$Cg*7!gLpZhv`c+{PTtv*}SB_VO&{6t4tvf z1^YF4+{gN3@%{i7?itVRk1h8DZ)`cJ1Hb)C^Qg;XlW$rKz-+#sO_9BSz8Pfb&G+-A0;>Tg{7Xh@^dPLNLM-!hy1I~b z*mMt)A?#8@T2#DARvztyLj@UoMW!vX9Cp9C@)K)L<>hMAzi<8l#v4^C9nt^eJllU< zXl}(yi6paIsPyx%-`mw5)Vt($Wuhs|v4IJrx22mJP#IRxBbDQJv5*9Q8*&Pk| z3@al5P7GagCs(iq#@mO=CsiK;kcGvI@`^AJ+-4o*_KJ$4`L~$OLE?Hj56V6p$UZaV zDRVVujXEO>wd=Q8B~e?@hjTpN^A$+EUQBQ{jkEqIwOOi-F-FBQnQ%R7!E z6l{diaa}-do9^XmpS-(&G%DfM^t+^Fhy0sZzf}s=VTV?_dH@o;{3@~dgr0%A5&@7Q3pbqhCPEEVQuqD6z+iZhVi-9RH za&;i7)1I%sQ5Kk!E659P{3&r2??cTrfl~`8F#~KU?;M%|QHeN7lsKJ*tduHqvp*k$ zTyuUpe@;`lyu_DPa4zlhQV6BzSqV$G21tfLhOHv9Rg||AOlr6<%^n;^kyDa^f(-@T zVPZDoPE(yn6LA`IyUEm==C*BayakEv}m!#=3WiOj-+K0CLpXyFpx{OFgg zLH}eRd5Mh+N`1@|DX#O_)WY9>JOQcM%vzF9hu^8Y%Va774>g?@nx2;Zdg!Z55n>8e zv{aC6oLz?(JmZ?{D3C@XHE{l7(gRP#d6^_gyvm|Gj;}@ivgM$cQfY$g#8yj7%_UQF zLPhJ9q^W5lA8a5#(5#7m7^jN;@8m}u`g zdLbM8YCsAQ;H?^wjNFPYw4_|HH0h$lF@9c!QaOgfF!>-#DU4Nuf8WVGmr`Kx0>&?k z{^ScNkMl>M?o|GB8|eF)v^X(ZQOV{7D6cfXhwB7n!EIt9RWcUT-Sly&aFj|bAi)$B zou;*D*jnNUyB7P_Xs9U3X`Srd;6G_ADTt0>rP8n-FNiYmSWB8Om*5XJgr%yaA2xrQ zudRoAU~J%s9gl@CNXO8Kv9gA2L51~LV>OQUjE3?px<7r-XuE^3{15W5WC_wBekL)Y zZc~|a%J;A7B}Uqs`Q~V;^dhN5-l{H@)oJSj5YL91^4Oa;A$hDiR%dAeZ94@A zzMybB&!f1DqoE;L0H;8Ds$;ul4T??k)K(a&*F&c%50!7T`TQoIm49a)xsANEoNKIglgo_GC&^=J-T^fNQ0?u( zJo^9l?(MhLZY2Banl@y9`Ay(It*5eT>T%rg#DrA|qnxjM5QvfRrJms-CC01bS$=5B zP%uol%MLUb6LE9fgU;1cj$Zw~_L-eaBfNDobK6K0d~X1ealbFx>P;?ve<+b<<%ldy z1g+)Rbl(#cVV#gF*4J`D8ue^2g{wleIg>*Aky_Sx23^e&$a0RS%UCj{ERJ3QHccaj z$0I3!W3uk2VNPf(L1#2kNb+y~`OrQ@V$KW>VSCO9PeA+6CrlOL3uopd=I@%+B%&Rl zKAkTu5JAL3a9VYKKXZ4X&1g+n0fcj^y@HOb^8RF@`i}G6jqjYhAKyI@imnQYIKpn< zOn^g=1s};~{CD;w8Enx3m}vg87MAtckEml&v2#n7c!PZDy&vkn$XTKgyrTPe0sa>E zC70M|v?Evkt_Up*F1Y&5{=ut1D+KEdV?!WHWQiVLpq3*fS>IP6;6V=B74 zYm#{2B{At01s>f!T@ryu`Aw%@Bo}w#-(Oq=SIFIv?!E+`|E3MB;X)z2FYfO5x{<)M zd8#5a|LeA(VDQ3~#%4@^7aAuGp8cT=sZZ)ZzZa(huB}$!%gNte!;=MX6M&YW^!Mqp zl97a9^H(s3H~(I%2UmKj|LZ{vi(Ft0H}T63e}9ih3Orkp{XWxw-S&bG91MHvH2U59 zNLt~$1FGdKK*&DqAvKZ*S)VMX`ODtOI1rZW0`4^D1u!)l*6uqgcj5+wjy3=$AB#Op zV}AtZ4H5!VrS&8yi*{M+>R^(>(?n)EHtn)lQ2yvY-TlExMlVI{fFGaOfWp!1flyWf}av7?ks7ZaZ?`Y{>}m^zZ3OB7Aua zcD?2!S)an)iN=~B^t+sbEMW*F#a**zECv09PyW?h*jD?|Qo!W>Gd&gV zpKC~O;ex6gT`*CK`2`5Kkx01anTVh!K1o6ne}AM9?2iy*@5V0LLCwj z8@$N<_q6H5fpmvG!@&vpv+DtJ;I;R4du0@WPQOSJlKVgT(16FrljR17pbIkqq?W8k zV0z3Cyo!S$1c>`EItR_)v%uqmN!!pT89O#6SvH4i&_nHsCR- zh)k|MiHJ3MbQ&-asHUfX6$sA;eErdnaEIo9zTShR*uif7z07g*E8_)d1%3eSGd0oD zern<1>VAU^c`EYYYI%L)=(m4Gvli%$!Y;A#E%-i<)FKeNPcOPqSPYx%i8!7;oEFNG z@H6BzY-T3seI?YRyfQG7?<~+7Z7&>TK{hkxV=ta zHvnd*DWHp%=e`1E9z&lKEf3&?4JvGA>(#(C0_G?Re)S4awo@oj&SK3Fu))|QGW6VJ z^f($3Y&ksu^wiQScad_?ZaV}VfBCD^T||lvP{_!Y?F2pC#+fHM;w}31_EKN-6`0%? zu5nSPjk*?!Pdu$ocb%c@y5<#W%ngnPBtviZ(hE5wEFKc<%L03QhWsBvW?%%`-4p=Z zLUnoL!_|SWpvULpy3p!_Oh5kp<8!U|S0~$(L!ha`VB~v|?|BQvASv#?+#smNd|Ox| zfG)0`fGW_?{TNYtypkQ*aE>SA1xiVRWYM6`9P?@*{*i~5)Hg~I*D@*zUw!RzeU^z5 zEd@Fjd^;L*Vph#qFux(c-XQyD2Am9kgG?qKa7jTVG5^_9QkWYv%@bLynB21pPAU=9?^VNz z*&e{!%R3+<07EY5VFdeT&@E!sJOZuWhpAy5hY04(rVFv27W5i4Y62_BSs*as3> z5@tb#gY`8*?z>9BPkss{WUc!xAulL*q=~qdC$ee{+OR*vQz*7a!EKNZ#(1(f-!efJ zU|R5|wTwhI#Hw$_Ne(NJNj;TIOHNf&h-2T{S>72be}qbXq@h*=lMLSLPSP z32XgvG9aMM6)mH62W=w7Ug)RqnivwOedgQ>Rzu}rGDj)S z$*M!h^-GWkmwVdH)L1(HKAA-I`N*>uBYr9Gkc*rHOy443)EWs(X3yc4qwQb6#R6M; z{;E3-h?n#UC2cFw2_K-9yabh;@Ao6X()z(hCDyOA^pOWu^ILzH&wrZq?E(}*)El~E zkG4T2Bmq<*l|j#+OUJ+^$gPjAXF!YnK8*DL z^b~ig3}_9n!?J8ea}Dh&Xrg)^LjvWfhS}-gEW>f5nXV0`q=0l+;d`0heP1O=`v;gY z*7@I!u@vJQyk1y8c6}s+38efk@t*}>PrVbZbv9M4O3PfuyNMq4Ua)VdiO!* zrt-5o+mGS6k`Nh2l~ws8Gyiokv*@7!i7R;B&S=pRkq`DH6GMF+f8Bi$EC|K(H?0Oo zR;6SNvZ_JKQV^5y0+#eZEM1^=XFoQuD1zkT-u3K8t@>+1POOI4@Y_q(aKF?tOJUM9@0vtSjd^_+yTCF?`UnY~qyj-=GJ^^hhpNw5?b6Dx! zvU4x>0e%2fwcF~^&GBX~&-V5Py+S^#nQA;wV%5^bc}PvM71NhSa@O22zML)ot3U33 zJey8=3$`#~BNv6G15-VSwGQkEmKMzMnEelvlFR`9+s(9<1 zy(U74^Uv4?`AVbq%+q`ReCE)it(}=lx2n)xGsoUq%ZgUzf;c7GhEe)5A>Z-BCkeeP zvisDDsi{TtAP3W!IwD^=y4TZhOwMO^TrYRu5*hgQuU7W#Kr+;xz}>t2@!2K;knUM4 zstFjMx@V&Lb}wRL%$wX$*g{l22nEr#O0}bk2b1xJ7nM+5PcxYOF7_b{FfkAPSIj@Q zM?cb9Gi>p^m_xG-+6{5R;_Jk9gf+nJ`U~Z|!OU~2;E;fyV$f1|B18%Jgfp04j53U|ZWZ)Uq^ryzZ z`1PYs1%sG%^{iMvjR!z!+G1k9A>voPW6L6CR&8hGE_~H|I!e0Z^nsKId)aQj*~5Sf z%(t>W1ZP|LK_q*{A5L}vG+ayF9M^|grazft3Ks(NiY9Iy&nIf%2Kwx)V6ciSVIhF5 zz@JFcggoR1ZbimX3HiiFV>kCpOKDd`$>L7+ zk5l=UMB)uJGhp;#@^he>trPk7>w1tQ=c5p5QRe=a&pu$9dzS470q%b-5t}xSY`2+-VZJ}eZ&%hO__f{3aoGVyg%XXa09GjZWZZ#JH>oZIky*MI zB8gKmb+)I1PzDs1*YfL!cB#kqnC@_=%Mn~9_#w|B5y-8sORs@V zwVP$O5Qr+DWC(m@3+jJsAJVaCR;gL6Zl&MfkpR3F!^`6bnNqXTGeOvt0%+Z3Pm@@S zo2{~ku_jGyCrf10;T%Y(OW6c7OT=0?Na1i%FW53Ks&jT|%!^c`xvT9d9KXNRznvHUG%^ zNo96@w1W~;U7w7HlBDd{t9fx8?+5gI4BxpMcrN)Zw`u)?t0mW8lR|f{P-bDn;b1i* z>#69S=G~6=YEeCJxLq};>IlJ866pp?~;1`2l&*_Oq|9dzE=+t;p$i#odxO6sb?-9NlR$CNjL5<5F9SiwxwF4-Md=PutR#EU|F9V z+K0cMfA!`koZ2Snm?{eNksltvo^RIUzW{L_>k_&d$#*nckWTQ2&)HmI$t&wXGJn>YMe_%ZdthE(B#qcJ>JWZe z=m%KfQx!l;+5B;s9{s$WTHtl+_3^k$xr^laHfWZ<0!G6T%++E&kfFpN7Q6hJ$5{op z&^N;eQU+lVWtdD>>FQK-&Bm^XP|=uyF2BQ%=di#U#2byR3t2mbi5TX_!B%TCwDq+% zKeT_pJvl#Li80uts~G=SH_r;|`zU4zD78cfDB>j_+zOvjBfSSz0A{Ew;s!wa3@Xv~ z9N@ih;7ENBSHHP!0>bRA#`f%PYgju*F%$BIV}Yien!!a5M{;$<#S#ygR) z*S!UiNdq?JB%XNQ0j)uX;1X8m@b(D+CaUIm726#D%_AA`E4D#2WOObRM5P@&YKI^Su?emABWx>55tg3R`l6{er#W zTDX1yq$b(IH1xRQ31Yz)i$1eRz(-Yl1#nRp(dIAq8XxQwebkP%so(?`!Nfkb!sUX` zMs2=Z3-`SD{n=0-$lgPf1TYMSQxFzG0O)4#K96WJl4(4;3{izx%w^T5j zaz@l$0vgaw--$bMZ@xPT*gdYbUyljJ`Yu=Zas%-I&IA|B0 z_we2FVdmL8AU9H{PaOZ9sJ?_X&p}o#fc6m&oo*%#{af4{sx}ckSL^VR*3*wI7RVT0 zi~M#Wzp)&Yz2efWgIcW1SHKArcPpzDBwUKLgdh?_6>~>PN%fELl6SoNCWW#@+>*g$ z#0DJnQ+{|ZMFSt2ti_qLV?C55A|EeacLXp@QmMuCR~#@;Ltt#=mN4>>U`xZ|Qt(0p z43RrkeyZ#uY;Gr-KnPuYBhNk957@?k?}|qvyO$-cArHFj=sprLcUPKFqc9ZT>saR#xuP`e(1fq()ad4Q{qtJ!$EH=a750R<_LtkM-*9a}QCzgi!tl z*<7i_?gZyjRWPNchP%IcFPI|ttzJ~CrF#UYxo2lx2@<%0V=GPP4B)vk$y`R|%8&S! zYH)E;-7PKE`iP*5{3oznps`blyz|5U8#9m`*jD5jP@9-llvsLtPWMH*jy)LGDS9w3 z$NUP!)^nmwu?27*TTheOH(d#F@!@knKHKlUz*7ff_q--jSTd*kR+S)K*K|c&*H70y z7DIdrTHfquv&0G!LurpY#?f{ydCFCUORmAaYEiQ;??)B(nG4awF~u?--5qq0$84dA(@uRtjT*jmJ7F z>S&d;xZfPOm}MHWdg+9X`W({L{QUFl;?&d&l#y3Q>+@f{a5k1^_|F9A1DU92x@rk_ ztUEa~b0xfK=$yyIahfMfwAhGSZ9PpohcI%qxC=YUXZ%xWL_Zz-w#%S+N1GVni-YRM zskPCq4JovJ<`{#2<3oiDn$9KPScZLwHA)WgOyw=zdDoGeMDy4l7+tF5@klzedv@3w z4%f4?Ze()*_YnxZ85%hEJ%Sz53@6UnGX(9jCu5!M^rcTgaY;4KveXN(3+uUzB%t?M4fTyk3TJwe@ip&o1^+Yk2D zrWm%ZL43p2ez!MP`1UeUl#=d{Tx##ST%{)akUfzgGs_r)*`i#D-_&S!tU=*cfrXg1 zhA&)~h9|eI<1qTBg89&7g|P@5-s}Vpj&yy+8nSO5bc z8}{tB&G^2-9_jC~g0Q@Er0siV!~qt*Tmawm3Y15R+JHCNbVZDub++g$wlC zf6B?!o)GmuuiWmX5IjZ0KW3QO62q;M%m7cs7PH-X;Yx>={dgbJX(zaAfDVt-30Dqs zX*{`_N~9_j!Qb5AOA(T40l|j|3u2DEq2s+f! z7**ZJvy!Q~I4QDTXdYlRX9Ca-N?qgbbTlUfmVD@oj>oD`AdUBG;_z4NKU8T&PPjTbJ->KMCZ95 zpBJW-0<;V0!AvZHDN+lA+@QLpBDZMMiW1K@J&|5Jh>qI9oZ3E42Uw=s&iZImz zIk?tPMJW{QG{F}@gM&=0IuHfd_octn)AYtriZ|H{*v>-y7js7px#24^Xr$M>ST!Jl z#;EvX882|2Ays;1dWfLrDl==pQGJxUnU+Fjj22wq@K8*G_L_DwW^On`1x+FyRT^_B zg=g=Vvn zk}3m(Q5qb17-5sgS~5*Q|5ZXv(i+-}@IORvNbkL-v zL2BPdCwaQW&X|KZUk|L&Dm9Xqjqc0+o{gf@Ro=u?z z;q6yrXie=JmJEWu8hTVQT$0z4oiU_9J49*hW4>$Uj^SoMUJ1Q-__3Xs+Q|1N@p;S- zmV>WwN$|=1#oFQ0&kv9zbA3q?VI=ZR-WSsQnwSeTjfYVNq<^~I$Pkj(ySp5Fy zuOS)7tJ3_y74PIElAv;X%*YrW&>$gs(a~!_s+olQgbMooEBRL8YxcN|w!WCuuH&o%)^qHOH-PEgf<7SV&-7zN=2QpiBt5(Aie;)0;XZ*!OOXW!b-G65kYP#$l z^!c;h!n_UI6kC1hmyl8O9WDbFG9jE=w~rOuT`BeUZ;D}l zE)_JWW`U|wbnuc?C2&y~cMpNV{ZTpLHM1-Y!n4<)V@;B=O;Mt!2Einr{;*ph3kgIG zB@eOijGMA+R$5X#i>_o?^$pmdkmf#7iuS#?fQc6OP3Rq-au_OiE0$omSrJofo4rOp zw$I{YwG6H$EL=zXK$KBlGTY4ktgukI{rW9MwLN) zO>)tKhW%#+rRN@m9XlK8ooOxabHoKwfp)Aq>n3k11-0%31W9y~cami7FdY*X*`xSl zwb|^1?6-sY&kezi;2U84&&D!*o?ar{#WLQOdHB<8S#Fi>J)5ZGDv+lrm%qrWP~EWV zYRDFq+7-ead5J&_-67WFL2*ynoOKdz<|3Mu~(i8aNA?Uyq22xkof~A0WvZMD`p{b$RUj z2D?=X1c}Mu@l{}Y&%cw$rg>*^w2HhH^$22E_+N@>v4kZ-@BGOmOy^RrnNuyMkpG(zU|v96-rEFSE_$K8&3XY~m|F8B`{iF3Eb)GDD{~Y`1^JWZaR@C9``!;f!h0Nk z;STiF#3BUR!G_l0>^#peLaNy)<4cpul1aDojp$l*7sYu9ut$l*qA|hCHf~&uw?%y2-Y$yTHc5M&ex($Fig(XG zioScY*xu|DNS9^abt39jFA~UFpoPyQptZ@+p`|Sg$v)z61c|gic@V*MNbq!5$kIoQ zy-rnqLf{ec9$<}XlZneX%)O6A6!)V2nt|1VZoVEEnK(TCW4;d49EKZzUi6@x(Yrh& z7i#v%1Vn9$c0&5k-R1LS-XCxy2S2OXyuv^WE70MCEQuqTEp%DnP6l}xEDCqFV4l9y z!~Rwluze&$i)9Cj%ZRUBL4y{r4GEm^J?lmYT=wP45Vnz-*?W`OO39y{UhapuFE9mJ zi&z6iDeGTI84`HoTXm5V`wn;P&iqP`!6OwV{#ww6cF+@3Nw8|rfy*?DHfE9_Srk+) zN9o5@8&MPgo9T1EQ(L{nv|qtAbFt>4RfC(e+NSCnOGEJt-{Iyp%_rwcP2>ibe7O&= z=NseizZ4g;cgz@`=GFiH>-PQt0@Bye>8XfGW$oW-u7|G!UaRObN>tp)e&1=+KPj{7 zFgtXvj^nlY4R_%%T}ror*T6@b9N$n{+K?meM?Zao;xfDb^aX)Dhs)198Ej=a#D{Lh7)` zG(X)6H4fpCq#836^t4fQ=rfdR29?LbF0|}p<+FtGWjL~!)`rK+ZtlUBf#!@zYSmwh z_4LLJwFq}yq9J_Wuqy70p>!iQy+hgaWw;LA|3Fxf=f}sCrLlx{MLJux0E#!8KPf_( z@Q05Rzlw5ehl%5TIri=HI(_x40}l!Y=C`j9=1`mYHL5;hE`Mtbf&d<*Vf0RJjIwuQ zK^GojXTb2+e^--5*bb*>GdsZW6GxpD$BU9sSEp(puG&z7RQutqIKeHYxLX8_+Bs2edV*lerK8*E;MbM&wRQ$KI2IxJpFzD2;me}Xd5z7?2@%$jLZ;9X|% z_eo0La#8>8Mv)+JAPnSE4>Da2eGFQqTAMon)?4Rvo_ikgn8P5s&aEI(Ek=j?YX+ymftEP;YgQ^ubKMiaMevGTJ{L1#q z$J{0M1eK(5zr}t8sW`SRSM*_dW=;8ZSWW45{qgOQ?}c@r>=%>NflE-=`)X$>sMl)V zxV%XlFl*MW@uRNi@lJGRd=E|`Hhk(*~Hn6Zai|>XB>ncj-+7!`lR!j#*ni@>2?p*)Nupilhc`| zVZtII4yv7=)n32dDg8mG^*$;t;t_glq0{h~)Cc%>q1IvVpMe=Hv+@rnQD+pCA=Ns( z-9o};{U9CK(+EOgoNOamJZ|pf$lSchpwDCKEhtn5@rZ%J*uXR3t_BZT%hcW`YooRJqG0mUQk|xYT@;!u6r76{IelBwe5p*Yj z68eR{X6v`HT7w=FI$4j7mSyjZJ;9SPFU53NMXA5@fn{&!o{a>XepG47$^q#uUT1 zy4IecRxiK(X~RC@9>o6L;9h{YIo{g-H`KRYF7*}=5`+NqZt+N5X&R0iBx$+KmtK2yN_;)*y>LAE>Op>Z#^<=agKufH_Jk#=DdaVzUUfcu zx?|$;)2lZSb3u;GmTPOw>A1>gE!cK-eHmB@Up=gzNG}_DEN?t3pNz>8BAa=1Q3o=X z7adepS?yhD3-B|djJ0PE-B+chPo6XhVzEloOQb<_P1i#@HaKiAU4B(EOokDYgpHc| z@CBb8wCsb-nDKx&fc6bNpP4Rmw2HjEDqZZU_+l2K9bqK4*#qOgD7?)HO-7fW8XaQ=a{b7;cJFSx%&7vkYT#VZtrV5 zWy*Ks;Hzh(9bqS7jfg(FBm?Or(cEH#;pf+SI?v6NMZNgHf{s^{=mBm0B7COaE*=(- zN?95SUQH;PaQm5%Z8|D8#p;#{ra3TGEi-u%`f3pL_i*cFBW3g@8U{bloX;y*GM6aY z1D^!prLVI{kgZ3wynY}*xv}U*Dta>_jWPede+PA$m@M5;@m!EKuV9tp`{WuR?{l5+ z{(N%Y(WMb4bV{#ACPGu*xq~dhZO1|HSz7Y8ww)itPNLdqKR}Sad0r87o&UwC11g>N z=xF3{<;H&31~rm;Y6~lKv$Yy+N3&T6&J_lIi~xt^irvnmpbzQ9QA&H7Oirh+*v3Yw z0fmJtixZHF|dGsthzb7Or!-{+hyGuJM1BhOVK36-L`^)zRS z)$Nvgw_~k%CLl`l|WlWjKerTlrP2;TU0iEUZ z3m#bG%4Ob`P4t-UV&bww?QOKS#rAkSn}*a7nAcu5E27X_@k#tX!5Nw;bpIA@5~+@j zA*!qnsSn;y{HIP)p@c{jjromgc;w5)e5@C6CZx8R&Nxa&d0l7dodeM{VkNfO_A;``wtc!8Np^H}rHpNP3xcST zd?ew5>(F|`YbC=7MI*em(vIqOjj2OJ?aA(aC-B&Drr|Q$UAdyZ!a8cl{^XGmd6fC9 z(0HWhxd@06mq-M!&#{A1lcyx6j#^JeqI(Rxx}IG3cKSN0yNru-qT%9ke&IZQwA2|q zcd0VKA1OY@8AOG)+~0zthR`s3t*7BP*?x>RF=36LTn`#S6B0qzM#KbpOX+XW>U^$T zlO&)-k9~LoNIAY^U!j>~4bW1oNA6MV*m;u}<4mobL!bO=d{amhWHd~^G;=zN;CH6Q zKrOe#9kS_1dj>1M5TJ8FrhsuG|)$3VHd}vn+cAEm`EL1rj<)d{5fk{9K% zM!;Bt#H4lw^4CzpB`lx0=eLBG`j^^im6tNaC>OdbxmMm?$<4I=6WaO48pI*t8SS4u zw&l~=z7Ak&lqU}&Qz+Cm?v)(S2-j-hWKNa$W{KN8zVID5;&ocV^Z0zByIo&8G5eWD zkIwYyTO3u|;m|674r^Sw&jgcJ|4$qR)hn0ZlgZ~U>YfJts?g_4d6YOejKQB$@K>eA zm(uW{#%MNkF0ZprS{aQ$3@kp7MQO|fc-W`08vr09O9-#hsrN8SBEEO)2_)jffde}O z9y5S&_!F#Nz)H;kmyD(+DNed-*r6dMy6G=RmTil?^5mgxEof7*aKRnp4>C#0V?-7z zxm$h2{=JM|QeHzS>->5U(Z4O`hW_I{4zh;=&nyDWCn_0zR?CI=_J(X| zPsCJEnXgNhvn^-{Yx@~`_$nD*rA6vFG`!d6zLNCHmgYZ!%;*`uq%IX>auT?uRd7>r zB81kco^07o4t&1Pwzj2i@Zr%kO>+OKKchRbPUp0L!jH8zOOC`ooh|~wfxsw~{_GXX z{L#eTL$lK9r$}8F++iX6^)ljo?)B3${Gf|fKs2iLD1_4&dgy+ooOR@Iwe~dC+g;yd zbHO2xc->QKKFy_?;CpNxwo})X^F4pghoEJQe&`K+xCN}G>t6Z@{J}o0{t8wANZamI z`=`M0jc|X}28Glm2!*Z>nd2DA6WjLwhzP0*wj2EQ za(gbQI^~&i9IuDf5aYI=SF}w{Cq1cdl2l=TxR~}@;itJIdjX#=+KU&PxB7EFE<+?k zY#bkF1q*@>No3?7baP*e_?A88ETm|OxT7>>AahZP~|vI&w7W0>1s7jDdY2A(Wl1&I%Lx9)}$>Rr2vvxS4~Q==3kXR z7sPiBA0wcm#gFPTUtAIl(#ejVt_J`gWs)rk_;mW4T&iKYpfMY5+e zUsoP+!9tYFW zPksjzU=lGdCh7f#wgrM?A5WVeE$?w7^#78RT{nPbcZt7Fz$QnFUC&rS87Taw>S_w;>UpWl)LNe%BFviM|})Exb! zxLA(K{#x`(4wkQ%eqGPubu~3Aw#-^6P`B$jwHuzTHVgHQ`Rf54F<-`sojBsu6)m25?d7*e|X` zzL?Ehn`25b+g3om;Nv}72~IC9t?M2ABgTO^o}6vXRaY;k$CNtJe9it;7CO4OA}_Ui z^y_~!kzb!_Y`JS^hRFBHMgnYn5MwM}lQ(MyRj45;89VXxa!n-aqnl{63U zpxB9qx#G<(K0%SGdV|RHx#=cEFehyTdt;GI6N*&_s-%Jrr+?Wtc-XMcR2Qp5B+Tc4 zPb>oXS#`LP7!5_BEHA4}z#>c;g>dhHmj1>td#jlK`VQqt55stlu|GcIb$fE=xmOyi z9OT*?t_F!nN?H~Di!u&)0=l8xmY?o`(_ysWXQb=MPVt7uPKa8QLGe9?-;4e)f|dsS z@Au{(A?kmA*B+qM)9&j#_;CkU*DoL-{kvH3=OXCJ^Vwqlxh+f$d=CJ$Wgc>1nA0H) znf+l_yg_=O4whMDD2|i;N7R~OjGnorAn#38EU5={ezn&2onMOw`@Crm=5;&zu9B{ zWH7AtW76dDyEB4O$CSa#I7vtJ{pIL%V}c(z`1CcC>7Q)^6&bKcZVEJGK`j512-FP= zb$A*pGfDpJrz&KH#vTy1(g>f4Cf`{DT0_`wr6jG3{<}@f(O{42pC7(?pt!_N>n-}% zY|uW2@uR*3uhra19N;v(Ap|dzIksGNw*v?VFh*E&*~5(||7;VU8*Go};!z*Iz%Ykj zS{e9zBhOQCV$4c^xiv1h*yz0(zdFwX+oA#X$%9dmb$)&!u@LH=S%4?l2c!$WTfGDMyG#BM*h-*+aD^&=P0Szlaq9f5H~+=RxMKmzG67Bh z(qK!)|JQP|0u!5zF*eP3cLmV>=)qmZXgLOdxdXDKuAU%kr^uODKH%Qljt~jisDjNBpOC~_~s>$^Fzh(B$BYM5M`en1# zM(y=^FytfU>d%Awj+euhLN1qynN>Hp!_$1lLb9hyf8JV(Zoe-z-%+b@M#?e*7M8<4 zl9gy56f%@~&H9gbW+_FF}p%oLxjXULIRMfw;;JCNeWc zSs60AluRVdXl)f=2~YrHtsOY{dYNnbO4Ggv7dG>nX~O287@irv5^`Eke7N~l9{iKe zjDVWzTiX%%p9KJkIwf8`hgoD=Uvx=*5r^d-Vn*hsDmB zl9d1mRY3e1;3{f@f{=s~+_j?_sFHh6(x3Ec8+NGsx!GaAuAvk3 zUp9CzjdgA}{zWwSx{u79!CdMiUt(#k;k;nTl^0e77@3b)?xGwJTUG~Cc zifOQ#T~2wOW99R|{(%-=1t_J1(ZAx=8D5=6Q;DcvEB zv^1NRE-C4fu5Y&Z`=6V0bH3YiF5K4UeP?FPteW*a>xxP(4H69i1Y9)j|DnfyU0{Oc zy_eaq@CQvpH8!))`E52|ojH1wFXRqgMEimft;jLoPO= zp^h@i6iXh2Dvug*cpNETQZt--`b@&{Tx+7$z_avdSq;_#Pt$Kb6n3q+P}9$8FKPwc z9ahmpU;GEW$HqH18{xSy>eI6a{8Z9;T0Am#U~?&0On+R$dy619xi{7(D<=nRPy%P}o!2MfU&6>9EJ7WOoP>6bJbZnll6i{o{P zQz34idy8Q{mt8?D=%AeJ^yfNqrKjzMePDJ)XO%*>o!@iHH}EVlzjc72;3|N6DEZ~X zARk)sQ2Z#sP}OyA_0`24IR7rYT=w^mGDF~CSfu8i+>2Y2QRCu`lDOYh+0Xovt1Tr* zqyw|H2ct>5%iWGmd(#^;(7jz;3&W#Gz=bkUx1c5j<-N=F){oJvYUFGHjEn}D3Y4cf z!As%5!tG()ZxUB>Uy%o21yaZXuF25Vl#iZk16J9`hJal&o%Mr#G{*%w0`Vd9Th}!F zSBemW6qc$jNC-91RBz7OZ$+-&iFV3ept`d>1 zr_~k-agKoK{=}m&`K6uB)1{o1Bv`1a9L*f2>3HO5+z=c|auT4U>~EFVD#F4G>4t)- z)A^x!#)Q&3a&CK;JPlgLkBd5FQ(nWgn9M`8n07IX8wW!7>x3J^0gs1#L#yX?Tm8*q zf_5eddI&R%LUVYD9M;KvN9P^(MrQ=N8X5YnH4L-alR6PHVDoqKv%!a80ErtKk<6PO zwqT{lTQ}B9rd_)k+BsKn=R9nrU$$zZg7Awh({rdLR^0~x5Bvl`0>yOtS7X{fD0wlSz=+Kw@xT9V2K|5iy zhxxtg8If8Fr7JxErIA@D#m=414qH5!P@aC1`%NBt1 z=Ko!FxjFl)(x`*6eto@5sl@NhI8vTR;M-X_`aY-pp2sF5x9OwzFIvu}QoXN?NT_Tn z&Y0h!Scxu@6)f!k(dT`oU8!HGQ=(a?YI^fpp+RtVqU5L_I_P11%^5@GP%|((ZDBb& zte0cZ-=I|LzG*y*J^=>;Z1MQrd&fu&7ucz&Z_kL)U=9lx&5{_G==Rm=PJ(KD)x9g4!s8cF%|pudV1=;=^BMD z^m7eXV_BwDvORJ))Ya%}8P2{YJpySg8GRXsAx@CmoO)~ZFB>)rlinZvU*|A0*2)Q& z*REk`nAPXPu~qd+>7|dsFDk!$n6p5eFz7X0#911>P`1E?Pp1;=xam)JRCj*1cGNvC zu2gJ=GhJ_9>*&!CwmLwhUJdo)rO({ye`Mc{mm2E|>dUa}$(@rnrWS7W z*I~Is;lhX7nfl{bYB0MPxZPcNi%9{1BT#nAj44+c9^XgbeM#ea?skNeQ`r6=;8AX3wp7Z+%5yEyd1#VBQolVg>}V})kru0_VbG%Ck(qU) z4%4=pRkYXaSz|jpMdeyg>LBXZVtTyd=ZNzBP+8{c(o$-<84l@oi`fPe)(<_esi1MS zeCzGjpjuoG@=rk7NU#v;zHm8G=l~^KnX3b(#%hL^>C9VnYf{o(ke~^tk_s2)9}u{P zAffnDOZRq_`_((f*CG+no4B8-^;W~b>?>~{He&T|wM>2@$Z)orqY}hO&f@DID2!e7 z;nF<9hpvpQ>>M<|IO?D=$~hsKq-A1H#GDAQLlVvD*Ya2_jy}h_(T}yD5zR$^g$u1* z8=xiGtAD#Ld-+ZU$$EB|=atU%yy*V#Ort`w$F%uwDih@;nR_V5ck@SF+aeKF*^7iSd4hc%jt{v8;@V3mS&t@t8B4J`<{g!Ea-dpS5yYQls zN>AX-Q>Sgqts``h_BC#MjD|I?K_d3IrNA1u6OJTS&mfh|%I}Y9bz3J+LcCWf)ovjG zZm*Ma)>CIoEtAq%8pEo>h^C%vr<&#YmhgO?xxm~l*yLL$t7|>~xu#3e^Yz=u0D!%O zO@O*zXFwLNdNY4}&R|24MDryrwr&4&Bmqf^pNIY1}VQbCz3Ppvd8r~(dCOS75uzgqtJDLkSwH=3wrW( z6&g+pZM`nZgt7!yjmF9Je6DaXnOvWbPtbVNYyHTs$E#-kh(JavpPJS&ONqv+KY6z= zrNpk+zfe*La8QhIH`eZ{1Y+?B8b9)-y2X1SiZuBXB>(K9Oa^#ey zCz!BMyJlxxHb~1L!K|1z*?P~)d`-BY(q~XRi`K}!|ARA((&e%`lU5ebKZRylHq6*n z!6SD%u*!#^M1|CL(%H&F$ns<)7D1P-w@4mT_cYy;Cd8cIufYzKdx zu+GzxAG8+D$8pZ<(sj>dX~^WLu52kaSw-nH=|!J{)CVyO@4XKOI-LMFS09H90j9}O ze1$UFr>CpTOgb(xSdFI0;6D53MGLvY++C*qjFX=^E*`zfFT#w6yr*Mmr*gmdIpRF< zxH3YzwhENrdq)x59RZ2#?Z_>XPA@E4Eg@R#LeBEvjm@=k}{AX;sQN_mvB2 zY?R(LQx0uhWSAPXKxpo)owMd(G-GdU7}V|DKA_&`|4+(t!kc>liKf>Gl>G1E(J3g(Wu+#)!-6>-Js?^k%avS zRuc{*X8~js3uP1_=kU&AybiKOF}e%rSK`23V1WVPq57^=Whw_l*8%^n)A~1n$mT1T zYESQ00}ee60ODP8F9Fc#(MFg3$8v6fZ{`S~bkCh^0E1380DS%2FGedA6hob2@Vfk? zRH*z4hP-@j?auA)8y_nK_#sz+{Rr#LdZkth;R8Izy`)dVkYRXSQz)b$S2a(bq#n?d z%;hE}Ccabk1h|l|-@4eFJpL~I_S!(+jE!g7IV%Zfz z#>5Gjh>iG1OF*jvc!LKV02H}7Rq+*_h(i{}ELNKdn99440M@rrm)HGIJ|GV3-%SJz zx$@qCa^`H}UBD?q17k+2X#s(20e28B>ho#1fF)l*VKhXw z4kH;*s@+=~CzVI{{|e1XnRAm%ejNppN0ct=V&7W;^MYRp#>d}b{1DuaFLQ@>K?{`6 z&)EA)l3)ZqX?$+lWeggXViuF8N^U+H$VN=FSFnY=y>`>6Fbo{HH1HD&$%_)ohLN#t z(ybbJ&ZpO_luG(Y0S>*SJXNmTod`SUKL;N!4#z&&Z^WdkOclJ7;fu--#ars&%?To7 z?+|$QLanBdW~~@vIfMN?nR*c40B3M`V}az#WN!#9R#D==tdlyp-598{a|R?!{h)h< zcE14Nbugc<%$el~QC>#>==nP{coiT7KPOpxbd$LB94vQCc;?C`J%?eXSL%N>*wr}i zY9=HY0}hEbKp0{_fIm(i{F=9=H7`7D{3D~|D}*DS`Y_)%iN z0V=2+h4-mWs(^#0=7LkT`zUUR{^QRT=N1pMe&VbC=mRN;%N^86P#nEf9PA|SY z_&il*7V!+ryXu@ZV2uJn%46ta`va0KgKN^}qVs{hJLsb`-*5!z1~R5AjDpFRIofH) z0PNp5B)6OJQnSIaVA8y`kp?(LQozGpr_g@vpxx|Qk9sps;rt$*)Z(-`wS4R$kHh+B za=xLx4|$7%AT?+V_-=YN$>*Ms>Vn)Tsw?jm9fy9&2#xFXAR`1&ILHACpzRMaU2Q{L6T1Qb$zfOU=ua7ahY4zhJpt)3?f`ZR}$ z0A_H6ELD(?nd>1I@Kh(c0=TvQ!|$^Xs!RM6EwI;?+1;M z)_F_;yNm@%1PO00A2!8Y>t6GfO&lKf#TtN{ttD>&BJ3;?5*FtW2#gM@0ZF!FMzna; z1Am_a+!f=N!dZhd(geUpEp+pV*^s$=3Y+$m;(}BRb%Bp%$4SVs$DiL6*h$#W&jF0S zaRK>-wi%#%iU!K;=ZU*?U=^UjmOR^=SK*x^!Y-=Vb0;HzO-PY(5LC7Q(JamF4fz&e zmQ^N+sHc#9Y*?;?a06Jp)&L7UPB9t!qype;1o#~VvD)9?1umPjlyL(G)m>>ku znILcbvI0IRMHdBahP-KMo!Auk(SXTh6)>eAqCR;ZdV1uaKivHNY~Q%44Dpd`9l6vk zAZ+qtgya#M9Y~IGr`q8~%$e0?PX0o>y#+Wwl}3P^y%7C2+8&T$g%RDU3I2XU(aHQ-3r*r2)>OkKurhh0<#MY?T6ue>x=4ok^tGNVlgLtFjxyKB z^5ByrlXELTrup$Pz51|`jqp|dxojKhKr(xgJK8VGP6-91w?xlN7nmxV5_XTV;sF}m z_@Z)+b+?f_nmguBH$vb^s%^AT2rN^GGo{ErdsTh4_Wl=6+1_^M7U$?JzZ2iv(QG+^ z*RP3>TN8xZ+l-=l_2h8`Unqi#1tXKGd%{y83%mn0z?ZZ7Y8*X{3DE!J0f>4#8qQ@N z;=_>l2djRGCnu8$MGec;M9-hL7fB~D?&4A`g@4SxJt!aXfAl3#($9{Wax7Rs1H*8B zb1W}T0<>-lU{jvy4CiUmCG|WIUi%Ti+yfxW1@8`PPt(`{svA$!u0tJ0#unE@_A7$m zfSfebIw0Xvi@5dAn`0_S+X~hM#^6>)A{*@o9s{#b3~D-N@n3F@c1RSDNR6F8l3&7@ zramPAW`sQ4C#fiAC?f&EIGxnHa)C7z9BnD69*7HeGY7T@h{)e-L2C$6DkSdS52o(h z(}%jUp<^e<6|2fHYjF4B^Kk_#s9<+S%j{u znQLdd5zeANWSFY|9<80N6n{t56=iF(BG{mA-SK_dtM|w6gbYgItc-MV3F=TKPIjY5Z2W6(=_a}Xr|&AMIsAR>F72rtto(zVvypYi8$z$z!z9*REok6cyenr(C` z(^8)Z!FdrYLDf^eap&mox6Iw?O> zJOu~}937K)Y{L#6PunmSROP3V3?FudUx*Yxutj+WN{xka;0Z2+^s}vpGlZqbMNp^; zc`cTQQ-af(BG?#|ECkB#{;LH5(OHn9=B`@~KLQKpwN@~ONO^!u5-7OUeQWxjp^HVo z)t52>uALm93wPY-A46Lo(CY)Y+ujoEIb9Rh;I($HdhoCUgTDymO~_XOlH;TnVZgvDrE<-b+%$&F|DWVuGOumZ;41ffF(@(GR4H~ zx+WDLepG-J#r%MgtK;r>)aVju|qPD365_M+bzgTQ_^KUu=5M~Fvy-j=5Dc! z?pEfQ__C9IRgEp?NYr(dBBEzH;1mE^Fx#s35h=YU7kHc_*le%fJRuVLFbf!o&6 zk3JDl7ee_b4c)&S+8cBh)sHSxNa+5_3#p=SpbaEYd-_e?C52y(U0g)QU|h_WFIFd| zIbc(rRAAE_)&ka%j?9rDK&FF;S01s0v|l`H(BV~ z_4^*mzy}>@vYfgCP~~{Q7b(a7=|G#UC0#?90G>qbTz#Sc^aNljo%Qd8&2?LH`0TWp zY{^(WQxTr~gVP?Oj&z4qVLhqp_Iu4ETib02-;J-j(ri|MkZsBAm1<4M&fGUGH)oa7 zZSEF-pRXu;Z~Xx67zfdxpIG)z+aSFwOvTQ{09jsiX|YQ&?*G880JjrS5QZQ^L?7Fp zWU0-mxnkk!rrrFxUBf=${Jsc5h#niLewk7(A?W}%z4}^mbaF8?!$eM7GF~6o0`hT; zW%HSX$gyv$g%%ndoC0A0;qzcM=Fo>i zJ;Hm8Pwj=;t}rRQ4pc|WcnyN>B;Q$qr>3Qp8=;yxFLyouyo8f~L6J>BDqjv-w9sWz ztG5%L8h^s2x^6(PwVusw`X#$OkpAm|=?XRhg*)IkIM|$&#l)T^e`&#n_zI;Qu!$-2 zMYFly8kFYRt(&KVSbkXG^mW&EFZmmz-M2drkg()-Kjk_!ByCXq8GVfJ6otx`J;4}| zIsEbT{V^qEbJ9@|W#kQ%%lh{*;5U`!u+6L=GIV2_J@aA)=IKuSh#=uC?TWI#r1wmU zTm?kKGAh?lo;%A6c(?r&&KyaWp*0A9HOF?|MWw~^Gb(b7JzkMV?>Wz`cJo(sv`#fi zb+%1b`WT|AJP7Dkt3HpnfN^n@xY=WlhJ-w;lh8&pDm-rE`il)3e%4mXl&6fU>6TOe_+Z;MI0O{hw|!R z<|tN~H10Wt?q-U&Yj5juXAH#vTBdE!s~phGDHl1JaAsq01zp!sQWtaoIol?W2=nrq z@e`=5@Ix;H`58R46=wc$#Wfriv5ypx=^_hvLvVU~D>|BFE8u#s2Mr&B{fyEr9w_^c zOLR>pAO^6(ko`o=?eIOcRme;RITm}W@NSkVGXPZaBcRqHM8&R52H=gUkhblDyM9sa z`oY|ANDji+#`M~za{Jqjz_lWh09Q;epX}HYRn0yXI;sPGEszm3AFYWRZ)ixy9n3{f zr!Xmsk5-D4?pdGEG6C7o(MxFjRyCR6m6SX&@2lz(qblPet55HX)>`*Ij>0mq4nJjW z+6i^L55WypWz#p>B0gi8_2Nj)@FJ-gjy;pblf(^$3mNh>Ad!QwP4Z*UDu+<}n_QQ( z9NiBppH>gyuDtQ$77pJ{Fc=eMgfOJENUzIllJloEs#IQ0`UIr94N z;`R+Z06+rv=l3d4hxSE$hq*}Pjj5rN?Sx`4UCG-bUCA<&utl)55X+bt6ovTO_k5y>;T(&Q8ir4@orc| zb ztwrp0?9n8^876W7@@gtt*mLDRZBIQa)XMJK4_^!Jc#Pp;iW5c#=*@!Aw8Z{_h3FyX z>I|9~*TZ1IiP(S|DWv}s-^u92S?cgbxh0MAVcSIy13;8vT2+f!Rx*2W9MI zV_o1ZQ?8;^jOX&(F5#O2M$8^^J^W6gr=8@hb|QfauKT}H{hp$UXtEGAN`w>YS}ZDI z8<8iU7Zz?>6ZwC|b6&>UPqxobNjYbkXBn{{S9AREaA#A3%u9LA=;QnBGPoUa6m9tcHtlx@<6>7_dTO(s0-_^u0X`y$Y{wzy zLo6PNlrs$=b)am3;_8`%FSs0a0or=J^z=h9<7xppoWhKU>t$%}OHSpj)nU(Z57ftl z3^At4Ty>scFqc)p{EM4sAfyrby_ed!wqctqaZyVYOtuXQoWw^^Ba)83eoo`iMkP2> z;QEMN?R6t>K9`$WD35^k8s+W5?}n8}bWzP@&Reco=y#3~$~4K?b}BSY6}m8~PpINK ziihy_dpmA~2YuM-qz2fdHJb3Fr3@aciEH13_$*PTcaF-bAo!dCY~T*c%nONDfvXC4 zZTaSr1!nDOv&PYr-W20bz3-Q)sA$$Rd(x90np~nk2K;O>8;MKJ13j{mO*i|f*0zX< z@;?N#b`v5w{268`V(xg!UX!e*aw3(|oxL?!4Uzc1WP;C;axfRKlCqbsDf~3=4?y zDbs?!Dyx2HBm;|u$y}-%b<&^gf{s(m(Q>9THXP9X-pb?xFn< z+W^q~?dV$nCTb^X*B&g4N$WBmGBG{lId;ShkU&{#u&~N#Ps$T$CD_hDV@~`*%|}ue z7Ps^rCeeCeWjaJ7F}68-Y(+ONuK89{C_w?qUr;#LZ=c`iwkD(df*OqDy=Z1A4#d|` zlwJr!l@|UWTiy^Y&Q6U4uXJV_a~>QFU+aI;5~qD(PrJ|6I8j$w zxH{laZy!~Irh3KIP1;ugIGdWn-5t`4OO5^8FA2xF+wH`U51>5CLW{7uFD^+`o zr_;$T53~igir!xWWW|F^fC1f4Sj&+IT!$tH%Dx%C%&)CcSVTv?U87qE=49(A-h||C znq&`&?&X2I3$afG%*u9R?`e*Nrp0kX9`+E$3wXJ!wH#Q8JWa)1=_KNJ|D(vc1ZbWr zcrbNs>TrU)vH8sO4~m8HWl&gotT4+u`jIYbmxPh`S=9l*pD|z(D;0>kder{3P&ye2 z4qKOdrUP@o%>lU=dc#6eXnM9)wLyOTGak=5cS>KiTWl6aNV>Lj1RzD*g)s>!3qHAa zA;xD1^sfWo1^gEA^e;#FD1mc|jBYLRU^I^U@L2#Giu7!)Z30Nqby+Vid+m@EGw_Y+ z!x+HAzMlZj*1}^JGgm(f1i}RkP3Crwg8&gd5a2SsRAy&CBkxpml}hG5AkS)V#G#1F zHMlcm+KMz|qaXhBhjobMIA+=DGZ{WZg>6M`wM&W{W~l-suV`xP3?%onIxQ47S^*n} zr7676SlBozffZ_cohnYKc@WOZj2H`|jwHO1uM2T0F9|es5-@DP0_1;bO@PAFrnsC# zokj=aiEY=Bu5(J$i3D(owQ2Hy@oaR9pW2*rJ`#HjVp5!H;$~#F_v31;PPao_G6LM7 zZBbKy!)Ze=Zs;xDqwpp@*v;=jSHo7&-Q3Ub)wNz7jKV*~_o7i1@dUhAPrj zMZ8hesEW&FNoXHcSnW*4WWP^pr+ti9-#Ihqlp0%LU#~enlS8yt`?e~^9ilNIfOF~= zh`;AH`7TwbWBR*OnN>AHqkBAa0>z4tpyCPK0T-RITe}-7xbd%nZtbfT39h~u)jCrb z$7O`HJSA90zX`vwh<5bO2bb)JdIjzZpk)$O?bnO{6%SEE5Lr1xPhn>xTzc{xsMkEs+CrO}$vg-o2?| zl$Rin9R+$?&<=49{E(GSpS#F(UwqW47~ZzP*bYyjDExj*SG@T$=soHog4@oH2im^D zcZ#JfMKOC3+n4L_QO$UU$WGx)AECIJ-tt%ZgrqF_ZX=1KGkg2Ex+XtFPz~VwD!1dN zV}S1?dj9amQ+r&I*wJN_-_!@TB-G$eVfe@`ID^C&}KM^3g{pWmYPBDji~VH1=t2Z0P=pP+=s4BF_`n zO4CDkVI|L|VA=%iFLpL-JZmO*r4Lz+=J?hQf7Z;ND9zAPGq6STetvj|=)D$=1(=t$d6rI_STbq3$-8tJN;jZa6@-2zf?4bM#+_i5p>?Y-k$~xs! z2egS-#cS@XCH%Tmz52#P(d9IAq}{<~aN07-I89EZY7=kC%b_BUt*_7GH9+jPG_gl2 zE?9*H(kcz_r|z*+LSi^5fp!FM@mBRE$v-3TJu)-c5h!dpqJ)S~AUuCx4zAkYvv9?h z!|FLW#{FDY0eSfMz9xLQbIUGV4KA-ws9FJL>h?$x46ZijqUSVxO_2`h&&LN29aO62 zpSHg$(z@t$JcM&iXzNEuil{IpDOJr(Mlout9U@_Elin)>%Cl%N{W5_*rM7S43178M z(~V}CiqPA9{amUp{sU>w-pE%Uc@)$p#2DG0HZ5RVpdMnl%{W%Sb)9QFw&_o{Rvigg zT)ku*%*mp{bgvw_p($k9-by`fd2(H&PB5rVJDR}cU4zYbxBQ6L+kuB`%nRSgo$$LV}9ECrPdWnk4R%HfSVtjs7>bwUrY zXTNF-d}bKO*%53w<7hrH+l?pRQ}KdEx%nHg5_N^TTMi28JYVfUt$l0i*a;#wJ7;pa zh0uD!d!_MtOIL0Tm?8qOJV`e-F&}MGm8nvib+fETtj=V8l5%C!jFQEe#2upwk+M>| zCNsLj_0K*QwT-)-RAg6WmCG(c^2{?`5hj1fSE&BJFU6HXffuqiQ+@nsN$MSQS;#oeQB2r5g$Y9U5^}p2g<{N4u%S! z{MqiR`?e(!&D>sPmq~nuM}=y?F3LnQ>)$AiIc8;~1vrDmzqsnZ zPji!NnwNbp!ryXdU!m=BHnH)b?Rw!l7esgh?Yy?PV@|kw*K>1LX*aB7GrtJ?;h?1h z`&1`u366XF{2)D1nRP(dD|1LoWYMl<4x=p$;NegWTYG(tOvf&L0 zcFI}V(t!AJ>C(WNam5VHjwz@w7!hIl8t`GE(D(8g#J@^}#hM4@ehz-jUE z0q_1Zh#dxg$YG?I4ty9Raw#L4!J*~`7s2EO2CD=fT?*kK`3C-y5Bokp_SoO=|M~a* zasJu2#QM~s59cy}z`@7LW2wQ`SP0)P>aQ?0c$EK+x#?xb7d9`g&aA)xy8Zuqc;jMR z(D?Qb9AYG9aI|^>_?ezAq5mU<3G&Q}q>R{y7G)Z&#~7z5J$UJWcomSKc+6V2 zVgxl=j6HsT@(X;(1w|rh*1r?r4*FWc8m$2y2dA6wzcK!93T#VM2)~CatV7TlKQE7Q zZL{GIR1Gczlb*}5FUo#)V1)FW39rp3m_@#h^%vgDF&{otO@9Cl&tx~*NeSeFhFwbe|NU2h3Jly~ zK3@|zchGGq;!J{QMiHaLiwZl8?JsTe8&&NHEW}^TF8*|yHCaEeS3*#jFZx+b@0elo zAS0DKm4d22Hdc$2lwcYTaBcltRX=RHg9#6|wQ#jt8gOlo@nL8t^{;3Oh3fDVS&V|_ z&X)K`_j7eA8e{#dBBcsQ0^ha3`J+Z`XlgpR9-#K^e1b3^;BRr61D zq%;Y7v!e5+QR$>4e*c~qg0Eit{_E?)5!nT|t9k6R;`xI6IR|L@jnqa&TVHX?Y?ls- zI=}kgPi&D)$IOA6A@UC%O%?LH3Jb*gL#)X+-gZ3j|2BvUc_(T%sP$gaOL`7mlBh7o zhL@e+EnWT{s%0g8K!R_-ne%_T9ZC*|bAlRM5^npUNVoRcrxrmANs^vgfZM-D+9xbciq zeyO1W8Xu8C``<7Adp?EY66gybWt8_WXSP+I;x~iY6I7@OV}bKiB%rbbg~#oj3YO@# z``$F`t@Fr8NyiCHo+;OAtmcVI!1Mxm3brZY=$E|yLJh&*&2Q)zi$u2OQ>KY0KllD_ ziJk~}^r5EXQeMhtWyrVvc_rygUyb|}|0+eg!ka)*Ao^uRc3Cf4F0`2PzznShX#9`{ zL&qMGHp-i?qJOnEYznAv{7(8g`?GB0dl^8lf`?_KFH1D5qruf^QWp2jU$q0ir0NHx z)CZX3O+Zx1>a?ZocK7it=PK?=uiGwBur)c@*3>1=#3HbXDGu1Vs7$p^(|I5VT*X)W zsE{wBYOC8+IWjgXxFBxZhWgQ|T}x z2YkqnzJP77|2e0B-}q1M)+7>rgDEf2yZ(t>JSKY<&@^HB*X|$w+nN95+R6mxazPbC z`q%*cLEIG{zP`!7N@#r4$OvW{D)_G4_y5i0-)}Z=!JxEEX=}O*U}gtqdt!1#!(kf! zIcZ(F*8uUZW3yt^35oR=C z`#}rr%=P%B(A^KTz^+{XH5QwSFA9DAt3vw9p8PWs|0CIJB`_X6)=gm?*z*Et$Rqo~ z78hgS=ZMD#TfD!j?ChTp@o&2=5rCQ2o8%gLKw!4y?yo`N84m$Um_F-h`^SHZ=szvr z0vk_)th-_W?0L(NP*#snq3~~IBEn6;gMJ+CY=8e}^8W5-VEM$sOl^`1rbzQ(Mx^Ec ziQxY#5wz3YfpTCfs?B`mLRE^UC-#55PX{!jo<9eM8{;yW_^zxa^f4WQr(QkA`kw-H zDZwblHQIO#_klE$rb-u~o~v<86+eCay0V{>ZO)k+|MpgAA~;}%R_!|Ojj#>-;eU3m z;?vgY(0WVbk!v2MBiy3cp9(1ehL**oVI zZhgxYP;}*Rt<4s2l}2d2q~<*{j%6z8dC0XXnC0RvKlA}Q_bO~PVX$#ES5PN?zd6^3 zN9%NNj;Z$6#Py57PSd1~;ycF1b=-^Jfrs|T>7ZOIu5ay1b3vYxt7t4!RPt&ZI*s1? z>AzY4Wk3DLlG+s_k0FR(Gt!D+=Um7NcHlTR=B-^yfDU2H{U1ewgKTEU=JltYON!fb z{I^~A3a8Yk=pQc%rSciQX>Vei?&k9Br>pt1KlCG2TC{0u9GobP-7L)_ONxBelxhZE|9G*Q4${KU#nw)ZbRVWX`83Q&2M(x3 zOZ3Y7?{{gGfto~%i=K)2tBWdffXAs4_oTPk*6;qt|uk5Cq)$VCQ2{taR z7Ee6#Y0m=HM%o9z@TW5z?@B+i-k!W-{pFXbl=CJ>QSKucN>c31+Yp$geoj$M^jCEo z$^zxW!m-TwT?f&OF_{bYK~G{Kh9B~pD}~K1vskM^*alJC{cc(TLx_m)GSv4qg`?*6 zuKmM>M48V^*8LHJGDXlqIiG>C>aDBlN|RmZPhm{d;Y+tX4m_gLv9uJ{4UqlGYS(bY zMfx}$Exa7V3Dn&E^as8(4}95qTAC*X1TH7J%;yR-wdY!@+~)nOOn0C4f5B*3MVb6& zknCxFyk+gp3w5)QYfS3T4*=19@cQvt;uM4cZbTXyg z9OQqUdR5{qXKQ6DY})7Gu+B>IoLDf4M@Z26AlQA+~hFzSZWOK*7I+*<}j;LTUTKh|Q( z`MdERzXBUCl$tHnR=h1Kj9Lx$Art)UE#mV>P}T=!3+J=Q)AfUUyo01#^-w|A2o|?3WOz}UTRCDEp#hBBqmCy*p3rwEG9?x*Hlxf*H|G-G- z!9du4B<|hUQmke(_ z^*$fJss*KTOuVm`87s8Ma5bh#F~=Xh(Khl?RJCuJR1p&O6T*{C?=|t!y4R+y;@Pf<3?oSQh&pzatAQ*)dfzT9Kdaj)FI*ywrvDIuPVag-PM*0lbzb`l>P z+q5(yOknQA;1I;J?YYpBVI7jxDs*euqFa zUs^XT#O(7-M`=tTiArs8Mq!InhB*)W8ao~bf+NNfz_YKoIy8=6u~h^S#7mmqx-IXT zZ(U8lU{9xePrVcmeH?Yno0fn9wQ>FZtnPv~+0v8EIlg?6=+3T0@SX8`qQkCLKyZ`K zP|f>GtV+X$XDdSVS_^`%$&=)dA!|92&AP)tgV4SIapL~hT@nihT06^#^y>{H@HyTG z;e(G#$9x+^gYxQK^SAo0 z!>=@}-^~OLA3@dS9H6U+tXm^dC_eNi(2Av<)4RsE@k2Tcf()}Sug^?{s8q)&>3_N1 z`d5(<`S`ut)v7l%w#CD>AToFsvo~9qO!GEde|si=$*D&fr}QKcwu>JmNoD{<;30Trb1Ev6zu-6U_#par)ab;%8d5596m zkxb~V;Tsb+ifo`wov6BbAnh7D7q2I|hHDu@2!_`9!`Na5daD z$TKxgvxEC6Y=rR^4iJ`aqa5vW>p5r$klnd&{qKeAF3?)%W6yF4>p1^>p_5}|pDm7w zSc-kX{#dR}Y+jg8pIYyHoCu|ESm2L^WQl?totw{`+`d1m2;VKf`HeVKN z$kvNFE>`Gu}P=fZfI4g5a65jeV&IG1zjnUf|N4+3T z(f`&ngF6(veC>f5J#!?!Y2V}9;*Z(EY ztU5+)*+@?J8)S9b(1)LWyh3~Lx~{F!6HK6Nq`XUA{WtcIktmmE$y0omX%1_}h30xM zGG36Iv<3Rro=AkeCpjK{IQje60pte=!$Rt zx}Mmnux$o)3$JNZvRlt0RIwwd@6~kiBDq*o5ArNOOnfY=tku0mwR%Zyvf3fny;_kY z0ai9$S65Ej5F1nyy&T?d*aWFxfzT9)8xpY*|*TT-E0G_+(>M^w?~{^|L>lPu%DV zsx)xaGS!l?aFwFJWQ9lt4!i7q?baK*r8TSg`NSb0W#jA@v0wwLl%89-9~RH3LblnH z&%N8OUOKRJMFg-Ls6*PnB|SmDc0ha|6VWUxh2Fmk@oIF<3{;{>90Rh@TY=WO?QNyC0yH|Q!1FBoMgM_q{_u2}Wlz#8#n~gJ$0YJI-Xf%$ zuIi*6>`3EGCZ!d)H-b&)<;1+Oi1AFz2*IRQeg8R($@It^pFv6a_ACMmW7xvU01YoKQ7bGg46oVzcbBw4 ztOY7U=a25*ln9?CJV=G$I`>zjt$0DKnCG4^kfvO}&pkt?K7Q^`a}P-S1?^woXg zE5@?3zo*AQ#rR6)L5hieuZm8k14oe7i&??RCqWsrkuG|ZCI)dBbet(C9Rk@)XKlN# zip!hY3n9pe^gZqC3IbAklg52;1P+Isi$*rdZv8Zb`CT>GXe6D*l)F|O`_vsv8jpj- zjs4xoaehzzaRq)qN;udLB459r;_AYDFtx?V)6ymSY~}{fb0$k_qgC>+ zE+q0oS@G>YXIS(`&WI?p)3E5c)5ONw`|7I6%`&;v(j(`qXnsi(pxd|MqGo^_VN`fY zwT+q1x}5f8lHR=QZ~}@qnx*uiK#x)WnP-zs*Pl}AIO)f6aEm*fWf9+tlViIA_0&^o znH)Q1cvvsp7XCvHAMcEf zI1;mxn@gaU!)_e^fzp~iaXZban;#nikG_*aL*8#iwCuw<;A!`r!~$QnUUKzESh&k`;iC{hG$t9lU6)NrDjgor zu*{a*-CDH~9t*JEy&w|C%8a)5bZwmr(Z(W--h0$mg2siL5tzAF%*Xd^d8hdq83Hy! z#T444=Ee=y8}T&+t+(7o5Az5{P@t|Uj*oa-YIlfduA>lgAhudw+>m{jT`h2^Me=}N z9X2oE4^#lEVR?*Wl^a)FFB<)mj$or>jvLW$;^gLF+lgCd~qRmuud$Q9bWZ<>U zVlGt@CYOjUA7KA1(>1UEeav5Z)=B7gH(F><*s5`L=pk$KZJu)??!sftvhG2I3rJ^Ik6(mp3;PJ8C5o))l?S*`?D!*0hZ z4h1ZcZ%t0#TgSN{T_UCTVm_BlE56x~;oER-%xtu}NJrgIjDZ%nLGe?TP6~5_@{%mi zB|TjeeQdJX9b$$BpIToM*S%$7onb^6(?zMHOR;wdZ|3N-Zyh}A07h8W=6O;k!JiZVDPY&TbE!lBslcbjW0W(46@|)k$gkSK>`m?HvmHnh>}eBS&b&b zZRVz-D|=qHlvv(Sft(E8j#4|SMfs8nLp<05s02La+>4GkoS;xKa?Sy)T#r~%t^tR! zOQ_$br>NST>F?h4lke$nH_X+mY0Z+)D5nLwg)$?T891NIZ0pB(j(pEju1uc8wM}b@ z!AZQd$9e1U9~3YU!fRWByjlo@s8Id1ImbkS9!vED7LsUEYR>M*>BoxKB3cQJ)YF{H zFXmVGSU+-Ip{$SFBSUm^l;F!^pCQstZ$PhaUmKUz@Y(&t0Gu>1X4F;S*$aTV`iJzX zFs@yf!@P>p_mpUfm?oMaU@Wb1v*kyl;P2d1`+JzM5BFfJ&7Y-IiuH(OKKq$=`E@JA z|4=mdY;Z9WEwKI*^`TTxG&fv-kTN%!gXALG$(R!Ab5qh;_WUfCd2@&Lcqevm7`=>uHlLLVuzF{gN#w&^y ziUmEVnR;vOCxD)lMn-=1OYU+q#3_lg|4sZ~6mZ|yII4V#sQ8(uPda}XE`jmrCrN)%qKINd-T2DFXATG34w(7CG ztiU{bwwWq64Qh={^)1PFlUec6C+`Fmb|C3klM|zq4z-?CB1s%)9~{8n%8KQMi`W@4 zsLIi|Brs$cf0>Sst+iZNzO`t^`WpFoIqXs+t6JG_qTi`+c|KhnVT!nT7iMbFDs6^^ z(wtK#1S@MP#6JmiNVqRZv+L%k_^IEfp|5#mIp{k9Lj>Q;8AbkmO%yg(3Xs^JwjAAj z5z7*9;L0Wh8W1@zO=54kjGP_pe0CzH$@cwm!d^t>lyz(nB`}myfpB~7j?0$ zX)Q>S+Nv~Uu4vbGsH>G37w2>eqvXKOcqvq+N}+>Tt7W~i+*qRr)NvD-nEG*VKT{hQ zgNkoI==yr|3+Bl)YB1-03X3jML+fmA}hu5}TdI#-p)1 zPX+Pru-vr&6CUtm0kJM5^iaQNBd(yyiC4a(i6LOA5Jt>O*ucHhAgU!(16d?Q4>hGDQxwTDxxbkJRch4{9T zSa2rK`kZ?Qo7p;eZpsBMWn7s-Z&j7J?{O(7O?7f-It$Mx;B!W-K-dHD?qI|tHM8-J z+qc-(Ijd#tMc3W<(`Y-UZh^iB@b}3Be`6Mis z@=_UJ<^i?d>65PducQjtnJ)01mxu-5lbK90e81EkBdcHK?hL7ZeW;eKq|Hk1`El{Z zsYwB|ySo4wwNVcZ3?9u)^JYSz~xo%)Kz5c6kuJMKkkkO7mA&5Eq+ z^iOfi!F#{*(nX7ULl1LM?^V#saJbddMgONgBs90vQq#8CAYNFXc<7(G^K_vxj3y7K zc}CB>eWujQjrUrdi;20?wm?~CQPV@PC1;sK$qc$hl%-1YYZmQ0gdQ3ji=5uP*nIG* z$U0pqD+9|HKLJ;F>)hS8C*Ugn-@x_HN_0oWB&$BnmxF4+b7Jk%@Jk6B6*S+zn+*^Y zr3>pRAnvBjZ7XD#4coXu)U3y&fbZE7$Ruald8Zgi;m9&sxee#k5%)i690k5+=E|@% z4+|GSTk?Qv!xa@Q8XF%?A5mr&!h74SPKs_8z@yV+iN`CP*`Ug;FJVp@*yo4)c-t58 zv&-@lk<1*Uof4fGF%A_R6>k>Uv%oLSr6ZQJBSeyraO8jb4;lI**)H^KJKdeW7=%F6 zT%usZKw+9YRL1x02ptLLwbW7=rOXpZ%24oGedxSjt&V!Fst1uR^m}#bb6OvQ-^UNL zTj9B`DZ4(&Z5cm=BFT?#zX*+bv2701^-=UeZnVM&65yrY!mwfPF|R@N3B(PLbh?a{ zy=KhQXw8ErUo@Y6H6WBsefu9n$$xO;t4@ln7GP{E^Nws339nie#WT%jsgaJjLq7HW3zLfa3Wfv zl;{yX{shrkxR-fZ397=*a18z_N+p(YItRN7KfgfJPTh)+8cg;#r3Jnx+7RmO4UUz zK!y$j*}I}aPM8HCGBQ(~^iGx8oHY4_xKwYm6NsZ_HEw0A~_SsaRMXguggq(d|^Y=^8tEZq6xy1j9qd-|UissLdZx zLVie{`Y>g!%&P_1i|2caIY)UzI2qtPxnYYJsOB+eL z_{zMIo%WO0rUJT}xNf7a5p^fmy;5X`!e3TR_~_l9 zj&`BDP0X%8`Yom>Ofzju&|zMBbJ{EStD(;TeD$4DhjXuOX4dm|@^Xhmc6CTE*AT@T zydN)Azlt!0-)b@XvD^_0_5I~_Hj`8K4&Lq5`0c(R zG9x=z>`6<;sR*hCul6^-b==e9ZivAa2?;Dsi#RS2AW~kn_h*EE^AG zNyr<$H~FZDr0MOeUxw|vRfziBNpqxe@)V_Lq2mx{$if1YfXUd)8~xNR@Z~|2x1E=% z2k#(3K2s?rxmVu7fDzcbF1RRLY26HT(e3n71M@wo}g1qI)&p8wpTSk z*{thk;dbDUC@sLrmH*qk2P)_01esXp1XY_w0kC!n1KFE%O02U@2oL_+Hf}*txvAB9 z+-HrvvN5#H>()-l`|poNPo!cx=knSeZNUS~d5POq-yqD}Y5I0`Zu>2vxK&#L(GshC zIL70SHAilGJEXAm9Jk})M7)vmh*ypphzlzpPs(I`qOU(8v zgaw)z6NaW2JW+iVAr|WmBd@YyH9-^RbPL>=id>b7ryyG}RK8K8*F*L0-n$nXRZ&#p zmwk7Y(J8l*6nvXI6cdC0!mE5(9-w5`L6PkhheDd9gW2;p=sr=K)N0@>J6+=dY!QcDu4I`f!sAVOk9qZ&G8DOtD&*E;HwwIFeRkg7mfVf{7iVZL8q4vPTW(j0eu2fbuq0R$<|pTEqmtpx-aKAY z7Q;Iu`CCj1eaKC+DTyYXVhbSmv@|3*yRUUCOBw~G1*rty1U24nwz?1o<)VUN3$jzI ze}@sjO@mHPh4}$X9*-g8;@yV4>=QJ3(;c3MC(uP^3P3Jg<-fNfA4;qT+%xcH{KVpq z`0+AD4iXg`@v80GVZGke+XCkwDVQXvWSl=blwhQJh>nPE9ZkCJ;UlLH&oXzevkbeJ;+$)Kx^_;sc9=i)FN{a5{R z1vMutudmm>ccXgix0j!9Z%qDpQRIP_nMqN$s^+>gP?W`si0+i>uVB4@LDzQeO5@ZA zl)Ho=?%@b?KaaI8f*vu9@u3mp4QirAV<+*4P9e-A@beZcW~h}L%P}R2Ee>LBj9JvG zB#=m_M};4-baCtC6;Jaq$a7GG!vxe8F{&^|x2oP_5E$#Z^#jyp3BUUzx#}Qcs7L6P)cx5Q01x}Q~s?=KL?80>)-BNi_pxo&tnqL1PzUX80uvGY+uZLf$hvdsC zOYcyb1;xoqSNG3HxsA2;>nN7Di!MEu)8o9qa}1oX-`{?A?U{{0%fMHFix1a+9#;ITC89&XV;bP{DMm+8yr zD0xBhH)_xy`|I#wR_`daNSuFxgSYvcz8BHQ3=7>B4ZT4(dBglV-yb8$=#nh*xnJw= z|4n=8y5%i#^8&P{SCq)Y)DW#KW%}6ff(D;7&8C}B53X*+pwkCvQt&HSFW1`;zOL$) zxzD*aCkMP1F>iDH>!)TTcy&jZ)-Rrea*Kmj*+aGJ^qIf4uS9PWZV|2%yr-G-oRNH` z<8490yVSR+w#d{|^>s+3uhnr@E|u2?QTlc9^G#hKTM55*8`NQQ#RG{X`AD~U?Xn^h z6;BmG*Dj(~XgdAOGO1eLfa=JRY>50$Uoe9h*L^n&xqp+cCg7Z#r@#*%oT~>AH-Y(aA7HoFgVR$clXoeX=b&jI?PRmhPDUuqA|&C$QV(F zSIiZzoT&;i7WwGQyuB$3x<9i&eBx=I;s2YbS?~e%+4NM>4#MlteWU#eL9Ya}bajMv zI5~Jd@)V<#dh)mtCU{vj*YmscE%&S@ag-Uiwl%6l5#=+O4!DB`g<(_1rhOe5^@NT> z_#7M;Lv3h?NrVCCx3=x}6kq6lRiWJK2UVe8e`g!D;H0&gX6|VXoZYgOU9SGjt+G^b zs4lerVD3b%i&wK*WOHkUyAoIJy<5Ka>y3AwFG-_l&bN}D8O*J+5>tb=kMp*FsI^uo z)n#B2*mk5h@W)*wm_OPKd>b_HuW1dM1+~Z($dg}zC5sBlo6C!uZc193 zS5s6-$G zyp95PGbHp9>Ld*wRnpQpLx&P#aJN&)7~4LlBTIK5I;ZyHb{V(K=IZ!O-Po1#7}bLj zzTlkacd=vn`lAtl_W>Q$nEe|nAx5t#4cN`(m|qP{wy3_{kJ}r9lo_MS7aqW4O6eL= z>8&&NWEfYx^Nz!eD%C0t;kryME!tFzISLBH$ZrM-^q4#5 zKmC|;dnAN%R|*n^*YAekfrl4AHW~^&$|KWa_Qp~xb2F3!La|#H_hme8+~kO9c5w(T zo#WGPbPTI~RDYFi2Zpfh{0|!t5CYa9Ebdt?y0(cNOcF+#hSo@3qLD_X#~8TK-|bD# zsrO)&B1EB4goT!eiFdE5Oy5l5zK!41Wug*H1YiKtizf=mRC5%8%#2&pgv(K#9&fR` zdO@<0ZmQb(qtk$2%Kj$X0e$P8wr+Vi;q|*IVjhC4ROjk`# z^V=oGi*f)p-gq5{v&Mqn?Y*=wlP$RGqJ@$*ikpm-)s^QguELszMPlwRLY)@^Hge6* zg7nuF(lsG_z%xTDpW-RXj|(YEC_GXSq`EbOal;RM4rkS(b4%W4;r(2mg`_0Du@ca! ze(^0Q%5kjVA;{-_k*bRyPT%^Whr^=9N+u0EQ(zRcpyYacNR9AGMUileO^qNhh)QXiEo0Uk60TH!JfHH@xGS zOWnuY3XZbPYhI$rK;8=x&HZW`YOlIh`MX^b?&&_!0r>-A~Z4vJG}Tss4~od7q-wkmwYuNQV=e z+7HmmXeqr-Q_*Kvw>xnw(OmI>u1(7Q0X2CcaOy{YF(mii(d4VMe`RT7SuA4s5`a_@ z1@5scKJRB_YOCmNhC1~5$HpBMzn13lR8(D6-5-gZdW5?Ya z_C*SH%>HVdMCGBp+Cl+~X0>11k4P+LFd9uzfvChT2@%3!cZY{`O8aR1uIn#P`Q>Qa305O;K3f)lQO7lpx7r|}k5u{8PB}@D zj8t%>r^8|K>J?(r#NxMfv|A6FAXnQ=M6){XRhRP4RBdv7p#(?Pi_zW(%=oPXB@~of zj|J+r;i~dVBql`s3@e;GOVD>zFpfk<@S4MhTBZ!aq@99ic}})pMjxF}{8wYk)s6!h zo`c!}NX>aLRe7f3!sA5(UHb9j`7d^+gP{>AZjgbdd?sJNW=v0# zeW>PJi~sBGv_ff2P^SP+Wd=2)kL?jc)#7V$Vrv_o%UW7i^YGBbJ8?@BJ5bFPM7R7I z@eM=)PjahL-2f>nam6-MjY$AbF_bw#zLD%h@cMSI_PEfye`b}kWxSxLkks) zFmi(>89E~{gM6{P*YCDDdWTPT1XOZ6(>e}#U-BG6LS`|>sDllqn+0(DbV z3d$n@dL^?aG#U+Lh1^+qvqo+<(k&l!->$>@oRWAkWGs7>O=H4P$$R_8x{yCdD1Wd9 zI;wDj0B;8G&ljDQ`Z+bf$VO9nBg!fOyKJsVjCb!0M*rW)KJU@;k9K)XbYVmLp@nVw zXxcKab^@cR?RdhYZ7IIPrnnk-`IJYp@m6lfABLTKA2D#FR;ovYgaXb&zZI)bt%KK~ z2k0{v;via0B$#NRAr{;R&t*-hTlQXTdfIt6$?;9YmWJdX!dcgx^ma4t$y1rqAjblk z(t-;zJk+OJOdH_9`B|C!PwkyIFvv%s4p`uNt8WqQ0qLFpamqqt`N^@M$tr5#-hrhb zMe@0DH3lq2FAJ~4jw)GhsZzt`@b1NST#ISuO7y@1#f73LrPRXV_~`4iw-~bOy${p0 z;xR|RBANQ8`w#Kb;zuPIS`r?uJ44D5jxk1-WZU?VT7vr%dUsXi@c8HL74qvJ9BoXd zSV6nN75y(|>@bD;i^-WYJoc7>S54t8jFuMSRn5S}*aY@ItGy%(@xIMJAJ@*vpo_t6 z5zp-tg+_q{Txmd^950hQd1T>d?X;?)%8BEx z#zq6QVQD_5@63LwH=WUY)D}1imR3stmc^!Ft;d-xIGck6CHy!@v#)0GU+ zXv%)t^b1~G;SX3d-Au8+e3acX5qi-DHpEvKK>2(t(lNP?vu61PfvF!{K$l^J!hKKp zOp$DkOhNn+MOs9k;DGiHl&k?ekEV9dU|auHRsFZKB3Y z+*g)=@6`~yMY|;(oDqCSv|lN6`N0v1o^9Ltsfu|m&0`$8#jgFA@P-mUe1Bmb$HLU7 zPtvgF^=QRau$}-`kySA{{6}*;0yMXPEk4pP6iR7V0Uhj(LyTEyjQ4O!W-H>PwT!3q zfOkQr?P(Kt-)!0o;((*vr*r}p6?R8Mlwp{a`z(()x=N_)%!d&r-jTblvyGNlQ%`ZC zu85yC9k6J9BxPscIC^)zx52p{9#O`B(EPz3hm1XQV+OnD-r@QftS?}fHy{gMV%1j> z49rp2y~W5wYRbXOOPkYeARepb1zDZ{=l1>M{C~zy6mZ)*c*dG?Weomp*iTvDk)oUS z`c{mY$7OxuWD>udJ~ci|n?0kHq9b+xW8b$o3Zz7v)yLqF*SAi>t#xaDw@vkS8&jr^ z7IxuAmaO>jlHc2+z=`U?XanXg3xsD2Q}ykPULkp>6G{8*0A{&B$6(v+9w`#B;%Fa} zp+d0Z+EKkk;6YwRFt3O;eEhlSdkh9vkdJ`y4Boy9k9iVKYIJWemqK_sDI(icQauvR zJF6Zv9bR_<0PU1!(7vkxfI*Yl`Bd?S8X}--XXTO+&zKJOu9vNbP-pd0I635g|*f62c`67Ia#M6>OvP~IYO}aPOwFI?j%R(by1T1}&ZLSd^G3dDJ9@nP3L{_`5`l4o zi)^Z1=4B<~3(_P6mMdb`6j#uW;x2}-V!Ch^nW{>WXxg4?)DL!jN3#zNuzN>xN~=Z1 zu6(4{sjs(8I_J%xo}DciXoPLDZz2nL_IMH_XFZUv9%)FvfWy5U@O4ZK*&D9&n{h>1 z_I9V9K+5?*{>JaQmnvbK0R;pjy(|-l7e$eS6E-fl3o>v|&ssL-IQN2-HIZ`)t$1;d zn-bBF(meJhB}R>J-f5YMlb3u$%qXYY30nXCk@1J5RCxwr(@2G}j>M*Ug1}D^pm%#> z;xH{;7XnW!eCEsuTC?8mtRul8#^TrWt^nRj@NA9qycUZR1@lXeLVvksZLU9kjUZ&Y zo}ZhJ_5lIEKukz~^US+lWvZ*Xl)mG$i?XfNTj`9k;l!NQ6RzxCV&e_9h;8#YsjIx! z`SHoG_B8am@guUu>xR*RzP%#!?c2(_J=31;XUBt6jpXLJHicr@tgds1C$$qgO5aLW>HBKhm`q06~trnai4*x-3{`=+i{6Uh^e#7E#tisEFMccv5D{l@1yPAj|^0?ZG- z$Yxu(r1JV9&~j-G5ma|-63N^$dbh&>jXL-&I7P$nSle#lI)g`b7+wATz~Vnf=J&9f zrDZQ=6Vv3KOXo&^wPr774hWiof{{7{8s_3`HTRU>}D1KE_2>26YaZJ57kXv8a{ zx3GtsVtoq(gdHNN*jQbQ>F(5+oYne)&c=`DZJG@M65;M%u7fu1(~N=Bf};bOK<#AT zUvS?)a-rdih)MHGWVKto6lUpDNV!Qv^lu%r)BRu@mwa7{9 zB$Tm9*P$aUl=WE`za4pY!q%lpdZ~LozRec7pWe8s7FnujrU|r_V81In(&)s_-q}p-(8Yu79J0 z4U%7W@T+r}2&*a&e*wb*C~IYuWqZOUkceVukK$iP6LuS&BgL4KY4@4YF*ofO7 zq$&WBSTK9-;rp8yWjwF0Z2V4}c@vHB%LilAEtD}h-VPgq#Mr^~@7*owi-Az_$vlhm zzi9Mz1zJXw$m%@9o4G3TR_uJj!Q{h$L3ZP}a2~=^BsoOjyFP{DDi zd6NZ^z1$L0hyg4Sy2Kb&tiEQmJ& zN4%zDTm3XoQp)i`Z}BofG-ZqLSP-1L+0=8s<10yD#~%b%D^W;N<1B`|@3#;zTm(vG zVbWPOg&oQ)2ihiY(`GKv*(D^q?|#XP@GoPZJYEUj?ZpeaEe%q4Gp{6s-s9s-pBU6P zl)0)bgdgVN%B!y4=GXi{P_3oO+Pd`lw(M@V8RNcTsyAec`mLrA7vq%#P|%`F@`MaZ zAUw1Jpz9iESt{zt|5pUgV?7&@a@B~GaOf`!a1{;sF2h|w6IuKl%cq}IJ0a@&<%A_j zsErUswUUNFWuO?5h^U01KSi@4z`81|i|%#9#=HLO`VDj?8gOg`&9ou^ zne!Kwc{sHwQsa7)pMpDC6N>>4@g^zDOPl$1{8}t0x*CvY;Ka9r3(qpR$^cR}WQ^dT zg7C$@y&ZjTHQe!)^{Tbysho+-!r67Dp9eFoDW=uofMy;dsJsnCtrLnQ^am9Ob z+LJrPU2!_e1uQb>hAH-2`mLwVkoMw~2-mkRXeVhI1OfSOvKV(m{j387-s1M&h^niG z$70*Vtedz{+&-~8?7pQ>M73z$4M}x?$ z0XzxJwt(vvTS*hTyGMs$n6O&mBA^D5tEV#7Jx55ndXW`WK-)#d4zidTtDOm&f%5Lz z@j|V55w^A#Tm-@cOm-BYBh#XtvoA5sZ%`Qcrnn^|lH5(k8PrJ+IIv;X?~UAkI?MF^ zSRl2Y^>v6o8WO2oJ;}qT0lz9?k^x(WhsHX-gN4%E6Ec)6_<7fjD5$nl=SVW9b?GN< zpj6^v?6|CXTPc44v#w9&u*OQ3Q%XKJ4h#Inlz@i;J6we9c?&e%*sj!~jWz0$dgV@p zz#_@GHZN!krx|HjuH2zS3O*))S{g+vKx5lR{HlM1uABX^b2ANQ-sID0QrFKy(53Ff z464RQbukp7NCWtqk;K-A>e#1M2JX)E+JX5wX>_X5%ABlFpHN>E@Zp(sS+(dCd$`x{ z4fyO8eHgxESFI!A#XY+E?{M>z8>fqc=VErrJh9f{e@y}0;-3;}hGu3!7uqL4oq&nq zaEb;iX`Y42w=G#BW&#-s0_sFJ<9A7^vJAq;PA*d|Ah#n0$u;XjNOOq}tLAlpI|DNc z4bmH*hPZ60g9Ct9;BAwQatM3n_`9$zg+@9^q}f)icf@*nhRrPOg&L$+l(UqjZ>Z~@ zS!F3>*?jHSsR%fXWX|RS1uXDuoD9@VbU%jJlm9ThEX(z#o|J)bHQ8Y}uG=_ck6F{79 z>Dyi>Ow&0livj)~6`ZaZo;xQ8M#E-NCxUCPF{MKsDFgzXA&Sv&K5e>5O2YYEU=p3P zLxA=$4g{*eSWV!|^Z-raE~r>8r}AEBF7+%-CHI{AjYD2CWL<^lLezPx5bH(F#?lNc zg0_7W@Eu13mR&q5cxkG1bLIVcZQa6RRSOh9XtXvVMw=;h?aRuZd5OGtsUeQyVVF^@ zQ8{lsV~qK&TWV^AkTGkR!(x>z&fd*PrRxH<#C(2GSfKPE(ylQ4FonuxtY-Pd;>2?a zs9V#Q>v1yJ|No;k9xP>;yb*T2KVs*5#t?&I<)&6D z=4X%@R2{V{r<$DNUp!mEEWF4>Us5Opr9yGSUFCp;+ZsT6Pq)H9=14z$toY4RV0|gr z(ISDUI=_ORV$cgc zi%R4sBurR92la4k)6W6@v**21x^&p$XUGN7GJ0&C6j8K(sITNdTpWbwqV>P41}xyI zzZM@5xP@VNLaI;alat1_pX8~45Dhi|?cKw5zH#MOeZwd*&j$aC3q5A|SYBJrO3QX6 zg*=_YU%;WP9{Q3+K9+aXJ*vmVuatebyZ`U&JD3@sY}qn3S*N~@=(m2m#VYzZd~rB~ zAY*%=!E>C}wNNpj_;&lo-IC}%ywzq^jSs$ZVphMk(W6OtOu(nQYTdF21e>MpUcj>rb}KcA{38Yd&uF+gErssN z5TywKrsnid@_($$+7IJ(R zXqY`|=eywusXyHUKRM0DsneY%t^>?+(%+O!)?031Y}l1|_w-SJTU<%vD0oV}Gz`iU z(veVWJ*vitWWHY$Oc>_5k8&@~wUl@_kVhk*k%7*Sp}*fetN7mHaMxlQ(>_X`*s+CT z8gp%#)w}%F?Gj!O$nx!j+un`9a0}&Sy8j*>1=ODpgKr_dJ1Hga2YRt$V7&zL0#1{i z4K{z}B4W}-y7J5(xGJaJFNStq6Z#_UE)13qVzMElpK)AqMmhCM3PvNX{W1=k_Je{S zw8X0W4$%-!<>?$H<1%kZ#&+W|#ate~SoOiy|JaZh=le#{f4LmSDp;b-c!raIa_*CT z2|v93wfwXu?MmCuVcNl-nY64neZPO~_ps-6zjPQy{D6uJD^WF~(mU&}E~?<9aRsi* z=Yiz-WKcU>lU*H0JM8JlwIGNe!Iv!esI@{wc~_HNzgNq=I1)5rcM1FIoxJsjNs?uX zF*?SE@=Kv2W2E0O*YECVi>Y?*Sn@Z+G=f9)L>lopI+0by>5Xa*e;7qXG zg{huQTs1C%L_483XhnPCF@rsI$CrK?2yyhIx{NIru5*S{#;?3vKGBL@Z4?MKUw^BxS_kg zFbGO?#*>U924UC#u555DmGruMBg*%;J6D~rtAPRI5v&v(V&9a*p-_8n0Av!MY7@-I zO88(3!FZ4hPf5Lce*`7)tBh0aFFGIb*-XQWGaF4(o1Sm{(NB2Zoys`9gx)#*_S3JG z-6dK#VSwgkS3@w;-Sh2H`VhGXv#s;zuvjX)1OGep>vV?Hh(}2b>7Xvr- zr`9Ofal_|nowHelI#-Jae5WtV+4-=z7%miF;JyD~Tfz^F0uxvp_iq1p3Y}Hgx376` zm2N(}*^K8Dx@}J=E{Oz`{@EU5*l`pkoi0^?zttW_sO-c-&s-}3bYSSC#eUs+0CpkLs)2|D|7swcT2ehB z-pKp6Y)NhuC$VBd->HXtKwdY9$bDOA&A75jjv=R6s}qf3wsS_!bv%`vKnvsZ5)GK2 zb)Y=$!UzVURSt;`vo1tOVL4j3L==z%lk(sANHPF(-}D($Py!wVlf$C)WLbCJ4e92n0&~M?KB0tylet%L~WdJtz1T|d}#BSq4-GSy^rnADb zSl)X*Vp~AoCfV5@@3aCU%CM{xDk`{9jr6g=3|l+Iima8tOF#?WYc^CGVl~4-E7|9# zvCEbM%+*0SJ{yxRL^wSget6#o74|j=knMLb#x_*gd^-;&!~8S_IQLw5u*S2=vEap5 zQU&1y@s(%1VB`l5V6Z?D#aKYdJ79ZkhOfG>Z+^qM#^KX zkX-j&3GIg|wE{|KB04G4wHuVszOq7V_{9ULaC57D%p=Iyss za&*j;|0Z%Iz3u31W8O+@tataW{!AAORR{(~+6EWo^@YC;RxiuQv}h@FrpQ^_%L$7Z z5jLCu%TW8nYy7#(%ZH15#NGWa|7|9}lS@GLeC;=N)%advmCcIEYr* zh*@muR-c9iHjv<%KRN=VhA9)1uzI3=O38^!hyjv1a`#2@Z4=ng+reQrHs9A1rXmcM zA>DI#6?v3eul6FCOx)S1`f}zT5T$g?b6S+v=&02#-^^{nPbGePCdOe_R@gIkRQ~t1i;BkqWJB+t{aSM%9(hvXb2V=7QKCHm_1!+ z^-#2luT>rQmkdgbUY{!MeVJCajQD9AnalM7-0UwL!y)OV4m3ErhRHRS+9`)f6q;K1 zk6rq@inFD-yQ>J5zi)X>;Cf!OOxqH@tEs|KrmrE|oz$|u!%7_4=Td5dOD11||9Yx( zvJ=;}0)a!~5Ze((b|_&Zu)mB`VBSH|j=wEUg`Z)Y7@N?}A@<&dZez-0S!=pwI7GA_ z;ak=0pr7BCJ8|M89Khjar$TMWcMe#vJ8h*n?0lpgBN4H_0rx8T(yzbI5*CF_ZhZ42 zyU?-k?g+{VuXxe^CR5ty79twniJ&^1N%e=uSs1kfl$Emg^^(NrKx>m3x>lD)F zYf>lPwagf*x%Od-JlDPdv9NwL(!-6?3DMz9HKXc)gR76 zW2AQ0=d4FRPDi8S$oH8p%Q;NDU-rLfve$GTO~^hyLC*JU!x!!QC4omLbtR+Hz=CH{ z{YaTv=h$`Ef*rwFn@eRyFEH=X^56WF9vI~`AQ#}N4qtU0jfA)#QN;CyC@IM5$?8%{ z7#j6${qX``#LODfXUUz`;!mf_q$7 z#ts z7DL2oo<$+dW8{0`iU(T;x=kPB^CXFE!lM|9Xo|@M^F|Hn_3bU}Kj6A-;{K8WgHfyn;=2*B$ShhU zXoOzY(@Sv(jx9txQfJa<*$XA@kIkU4FUx;m%^+%9LWlWT`AM+eYmj-rg~q@MnZ&kBIn;CViBB{N{w> z7DjPdripBC!~Y1VqK0HF=^YN5BQ{_K?LRMB#hFx={AJgjVjjtZ&XRV;%0d4N;LjEjNZ&41GUmzHyQ-P!iMim@9N6kqbnXwy@Dj; zd(PwRzPV=j;c7$H%kI)3-6g8Fin@ z{GP>t*{=8om7>*v3rIIdv*aKbOz@1$o+-lRw}7VtU;d!iua>I4?=Ys`vEcrlY24=; zq~MZ_FSK^%E&p%f0Vrb^j&Ozs6S`DTWgb7gWajclll))Iy=7e0UAHz$cS<)1f~0ge zg0xD5bT^Aq=?0M&>5}f2Zj^?Fv@}SIgmm-H<$XV~pLg$b_W5?cto0LQt@$5w&Jov$ z>wOykF3Vf$Ggr(?(Jr#`ZQy8g9}JUhUn{|J3a8#7K9Xja9IRi8LcmhJhFE#TYX&B;i9nQ0G*k7{a{gU7%x!H10DZ?bDlEKR(E=q3ye z?tZD}H2GyN_-2Qh*|SwxGX*Nx<8f|Ci^WC~7DqtgzKmkcMCMDX}boVM!o-Z|VmKk4p;?6D4{aVyPa-u!Neu^35yRrjQdc4mL*c@4v65pV_jxMHRUw-W2J>N~c8=Mp`dJrM zAyCuix$QibpwoDhwJ7*obg0X#zVz%=sQU*5*H$DeBxU{QjTKdUoqDs_dV&-)PJuka zJDI)b4-D+7Rwg%6M|`KhnApzg8b3Iv_n^KkiQeZ@v*F+ueZacJf2XxOFZajC43I+q3GjNkase= zKYCBztNE+}B?ATP)&o_wR0OYgfZ=*x?KuU6pWHZ_bIehkY*X>@>-CxaLe&wqWp|xq zm23mi!D#K+R3GO{E3>)+FNj~yST%e0K05U64bJ+v?`Sj0Dhx{$=OZtQ?P42(_S=-I zJo2?v@G+FI?em$07aC_RYpWtSq&EkhK|{CoHBD!-d%A}g4X1ZMe=}ZX4c;{4 zPfN*Knrd`SAR1ZaFmi5WgvA^D_a|irioS5(n|iKX#`vsV_)kaRl+CDRDtyY&p7pQI zZZoV2%Rzcaw_b^rq^1kGrPScl{>}~tpZ{tm`GDyOq#jLGBh&)7D|Phrz60{19)-4M zg{{GT7>w?8|FGs7$aB|<9bTto%$_$92{6{)-33&K#Evnubc{)P;xrTY3M}EcV+0a8Yjn^ z@5u*cugV{|bOnp`f@)zpfmCJFkyuWx;F^`;`D_2nW&5>sNUh^g%c~@+40`9P0vH_6z-BN6U;Jvem&ZSUV+@v6~7TH1w>ocTXse4&CC%@YdZyBV!ktL19z5 z8EB-hh^NdVNaw=JG>(kA>~2gHp_myeCI9$A7j5EKa9xtBd2>tiYIc`^j#I|69dSB7 z?bTH#tzZjgbA4wJ?xxA-gv;dD{`)^#r=6msh>}NXHI^!xpkOq{+pg2}1Wr_Gdf{Fd zJ5@;U=|?+p8og>=2tQthU;$~B#PxCS6n32i%MMqZ7A+_*9A9I%p8?g`CTVfoq%ncD z+sS>_9hk6iV$~YaO1bF%LAPv0PeEV>V|Tuz{M_ixlVKYzY*`eY{C7C$fjot-!#t}c z*i#H>50ks@5C-nm^D$gXZ?V%j0!c$JH&D~LvU#P*jtk3Wsq2_th=mJvH+tBkLS!X3 z-5O^bKDw$8igMvT%)(nlUWgo5i30`rkqbNq@WpRc7;_4E7>Moq_)BhjcbI((!;3rJwn+;g6S@i__d`p^p1&EcCV(jnLD%oj zW(`;s(a3PF2eet_0&Z9d`DuHKH?Lr66)uogAw+XKwhG`uH8$b9zn!QOF6bN>@}Ip& zN@j|@$AG-soos>hKlpmg(|4MZ9S9PjqZg8a*-1_hy{jd}ul1h;8v-iJ&vFZ~i`JU& zE6ZMZvZ8_6s2azix~L;u`VwS6pKtAY72!RD^j38(OmhjpOnDVIaMd*Jg=j_dWrGi- zHEQs8=NtS3^j|q}C~bK*!EL!xd6Ia1bL?prL(1L)IrUIQa9$co?Su@x-GoPboBMxo zp5h~Tt!{aRoM+5jjW}~zy?QSA1v*r+xrJoqoq?>8Tsxsf^N7+u5S8!DB=k56{P?CV zB>$!X*}@H5BKZB-DPj?6ImORdpVqouP8K{+*Wa7nSYvma-r_Qcar=oVo#={r93D0= zhr^Si5gueLSiG^bzG$ZW-& z);gQ@{lIM#U6W<6lPcP7Bt(@e>&h=$1*fuy;dJy9Ig`wTS+rrt-q2EReWYO9Ts+SK zq!E7oUg;;oc-xM)iFI6;I8eS&^(FMMIepPRs~^6hf;Fb*N(`%3bwbf8Ya0_+&D0+C zZykiIqgTX})3z|yWz60Zz(0_&6|Hny)B#oQu@lvg4WJKx*tg*QaFeKzeJ4*seN0%b z_L!oy4__nU%Vtv4<(tq?kbv5UjDFA0JEy+GsS{~ovB%)4HD9BBN*}oJ_?1OgkHKCR znjSH_N=|_sJFs^DeuO83VR-gS?++ZBvkKb}&b3@j_z^UC@423aS72{OgU%0n(E@L%kjX^_-lVP#t4rMq%@OvGZpPSg&R9`!%d0GP zrH*jAq{+l~&?`UG-+C~}7c=FH4!xiaT8YdO2t`@Kgqy_3@ht_qNE`7e{ zAZ_Jdbjm`ojY24GhXs0f`TIPV9?VKDUF*i<4KGC2l&Y@J%GWAgZfHNtWfj>-9DT+Q(5m-K}rL$ZyDp;bho>u;s0rOBRalU-ZNzu&4} zsu)<)F>w*_k7pu0vWi@arrWMDv$%g0K{Qgqzj@Gz(8e#o9jxsUvQ%XxEL+4R6XVV2 zsG}>nLITIe?`XT%(Uu`Y`&x9ZY^LeZT~xp|)9jNtQWDhe@>m>V7__R66iyNA$6k33 zuvN)r8_(8xLbUVE(TeP0yUjA+j-z#nf7AllxIZkjI$i9!o*1$65~tLQVH1k<4($ke zcRl_w%fiX$N(2>Hi>|XhO2kTaTqviw_wNRe>&(Wro0@(ZE#mCd@J%|zfDEf4yW34~ zwvK{mWw_2YnIO>ChBL8hAA!E^-g@kAXg5{o9+~!yc=q>}_u`A|+z-?QHpty=ujQ`p zx_x|Z)5Bh^e*b0ErE3puncvp~O3spvyemaQvR_s`PX=2Oo=Fh%^cuv~a0J<5&1wPi zPIrvCo=+AnZ+?wq_x?=xS<%9n3c}DYTDON^rrr0j2x|wmYruPIa6b?YNy)-oA*&%+#m{EK!mU>##i!`d5eTXMY~? zo(>{92~el~VTrEe7U6G zT|vN)g=iTvs5%v~g%&k@V1YK;OCM&z8Qs=e;X@>SMaDk_xsptEEJB3QntVivGZdkB zU(Rl>@P)!k$8%xD;*RRDVsT40z0Q;fhMDYU*ibNz3_+^;Wl#`4>QsEviJ%4=DUcaH z8!|qIm6|JxkNvaFsqeqkum5u>50K;2jhP4%e}v{sR=Bv+2k^|GEc~y6WYBHbgB+GQjsGen1j|e7q2uEAE+4HrM9iiTBwh}`Q84Umy$Ad+^7gczxu`bV31Y5 z=NKu#xQ?UyOWEHm)m_}UoD6#~YYU=WrzaFvQ)QSKu#kK}eKGmTw-}*DL`c)HhtzUU z+~LfxP*b5e%Hx-i$I|fb5ca?i@nz3P$3xG*`>s{Hf;nA3$j((1+N{q!-l)u77+I_7 zH)}F8kQO+HiPdh)hd#tm=DoGKR#PjJG}Mg|GKf)pJ)_OWP!_C(b1dqz8R(6~E=L>i z+3wn@MQyvGib1NfL7Qnn9nMKVhG*+Z>TM!D{>dkgu#xI*Yp;9{=VE;ZIh;)KCs~sB zzjbKNHx-ttnAAcB@((IsJ+&nna{5fs&NUmkm2cUT?<@_k&_-*ym_#E=|LTKUKI-(l zsAiv5s0dZf)lV?p{ac7iSSl;>$?6mBnjKYm4N00J#K#q{_XLnn=62FPUWuKnNGt-E zEE5q5>*)xZhVHD`TP#0iR)3uJ3q{|Ph@-HOf+V5D-BeoCN1kDc8c=%bF%AUdxh((LSYuNSx3l$2`yAT8T7{HCxwlD5b=G zfx6N^x}=lBVCE~qjOobygn2t=YWm{Z6V9lufV%$$(cW3A^ARx|4gBox_XDNQ?jOj< zsdr}C5~to|yW9wPPR^B21Gu2Wb?QAj;iFmF&Ch3@q(y5dlJv%A>rt=pLk+{gTy#19 zTgL|PZ}lGoymzD1`QT#u%-2A2@TFvlv7_c{j#r6uwY@v2$o>?Ogfn%3vHc}$`Oy@P z4{YCj2w%LxeUIR|ccNl>n|WQ3wGVo6M&V(^gTqpGB%l&S>7dpA? zFa0GHAWA-PePb(U=;QZ&K}V(&KYCL>)8G{xoFJme-P&~Fd#nt;Ek^o&@{rP22`P{o#&u48cWapT!E|e=os@hB1Nv*BX_OnT zYi8BBrA1Y=_wl&HxUZ|X@X;J2f@o8Sd2^`k9sZ}iQ8<-0_`xDsX?tC|=@&}W%i7D;D#%?I8OU{a_U4R8+ zbF$Ragdp`wKi(8Y&XCVGhB7eTVa6cAd}(RI1@+Sg9Nf47M1Rwl!r6zc(yjm3ZfP_R zssVt*n$#p|?pz+F?Gs8Op9Bc||FE6$RhO*HXjdp+c3nQ1XcJ@oq{(+W3<(4&-}EZQ zq+U|=75iQ&p@-$~muugz@b>AutA)_!&2uqJ<*G@WHm}=iPNDA^Z;m)R2@#Uh@J!|L z+$5wQ?fSG}a+K?MRJtoL!}n>hbc@c0m9h?TuwL7v9)5j|SHUA(Qs14hi2>dIM!wjj z+`RUS*LZXfPRKrHGD095u(_O(N`qiWftD8w%#525#N7qW9U0qIwZ|I=i(Rwq9MM%q zj+Mr6YBPgwH3q(F(`41GGK~eSf?k-A@+S-C6%Yq{+V8w@U-~v_px;tN1SwFTeK$&? z|(h?p!Y+ z4;jQhzikcZleTl!J?#6YNRZ=k;W<*HwF zCB>?x9d7bOn=M^R__rQQNNTmPh0MCJ8GgVM_;A5OIy8#>`3Sm{d*pR9RW1>u=sv1+ z=VcqXaM`1+xYIWpywSCN@SCscKAvVPX^N5Y4~kkVZywC)cg@Nq&n&MD*JhVYcbzrl zKU0Bn&LnC&!(l?}A2N|0=W{j8s|kKL#rrVV@j6|(M~8$ic(Q)@#D~DlhsIH4llK~l zqb?DxFE3~$u$KE<(fm5E4?6S?8zMK8%_>L{^ALM!I~GjI0Xm$+UU*WU`4g^R(+8C| z7@rWXB&j5w>+Cw1Kh3Bv=S?0oIbH26Wx`s>twIT4Q;mZ)n|XWX5vOVLJx;9}mOz}wYy5urh238eOaibEc5q|DA7(=`7kU3P=|md1i4C|MIx>C8=OT^BH*<`Gcnon@oH;8AeH?6XRhi`j*#aQh->ND~D9JKvz z`MUzCZhoqB6&8FoCxTDQT<*e15kJ+jloL$YNSM}tqyJ`9)LQyMIeipfl;p4bszot> zP5#!n=*lFxlt$KUB05r6@1}-gfrOCVGXG0HSKGGshXb8CUGkC3$Q-=FbM^30&7jf< zp~+0Sd)=?>|TW($`Zpo`35vTJOa5T(~v)!tZV2xTUDQ$ji3l-2GJ^C#Hh?ekG61z~}!s!qO=nVYcSF$3Qh!jwoq61cb)5tf#G6CHDGxB{c- z%#qmpAs-28!ET|M_puA1*R}XXO#;!S4?g^;e0JnL)63aEcp*lYb<5~V=aFzSf2 zu!V;4urY?NmdlHrNh&u3H(>tC{$!<9Y-Cn_S)U@*(X!6R(rjblEM1Y$PDae^c;obE zW5wE`7V3F5V4K+=o*2!j0FNN#gG}#q18$>-9eFsjh7b*3vA1}H-@-a*-<5^+#{pH6LEq%IhWAI61ke1XNGPQUF_^ zmS=z$cCFY@Pp6EBHAmlOhgd>Ub`DLvNCCxf3$HMz1WY6PK2U45{lX`QC@L!jGJD@wo%$1|<)EX519u=BZ{%z&3Lk2<&X?5EKm zfoA@90@c0X&VxlkXBoD)jcM)w^atxJ=kvE7VOn7MQbvwPReRT5jZxk;F&l zM`wi0uO4Sf^^-9Qq%_=T!&N z;LOmgFU=>h^Ea#nhxL+T5)1~43r6#5N!ItK6pl7R*AGGeAH6z+*u3i$BBc#nn?Q~x zYEBB^+UPyU2SyjjK}+GU+w@;g66R>n!)?6<)|VWnZ);y?C8`pcJregb+!43vz$($! za4b`kzx%xM4y%Ob$p%KBeAsWVjriC3c{V$NHNYIqe*DPtW}4$6K||(;KDh64Z#d{x zh6rPAMAY*f!x7YSts`68xfKfb1s}P3B?>$&gBMK;9L;JfNoJFP^m?~r-GljhV0Xm` z9vT4k8*xJPdg&@U@T8V0CZ@v$hq1j93_%EFt+lt zB&M|$aEiA~Pf<@ik5eC@7)IUR=4k})+S?~_AB{Dy9d*sE)vP2wWhP3S!Mbbed1kTQ zYfd&(7pV~*6IUv?$JKFUdq9H0_1LAJut18LaRCY@0u17+%Rqm!C0Y(!cw;=^6t#_qdspNf8oxf`bXB%urf zCT}CW9WgKv-NShCtw*u0vj_QVsuR8PM4N-?YgH7exOwO7r=~Cq(QW~vbPHc`Fq(1n z;C=LTMbw>bSM-ziG4 z$>zv6#m!8Ytm!h(xvY9D{Pp9FRyL_+BrDueAvOtnZAjwOEZCd>QAptKkG0*6344L} z<|Ak-xZ|khfMDuSzf2Atc^$QrLz+jckzxDi%etx?lO>a;b~U=y7XZg^@x+^ogj{ zVH{$_)**F%elV|ox2P*L@~13tG}mo?7X{q$uHFGvWp%s3FhP1cAQ`WH~vF9PoD`TVhfO9ilL zf2&qd(#M$pny?2fBjQI-%#UIqy&pYqzD)w@SyvOnUTVlnkgX6csUy=BLdKNA{Twsh zeu0n!r0U;al5qu_RzdT1!ay@X3HB`NbmTBGN0sik3IR&zZ>fK*n+4CLh;8ox**ZVP zv6tQlKZiRQeBpOGxzzoKBd39JO#C4@LB1&4H@Df~6a9*cd<0d%I#pj)Z&Sj;lJXXq z#Q+r7dEjd>i*Gy+X<>uFE3z+11VFry{d(p!clmmR?`)@22s0IGZ z%z;{LU?a_7>&Zkc>3js6_Eo%s1m1*95;&q>{@?Z$Jl@~Nxl1>CeLPANEYlMQMvM&% zYwRDxs`K9%*1+1otuhAG1(@_Qn6z9aVe8%U4}#tDPd%W=`#&w}|FBM&mPMt^18+bk z{~N);7WV@7Y(guvuxI;wbKQ#QU}a?T8#6(uQt*L14keO}%!fLc3ak7KR52_okC8u@ z6yzo&-d6}U>n|KSIVP~{5RS8Lzqb7nrnnB z%fN#bY=!+ZZjpm2Przd<5BnRn%hQ#Ak&ee=f^}_3^Dg%M2BLAjc_7z=1Y2mzpQHG1 z3;h=hwS+u!>{DakuI{3zKDn`F7=6fg9NZ||a96BoqdFTyla>Q@^ z#yJpqxZElsM}095Clp@E&8=Zn&)%>_v%s52eZQxfkeNKG^lqAc zmbSKcBuTN3MSk?U*>c-#=D^G1msNb;xr~5<>6A;#S%^qKWw2U}rpE(>X1dO~Vho-p4Z`R@$ zZLPmTuZtYx=Ufc3>?V+1LR5Mbdxs4C+S!BNF8gw-;_M3%Ca&1ut4emn^ideA*C_wS zT49TIMYn>;X|MVhUMWh!vGY51&W2!;tk-IXDu%t9li!o$`+93NLlBq+=)OsM+vL#F zNLEYST1TimBvU`zF_bs+60~sKW^S|R4<`0cCB5v@tuUX+2rL7dh8A;wXcZjOoXmX+ z6`&X*3Z|x;N`Mmjb3fy<;ktAqb}RTf9)^D1`0SuxD_h_rG9($jQ9DB#dvrK-E`Nt$ zbJ+=qX%`zq?e*Ja_az%GyYv50S4j+oul+hVY8X1wEmgS>*3sr~ldfp4&mO>hee{(% zy~s`oGKQyVtZFJOafs$89sC#fmK1ThQ0XUqx`%*l^OXI40Jc8skAV zo5_+)=M_-kGQHcCmM6_OnhmPk%D-no!#@T@=Rn2$5gwZI-=N&pv)ouq_Hgk{r z$IgBmb#Trta39LWMLaZW7m~IGJ+KbVb%(Y%`xMNALkafH>aj4zY~BUf>;id=CMSGy zVa~$X=Q&~Re`*45LLFh(e0%-l7I;NsjdgQNEDO=bt*QZsdfy(#OPgWQrqwf?Ufp$! zvFWWZ)5eB~o$*^M?L~C>H;zkD?&XZtA(*K#(bUQxOb(*?cYVG%Pz@jWUU)VO z9qjv@MW?4%?98+LuLbNvn9S=I8o)1vY{mbw1v}fU=CfO4+D^%fZ(jRY^k8c@H z7d+G#a5ulQBxiNpMWsJO1MSvSI!cxm zAL`+}>pw7b5ghxAo%KozAy=IVb*9?AV>2N(aG%j%W}r%U|23ucldq+Fcx_9zV=x^Z z6JF%n!)>l9Yqnx#xUnrZy~E8ka6x$6a<3mHbp1`AXsW6HYC4gC7gZ~H9tAxD0R6J7_ ztx@_)=$ROv2EGoy_pH9U(&|l)($BlCYo*l-X|hf-k*#V+zhA^_MLCF(=u0VZORpfk zNXPe=*LY1O4u}NN*7MV+^JC{&h>&E-C2h!S^ZS=%cT49kX7{RUxxRHnohcHVE{q$I z-tQNTDzamKofmQ)TM*FZmX~;Jp6C*UdKjOZ2o6RMK~-8t-h!deXccHf?{nY5XIpIO z5#>L0fmVCwyQj~JIxJ*Hb5#TtvJ_^j!f8W#2h-~Z85e!94n{E2slG>*l+Z*gq3Z8T zSiSDf2n~vMln>(8 zNb{>n5Zbc+Ow?y8!zlVoZ9WY~XhN`p;P~90$rF&vSD8>mAb$&=-rv+3ui8ieOp_Mt zOF8uAFL$pPn?%3x()eCF&bHk8&#bUrUMO51Md34kJu2bH9*=49Sky}4_AV&iCq->s z!vPdbqO`dZna0w5ZGfg+RKKFBUVqW`E*bO7gRmRdywATDG*Vw%r;C$x%XhPY%Hi_R z`&|)4M09?oL}ohF*cpf0L6w~qz|`gd7dG!ZA0lckg4OuBLIM*?YN@c@Q;$8bO!>hJ z7@CFT=S(RtW?qZw4Mo1J<#A8P_jOoqRJsB~eTC|;M>E?cmG30*jbkcAgzqYlCN1`b zXDnTJn-IFvy~&8uH=eRD@@v4th=iji!HU3I{dWZ|PxFCqm*X<)KK@UabdBf&9xvYz zQxN*|?SbHYSnSYPma-(;D&Kgc%B>rAa!>?hcyk4qNF)iKj;yftL$4KXR;B@^8H4Pi}I`|4H6GBAe z$LoV0HmZ`sgUK}Zg91D}!GPcoD)=JpO(e9d)(UEY0VUNMTCmstLf5}}S9_fl7Trzf z6!k&6-|6*Gr(+_2;uj-k=Bi1ILSORA0Zbpnr2_5bwk7t z27ZyKzoyk%a!qWcsjh&wYA4ybnJ;(fAy=3#aGqir?qZpVC6jxX@u_dmH|YF1La@KD zJ<&<__wNg=4Uwu4TvNd-?ha zHAMwYT^dYvfF6m~bV^P5T1VyMu@%L|jEKsq+NNM>D8@(a1_$dEvaNb#98u}7dys4k zRH$I6Yf-Pw){gJPmN477(PEmIo=xVyln}c&;$_zWDmuMyfF|4AQq55X8QsP4%4QjElQ+d{(ZGLRCrvPylEDKfx z?qVQY%+NrH_-3l#9n&fbQUFCvLp7bzAYWu3Z@Ii}Zs z+h^7ecj~>?^AIBuTu(m9T8W??f{xbtg_?8EO1NfJXCOnAdTx~8GLpLv?ST=%9GuN9 z7{n~ei-+S9R?D@$Px|pVex5?h2`xTJhJ91PoH(^oS4s@R&;5VuvA-f& zNm5qM)uG~oaSNj0ZkwSZxYAJH>jSQCQ3Ge$CvWOx0}b6Oi6j%$nLZQreb^`!=tyX} zcS$2;$IsdY^G%29+?vhsV9w#wY9FzWz&RwEy!}U51pn7|1RTP!glW4koc-RHM;=(` zSj(DG!*SK7KQE8BR2)T(QvdQPgYtoiGG^WjTLUr*dN&vHuTcAsg~BVqh`cvtc>tYB z8myaiCN5CQ`*KtZj>HXH(Q6VsSgiVVy3?&>g^jb>w(>t8Q}8k8T?BK1ni(*&-6n#< zD4Mh@Fv&E2^@NjvA%D0JQA$CrcbA%8OzuBS;@_4L$q0D6T{HSIq7f?4*1#0MYzq5zq^uuO6 zQ@ZK>ofi5}c~Qb(s<6PF&1l%@*R>>07Dli>si89P5|J;X`5!06rKq#9D?VQ#tXHjA z&J^c-azqo}3RX{&vP+nE>4SrKc%c?hv?j`*`P_(gDCx*?1z z4 z$!a&^bJE4vBOJAtVdU5t>ECO$e=>rp(!|&iRQaxe!8RY027Utk8$O~){jdeWEo0=y zH#1$#M+<{+_(|~=bO=Psc~5stUAE$VZhb}fUi{nF0QjL(2%HfF;+KVZgs_7zYbhp= z1{1OL@6&JMu&f2srP~&omhrYo?#B1*yc4osVvfgDjyAcDwP=Q}8G5PU1fD2^x|g4V zuH7sKfDie9f>Hm8>B{-TRtYS)YL>8>rMhT#GS*wgMN9mu&%z=0FhH;H-QR#746xJ$ zDL~jU?=~TF!wP^wD6~^k?TE$TBuSQ(`iK|=tRDD0d3HvVkc`0P;CR042fV~H@6AG74lY~)MV z7nHQBu4$J_8hcE?V8S?4%0mqPbJfWIzH0xTO$D}uwubs4Ek>}2ppsY;OV~M80zx%% z-rM4w4@^#({@W{+s?Rgz<619|Fpf=-^9=}v{O(%Uhmvrgk;q@7_0^MN5-va^ZKzFcg}UWtYnv1|e5spg8-1tODUH+zo56UqYUjZij|2 zt%Dio7GI?SLyvlX@zp@)xQiEyLKh%jVOYodVtyIJX`jDqFnX`_z<&^MTEfb5(y1eh z&YazR6VoQnNp;7-l=0DK8TvtN{>yu zTRBFlzc2VG!_$a0!A>(4?bO=UR`{9uc_yhodTuc)^fiAs}3Ib*_Q3I^)2MMe(T1eJun^?SYf2PE@LD;*va zl_VqqZ+6hBR3Ob^2n6?Hs7uzy`kdqj=2;!$`V}2@ZIewUmIYz@Tvk!`I!y{!5@UUE zl9`7aHI8^1mh=z)h)Ku|GJwJzDf*RO}%Sudy4VZmJlqlh#gls8}iM;?pK? zNOmaGx|l7MneeM$u?Jv>1pqzr`g{zcT!-pm0PQbyOpC=2r{neWf8q&$>mN(S0xOY@ zwWCH99OX{-bf;-ETn0LjUccq*1uC^XYkkid%t8)=Kz@*Xl+u+GLnHil%_WB23d>mj zPu{xIWRB-$uIUj7ZA??>T|Tr2*6OYilg7$&o@`LiV@b9kJ92`4`9I*;s}6clX@)(q zZbhN(#1&n;)~DJZ>laOUw;wvs9qLmvT3a|em~>A{7xKe&qe5bNb1F4mL5TjC=}AI4 z#p`|#xC$4|LB(B;kw@9yde&fyc;dme~G@<-c5N0FmBzJ3+>s~*QJzV zZ2X0rY1|O8JFKDf)1$xbDyVh%*IB>~m%`J<#m7VThPQIl^*Q>Xz0)*O5S4Br`K@iV z+dz|T#K59!HtLO`6&#@q8!LQ8HJ5UHMXe$qYky6a78&FWU|ifb+gLkumGP7IFGkkV z3+b8Wocg2A`dQD-({_J#hhTCa1v$IiY@@HC48<&1m&U%?Tgj9`X~hr44plF+ye(lQ zNpZL_$mr$E2bi<@;y5-89W3^8XJ0EQT_JWH^tljPCEIeO=pvI*L0?Q}DhGi4$g}G= z50#JCew*_B22+-TPj?%hrQl7W_(OnVRKRtu6M`n*G$I%~Bn0?l+1%dF?4c~UcU+?s z>lJ<&^Y|S}G@fl`*93>UTsb75IKIk2hPLwNYTYJ&EwWs1aj+Y2!YJ{Uu@7!#-}6d~ zIrInC5`P<`XQ>9HVdA=a<>%gHE=URf9F(1Dge7*`{*~Qvn*3XHY|EJ9J)M?8dJ1pU zNft0Xi7$~alTr@(czby2wr1Q6&>0>>OejkGH>1!F2|3%zQrk~a*b{jKyBaQvpp>f? zk}(_-RC16jpwy&jAqw>vJK3o5p0;h;N|}Gu{x*)yJ-hb(K9@8Lbs9UezxClgcD!8q zPZ$_)33`#;?C6$)ZKJPE-xoifU0dSVS)&uo#et;{vU-P=P~T%s?RWRpL~Y%RKe<>nc7l4^&PfC&&@$Zwt87T9G#U1Gt}AtwAR~EdVgo> zeB`nIR;-wU6!#k%v(?T?c!R)PJ>QnhJuWd%`GGRr7|0XK$&uBj=NsS^aleGN-D{hQ z^_!uwI=U9Slz$)+OeGPVSYQ%AzHaJe-iTTomgb+qQ#^Y{s!;`z+6f6n))j)bB3s@2 z*t~Pj#s4~M*+s@``Cu{yqO$G*O08_&OXYVfMm8k;?g#B4y!@ead_J&?XukK<%W4{5M|GeX|**Ay{dczwNHi*I)4 zCd;imX=5He5VcJ+S5zxD>BCjbEQ+v1puEklFDi|a59Nv3A*fctS_JK>jKg)W_f*(b zk{JW2jw3dZ(qA6@K=0vRk2Onp8QQAXM|NDsW&Y7)gtSDrL?0c1X_hD?E13@0!TXZZ z&xHrk#`eA1vvhHVq0I|pW9yb?YsowBL19pArO`K$sq5-bO8fnYpmAWmMs^W&BMMm$ z{QG?hq?d5LWS|lcsQP5uB}5zDjjB69Pev{8>II3Wu|yZtzB`Ac&9~Rg#SBzG&l1jX zc&DuzUB1G`WV=(pnkVVlH9OcOLGfnRA;=Q}?orE`l5 z4SCL{2H}PAUX4#~$X2a5SYB^FqZhUkKua1gx(g023_K`KU1TxO`^X}2-eOEK2%4$#Z(8OKMXdZ37u&A#VCp6;&H`RbK;ui1s@>Sg8|(CxB%g_W zVTJ3e&Sfq-9@ZGHa~zu1WJv0z-a;DoeX)Lv>D8A}kKi6op`-Q;)WcX;!$Af{`P}K# z8Ws6HFeMl-71#4jimA4pZ2zg&xg)E+#ZlsIPR;vL|V5@>kEnSV0 zVh59_QgIE$oDP+IQJamrGYcKbgQ|KZGC}nI`eK&H~^*1#m!6%2S;aNO4 zA-H57c-(}M{BM3djK}Ri4H&*{q@?)rwqkc4%Fy{boc z3`L3&tE$j{Bdu1E&*Yz)#y1uAb)l%>IyeZS3Et}LYF7`b)15=fs^a4#34jZ}Lp`VD z$Gew;^JXlU_)Idec`qiXZ}g%ui04dJ@{7Jz+4lCF&X|h;9FZ%e?qGJ1wSl$r8kR5d z-g0ooezs29=Y1t}we`#E;LL|{A+t-sr_80ouGoM6*qC_T8k|a(;pFf{TVIoBBKs#F z7iIfmQVbp5x!J0aI#t3G5F8xKiY*H0z@mi2abLmcKpu&A$+9Fd(@$7nRduO84bRRp zg|pf>(la+Kv|UP0?tdE_qppN4MXZm~BifS0jlEJG$ey@Q3!-U$tw93rFPH`fQq@MA zYFB4L&ToxH%-xQVpbMFaHd+WcfXC!f0P+x;P3ffswX6(WhcHD*J z%!q-Ca`*I$=Gd#)ZCI&R5n(^JAm*k{%9A}bX(qFyEgn^#O?7(joQU_G1Hlni<2sMV zLwS#`M(~u|4<85k{yNyJ7`E%hh1In(>vev zB7W^ze7^bY56I2k0dlNZQ3h)S?{|-a5q1uCLTb2uz=!vkH`Z3a!JF%2{29H!rC2Aw za1)Q6r^;}6PS-p?5PKIz8#VLa()Oe@0zv#{gs0S!c9*U(vR|3hOOMSG({bBhhO^EO z`EHFdo!|vUJPaK(d}*r?4`mKU1_?h_8w|>KkfB=aFHQJvkEIwO{g4B#4RcX;{c^#n z3jv%7!7h)YB>9ymd_nUIX%4JrfkBzWW~qH%4s|?QrA4Jk5lb3(N6<{#hB1*EVDkAd z^vh-^5t8hL1^DPhmUuUhbfXn5DS7CY6I3c=OYv;KppsI@D^xqR9ygzu*+1*YYdG*7 z#bz81zyB>2ZW8LI@D_u>6o(>oOUEh**=o|5qFdeI3!g)B!Bihe<)Mm%igGU`l-p(u z)_Cn2sz;3oH(YQk>C_S7#0Ul(YnG*0m-G{XMEtK>uU+w3e~4N$4F>(2mwR$k*=w2~ zNc6tH7%83O=YGPYlCyf4N=%w!m@i zwLl?lsx%~Ut-1Pt->OlZbg{PWcUU{gv>nu9Oe%^bu{Vth8LjlPcl9Sl&K{?wR7|KV z8g>|O>m)W>Tbpl^hvza-Dh>bYfHWV^EfaOq%9JvyehZ;-m$A7dlW5?eVZGOjXTa4b zd(uJkJOS25Nk-b+>mZhC2$arL@i}6QNsc$iOIzHTmX6>p00K*aAue z(7IAWfNb$=JnB-|7qy3XrTd@8{=k5Kh{@m;f|8P(!#2?p7*va`&BYHi+1`LUfk@c% zFqxCLuihxhiPyxJ%BxPc=o`^&iINr4|6%Mb!?KFHZcTT0w;-6h@9jgk)? z(%l`>@X+1TT_PZLw(s|T*LBXn^UoiOy7$^^%{9k8#vF77fo%2;c`sG0es<96!d}AIHI3!?+0rs%RusYEaMq zo@P@Ppkr99N-`P)XB&va=-cDJGvLmm=fK(P;G!3(@K}@Vl2y*LfBWtPb2^?>U{(?{ zc9D3m1fK=t&wZxNoq!pE+bqqJJ3O=XXPO}ZRc}E&F!wXuz~WcGFv_wmEb%(N$pM;tp!^9Pb~r6%&| z_+@`Bo++d(*Dc_m6MS4J>T7%m6k#!jnyi1Vd;QkLg3tH?>5x01X5grM z=7_Er#FnT0&k~BtHXKOw0f$X=-g2p1&$PYB3KaMB`mGBUgO{*3a&`!!>;EFs`HdDa zoJ9VY1)yma+bGj@@tQZxNq)DFJ%`_?qA)eju$sk^RuW0OF;;3v0ePM*W8%*DU`w>F z#oI-p&UfwI7;)Hdym+`A_RxCBV`i}jDF@MfZ}D+x(CaL?isL~ie9b#f%^l5({&Awz zcJ&kp;pvy;Y}H}`2R=?2XVFN7IPH8_Qp%$nF$ZXmLe!w3xDc;!c^;)kpb@c;BZg0z z%bUQNZL-eYi(WwG*|c-cMoI2=)dag!M(MZ) z*7@C-NEwRW#v-W;w)*I-W0v%`o z3Jx)3a_(R4giyb#;b-Atx=n76EP*63gRwRYT4A+M1bx5Mglw;2{;;gFL1*mA#Jr>V z&&A0Z64b&(?+CdmkRbaA+4m2MKNy3R1wj87xXFOdr(z*g-qp$}#nEohm{I1>YHFu_ zgt}{jYD**{jN~Jz||ML`&o-w9coL|wEP*ISJJ|jH$hHS6OjI426 z&JWBbtMn@o0 zsYo=_g^*9sZ70LE4zrG3*JH%Uachq#XkNG+TaPNu8cj^Z0Y*8XpD@V- zW8GW5W&sLrQTvjxdH+S%)(KcT3!Y-I-LnueN6G6^_LoNDsg#HrJ^ylD#IYmFdpUfu7Qm1HckN z`a2@eCXWB2w$_iqwmwq^{kqw=u&9&n&K?uOosTwuYu|%6oTh*2I6M>B#r#6}q7F12iHfgU{=ph3YNI3GEVGoQec#ioV1Kh@gT_1aU zIHOA|7ps{*Ongg%9zAOzEBFCp?NIeLcCC3Rwhl$QQYFFzkWgs1oE>*d%({8hE;4;% zT^>p?QzUNTKf?XGAsI3ir&{9eTQbPz`I+}3|nvbT<(1pY4|!I4yeVDM$f;FKmNVPw#h&EMn~a!mk&@cDa`N;b>LSOClbbX?&&7e4v>#hEd#I) zwu3dXAzKl^4pGH|e10%V6_-(lDgKv*8+z2Awj2%T#9AJ%gYfXZZ9eIN>q|?Jv$`xi zvSwZXT!RLO4=npLFY}x&cyOLs=O0fBEG$UW*NZYT9tT{yfsg24y%O5L4vG@mr zN20uS7gjWXTOEP>=P>55sxqw}ZI>&m3Qay$*)7m>+_(W?sgf&dUShGIK)Gm!jLQ{$ zJpr^5c2fZ9iuyh|jEN4QUcwqWK32wM-QDfmLNhs*m7-2`U9|_Ez@i)UaB5oV5c|ySX(7~IqqV&@Xd$qRg*OHPAJ}A zXSYrSP&V94Z~l&1Ay0=};e-g<@fl)6^C}+QlSDnUAC>1*gFgYf)hc~)UgUrEur89Y zP)^pb;|Es$s=BMKp&-d9GZXwA(cFNO)~JYeKQZVAZ^) zSpD_f4F_rtWjqJ`U~cEu&VNYj?RbDZzm>2aut( z?Cx4SaqE~VnENx1QZ80O*0IntL(;tXcOkQ5hih2l>0ICW8|^1mh5?J0MxVtC`~yTC zm94NTQP9YX;WmXslU6 z=9ir-sWN@8kjD2OuAewu^{k%_n&01gMG0h2nLie%XzqC?v3{V8_FBA(RJb z@r`!M^$+!ilm4~pZwwNw2 zc!viAvle`jf5xwc7)Dy`%kC9FZ{|SeFd$cE2K+fyy;J=S=pPv|##KrDmv`^D@FrwU z*a-3aVMxS`0$4wrPNDEMJGt<2O^Mu#ZYZPB)%{v7?%ITEo-LXKNzm_XR_mmk^Z`7o zeYXXT7A3WY3yL~|y5R%F@CPkM>tU<@Oi#a?e`M6+zLmS<`QF`fZ!)9WYwDhvJ|M%& zPrm;{s3}OSs}9HPoP}Bf`X&(h%=4Jw*lsFqO9g0&E3JozhqAid!h@N3#4uzC#$+hQ zFmyV=2qPJi;Et2?M_uvMJyU?_gdJpRL|WXJJUjWunW;5E!&hW<`JKa&5_*{-8-9M)fAlvdKH5bJ%Ey?qedvh@Oc>8ds={cZc~{r1#s3y$4= z(*`}+6J4KgeT~pr<&KFC@7t$Vx_ej>Rh~H=`zEGhvVU>#&{-`O2rB1!;Llt7pHIS; zi*hjMk-w99z;$)qt{XPgX&2%!CTSmD2{UV6xBg&wGrycl^)BbAl=Vqfx%1oaLa~em zQa|mnWBc18oErO(42OUeku#0U$X{##oX-|CdUEKvR8UVZv$y6g{DKyw;@!P}f3;H3 zWKjB|KX}GvEQ2C>LjcA15FqIz@_4qg>!SR7*?_7%4>%j8Q$z&Pxd2)GB=i4vSm|El z#TLW@7Chy`Am_C^glLlB$ea#puOd}3$i(N_b}a~)31eJlLf^qd`frhWtK0Lq>a9|c zpL6DoO+~kCu*My!?q%W=%B%Ex{GVvh8SEO^w+RajXH#yl@QoiEHZwHdQG%2Fp!_;1 zkx`3t1@mlm&~sc-R=bW^TYr#q(%D~L7(CX=rfUTTkC}adycXI=Rub^v5DMm4CM3Sg zDjB$!PA4WtrebR)CM6icjv*R7aWwyjeI@0(Bs|R}^p)_Eum~?-YZ?DVzL)$JuZc7U zgs~`rrF&@V;uFexq5pP8LNSOIWL=dJD@;t6{aPnt{gB)HO+T4({{Z445#r51{8cue z0%>qPlkQdkN5gb)8v4fF`sD==sRHm7qqBavC8ArbMX2=0qCED9xbb~fxcj9}h-NI% zD5CpifG=qExa%}TVGtj}w?974FJtFCT1NVS%%IjeJ5UwxToKvWe6LMs<-n`U!(_X= za)eUFq6-H;$}HmrSutUJxY&JEUl*96NzgFOxT90_{6$f{f*)@^ojB#d9LMy=OYq^& zgiPWwXwVgI;Jd@7#hP(UNi`x&x+}$qLBlAyx;(_uqKyQ0RqDE#HZ@mlB+P|Ql$n+21hmrx5(Nm^VSN5Cyu{0z|Y{qZMoL0_*v7Fk;)dl$*faATn>={9Q_W@WPt zn{Xvqf80R!P+2U~=9Y~_X>GUQenfQXVizTrbPcebL`-+%7z&U2TlDoWJwiZb$=JIzQX(buRwE8-E;ow@MGLj9bOC3D= z@<-3C#+Xm5qJy6Lgr6#14JR@-lEMwLA<8bp_w1-P8mF20sbY8Q3bVcNC=7qZ)MF)L z4R1nz8{awTPbbB;Ack165-U5~88xHkdv*pgtJo6?e%oGh_89C0wz!J493vq&REhd( zb!f0K$p?Q1qLg@EjS}CMc83(OHioJvhO!Lj-8Av!+#pwk@Mbn>PRCAE%BpUmT5_i>$Phe97P9!0V*U@>(J1!MYk5F!F#$ zQ|-@$#%D%~{0|OS^Y3-WqJIKeV+3<8aE;5)a zMKj&Qs(Qvmq|V0OPt_Z*>FXSxj|WZfs-iR;U(Ctd1{}b3y}ZmU=zWw{#FTTQkN}>q zg8&T13v$SS8}x<##w>m*EnuEmdoypWLNKBM)JSZn6^f>5?_zCK?mTbMuf5ShoOA~| z;pAHXHK${471fnpDy5^q z0S0@jZ5#jrRRh;P5q;bGtx=G!B(7SzF4VIqrhCfcD8B`6ePbB@WvoBHDEj{ z9?UW8qQo;o|HtEc{g2-gQ6%U>1AsLZd*PJE|GcVda`y`M#PpU)+%^O`_qUfPlCM!# zKdj5hu?Uy=5cvDysKj42Xez8(Mmo~5E;xOCYK&ca9Bo=^#F1ORjxJ}bN+S_>#o6a@6>&8fzcCoyRB;yj; z@Pa8F%==*^+K;ku?(763g4SeK9$3ZXi6h3g8_UF>e@Zl<)FtnxS3`oC2(sZamI!z2 z7b8jz28z$Fh{G+G7}B=;x<(F3<-ergz)p{XONMG>y^eJ9MG|FPmU?9Ev1R1pSCqea zMyyx5wInnGiraW-`NV#p2C5Z;$l!2qpr3}X^kLP~kV zNn!9-8NWq=M@G9A#R7VCxa_;rX9a}v-0p9kajCYDKb9U`0@_14&vMP`u{-LHGtAJ~ zCCDPkur!7!HRb?8gmQ_rh65R#;ePv9@o9PsR)@;uFFeQ!##@DYo3Ijoq%UOgUmntC zW@ysSLbnfMI@sYU4c$PX&c@d+D;Vo(6Y`E)gnnA1@yKDvG1CBN!G z_o*jSS7U8BcqG=F-RM^>4FN^3DuI8X0 znYw9=hhTcSE2M_|Q_lxY>|S*ba_b8RH9$ZhFR-$j8TTTYrPCU+=V$n*$#%sh_Xs#=d z@8!9ZBuOpx-Y5kd=FTXSQ@39)M1aR0QO!n4ma&!i)f zn4r|1$XWVn*f^VEM+Nq`%v7G8o$+p&7@rUgSe$@1M;oQt0L^8QhkYqrvVozV&P7JG z1t&iammI!Ug=ZbBk6lHNv8^B@(VoW5tD4_`DVU1jo@1ZST+WIia4cy$U{rD#3R0A! zDtYGIegE3*k@nfb&3}MetGJjkzI~K{YTOkQgbQbq72wXFYJWkB!_F6lG_t9B9{fGV zeCOO1)`!V}4c;_fY0emZ7o;qEh4_)L5%Cp}3t4!0kO`igwe$Z>T;M9fDoY4P3^mmC z(?KM7WGTZ7H92l3yAM~(Z0fKTjF%wg6bRG;Pgc!Y-=|&vS*F4ItVgMzFsJJuKJr^k zEoHo!A*001#u^~y+@|i8ISe~+8<$4`2wnB5J*DY#Z9s#_%eirwISFtfOh~eBl2&}C zpcp}2;7;D1r*Bj7gx!G56btwIx6O3a1?B>VcTwbuO;{;ln}JN`X3LSm;-luuUTy?+ z1@k@wxh4pyXPvZ}rJ?>G*S~`ps%Y^pM<(FOep7Eio5)l8USX!Q>} zS)kLf_KEwy{g`i!A|Ziw`N0 z&l9l8oAGN_`=Qa%;!w^)1~#8zL1kk3WLWNnu#&`tS~}#2Z>xO;F5UdpWylop5Jhi@ z@OaqE+li-1^DH4G15C@;3bf61Wxu#B6D@rD@KmsWmY}ZDOJ8f^cZl zavnR`5M1Jxc_Nq^k|2Vnwb}O6xPpN>Gyz{EGk@3D>p&tytbiQCm8yovFqB*7kNu%G ztVto>x8Rh_FY8Q!o10%7c;zwl=>GeNQs~?C8lBvUgiriL?8U7BHds((^BsLVa<*s!| zO#bWqJebwK;&!V4OsBqox14-+KAT-)P{Khr()wS2b5m#gcP_W& zo=$pi*C{#}aC7Hqiv}Ygpk&GWvV0%;+JKjU9K_>%G~rTMQzmmHz^k&i5-ud`5y!UL zDynD2YX@WX&-X5?wpn#|L0W9}x}%Z*dj!;kuGyyv86-;cw~&QYO=(g^JC!${^%KIV zPW9ZT>!g9LY?90N{R50O9oR_9=718LTv)n*;YOo6m>ixoF zLx^d9rXe+dZbKQ`scqHPBscFm+sp63E)SbHq}(?*i_b-x0gR(TJ7^YiwjZ;BZOEj> z?2<^gaR)y-KK48Oo3Hi9$m3RXKPc-TDmlrN?=M=%43&Ua97t;~euL=;q<-gNi5VHX z)z@j@9FRy6<`$SkyAn)+oLjv()??J5scv9GMl)D#^71YD)T_+%ll18>Yg8zHlC%wq z6)F(8_WFG)$x1JBoEXVcXhA`L_+p;%n0Ngvb@yazOVE-BZrP8a!`H`nG##hF2I#{T zblfGB?@azDSm+<0shNo5G8{?UqZWjlf6Te@(;6O}DLm*X{Q{t#loD+|i#B64C^vkSEecMwoWH^O(Wc~TQft!K2Ewwe+;3YuQx^7Qv; zkB#k9l(_kN?rXIGdcQudPn$H`r!R`rujKUiea2 z(TCp|e;%a15{w9!IStG3N?Ho-lm3+4p*#Czkmfv$3vwKhB*hkK)E$9(zd!>wtT~}n zjK^OZ47V{4_j{U321e|l(pUbpT+FCEUi%q3L!a2L_A*ge#GmDCg`VT#$@cSPSpHd= z!}(Mffl@=8BAU;TI!L0Vo6e1jotKgUf* z;TwyU+vLCfsv=mF9Ba@J8X2sE%AvQ06y6jVY5$iPLFHbSI_^wU7|hoxxoOa!$;cK< zk74c*w!FQ@VrFDl9MY(F;xk_L=OvrV@=*!f*Jq-Sn3HMjzlQ&D&w!fmMCnGCN<;#Z zzhd!|?K@k4xA~VJsUM``_l|qmJ_~Iaj%(91T3(4LJ}H?dTkn2khQmKIzngFILESy*>NSNm-Fn_gJnd|DogVa4 z6wvVz9U+Ng@@;rq;kPxc$%ou~kNE7#YlIcF*7FV0z@NSE+TQPq3kSR$BmI_MlrZ9p z{Gc!yGI~0U;!dYnV*VZt8+kyg0ru?vp#X&Ca=7%oxxQhxllc(&Fn^3pC#|x0!AwY9 zY}3J9T`t|3)P^lG2QYuYoA4>G1UT%gO^Sm#-*(@yC?;t?g3&;)qzRRTn-iJe&8SYj zAoM&)*=*hOh`t`EX?PXQ2N#HS-&n;{@-mN{E35qOa@i(?2J8KzwLGlBF>lO^WmDzZ zvgwM<#!HBd=T)c@8aCiNcgGbuS{D!)cwD8~pKm8NiF;+cf(5XyQO^w75?^ZDTShFX{vX%61_xCWL zNSJ#puX*=P(oc{$D~q62)q7@ht81#K;BXIuz$+B|sIcd+o2&WZ*Mlj4a$ZP88Zs9Q-by ze`K9P#nx}j2OtU=J08{6%vRr1uiYQPy<}K}&Q|G(QOir5qmIL8_Dx(Jz9$}g+oVBU zBBPwFpw=3(AQH|-5d`pMEHQIzDXV14U*$I1kt};I{hM$zjve}da?Fpf%gn13$fZol zern~Ma}|`&XIZ284bBFisa}nyJ(&?oyc6f}lQF_ngTo(UcwpQfJxi3GhV)CxFa&!3 z>^u+uqF&x5tF!YDwEes3bYff?C2!o1Ujbv>r6L$>zpngoQ8sK8u5qyxG$}R<12SZH%0Pd00}>hjX?m={ctDFALRjd?Kz8FD+A8F&A z;Rwei*{m-fYI#^GC(-+}$we%fE^=&AG=rjuQzpf*oTVJftUiSQ>59%~k;t!aD=z-eFNJ%WSx?i}P$Nb%ZvaxBNTKC!g=&yB3uq0Iw>$Ujxd+ z?u`~S##)lj;r($~Hh-vy^I^88O$8t~B}Df7ZkKLG#kcivZG|4De!~8fnXN=zk-6$k z(6i&X7q2&)*ZaM@Xe<5hquB1+Ek*gpt11Zezbt^X73p#*5}$6!$cVyM+DtmcFSiRG z=gHL&{h2S%c&VLd{tQf8=|p!{#?Qsr{2mtG!D_v}dkePf z>NVezfohrTBfO%fyW+ES}%LNqZH-~t{RO0(`eEct?6zka7$1tGrhmsWPSVA=bMrb{i`@qc6D+wIrv4Q zT7%*vzgXhZMjZ@#B4Qc&?#^_{>T~G%VpmGy%cB29V7URWlsaRV{AF#s_yd0rvGa_L z>zC=FA<^RrbddC7BMo$mL#XW`$aKJV4P5(?%sf9;IYUYRMMZc#lMS4n^_>l|Xj315 z|0N5FSOlB(E@Cj0lla%DzdiciVewiC&1B=uHeOdhb(w_CcqjSw@ajqrW(y6-AAY&xB769SsT<(g z=n2|{8p%B5b)Q*Oi9D?d?%lwxsA9lyfCb!~4ny+^z!dXs%ah6PZX@J=@>^Zrl?0|! z#>-Ly1}7_KTu}3}Z9I4lzg|g=!9BsM%0MQ98BN#G{wwiSj=+HFl`rxs5U)&Q-gwEQ z^-LPvM4cJJ*@I){n)u-((VVrO-q!Y~<9EyAJ`)!GbpoGcLO%W2P_+Q|)(RF@%BMm% zqF^FwqN>5E00Vw`puYH~?t*7(+ytR4+pilOzUcG{#9MLpjuWncsbE1jj-=abx_~~M zH0#Lys4hndIl-~-ZJGT`v+PTZj_W!YM~sa~n({uOSLz>7Bd+V=|zeM)|U zf2qksJoO>=6Q=+;qchZ$4G*;X28=3csr`G+RR=uo8Ht3iJLlrRWn;U$cY&XUs@fdK zPm>|!MZcvrOz|WYw1I&c%1$!J13ZD&fT~ZeXBHLvL)BV$R}Z6lG=IOq187QC;#0qL;uHCDZUg2S4(~zRK_a0L&Vl~PVfTBREB8C^an}~_ zzi#}=9_s4Tr3zyWeDHb47x6+J@eCMgNRMZf>`Zh&!Z?twoeiy|Q3o`sI&?k_a;JS) z91%@KmSuB=n5o^$2|-9)fE#dn zf!W;WUsaui7Owea;+N5PjQIFj#7%FUL>9!DU0?MGn{XbD>Vb z<=a~ql_<>ff_&qON@qZbcer`oQk4IUoW&b4;!-qtJ+$}u9snT$kh;?prai9xls<3E zP2+<+)2fo8jTUh}P)-h8BgEq;qwDw35X1$9gXo#X&G1`zda^(n{f9DtS`|1h$#xV4baFgMRGdhE?x{MhfI=T zWoqO;xmPvb!TGo9{tP!2K%e~N7u{Dp&jd0v#`Dfv7Yj$d5wB3dqnumXohW{7p@==Z zk7vb~apw2TJo}FB?~&b*G!cYJYq4(VLf)Hj_3kVOn!>oOB6?4#+d;n5=-{QjJ_H0+ z)1D^oRxLJGy@pDp^9gp2_4FqSQOSnu((McGOLMK&Mpnmj5kWW&4i?M99S`R}u&Jns zd$BR$D_sCR5!X)jj_%7lFp(Sfv1+jW(~wV!rg<4*>UYkO)BsEi$#tHcf>+)bRQl)* zMcsqq$*PJ{Sevg&cMdz-R7Dk56kybg`9`QWDC_f_BfjbT)zQy_a6^UX!t|nho?mz4 zC($RmI-F}o5=bn3zAh_4Rb9DUP=4I&KGNF=>W0!Ls02L#qUVCRY4Icku5sPm#Z}Md zfYE_+0lg7p(BPN2_!I63svkteSInInEEYzO98KQ}GHM<)$SGp_5Bc<@_qG9#Z8omLKXGZ*6NeViWrnts6rlJoj@yUk zxQU9u54+}rR?2?Zpv%S!8$1kgC7(#b^(U<@i?Sc6uO3%>6uEHIZ)2$RPOVA8R!{^#bBv>62C^ zsd^PB4{QcqGxoC}V#kWkj1_)QUU^yvXELf0vT#~sh$}CTe{BqxxpXqr4yj7bSeAS^ z8=TCZ$Q;BNaF^tIpyWfxr0et7M=o!}FuMa^0wI5}|wRTD;>&NikE zkv_lY_AKqKV-#M;bhr!JsIKCI*)Vf^FHNs7(;MNqv%2j#*@D~sI!As@Rel-O1WM`M zFcOE`>wh-(Za2^FzI;Qe&d3MVoG+_g{540X0AHHNKCgJbB9pAWbPlSrMkL)_A>>Sx z@TR8u+2!_SZYF5*Xiw=g0=34I}h&zDJ2YWC+G(d@K~?SGt73vgnSL zh}RpVKbBG^E}boHXt#lj-8UbYubv1tRWW$8Jo@WT1!}RC)?KNMROM~0_A^p z(;I7XE<5+?xt?sU;x}KiDj1BE@Pi8<6>*Epjl;3Mrn0y>9@Z_z1q#Vza7X_|-ManI zm;3YAX}p!gca2wlX?G{?SNxgYa9d>+;{cLE8D69ywwBTFQughi5N$Y*E@BY-wxS{% zxPA3P{McRm5zjm*(x!Lc9zp=M_HHB$nk6@xb3!U9-(4yCrVK8EH7tjpT$2;WbD^mSSqH<<1IdDK zLU8JsTC?B1V9*Uw*nd5|0a4BIMIq>30B~$d(luE=*z*`XdUl(={OxC$nMj-Z=PBr4 zG-3fAQYUbw&mo%II&oiF5p~+V5eL$|4IV9{xL@bwkwrWe+ zbQXtk8g=HuX0#KD9kC7b5VL+Fn42CS^tin>O|C)fI6Y>liH4VQ`iw62M#m@Z?XP7T zh=iU3;xz_eOCxTo)+}3eaA#z6ijWC>*@K;^43-TC2Hc*fFAuqSUt1g)cbGj<$Ol~3 z>i_&bqWZS$E#5!FBrzW7l0ZVSFTsw}Q&s>>lwOUz87|9GisYWhxRy@zYs5_^TpK#8 zhqc}S)4~Bn_$*}igGPnVt|Y3?&6S=|xO-I#rbxn&@2jPno++U~9C~SS4t^%Lwu_9| zgnKv(aR=H3bc_8izRIfwEO_8a@P|3FGwcf@*uZ)`9Z=f4BKIF&#X)yUL#D*Fi8b4d z&wfsw@Y~{ntd9Zl(l^M?m$2>Ct#vW$4dX(_9t}4K%(XvqFU!x{ylaQjd(PCPH`n(b z8*Y6?0f)hCy|bNNg~M8#u9d$qs{n0Pqem66xJa5jQms|K)eZ;J9*hgFyUL6a7yRaS zK8Np@7pVX@sXY?YlCs2rs|MJ|yn{v_FuHZfTZ zB)JKgB3kE4l<)~e>j-C{!9m^eYLli8bz{XdhMN5NH{AvP)l7QR{<_Xa@{0~k31!Wy zD8z3AWSxOX&JJJQ>+qep_=03qg#%s|WT-s&YZv+Dfeb~;)Ee|S7;6(XT9SA2bdt|J z;d$T^32z^X5klj-x*5cgNmaUzfD;b|7!31Ur%4-hPig0h#^j1Ol}kIJ^xX~F`-l#>!IDy-016P9Sq@)T?AOs zLGIKUTXZs2#NL`PC(XUPLDZIY1N%&oU{)X3r?L}I8WU4%&oFY595(>e8aFOKEtI!8eb>)Ttb1wQY<6qjsV6Z76JSpGR zA)k|9Rjj=q6u2uBVS;X9KZ}FrT=vC5(pNa(47deK?QFYhWsgrh$6C{iKq*1mj^mYA~}02}fvTIk5(Sl+l|h+S12A%(QEptA zeuq{4%Uz@Phk7$Jowzl?Gl`cvBQ8}8nAIeEv{N7|J5n*u+BRBbKNoozP<`Mivz^w61 z?IwRIHfrv*<3Q==%`U6LO4Im;wsYRiE$>p@AMJ?fN7CPD_x;82VGu#~lzB!#$M+OQF$$b%O zKI*Ol=or-)C+-%gxGmj9DU?$yLO)?TP)2S? zaUwBme;pX;NyuluID^ju$pV}mZBlH*!~Uk>`Ze*(wU?U6V@lz+JYn;3PRz6`PS6i) z78G+UDJu2d`iiNqxMGX#P5#T5 zbPryr&|r!(3JMB5%J`2Ap)nybkaE$*6!dZm)KS3*>T{Hvh7)u3$M0Q{?}S%qoPv$) zjbE&`uYCMBa@rq2S6xS!mY2>w1jPcl51O9y7_dBsIp#4>MW5ECHK9@WbR54jFri&r z+QsYk6DCOdI+y~JnQV?wz)CQiBgD1Ghl1pc!g*?U6!#0OGb9k@t;b6wYF{kQe85wX zaXnFdk~q{Tm=0xhM`%l!AlOsGk*z>rxV3xGqt9gSvl(42i*XOr>fF({x9*6QG3LJ# zuw?4NJ50eWNFJo+TZLWU=s(MlsiE`LJBC1k*D9UgMb*)n+G+EmaL4e8&nye=m>gATT3g&A+$YkcrzcJ!$w8Y+oWf^*f20ki;M3G@BADK?ZN7WCFkQL7s zG@99cPk5+?`tYh5Lv5$Qtnol6##t;3wGZK&56z|y3kg>JfkaA{?8~oE!n&kAYsbAP z+@Do(0)$Yud#T6@6g+2heiM)9EHaen zLP>?28ZAIe+MNXA-bWn5d|X1$KSTm4YpEIamOXD8dA=K#noty8&Qz~vHAG!b*e^Vl z!SZ}s6t^!b9LU;NB%|b$JOa==F4`5Jr(Az6%J`aoxKSN$MKe;|2)Rn!3L_ou#Z``0 zg^+5l=QxwivM&*LZWS`f&fk=nIdQq5SDJ|IS_;y13(9xxL7Cswj=5D%fWN<>zE2GH zH0Hh6{FL0z(f->faL#wgl|N-xC^j9zXZ-}HM!&Z6F_HW!iNNhBp`3h!K5Y>SmG)zO zn#IGvwj(#3-ljhA-7OkYIwdN)SWUkk9kKge#^!W&xV#lERH@1ew4B?~wH_nmddoqE zHbQiM%<`yVJ>e`#FP2Xo?l=M#fB+Pz9KMev^#vWAlyz6`TP9&@>K1QQKb!<3O)ZY{1JRfZp8#ClF6PpzNeS!H$Tgw zs@s+rvdrq!_m%!f@hP-UF>?R%;2)7vIKI=Rb{U!9#2=wt646qk4%e>VzX|oxh|zx^ zE1RdB#l_yK$~onTex}`lr%E%@hA3-;x`MDdX5-fiX(@kpMkp*&f>#my$VJ9x43WVK z13CjiTA7k~taCkBFUU!rlFv!p{rE)DtofI!FpY>I30}-Y4Dzo=g{!y~_m{7_KRo;W z!~&$8xd9{Hviq}?=SFyeAkAfww5EoZB`)fps)~XLsK<=fI)-uMgZCVvski(?eNJ3L zI8>){jj0m`RNRwH0_)*|?AwwUys(1L$K#({_l0o-S{15^k&5EQQ8Rs{43lg)bQgjEW-)8@=Q5pDi3A#8fr7hjsHz^30RT{H+t z4wV}$0fs?624{H(I|&2PKi z6)U8BQ7-0^E;4KiivWglxa$m96PIxQDwoA&Yqsew7}^N=9+aW3ylQcOB1j~a3D4=l z0bM*0Ho0@5O_~Ic&XHIi$p~{(peFv%oW2kyE0JRqCctTl<1hdXN#?HV_dC>tZ6WKQlm zgY+pZF7siHatAI*Q`=TJM{|l&jJHoTH8(m-0K&$5ry&g3en`DSL>|>cGW_MFeq(kB z71m@{K0{n4$7(FBb-1^F$O&a{tL^T6YdF8R!h%a^6#c!P>wBlU40m@V z%ShLE1hptCNF>7k8gN|?)i~qYKj#bl^N@a4K*3noGX(9^k+WNtRc$p$a9EDxJf)gH zqI8CE-Z2aJ1Wcz1^?c$VhnxH&N?>1YyJos7TmB2otDsZ~{((X%IU8u6mGXgI z6B)r)p^VbJkC3X_tpdKuOjKBIS!$C+ih&}Fn#v^temDdvsvm!$S^Ov5>3Mz`XCaB6 zNS1`O=u2hqL0+LhYwhWMl_tFayhbaZ6fKi#GL5+=0~YYpFghwd6g?9Cv>~|{Zd|S5 zK#&Z-LeTHhv;}l7puZ!nOG9!OdrBWs@|Uux1**ovqh)TGze=~*tSon%*n>F3@37YED z6rdsG?2+sojPhUd@v=|doq|5@Oh-ydnR9Eg6QX(xe@K&?izbQlD%go`g;UNmS?2TTN6%9|#*(>+N^ z?BhAHGb4^a_AxI|OcYV1vKM<0xh$DwGN@J;wwRRRHr{L)j(uawO*nMpIU0B?%jDE4 z;e)KB449Jw?LI^|bBtdx{6+P2$(_p4FjAE+fo!6J2CJGM#Xy-HC%|(W-a!mzF(!Sj zaZuJoe^tzuzg;K7K<-ap%R{1*fx8*R^zAQUa~mFM?*~B*-(*1- zo)T~vkuikUV}$=|qprs*8#Gu+R^YRRPuAZ*qOlbRumsxnTe^s!DTn1ctBU4CXNAE; zguxnW3(@T61TKGm|8fs4g+$eeISTL;nUhT(f2zz9+63> zXqd;`Uf~=eV8ix;mh=#pMI3zD*jqEI$Pj>?OapwKm?Z}MyzpXSP*N+DzfjcN*&4N~ zP)`i%9f8%eF4*@JPz^zj6;RDK^6(wlt$NgXoDwJe3WXo?@Du4br+ z!?#{-7t;=H9h)1NPuVd_*^C_zqe}8uAGSXU2{8(eVu>k;qr7^DAYr0FY}K!#k%nUJ z>W7oPn6{E92BRpae*miok$!cIO;OPrn_@g?-v=(-zQ;PRN|sYv7&uV%=kmg10$$5^ zPx0T`i!&%YhOpH$LufO|n>(~+O>>rZIXcwp3bXy#+|d}gkx>n7u-L9-bJjByeO&Nj z7p>(LW$y~jaK$=>*lU4{YK~EQwTqyB!Y%!>9erEe>SaaG^k=oF=$mAySTuhxwm{X! zL6#YKV}^f=r(&W$pd45{tc&rT=6As2>HeQmjd9=;@81_>_*QI!WA~+BwW~K(L0O83GCw6mptek|l2vo6 zma1ZS6&Y3?A%>7n3xi&xgdED`y9UO&-y$YTJXBnPe4qdK(8p4R*Kp&g`9*yY`(90>EHUo#n2R)hVm!rug%L9;)*KVd zNNto@Q+8#x5d-gISY<5U>;FawhIL`~fR>y9cx+RWeN)blgnY=yRXGhk4oYnEMGD z^TT(f?&tN-K!KMrmoL*ogXV&_nPr%rH%;4!ww+X~u~y6rM-lmLb}Jc1XQ>{QPZFuL zHKG(UKy!b?71tA=5SqlHP~cKC<34{W>;Lb`W33x)cKd0gor_ZRDnD_y81BAk^w|Sd zXII*|5qxg9i;U6vmQ+moM>6{aSWZ641zloX`ac&5s8U*P@cXvypbR#JZqJVUFccX-KiPD_{QqtYh z@Qt?NNji6`Bs; zLMNrf>PI>>Vsfv>_jzNs**yzwi%UZej{)aZh{+T8p-I>SrEN#t?{R15?E>=C^QY_fW3gygxck7QCS0ZbfM%q3XjaidW5w zic;i(^5BKk0!JE?IA};$`>?DzrFdMJub4J{7>S*JW#FW=RRDjRy-}bdz-)}SEdHj$ z-XPsQeVKXK2wn)=|3V_GmuO$~%7c0CDbfN&8iO<;s@aDHPfiP?sqU?4Cv2$xhlfoS zHtV~pS#k$0jdyKNfzCjF_(`jVKlx+}ZfG!KIhbFxpA})JkvQo;uH-xD;Eyu5aCGtn z>4l>t^}b~jE%zrd#i~5*YpC)=D|LO#_EsGDbc&yw!{l`dqk5FFl!1QywolU$JAdHh z%Y&JvSa`)UkORVm2iqf9N*mDR{O3>Fg@o4urf}4(v(cM@vb!6M-e3da1nYI1?}@1C zNhlIJ;GXpI7FjqKB)@1Du3e(VKBCILfAr*b189$y@;4-6U<0ONc!r4yMZAoD=BOtX z`~dO{LRLg0`VOyLR*i`JodY@i)TY99Z32?0GcL3s?BNVv*7H)X=&wV3NL@}1LC{jLdRTV1bj=~bz+sz6Dx`PBJ6J!1!^wJL685n?!O1(Pds^OS%UAIBbjgm0i5-zyO* zh`%X#6N>Q1CDQOpKR_iRH|3@88^W7))2xvnPm29n@hG6D1x;UG{wXf$TAQhBm-&-E z>n+6JnKDAN@kkYTQ0CBY$F=^d;2Yzwp^;D1kBPG3HT+S&%OX|=872%QiJ6K$SjDm^ zUPDXZ2uKCMrz4IS@@b8G8DsLgU5DfkBo7yrfh4@7g+zb|{8mz2Ad?6ScG-9CFO^Mr$wGn|O`USISL z(qT^`Q-6J1;5=Tq85Yatj3k@T&vOJ`P$|QjN%0CF%(pF75>^iEP9pG{z$8T$c&`-^ z!A*~p0yv6>(mLC?}syiAv4dO#&-P#Ng>C)Jv!-WHq30X8AT9_MUEJ;oY*9)&H5ScJTWj9C7sOP zs)6e%mi|aGZT1~8h|^e9lEgya{(ffg0^F~^uk#Hoqk4uLCq4@<^#j zrUXl)5&+Ry9?cPtrF4}_xg0u!oMp182ZVr?4nxrHy_{VdnDoG*@rvAIOa1!9@vCCL z>~CC#RsT@^Hxvwxo7Y?nFVbf31){kl+N9*b=g9Q^2(-%wktTvWuu=sM4Ec==oJ6W1 z)v-qYELLACgC1H$|LbCZAFvT>Mhq_}2h2eGanymNY=BXn*BNj1O(;q=Q=^V7Mef0#77YSw? z9b_ok*d0GklcAFP$o8CmRgYyo|S)pEMwl? z>`i-CeivQU++X-;vA-gifST>@(HBwRBRt|mkpFrXc(H)pnW}-=35x>Jp`erNGTTR3 zSZPvpdF5c|#TTFs;y5fc(g?O6un>q1%LXZI(-^kA3uug{I89wkcPe{rvPxyUY8UpGydy|K}cm-_YL%(SD(cW^=?K zT49@g*VBA^JikaogJPH-1S*TKbLV0gq>D^}zL&Zz`a)H^nFTU2RFO1@#VHShg(Q^AqLq96U@gAe1#*L1u zG&=8Ae$a0$e?E{pX;Z1rVFfO{EdrjS3aJVg>hwDr$x{B|6*H)?V@>B7q_LFB4Mv%bxu85by6Uj2eNT)o*rFJowf2JV{(E-}cyPyx2{>M7@G6 zArI`fzuhQRMaS_(kLhma(8=b;RJF+=cjLo|58_y??CqmtYSmncSZdj)ve6U)-@^!$ zfA>TkUi#(Jg^JUG4V*Kyq4qy#OazSyLhtr)?{l{#{BWE15Op>%U+*CQv-M@8`(dWO zW(Tu-q=1@gfw30LU2;5wuHjQ!g=bxvj!;G{Z&wTN8 z-|DLgSgZ>B$@;b7R{rZkMdvD!pRG65X2Ji8cr^kwy@Gp?^1-|FWN-|Lw6uZD5|Rj&#%We@@ijCV&Mq z%?}2-QKBHf(HV7MBar!zrG#Av2YWpR#Gc3iLkJco9h4iX{$X&By}>Wx{V*HfQ>cKE zJ?4&)|2I?quW_jaWUGT;ZhehMMUjAA|8{1J;~&QVs2K#t&bBTg!=>383mTXeuo&=> zkKjVP9!^Juot8V`7+!tO7C7mBZpf?TxRIbB5b~cp5;jJEBfQV(&@^GcHcsnFnM4)? zM%UeGbh>Zg?Blh@%}nR3OrxojPE%w|viL9pR=M^51fQ3llR+)OSZ=R2v($=JDEne* z!sbQo7h6iA2zl(&7da3Gt^m6#VVw* z_s!MWoY(y-QYN>WE>F>R1`&h?2lM~7FJKVM+K9FY!)PRW)ux{W&L`DZB$D~vD>?0d zLFqeZi5*%fCyJDloHvKBTvwl`qN$qd)R-p#J*rjj+VJ)HZc)^1olR1ZIa0c2m7#%F zl_BW?C**S|9@EZD+w~5U$6@P5i!Q6b4j30_df1aR0gvjI>zy+Att1dbWH5QHC+X1< zF-iHTuTIt{HWAphDs(l~3KbYYkW0#y3`5!29?J_R;dPRTrIFuwOZJ{4>gh4$AB-j& z-WQnh;9+cV)=QO6!Va9=v9_xX;_cC#K*M;h7uYdoVENsd@w3sg(eV|)dleI3sMVM& zwpMc4Ov`|kIUfnJ97yM_L3{R6gwJhn>zEPbfb`gy|6qAw%G?p*pQclypp-;}jEmK3 zy?T`#DI<P3%D(u|YfFRC*r ztILPLT|ww^4()eZ!&$-HE zMOoB(Uu``VyMi&3d~W{8eni1rI|mjxD^J;lfX2HT)++d3{F@W= zN4Ycz@PA|h-kLPSRfx3^SldYMdnp=7mVa7jH?Kv+ZCjj!t+oNlP)dj9yjGkjROF)P zII`RrOo4BJgMMt%mnoZXaxF9V7AYjC@;;VCZ-}-I@qm+o>~u$wSNR2ZMxAX> z*r-fb=<)cp1aMAoG39fEk;=_*bzSEX9{*JhTg#u50oH*YazEzc)sIA`M>W8w zM20Z#{(47_g*Hco07dxRWe8EAf=B)Oc2%lGqa@Z4VdES>T{H~Ly3ESAk6dfwtuK{r zPHQi;9No^gR9m?vhT)vV5R}b3W`P0Kar$~B;FFfd-Gx$gW?Q0Gt^MfBjqS6t*;X{ ziC}6rn5-&vdnE3Y4rc)o50QEgSNz%#Lhlm$=mWyw_F`*ukorpqu1FquHjhUZ5W8s& z#kl(rdRmFX&d=Qbf`h?qf`owPZY$3oDj$NhhI9}@GI_X$N6{!i=87S_LIB8^BS;fg ziQDbv+rAWzqR3YW`TTBs$&pH@`01||037cyyK?td!YS8tWbe_U^FIn44V95k>#T$jwYd}gO05o z@)aDG(>PgapohSWYa)G?$R4{9> zbKTjRl=J71c#<<;#1HK96tg+iT?EM&Hs44P^U z(-`uj*j7$61-)GQ^XPULn(LJi5t+L#yFZhOFuC}JKXD9FN(yNZGtFQ_7{-bSI*9d8 z4#^bfQQVV#xW98PwbcN9|1e{aIYfkV{JFbW+O5@h81m=n(2yu;z1m3gd>H!$@2rO& z{bx*?e2NH8SIShz1CjcB+t@$uEfqK~dN3d6d^k<-(E^ds$72-IcpRp!Vj-{2ccxI2 ztRIiY(GDr%Kn0rOiL8L1%RRm!@bwOm@dSm)H?G@@$f|BW!J1lR1f9FHT_A)ugzIh~!ZS~R9R6H1C$|@D&7Lp!@0X;Rh z^Vi1S7xadQiWBXna9EgL94u739~v6y4HI)LqdC@s80mC*{L3gr#My31 zOpZadD*gm8~SRF`Y@8d{i$|>I0cXF5+;D4snN$gjZCP4aCCo$bV9`gHhjTy7yZM=T~lI zk<&-djF~qi^C>pCXGo?aCJmTgx$)LpZxj!gp){leVfj6+OohY{E5dL#rgb;vH`MKz%G@t1HK@^# zOQO2Vm(ijpBu-OWyL@$@4?`{>OvgfMjs_{RI+AltM5GXAsakIJ)xm{De)lN}0wv3^ znK2T$vDY%q5_2r6NYFH}CMJ$UxpBoMF~Kw09%N=Dfy>lE-wvm5i+XeRGM_ZJ@aNG=7ilou=e40U?Ly_O$~-?v2}JkL;Zju;oNZ?V zZ30FH88-Sy7rs~h@eHnWIVJAvDCxW-HdLYIV+O3PXR$Ef(bP?da$_Rs|KgDG-L^`vxQA!lQQ=JB!rP&6^J`I=+-qU z7plGFz7z>~FVT_6N4ape%cos5IVd3N!5E|@s{Zd5T&5nK1?I4j?o7P4l_<)&f}6HT zWvA#bQ_YjYuw1vc`)Sf4HK~A*e_G3}?BzhU0+P}KV)lHW6#padyj2EU9gc;(wuU8AAn=bCJ_ zgidNpqR{kCO|S2cdSX^;r7(<1%g$sOoHSG>i>O}M4t?+2%{Z+ z%NLpx-&sC`FUgZZKmU54TBZ4UVDY);GuP7=uF2ox?#lTX6aQX;)#60%lc_@yv|1f+J#65tb zq}=_XmJ;N1KQ1p{xDCE`s7r5tC&gma$ec4%`F)Fog!vjpzu8+n(w?7&#u_jD-k2od z|C)$pX+*Umg%?)pMM;B%)iQM_T=~|GJJ%LOc#o056uo~{IJFlug>qY@sn4l`CXpC(v{AgSML>5EVp`U zhb~rY!GEj*g%YFV;i6z8$Sdy;hxePEu1;wB(m2{{ZN_vM6H1@bajoPAN(1zEc~-tP z)}$!1x8tqRz{)BzjX}^^EUn@Wh2HBUW{cyff%VJ#IJHK5kG^=pz;eCoB!PQdyxD5c zV3MD2UlG+iP|Tj1qh}&I)BNW&ekVk%)PIsNK)@>7rjNLmF|!!=YxcA|rOr906G$X< zu=Y2z^rmI9C=cbk2MK9>uFphktC#irQ&u-k4}Q`-74{Xl(td&cfZ6PBbP&9dG|0=* zf0{uik@XL`Q$T44s7Tgh;VrY=9(2i=^j@eWiMFqP238_keKmovkoaXnT5edMmqxQx zL$tbd*{HjG40ZimNf=7kd!8rxn*yOR3FabqP5->akV0I>q7Le z-)#t#$a$GiX<6c$-4Ax0pVBJPr#PF&ckdIu6GXxUBco`*gFuOA>n$8v**mmnuDb5U z9Pxc|+#TlV{8(n`L0MWYuvD9Z%W9}!Z9c?!{QC#$(L&)3YfE&A=zc38+g_4dCKE86!Hz3RW2p7pY+ zfgZyf2MR$jdO!@ZAx>8rv6Aw;M}LsTkx7~RL2Y$H?W+zs+MDcCXFu3wDEHXlnc`J- zfe{jw@4tBaj`%LDPySdfx56K|!OGZHCgTIgEiflFm)AQ=H$fT9<` zVl@vz;9RbHX(sc^m`T}*zBw+N{r0zK(XhRV)q_{8)tai`jKj#4dU+v_F)1*RC=`sg zBPbAg2uUKRhe3(`N{vU(U4L9pu0CVZ7zKF4wV*U%Aeu+%(pxB{$0aTh$zL;X)1%;J zzu442`fn`&Fqs^f%V0%%UxV!BVtK>H+jvoKyVTTmpC*O*w^I^fQYlrC?K6ai;nF9} zSA~)Yy?Z5eFs;p)So(X$gO?C5BotVzs_GE!rpw-fu6>X}5`Z7(zCUlm)!u* zY_}fi`#Z={T%#}>G-((fE=umym*U-ZQn7ve=uA*xwSa%(eMdm`79X^DdU|6LY8b7Mo5u!&XVF5p^Ku>!O9ZtFYaJ}LvmXCuOIVanA`Je^rC%xp z_}yPg1N5AAq0M@;hF2fmd$k+A(PRIs0x^3})u|IZBbE+W7pk&k#mK6q!osz_Bx!(+ zTjE5)I8SWRWs3h|EUjQyy+~P7rhs}uhG8&&yUoax)y%(0mBm)CV7+&D*XISVyMk5> z)7Z^I%MF^R0hxOMa%dmMXB)p7-sWZI88o}8Rh!CY4fk}Gh$t`~^*)cfBu!svc2|xh z6JnI_t_tldaAao0T7`kv4f*$kpRH@E)hhLQ9~AJ1bbw+>5zeCtXahJ5(cY?p|J@vd z9x1SGs>YX(kG970V6T{UMD_eLYhebCVO%jRd-Ww31Z;+1HX)^OBu5FY0Lx_eI58Ex z{nhAP0Mg^O)X-J6fM?ni+#Q7YUp!9Db@-j-!>fW3fY2AlPSZ}Et-g9>;ftH5J}U)2 zaNZV?^f(V}GhbZ(P<8Bd>~zWF@f9mol@2_pd6kDqIFN0ZzV79o$Eo z3~#|-)Z4!$C3^$hKCmPcHPH4O^Ri;g&n951U1LsRo zz-c#EOBS^bc&$R1DOpMK8omT zgR~5j2p2)*h^7#Etm5%9gkP6XE5wWJw7=r-H<8$oaJa*0CZLT&9*&<=Bu~-0K=5mcxgJY+P0f<`^1CzQx=7< zsnS%>!Q5wUtzm=b6`o(CHLTS6hJ`&Y!*6fzsAN;AN;r4V%9+z^6&R-&=n&;TRCVEO zKbd;+m#1&v!~!OGbrBjiUjsUUjB$4|D8HeNq8*_HNx`AwGe-jo;s;#yQ4^fSbE)uX z5Vf1|PWJ1U5&;IfvjIOkBni_KqssO%<}0(WyJ=eW?aZNhYqU1`$~3uPU%RVeAM z_vLleGM7s{_luEVY`T_E(kl|Qm+UA@CIlfC7LgfoI(hgqx!<1@=p;$SvZxbv3fSN~ zL{Fo9mNX+P6Mkf|c%ZaLD(xuci0`Fs^eb(aFda^d@(rr0E~u}^9nq{}LQceR335Ba z2_3acH5dS<4ybVRleIWy>q~Q#o9EYP+X+x10M$koa{ z8;Ni<_y%=!0dTmZ5qa}tY8^8m><#}ga}2Vqd24h_uSu*C`x_6{W*ZIKHDb5uz`&qJ z<1TZ($(r`)FV1A|+jbbc#UNE{Wg+7@vv?wM%wr6n9VQ<}!BM(W@T{8ax9(&x@A**4 zJ-miCj3pTKnB!-OdXY;UM1A)!#^tN6POv59bqJ9-fzqo#d$X<*1h!yX!d!UImz=dgEI3^ z@5e-k0hN|C5`#`b@|xCm;%9Z~Ve1C<&8r@o6uHJ`pVP>4b!>y@Jh-JyV4C(u7lJc#Ou9_t4>C6 zmCTJC@>4~OazW*o*%;%hH5>iR1!Z{dd&q_ZnLK4{OsVq)OT)zVem9gl*AsG94`WP` z*X9t5sKmh{uF|n1wgS5`7}jEZcY}pxRcc*03&*ck6SRG)TxGtP*v9^^WacF% ze>OUAayGD=SG;vw`G%-dYZ((MFyHK61-42z_WH+FxMj5>rHUo#4@77thJaE-V4LuU zU|SacNx@6^xma*t85afHjS=|tnN^B87%8x_lLXiQ+WT4Vkir?UZ*UbM`-cafpdFkl7KuR| zs)>=N$!+UZXZ<&bjlU(tA}6WJ%2@5Qh7j-9hx{PhcjT-4a}C}LzLH~n{khRoK_ZYw zW7%>y)c}Rsz$Kp`QZwcaV(8`*?KJc%JQf2H#%Y$-afi)_t|rLHT+7V(F@A^6Y>&M4 ztH&ajEukInGx&4HIi&|D_z;3TaNfFI{88VR8=IP+r>pVs%`W7I-1ntTk zR(0)*Ws;fOs=Fpi^khdEhP?61sMEDKXfMc~T$3^3q7r6 z`tl$I+=R}a(Z{zv2o{cRzrR0$@j6OHAfOjT(u-Hr0AhInBvf6JL4=OnprMFV5I=?x zps`q;=BbY%pI?BqTIk`{*@rLm;kg}-IidwB0ZV*QmhT)JVs<7k219z8Rz)mGojmH^ zf+%!){i+ISO;n@fBgnH_n2Y^k`aAfu9hf`(+Y@S*ZYZNXrBP1M$h6Z`mJZ%m88d?@ z`V;CnTf`m;r4n0abqA!2ZAS`P1{`FRiM}@W3}~J5l+Epc(afOr^+$cKbIaUB~7ZKqeghOp~lq^I`UFU*w9F25{gumVpS;M0`1|4ZrnU!bJOO@Pq2bl62 z(qqyjy1$=^Fo|^65>E7n^7>g8BSAiKDt3r)8a&JU(*=mERRnL;qS z$2${(Q`L!)-@{5lcB;wykO2#A8|`Hg9F6c6KxCTu35RaSMaG|F5=UIRpG?qG;L$La zs8*AURd6(G+hs1N<){dwW+`=)&~@>uD-#ri{H&IXMX&2tOyOIVjghRw;utqjMM!I; z;ic!)Y&IXZ?LHfEx&Dp{R0HEfsIhbpRr#GD?_aR5#8-l}vrr+(!fpulqUHG%Z{78K2`&xkCv8StjT}I#b-F&G82l0f~|VRjZ5; z3}+-^q!zebYUOjIo!(5v8)}PyJw}v@Y$YzcBQ?7|quX}-aDOYENr6Sc&yd~MIq5av z%C)+!Yrjc2n9aYNV1pl)x=1O8%aZ-{g9Ir($As26TWLmvA$F@UrC!FqM7KRR{MuA3 z9>(PbV#;9u_|^S6YS(XpIYY>YdjryK%~b+kdMajUlmzY z%yU!;mCRUj#RP-@$e$2?U2B zcm{F?An}|f>Cz3LP&gX_P){7-iJVtU{6}CpNaOc(3je=ykfHyRgWy3q7KpI9K5UV> zIR7fzu{7ZG(m7!a`KnnkP};{*aU=>2B!bbV7fG<#?!7uA^?gY?iLOnSNn5#}zQ+5MScj22S5sQez{PK@Ba0v;>AlOhNs;;s&<~qzg00sm|JpE@u)=v=ocWoq?o1DV3>P zxljC&n1K4pUd*4qRx}W)ud4mN)W`u8J$V6etU%`?4;-a;2}=)#3(RA>AOq~fWp&G~ z7S|DR7~<3I%`F>HFxeDIV4X&)DS1rraG7TR)c^fih!8tgS3VPwv`Rgi2pU7sG?7K{ z1=3;`**Pmwx`G=(Q zS7e*%O_`ng{uCaAw^%Gk7cy3|v_*S~6Y(sX&)PcO^}4oBzuC#amqb8bGIU)H+vGv; zQoS0hC{2c^=^nT zy)m_mXdS5`csO-)B+Gp_lI!VE+z9BSm4IR2Tg5%jfd=3k$ii#b7ibgd3q<_@4zz|h z!$cXGpK>2U1URxEAL`h|!+BCkq;JFFb>c$>3>f56HRpgGdB*f8&RktPzi! zn}lUn=X9WJfVP-gN~rps5an0Yy-RGo%62GTCjE;iI5g3+j3-Pbaw!rS91{L^56)5c zd`IbwN>Lkd{sQ&5y+Z$#8Rm?L?kpkH78K(=A&~% zH{dQwKi-d!@NYsahJE!W^?cyQ(V`jU^>^@{CcH>(7Fwd zC{ZiMo>!APyN35Bz|%}dFT&EUkhcS*klXL?{2X|*o8xnX^w;SI1Sd5o@~@}u4AoH* z7_zILZy7Xy+9^*xSuuyJ>>#|A+UL*zo$7H!#A#GX^w?!@LQWb-YU=TDZfW|%FgVf! zdb%{#of1hd%+lyAEE6#j^E^147c!WN@=0S5AWMBfau1ZD4^%CR&Jdecwsh3Tf2OpJ zfF4ZVrAYA^5qkF5lrJ?FBYaKnzv}?W*;l{hOKdMbS8MegtmtSJI#7$XJ=}ZEwm*oP zP(Ir!S>duNQZ7fX8&$ASa<(1Il{5p_!atVDgTIFrwd9P>ks>g-QN%s+Uq!9^GFrCD z*qV`RDrIh=F-OqI!~!_ZVp{iU@o?PEpEEU&ENU{Qu82o&`j$y#3 zecvMGKK$rXWS=?zHs#HsjGu&Eb!x~O8VX%8`T`c&GEKLSF;d<`-(nB>w zG{2>srs#vo3{CX{A1qb#Is@WAjQ&jLfT4AJtTPwI&?TE72qObWqVdS|4% z9TUl%*G8_Z?WX6)~$2+CA}$4qc=|CcmDG{w+=3yvbtP< zNKIFKD_BO@0o_Z`wb#}|b^l9^`C??wgt5p&Z~S~(Y%PmF8gur;|ruBcXE zh}frnCfb_P`%H~twIrk=`Mr5Ef73D=*IqOcF&>uXbfm7WI?|9@Q+yJ9sj8?8h!#yK z*Z1DAXt88umQMK<*AdBr{-E7)MNBrjGK;{PY&g z#|?{4-i#NxA?S|gVIL%tW7{c1N~OI+3ZCpDf?Q;ysACk62Lk2mLFdT5}HcyciZQ*9Ek z=sUJ(0qMy$NU`D*UA9Ikfy&OIh3^9B!xAkrJ){7GQne|LY-}M98l{0WN4h_Wl@%cG z4ce>SKLTO2KhRi2^%M}$tBAmazEM4ht8jo1jR(}(RzboGJuRTMCAUwhw_k__v1k5R z_i(Y5MmmzDlU}l;{L|k}Zr$%;^GV^Upa9@_97d;b5tuoRx`p3&e*nK>%gJ@u{v7nMgIf784VrHfVB*oB>K>DzpChPyHHiPtubc5;H!6tsj^4lwiLAJfR_q zS$!9vd%r$kyb8qBy9|2zV#q26Fl)>wsmI|R+7v%KX4`kjE~ACH;!Mf*YcRV>J-Fv%mQs($OAs?}rGJT`6SnX+C^4?R)C6Q&L3L z`qdcO!t3I|RYF#)SVffE{?{|B2uFC(er0-koeu{01oRT8A*;br0(>D1e6~(m;raSt zNgRv2$<|Nu0-5990(-RE-zG2_zl+Z%g5ZW79>)D~^TOKNaUT1-LF&GAnlxWQ1)try zY)>7J&90GxMGnli^?TQ2Z^9!T2STXR`*`}(<+NVeCCsEHUmbcJ3)Z(@KU$_%G)tC> zOD27qV&+Z0+Ul|{dEe(KWVGMrnA|VciHp|VsFmm6ILvo@-$V?i@~O5FKg8iOt1;_p zJB-ZC%T_d@uRS?ms^Mbl{h$*@trw{uLpHI)GktO1(`wosyYyTzk_U9lc;r2edcuh~ z%{)h8R+<#ORubFjvH&}rj)$)|F=F@Im3Fb^TdRpHMFzbbr+jK7s6~+N;$@Nd$$7L) zhQ($em>y({khZ?W{m%Bz;Y2PQE!o7u^QD0L8^85_yiPw5zVRiLB}`wxe(eO#Ox%y6 ziwos1ec%uS1EwXXRFWrEU*qv04dlTG=m9rt>Zd%KMvC!x&9i6#J^44Op9W)wI zE2f_1w2CybQneg4vNUbHu>UoWFr_&4mL*nPdEq696Z(%efNH@-n5iBT^SgkA{*<2~udF%!XsU3gWxxSbz znUI6mbbii`1Wka-s%`(VcR(K!@+>M#7j~CwBV;pVHy?g#PcOiMw3qbXS^zBb{#eb$ z`(^bz&vv0h8<{_H&BHtGEZ24J&9w=8py~1ALg?*ocjuj7872cEwE@xv0|RydhX~uP zI9iqgp-?_|g`Q&jxkB8=`2G3Aj+lfLu`_eWERO5$fcgk$qQQKcBeT0jPG=?1ZAFk17gjjnxn&h_HO zLD^~$X5x~c9wVyngZxuErM)`vmPA8&JQZXNFTe7ERGs0HREQ|{6x%lh;qiH6!2V$p z?7w%Bk`USUL)NUQ%n=XX$vQrY!JtFEFz%pVP)M>W?09k2)uUbkwZRsugNI6IB|5&FxnINl=bgP&i-0sOH7MYpNhaXeZU zMSgao#lJTKxWaU< z_|~uBA5`Yi{XsXuU*u&!P#>MY0v$DRg^FoAHV8e*yF;m^jeZZl9%p0H&*Sd^!WIY0 zAKpk9tiIS&rd_r3Si+z96_B{PD2hxb@w@KCn`8h(9xI7FB8{#wYWKZ%=&=6j)xqdo z@mp{bzpL&;JDkZt5^GW>L?X)U;ZR~e}6Wie9}jk_4p5FPzUgL+eZ5|D>GFAEZmV>B-ldg~)6z_}#_I0NIT|rOzAcYjywW*3Up0&~Y z)@qAQw~}lTy_x!@IMY7$tzs7c8n>!;8uIqjCjBRk%y(_XOxxoY?<{+Yl0zC z`i|^uR{5ezioXO^yI&{~{^{bB)|f-ix=2b@o=)=da_hH5Y*buQtJBu_F4#_T^5HInZw2vfT$v3R6Xu}XZ{{{-(iD5SiIh&yIo9$zN4cAGQ2 zIeU8ANV2ih&~`#px8-CBnPl~-(ACC;T0d>rq4zRoslWCk(^-A>t9gsVcA6>c>6nks zg}YsB+6C5Jrn`mc)h9DQcIF%Nt9NE-V)-3>J!)opVxPN{u`QP?-B}c9*Z(=`w79FQ z^ecZwyDII&G&2=klj@oo>jqA=ty6hAkX5G5I7GMQnpAd{+%kRsVz!90`r4q-nS_+HKG;S*w^FUX2qi&F(}P@x3EAQlghgu zJXY;AOLa~o$=w&t7Uy}{#L_BsCAYXgr*(Zs&bZBCsun}fQ(!SOlYf|v9m}VvTYo|2 ze)@f5n&bE2O$dkb@*WWR5@~fDjiP=wMq5V2+ln*-R20)k5!JnrP2{u(je-w0L zZWk}+MII4&Y!>=hzssYQ$E>W{T*Fu~=GJ^Glw@@yMaCjg4#u1suyF!h~S2 zuXqcxL76&h%r+U&&4Ki_fp0`2!UNvI>(5@yn;-XK);!Cs>~kyL;Rxbg$Q``TM4l1R znvP@&iYdaq1m5U;l+9nN*=>KEVp3Nf!&5oyewhOc&H0%M;$!59h0Yfeoub4LDiLDs zAWw%H^PyNE1=;ed!%N>^Y}JFTP>qw!uyL}z#LIR#_|-`C>)B&OM}(gtGGEepo%tev z>{eDHi%epW@@-i~kqMT%etoZ1CQR;g7Gpv-7>4`i*)8|-U}*SIz_{GlA>irYtrXOh!h-_XA+2| z*NiHh;sLdoxR3jp#M|#RGHvc-Iqi?Dkt|-nzTmW-ud@rgg&xiXQwT+qKVz4AM+Z{ zakTi;dksYhTfVoOXJZL2y@o`ohMyRDAaA!~R#mq!-64aFW}Uem={DC*QQ!OE{tMw< zy64Cj>Xeu61zp5~`h^Os-iVDsxo@5P8aG*NpgeMXVv~{Y_$BtFt2ZZ>&Ir!x-YLz|H%n6&cD!2i z!{NdlokWp$f&YrOxU~GSn%TKGuf`*CH7o2dCj+;<)sE*zM5B_Xpwjj1%AF^}OBQe} z)Nj0`_)Q^OP3?5EVaK3$pjx6<|Ngd(p)GuGeq!}mn*b!cRiyJ<0SIO6E z%RG(vo>S|s#kM~f?Kh9tuLUDO%b6e)0oN(PI;b}Z)?P)0I(+nzK zna$*EdWDi03r%$aXhd8^9!soda}!bl@V4`dKdub2IGYay!NlJc;H}fxf?Cdq!=qL> zm^&e!+mj`6k#EF^OhI~fUUIo0F#_1Pjt_A5N9&*mYUblDoF7^Tg21<*mkPh_0M5s$ zD*>2)zb$*qeT-19JRA$t+L~`jWm==SA2%~P2}Pb#^`@hXxB&iQWqBQF4mkD0(V*g? zv{d1cGzCt)ZX{bVPP|Gyi7vvfkx>G^WE-31jsPMg$-?dwF5MBr$372t7b|>`JV zSCElQy-wGV$D*fvm9L-OZzxz-;aT?JLJ`is-Bacc-zu6I`ddik^YX0wvovHnzR**T zRCGqW-A{HkCLswcd;}VGTs!3U5NiDqPwVztKpuv<@ct4W?=SsQHi5W4Xw>7J=W)QD z7+P73EGALEvf>yCZ*WOyjQ-F-ab0Q` zX0cu*$+UiYu{6D0p6>bCYyYDZyPhadelitbuHHOBS3u1b6OY_e%_2BeQy;I3Xb+}o zd5MzLjezmV|Bt;t4W}~v{)b^xh7cu$45^g4%o(Bx86rxNslqm89x~G)88XwZEtzM^ zM&==vd7FudGS4#4?sbvA-{1eZ@B4Z2ym_AE_`T6V*L9uKI@h_*wLa^!M5ea2P<5|4 z101||X-LEb^9SdBvX>Vrmt-fB0>*^J?k&}rEb}_EGxrd)6&iMOtf$sYiQjHNjEmx1 z({`Hqo_>XmA}xs?_pY_o%a8BMt$X8<)Mx>##FKrW?sNxWHy)WoIIPz`Kf_ercUfc6*Ts5+bHlQlc(>GI*6MStGPo`MPy$vg>;W1{hHL*M6!CrAbCsdkpiNSKLF z6QCB4Z#g^if2OCMovq4d_uqEBbQI%4h3lIktE86EFskZ-eC6t-+mHdSst|#X!^b=I zhSb-ONv?XZip4kenA`;3E4;&TzOP>rU%luhH=L*xS&z?29rNV4>UY3g$F~y)0Vf@v zt2eLTm9<=y`Qb=fEb$Y98>hdvG60;5CZr)zpxEMwVbIza;ir6s;0-z6A!?Ek<@MW1mR*nVlECt??jZ@2vY1j)H1uadC zk$zLhIp0>f^GX=JfUU4?ws44R5WMj`GW@=9G?lBXbCxNwPA_?lKerBu&AO@YtM-~t zd9+!}(J@Sn<(tazcYZIc=qu@$@bJD5UJN&$Ngpus^Ou*6V2qE^Wso_}UJ$bIQqd3f zHm~L8(Z`A->M%yNf|Nq6&GxWySMT5S~> zp5K=`s}$XS^Q5%uN$!N?#;%;x#aTZeCJFh;i6;w+y}e-mI`JkaL2}v??epxvJh$!# zZb^Fh6YD8X$eAOz9V`my*G9z!L zmJ45rF$ot|G`QWATO6+2HRwMV{G?_cA6##6<6n`9&@!rGD98n1#xHY$U&0+%t`afH zL~_dSLR#AtDc=c}_$+2x2#n3)p^(J$%SUV~Ukl=W#7!K*;9H_QG~_;(E#f4^m57h> zd7n^|{PT?ECx8yh1kB>SEmCt#cjE^ zI;3tgxfsX6H2LY5tEfp3hA={?hTz6;XKpe38!x_{n5|@Bp1;Xdm`xUEL(o#&@(+@!ew?p?m>z)nHMW_;lbsu>hsIBafEGuk~qYl08hc%hllGT2_XYe@grC*+vlytb1DAw@oi5YWf?dDc85AaQ06rN!>5-=tkV-nVnd{|*;B ztwp4*5ZMQdAS|WDovCl!ir9DJJDa3)LBZYG8SUfuWpDMGn40<#V$iS3QxCno<>HM! zyEs}{O3$pmYU;c?SSu3x@@1LeLvVvp;oNTy@x=Gu!y{B|C7`gUpBZxJWm7toW~Hbm z=<9VaYcwr`gN@DnYFx=V3NwQfG&gVi;jYXGfz)+~5=!Yh_U6ggvo%o*mp+KUgeP+Z zp~xgW-0G}2A)6QGA|=p!in;AD6Zr3Gj!fUJLJJmtBseWc#JC-z5kPwGdqJ|N&l~w| zRqJF8Hg-CW9>Lsq*KMEgj>MOLIM+Ij=S!+e^U}n;Si4zvGKrS!4F_#kgcge0SDuO^ zz6pRDIcS%V_@Fno-!7m4T_D-To2QT3Q>%dJd$;4i=BcaYeH4H=o3@8A_(o;zMM3Tt zA3u9Vw|V~y^`Z15b*KS~aO+L4>|z7Hu#1{w;*tAG5Z(#F_Cz@6*84ik>s7w|HCPM=e}LK`PyU;h{_ zZSjV$V?cHt666Vbvtqq;=(ls?{HLnN4yBt`q}jqiPkp9LODU^Zwq*xu}%X%|atI;8D?`(|`9a7NF7N_+} zoqe==tDbiD;#Z}}(BKNj%u`+sT{FvNE?)+7mOY3Er^EawZ5;WPa*F9n-jcMIe>4U! zKM4*`Sw)9#rTuVbrYc@$euT|Q4H0||>im)PR7vq`1p4R~o^o|;3tu8nHIb)Tr*ay5 z3>L`0*Vf$ZgOqyzBVQYi>KXACzdEks{%4c|mnlPNe_xg37`f-&(?KmGNA=_$5@0Sl zcRD1!q<#Ik?Z{N_XX4|FpG27a_{3zz`NWNM0P$wP9@BAxeftCdg>d4)SiK~OEV#7a z6k@3v6t_qD$rGvsa)}YvRyi62NK{hpO)4!rSn6#KS!u761-aej4qwE};Y_A_Z2a7= z%Mq_Uj^p*RxV=@n_Yl}ik6F9PSMVl2a}cyOKF!CwarX9z;vka@9R(Yma95`9)Mv&S z<1Pkf8{&D(e(5nkm973?bG02pMCzZC+jO>N6B&1Kpfg26BTf2=P*%H3>%_E%U$o!r z6N0e*A{6wmCnP>4d=YpB7S=20Zs{14QR39?NrW2>&wmSEZHZ}Ov^7~d;XPGief*ZD zV=i83rb!zoez^(}(@%v&dX2F1SdnhqwedJY2T{cnGEC-M?-ixr5O!F1+;SYYOx(Wa zKm6eX9# zWdLE}Xl2jW0Q$3l*nTx5XO6jkE|(4`I!Hl%_^9HA<3AP5Ib=vR@~D@M@eU1=u^hv{ zY=|4AG;Hcv<$@r-uk+fc_SZfS*ZLV|uK~P}>{Wf_E*6))+#x2r6h9$HKIlJ1GkDl{ z!|>iGt4?Xc##2dx=>U>@yg1tA%7Z7WThVf&`vIhtuohd{_TM}Gt0e;e+FH=^c<T zo{+B{Pij6rh$u6#AWWFb5L8-O8wAt0pq&K zPyKSd@2GE2oqBt?+j*AAWxMxwPrA2$Cu9ByYox|YEBo`c6-F~6FXm=XUX#jyHYrPZ zc-_q+Ib*cBp}Q_!QoAf!JB7?+Hg z61UUkn94G?G6Ht*O|9xOqNuIiv()*eB1Ll6@_g1OZF_G*kkWi})t zNgQ^4I7d-><(F$XBk=`ahs$z+oVRCYW zuoSfnff>qF+&fj+B2g8t``2niADA`D{bb|9F<^cTXm?}~P&~sYj`O8al3{bLaUgxh zpL&SI0CU@4?i>pwwo)S&BCtY7WcpBj>&%6LMYjVBZ-bd)8@ z91vQnoN6KPDlujE@w@iWw99*nHs@_Jepkgu#BH8)2UwT4j4D4qQQZqUwH6YCY=z^a z9+gOrx0ujZ6F%6C3Uaum`!USeXySA1S+yO7k(i&c>tC)B;zoB0ZiJonQcw8u8Xr~d z`M5C7M-B7m&wC=&WxChZKQw$~2v7ybTc~aTs^p|twUdjcS_AOmymL5GVoX6^!r(J8*3+4STUa_&p))SM=Uu51GW;O&*2b~aOzX+& z6EMj2H+OqWXdcJZ=LZi(H`#rr5sPxObhF-MSi#P^WeqG$Xif|3uCD6DH2>Hwn{zvg zFW#G^4NL-ux-=>s{eby(n-jRB9)=a!XbyrpwMKRIH zAmv~Zf# zVjGR?Yi+^_+CPg*PkPrG?MjuqG)wzZ@9}r)9Y>Y2Rfs)lu1wdD58?H8w~UVb!tQ>4 zD%(G;kD~J>Nhy}QQ+_c+J@(N+-e8j`K6@m+i*P>Qt$n6|0#swrt(&AcxBJM zTG2u!PaHL~NnY)Ao&qqGO-}d4DV_^@PcpICt3IJSDyXXy*w0S|V-6)52vd5I2)lP` zhy8LJH1kWxGC5v1T8o>@DG!J{+x)A6pc>`++<>NB8og5l8RvN&t6BpyEz`9HrdE+k|bIG16rjbv`15ZXZqf!4pjlCmqEpD)^*($a0StSep7kWzLYXHAOJ2$**xx4IhHHDOXRDzdiX9AbfDwb!`U zt-I9S^QV?>f9n{V%H)pmNl<2yWhEe`D*OzDxT@kbiqW1ECZoT0Ls=Wva86lP@w)@N zJKgKQXKy->zI}DpU2_Ag7kgVe_zPc#xuS=G9$hMaB=Ms-?Q0BD_hdCG)f`G*cFaZX zsBC88XM3LiZF$97EAXPE82QX%%G{G3iMT8JYddbgn!9Z@nU@Bbw*7p%Vi|(-zP-oK z;B)D6o0MSP_?*6@|5RGIO?LH6hdS5s#-y>h%_XzSwEM;7&BaT1q-SZTjYmmp{ozboyxE-fiBSlXG+WlKhO7AHd z8xw_CTx!^0W81fzPAUi(=a2121hO(J3INbcV(n;cjz&L$-JrOuL;`edVhvkNS8bwP zG6HcO=w};$a?%3y(yDuQCc_2dK4L>yG z8Jv1y2fqAAG-#EX3@67y70b7)UC42gBpxH~*GHfC~P*bJYEpsAnGvZ>npoiM_37PR~vf^f)G>xPO)l}o18@7?z z!IycOX00M@pa;*w7c0+0Gx0KlzkuVNrdWpZB}{^Ho>;Ts6%fm3d`7y2Rp{D<9IvqY zr1V;goN_DA>QricB5IjIA)ifg$B5WZIZ--yZYdX1RB3;%b1^&)-Msl8Z9`UtCewTG zD@RU^Ije&iL$kpQU8N_*;kurCl&m&%&`U?GFG@#4zm%!nj`Y`mZn;Qmf&d4;BT;}^qr;E7o^lL> z?w~Zk9s!WL_(Ydi*I|-59G?#IVEgZ`_nWV~t~@aFnk#EtNRC2CH32|IVL;Nz!!~xx z8;`*DxR}dABJk(jj^a1qL*kb&t#3qRyfxsvz3c9IGWJ-17D#~oQ2G}*fzEbGSRus7 zOlL)N6+r^xO>Ua^!GM0tqm4x$eADjD+onH=?SM_>@i+}X9-TCf^_)z)ZC-}Sjmow!VsdmouzWVKmCzWUzP*gdsvxD{7^~0l~TkDJW#0l!< zk#P_oGBcz^2#bNA9;$KwWen#Bw^gf4q6RlzKJF4droI5{)i}PAXwbK$ zmQov7G^Y$yF1-KGFF)Lzvzs!oj_&II^qd4g0kl-0mDcTUaGLx(RzkCmM$kRA@F@@owA-xqtbbIocXqdqrrvaZhvFInkM6&`bg003fG;T^Dw6&kBMA+G7W}t%Yw}WnZpAwN#M+mEMBQ~ejY6CI z4GOaBdz0)5pI>x)!2l*IvS1VB)BPoPM67#xa->DIxuPAJzHxnm0C7S*zf_sYR2hZnRRo48dIHc3%M_~-A+zh1qV?ZDKUh+fxU6C=C;_!<9>jM*FJf%h<5)w z#N?1=eaC6LT$Afgp^U48`xB`qidotuV;5i2>?t0jwdBmluA`G&LkQs3zJ%2aZJ-ImdN?c-i2;fKs1Y)XRN z!+850m-&rYWT>X_|1|Ib@WyUZT(O8f7+p9+br(uQ17W5`#V5Z~S$*o^`_tg_A+Eze!`?BYo-2Jq5! z04V={mvR&j!nSc8h+YKUM*%m_2Ml(hkSEZfD%yVrft3M6m^?6_yMA*7&YcL+pO0nV zh&UA8-)$y8ViCY-T!7GaitQ|Mwk}#~Q%(YqX|83DKpmt{CeJop zKbQ;e85nzkbI8WtIYIqFk&XK)7E%oA06;o|lbBP{Oa#$GIGxC1x`Cv7wG=3RO76Fi zu(}W8#&ybE^MP(v^i(?IYcs>ujj5cJ%~D8n)yXbhQ0ggl!A3xi#T!v;zF7B#mu}^_ ztL-iP4hY{o!tS}bIv35M`lc4B2n29vWYqZjq8(t*MF*#V zqeybTk`P_Ye{OG&AWK_-$jQ$ulYixe-6!)14tCR71J8A1Uv@@}8EGCc<+S6b1}Z8j z7=;w+cxE9b&6Tb8y+Qm7w#hzs#@7G>JL1b^W#h&4$R=4t++rr{ z)6Xy<&o%f4TsvbJ`&VwwM^DQaVi!an^`5x|G)aC(ss3&We4S!|Z(N1(pI`k&M{1=SA1 zk$avpJ20(`;OyGf5%L-{z7m9y>sG3Il^$Q3kM?;aTRG#Tn2iiV@!Z8EE;Qe!Z2~xs z2$0Tel0zy}uH$hpjNqL%@M3<#A=D3jUj4<6c7Snic){v9&c`saFZ_tKgTfhIJtBB{ zUt%ea)lLqopp?sYcQ#RhcV*A$jixH-2;uY{Z}np>Wn}TFSdH$Qmu)One??b{Wb>2w4A=6b%V5N=x>A4#3Gs5a0UzZhh`p zkCh$GT-+|a}Y~J3_3AC9p6PX#1kF-g_K>6hAsSuplCS*JfSCO!) zT;d19fQ2#X-JY0!$X_bJ)=#SNZJQ_g+A`KZKeCPIgZthlzO?g^7_}j+O8Eh{6muM` zW!A)G|vyZv%QTq_X@0eMJd1Khpn)cc=mt3l+_4qX^NHB}`g@eG>( z_o;laE+%s(Q2d87%7s_I^gHG2s|p|%BT3QvVSwH*#+GZ4po(oPa9Zsx%E8zsvG(BO z(|tvnTA6o4O>9Km*KAg~qY-w0a{f|v*Oh5bP>=y!V(la7-L)F&zOiwjzhvP-Qn|+U zAN_hk{A8|l@zv%vwZ94(On^DW1em)pBnwjrzp~+15TT0mG*Q_THTXiAhnd9*RiZs7 zGjlrd!#!2r11LrVD4bMCP|iPHQc5>xOU(E>JLYE~OTZS@HnC%b(3ug6^jq`3VL%n5 zdJ)|6k!#Vu`Yt1pv&XssE!F@inplBvQ`sfB?f6kZ|5CVO&ma1G$Whla!YP(%JCglZ zuZH-_Xpx{c@~_A6Ms~$1Q!kT8nt7aLYcnw;O?LB`ctNz3fXApm!6O8!#~=}A23vCH zjn&QoXlv2$ZTz<;eK+V0+-;R?lJm#o41RX>pwg(yqKR!b&fd@qOH)|WP~9%`H3DvQ zJ3?$rW8OThu`j5N8^G+wgYfg&r*7E#z+ctcP?J0&dJF4_!;o@&>HO{V2uP&6;;<|H z|2&mGnH0FI_!_R&lueknMF-9ipptwT)~aosrIQ~P$$clePeC>p;e+X~^@0Spcx6H1 zjPx(a=0li_ji>P8MSU?`CuVriCI#W1x^ z#69HEE&L$CtrG1@$FJYlPpDd5l6NF)s^Eu}#L(ErNOQ?-u`-mJI1gN=MG_p_3nt@F zi7w*XUHo_6`oMX6NHtz=Uwdt_Bj>GT5NNB2Uo#H;`Jnq6fI_7Yt-TRh2kNru{_V+! zP@rVC?c1@lN{KGbPf4i!afZrMI%P7em)d(wfSWqd)vp*`G5a+6?QV@-AglDxeofU? zvv#!U$$#hO02mcpKkt8gvA=C=2pCVohV05>ceJ6~5e#lnaVhmRET#}lkVS!F`6hQ# z{Glo;c*_z{9_{ ztJC-~=*0WMd^8;`D$$2I&Yryv+#rcwyF0TcOORQj)Fa)HAQ0p=4S8EU;th{LA0O>b z$NqL8*7G4a#9l74MbKaU2ohapkKNSsulMRE0Bj&&m?evWRK6=z^MoX>6(2x zs`^en##U!R5oIwEO_b{tU%agFjR5xisxcf`v8Y=c?vzrH&Nrh?DS6^xCn&;gO*{Up z3(Nlc;M0&ciMQV$YkJ$w>|siE(d6x+X=B5}f756TgNzamwEJs*+kVxkh$-R3aTO~nHuL=45I8>oc zgNHQ0sFx$V$Z*3SmDit>CE>!8OTQ9lf(|rfycI7|AgEdK8dgTo(O5$ zXArsm<%|IpaAmTf(cj*D(1QW59MO~aK_(f@bIa?beEJF3*BD^7}Gp6utpiS zGYI@0_jlwfpWS@w1_Q5kml3?U7;>9TJ-$THMQi*u*y;V`Qf2G8* z1lpNlPW0THOm6sQrsX}$YfvL~w5Id#Gv3}LF!BFR#nr&uU@L8kwd&mLtmMbbbagu-1;pUDlAJadn-6ciWTRgi_cjP%opTk zwBE=?HO%gqcK_0A^YC;==RMHS?fo?W>Ur*V@J!oAbyrG%j{HZ)-1zFlW|tF++;eBc z3njUe-$K0XOU>7RTLJNA>ZdNM?CGpZa_FbXpN*GoY;lvTR{M?u*pLUYr+wcGz5;R^mg--rT^NXNpdHB$ zdAW-|&phU!A}W?~J3eZus5CgjBPnxhrN(+;+<+xAG)TTUPUk#Pz_919Ds9tm-Hn}s zBVAP1suLwRL$|V?DS?HOP=WhBL@v2I;dTRGSFgwEJ~x_qm)|yH&~MZlbxmNh_ohaU zgOq4(hIVPPI~1(Y&bNrn%j)VX>sFR7J=5&9S!!zIbU&2bckf8-J`GD zj_v2)44w_1D^aqcGf8L*TZb4(@n{u;mw}wJWbAnSdrM5I2rB>#on9tdJoc-Wrg!5-LIoW ziMY4z%$!AZHF6RodRYpG@fm&?W#010N_Q(KHiP=r<3ArR1cfJQCz;sIY<9mpiqTu1 zy}i&mn$!1Qg%Ep{|9IHQDql3F$6l%I`f}gYIiVOfRW1=**`h)gB9!gOu*tzDlwn8Q zV{wgw`v#Qa*$~<)?e!-kBi*=);5k+|Rq~^6P6|!6WEv&58rhJ=ADgY%Fm6_KDWp<+)QWd_@cW-2Jfz z`YU$pqd4onq%dAS(^K>qH=mBj3BH@yA*tY!)yY+3X|g7hcC>Adh{x?C%+hCqQbe4Z zr*7v&mnh0`E!AKl0Co3N*zlLmnAz$|yu^}&#{gJ>@ zO~KGIEf@~%9d~P2*tnlyTf$^1qa3jAQsNzxgSc;PXJ3K|zFdRv*}s_~kzu)AzHGNT zU843QQ~J&3bNUEmbBo>c7%+F<+Gr(`a_C9nvz@_yP;)mBeu7d!)T9Cdmul)T?n#5C zP4fM2?;PUVy2S8>?@7URr3PblgVw|-E-|+$-%p7*pUcmxi1aHz8oW(tJQrG~vAsO` z$HSHk_j{_Y|6nnEh``i1i-%U&cff>nR%wruN2Ntyqieja?m4$m(l~IzU`ZBFJUqQA zQ)IzDC3a+#%F1c=N_*0qv|yIl)sUk>@|cf}3znIT?Ta-(O$M~jw&01MD0+pU~*9A*>22RrH7|c!=Jxo7UO#*Gp*7#WMaN-gV6jONJ*tQoc&We@>P3M;)$57vh=4<5+YGN%&PbAMSh2aN)pSF+_r= zP5lW@`$~a_=7gBd`+K&)1jUMbVmSOL@?cxAz^yg0EZD-Twjs^1=VQJjzr)0BscTR$*fQ%=!*46q9>3 zsq_c8SkeNxxqR4c9^b#YC?XY5t`@nR_}73;AT)51!0yuDzZmI*;kl^w-@cSG80@Ol zdxF1RI1OERttF_w|0_6t_aPIfpCTiMUyneKBsSzC_j`nB=Q?PjlCAs4{bdvHAHvQNss)Oa-6*ix#>=r$*mu z3HPIFa2c7h`hF-h3D-rMezr5F>kE5%8hyS@+_TZ_1 zL=Fb|8Lo zs4X42`4{GJ`PgHmPz`Leczi~Zek< z1ywWv!2^i@TpA}8Wb0q^}KEmxe&Qfh9?c!ptiQ>yrP$QTL228%Sm_F;h!`>3t5BqqU)Gl*RKLu1-87HlUsy$^QVm}B zoaA`8A7`zU@qzHOGDOn*-}7vp*GTl6?#*wg^yeh|kF@w0!Q}nibCkj#YwdhIw>Rnv zZHM~!%_pHiJsGeG0*}UUxb}`E%fEX+pak-urPg{|<^xrvGYE`=vzb^V9NB&02T!{L zW)H%+_s$>yr_e9fB=;S5K5S$ZOd^c{dycl69O(JMqu7ti{I@j^u)Tk`gT{OHxT{wE z04BiZdiwu!KT?xMAQ{e8Pj92e**Vka0Y4$AB4*vD!I3v zw2TpVL?}9{ybUAXk@M_~U7lF!7Jk=#1^*{mAlfai!y-&lHuyigs*(w&$-2)4T+Ao9 z)S5=q+8!`41GXSzwyX4aEyllnDquEDt6{< znCl`XnDh?@3hqYm^SbyoCmfZjl*FgRL0RysMu9~gnnM5C>>{Nwt%%U8IR00LWYggu zVKebr3o8rhRg7W@{}ciRxG@Dka7KV;{7GiSA=chw3LN!egd|65VF8VxB)3MnyGtuS zF%+MFn+pY6T8(3XQz!JEby+8KW9rQWqn=#jwh#AZ8KgZ+K{AwqgOLTPFS!IMV&&T_ z_tI`A_y)2_>6mwC&oJYuiCi5V4?SZRGUk>xv{gU4^_w)R5=U@v*#8xKzr5g&3wL8s zd4ES|7!M2nQZJ4Jz)LDJPZ6o8KNst6Cr=5kaXJ zh!d}>W$5ImkC!<#3sLq!sP>n>plJ)+1mGv82PzB>A3rZ+K)_SEV^a0#^5)w^pH>4; z{@r$s$iiXyZ#@SYATzZoK(yLej598Ekl$-P0IEw>Q{uR^GS34(*{&Ha;hYUMuR69p zAbzQl-G279cmP5mwJ@F#T%UM79AOupZ4L%jX~0-SAiKBHhp=DzEfj8cnC=b%G zM4FAa*Mn(my1t)+2LYK2z|7g$SgsCWtjA|U(0k4dsYZ?)m~pN3oy|FA2sj%9@60R@ zwxtycnj&>dhoKBvTbY|P5HGz#01GLHP*CeS5)=^-hICldur}mZ-DxHAbDkA9(7z!>y4;asHzWw?I zwf%M4+QfsL;o#U`Q-R*1D2>Gchxt}Us0z(jl6mh__GXgG_IVj7Yu~SJDn6uAQJmJm z?-&f8hVQO45g9!LJm+IB+hhKZfg^S9`hKc|PwO$;dxK9MNs-FYe5$JxY0>7BFE{YL zL73|5Be^xRk-F!b`KtZ&su^<&OeYrAV%ks&SV@G6LxC zRr=C2pY5hn+p)BRZ7w4Zin5dyewD!x!4^Md(VZO$g{gs|t1}Upcx;OWKau$(%ahb@ z7976a^rYWQelOSol4$}miteFL-@EjZzDJ(=FmYej#0zV<8rymh+dxpGCM(a=_s$H0 z^B{5fmK(dijUah5QvbqdSg;BjXQ?{(HUN>L_jVIO$Mh{(V=wIB%Dd2mZ}lMGT75CO z|1EapTV@X8)kgob#UglWbj7!R|EV*`Q|3;L$hZExU;lZk@|X{twaVGLMZ*|-+{eTR z^+GqH11i%>56mBUVZVqi9zU=!*$`-eX-6S3J`n-84H!4;0Fz3BKr{dn%s^iSL~!4^ z0er6PfcS7T)I0Hb8YZ5uBreO@&`a>1ghKz{iKm8i2rQce`XalxOdk;dO*nwEH56XD zH(NZ#_XJy5N4zNyZ%*%EEB?#qmZ`J^MUUN-bp!+HH7K$Blz*;YnzkMdmzTdpu_799 z>{Y}5*$jcj|2T2H^$1u!*WxG`1mvQ4wAmr*9|bT()cwIAwVouq%5}V9urC@{?lvM!)1LW+8CfScCgW)II>JJNqGEs5=kNasg3uhk@~0%+adw0`@9d+ zD+KOfRWw4UU0{ip++J#5XpyC=Lo7M45Tc>NVT06a-)S(0qMCkxTLTP`JYV4;f#?ahwO=p#4OQll6VU%M>RTYgELO;f>EeTDZovRhD?EkqwK!RI z)DQ$~NqkumwqRiyYWUz__z&DqF~r+yBlfhy5$$sKBBO?=CXdOXU}?T;UrPH`!FFk7 zi(o;@UWA8ub&J(&>n0pRyii**5?GI0V|<*x(%z{BSV>CCBz!4> z&Omd~mTRo6{lU;jqM9db(DL&$vUp|DRs0q6MU7-sGpn2uCx5r2r2nr~w|wfS~EXBEwSbn~sTmjB|xHo>2R+fylatPXb6`^wY7lD~;{8J`JT z_1*Pc`$ghR=tbGS)bJWOj3eP3@7}&E0|#3FQPT@y`ZgNyXTN|2C!Fk zfXYfzi@n?iwHN_N2Rqbf?uw)m7eHsG0b;ro)2990?2su;H1 z80f943E|V%@O)HA0UE3(?l3;+`A5J3Ibi$O3|4vDhXBPH;XL5cy`7%yiLXg`ber0n zZ>;0A(2ym3f1#3)3Zdifvw%uLv36?PTq&~w$*iO;!BnNoUH5o88|Az##ji!LO_n4I zGElKdzBcjxwqqjVD72HGTcN$BT6Fs0NTjv~1@$<a2Z zWu@^#M6LB#a~e->9K|b(^+XGnxtWY!#ypJn_DphIp6s-0Wh|Xs(4|zB_2J%p+tb`R$0fIS$;@H!(lbKtQS<2W!3h0>GRjwk$@g#1hm-*ha+GI=^|ZwLe2`MfNC!vs&ZwF*q=YMfvfQz)BzpaUuFQCI9nc zuR;E-JRL25fjp)Ldgg(G#{OHA#6n{pQV9?CwiG!LP}!aemRh-6Pl4??3;OVoYGeE{ zP(A^MMZ5jMKEX+WoE6lJ|sdGy9C*b)gP;#$8&fVHjCFf%CLiU zMe$D+KeZYU*~`xUgUYalASfJd`Gz1c+^q=!(Gv;Gw#gEi_pFMpncy&-6ESz7sB;nE zL9e0MN5Go1mJA&8B484+v>A9h3)JKm$AY(;igz>V-d_s$IyR%eZ#&XK(bgS@$T7B zhG73ZMm~Y`xW&Mn_!6(R+i(4|!pi3&LtRx5fAtic{m)e|1j!K9u5W09(W()|$z%e3 zfBdPtlL;nGZ`wePA_k!tu9HTBJmqR0oF5Hf=CevIM$J5qb+vK4FxT)o;A@(A2KixK^;m;Of)yY1$!=8SOlep%z!=5{MHq<^h;8wXt z;!YX+?y7(7CRNRVSI3pmW&>(;=UAhgGlolJCv3R4_Z4l?hR<*hs(&#?eDxbXyXcbI zjtA8_O@AYk-f(VAi&bdt@T=CKj8*5lsvf@#IiX57BTu`>$yWwG+p_C;p#$XJIqmq5 zfTF&;R7iJlH&!f#Q}lO}3p)YnT`uawIK&Omt>>&)oIo103Ye<~NWp$>_j9gdcb3=+ zTmg~5DBP!Us2&$61BZ8?w-8`YXQw}xdu+a=d;j&J7o>C_j-+F_HFN9Q9dQ@S5;FE( zMos~&f}Kslqyc-*fYT)AsN1z8v&SaJ8WZcs3sGT;%N71`FiM6`>5ctl{I+akqPu-T z&uAxgIKwfyE6h1ewnX|9U90vZR;boJb@c^K#LX!~Q?q$^>S`5w$32W@nAcTsT_r|X zt3Yc(I+j_(i^U@zou`eMQdSG zP=&rOJaPr~4UH8qlot%`N{AU8^9U84RrC__a_dKxx74@!<(Z3ZF z%w2iuzHDxHA*DtU<@;h5T^Le`QPBl7 zmp)XZgC=+e+((FyQhwf$*wwMzx-wiU9V;a^HPe@(`v|zY!guwj)z}A0e-Y{3NbNL! zCFnM7YWMg~Qk;E-_>J?<@1Wq?M6mW#>}TUJFa7U5u9iBVhHam-^0UxrOJ2G(9k61v zDr#@?)sgAzj!7${-GQrwEkqGCznu9Z4-xeT9k$~bRgBop?BaUu=EzAg0PdcFTC1GO z!kW!u_rL%Vviz?8&VAuJgXCkI&%DGGC)hxZtE|~S-KYeVZe44_4Obc`q?kjx!CeL1 zk1gMbfyk5b@@|Q=n2EA?Js3$s<>-^_T=FSK53vTsBaV`XE=X}hJc|-jC;wAWW5*NdV5;NQt(w9IZl4})WXaU z9LGA2wu|Hx~0@;^lCxkef#0^YG=#%&ZixcbA8z zpP$CHwH&S6e3L|vF`L&)3BnS!^#rH}tJ{ba$JNr{NGKg-cja-Pj<1Nt4b-c}1g*tv z8t2{8L}lbVM;T8Z>zGS9JFw`2Vo39}WCI%_5c|e8603%05htG;d(NICu6(^+LP(5T zuYdinkD$aXSIv#77%9c!@ZFiB z;LvCTCn$IrH*0e~yLh*e0mb-=^6u#{sT9SVV+Q1U z8gClb?^gCa5!7={^LgtzD6`5>-Xzmr)&^S2*J$_MFZbI^UZu8|ui)WP-@Pt(`*dtt}>3J)J>?0E!# zsqFC%fR|ciUmgYvW)Q17Cgu9&`84z-&%&;MYO#q86i&F3)%HE8TD&8OJ6G3ZCdDvQssUf+jsmVDcl zRWJw4yMqSdqC4F^4jV@aw=Y~-(`ahKy&xWyoZ-3XD6*C_;hr$0I_9E)$??=5X|!_d zCi|l?=S;_fnb?-y%zflckk>+T!htNq^S-9E9d#2Jxc?o48bqq)eABW&N_K@YNL6$h09@_p5O7(Zy~*fwF=ew*dI}) zy0-Dyyo+J7vV|nlm-DrEW`o!tz6~%v;n@F>{gas6b7P5f>kW}#o9Wj(w7;bpw7J*z zr0%k#M527PS$DRf@FvslRG7?9hdilKd~*QG{~84MG>WHkjm+YLuN+Qm&JR8I!O(g=#8t8|`x_h8c>a$h2L@9^=wRlAQG& znd&*uKF|IGwjVqm&oke7-|xHL^{(~#tk1jF8031%Api58Bs?Ihy`;eUn@fFoq zvRxtE(<=M2-F(Zg6N4c)L0*N4YRf)YNc9bzA|*z8UeGa+qO2rlpNy3AK0O$6Lvd=) z@q2Wvez14KBH;C1S%UUT%-d(e(5XWWa4K~S?G^=l;3$aJ`tCndNe2;Ef$?S^0L}8F z=J5U}p_C+FXgtPlNJa5y&Rw1Eb=*w>kGnckY?y0MTWSnOXn$NLGIFy0gVrKNcVQ1& zVo&5;T_(F3HBon+wsY^n6z zPOyq{N<@~0BqO4ekxZ+t z7T*BLY3;Y7;xRRb+;L&4O#AVRSVi znxf`HXZMPcV+RyCZ_7p%h#6aaeQ@RPmnZ;|Bdg({n&wp)$AieCzoe3fpRR0#&Pd+8 zhW$-H0S0`*c8>rnD$o0(7k_9RPzREeQUl|Khh)w8iXKnp)JtcFom7Mp%@0e*!ToANBzJMtmxiKR+AS2 zr)~Xz5>K`%fOhkf87tSvn~8}~+5G zEn5pnpv@tI@~BnnC15Xnl91?wNJ@F~09B}mytRhPl80%6LP#l)<5-3}&}vG6CMyLX?8GHa z!Ts|LKqYU*jQ@s{C|r*PD#afzi=XI@?rA{U94Cf^(dVXuo_2yHbu-d^VpK1h=dB4E zDE;8Jy-_1?)T>WUra`9D0IQ}A+$E9lxp$pU%UBZzF|iy~RF%{$3Hw^#MnKa!fd!1) zz*fo=H-Cx+?#F2dw`0skRCni@&*N6NUGHQm!&R{$Zj|P{zMo$?nysQ3*o|kP9>k0v zLUg#m!~m9R1As^OanCXwYJP@AI06u<=F?7h-_z>Ar>0-lsd(OPS`7UXADo_zLhlag zsj1$VgK4dHoA#ImiozdA^0U0u=7a!(9;tFo+=$Wg&{QqMOOZ;=9WIAE`lJQt1R6Of z%0IY3NVOZziAi3!QGc({r*?+;a=GWfo?qVGf1|?NNk)pLKlFmaN!}{;7$IU2y$@i3 zEGcH8J%yL(NA111i5wn4<-B0FI5oGheBch{C!;m9PRwdRp1b0LlZs~V%uSXOP6SZ< znYnLQpc<20z>978=|jg1=~78rjdSDP>VKZv-|n01um5#$D>fo%_6gGc4e_-KA{mguH&p?rkmG;mFW?HS*J&F5;(?ggYL8HxPVSjRR3Sc9^O;f))?0|FSTe*jj z$P%dTP-qde0jhpfO!pI}HbKT@2=|6C(PtfvJqoSP``{nfNMSXdD2#rB_NjX8Fk8DA zRD$F5GZ-^saIgeAP-mmOHf5mb|T*J zdp!1+ckzkAS`^9Ra$%GR$+RMMivRsVIiK8W8N+lX9#}5SEZ{bO}ma#_S-}HHssH z4oOaONF{(i|E1EU*wdA1FsK|os+YUcjcP95kcIMUQhVWW3=ZftEq9%1g-><+rH#a2 z_cSqU>hgo(Sii7gT2OS+1oe@6`7$p2dIYwf@D%z3)7$m`jp}73ZSrC1Kw9K&mUfzy;vlH zh9ey4O>fIda(+0-jkKsf{!Cip81i6AEjIg6uMjCuL1 z2&AXN8~H{-rLW*qaL#>iRo^g5{$GE8;oqpI`_J>PRa>-{;O&*yszRaceA$31|%W5*7BMFm;S z9Xl|`ckI|Tg1rZ{T<0pjvt!4p^Dc7wE_NOk*0zWpOnfq%pO|>Lt&q+xOnkCTyu5G+ z2TpTqxTO=^&Y9C5;R2e#b+{$Meyf8D@{F~uEu4w>m;fg?xWu9YH?y{PK{{J8@kxX4 ziuNuDTktPv2EWy`z%PC9mxr5^M_{vulN%GC6c4X3CpRm&AdfJ!L4whsF5rg;G##^c zLRcU@K;uTQ%_+DcoSdzZ_M5YWZsru)yx?pGw?%9=gPWn*nj)Rd5l&k@!5n#+_~e+l zrN9;Fhfii}ec+2d+-~bub+AN^t^2LbH}6$a7jTi6Q8$p*(Du-EQ`9q%mIKTAdNkCS zb{_Iha0e?Dr1{3`%{{he#LLIOIVp3`%@!dMkFPPp!FS1 zp;kR>a~Hs6NKAwUwx$QQD;wp&-ha-2N%q5C-@{oD9zur=Jq7g(>n6AzV867zsWbEs%Q07Hgt-b*4pIht2UlRV zxuwhiNdQ@p_Rhe}{&--BB52YG6x0ElAz(V())rxlgirts?|^W!2DIKVsb5`F-@YL2 zZe{I)&~kuph=x0`h2TasF|@GuK%lG}B(wm6Y-NzP$W1fNCkW@}wy=QOk@hZV?G_t< zK)bVx6Ve8Oc7;!fn_ET(T=}V>0>p4ZpzJWz>$lnYL($s0U66YVY_^Oz5C4{NaNjnG zJR)0dR_Nvs8Wke*-$xwzHYbF}_kW}n+OjUcYlXHL`;WImTl7G)LK?1cdl&078#dxQ znfMxaepk7_R%ULv2#?UeqRhOqLW00=_=}Vo%xyy_x0IQW=LgCx^dn_Pxi)_vWfs_6 z2^v>Xw*G6l*nk_zwti*t1sz=>KWoG2ayW0eo{*Q!Bj5n+9x$w^>);a_NCB7M#N2_2 zR|a%dM#q-T@i)7ortvM7eFHhbKtmHi0}8mUwI$@`n*nzOLL6zxj9UYLRBG#jowd0+ zP7G+}`kYIQ13~}byM?sBc=b;YW4MyAYV}!mjeFDN;LiqcB@#dK5zajtk zoALPswqRzbHt*Woyd|{(?2gT01+yUCgY$Q|Pbl<39`5`2hDI z25xbhA0^x9h8V?tbVH2lhmwf@{*F<2b0KJ~MZxKR3e_C}-M)c!2;NbTHFZL{0H_wA zCBFjq-|cfz_x~xrh{lBm`G+w_Cv8EvQb4{(>hK_GsV8-%W< zLO*bqP@4Pq(Ng{`WI^N4U*IoonuZ_vw7(mdLphbJdtwwte%sJWLr4lj zuu!hT&JLy0zySWh7^5}mZH9rMOo)q^M|2O zMTN%%xe@ebv3IpIu|VP^cF?XAR1)0fJtNG1*pWW;tICA z-{|`v+2j7rF;bZK>tLhc4-5g%k1!I|?{8!VzK>s^4F4YnL@AYj6%RKA>KGLI;pOK2 z;+}oC$^Ays{`a?I0x0Kc3t5C9#e!Cejx4Ca%HJGUwxT9~?RM-7u53U8RBHj@0~*L| zLj-j3&%lT+)Bn2|0r3%H>&=CLw>IX`-e93!!NiKc6OA$f^*3krt4;9Ay zdt(Xz|B+NE3L<_NOZc{m|Nikjwsg2L(^)21Z~D#{Nc-wUy!cftdYa$Wl?^{TCtYH>!C4aXTKqZ(?a*D%sId z3guCt3(?;%l7^bmU%TVkv^oDy0ty8pzmGQmS9UHaiu}L~pjr9tx(*qnovF1w0=2vO z-5k_5PHm52h8lz090^`50fqWjU_fqdqytD#al50kR^UY-TTpff6_A5wxIL8r(nOfM zg0ga`Wpn@Zm*NvQw=k#@2a(tBYQed=K^z0R0@+Tq8u2gH<3E+zKd}(-dwdc`nHn_H z@Y|Af z?~hRG?R5X7nq(fP! z%{11(Q8Kyt;2%6v;M+%{EGO!bn=Zq@@yP#nl_uZ!`70E7{$WR74IsA@fMq##2%F*d zHUQ1P#5FfZM-}6Lqi+BEi3@&#uQOp^gcMDcPyx*!sS--C{`&VTcFOr*A7F!{wr{r|U5{+=)Eq5uxHuSC;iKE59mIiNKDZ=}jU zs+ap?J`?I>m_Ms@=`W1yqw(**zjF{qZ8X0C#``z4rH)W0*^Se9$@T$N`na(yF?~#H@fYog$=Vjo|h@Zc7@av2D#>V`S zt+*2coCgFq-8iTa%B(wV9zz07^AnI}5|9Ng&{2b%X9jJ&4)t|X-X0066hJ;6?Uc8z zroSQ;&x5L4-g1tXeBj_xegSaqBzVgTdfs=161?2sl>q%p z*73&)ZxPfHTqtiAwf{n67$3Lrm&07pdB*px+5PV`4G?UDX4@w|9!G!^bU_KtH>~?3 z8u}wfZ;Jw?vwsce1UY%XPMq+dGz*P+sLjb&jN5wE2nEsqeZKutv~2UFPC#Gis5G>L zHx<7>coTHksGf2LN0oxlvZy>Q_>6*?U&kXSaLZRP`y01%qfSu+_iUZ`16^8{Djk*?rSXmHTXhKb^KvxaWotP0#N8{9rpirWU^ zWWRCBH5_^o7in*X-~|6_TUkR--$W{2ZgYeM+|_n_LcE}+30(eaF`#?_(hGOsC4hfS zq0gH$&$^soFYJ!6xd_ld8K( zwWGDwj}m54WocDbWmQ$l5@yBk8Hpvd&XOgx-fbP7NVwYcxye1KDk-@q8Sa++DP!uf zn%hvXbF*ht=UR|u-(q#lx|T|Z!dc9p{dk`{%7`nPp^rsy?x%hBVvxGM!Dc*)@#CMj zGMFsm-c~%qq{MVPfBbW~XyD+_uF>tJg&l*%v4gdOX`!PT>;8Ef~mV~x;a+V@tl zTQ(WT@(pu68JSBOe3d~l_^yr0>2n5_liFIv=VqCT&u{(3JZE1?9(t$|dzO|sg!sKq zDA9yKD5ZGHLv_FDzEaj$g=K!$!AR}KR~dt`oUTru9H&VoQ`5X;t`|99peQ`_g(3cD znH*-5{Ic2}XEXKObl>F`m3r|r$tU^s7%En^r;EcCYHNN1CX;*Z`|dJV?W581^NmZx z@{5~GDpU+(#9^||#EfO}6;n**u6cNY$|&|@)EnI!Dso(ei{@2C~qqcGfLSDlCni@%3oL9dA06Mu`ULXCmc`>fdb z`uX5ur#Vi88tdS3Nhi(T#_7r|Zhz*<`Um<9mmhn9<~d>)0vNVZ373a|)S$pDwghFf>*@h`U$z zfnsnV5ylbZm=VN0e00ofZt{Gw`I?(StzF@uW{a>vpRZPN?IoTU*Bh6jkE#{&8X2JF%d35yS}mUzcj<~3`qcCub%4$f#g z_W-B@_CAaof>joC1WQZP$W|%Es$3?#pvPv_UCe=LdXB|>+lGBw4I`g$EMtt^_Nql- z$0;Mn*F#>T&3LW8)8>kVk_GXf1{p$WnQz+&&9V{G`Cn#m(9}sZzKf`;PCgVEv7DK# z)|^q-l${iOo?%-7L+J0}-XeOCtmerB2468<#h9_z_6y8s*AmVc)U#~66$8xwS$0hF z!cKGTc$)6J6l0Or*XGZL&JK?FX@(HN|!tIl1 zW8(Z2F8Px%%^8hc)4o!TxuL)(8P2j5_i-4p-!hX;DqJp0P>6wf-bAX2W){uGQ%|As zV+X{KoyE~RCGd4a@TYeZn4NK->`79(xvWH-ux?mx+W0o;erMbcoB%5S-Fd#z7h5u) zT(p1wV1ly1?KpgxjrFAb_Vjitf$8mXf$y?41k*c(H#+1p>cey5-J{S44M&73W}E!DeJ3z+)!?X-FvpX#6C-nw5tbf%t#+i_?W zk)stF<9fdePuQ6_L>MFF1C6*{L1~Vbh6H<}tw}?JFzMmhgw&}A zYF>lQcoc~Au+s1k$KVt#50}KyB#j=DG{~(IW&+H-UoN5~G(s=JYZ<#rYmQgb=)5tC z+^OEM`B-jyTgvtvjew}|I!?@v+_lm6UBv43BQe8+yEM0*8jWw%A3Mjlckq>6CnFiw%Q?eutEeFIva;t?2NRaZ8yhQHhyT4qh^QP{?ig6+m4<01wfYN-V_Jf zW|C8Z9dA3y<|_%};v3@>?~93?vusUe1%|?e@Zw$bvOy2#^QAY9c%5rQYH0J2EvXwMtzAHx7!$4IKw8Sq-|^MkRD@@k36ru(!vCpJ1kf4)yn*8fHPs|iM zqbh+q__0g?+@Z% zOC>lZ!6I}0SzJ`4&F*z`5gG8nT4sl}DWg{RH~gJ7U;a8kWj}v|LDB;CxZ1v>Lm$XhNa!}jP zT`9s7DjmOw#`7eI=aXFy=FLDmY~M}=t?WCfwBV#-uJD?inElwtdl}r83K)!pgqQ?t zZqjbFpLEEhCMbVtq@}#_hS}#;S-qCmcA@4&^g(b{Ia#tYI7>NA?8J~9oPtxo6UY_H zfPBBI`Gxb0UPBSFYdsPEk_JuDz!A8N?Oi)5ub6Qpx%6|j=ACCiwN4ZjYoW2q4q{dJ z*n^f%XjP}C0=nI&6w}#mB5bGJ{KYkzqQEG0IH_V@Cv}m%oa~1d_a!Z9bOhq*x+$S) zmI|aVrjVW{j_WhhFp~rf%q;H*@JvK^;A_^VOYLQQ2^}=cMk=K<4uZQK~!>(E5fLtg8 z=zR%d^}@1)iy)v5RflN*P+FDdWnC=FmYX`L)@*;z!&Ja*NcM{l^$<@HCuTlNC3ktGnEFCIueghA8xr&@=8VJrwq+|YIb&ylTZcM=@dUa zna2K|v*ZIYdBx@-QL(h`E&-rCJJnA0wD$lc*t?YJ=vvs6P}h_v+)4Rc z{Eca@QkV#1V)TraMe+AC&gz|&v$D%mp&yZ2&3SWKMRRPbZfKjAnyOt5um zE;p9$Wa_<)!y9V+0@IB~7$#)@Cc333#nUyN)a~S>#-=YaqHwPqfJi>FaG$&-WaRv7 zNs5z@FrO~L*!?Y6x9pSu*+UM~9~?z|NAr)qpCwM{WWC-N{zVzFRv96+vYp}k^@u4I%OwM)U=`;=@6GV}J!v;yL9Aaqcrs1!J8nTnd}XyOew;}&cdFzyt)zO3 zvAmq_BFPdWd1C%pxLmQ0x7+jD3$ng_EXBS>4UHKmjHcR0mm>mYYwTEx%~EU@F~@X( z{!p{4!6~T6o*Bd*W;d9+jz2r!4NI0FYt+*w6Iu5Pq3y2b(A8Ud<{emq z@T5c83gz+5lgJ9y!HG;*k5>2_G1}$Kgi!{~;_`+ZI~daY#d6SOsW9Y9ks>MYHu${VwA z%dU>IM_$k~c#r2|r+y6c1-Uc1R#im@mMcG(H?T2Hq!f}H6lBh|cSuFM(TTqnd#jfE&D2`gNmqHYJ+7pbu2_(^eQGKT0!*9oL+b@aR?NpA;P|N0rtg+ z`5$Ej3Q~Yq6(uT{ne(A?XEIpugU7Iv(XL5(cvzpdQa=051tFN{XlH@8ffL(-h6VRj zpT*Z)1s@5$UjtNf$oKTXtD<^c*qC%RsNu?*7gNtA%4_haZn6D0%irfH8KjAt<=UR8 zl5VX|SM5E+Tq3KBbzC>^<l>F%7m{4Q{<+EzYV_ zQ+zmG^y#Nc!=50`Oj{-?-0{>Alg6R>JVwQ0(pVNp;tKaEPKNPwW>n6JjHiI`4je-d zAx>We#5{5=J6!c*bCL2&jY4)0%&V3x*`Rxd(yf5=VS^DY$&aE_q{mG@{~^p{$Ra_r z>c-8yyO_XwvH@B@vH}gl@p&?90ZU-oIOA-kWb-Z2o{?Y5s)-<(s@bq6GKJ z3w%MB*|qbtTqZ=dud7hoHB<(1bTyvG+Gp-jJIU+a>eBd8m7G;;+D9?r(9O~hFI0Tr z+)-`kG=Z^~uOAlmeZzPaYcZiRGUe%j-)Dj42BRYGK;vm!1n_mg4M~bYR{jc}XeyWc z=j#gIIPBS-#>j+|8Ly9h~RWuY(HmCPZMKt7T$dKQf_D1X)Qw6}E zKfl*bu%C5Ym0FaIV^k-hQ~9-G>maMP%b1^LnYom8TSYLn`Yb16v{Wg_u_|AJfib--IYVpbZ1fRrXHhtoV`YW%)(9rW-CD z_zyC!-4!&nW#@01b6uPkI(a&ev76JPVMRjAvePGXUF8Cd{D$n)g$M|9o9(gH;N`9Z zXDDCXec;S(LwR%8fg|<=MmBoA5uD;5bkFy z`&{RSrq;qlr*&41>Y9DdVNE%Prt~B9Jy;Rkvc`{x=HEY9-5tR<-o>BKsu+7HNzmBq z>Aa2J-Vwb#Uo8f?_T(bbPjmK5@Cl_69)|!i-;WAM%B<+7ES{^vhii@E7?^jKgQ#^y zWQ5Qc7ZH500?79C?%RD32GHs@ikf{)EHbJ+-RZo`uzVm|L(KHlzU~SqtNBZ1JZx3H zciePzT+H;bOU|*}j;5jOpGi+7*$>J=4z$$SI!?O8n7C~#<{ttH z6H#-jno9-*4b9Iorb4jc!y(Oy_%`NGKE-I>GPJAk8jd=gCMjR{X8m}>nMw!t za4z7$6nese(4GBZy((<>6KxEZJN5<4YGpc&BQZAFG>eV6w|Ws1V|VW!hA;NL%r3Tk ztmb6(`D7>=bJUUANfJ+nwM#Wo;RC?soQ6Xrsurd`fJP$To$hQ8rQ7EvIEyb|Q;5KX zB|MgpEb$VOJ;J=VkJA0K+~7ymmkc6)0Hr#&%i+q*wDMv2Ihj76ss(|Hk`|7y8j*4?(L$vbgy?tSaN8TEkL`^Eq zCMu{sa&wU?yWO2pdkz5R<1X4WXAJ6^XgeGt01wvXK%%C$J^U)u7lhoILH|}YuwxIN ztR{kFByUg&&Sb+DMQ9+TFxdXc@};MXiTUuy>9m7cot?(8@UF0kYCEs?eGlXk-Fz2L zR3GHH-4JCB>l1X5I4FLhZB2IaB)(^luRx!uUi#gjb-HJ_Rh!KqTs5H%zrDJP9c7b_zAfRZ<+2zp(FW9Ao%f`Jbky?>t z?$(O38aN++XiB6se5g>w?Yc2 z@*lp5je=b|CQ0IU_oXCPAt6WGpjV8@yj&B@t?RYvSF&0jRKeZ{hX6NkFvAGeoy{^L z)FGIK`#`qX@A^7%i21Nw($jp`{l>^3~cfm83lE2f8qkSHf{Q zAyqecY-2~34PrXyZ8dz;7CU5V2Mv#t*?DM(-b{8)cPw)9$9~XZ`mns`f|Znwa+026 zr5%jfkGL_que~q5^LF07?Ax(f`gRdu^xT zZP<|}dU;T7!P4yfsWal`D%WW=-#O^el9736Vz&5IS#&HP%X-VgR5R@@i)&AfAknPs z(QxjP5McZiKNadC%E)^yvmrR6+q|t}4W_4^;^4-F<7dn|tK%6H`nhwmdnq#3fh`yR z9PaGQs?_5X7C9AHB_nJ)DFlnQ&tvK~OTFFRD@gARO{E0EsZmj6DW5Gv;4YiGjBbvFHjw4sCNs=p0-zYUE)9GM{ zwBsD0?bh_QWItGWOWIa=ZO3%KJ?pr)hodk*K*w_6fL8>DQ*4KQci@~oD@tY>o}=AR zyh|rukJ{+fC2}|2*pji(6Jnz$I&`#i@^DV-)!m5XFx(qsghk$&SxuMpk1tD2{*$)Y->`B}R%bM;S(lBiy%7~#2-K*1FJ6mNZPOnJB6KZ*RUQ3So#T52@ ziaqe2y(J4-#?K+2i9|qu2NE)s`R+4Dq&*%?PYJCNcFqSvUTNm!z0AGAuSg~mEU10W z>EOwk!*vA1x%3GBz*xl71ee_GR)nLy(MyCNoLTAe`<(2o6ykH0QltIivcpmx6|uxN zWNwCMuc#>$YqLEmG&yvs$Vl|4xtv<}OTv~ttO6Cb3UJFI;*_EX9rtM`B-?~wge}qO z-bUA0&1#}>;b1qxhgX%t=0xZNo4;K}?5?1AvPC8RYy(WO&2iF1q1sq zoiAXC7QH*k0D;e_QtqIZby_=i)r){0Z8Gs-+-pmh!nCz4v!pthbYo6f>YTAn&0@pX zO?wcet)lHHcjQp!i|8kziK=p@{kDR#eXcr0CUZs|Vd9skcfls3yK=Qm>Bk9K^^EQY zNjTV3X&xq0+`}5|+)elXqdtA?p4^gN@?6|og)IIba*AZkQ=+g-aYeIMGgCMZXzByc z=}AFIk@G_)e)#KSa_#hi6mzfeEuZ&`)gpDmZYA%_#LCN4ctES2!k+u&NLbO!;r_Ly z5+$alocipht_Pjx&aZ2ZG>w)$VpAe&^YzMOx$`1`21b$Q_&h6L*+Q4q!*611i$aGnW1yAbjq% z`l}t*u_1M5F1H?C$>D2TX6M1|4=m|%n4~EX$bKkoPOoGeapWpzSZ-LD^St_gX*|)a zZ~}GW4~vnv0p)Ns1jsa8qvfMjg)dCh#A*-iF~APZx+GA?$qx@!Du|A_Gpwbfy>qyq zJ`6wSPSl;Z&&0EmR9Ol#30{kP9U6oUUBDn2a_$+;E#wKVq0~VbxQe%W-lw~C#DvQX z9+Pe@=b)lFFLB2#YA8e~i;Us!elyCPhm`BR@@aFy&KN>0BM-EXw&wwU-@|c2-QC83>KHOF8XhT24lmfwFd58bUbu$xkW`;TzsGWbbgys zT%IXG$PSt1k>Fem&Hz$#iUWiDaZUEB&%KmmcFC}GXSco%YjRk~;Nw)-XP+~4VYGll zC*ZPupmr%$Q!G_BBXv{JOBITp{`n6poG;R9YSW849;6OaCe=Bi?n54dWG81ic!dFj z=t?l(*pNWEq7J_-&X76=32S+XlI%?M9adYLh;>K%gITOYQFVk0X0G*RF=6>G^;oP; zEWGE$&q~t?BMDnVmc8pb=e04fDP3xhBs1+th!6v)F>9oDSCja85)Wac+7ueFRpkE$tq4sZtiBv}R6k?Pv#_}j2OfO7Fxt>cC_%lXsgH=FI>__P5kYJ@($yj-=WI#Y z{{B=c8X%k3KNI-2mpv>I>kk`dJo7*oQ&vbjPp<5Qal@{{66t&)_v@cO%E5;vZ63vd z2w8wCs1Onj*bY`H+~dHQ)DCH*xq1w$1oJ}jo`^to0GmHY zg^3-q0qhf;H;y}c1KI%#3td#Vokt3z2x z2YW)`yPa@vO2Nd4fqb_DwqdlKWxoRCyY&n2byLB(vavaQLr3a%zhW{~00$Gc6PWcA zI!&S56srq8%S>kaz?x+|5i@n8e-Ab2U$u}bf`fSg>Tjgcgcc60i`nS!C_J8mNxj*B z|L^qA2ID&_fd1(l$-OF86rHj8AsD`$nYs@}9>7utC?O<&XS~Y>+HKpgtQ9ceiq;zO z5`fg&JsilW;-eeYBq*r<^oB>jG88l9B3f7WSJKzXT#D-v`rpNC^d;cK93*Y3B1S4`B|28 zU{4)EH3dXwS{#gDQ+|jVt^X%RC{oaf=)Zgz61F?#m_XPHB`wu3{6YC2XS8md^Y(7s z-vOdwKX#I_7nD3YgMZD$O%^_!^t4_I_@lxgH2v5Pt;U3r5DG)!lMr!D?}YM2yTYeq zgO9=&V=2yFU*-fFKq%s)dT{XAR8JMfyA-RIOvPsxc`~Y|mgmQs?nKvz*H#waXZWo< zR$RE_IPfH|ux502UVo*}@3XOLsxbZI3$#K*DH1+sJm;PZz3*}Fa}`!g5?w68`@FVj za^ByblGmJj0huN0t&4o&}7M2FS;F$k&Ecz23|OAkR=;$~gdCu)Ba-zB+Xi|3=W zcLb5Whj-(VyDXhCsVss&2;#Nt;x|(uW7CdiQceswQcW)t)Iee4P4q71Ndo)RbGz}R ztKBBL1Qb(+XPkX{DxBr%gvIUbx=Ldl2kI-lSJWie7?)<>JC7`DWveQdSUfj>5QI-q zs20l@>b>+nN%NMeT+;GfQaI;{J9LBzRu|3USu`SQ&toyB*#{HQ#4h%FHJe<2uuF z=sS?Blb7W((%S4gm{({2^16!V9S3-lo6plk4#euG8BXJ|F$(%HRlFk+u4C;tVqCgjdqaWC_*yqOuo`$CfQ&6n-n8{2P_tJ&Q zM;!(sv7}cyT2y}S?dcLeY?e)_L|GZ{yan_$Rbtk$r4m@R4h`4}A#tiYODvUd8kUXA ztbH1|uzU7#SE^ixaixo*@5+P_Ll@rcvrYr|p?G}-0flQ6cCMqZh$2jco(%wLjol9d zt|kvw?x+f7MM-Pchmt%V8JMSZux`!5akdjPP+mKgKQi$60uvAv`ucNodu~Y}wN0V~2y8B|&p|Giy6 zL0DKj*Cfj8F@1{F==Qj?%@Jg;nE!aIhPi^-#H*Y*`lFzu{fVhbO1J%>i043)?XzFZ=NMPr z{y5pw9kM?5rlQ10k`7h`bnl!`K2-)G+e8wGM~A^ucK-OES))vd3-kOG<-X`@b0 z#+t6HXgay01x_efiBBbXoKCv#^dzn*Dc#wQy6WCDV;nLrSETlj@W0xFf1o4W@U3`B z(O6g6m~(5(k$N1GLki*K9OGPGMjZyPmoFh}V$RHuipaXokLp}yId4y>eD};eED|$sYC!r}wi~9P*I|m}ri;a8K?=j`OThbBv!53Ewx1SGU*ZD&nSVYP40z zq>HW#hDLS=*A)R$hGa><8so0Ed;_hTmC352+gIpI=vQCRX@`@sAG(-SvWL9NZ6fy5 z%s_Cb5EtD}Bmif9BniT=Ukc#igY&tc=M~|@em3<)5C9zl0F)+$A$9y7o1^+pL7(N1 z1feeD9W5a#v)Ua#)AzA3LeDP%q@ikYQE&{%dZ}RTO<1+A@0~fAKKGc`PmH2%D6~lCN>IyHd8#!A0g9Y_v&T^hx9z$Q6w>ED zqo%(R*wZZe+4h;9kV3&|Ti(++W;JPSiRuftU^hvh8-l=vh)uNqk=Uo50)#H}W3S6= zeLhXY8&gF}UP|28mJgt>w3&QU(TpSN!#5P{%KcPSwu}A%NsQ%W!8L;p1FRc?X##gC zh#UwFLiW5`QOqW0s3gk7TN~Ze81zaTyZ4e1@(8J$`&6$$4ec|H7!QJWQve-K>#Lq# zet?OwN(pR~CZfb{3If6VPgJ^Chp^ID?u%&8H(&sgR9YJxqNN}!X4zScSa?nJ+)CBJ zt@5n*dCY3a<+!^yz9$e5f^mv`-`Ai5qmP3U>o}-04cHM#Q@#fWlV_)UnCe?PUj=rx zfHV_&5bm2-RP0Gl^vMyKISd+QE4d*vcT?*#g=QxzUwD(Q|yWRoTA}fo~ zu)_Hx0d7G(Ia=A&xT0gi(Gn{o*%=3#cS_3YozB6z65;zIzpm-TTMV1aixLHaR`scb zppWVz4o8cIXnatM*ve z<%%$7%!H`+m#n!KJQf_SI}FZ;@c0muUOaX z&YV^mFs}8z8`GXcWqv3i3!)h!x>BJ`P>7UNN`?qm7O}oGoN?~PeIB#=fRK;}t|8TP zvK`j+;lp&xq{UrtqAkK*jC%OBN{XJwFa(cxmpA+3H#xp)-N!YG-7+)qm}*!wg-Rb5 zDdxAnI@=#iJA2;ghG;4>NC=WSWZifl73Ahqk*?M5QwuP@I{NzLgN=IXAV)dXDn1lW zFy6yqY#1muMLA_gJ&@h(S|#aj&!We2A!7aX_3GTP>jL?@s=Yp3RT1gD?6MR+U4jW2 z?siFy&g9CKdyyg&y`5qX58fnDF~Kt}9~ZwZJAjY`sc=JuTfA3jHQqIOe=O@Rw~ZI- z^;Xu*IzI7n;d&Wc<0Cx3#!Ocuot8V6J9kIfd#rwnu{Eqx!X;;gR~uh{yTWVM!loGW zBvEQWwaRlq|>ek6iPp|$T+Obzwov0%d!u7?XVU+x^hP& zkD%FjitXb_Tzvycg7wQAh!u6@6_90NDP-{9!FY!%WNyThXZ0@Pr(NpEl^>E2oA^5OKjdzG+S4~c!xTa&wmaLbsdmyjE; z(c-NKwCA%XVeLB@pKnld{5@xV13)f~CLA-=sc0u@GN5 z8h%5j#QRkpft<#(K(2he^WxlAFX$LvD6;iLXCf#G+ z^ER|7&*WMuDXT#fp@Vzd@DmQn{Dk-6Lr+#{-zTYxV>QSmDDGIoXh|IDS&T5joP1NM zSig^!<>e4p(ktr>vO=dl@JH`FXXhXC%r>ui6+2AmhEg@(<+G-@tGs=hk&nDEu^|T5 zO-sUbI)0cjshmO@uSzezI^d1$iv6Qwit!zXUfcJq8nIE7lU>78?ilfWW>~yX0Y9Xe zxLRQ{E7Q?t-Ekr;>E?$CYaH_vk<6Y!GC>@YorkQ&ztU)+S9~M}2Jeij>roFC-y3s@@T4k?Nwv#oMRbo?M&4KPijR3j=O- zFRanG4l?{f%*LNbyS3nYHA*V1D+aDB=8F@(^d_a&Fhb=S5=Q<2n#?DJ#z6}iVNtA~ zN90?N%1vIJaz2*l*~1n@m@4iX`^u(ONh#5tI$6lF%baQMono@m?W_9Z8T9L1>VIj4v_?Q4-y9%@PiRRWa7mWgMEK<0g~V#j4UBZepY?ACG-qm>4u z&f`9;C#p+d<@de%^k}A2iy6`Sf!=B22@`|iUb7TXT4Dvqa{Oo=%DcdmnW3Xh(SWR7 zH>GeB_3JwPk~uPgJ&8+k;RLP1)2N=^F^YyAH~U;Wu0>u8G`{lgT9(YqGn%>H2!s6k zRbYg2bjz>AvluB7mX&)xIPGe#0eizqLG6ZiXArD>9^685?mqDHs zbhAr&@%Hn<7wdPA1@xcQh%;9&Y-+vP$4x0(nsJ{fqD8R4nAOXDozZxOAc4k6=xOGZ z#r{?nOZ&16!_qjvwpuJg70;Q%x!BKfFijmf-{)x}I%oD#G6;Y6NMBxgat$wX3^zxD zKI&CWgX+@TVBvbDY$fwG=Y&5G8^|mld{ZuuOUij@%DnM( zk(^EI^^(CRo4ET;3wfUXr9~x^g@X7t2L^)U&lkyLbVydtf371R37VpGepzS#=z`=^ z8>Nns_8MzrMNq(0x|Ug~2*L01Rq&fr>BknA7U6bL2jy$%1w7|rR!{hu?s~LNBi_L z^6NM)o?WE(RSqfDu^}@e%{{wX64l*Z*HkGO(s9IjD8lhzYMB(a)Q6Gl%^@YYq9IJF zM%kAHdPvR@2a>!oR<=pdH%`!#d~8ThuHzANC{p0|_4lJC%VLV5kKUA$%AOhRDx>!! ziNIh9x%^TkLv%Q&=;hkw8#pzR(^wJD&vmCv(p;2G7G^hrLW?!GVbm8GpvaET6&w%c z4k_$}b>XDF)N9wOlf;wg$Acm=*rqHREwD2tB<}rM@UoM905xwR!)igYr$k5&+7bT>kR5u&kHWd*T+-bKh^F39I`GR@v+wsKSilW ziHQ-aX{(M{KFseI6(eB)8aY&RF~HCIJ`$l_YY$LFcV3?)`>r*6hA8xblik_qE&-!bxpMaQQWdKR3m+{psI2INNQVxw^RM_$I!lOdXIHm-9LF$ zc+YT^=`BAg!qG}N@x;_*SJqtnO>NEjrRzij3v&fD_4_SyLsYSu@(p^=1>!5>jZiDj zw2c*tint9@1UHA0iwC`2<6aq43?kgKt3k_K&EITp!5Gn(o;Dj!e7WPXm%F`lrO9)u z#$-^iIG+USjdU4&sOvZ%W6>&sI#BkV%d)C0{p}?(%1l_I5=rk00(;7z3hpOzQ`3HX z?-oPwHMJzcR-bzgte4F>U@m=HUKA_)>h_zZ&Gu-noOFMvf+zas<-57l_4=pV%M}sM zSBvo8W3l(a{EspN-=m~=*L~eQU5;I3xLlVpa_tQ}`d5V%czd2<5VR^};0G@{o}|f4 ze#V*(ia5y?D0>L6Wg{rf331iwTe2dTUO_1HU zRlrv2-Sl43)Ml*v0eckp+{d{UU=evHPp>fjateEj02BEe8Jp)j zMkue4&RkK3t)&v8Y5^p75uFafe)ZfSAdNHgofQqYiR>7hD`x_iP|%dKnA&svlT;h@jw6iQHsV$D>yzQvA8MbJIo>aUEvgVM!)ADT{B_>} z8S%x(l-BBH`E)m@S1Tm5%H4B)ew4a=6-J$<#k$+2X#^fM?XFR1r_< zJuXw18GBx{k?|)vo-`99v%^u$r84J)VJ6viAyQX@_SD!_EocrLKb7BE{Jg5oa&mvr zCz=;GuR7kWuRL(jXz-4T5ynfHX*hw7>u(C7_}pA>9p12}%h{ zw+Kjgr}TSx)-RstUGE?7TCOF!hMD`m&)Mhfv-jt^W_9)P6s*Q`CHY**1dzevGnNI` z`sXdi&H!fAMBUJ3AfoR>x8=G#C+RyDK{7HwSdCI?JL&PXWLk|!-=1y% zV%lqSyL{n9?)#j8*mC}5^@qtWv%GJ!_4w*j{TbquMJXk;a7K@)`6iz&%(d0j4?LKN z_$qmQ<86WMQuTK{qGcraf1QDHXS5Mhx-e?9>1b{6^G`7eqK&zt)fiC55};(1Wqhw( zVb+K_?3-9*f}eK(HRksx>;of|cNJjXGW72dio4@?tk!zZfCRCwkxd5oE?+mG~Mm1S<5s1(4#x(q;c~T?BR^Y z%W-h@gA+nZP*?JpOD(CEw1Z0WO`Z>h>~K2hIY>t65W@c3?o3~VxDdi+Zo=n?e!K{3 zni1VUw|rT|9->u&%JK}bk-?ch`>E7O`WTZjE=xU{QX%YETQw_boza|v+a$bE;ofm1n*z~vrdCPkY5^hf1;vU#i=R7^gG!t{vf%Iaq( zouYw+%gj&nZ-#b36;`2$!QSCWmW*@2<|z}9AcM?~SoOo;;hhjAK#gaA_b8>RL%N+f z_yyf|7%U_(WZYFIeA>u)7NIKlowkqW8e#q<&_6}2yW3gHx zvP%BKu7-?#<&)a|+3e+>yrhVD!MP-i0Gi^~dKy6i27;v1vlC~}y8@!7b4L0LZ?BZujOSZ;-0B!@5Jv$`ZUapZ*rAGZYr&F}C04JC=Mh%V$v z7T{RyOUl!_;XVqCt8EvTk%wpTTiv2Q4#8rdQ;=$3qgc1Sj~S3=)50zr1E{DpK-~9P zeo@Sr{H&$ePA{bMY&bC)|KWv2f*B#xcA`a`_atY`%9mz8{Z2b6`LxZL0!Zq{pC3=t zF&Z3aH?<<^Pf3{5q=Y>pV`zRCV{*e~tCi^wUKXQMLvW+Y-d_j$_9BTf@-1|I5^{>$ z4;C;k^P<%*%`Z9#muerzaK7Qa%AWrGI^Js!I_v$T9-Ml=EUt?RlZV=0qS?2Z>k{XP zjgYoiAYxCI>T-;y5OWEqTOU$gvW|QJ#B4eEhTAxCtQudrq&{wn4t3dXBhx zlIQ+%MCVC9iQlG>R{nPifjszf4u$VPgX|O{YeIj`SWXRoVUWe`$Z$W>+(?@GTnZPH z4~^`(%+us&Ki~2C70djS^OiILwK; zQAGlI`~@lpTtf4pHs5`nundiXew5Bgcq!#Ep>TV19lw@FtAn3(T0Bb2sb9hjvJ5;Q zTjGVh!4LAAYpubTaa$e+Y5s(iJZ6cY0!y0|UX)k*+x#T7Ta1S2HaeP3I`5GQZ3B#- z@U2llf#WyO2=nzn6D^6`+O|svtx;3q!V&UCG8;MMyGG+k`iXn&!x!5!X0bSO3MS#H zjg7S_tF`FmYs8yUNp|r?f?h~C9PQyV&=d4}Ymi-2DV(4sK+AGoC01?Gh$5X`tV4enQ1@(If!@@L;jp!$+MBoEZ>0>@?5{`XEz?1!yBKIkd3 z<;U>7r;U+|)9s?9sQhDh8yAgEG(S>hE8hFp*KT7hCrr$8S3OtGeR3lHo}#+cMMI*; zvspc#5aSxM;%)Zo_wO{hyS)Xo^cget01(VwP9KOhmQfWtjpd$Y2g$WsX)dN)#$RpfT`8=d0`&1SM zRl}Q<&s(qvPDslQguUOV>#0887*z*79eN7&p|cz@&Q6>V9MGH9UF?qEA}z(Xc<>}s za6}pQd@khD)n>OMPZ#+e6RQ1?;YxM^U(fsApFGvqVnONpQyV2ki5@65`8PBrKTS9_ zQ^mdfxaB_xKczn%>Hi?i;v)FCe>hQn!(@iz38z=RsxL!D;C5hONj5liK> z>v*FwYSL#j7tp^!tBPp2DijJW*&LiXuc0IEDRj%pn&UWb4*;(21Cjv7hVa< ztj3F1=~iT-Lr3m)7#_SFWof4G+cjs83QfexbiR5erU0fG_1&e>v$W__?8qk%S0;4A zN+vO)E*j`I|=S+K`KIwcSt zstk{i%aQv&86ON&;{HtwZ?tR7IW&$J8@!ePYLZXaNv4UW3j7GG8cCKWEe3lpLHqb)D^It7#W5Z`bwvhQUsb6;tS<>73bpCONV z3sgj%+vIrtbQekpt4lomphI5EcRNCE|94jRjKwgVz0R^b>d_;d;!t2Svs|q6do;{E zBUXG;79VaxkmUGIk9=&oSGR8%MlIug8E<$bp$#Hw8ei6@Q zL;Pa9XAr)L#d`x6oit`wTj?A!$s<2VFU6sopP2ad&khQn<8`a;}9+;TWcF+_F{b@ zckMLt!C>M4h_J$-cZ4D`c7PpQ^6o%z!97p;4`|2nrm(kTQ4hBhhFfp6h!UImFo_{x1xG)4AQr$>M$H*}x0`>(AGw~vlR@`2HgLSyY16~XS7nVj}0n}TSTQ~ z)i!xmu|{)2Ky?N*+CV)n{b@V(zpn&1pQL~*07~zB<6n(okBU-$=lAMNH6mY90=V!m z8n|_7BWA*+|9w3?4ZL1S&_inQUv5jd^@yIQ6&D_}NA`M9*d5la$H@7%23 z>8(2J*86q*SHX$hoK@g<4kV&_XznAyYCXd*_M z!5A|3|P#c z*`O<*MEXFJ>hoI;{U1Ax&6g4ujDPPynEgo+h+PlxKd6eC%fN`t@bj+`2JG=WmbBOvt3I5gXbS*2SU11p2(u$V8&H!e+@`E6Ki?4@ zOamZ*H}G`(WupHTXwHYPZD9G^zl~WCg{aV}XwIJQ{ow_`mfuDaC9lOBGIkxR$<_W3 zhJfu@-EWN|p#4$jI4ZXz0FZ>kgYS&$e_DfC((Ue8ni8!f9ARi1>h7xs*|f~GINsW%8%al z;anC_BIh|dlIN*=^W{(BA3%!qB`18epTlIg#T zcBgC<1xTmVXJ$GGE|Ms$?e@lWq(H-2VQi)MwX*N-jH5(PQdOyTZd4H9=U`YP_0ERW zV%`p@Nx06`&TO-x)ZclG%Om1`@+56vUeLb{nEveM7Dq2v5psP&NcoE#QRt?SHw$uR z7SOn5aC;9ZFSRiAd)sI7^ItADOBqi5wx=H6v3WaZnc{w$P)epIfB5#z-z6e;=1DdDsEhFLRxY9-JghWj3O9j`i14*f>P z>T-$h@JDIV6{3jF1`~vE>_0t0m>`6ZcRxe=f^cwKg3uvK5M>AT6s{yZat^9cKjv+S zwl`p2AQ3nF9YnU173MID+$t~b_FVvZ*8ltn2pP4R1`qmYIHgq`V(SG=3w^={6=ud1 zTt@Q;s~3Rmh;xvfO(eXyLovp(r0+M6qmk%Kqg_0|%Yy~C zhcOPPJ#ys$^k;sv`F;Rc;25DVitOp`eFHMNc_|RQXQ>B-;J^QX1$6DhR=al63q#y_ z7tZvq$f%+R3b7k19Lnv4lM88pBb#l&wl5V`yXSqUuPVq2tJApZV5*B+mo^JM+cj6s zlxT#i{RqICj#ZkPZ4EX}{6sAs5~P|X75v^tE{3^6In5zQ;bq)`KhOwiu9O3)tuIaJ zCp%tk6{lWNqRULfaLCSR;gc0`M^LFEZ!z@Th6K~h7)DFkQ zjngySq7E8B(km>1fLQ4l{v$XQJ@FN=o=e@*I>hB`>*>-**H+N z!K2pb^xg5&caJkAeo0ZxM1XXZ*ZR8Oo8U$b+h zUdAUOJdSpdm4`*S{HlqBC{0{K7opFGKv@YT*+VkRHvqV5YD+mWgG!6pjWpR#7`cui zTGxAb%}jkw(ao>q2<%jhB+Oz=$0(KBAstQ7dq7oPZr2WOdyddVVz(wLXA5x?z&_ID z|4JLp?yniv&PvpQcvV|>k(5?>J;K7#NvAGkVJfL!9mu#8_0Qk?4p6;H;Q@06`D6NJ zULENUCj_pzp7Nut2!}bhiPi%Nh0bi)be%!DN%tR^+-J~9@n)O%-=CKae8Yq^JKLR= zrx!a70EN~#=b4p=z`j4R^M2PCCw#f&9oD^PfZ&Ix0Km5MaCH{%s=n-$i1V^oL&Y12&`pr7gb+Q4Et2938)_TsWphuCjpJ=_V zG9}9fPkStkqfF+5q7$uoAM3JyM(h)f!{IeX*%{05`8-{ECc-28kUi7YwF0NPE0&Tr zoPc0B9MLPdMeV!Sa+Gjgt>RH117t~SYV}Uzan6Z&FUJh~X>xG(VEt4{_DwMKLqIn3 z!yBQ=6g*}Qp+raD(_~9b{3Gt=aFyq{{M}y|skTv$ZYR9N;3rHrQhv6GSj8)tc;lil z4$Bw|Dxh0E1dAlK<4i&cxlO3)U|NS)JBjbak53|d1jGRqkJ4?Uixg(7ENp+gc?hTM zU%K6G-oy8ZJG$RLoCrZCuAv?asoQVc1TlVW{E2;L)!(0#3waKSb-7c{%Y)e)IuXNk z@^8|J7xo};m00R-R{83N1I>pdD+#0Mk5WAi5Yb1Q$Fe7h7Ksl4?jZpG2wg~=vnTfz z3G53>m_Yjy>4#PHS3>q1VgI+ZHERiDyM^n)2 z05j+dq66#vc=SoU=;q=BD#{5ov+7D{S6>bn7s5Um>$x>!<2000+;7*#EWx24Y?0KV z%0GBclQk#1{QX4E27Z_h;-1M&ZRGH&P-E#n0j^{p+P-%z&Tj_Yl^Fg ztF;jn$@i1(fer+w@J`CBpDW3o5g^riAW6Z8Jx$>0xHLf6TgWJg(u=suSa+fjq33|k z{9Jgp_?TZ|4&;G$DeqDI5pi;2WzQhtVmri@3tYlSlOW2|#iQiV9Z)m^S>S+r)X20w zgq=slO!}0anK);E2r8L07n?)M7Lcur1kao+l6AYUi?(cyVNt4wS+K_}?spoAbG*UN zRDnXt8e+#jkx;^oNuvj|&;TC~(mYT3p^d1c{i=6;>|Ke0i9t=UaRW-VZ_bF9B_M;G zv*4zll%FT(wV3&JC~+nxvP1+5JfGv(WE7A&j9w=eDqphYqiF)kuKEcqZRlZD#P{rZ zb{BaVgAL#p%6cUUPN3>0H9D};dmwk0IIDrgIe&4j<2oHI&sn_C=qKRt@&FuO`hk6i z=2xyTI(?sdppu0MfWmPFu!x}226OkP5UmDeTT9uHQ#_%YXZ3s1r%tlZzN=O+) zjP{j(Ih31tp^$dKG~$J<9wpwB^-umscJl5y3&9FRX|_ox#uYf?&YgLpk>kIn%G(JY zv)R@^nt~I<&PH7m1YHN|v75=Yz*M2U{GpzvJof;p?~Q&Yqa>@gfmMuA9YbcMqIP8L(SS7^tin67w%-V9 zTzOFK9EDT+c?qHPQVJ7**QD#C)*sK(fen=lN5zolsZ(3X_4aUP1Ge=Dm+`@Z?w)xI zBj}zqb;i?9-aJpYEFAq-ZqqK}u}V)ffQNLb%OrYv=B>XBj+E0(>h)C}mn!T^b@VqP z&i^HJqT2YhyZGdrwGs}>|g&k z3$QV(2x_I=9jrmzzdTnyJvm)ph!_DK>N^S+81dwpLVl3{V9ISvW8av=H~uwOu6YxB z&=dCM>1<5?;Yo{a^6H8ADbIc!sCo5VGOixw{f3;y(N0({K2Twj0N_a-L3|o`11HVVFWJgCyope}T4?4a;df zvdt6;TMqU!W1nKJa$gkF`x#@*3O5xKGT+^81`Boc5S4$d|GCF7lt>ekbAQzi%o%GA z(@$l@s?k^*;nvtqaQ3{H`cuZ7os0g71R_?9_M;2w0<%yqsD2-TBN9a>FU_WdcqX

Y(_b>QE;0~wGnDe?fk0!&i3hr zoyKo`S9}m^WCX+ZquIO)2Z;W@;=$+ij;$iE2=ukUN!d{$EE`N8_l$FVL9%Q<<#t4Q z-9HLX;j8N}t3L4C#(+hllj>dW7f0L3Bl3Rs+zCB%=T0-DcEalqx}gU`r%jr+6NAci zEZv14C<0f#s9=6cc(HFhuRcxFEo`u{kP{b+bP$8Ho6FmukV+L*zvx$~?~IL_b@rL- zf_u+E{RN(Rmlp}Fkzz#47_~ps8XTyx?Vw@TEl#W3JTCLR9K+3hx{&V1eEUE#Z*5-t z8QMX?zx?CMEbIFcU&?=fQXk5CWRI9E_7CpOSFjnMKG`ulh@EJ2GZSc)3SMlugRWQy zjk!E-Z}*VMmzf!%E*5dCKBX!dO8qQPTEm}}YgbuHIF3Qiw`LeeiYH~3*fJ!=Km9n2 z^cp|lIU^9I1Aa;JQ?|XCx*Qw}QAb|}Z#@je1WO<6G@ra(w3WA0>d$zn@9Q$le66f0 zI0UlmN?3bg0wa6wo0freq2-5s6^CmVQOrQ9yg9Jo!#rPzsijJ`fpmbFU8D<0(kr(Q zaGf9fsEkGwfesq0W;z?U>8CoCm}p%y-p~ckcfL$-Poz)$i=B#}U=qu{BzZufJCz~nqr)1{0H3TdOAc35>q&QfZF{(_ zSQEO(bG~*XDe80eo=!4IW$u=(t!#XItt>0Syqesyq@<`U zc}0qp{^h+-@g_crLHDvUw9DM2JhqGKiVXEBDrQTN=0QuwM|PTT->Os6EV1jg)CJAe z)gQz<)DHb>@Xar~2+0l__A82^dc5vF17vM)2sO4;ix1!09?TerSC)>uva$RyNN)Cf z_Oq<`HGWv^+9QYIiENy!qisiOX_psthD(qC3x)MR`?6i57QZ(={`1a(tj6t(&|hFw ze5COVn#=3ub(%i3Dhhgys+sV&bbtLnR}YqqyE4DywMZ2DNvS^als!*(FDw~u>r}X9 zJL?YdHPn~>5*=6F=CgLnU92aV<-H+bgvZ$p@ttCv&o#Z}ou&<~S98H>mJ|<39@e${ zRoS(6%74#Ob+T1HPpewT^YH8s<|MsT!xR42*T%TNvARQu;y=`$Xrv3KhTd8ox*wDI zGK4R=GSBm3mSNB4S{I7W4kuMbq<=;GyZR~M#3`2z$%L>^32(seLaTY&Yn}ehj_3J{ zGpK-;>?ju*qzgMN>gW)HhU1Y>J8&$=2%=K(iSsel?zeB35W7U!1(#F{p)5KsHn5)9 zIu^B0e9s06`%yn3M}@*y=p=)mc`KwcyaR3a9`-(hYa0tT z%nQs15axz^M7z7Y24*qp*$uj8c=pnts(J;|#TyO)qE z6?FJEm#67d9Rs#m<@hncZw0HyaqM#9W`50iQTQXW@3<;PO`;=#=IiOdiav+lh z3~N;0yyq`k8_bY;X~%6|sbC@pSTgP>AtkzvnEmFBsUvynm3gD;C8qhz+$L=iia5&% zfx-l&cXxFw2ADgFsvS$A3^@W15=PR=KNw9RGP!>@9|#mA(5mKh>J=S& zZt`8J;b6k{AMNR_6|pV}D83>Swg@WuQ%7FO-g>cT?+P9Ndl=bf^y7=}ZT#Br$_M8O zAASrn&TnY4{M64gIiab3D#*~?{%k32#Yr<{d3jOOxMqdf^28<2?$hSH9mMyqTZ8PE zrV&2!NSvsG)$1)c|026{_#2q6o#)AZ4%p)$I^L6el1k2c$X}`~#xRUm40BJ=Zb%(E zT9@ct3kb&Mz+;r}XTlbGHlC0`FVzz$W#6S_(i}v5^lW(GEx&!Q8mSVYwzD34b4*(Z zmnj6Y9r(%B%-J);B!DC5H!wHs@rv9=zl^Lctfs?vY&#_9B5@%62Ed`O+CH0&u4Ya9 zmG>#Qza^g@t=+E2kUR^PoO!dZU)-nGx*1J@z}AjCFt}$f^HJ{9$-Y(tBk)J*YFHpl z@fE}LlTnI_9Df1U56cC(;!D@{$+q!-jnzO7)5wYKLUAtc*u`^-;xvr{2SHs*BlcI>+4jT z%~gba9hO)TzwOIk^D&b=O`=exoQbG6asJCC(N1`l9m~CJq#{ph(SUX9i7iSo(o$!i z%Czmyox}T0lQgZa80Y3Y>K{ z86>Sy0Zv+9=n8Lg3^G`|9_>o^Cb^@~>4cY@J4$>wDMsT|Z5)Ar{8K~zEZ8DmRy@^u z{tZGL=A-5HqH1eyQn^jr_8$^NofO&=QxqIQBaC9bXFHU!lESK;BXHkghr!M^*R8S7(v9{n>I}DwjPm z8g7cfyR!Uob;^-QufpEiKUzQ1g=k|Z>DsO+iXLJqWcpFso75ih{ZzxB0qnT+#o8)C z!c)2(>FCgev}Z(nfAop%!ko$Fm+zfr?(&ptR2-E`a8{fSMDETumggaSVQW zC~xP+n#?5vu&rIyreGta#DO9<-9p>@FVZ-2YU;i06I#cJaKyPib#J}K*g7+v*V>9= z_`7bie6{-h{=S&wQX}?pj4^e;bl&y*PEkpUu#au#a3(=l!mQIXxT13O&sDmvOne4P zYWTA%CtU|8LRH5bfklx~WAgY4JJ0}Z3~gm@lR6ycdZ;y|II@VxRb~V>PR0a={3u!l8N^3+CE%Grkmcx@_V?R#y|ZXNNqz4zhI6<= z*wru*3W{R;(h_z&1v&*Eor&`cEZ?2t=eaud3&)A~NBQzqW0ikXcAu2{1k*gAd*$T@}Kk@H%SLjnwHg7G)YBk*nO?+Ll^`Vq@5Lr!N@RiOJV!-o{~kuG-{vzs-%`qOq3L_XA#VKF(g~Ug2ha_ zSuW{+l}Pg&+tn(}-^LQQjh^E65qw9$L9l)-@HuNYi$31?mP5$i?4uu3BY*w%b{2P` zDPRLVt<|A^;paaFwj2fvZ>FPTgy+JlpU!d32?S5Ghur*Kh_L!YIwE$Eut}J}rCfi) zJzE#e#5!4SCMPcJ_EAdy`fPfh!=2)@rN(2e(A2>KPbW#*PRLfZMo#9oc|}bDpswT z+c~}l6C#Tw)5pi5?jgg8L@^{vCEkZ{I5d5H@{j!KMdJA?6S+M6ORznrYkujZ8){r{B%_i9X2Amp1g-9n0 z0ZTo^tzv_9e_I=fB#g+Mgz)48eT}4|^|}K0d1qAYyPM_23XM3OC>Hx<__kOKjpj<* zDDbP!tZ;Y|T|_W023ooYqsm%5SMaxdu0zI>;X`m2sHP&&a) zMI4}V{uYLbeBnb3LD-vLveeddp|9dK8*mWRJbSA%R_7CF@=*uhknLBU%DBdT(A9iX z!oIpOk@ARMr0drGTt-n(sM6IQZ0>uSc@M3j{4S?&4lswUcZYzcM8F__e6ArN)p>U# zGK2MF9gZ|N3&pHq{h9n!z0XefMp1V=;h>Tl3s#z$i^b>UsDy(;6Y`jvr<_`UolACs z(KoxvvBMiN*8m_vdi=9!B=`iQ21^HDgQ>=Du$S@td0FWM+LD>Z6@Uf(S}(6K~C5ijr*+m%hWx z+O4w4(|p#_a1CllzNR(DlJ=< z)NK^KRbx?l9qajaN6)qE!xfLlRO%wd%uRs`br1%LSDWrcPzI<%&Pk>=LvS$Af`cC| z)lKn9DDjY^%Y6;UT^~gZpHQ{EyBn>OyU_{`WEJ>C2S!UgPYte5tS#d!w3Qfr2CXP3 zPpgtA#Z+$QAM5^QK3-Lt>*S%a(8Iyq{wz8sDD9Iw%PmIcD?`5oE;%OjrRsoRavWxL zc_lt)ULRk9B?G>0GFCoNoG#z}IJ><|y_!!>vVQvHgZt5ah3iOtkU;9{En9dC4I-`s zw5vx}ijthj!cbrJibs>1)tr%Rn>%x1Y4saDagTAuxvXB;+pFcJaeJ@NM67L&^@LsN zaqu~ngSUa{IIpD;EDJqn&euya$>eAvEUrWqj5?*0vuj7`Mvayhi7LEKO%q1GV+S?X zuUu+;kMy@X(uZH54afcRTIJuy#pDW^vlw5MFB~H+q4@a9&+nO0ja5pUaXX<)cp%Z! zkHtP|pmx6hZLBFNt%MQc8(ON;Gaz{}w8op&Xv2kk;56%BY?AQz)~{86%8}oDzUSZ# zmp@*;w8mdEc(O-w7H{^Wg{E2M@iW^3)n^5#lb$!o^J)QN#F>r%&)-OyJmMkcUod_0 ze)PwR?-i$+SGwe6aDqtDyfG)5sF~ttL{=_8Fz^W9M}zHm;DBZzqG(0&kqQ*J8%{ex zbl7St^GygDgkCn=$W6_*gsFVGeftWRkoYNWPBq^UHyLQkJlN7_rh@N7G>~CM`QLI` zv3@(CXCx2q3<-DP{NJ)E@LBYG3rq zg;BitJ=|mb@_IB#`ldR$%zMId&QsZe!5Ed0vv|0(bB4#4Lb}~>Y z$Cw*Dj(3J;8u;vn2R>2R0#xYEbW@V^TIw|KQXEm9Z0>2sBWSYvo8pfnIpW185gzB? ziDBDmV1iJdiZYbDvxTb1twxr1r_FLjy zTcqe>=tZIvg-;l7aGjd7m-6vpP)wJIucte|akYJ@ifp2>9`Yl>rmP7zWw`GjEmkP- zIQo;HF7GaI4Z0#Yx3MT4g4JXhRm?Z`GStyUH$*krxZ+;#;RIy-EDcGb>W?VsP(gv^ zY1}|;`|T^;tYK2-FFl{o;OtG`Y}%bOno50kYW(2BiK^DEj_9-F$8V*+Nt~=#Myf?+8?|mvd1l$Ds8bFyS zUp;FWS`q$1U_=u35Y}Y<{UzzDZEC=i=M#=shx=vXefM$J2>$Z0=$XGd$oO{tcmD>% zPcj9;fCdU$9?7N17a9;56=52w`P-aFlxdVb5U{*F!l4s!f8o5Bok^JH*%?PY`iO$H z6oRgu_#z*#8CbPI#4NY4({}XAjFOsscWBnzIFAY3k0+T9Df_FA>9_i47eRmA!-k>! z@{{OUhGbyMR%{LfXJ^JlDxHvD{YC4v*S_dxq4oUWSTakSZ!`WDm8d5(0fWT*55kI_ zG8n9aF6~!CL29(Xa>x3#=t#}JGtbXdNgTG6!89r-~K8%9?~>Ajv79h zl7UL9$w50x)FVX(UlmJ>n{F#OKi4kQBvfkAF>4vW0nax{ zNd{@2m;H_sta1PK$nUN)>?3BG#uVb-YgP@ou&{8Nbf}b&>-1P&Rp9^#LN;kT&2GG^}ks&UeUBOo&WWUa|J=fxI zqcT4GCD1Q%RSp#*;k*4F(Z4G3XQF}_Vec%?u|Sd=H78!@f(@w}<*AY#$HVASuxx3nOtQ508K$HmTx1gA?nf-w{YJs#MEchxO z8Qa&x$GqF_yq$DT8`-P0-9_5DA{k_kZAD4;6;%;P=b?{ig(dJsqBEvNNO1|ocYVqn zINXIOzvD4yGpsO^VC?At=6l+$LFaO|A90)GX@w8Pfg7O;=#ap?v4k0b1=LDFcOi@*4On%P1A}ajM#elt%w} z`V88_t%}V~95|cXZ#wuf7CH7eFOO^AVR>^FC z%x)k|Y0yS?a|q(3RgGF=p9=%-5c)YM(+br1o(BVuF#Mh2N9T<=;vZ%Vz~n+DgEE3* zuj@O-|Cn1w9B9s9tgG#TbyE_hiyF{D202KVRi)f`JPB}V$O$=>#LhHbRMtFhK42Ckfw3O2lXsb!plHm8U*}H+z zNnSGYqJXsMZFF}2y2dCZ&09E1<3M~p1L1SkTr2d)kgQT;M>F_ADfEdUe-H3k-H zL7Zy?a;1ZL+nsSl6Sf2Z-S4xe!&>c=lfo*axPsYV#S=qfQ4)EzPVr0-g6Ka>7T!Zb z4B)q~n=g+yp{`Wqb`y@6>bmXi0cImp7QVq=4(y8&r@Z>^a5Rx9;XUG9w-BKNeD$l`JNg{*r`vBUmnGpUTQ3t7lqH8NoWviDK^Hfd| z<`uzbPff9MZij??0F;Xed8~lX^XE7r%dYrBzGhb5EPG5}=6-Aqm_wZ0!RSzS?T)&|?If!3A#O$$hlnFlhUL@2QRf z9P{5Y2rCS*vi|zY`U7aZz0&g&TkJ zqRbL!iV&nFpK9OLko?DsX`$frupVirKpZj4WT%3)g|MYdciX?^JLY4$`LBU%-YACK@{*SlZVgYX%aHeU;aE>WQ>YE#N zoGaq7m{=N3`Q&u=_vf)np-Bj^buP{>rl=4RQSgvsfZS}fs&HTs^z!Phb4 z&bD~OXBpPG-1GF@QKJ^H&j#iF(d$6JVr9!L9RGI`3Sh#az}4|s`~MjIx5o>`M1B

Ol|cGV>}220cyhAiErV2pa>#3Y-5+C)eFefHomK(oi}(DA z4(@YoN4?4=vHdaHIQoFyzG7xG*aNNzzLO+4yvfHwm5Fsgi=>JVl}Y z%y7$}_Ud9@?uLD7O0>zPI1aoC50fI&m?CF4P7CmLQnrXt(mEt#cf2Vpu5;ePQp<<` z1>SOsAl+TWX*Mir-pkXL-vH&ZW7Ev57WV!Xx6y%4~xosl=LMii%-`E z-xu=QRQWJjkq1t0x~~m(dcU~pUXBoP7UgX2jyCE(aQK$y3R9Z! z*_zVn<+kTA%6kq$_?<3$l$hlje9Rh57mUIx!fBT4Qu9>X-nQM^_|bJF*zE_ccUhBT zln#cQ>6QesQ20VZ3lr$G3tYu&-bW~SZS5)w(bH<%7qVKi7#}_I-dsnRNca^byJ0)7 zEH(;;7D(4Stf{5eKF!BvimsZEK#sl=zn#gOIN4zbUk$7`JdmmD8v66~pw{K4O)9f_+?Kvz%-DpJpDONa*Iufr}rrElO5DE)q%1WI|j zjwh~`PsLiN3DSb+r=WFTEvg~zt&v(N(mD)@Hf8BFC z4v?ot^FQs80Xm;-;9WPIDqzp%HrL|4HNhQlb0rF-93MtNP$BkSW8onH>a>Tyx@t0Zgq2%~x_RN@=4w*tjh2Q1IpR~rS0##;Punq?` zaGMt}0(9fvKkV%YOhlLy=^0L~`fNMsd?nG07)Q>|ir*%B9bJs&`2;!@>9{`}dXsAq znIqWahEC3QCB*VKNsT3eLeeC0;Im!C>-(UW!uJ8>+qK%gk^lJ^<}V@cUS7GtgnI}` z$Jj%z{+2+monwfZutY^gliJ#D&-{xe^ufjx1`d^6j5#H5Z3s%t+f=I@GfUsnahlaJ z+xA>}K}`@qEo%0=)nD>No8DGBy^XOCGDJP%u#Uag z-i&*`E!DnuvMn}$KvXjN@ot-e_|;{!?%$4@_CtJ&qJ;zg4`pv1mDSpIk8VLiy1S7s zk?t0xyAc8DZfTH|mhO^#Xej{!MY_8~y1S&${p@eQ`}6(%#u?+Bza9p#p0(DU*EO#> z=WXG4h!v6Moc(Ys37qm))n~W_(4F(igEGB_eHUC{7Hw|PNEJk+Za+?^aq1xL&b9It z$Yv%jcwKCT3CMJqL6+V`&0W|WpZCNC`I|z4V4Yja8cdi!UO()^(^`6gwE{U80~|q= zFYESH7@#biOv&W}Ah^@?m+7Z?^d#H&paer{pZ$q>(Rq1!At55Z*n86`1Foxt#tff2pP|wmEwvDs-Fh)m)XC`n zdn;;}`>SrhP$B&l)OUZ*btRboO3D)8k9p^p{j0(L3ebL4n`emluYRpEgV9aMl$_Q% zW9?g@26<+4Pd?ibG<>6ihHo-yr_6sW^pfl_9$vl&wDb~)+WY|=Kijl#a7=wMMZv1^ zR5U>v8ZKZ$!-Wp+p67pTlW#MhQtTt)#!;w z`W5OkraGVh*jrn%!HX_GD-+}vhdzsP2HyP(OgFq1p_hEnwF-I(!Q0+~KSn7AOaS=L zcx)q7=u`Y;l^P%cw?L8h2nS#LNeNuAeiSeOn)k_nUP==Tgv?1hgM_rN0J5oB1kE+{ zre8MLzZ?Xo|6slU-r&3YbBgvuf%ZgcY3R0(88Ws?{=9u?wLylK9`r>jxM}J$SypVA zf3E=jD+|qEkgTDRERh?7@g_Y8@Ob;wK=XuoL(ySB^mp)o{C|}$_+^>fUfx6pc=XHz zJR^M9OI=OIr#6{jQPNvB(#ig_iKwAXgjpfT-wtB@h!+WXkDz7o+N55Jfj1lgk2@r< z?3DV)6+xoP;0EjJ+E5^Hs!9%%LmIrvPn9eX(u=M`Bl2r63XVTvhDj&_RHuhuK+D?B zx@*$_ZKvrks+4c{!28bLp!MJU(bdPKpaX8B9X3WWq5Ef0ZtZUbVEFc%CGHs#7#+0W zre+`F`v16s;^_x)1G=JiBJc(^AqL?oeBo6Q1ktBB;0>sCW*o~l6js3Fu-@&L9Vp7-xS<~|7s&Z#e*TgO2?=waTCYCO zj?+C@XwrFlS#h8@VrZ@ipBf=?-?hHGg1gBFR>Y5FBA(FwE#&=Cz@mV~+e+3D>(#CZ zrN*<(B5u~h1R-I@sp2*Pn!zG6L%+aJgjAtDb-T8@a9k}4O{Z~6XizwY!RRf zQLX`*Wpj7#sL{jd?73~zH0=ah|&iF z?hRq6flqtYaxf+xit#OkPgc8&BtOOnc|?P9@O^W`c+MLG%bh}@yJ+Ms#qmNere`EY zY`O{{e5DrLTOXyK@cXUZy4iVVzP`3021qheLI}S@JnV880ep$8ccs=Y`B9=-a7_3y zFX-r>{Kn;U#;9f})A!4G*X=0Sq)XJ$k7V>D-oNp_IE9lco^Ptm_*)Qf@2v`Kh2oV)#qR8g5#|ry%-gRMm-W>~bSBl67^ol$(_~H`XH1P$D19I5F+cxp;HSuRyROk<6A=Brjr~8*L!Sm-m`& zzrwIZek}wwAMZfFM7cVy7UVX%jTWpeKDQx^8Z|{AJ5W8}9(}ggn{}o&*WgSH%{2Qo z>Z!LXT6hm|K3SC}v;MkDVAT4w(B!4^?e{Yn($IOH$3jc&S1MPnfq~$Nu;wvb#(0x8 zydPw_;GakG{C8X~Cme4!8kS3poat_a{KS%?khssVt#%UP>SWL7ak-K~ z_B^(OEzQtf*;cL3k>rgLupR*9^eTsdo z=Q2G5SwcQ#Ur}=j7~>R(E8jfGtxYbG@l{ zorCJw`e-$-M(u52eVLwA7_H6b#HSZqYP*q}Z2FOcHz=BWlrjJs*Ic1joN^{N?#<2M@p+vD9a>MEJdQ=|fF`!m&n83<~E7x;kF z-v~g1q*7>3vcWhzYVUD6YV&yFt%KmM1dZPx`M+*0-f!WdlFy{i@w(^2NQIA{P$O7A z$ouUIxZtp)2t|b_G5-|d38m{WL!aj6eA~e*n?;4|S1D+a*<*hGZm!L`dG_pX$*(@V z>UJq;$Hl!6jNkY_uC1T+})o| zrj0o*V3L*LYjiWpMYyiExJQew*qs>w>iynfD&)3HQl7*Dc?VdA@Y1A#qQr=;(UIy0 zSzgo6Nzq*Na`hw=Uo`~y-F{Q}{mqoDc0lWA4I=^WLd9Y=Q?diDIiEu><3ZBmbh2Ik z)KK%3cp+TQY+2p3^?Xn*oef zo!8xy%-U6UGF=Q-6_>8NOneiRBWDWF1oFQ>csQ^rq|}84F9JS>PSsm*(su-n#cr)- z$@KS?fUJfWZ;9Z3E9C%I0z=762v&*frHi$kfhT%S5MzsLl@^cF(QJ%tV&%IyI#ox? zeZWEsoQI6%>eX7?xpt)8w5}aR$yreo!A% z7I057x7b?QS(7IDbak?Z5Pce?p3XlMlcCP&lw`x#+u)~^=%kyj>*|19yGcAOg}kSCioWKZ2mj7Kv(9>PR4q-O3B`%cmR6D)x-H$xA0aPFpR z?s{hM2@W`6?j|wPsRStU$!xot$V+{8PcmKn;IhV&Dpf4CQuF%^c5`Ae>uAgvAQgndf- zs_ih7{`Hr4A#?U&9l9^=X$N(ivbF;FL_fTBu^LaYny8Qt4u@nx8!Q=W~rhrBFOBp}*5Uod2xkdMf?p8Zk+|-7E!U4UmPboJin-@$F zgAyM9i!QLW*yntvwTg=6!O@wxBxez5!%1W0uaAST-}hkFCmkRiA%9!zz{;xN2zl`a1g(+Z*Z-@W8Sn@a3c8{n{~NXeZcOd!xe}R3 ze%EE<$(S?0VS6b@M;Ub2?s=5L=M8ELiY@+GBoDR5KR zkkLC%e6B=Sda_fS%=RF4AL>&sQG}}2K4d8M?4dLdQ7G|6M@=l&tn;rItt?rq0YWHt zqBD!CV;b0jF=Pu{obo;F3Tac;V`CKi=Yz zl8Q$zsrKi5yZS>ReERL;%PW3B*NBVT`y*Hd!8Zoz z#n3;3Z_tYvG{J4#*6nt)n=Kh=pth^>@Ps5PsL3AC{7Wpdf=Kv)zhM*=*3sCtGf6M} ztVz+Be4Pxh8WUjX*Z6c_JG+qQ<>|U@cbvksqphOV%>%5U zjA|<58kbGOgwiAjBVM=8B^vG-0YGhQ!h}!IDDHjohz> zL?&blE-$xvg7!9n&R>Y29quLe#mz3KO;=2)zg!|e3s2OD`$yD>DhcYKO)I5X5OM*o z+|YzBl(ZT$+S4xA`*MP1B0mM>R3u6YGYUxB;!JphDk`2lh+}A8#i$cv&c(`+KuXs) z$fitZTMk(ZrI(a+qEg_pzRd9ETS{GN9-aZkGg+u(0wQx%=#TFY=ZBm2P4m zzsdmu)na6@fkg(O|9uqVEEmNor` zPLY%dGZnD?{5X!PVr;5D9&F2qh3vW(qL8BKafVN|g?GW=-U`F`VIbPG1m-cUM`9dN z4Hgyg)%Cf?%eQpM{N0d(#@;7Zc9gQi(vw1Z05C67vNABnR0!d>tkIa;qrz3Gb?-Ji zCFBUE?(Mok%h9f0bBN+qYd_@CFd6>Y$aOueH#BOm`!f{q$>7pxcash@U8gWK@U#jn zb`i4flY&TY?k{gi)&BLVr8jc4q2v7dvyusY2-0lyD8S zn!HUrAs#tgp&cB8?RE}FcC=Rru1KGBqY|d;qpWMb+vXF!4+2C)1L8cAD~U8kGLA^@ zN&|-Wp}a?z|1I)E+d5MG*K>ywxBz0(fN=; zBZG1c+64LjY>ChjlWIw3^X0mQ#DvhfY8~>4GfI!HOX>aZ6^Y{Z!>uyDuMDyl#!7yq zRNcmTSe#B_H0H^(b2x@!nYfdty*>(LCK2$QZP&+E2>Za_X= zA%d7xY!u?Ut)~C;R&oC#UK*W`RHsepm-Er3NCKw^k_1chaqG{8+sNTezO5&N9L8PY zQW&b)8oEayTi2Mg9H7`69j46lcy)dze4OlAXADgM1|D*>2nu?J*4KZ$K z5d**oNc_WaK&}yDk-epq{j?FTy|C4a0E)ssp~TPxYgoIie8qpkurL1JlKu!RUL{pU zq|Ek1F*6L9M`(>)@xN;1qF(`ugUrL>C;lc^*M#_*#l^P*gv?BAOLXx zoq;MbVAd5kY)ySN1~AxH>y`8x#O7ni_-1gADnam6%=Sv}Uj3P55rJQpnU&_9Oa6_^ z#mF0l!rT*Jj(+{VKq}$*F#PDlUFsOzIDDrvEd^hB)X2^GK8#FAIa0LyKU{#DD{rY- z^R>)T^~{d!JgHmCqx;dLTgr$pzmfs+^=t4`$sa4k_c8EJHRq&p*8$Kcbw&d$xhQgW zZP!ln-J3 zryOgBz-RuL2kFK$yc~d)jG9a;)TK{O;HJy-kDn80 zZ9O45mlR;13TQeG+3;8vye2>Y3)EE+YrnnvqtogEl$XhJ2stQ!9h7-+P*Q>`MPoGm zvj7#BRV9sx+a4RSd*!~_vM*(Y)4a0~gU1&tLlmE5TDkfRpWpXCaFBf|t{W~yvuL-D zq+E3dDin)jD9K=!lC5d@X2KzDHP!@+{3{I04LFiffxXcK4Z}Z;b1G>=8MqCCu=q$W*r%fkbef>e}L4tgHS~J>`*GR z5$Hbny?_#pSg2OSboiR-V7@WRj*(Eg;tMIC`+*dBg5>VXTnKz5fpo80ejcdm#TJ!f zUq@!i-@4lt%{M$0;tyw!maO@yHp;qzBDpu`yq{p;!LP&DowA8EKq+^-$_K5MR~WQP zg8Z4f;czK|!iehT{DgWimHX4>;esqUq$wYMP*?4maZ;mp*4)W{Uj21MNUgWnUu_~j zT>~Mg{~{@u`}rW1RQ-vksQxpTj3tXB5L#>nL3NH>reLwlb)WYzU<(wR7uCBQ4cRiP zmxrq^*FA*fw zxNOHiNquDSm4SWrd}_A<`L|F(pN6{iRo?p(Y=Wr|0EIc-T;{P{%vx&mlGB=~b6oA% z=qVv-d*~Wf2LgEZUSv!iZQ>wu_)Vc#2Rl_K{{d81~%5>_(mP#7MBdOXIvEBR(n`Ps>Q9gI0 zADz7!%HC0WI7P(@pyFD3|8Ue6;6v&HG`BAD*sO(u4DjBUuM} z#TLCUz25-&lV}7?;rOgt_fGP%uhyP(-trkf%6X{N_lb|S4wve(Ah~xv+LJgdRN8D5 z&tJJ!8$CqFYn|~^+k1AXlnt<#aT!-UT>MHB$ul3>juI!Rg zXorhYs~r*2LK7M9hkyPSw7$kx3CgV7PZ&`9W1`_fCmNhxxI6(Z1DUT1o$t~5^MNL%w)%ilPO zzvEhG(T#$hN@mC)%IdRb?lM26B>7mUqor1wWnUq_FNpsm6iy^4)u_ajdlh-LJNpk} zRb2(uNY=TOi2=FrukC!3Q1sI+XHA=R9ElzhL{R`%X;X3Ajt6Xn{vSZqj~j)PTRv+y zh+#SN1>wtZtmm?l;}dhG#kF$XT`VY}M^HTGI<(0vR&buD+@vA}|2gBB*FmCsAU#JX z@@h?k7Xnm^$?@|(R@bC(kn0-f(6iKneiRyq!?>8M6jo0GCPb4JP zPsKlKv1s5sdz@{e%89pB{S=_kDxECrmv^iHsl|}CgZ?hJm11UQns;J@?9(iM&lJ4j z%-2i}r0QNg$8sbkurl}_u5=% z-#uJNOZ9Md+;n*(a630q7$CI4q8)VQc2%$N?FdMXkrK>2Z#S3M7Y9=z6An3P(Wuu0 zMwtel7h2|9*u%CH34h`F6i&rc7`0RQ{k5V`CzlmRRA(c1vEMfm_K}0$Z^bwyLEFSA zRNbn||F14Wi`*F=&8sOFn~q_$y|c0nB}uT!eo=d=^haQ%4drLyePwt_14h_y6hNNA z5QD!C7%fpSsd&_^&oc;XKLkMwJi|neCm$YMu|_B&`?Myi<=R6$$uswZQz;@IlP4lx zi{YJ|CS%&VE%y|&i38nn%BA;Q#yR*j%3lft!`%Rx_dL$gpqi)BP~||D2BzRaOupkI zaZK>*hG-U@di9Do93#esSCJxqvl3JVr&f%4X+R%b09_qtL&EbA6u6c7|u z4AHfsW95@xnuh5EB-aBOrw~whEI~R>avh?smW+q`0kIiWUEuoRs4%z8mLM!Z(#C&t zx`E79rjo_)ahXfZXoc}91ImtZ!od8kJV~?9_m;qL9Pn$xm&(37l>#9ikkgaphZs~2 z9WZpWHTJxIX6#3!WGLv>w}pNSln^rflJocZX__Pp%@HB!+X{PEWS;@I1YwS2Enx=0 z*VXbJ4wke2Qe;65B^MN9Y9gG36_XW#B#NnwsIEJx&wkg1fQ4uL5L)ma##Y6felDLd zLoxmHOi=FevuaRg!BOmh(w|F?JaoyikCq=46@?aIjR1LOC*sh{bw{RCE?JGL+>FAg zNa~d(rW<;~`_VB#a9Lj;OAH>3veqgFkO(tBMtL&5qpH>9kEj4@tY;sdetM%o3L}T? zjpjIEJ>r0e15Tw}vEcp}paDp3w`CrI#e~Dfm=w`E&ktZpwJn-oqG(t+o>L_ZukqWC2~m&N=LKb510J@W(VL*I&dseVowq@nrM0W8Od4|CHr% zy=q9pJ_RM$NFcvilDOk-xw^k~pOBP-tydiloS|lES8lfaBFiPkzimw;{^q zP3LgKo=1wNeQWbc71Cp5>M1jfV#3V?qnv6V3KDU;dD)-lfnO|QiOy|1ise6bp zqpr_;scE(kt2;WslZiq2`P_5C$$Dd}sLkpwx=0TwUy#e9(iJHRJbp0yA!*kK5Iv{_ zc+&sSk(?Bm0Ovz(cyJVsIMbZP>Y_d~1AHAnKMY50T_5zg=L&XCLd7aYP*SbxX=$&DhKl7`8rePdB+c|)&zf>J$Z@>LUam;dz00L$QwRf_A2F!TSfZ4Pt z?-B}RWZv_D{%qDKLT>(tGZ_-b08IOvQ^hU*Rc_`}8oZz&;I7nrEz$j3tsF^8Ck&Tf zKSRXp{Fks)1et=ARsj!Ucw}nL(>eP5hAC#lpu4TgoU&QxQ}b9sKHtH?i;01cRRUh8 zyca4erlsl$aldBwM)i z@rcnS30YIP60MP-(mU%1u_iG@Djz;h*5NE+cEp+|S(ywTeG>se26HZiyh3;?UKoMf z1M}bN4M3WFh0qPhP$+X?@Wqj*Q|${9EY}K#ZD^S2y$4W%<{@E{@_7uY_EqF9ClIxh(kE5XSWG?I9`C6g7)LFvzOgtfJU7o!J>E&3J;a15l7Q1_-I z;GPB{J=cazSoRted=UnEZ8#|ojnIJ#Gt&5Wv}Smn1es?sXqjFk^E{QkuP05;)IK1c z4kT0wIK5?)evun1@|09nX1qO?(iP%Vr#Vtd9}FJy0KU^MO*03xITY(AMGC6BWY>0+GQ0AM7hhuu|LWE!!owg7sR? zv#S%u(iPak2ya1%Le!iQlh;&5W!>N3Bgn$#@fD;o(hzEP5FFNhkOcZg76Fb4BUM?iU@R{N)WXhoVc3@9=@6O|IRB*2Y1 zToYeu-V{-6Yx6>*@2$AvtmO3gM_&oP1B_v@30z6a^;vV8^SKgTdki%NQiy5`EsOeE z#3pC4yH@nOpP~q(trF5|5Kc=4B!Odz^JOp+8YvN$P#NGp6_LkU-KAa=fIH!(u^ z5j<31_@2@f%jA^aP`8|F^n9nitBAyjp%J1H`ij6J%il#6Q8GW)IA!&N%LMB|`q`V@ zUPL1AcfCpD;Lr)e5TEb2?C2<*?!ThzAfthJs3n8lILSiuJmay#LMY=QcM{m}CZC>R zzSqfQaNu*l36c$TUGaHp-lMYL<`_gu7T)r`xmz*Yds z)vGK{C{#Yl(7g(x;>~M(<$&q-zOCe9`wGr_HZM=HPTlx`!+TnJ+Oc!$o;l^jXQsiL z*!?tk!X|&rGfEmH_z2D1 z;qIgCxcsh3EU485;5F`8o(`e4CWj=Pko7?D=l;M%8I_}J<)W<_;R2b5jg50 zFNNJC_-kbBdogZdl|^NtTXa-Zsj%$y)l;d-e6PGQ|9GZZ$g;#*Xzmo{?mDFucmAHd zqu7?AWM+X(VxMw6OD!Cw#?4BgmlYd_(_n)=q0C?iZxl8U>q-VFcm@X{QpFa%CW}<+^FyL!eWTroL-_G)9(^JBMFK;SX~N5@m8DN#|^D6|ARCs zS1p>>nM+JuTsW*9Hdwonp-nSgY|)f${T0Q}dhZ%Ng38Sd8zqWYLKHe(sh=l!jjR&P zIN1oxkrGWTOh58jHnH^GHj%f*GV7ts?dh^#Nr;3HqfSNM?4V}*%eE?grgm25+I8JO zQW3-e(Wf>lGaK?ck#qRzlD%uDFUl?s+q5j^F!7tN_xIiTo+N&0$lX&?8i^_@vd}ND zaoL_i()G3yO<7Llvn1M=RHUIODS+7Y9AXW!vJ=V`a zy@^aH^6<_%-t(MT9KcI?BIt(&b^UmwK&@)gwaSG?S)k(lyzpppkvGpT7D1aPKuKbOk5k zxVF1d3u?ziLau0!WTZA)YzA-t0lR4A#Cp~7XOGx9jd+}Ub^B8IOCYQ-@+is-bsj{qlxO6AcPEdqILp!zl{x#980 ziWMgQ)J9>(^mS79H^z#BtL&wblX>FpffmvPS|jcoYAw2OG8Wp4S^X@smR7b3akZ-e;G^S*w(0ahNV2{@ z7YsKGj549zWEYTh^7~zK0wqg`e)9zspBian%Nj3ZSlj-Um6;%HVg^#c?@ml_cO`Xs zy%@oAmezE2k(<%>@M=x^_sAz$FTS`^qcd($tbhHXsy}TFzsYrtXi@#K)Zz+3`jN26 zbQc;)?OFcZHJ$MKo zqs0ylG^dkPRdnnYTYfWJto1wy)k?b7lk!>OaRf!~6%1)R)9CvFJ<7~k9=;o0qP%!5 z@jJFIBAS%t7O#GDcwnMqzg9{5ZW4%vCOw3t+BSNXV#tMaUgDNI1Q|FiI79<*ZFnEk zmiOHNP7`E(`hOU0CyydEicm#TEzo4epN$Xo2YOw>WIjIMy$v!4^c@jD`~|PeHF{p# zg*Go~AN2YEi=C&uOz1Fl?u@rP(;4ry#e!i`0#!}HPxg-jw2j^)T;2O{Po4F31z8J& zDTI^Le)ApC!J|gst~9_KCf%2#t)}hUMT7xR(;0=IJx!P@5oGXgEZO}JN%Hr;XESM4 zd#avf8?W^OZtO~A+#g$PW-=js0?KQ5u$EVKjkO!6>bvTU{$`{H;?r3q=DDHE)|l^Y zBchA+OJtoH2QP}~EW17pWvI?Kw-eo5-BZh^zd)e*n6A6%^`Vf&N=Z?wTXhBH^8Dr} zvaYKgFRr|yf>g&rPH3f_^8Gg-*hka#10kZuq9?e6+7Xa-3+WxKt^8ZI%JD4gvxJ0Z zsR|l%)jREvd{OQLt%d7vJXu+IcNDb99!v7z=~NFnH(qOW=ani&t9(~M=gRJ;-7=N4 z{0_pDJM^>{@*=<^E+&+||MNO*2ac+-IpDG&Tpd#NV9DfziK*uQ_;e1$n9}!UhK&iG z^6`8= z0Bg0X%zAH5<8`Jxr&BD?-#oQptync?O3Gax+XB@Gk`e%?^~EyO#1X!Mii_sHxzUs_ zdqpaei^wh+WcEtEjs~1fnzkQ4yi!%EYSJKVUT~H0wlV?kg`BP#E*9sDUI!l_kLtE2VXwhGM9H28356b*p2ij8a>GokB&kUzfw*EJ(UVh?}8rR~W= zYOT|}n1&yVQ88UlLRGrRc=qQC3iP64zM1Kg;Z>BgQq9r|tgjoz{};KAQwNG)9(mvV z#;hR&%0}cu4!Rf>_pPZ6Gk&kW-*1QI;_`bv+Axe7T<@(sb0r;kDg*B1_fNn{GsSA? z_VKrwm2W?ZT9b(wzmTKPu6SRcz&7WZ1 z7RT8_9Sn8iyVEkVY4=apnQ*r6uJ#UJLBTB54W<&Di)-fZ2r+5zQaDt{wOCSzclIQ@ zZ`3x+W{V3qS_37~r*X5*=Bj{c} zlQt8Ad5t}(eoFN6#jAmob>(Va>9P+hW!lvq_LPj&ULXkGCyl-ObB={W=NPfU*>eEE z_`};);#p`Z58K(4bGTfeVCp(~Qe2dNn~k*!FBZI8vR9IjCE`J?V||&1wxnP2bb;f; z_WXNCDi#rrAlj+c;&-}O(F<}obTk_>DhYU*yGr&V5B*pAKHdpQnI}fQZoj8w=iE&& z6MIiD3~x`O%X*Qj{v!F-!cR$cw3FU^z9nK!Z#c&YnRc0W%tbo zG$M6ZIt9Uzkkf+FysXI-R$)PzhOF$*eSCArm6VV2gVv>3@%1kO2fa@*q&4~&sTqha zj}6x;w>XUuJ~^x`Sf1|lBm_Ts3bc{FpG(%Hv8enhuwGEGaTn9GBhuUji)f_PCcJbd zidF>tJbysl6T3#nr)Ht4ayW@oZ&rPV>&O+ac8H;X`%5H?B;i~FVjMu#nIPz}VJa?9G8PQBSMA>9>o@*lDMm>EJv@JYgFm82W z>Foa^pw^!vP1uN2?u7z|txiLD(m?v_nxqOIhW!DZ21k~c6Dv&WWjM(EkqXvYBi{(s zELP?Ym^&~W5nqP|SX%?KAA@S4S?X82D-}hZE@2z>(O}@{Mgv8L|8E@Bhyi;J7`6e2 z$M!r_*gTZc)E4#UpK14zihXP~^;+~^D@m%gdY^m{E>_p`sUOT7BNj~| zrH|*xn1UC|7Kuvm?{4}Yn;r|qB@afV$pmH2M6CI-YX0eDpk=DKWM^G$b(N5+*?3l~ zaFd{o=zPRF#LP#T$?ou``FyuXwbuG+bBl~K=~N}1bgvk^@nb#x_3V!pgd%>n456qt zJ%xD}icqA^--PA-ZmS%@N0aM(a@PWQ|F7Fg`tENY>A|QO(pus8?0ZMgiwA~5z9j(W zMc^}G7AVFhMla|p>|j5(*BygFCc1`RYs?-(^q%@^V#^@ZMd0{KHO=cA-J`DD$4*xa zw=<>pC$6!o7a2`@G}3yEpBl+L>{;ccUQTf4cbLha zri;zgsg1hE`ph+Dghzi?P2F?ewUIgda=K=j(h8Ed+1`<9%r{2U6^^!$BC6p&tqNf8 zT5Zy8Wb{NOncD{*GG7eGH(_6#XMkauDl;v4!=#EH z6>PA{tW#nT&mU4NbPtJ4Clr03ps6AwV*vH9%%?Y|;Ht9wym`KUB4K4?$bgvd#5h?w zowLI1uUv$#;{1bax)?~j0jEh(c!^9l=c(d-=<@((_C&0R;_{IxTX05`K1DeIL%utQ zkVp`%PyIC43DeM0@z+&DsAMJOOJiw*uzxikB}odssdn$y%xXB<2Au2k_d(q9&t9xL z4*!P>(2Q_(q=HVv#~#}bMT|2P%by8=%1j28knJ~e!$2IJeT1>x3s{COWV35E|8Q`! zj5J(DZWtn~+JZPZ%EFgi#9R53?}7Yb!&PkM*^jS2XZBXWN|MrwRp%7vjzdOi0QZx3 z;a5|228gFo15nA~L1Y9vub%8cfQ^K$Z`fa)H2YQ4t@nc0Iv7@i8LdOoBIoUB;v|;c zHwmWuPsThQFE3!lvFaGv4H3mAzDO{>91?_=WP?|orrdN$+SO#}{X#bCMS!stdnh9v zGeFtS`f{hg8L%z>lX|}0fX4Juzt@!T-jQmVp?=|r_1GdWhPl|1M!<3lv)I$|rIDWL zRCYZ~!y%9_!MdxI`Q0m!j@o`WGKiaz7am~I>U`y1Ix$y4iPZc=qh6)!d85U`171jz zn?c_|(KkybN6%8wsLdsz2E`I5y(BxKh{BO+ZLS4PZCCkrdq^X1n+@Jsz0vd)L>{ z4BKFSGeCND^rWTQfsZD@r*E&2=^Y~({Va7%@zU-foQ9bk{^}5~Sk6YKy#iRgGS1PH zMg3zC_{pV?5vM@R7rR(=rLu073|JJ5=1HrmnD3xs>S=>T2vkl7tUEv)!f#1sfQbdS zRshMql%EWhKy3hM;cifBjhzXU-4K-Fw}HiQaO-AU6q)Kcz)QGyNP{+;Fyc)ESr1cu>@c?$QT zY(6Px^~GlO-RZFQadiE_Vs7*!v4xShok>1chI#4hX3YCQY0*CVi2%^k)tqoGXNJnt zd+(>*d8hK|E-+0LtTYN=v1Dj7r>$N#KyZtnH8;CIzd`iA_&SV(1Xu&Xa<^1Tz(xv_Rn%JDidC3|6^nT{cz0hDj~Uhx zhn7t2wpAKzaGLt$Q(K`vWUw@g6^g@%?@}7znYT5>&FR*X1udsfSE39^JjQR`>QlBK z-Z~@}0f`xaJpLYp6tWmqTt?G6lX*-zRy~wa=f5+ff+|W`X$nUV@bNWl8+S=iudZsw znfGZ)ZTeO0aKF^$;{}J*RF!5opwq-;2NaS!7WBtF^*?K#QFPU<-s=y4mP;9WV~>)J*k zG?HR`x=^s7;?li|3h|wm#X}K(v&gqvEO+z($bY-n&+R?ZgfYsm6oxB~5aX3j0$}u*L z-P-jEt6ySfVqKhSXAPw~-8M!IT3ntp%_bbz#jjvv4WU3gGp@=u*L%q-0JJj=7&Q1I zs07*oTywXPT0_ErV!b49FLo{>i~+b@UW}6*^f^d;gh0qB#^Og0>U`+XD`=(_@0a-* z>wW`bKyf9O;J8Lq9j~mF_|$~z#hh{_^H;DHBr9+=94mW;^hjnrDOU+@B}=~)&Nnb2 zv#{|`n2i0pB51`>Et4*VvkZmi(R>x+?K-&j;P6PcS+*Yn2i)BZu9aeC&czhKLM;{xn^-R=)=baHesgTKMPBj~l(kPi#MLvW za$QT}%5~3)H={$CKYQOm+z;Q0g&vV#5?*X57T<#+lLIKlzjZ8)0;{>|*_>Mj?SpgD zen>WTCox0SSNryTsqq$pI)p@dIe>#jV;=SI`2lZ_H*aQXYI!ic0Y--Q?Td2#V!rS3;D z@y`7JN~?QEDovUiS*14m3d1+h(|zN{ia1jN7EZQ@|&joW8bG|ZCLJyT* zrM?ePhg8bXCMCkA_zVRyy`A4if5DuMrjczg=N249X9^D=H!|2^sLc6xsv0;q*V+$x zh1Yxtli}R)uYzsQ62g}u;3Hn}qv_b=r=a5@W$=%29G?6=@prj}6hVYUQlfcA zq7d9=6*Ik>m^|6{@i!@%FFq!%|`m$@woFff`yfv&`$Q zz<$WCqF9f|+ufiqVci{)+?{UxDrV5YP{samE8oNNa@1};Wjf+=S-btpgyQ~^EfGs# z>noirOO0~DSIvA>d^!i`FbRky-1E29@&S)8PqVSyY?ICNb0@D#?P_MP@2APgKg+-N z)B14uDSY3VI_66Ry=LV533F#YB2`;9p6>S*>!f2AC-NwqWdzM_gzoXd-ke-)dVCO| zg!_BT2g5N}gz^|yCZf5rsijaZ`%^F(JaCFvN{1w-@unP&f%2<2m$tszRh8&dyjHw3Ckm87C|r<$nptK^ROv z!sPkWQy_&wKVrY*gvESUoNtfpbyf*k=i9l?ZsT}wtuvAM1Ygw@lV~LHOpzzJ_csuq-3cRgOHvkxb*qmaN+M9~!hn-}qVhcl24;W#?ca~_&(lyKN$LjtM%kxg z311-C!@+oqzgPc!j`p2pXlCxmEPmf0EILJbntsi&de@%sf4>A`RXoSP7ytcL7?yBP zXQXub8`fKCQa#lS}B1fXNeuQ4#ewfSI+>Df~|xK(8Oo_5Z7RTUn#4 zwiY}UbGgMnsgi%DhgTqx(K6)5W4SoZ(~(8_5iP*eznZ;u>TcxqRKktVjF(B>n(^nX zXO4<6`&dvHMyV$%qBwuQo+wULEcC0M=c8bkBOz8rXnL4T{;I~D$))^p(Vr~SAOWEq z#k(EI@xu)g?;c1&y04wa+EaED8f?6ZiqZxXCoLbz`GY@+#(s&2U_uD{r%?g^J}KB zj{WG6hzh%Jz87(K8a$f! zy*M{&enD1uun=hXrcE~Bq2Kan1$7Ww&85MdGI{1zspv{xLgiZ{ey`o9QDm#~q)k_O zwY&S*#aFHa>pF%liUofO^m`BY(*OBV&tb1CoKYY?daYBl zkps59v9qF>5}$@Q=j)>qbFHSZrbBU2@LtiMZDyUvK&Wd`eI(EC=0I*boEkyIS(Der zQnvH`@SxTsQzRCqEH5^9_|e0eMEF(jfcQdnJ>G znf=#qTtL8jb5gYM?r8s&lw1O>LF|*{iGgRIORU-G81<@=4v>}xey!Z!wjmzYDAr$Q z424~+Gbx9m9nCBF6LRfwsbiSX7kwlXs!HZAH|QKLfh>so7%Vb?yfe2Jt}5^+k|@g6 zwB<}f0OjH1TBq}ACU*c(taMJF?+YjkmA(La6ku)eO$p%U8+^Rz7o7Ye2Sx<_&JH3( z!oG}qf&eo2e4I917aWLxc(}!Gd&3>BZ zB{IFJ9* zpClTxDti>PjpyVV4I#m22ziifsGH**N!`=3yw40-iU*#VUyv&-`|E4PGWG5jTJ@BC zLnAYt|N6Rt?ERZmE|ZpJmXA-Zz^bmj3>K}$5T~j?gUgyA>~?p9e4@fp;dHqv`3lvu z9-Z=k~@&x9d>UfO$bWe!9=DeTdLAE zGH&jJ9d(a6gs0j0FVw%zX`_j{FepqR6YrL1(~y|7Vq8Nj?v9o?w%FA5Sq&^mZd8ne!HF>biV0efAF3Sv8^?_c*IwH0Fuo zLLm)D>U+n4u0SY|QMqGdt~BQv%ZlW`&SQe5aH%6Nve7GTOFWsg0vPzfbH}0;=G{fQ z?n#ELL^Juw!Lr|@5e3JjJA?Jua>W|{%w%WM5L0mtI3fsX!6&lb=D4W9w}uchTm1j< z_10lgeQo#n3@MFDHyD(JbP6~U(p^%DbV>{$HHd^F0)imj-5o;=qKF7XhcrqvfJm48 z_I#e_eIK9qdtJZ3ye`k-oH=LjefE8?d#!a30N=K!UO_Db8u$11*@5pv-fPmDOQ z1#0^X`n@6C)U8qX3*DTUps=bjXzQ($aVLx+^D*2!H8>{4b2Ix=MRID_laOrd8&e?C z`+m8IZ6Y}{W<(KRVzB_PO4ujpvufg^H!jVhdPB{6+2O(^JMTh4_Fk>kfs)j7cLBb1 z@rc`fDOzDs@2_|=Ax*?>k0*-!obGx0Qa~^bu4do%@CN(ic*R41ByLIV)*f_~yJ>}+ zXqYxEp14>brxyI*aVYXQT9nExAKy)SVYBV!zN0x+eJi$sMM{I0DnZx_4RsnQ_8jkO zmrIv=_Hw}Z*P`(L#nCI62JQ~5*n@hl@2mA=1Pf}z93LqoF-#K+qPY)lt2Ve1^ck0)~G!TX`l! zu!%Cwwc?3*L*YcAArcf0l3HEw0jBQ<1d(?g(D(5nJVN@?LH2T?%;$?4+-ioPd`*H( zWI{HOG?QoHW|348WT8Lb@G`#>SnsX(1j-?k#@y6Rmqm(Gh4Ncdsc81z*=8mnsizvn z$%4J{+ENDSk9AMvBz;{Rj^Ul0&rXfMWdsRPHy+4MlauQ88fteqeMdYdnh60-HE%a{ zYkyM{cMo3wP5u3AguK*aPbWtBu%vfmHZ`LU52v2kDX=2#DR{`Yu_~4ZQ4qr~%Rww$ zV0cK~z{J8qsqUL|)HqJWtwz&Hdy|_>c)kv5^NgC2`Legu4~4sj^_w-l{z5!e6;6Xr zS4gEgb~4VtInf%*c=fJBopiqTFP#zn)TFx?j-w8A{8iFE8+j$!4QUY}xuwnC{uTK? zdd>9N)4=kh$1X)2Io_9S$-3NtSLMmPs7OX-rD!gdCimd?68AdjnyECP)v`}-Kl7`z z4t~}tcodjWySf|gsE%`mz$)os z+w~8uA1D2Fn~Wb4z~fi>f7I!sOwv5_BzAPkHDIeuRmVWWu_Wdp6YA1xcS1sw&HC;p zk+-eJ1r*(hH6>JY#f&zSK3gQ;61EaV!k}x<#-X3f9gUVABd3q_wucBwX#*3>m%Tvi z7aUA`!Rc@3|HM9&_&=IJ^z-AG;CAdfNtGHKV|k(jag~#^O~2sM=e@$wY!h<=CU+Hl ze1gGt=1$Tlu>`NZwDSsZ`ecJoZ{kfV^k-lAkC-eM3?mGEi8pbi5tMnw{J-X@JV>&1 z4x|p&^HsMe=pfOgA_c=nC$HJ;$?R&B`3lgKjD`Nl%yhJLyJ=G26F! zVEQvf!-8W*74zmc3XL}ADcJPN4bAhFpa1^tPE>oRX#$vqQyoPwWdm$j8O>tbCwf9)4g9h zUF64QMR>%)6qVAo{q?7^$Z_x>@n)kZlZwH6S0A_eynE@Uweh5`%ATv@UVRS;;3N~y zw&KfQb0hR`yb=(sR}bsOToHx`{Vu7qA)8#ECxo=Zm!E{UUdekcrmL-o3zs{OeW^h6 zVC`WRrz&m#?`W6ZPIH)bYE<|COiu2}E9BETe8P!u>z1}!-;>hqunraPQc zJ3Zg}knVlCXWO3U^81&_Am~v>%yS>8y9+bHFF)9Z*i6M0G6=BYwIWYJ z$r9PInpH+VW2qqBw64j>%l0O!in4E9_WXOSA3$?Kgd5*=l@=Sb)55(<>U`yM(R&MC15FXt&r+YS&+oWqZWYlD z>7qs`oN{h_>TC|LZnHbBMpA6oy9S=#-G^4;HsQ)}>$IeK3q$k-D5f%&9ZO8T$d#DwC|}{Ro5{h^2+GR^i?XApM_^SQB%~$9mC5! zyP%hwOo3a+nHV06!=6w}i`R>Pd${KDl`7bgfs1U>YSSkl znowslkHr&+IbAM1wA(C-d_eFp!ycun#@&64YRI6#fH6fML{Yb~4 z*2z#M$ym-HW-GSjto}iB@VS9?6byLuCpq4P_=9f@+vXP%-Zp3HVlsXkeE!Wkr5kPIGE z8Xgw$Tt+kgkPw*wxTqvYU{YoN&b3>*>oKt691q%t9mg-R!k~;TB0si*`vE@yQ^w%U zW-ifs;Dox)T>&X}H_N^X<|ZoIh9A4jlchEi<74&@aE-z8FC5+%UYx%RZI z&iLmsE~AYQj3sPgG@JW6{!#nRdNABLd*VI4?Ev;h_IxJpD&Kiqf!#t*3f5uYgH{G_ z26ti7(V|anEEe(Wv#H`Gi;>J^ERt_GCz)qzLm7>;Dy-EbAjl7fW*MA+Z`U%qV7r!? z>ho=-g2H!wJr5FT46Gu&WWC}Z92w>Hbt)RwJ>afb^nTc!AXOx{4IMOep2;3&9vOU! z|GVOkM+TJY^#mg^)i>KUI?`G3K;I`#nQFj0tX^xF8*v^L?*Na%8;R(oH+Hv3f6bUmsC8_UMpK zCL(pb+;)Ni2SLvO6>~EvUXXZXIwiW%klnz>_C5$QIi|xeg0cnU`?TD{mH`jm+usmu zSJmlBPRlKyY&d;a!CQ-7$vQg&#TMD zs?S)~r;*Gh@F03w?mBRE;OQ|t zo)W|Zdb{)MQ&pWn8CGqmL(F%BRgE`JswmO@7!RZMi3;Im3$sws>?^_H>d935YBu`~ zJ?Mc^zn4DDX%;{|WrZj%B@qesy%bkX5uu-8f1`eWvduhSFkL)0^DM_YXeEQy1zq&{ zs8IEwFNIGz(h!nq-!t?u->k#Xj!7!z`bsP3)#xaZ(Jv*uz^F%(`{8ijRdbf&0q9_d z&LyFCA>(B+)x2^Hi}tl;i?Pc)x)e1i4FIUH<&d)fJ-P22fXRJ-_N1xa6=X(E9XUde z)NrIdS3*u5D*yOiU7>;8q-sl|#D7wX!%$W0?e(@?PrYjInpKijqXiBB-&Y#N0!H3s zHZRwVz$N#cm>y;Psz;bRn{@QSs058d^uEMs|M>K(`001zl{uaXc|vDE@J*Q6e}^9UM)6 zb^oi)-}7-19m_bndeV&8|38U54~5DnCRNT?2C1J4x{_4EZe z%_;{RRi^@;n^^Gge?hVAh48}=0ou>V(QEEsUmlC<1gU}M0JYV?oZK^i^qUOwl& zIlu)W%)z6Z*Qvm1*TGS{9ALcp_kiTc8v_LBYjGqAxZ?W5FHir63lK$ue3~ptTvhw# zt1r+olNrLTv*`}X&1qs0Hn!-mq-R5a7Z&Zx;- z-}-bA={8;oI^e8@O)zV=H?jVyf6p2T@T!p#NN*y?A7F<^!?ADApO2Rs{|6HLM$dpf z=-;=9R07YYnA;>&_ZiGeySv=h7xFlmCl@qd@c_z+dH{|la{)Wfzg7@}i_#E`qX*BA z7uYL)3mzyp_YDU)%hD1Yv&5RH{yl3xnScv@{$8%8eVrD}{RZiFT%?^Ai(ZP? z`r6sfbzqZr0jxX+PNv;D>W^nGFODov4^!RNQtj!adEzh5cC_|S<_Gp+?`|qZoUCRB zbLr$f(D8dS!>ENF11)jG#9M#;KzViS?7yv_X!pPX{1_br!R;B=t5vB$;JigTr2=B{ z0@mQ(>otO=PM_36I3F*5%?rOx@f(3@>Zpe)1{tu)9;B;3xr^Sws- zP=^L7AcYS5e4F-b$Xk|4wdOtAEC4xPLqD|Ub#fdW?Ohr=PvXy-4{C$Kl>AZyV)FT^ ztn9dN*)aXRey(A0_iGCUUpjA*R7HiR(tPDs#={i6Vb(;x5Z42Mfx{0B{(iKF4ZAuN z&iR&!ZGQc_!}sv5xHR)6SESeC9kPv)=-d4!Sg~g=aLgO5@$&@Qd^!8WCzmQ@#DaQA z1#8Edk5alvj4kV+)6@R;meU{m>4JkgJAX=z)W$c#;WykzrVNOPC&}P-`5wgjQ(o0% zx01FcUq_?R!>b>k5Z-x*``xFZ;H~VbpnV4xELKax>Kg2S+m|Mx3+Ji0Sf!rld_6MW zxDP%Sy1L!qWAVSA6gDVW82nJz=S*fbV7fOs#HH16qqc#$q}Rr=4ueEryyKW#pVlxw z+H^&g0JUHEAZ`^EEN(0sD^`PQk(M}LNk-5-)CnyK=%=nSlUL3f}X zm)kPr<6~P_`_`OWs(Qb=y@{=4v%gTV4kpKc;R?>?SXzgf0(4|+-PrU#l4&I-o9ER-a^sjxhK`R zyT7-4LtlPY&ZY@~|&zz{7~b})WW8ZaQRI#Bg(pdgBQ!2e>y zX8%v6-1=d(lh<+*iGI7Pw+Dkr!n<2qkADZuwFRo(8);o%>kI^5A*MC!j>@2jP_-Qu zXs0`lkt|9YZZ$1`-b#w8-JTO$(^v@O?#3(f$yC4u~ie(nh$8YWAhqgKpX^Cgyx2ueri(N!i?AQm10Z z&mO2hMsqX>!`)kU7moZ#b6tdccdhyHb5gjI%uRT7Tc4^hOCvhKx8~j1&o`U^UlQg z0>U$;d)iY|3US|aTYkM`6>)Ii2t}P2CFV4iII9)4qG@M5p_hLWr0tLs|HXIkY367B z8ub(HpSU#%+WfFGcK3c&0$1xow2hU|$q&hJIYj)RN6C)vcyEKbFi&hnvR zlj?PW7!;c*Z#noAv|o|faL=goe3@Vik;asl$F1}i^1QwOgJRQyIsbDP)=`DNh6pVw@3umo1p>m-M z?G;_c-WNQ*$;epMq{<$D(aEoG#df4Se?CwK@d>$sjrxC0r%3E{>QVZ#YXkD_pN#h1 z`x^N0wo~$)pay2TKPZJlNsNk3&vOasm!~(FG9@XPi~n$G(Y(7xQusm7htah98K(_8 zCHl|Vj$Dm{Zq#WrnH^yU4aB^=J5+7E^)xp)Wj?7+N1`U9M9ME*cLy{RSCrP@Y76ZR zJ2+pzPJjo|I7qEq@YdJgmLyw(++XfZPtxW6LJP6T&bhVredNp1@^;RQqG&?uH+H6K2s)9rW)RE=k`X4u+_f8cOxYF`eD6N;)jc`)?h1WQF zp{2y@17FCu?6R5<1|7ns+*_e24tM4SPTd%dYvqy>&vMvPI*?DdM#?xD}>n$IOx#M zoxM5IS9=v!E8>~xe4f52?pgDkmbG*$Nw5DnTxdrZ3SU{jWd!2=|1Gn#5wC&3L-Q@{ zKjGStzLlMvm7qB?K&SbeRNTPt`lToHp{bfOf#;D<@NqAJ&Xvyw z#WDx*vsix=>5^Ce4oYTCiD@ek$C2tsTvkakeyX{fMf&Bp#p57FFh@J5_c*2bQ%ZG@ zx)HP2>*qQz9|;WWyO|;T7OG*jEsw*?Od8q!ie!h#AaXl#u4Ww+>*el1Xk>arT^O2o zZm7pEBS8}?R+bGi_7j#eT!(&!s`6{_ z?@aMrol<8vnvXFy6vt<5^MU1c29TBoj`XblJxY-{N+4|HjJWZiEXkg{s_VL4kw_1+ z(RHxW$5prr7FfT?uUYqO;BG}Q!Ribx?R!?>3BrRnnY}Vo1r6k~nue+#c6dpQqF+pF zHWJJVtfJ|dcYu3Vc8VY!F5LZPwk^#<6pa>(Y#x6(!|+T#l&9W{19=+T`%1#Yio&Kh z4XCuY&3 z!A+Bl&8@^ef6&1ZII0|TJg76>F^vaXaCQ^iX-Dz{8g~F(u#6MPrkZmBY_5{Hn(bAwWCvo*}S(g17E0L zK3fi_fH>>5DaBrn2c|hrHRzC4n2)QDbx7q_FlNio;}|gV8D#{p$Kz89iOrRK4!R9R zf;gN=@gT0`Rf_9n5rYf#8*ObWi7T&&&txux8Y=lnp?P=V)U?Kxw=e*w2?&Kef8x_d zu=L)VNNzdUW_)+;QN9v$?j}N7pb&Ro2O)MTDNVNiy?qd`r z_~njJy%o57&JFv7-e^{F8pVgZFpMvl2D*=>8yp-cQ9LWCCr%;XnxAH?3MHw`bWyIi z^Z$-7?( z-~BhygbM|m5}{q&-S5?ALJY+hTJAHj=#tj=7*z>+>m2~8PMuud^N%`sToIC84ijx! zmTar z90G}ANrzejY2hr1#oDisof&e0LN}SOr@T#o#4Z^A7nE{!+;#KKdb7Su%xq|Gwc z(%z(mUGbU{wMkzzr)8;vxu|$;DV9r$oZEv^8_*D5kHUR%KdZL-2%@Pyd%`l;k=k8y zq-v(v80}Hf??Q==_FWZVWgq<=;t(u;rMkV9BHpB;cio4paF z-5vYuKO+cT4Hi;K0sY; zZ{&Wm#kbX9_=ZA{4@rt?>8;R2QB5v%n0a5bP~)3|=Ou1ru8~9GL*-R({;Q5IgRP^> z%+2`Li2y*&UigOb>wlm>9rJad0n=Zd{iI*x99Ber>xh#QAA<2XKfeCHRDbNLkDY$r zRV!NRgC&dqiY1$HAndU+Ynx zimgKDrIhS7EKJu9)@I*?`Zg9}o-78>IoT{A#QXPM{)0T%I7e||zrwqhZqzp~l`4rs z6>nSrV7n~lugav(i`m9GysgPO^X*PC?f8X?=+N)6aDcRheZS(om+r|nC z$G3`Hb8WBA+U|QwMB*k2a9rAutY26L|Mq?uf2(lr!mP-lv>}c~ycKk=Xi>?OyHo@8 zs{#<%H0^CIT(yXMp4iS3U0znKpjWCAP5nMKQQ(6bR$Z*HotrD9Ue+$2b98-kXR@_o zNEDA*Lb2cJec^J5G-%3zseWD_ew!1;k9J8`H~z0tV7rIKdhhL2l<~Dtt3;xy1&k|QOnYE09B()S@C*=SuS#40U|Ni*sOVsL;vBliONf*Xi;6WFb9pY9F#U_j-r6_jm< zVuNJYheFSYgT@N4&W+sXesyrFzE=Q{9-Wjrhpz>t%%atf*SF8T2q9*V+Mcr50KSje z`x`EwaahM!8j+3b8K;GsT?M1a4&Fi=h7m2FR#3~=1`HvG9yvHbWNY7FWyiSonKr$iEAit$ zOn>+#!S%c6N)ADPneWd20Rq4rQWVssCRYWC2jm1z&A@PPn9i3 z0wg|94?eG2Gw)30)^4M1VsY&UM8$wW-yORLmRAQq79+{9WESuH0q4I$Ld=sg-g?>Z zXx=;WOf$d*NL>9JVaJv!Zi9~*8ZU0?nhq)w=PDFw65lQK0mZt7 zV3KaY)3r46ade%{IGy%B$C+)J96kYUKBHBqNoNApLsIEq))|;3!J2DqHR&|?roD3v z#nK&yX4XkJpG%SDQ4o`>8(P`d3_Rh{&yxs!!6dhW?QkhoWChKr$7ZAQBW=ohjTUeB z8%rv~?gAJEeD`N|#V}GI<+=XNocm3EheMYgeL%C*seWhk1}y-x zwENvn!8=C3>6YXVQjNKzo$V8*AuR#BOp+7_`fTg0DS&~fabVty>iZA`gtVme6vf4e zryzP4DjJPeqeTyA-J7-FIt*RfcGenaZV8u{{o=Y+6gZKA`CaQW@dhg(Te#}JIl6m8 zB=|r{)7!kcC#L1lBgd?H@7E0ENorAofdu7Z39i{pt!!jAi`Cq2*BQ0^MU3{U^F~u# z0;eT3lc~9VMDn(QRBpBV%e_a8eBee=cC#F z0#boL=v+vRi_K!D?Y}(_4&B4y8j1EUO$5}d?dOw6g?gNC2X#BXQw1E8ZcVi$uMA|U zZd-G5XB~vEzMvWabHvw7M!+)U{*tuyHBmk8&+afKueTn{=x>y~_FW7?6j?U-_{_@9 z2b>g@UsYGbZ*Q$;?#$_wK7S9>)r8KNPE>w01s1*i3(wJ+f0EPP-|}}+P4>a`y2)2p zVHDOoh1}emi%{v-2tY&kTI?IZ5^^^g=476Hqb&pbT+MYMivQZ@0H(~1&Duk_DE;t1 z;=)yMY5~GR&=Qja{KRWLiChJwzd-c?y$ZfsHrW&}WY^!Fp7P8^u5po+8R851M2UES zcpATQCokMUHQXbrlPK}Nf}Aw|F9_!LfcXQ!W4DUPFi2Oe~!{CSb^ z81gg_aXO)R|qx=;J!eCPO0V*jE*z6vDh8QA#%d9*e9X@F`oA z#y&|`BKkx+Y-H{*d1x5~c4X&rzH?Q*ljZ;zDj5{3Oc2IHLbj{g;(Cc?33okd;Mb(! zE?>^b2vgcwAaGD-XQQwqoNC>{Lyl8Ux~*p4SU>XjiAJmWTtMFZjS?yC(cZaV@m4k= zwDLACAKt=U#58ao8T~Hv6>$}+a)ep8X85tryeOF>;m`h<6)Nz|w%#uWu>Qhc)99g< zmN`U!-&@HT@}vF{Ed5*MNhqiz7Preh`mYUFo@x+V5o99LVf#(|k`S{%Q(Nz#>Ga}A50LJT z+1*ds((h_a?Jq4aUR&wBm{l4~e>7q6+Ugx8s-XFkq05{`df6XVB?s{d{q381F*uWa z=B(=9Jcoq>T2H6&Cg@K2kR|$UBl&>y-ng8=&vifOirpLCsI6!BRUW1=+-zQ zd+0NEFLb)!(>*xc4c*rA5(nNURrgexjtQ_6%Z=C6K))>oliJkJ$8yPgkG#qPCafLPb8-yInE++~%InF3Ui9bseRWPEIk&j_(aBGj*t7y&hAY9g6jF z_s{gHNfl^c0xKGM=JBzXKqJjS{A15uy6XkSR}O!_GOZk!!DTz{uK#A6o(o7k)xtWd z;&UP38*`UZ}MlDBfamJOuFudoh}hmjw;EcEzex%YeEu_(N8#y0g=nTFX8)3Q;$ec zW*ljp5Bzq}x{uUzM3HOqAUN|zIc$seCnqW`YlL}(e$wN@O)vPV7L`jlabh`wvvWpk zT$FS7qOMG?!(WXd%zaPS#~oq)LZwsG^bgn#=%8Pi5M;W_*Gs1yAkOaF>|1p8-3xGM z<7PZER`K_slaTtX!N&SR+|MH0ws2B22$iT+;k4)&mDSO&&Yh(}e22U?13sBY85CKX zK%`mZ*e^9&?qdn!FuiAWB_p=IuF_p0xe& z62LFpPftPfBB_FfTf-i-72nl`^@+vIT&Daa|BWpTIzRAfW2h2dWdaWe0yL3R>@Wcu z8(RU?1cGd=oToDz&^O4eda&-%dv^`>O}Pz~R1$P^*bvG|)|RP*{~z=xbWffSHF*tG zg}-U`$ozMr^Z=TVt?U!z$S+Yx7KJnG{?RSaEZ4n1yY>ULm~-rbz;BM}laViw-Y;)7 zB&JF~Igu!I<=3ts661}n^HHuUiKkHkMhzSSK1Oi)YyyUjYG?+jw&<*dOmJ^(Zf!|S z#QoE51oJT85oWRkGD>xNHo|F8H~$tiXY0=bBPXkez9Ub0S$@{!-r<^c+k_`?{*blL zp&7naZvVQZI%4W^_evoJ1_I_=ZgELW!Tc>lT=?R860cUfw&x3B()tO=EaY?S`0G0L zmjtv>nz$OwkGrDkf~x<+1xS30nb#QU_7}nu{do^+w|dC=n+R4lcMDunC-eqG^wJCg zwTa!F%Atxy)sycQ92FZ^c0}>;-Qbn?Z?T-~Urg^fwH}ITIMtStnLGQG*6ECzj12P} z9LTHQIh+Q>xs7r(MbN!cbWBo2X-Nn$N8mdYCYpR)>A_p|`t^Uu8Fk23_#w8MTy7IT{P#2ojOqfDlLbYkB^3kjkQA+USCQ$VR8KM_`vG z>%;O*?qSnD=*;x;V(x`#W8cm9Bi-|)p;yPwAF+5RfkNE}Zfb^R~O@@XHPezF&! z!J}&0^D+vIt6B4>7RUO%1&mrUk7(gLF+1)|dUMvleL6*F22K$%+$9}n%OD$ zKyW(}gdYXr(9r5xF&?m!g_)o<+ae%a-7AlOK`j>nq|^Rl?YGMR86OmuS+eB(&xaEX zPh1?&YmwnHp6IJ@C1>S14J4|Cp7!QJjCuZCL6Jw_lq#lbpCHbF+*u=ljCP%M=;)61 zK@ZUCL^FG9e|CU-S~sP9?kw&1U99@h_RgCVkdg%%ni@1}mD`*1!0a{!?Il@ zuv2;@i}yT4p+jw-c?UlOIyq8cEl%)sZG*TeUd;Sd5XN>ux4;LeU#;y}m17mDz~IZD zXP;iJ1MWv^T|ggU@u2U$w5JfeDYG~0AtyL&KH^R1WuPOO> z`0!wOC}%TSTRtKpR!+vE1tgIBnV!1^NdCB?hHR z@wYW{o_v0t+=ffWAd)qU^ciXO+bii4_mA<^5KX)3S?@_Z@pjluT7v$d7})C2*`dwOo-zOw^zm+I8KhBqZfUsG{kBc5FT z8Xw_0-_E?zUSso3DDA-~TPwm?QzYDl!1J^MY1^No^LyzlHyaz9>eT0g)EhYPGdRIW z&55V|IjP)o#Wy9;<|RA%x{_Z1^XQ)O$B6H0Lx|ZN1RucLTx)k?iW%?VV<3{qqjfca z;4Kzk9&R#ke4pc)3Spjc1=~k1PFU11@UH&)Ia?Z^S*CwmT79OAkR2qS*;LBww?#Ti1yDf0oE`nYCpcu3ofZFis!Dg|`bi>Qb`i2HEhewM+&@@=Oq??-?L7;_&#nR&9RS!9)0Kohz zG$=LH0VqcYE-#0f5V(6JHN{@ zGZ_~nNY3OB3B9G3xTHS?mZgw3Y*wa;(tm$VVRO9PI_b$!j-ta%(=$>|4I=Xh^C%iV z2}OD*lREpW->0U+8E_z#P7@Uv)VBogpuBTMJk0HAxY2e#p-BJEH`eXeiIU<^K%|QW zfHx@4#YLRn(Wc8hk7Sk*Gj}e=rr0M>=99fJ3-@|6glR?XZ55**84E{-cWL6nu>`rS zOONN5e*{=6lp1;3`Aq1jrRg(Sp_oUcj}0|7BSp#-UFb}jW&_!dluSNp?hFjXRXG@H(M3WY}SI} zf91hdVbXvA(n4jd*mg{3bLa7VX11yCp2+eE#29vZz;On36#>G>N^1UMP&tM53%9OX zRn(5n%{9TpfQD1NOI%{g3-@fkCCGq5s5kXUOC0hXccSd6gyd%@yfNKaT2J`}voWXP z7rhCU4?yL9x$Tc)=YoJ;&)e7Ud0?HYNvi|@gJ~MAqWV2$%%y@8Zx@TQOC7Fh6jQUE z|BP83t@d|g7_5!=m7s+(%}!$rdCjsIcI*|9%-mltqY%{u7=nfkSIb>(wp zw`F7(Kf~wu+5b02@T891M3HZ9jTBP!XUbV#=$s^uJRyT3k9OvT%Sgz{SG=a`9AKBv z?A68PMe{my;OXh<%r@T=qccOOw94<%Lg>tb5Qwo-BTiOUR!zdsd*c=6PVV*+E2>HS z8YHy*O9P@-gvNw|gyLBQX9b$+sudL#X%2I(;eG1+8)vT(%t#f6`1xbE}2l=x2~+^dm&8Y=m}b0pIsHJLinSPYJ6| z+|_$VE_p!MPl+0RX|h-BUM)arPGUX4t=oFjyVf;l<$b1<*N!5P0p6n5)H>`V*@@yEefr_-mO_3LD&C=8HKvd1HJy88f+Ht;~6FKEPEt-2xH5hXd zGCNIwWx%p6PR|^g{BJaN5cBnCkzy+OhLqE9gH;kx!wEKmzgWBjU164M%>?lfYtmKF zf+26QRhIqBE_IS8bGg+1rpQ&DYxW=0bR;&KOR(Dr# zT!h*`O9dd?8*&M)IiNb_M|XZcVHsQxI3oWiMh;Os<=0J5aK~-5Rdk0L z9jG6|JRZFVS=%a|OSk1&rbpau_4~9$n+>Ez`j0aHCRAe?P7ku=10Zns)@ZuB95)T9 zQRIU}zeq#mKK&BL5&abq2tz*o7~eBA6jvGi=Ug&hD>I{$9x?^O9eUDx0mrd5b^~Zv zp{8C+(|&0n5XvIu&my{hdoY$AcFaOH>GHiMtdl7PX75mq@OQuZWSfjpOoV>(-~8M6M}+l||^$dPxq)HniyMXU<_*;7IwUK@oJAH6@^)?j#n zq}+Q=arJ6d7sKMa=|*1Tkc3=Pw8!}eHE1Of5B@GER!Odrfs9m5ym;5$HtI%WQos4& z$e_GIy+lE=yU%2Dc6qX%StF%3_BAh1{b6YOGL(QYnFLBD9V(`adaYRJyJd*4#(;I# zklZ>XTwZIwoHk!>C!X!e%XA@Jep&JrK~XL0UXCN}$6f#8JH$8abz+uXuQ`@wXQn4w zq^3nk@Yo&Vcx3hFHYog^X^H08EFCLC6+$D%_HQxFQv(a$o7p3ncQU~%tCT*R07+(*%lj9ST*gl~a+^S& zNClshQ~t-v2q#&FGQxf~!FUa7mBbGt$lX+CzHhxZJZ9u$QINL!3z43EMnc7-N>-Xc zy6pPW^KRISuZCMvx-kv?_Oowcn0tyP0&|^3I7rVzO(BS)Rl9FTxbJhubV92D_3;0{ z`PmgtW_^rLd3bR&FRX%uZ!v_j^JzVHA8Pw^K@}kr<~{$$W!S0I^)7y z{QM6wiz3<*>yWiNd0`@nQBp}v9lE5YxVZ&IZq3$4CAz7bulkw0nU2$YW7HNZEyG&% z8$?I!d-tdKTbcoZJFlz3VdVA_!)O*rAmoU=4>^dmP6QgbhD5l#6N9$hkl5`BrLVal#<^*jsaN7bI19lTt+E3sn4_PEEm_IVT- z8NwqywrvY8dT0dX6@;Kwx#d|dN!~+vgXe;X;w(hoW0$UrpKsRpaOp0e&H^QxmrSgF zeUcAva$SrXu~({FpMsegCWE{K_7HU2?2s7#4C#K?TPW!ZRDHi`Bgugi;FKjd1sw>P zBA--yeDBW&gj9NWZ+nZF$7(<~cXF@HPo*QHZu67C;}ll=y_c||gw<4B4C(srtk0_^ z+hkYOn^uOP@(UZw8qiR`1XKN6{WYmp?IGOmbsWYUC&D^e6F-G>yo!%0lsij zicbBqvE2bbr))MhH(b3~Sbi3KTARp&AnD#eV)<`b3mfsvN_jIvxceFvkA5WN1v#_S zrzlawF7!+-V&#g8f})+tXzmEkdWDKG=z@$N$~jsVdV^f!W*y`XZk;e5W^k59oAk)k z#AKHbOdSX%zqq`Dyj0U^>^8|niX=HQP0~|01P9-KXRZxQ9@RTp`TSMW9l>&AB?i1p z3~Ldl0Vx@a>jj|5S!=7)5p*H(p8Q~kob{f%f`Y>Noms?4C76h2I|BC}fHZNC$RXl@ zL7c*b-?6I`zuxciVNHp796-wR1Mo>o^*;^NDUQX%Hn@kyN_n*LBN&F27e6t|I@oeW zFzZ^y{UD-?kWa5p;FhY#BmS!noc(d)-#85reh<~ZE zi+nbs_DDU$rZqdllliM~Ase&B)G^OaKj{#_5hf5ThC8Ihq**_EABPjLBCNw#UQsFQ zWw?=73GXxIt&tm{n}Ns2vs2;r!m5Zl;fo@bU0rs6#J+VJbd=o7wIgW-TsZEmawFT~_ z=>1e`Is8ebFCs4S+`Y2oy;$l9C1}RjwJq*NHd~(N3bD9uE3jbt%v2RK$J`0({^Y-P zx9^}T>F08(79Qp-MAOzg7>7D_cN+K@j=f@({ggb;qjDIEmQ<>JElbRTb0tuF;fkwV z9^xbP!>vbf0+kQQC>X^u@n+wzLAoo#OI2z~_i|s@wR;^MYT$px&>KQOV9QX?i*M7E zQz4}G_MT}HnP)jxEcae-*{k8N>XK~E8kQ!M{=ORrtpeS6uXiaPRmXGZda$=}`v)(BLh>J0O2 zInE7T7o_<45G4ia?wFxdT6$;%=>eom zU|(ZA&-1>2_V;ri`}qEO=4t`dW^d{6V`ouzebnvAhgkKDCoN3a>_IC{RzaQ)tL(z& z2u6nc;jcn+BHu(L`cYUKnB=wc)aUuUiO9ftz;ykg)5BKI2V)Y-g4}2_lWrT-=4S`~ z^gJIv(I_IpXNUdiB`V+;x8-QZ#v(h&b)4NzfVZ`2f~=jIXj&j~MVOz`s*1OG5z;w1 zXT2qwa6?19xpuJ{G*(DuRj6kojyA!S=kh#>0(sYj6oW?wZEnk_K@eXyZ(nnyMG8#| z;~9#jI^REp4HYOdNCyn5V^2Rd!85fig+e;BNDcayvFY!U;`6=J<#u7sd%^oQN$N@3 zXwoC~!)gKHg*W`ZBI;aH!#66()ztg%8#wmY!yQfqXIblv9fX-##}Ma&BCXg$yP`NX z`l5ANKaNJTT=Q!uurOwP9Jv|KY(|x?=8n@I(P?julbi0h@G1d1@KMQ)Z+)%vSJKx& zNh4iehg$1@L*{QW9B~mdwa#oKr0B1s#3mbkeO2g3G5Ch&mnF4_zPwpZA5KtQjZ`nx zQw zy}ylZD|r*wM>~G&x*vs~0Sv;wCeI{^(j+}RtD2QD{;UcMVTHB8?gIS``m zPK#h%$9_gs!ggKt65r%@hSM|O=p4^P@I9mq5Z?!`!SXMA8!xEX5r7kSKmQjCBAz82;$uRl415zd6ZMRb2T+BK-YEO_i16HN zr*o_?Q9#J`Fgx|-Gac-pl)EO`qou8)mYk*#G`WS39TTuH)e=O^pSF49LGar^1XY{Eqk(S(k&2xF%Rrb9FwMUg5 z;j6W~wd}-Xs46mU+!FQwWdEl7g@Av!|CGH}iTutdKhVyLVzmJPQy~0nI>1NQXXg02 zUC{~by8q?@OF z64Kx>AveE&LrS2Ctdj%@4xM8*@w`KAPg_SCS3pbDg}4ktttCa!$NN+wqw&ySKYBFP{#2)7 z#-qWsEce3^T$;&ZD()Y1(x}!<@L_OM_5eg(f2x?byDEU)*ZMZJD-dqC=tXKG%T{`2 z2oiLkE>e#Dhv}P1aM5#OA<_IFDz8R8)Q3jc{MBhKRc+))>5Wm>c-|xezDRT_S+xd? zK&!s8zzXnPhf$B?8+lJEZ-H5XlAT1G+49}g!K;YR0+}kwZ!V4(g-Cnz?s zwUO0O7HeV=bfhgWe9*(RAGY&ol!p#!rug|*U&cQ5yXQMlRuQVwuaT?A-Qc7B!~B}YM$>^SE#GGp(^Yi&zT3JxY`4b)wC0k7;bsAO_m<}fl5?#Je7eEiS$gwj0EV?J^T@)nbX6`$-NC!S#F5`^lag+Pt5Se9z^p?G4n= zoc+7G&XchSGR?Zwzlw8m`%e56asi+BQ%GTEtyH%u)@-~xU?VZ0G6IOkg~pY!1GxQ- zl&5EEpPoE@z4Fug^=k_S*v8YoJsfjX^}813C>m>wR1%C%A!PgPY_w<4?|pB6VtOk3RvFD@>Y zXBi5|<8SLm!?SpF%UBc!N=r*k-T9zMoiTQH_F(vQks+MFw+0G@R=DpxvqqFKZ6Zo< z4h^(V?Ck6?VPgl$MUqU=?qCBvYjj;NoQEhP7JZ0}quC%@OeNYYdi8>;e88kl^c3`HRcDPUJ1Y<#%0h?`QRv z<_{(c)PxCufQBU*Os~z)CSyin-0>Ix35q?RKyx~RFK`23gbKyR^%>f}-ts+$)z%y{ zwXUz<7Z5C15TVffpz!oq4)sA)N4e28Dz0V~(VBI%%tKqEXTj+1=x8pB>o3+bn;7O3 zkG?Q{7jZf1;B+U}PACK`vS=8VF2nL!8idvo7XxPd7RUSQV&-%jVJ+E5shfWPn z^X@3u&qg_=8{lI^U(WE{tGq7^u7;-!X*6cyV-aksj_SEi>;#?6Bc~fY!cBRXhpN=w zuF|aBw_V@ne+An8wKyhtyL^x=r+YK{mA+D47882i$;A>lC9ShG!`eqk9^*R#-|5iu z;@Ax%?!-6oaZ&e%CkqRV=yuN95*YNHy^~7rL{*b}pvl*177DJ69cV;t0JtzoSZu=L zZt6|HHZTBvPtU<_mz+bgV|0!R4Cug5>6DaFbWlK^CI?wnzA_6M@w(^StbkS0aR~bG zo%DWlztwmU59>PR3Oj^`2_*JIuC)mX!W|3Sa*b~4VAl1jB)kz)nAQuM+ujDG2W=aMh(a+%o^&pUj$Iki@JD&;c+a~;W2 zrcH?r3c?P*A)Ycl1BMlsnN1Tff+s3WTd^7rbFU=N5W2##Nm9D86j z^ugQL#TazQxFHdD^Ykh?Mcse+!Tb1a#@@bq{n}&QVFo?I8eQ9<<1Mvr4*+S0f$lNa zw-w_e@2mkXKvaG{*G#=v{6P13@DTUeSN0X7*xOdocOrvq0GgAB8Xg=x{P-vxD=u_6 z>gnwM#%3a5Q!{F0-PYxJzI+=W0w$Y&mUpJ){Gb)>Km5HmEC)=9!dEC**La4wf*tps zG=H?@O35&tf?fg-n*Kzn=4!R^c1bId!{SFLE00<@JCz>6`p~eyi~7!kCs%$9l|f!$ zxzybF-P?YZ_s#X>cnGK6ZvwQ-!O&GFTA>s$p0`JW)3=%V0g4a(IibeV#M$EO<7+yl za)m;{wh1HY{#&1OoDL>`o@2~x(qT*E0B|-tbvHki6wi&pgb;%fi;+l0_)j@uRD*@q z*WOg+eELoK@YArw@uKVXMPOJ5elKeJDgCRkQP{i|p1m&`vRfl3dx ztp~YzctS4wZFe>leoB#+-MI6?SKM`4F2w?b*2r4$?I^E{ zZUUI&iniVzXp~H&3P?CKRH&8yN0iN&#!ej5v<09J{Pz_#PD|HDiwyNMtLw8nBX8Vy z0nCpZ-MFz@nUhwf76S}eN)Ay3H^ipDZw{JJPJ5pnZa!zzKiZm6ZV-p@R$!J~`E(_A zc=|N}g$YFjH!ILs?1nOMFRufruU~n&_oEfyos47;#JdcAh~V7B{_4lTAIqbhcPA37 z!b+z$z-bKeuJE&S&cM5f@hnqUEC{Gj5?n?8G*`)L6!&=+L_hsZZt68HVG*Mr9jH41 zZjV(2Et*6l#m3rNj4Y5}8nyi`M1nez$i&M*EfawtzhI?-4aZe7y4)i`i@H|a`FPl70RC1frR&>c)H zP%k)eh-5DrZbmRM`5>FG`HaKx6#u7ClXiO^A2TMDq>t=50DES#C?MS*b(CxKhBA~@ zL5a(>A8p#sPezY!63#WfQDQD3ScArQhkEv6m48rz#xBU*)e3aISc|Q(v)}PM3Xy z`NyN~wn*e@)@97ci{$$Q^>?DW&p#xN?-eV%494y)_f~8jU9!pBjM`Y9r_y{Z;|+bK zxcyK8O964hf7?MzW8m&*A4HX_v&2_Bm$E0D2X}d4>^dPe^U#?WD_6}7*{Jxk48JURCQ&o9Xo#_8@p3nZ5am<4z3_+(8Rz6?-Cln=Eu^NoPl~&;{qd9_p~kNrjh8!Cdndxi0pHRcGJ=GcGZPZZobN9Hk&;D_7*LAED&dGXq3NC!cDY z@%}B?Mp5wChhfi#pSz0sb?%Pu4<8s1rTVI~!P8DywX#Nh)7f2PzCN|Pdn2<#m{^q0 z?Jzr{OW%LeDD(@>S<8KYn4Z2mW^-%uq;AN~2>c9~m?d|jp-FZM_&R~RzJ5F;_e+&5 z#zCETK(Z>wXw64l5t+5ztz?$OAxh0F%64rrR}7@5q|hg_U>2mRFyfo_I3k%n3fyw2$TyRDRdJeq z>B(Arw4}Ajl|iMcE3S*Nlm)AjneKTJ9}>($rLFX;+px0R%=uLh;s!Ixj1U~2JI!A9 z)#KT{I7#Q18QIJ&+q#3l{;Ek{RbI4#W20F#o6R#YyHz( zrrG2Uyo{R0+nE+AP8Q$G$~S7mv^Bh*Dm(4YOA!(}h1gda0dQ?}lJ!bJKS*J9*vSERA1MyjvQTalWP0qEU?vF~SuK2vb-kh?j-?pot zOIg_`!ht7|e)suDCD8m05ka~m`s4GhxWH)#VBkPZ5 zZ9;WoG4W*(Ye9A)GBVPf!cUo>l4|sgm+VoA%S~F%tZNVYJaavoxs>UUP8U7-9uu(% zn84FQp->x7Z6f%7(lGM4Og2-fqR~JyH@L63wpnI$72Ac+__;EP0B7$d;_Hf`!i=xg z`*BR>Xa*eVF9iWbXh}?4ly_a#L7~d7D1FYOESNJXR0{jyx^;doL-J|E z!mDBA+MqoGxi*aP?y36bpoGSU3`ER11Ait<9*=$`_Eu?&Oc>1bnIh#&DPn5DHeQ@IUW--+p;GocY z7pM?OKT>Ql@BDsFrs=@_lO60VhQd^vISOOS{)?PG{50FIrkfdfF*YlX1dj zZ)a$8aM{1Hhm74T*>2ub*w$K@_mSUz$a$*KqEbm8$H4P0KS$P3E_(K5E`u3+iwc#Z ztwH11-E*H6d*}**wqBh2MZO}3f8pgEtkGt>srLMfAtAG=o7ueh@wD@v>sFFt5|M9u zs`5v{+1cc~bD3`00+1goPo9=Jn=6&~VED>@Ke@cLBrj?_68k_c*UYQ+MR>;3wFc&L z&WLOYjp_f7V{yXs*TEC%VwA~~$K)^^&2Hnj{MzJW_>LmfK)sL8Fz?f(Tk{|-^=3>X z@)-zscup0Wm8tGM+-W0&UY=uT6=#mdV-*?8rZ zG~c01T5FU46?N+)4mFj&qa)yAbU-3;LC<*#23iK-8n9=?sd(!LKD zI{BU$5ctx>*Q`mM=QQWsPN2p`JF{LPh2FT~m12mdA=l_vwRMZQbv^^LS%I)dCp8x3 zyOhD6F>hlYetm&G@RngaB?eP%(sDhEd`!|m?3SnKv2h>g%yHhS9;{Actuoq|R8>Kq z+aM#y?pnq*D1OE@PS?G(L$|B~`KLCp;73oNto8NCE}jVGG&OvWob;d)aO9{ZrG`EE z^nAWUCp9Hsjzvaf?OCW1(d-IO{1f$S=0^>$4Z<7kF{U&g_fu* zpM68swWUvB)OA?7AXi^+Aa`IR(I_f9uK?z1 zhe4{rk*ufcdG;BPsvnfHfn$WGbyf{md7AAJu z_Hs|bXQLoh@kc`}&d8F|#={Ge<4Ar|sJ-ul&SYmp%ye2k*~aq*5m~Fv__27sB<{0A z+KpNJuP^m;g^0$LyU;zdvi1y@Ka{B;PUIAP4CAYj(>sZk4fV-nfuto0JDM?Vm)~C> z8PH31weJHD(|k(DFT7~rMz$Xixw7R((ut1EO>2Dp@`7zF=dld@F7X6y|CG-w@ z(;)g~$n41VI4*7I0&i^Aq-Q9PdS2f8HU~`gUeU|*(<+;B9<~kgZ+w%ChlTVg=qO8EQpe3 zTcX;b%UNbaBVW|4Zu)l;>7|Vhi*;mji+)9aoy*E;X-W~ z*VQ?aPDLKQGa+>$9frG|GkmZ&ihpx``Ve{2!(XIt(+;8>Z*F*Rhy1zEXqh@*1Gu?d2sCc;f4%H zK^s&J$dnK$pmN)xeGjNl zd04vWgT+UTW*A0@Q9E~#J_Kw$;zg<%*O3-oEM}$(7*kiKmW0K7ym-MNiWOcV5ad$* zT;ayMxbb^Fl&0|<%nYckI{p0WY_!QE>GbDZqpUsFE|lq`i(E^!}Nx)5&`bExJg3;kFg#{X=sUlty5;QLkn&Z zR;tJy&yOQ%SPaQ!gnQzTQk_AQ!AKW@UeK6m*p=xep|ZjK@Bv z-#i?x?%7Oh=G6=0AD8j6?07Bl1#>p>N7iV8zWCDPZ&Akht*`f1`l0EUv6BZaU-f@N z)28ex?A#`Gu`hz+j*|~A%@4;@8Vj8-O{@0nsanJg0}K)GhM4}n9|;H}F^FA*xkfeA zV?RRfYQ~~u!Y03wkB_cJS22vPV)kW^Jq>TGr=r+wtlCBcE&C%?<|XNku{DqZs6M1~2R@JR@)Qm@8-9+wNbP36K&TGsU z;E1GN#Gc+>%VB?HIQC9U-S82vSm_g5 zQO^o^jpV_ui)j$g#uLoc4zZ(Y=b8Hpg|s&3?KFcIy%(@OH<9wC^TRf&LDkOiLp^FD z3}QJWjQlaXifRl#WR}<*FEUIT5lGKLFZm}W2Ez3n3uBNNda)a1lA@rgnUKxg_DWrp z)jjrEUTn|2##jL|Ffv_n+X*)lvVJ%UdXnIBHN6SaGlI}ov6i>Cw&6Tw~x2mQf z&ea2X90{96@GXVf5y?T9&^-0w!UBNnx^a{$EX7u6xTH`)Js(FaI1LEMSAlLa*r|tA z<#xkae~92{*B+Jr>N5g)J{d%=mC|T@jD)&bT0%fFGmVd#i8;Qvbh58XyjIS2 zsw9u@mLUY7jHn}C<~#}7v{(WZXiHmeo<&|v*aPRelhg&>@xMqTpv;>z|cJ4 zSnOZ?j>0>*%$`txWom*6t zEaFx~9#am?J8zgO|M17x!78r-ap;9rSMn|uSHoL8bR6lFrZ0|U1$}W;Q-Rs%#R1Ya zk7W>nVJzo=aS@adaAJh8Bgr4VALK4p(A9+5nJZHx zSTV*e2X?mL+tA!>K$D6%{0gb-1#-cxdC6<=O!=LEUJxQ%^@a^-nNk`)(2vXu|1Jd{qLV{2ji<|&T&q{0 z;Cb>%LWwei+_R^R;O6>$tU+=!CTvuIj$Y!-5bYaFp8DO(HkE zt(SVUek*!6GO`e3;Ql!W1E2?zdb|K*?H$(=jRn4EW{G9&7&knNzpg zscRHpZiF*YUjY!8L*uSMyDp*_79JX-j)-CDcw`gt=o3F=2Eaq9NIWkWT(7RC+d}I?blnw(YAhPMKe0l0eDcR<1Vll39_C89W7<)0H$mth` zSoqaK#e}75tf4`*N9S-UgbX=vvedt}kcA69z_4uV_C_d$e@C@jc{%N9={kk#+fnlC zRH%)-!MvrKXeV=RxqSAhDngJWz_5+v{=et&lekAJfnZ%MD$CL#{UEq`pLS=`Yb>p= zvNQ?nIcgcXn5JdBM!kXNfr*Ov8`VxP6l9V}TX+_NssG)B1S2&rBQn}iTUWgRw_@vF zQv{FtWWV(jt}%5}ZF;kw9u6Ws4!_}@e|~($l!|Wz|G%22g!+Yw z!~)h7oQH?g8;7lV2hhzIRBgi*AMgN>8iu=1&EE~1bFELFDls2nXGEsIN$~$CR6)ax z4gmuH5gJq3M(f|Rg@kCTigoovw;7738aeTb7_ldYz@TeLRZOqI zAWp^BZVd&OUK@|-%7}EL@E$Ntq;H$bi|gzNtToE8Ar1Gk?#m+D$;Ds)SD$G7#|rhQ zzIVmg0bZh3x-_oMbWLeE4Ik&=`(|&Zs3(}vpAAz9b9*Im9GXLh(vMz)OC`p|u&=)i zq4)gK%wOR(CFhcwDMkX-ZE0`)&;DwYCDz$R(l?yWLQKvFnM+(wtl`xI4*7a%58put zU-b(Te__SE_-PXvo+vm9Y^427uSz#6#da7I@>%_rw?zLD69-I<9?17E@1T2q0_t#{ zZ_r3wwtMKK?_n#0->TfU#o$G2fdI&aD75o@S z-NFN@dKK~?Ys7L5{l*Jf7sGVq^*tqi3|R29*@f-R(;nWkd$O@|RZFD|K!O!8DSgjA zhtpi9xYnNQ*?3IUKBeKSFeQejIWBPgbsdaRHn)H?W1MryU*O9A}q6^#4KJSCClzNW8l)D5c?8uAgES6q@4LmjJG zm2m#iu{J7nT=-;VkReHx0g!fZoDWR2|d> zl;QX5_D7IOAlhFNky>Hlz@;Y(zb9-wAE<|Xo;&%<(>SjrQvRs$yB=M>KN_Z!v_y4N zgZTOVv)zqtH_k)+QOKd}7s zlfm@Bbs!OJ3tzx4rz^iF?gd1ePvI?Q!vd4Fs^5$jY27`c>^LN4hdA2QEP|}Vn=C9vp?GA92;*(fFEkdcLNZ5z!66cR#T36KNeHe|4(uDtZVEOo^C&gHQboSg#5|I_+$)i?$;LVaL?qmm+La zD&2a{1v|q~MkI4cUaP#jYhvSVH8@-ibIC@exZO8_xvBKbp6R~g z_$1xABG|D~zD9$ALEruSm`8{gan}d_3JkF`Mss?RN%N~JKtJ+2S&A=|38TE5s(()f zgeh;~G_`z8NaoJ>Db6Y;N2$J7A;C#D+e&A>ls?rhoLDbEiKk89)%tc^2KAo!>?DRJ z3RIAU?Z=V$C88|OX0tQS#*1oOtDn~*?%E>m2*mHFmxQ1&It6zt2E!*h&wKQK@QTKs+FTf~KHE9aiAdI@ zY<5oGnPK75LW;|0``!>cySV&LPk>=X0^&&?(zvE#5#Q=K8{ew94gSepKHBNxQ~~?y z8Q%*l_~miW`GqW_r2Y1+FPUpdFD-_n>ik!!Jsp4Vpz|XxU$LWn+mAPq8BVhLCr+{x zJDssrWZmy^<@Ti(Uvdj2Gq&(#V|x4vVxy!mw(E~l_#o4WgP~$$@tA47O6xedgeedN z=t{ENOp@`S{!MGimZ?ZKoND2+L-k7Q`uC$vPWBH<14WNz*YKx?A>2v$j zQbeL=SFe#f4BAD9*w-Jt-<+-;Zn!wL$p^n%6_QD1s+}fz6?`_}M%fH6uep+6td5GD z7KnDrB*QHJqzqQT;yK>?6HeV_`C&2NI2O;T+1-Z%=(TdyGGixf=Ne}EMYbR6*Epp0 zrQTqVHq2J(4V(eJ!Q7w;d5v_h1u|7-r&Z^^6T2dIar|5a1Pg^=#tm@C)pTU>HR{F# zVrKi`vZ(}$+QZdA5 z-X9y88E;eufTgLA7AaN(5BJ{Y?__1IQ0(ygoVoy^OEvV~Y3C8uG zK*-Itxx+yaxHOhE& zC;YlX#z|o625%V!-M;}AM|28QAoa6dfAgt-HQS;wtY(j=3e?9|Os zW!6P1uM@AFd32(gADE#$dsHF$d*S?Ym5@*W|cN7iop{Ak98HeaO*ILEh~ z0b1r{VEGL=PCW5pvn`U}`PuKi>!J3}+$Z>F?+4OiE>5wV zt{IfCudqhQBYci#+_Xxc$sD}|GeZmt1xH&~ z6c5w*cMtZNdFqw-1wv0(TsYoX&#YnT%ZrT`t0&h2WKURLPVqLdT5XfXIs+iS-k>-6 zm4Oh;VS&FjL&%K%P62ThB?MaY4)q@AW`VM>D~TiM15i>&gj3(|0Fdgt7Kydeo3aru z#x}EcwaOpvG=pTmA$FTn6sN5RqFpj*vD01~3AEPZsJsChU^w=1=Z|k^H5e?FRhKboU#M4E=D64~vCtl7NDnW#CK4jn1qUjl^YPvbgSot; ztX1LBeDzcuc<&KMwYYyt&n6qA&4w)EML3n}A5S^N>{qgA@F&E4rt}H)`Wym3r3WCF zLNID(3j-UMKeL-NS(9(|^k{qAJXi77{(ypVp?Y@2jMsMk)_Vj%B{K?`1zWblcpFLp zKTsdovKctcJ00|@f!Q0gLJPq+%UF=0AX1g8vHyDgz3ynC;>J+O^~D#*|3}A9BfAbj z%6zu$9%^qcgPIa}X{GDN7>Duj5=nM4G~ui~eJ{m(S3dFHi*8i0e;e2U`COaq23MN` zdsGRCZ_E1{gq;Lx#F6hta*ed4(!qK$6L;kKmcbA z$FP0{@`K=V{^QQgBqap$ij3+fwF>o?(ve^yNc=J_?7>2$_=UruV_FrM)>E5k=IU$% z=5$4`XK959%MgXmfawa7GA+)!jD96O7`GYt0_cPO#Gp~mK-`n=XJG#g%AZL{q!$5e zS_PRAruvA2pW@Y&{*SgM!FU^!WnF{I9>g14@p^#v44PW)OxIFX2wA`j)-VeMi8EJ) zJLPW?j^YRM%!UGYP=Shn0Miq|p3twd{m8+M5SS^*r*(*1R}91V*PT->O_w2ZPL98w z+G%Z|%YN_5pp6=kS=M&`GMwL*{4Th;w*r3O{^C+N$r5^lu7K6y!gmS8fyz_%;YU^s zw?>!TWcP(j9ulFfW#8QhEdL51D$x4%H^eV6NZo)8Iv;Gfq`k?Q zsjwO@0EN4!De_y_BVrNLr32seSX+1uSk|zA{RgM}U>A^CpE1JU8JL^{hS7APLG4hU zRv}GIveTni#q46)ktJaRJ|W|}OvnXSiw{)4Ll#o-Kbbcvoqq+#{sxMl?B^P`(gQ~} zKRk~pRphB{!y=@!47?VYU4#JO^yV8{NQwBxaqBLq;1Tz-ZeihHH#w~bqs(ycu1F`a zwD7!xaveI$pi+wu=)YGG11l_SGjaXCYy*i;s;Oc;S1Z6Y859|%dO_6d5CI0LZ2}~> z6etk_6eCF~Nx?|2QXps(e|)d$T8s*5B6y3XN4%5evIv<(!Np(6%s=4JtzYj$P8gJh z><}f*Gr{Xqpag*&Qsk{Xn)96-b>af>W5jqwNzdJuj13S6yXeylUIlS*Xng;#m&X_n z-t1CzxcSFl?;r|+IYb848wD1}9{7^7N}KTnH8Y^qc7R6XN^0L3K@XkvSg9o+ckUPJ ze02Sxd&&9nV&oVYql13@=lT7oAsl6?**s_drtHxQ+t!RA%@y^kGi5G85#d;KD~J@FwH z0O{<+?YahDs+Q%>?cV_mqa6m0h{|7YfsW&j3daIt!x&lmSe8fL-`0HwCK8l4US3QU z)gS7?m)`3e7>e+PQ~=7~bi?riUcP=cB(&u0aLOLVRoeOpcyKu|Mx8- z3|$6gv@#Vav~Lu-VF*wE6b4R8t}KUPmfu(MoqkekEb=z2+Vv zmS*cMmVT?07`WIUr5OO^LMa(=GVZ41U}Q(EBAZLt26$K2<>q}(R;3DxnlTkICc{VM z=xk|Kpu0U$z&-`URs*{wKtPkez8H^;WS^6jxT+RY&i3NzM$HST){3FK)hi42M>UXaRJdXo{_y+)T(6*`fXD9lobAN6&*@jJ-mwCUetXVrgWfdWR;1hW zcQix^?yK}u|A#-u=r-w>)u9l;EgLq(0&J>Fun#Zzgiw)~r4Pyn}SN!Kx zD>SF5rV8SUCqoPI)Mk6EeVffhQcm7zitWCQ<=wnE6VR`;3klZ+hG(|6C4y6}`PAUS za7xnX5T*T3!Gc7mtq_4J)hWu!Zo9Cd&7<5bMZ3ei@y+n|`W9iRmWb4U)92qlvR1U~_656}-BowE?2fdip%eG>rhn7I>Y?Cjo!+g|U2X z-i?X!Y%c|^bd+~&wg>kRMK)DjX`rCUCj{(bhRQL$yUe0OQ7v%S>0!k*a2 zRTkT+V-?-!hh4i*(VJ_83levc6kMC1Ukpqf7oTNKZI_tOVF%3E98RVOwf?`|k{_l4 zKuZln^8$1oXayZU#aO@ywC)vwQGO~#xofv{GGUyoTG5KzicYy5hMpXyAi8oD=*W9) z)Zl%B1cMbT)K!Gq>+^9$>Zd?f$kyu(@O4-)vzl_c%ElaE<(`3#(K{|XSlqiP`AKQ; z{#ban=_OKXW|&yN%je~I;R9*5!iAq~+|pespRKYGZ|o5>-m40$+Hl*UWBbaoH&Ql~ z>f1^7zJx2GI9BxdBA4Uqz{9l6%A&mhpAy>upC7H|&~2;Fn{mak$^OPFy|8t|E0Wt9 zXN%kP{EOLl4`L7oH#JdwmN#RQVCdtu;}Ua~KuJ`*f=Q z@}NT%IVvMMt+zopjDT*Q^xg8_`@lW3pF3mV<{-h4Vmk#buKvhnTpH=?-OPf7Zcada zx6gY*mw2D|RQFetIBz>h-c0~iQRT_ESARKxE(PGQ3PgHfzcp5y>b3VXS5LDXQ3UHDSF3ky zKbGv(k4Rp)8nFYBg9e)03OIqUD*%B={P8M%>C`naxSb=p%b ze`Zqj>i1?YE{n3Hp3UZmc}@nQGYs0|z%7T&f9KtDTBrIG3b=4{{c?2l<38Mg_nRrj*wh>~yo145?3qs^?3JfhG}WBnvVw~B9b;=7a* zuftHyLtTltK4s4|fGb~_6CJxnJfqES?>Kd~J$InYnh8SNy z6lDz;)WgOzrQ#lXd*!!o{H?wIIikE=Qy3hCMtUITuxoyGIfIrE&=VfU+AfLU%by%f z&}tBWQf}v$zmf$#-5G}o+aJ}b3pseS#b6rLZE@b?A>05% z18?sS0K-Qi0F3&mIHsjT@wZ<_$7xIOYfFZ7j&A4m$%fJS>)^mUc|LAXj(8}v#s(%k z{&X#RKZOV+(rs4@2Nh~@+l$*#l)^(YV;rs{0yD%@w-`IhCXMx4wmziCVXWmuVl!Xv zA{^WEbzA-X^lt=vZX7o2s@ysSqkn83+BRNR_RXHcQWgE}G|%p`9k9@AwyHQef^a}5 zru%53*w+?cnqby4((tsFdqS`;%Uk?GRKHiTCYO)Od}>gwh3tdNt_Gq4)p)u61_#(V zeEUWsT}_*$z^E3Ft0&`THB>^Y|1m;l>$$|hpRgkbghVnF0IGmM%JcMtvF^gt64@xNdkn?jDb|hTHC~_$zY5rVjmF_V%>a^C7S5i_GoghtBL7 zuR>4W^A^tSXYsIVwWt^mA5|gsY;KA*wrbfLnxI0IgKj?8wMJM}dQ)29mFjxHP1qj8 zpOzN+nintBIg9tDj-Zj-SzwsK^NZ=$vW_9=-L)>coD1a?nM8aK|^C*)OT(?Ap53Zcg0>GP!=mjxFrxIQtA z(%`ZwBip@c@*{TV$v3;LYVlc*fa|V`MEn{rmxI zzawW!R5kv4)m^LKku<8dDEce5)}n7{ydvI8NPGRcDk_gsTo-(XoKNk$3Cc^?Zzny=dLnX)UCs^$h`=HoX#iaB*< zDG^h3xcuLm2b%vXz_?{1UGKdOC{cW4Xh!hsT8VAI-gLIn7XN)t5MY8~djz57YakQ0 zmHGe*Qb6BWSis7X5Zai3DHMY@!GjDNX4i2|*{H}U2U{ja0Q$$^yvrulCX{4!Ih-K5Kc2Svxa?f`lMHS`_d?+Yye*Zn`s zrwd8oaQy=8{aHaJVdVs-!oQ9Z6$Mttt+i94`9~_GBf*271u>>90fzUG%r?JsY#r2% zd17Dxmr>Ft1P+&E|0GG7$dt;0w#DW5QJBEWq)t;G9{jdAATbJN01xi#u(itl7m6&s z1uo)%@Xo))tzZ;*BRKa-Fp=T^KsHuORo=V<+=r5|0`y;df2IsB;ve7>wQJ%Pa5%?? zM~?iWV5fFC!@r71{s92LQT$@}eCXE={4$H4;K9M?$@n*!kkO!e_KSG|Y$LdcF_EU@ zf1m3IhgdQe)iXE zF^9(=(BC~U2RnX?X2&d=PJD3QwV<*0{yy8maTMhPzR|X78rB{nn#H1 z=Kv$l+jH9OW7G%bS$bY2rL(w(Qw&-PiUNSn+@`cr<+P*#{~17H3)B^EgMuq&r{1j8 zOo*IdL)Y;qhd_{)f4d#{3IINDVC&gd^F*Vw1pr;k4pxUh-Q_T%tZTn+S~;QCt;vE0 zc$PH~hvF_drvJ&~__LG2>J@-t+65})H4u6kfqe{DIn906LvZ#c$2ErEB8AoqD38cP z2myZl9MCx`5{x|i07y#}LU*{a=Z`~O0c;y#H5F$kD;cFqNnCuM`<58SAas!R?)8>~ z34Ikaz=r4uX&9JNeSBdCDh&0YwuBlKIr=_89ocupfG%s)9;KPp<+1un+_cnQZey1l*Ks$dUk%3jfDP`Kuj7f>`owXt(WIJ5Z*nsU8&S ze?E}-voj7`OGO{pJ(dDD{KNI)^5vT1;MT~Bej&Yp8UF!dWt9BDTE`7(uUWTgyTbwQ zA=F148uVSh7eB->)nJUG9Z)mAme53Ddt5h3pncHA^S8|rGuO<-XbS^pY|2}oe>3tb z4J!e2BEts^>LIum8y>PU4eZC2)>4bkV3qD`xd}~UO~9))_r8A(sHs2FE=yIcjcL%( z;L!rb2m_Z%K!4nVdb78mJ8jIhw=1@v{rYrgb6A$97F;a@&wV>FR;W=5N;!NUyJi}v zMnKb*_Ec^CC$QWxpl}#U7n%9aa3A!css=^b=?)e&J0?vJWu*eTyFvdOxXh#kOvFO~ zI!}^m8$^$ULy=?ci3S)J!I!VHE2W?Y_?D2$8YF{l#WVSG(xub+^3NGNR7&37p|J;; z{ToGHO`b8JM>9i8A=v?VUQZ|aw&4Wacrdp)%KOb=0opoMYh!^C*oS2i_p>}H(gE&( zhl0W{p8Da4rz~_;4`v#n>4r*x~CIo#4X1tN^u!6QowC{&bHn zh!WY+SuVZNimvLHnCuAfmMN?d!Uq>3nM(|2{4aa?b8~v7ivXfmTs*AYG6hZf4kngC zd~*2$Fr5eGNkZQ7+DxD@%$dQ!oF}!`(1@bzm3Lt^afv-vO% zp_5F~R=}g0l&2Yw)-W3aGh=gBeQGJ3f7TLIIxE{RPaiF90EY(GU3n+lmBBQ7dxf7d zX3?)GAN!vBBH3CcQ}s}w4}1<6V^oF28BDT#XJ0N594W2^AQ^zBE?jR8qdw2cfo1=p zJ?Hm3$Ue1C&VYJok|+nBSF@lybrG_rY8Q%X7;Zvr6|)DkAmhh^3#37zZgrmgT`S{G zjFQoQ4S?$PV22bGq+bT?iYg2Yy!lM#!p6>Ms^D&BH^z~n*gelD`j>iVn$#4g^wItG zJIo6v(MeyuoU|_3*9+>A^bB?y9hxoU`OB@q?uW4$)&?8tY&2^1j0icDb5i9Jq-O{I zu|qEu=KQW&V`7Fxj#7(T3w*Ri;5%Bk3J=A3XJK{>Z7>Z`Q-Ng+<_APcRRyFBzS3C* z3v&G(Q>gOZH=U2SJGHQsv-DQrK=Ww(>&HiX2T3bpH&0{U2fTT( zT>#0x5G!p0&Yj=F%iD1b3PY3LKfsVJBrb`sgS)Yu=VeVFX##l`Z-{WM*H&Xs z2tLPv+vb!xxPQtiS~kH1M;@z3N%yca~SW?=_ieTONG1|sZ&lrGY znDV0w#F$t+v<3N8Bj#(&j}tJ13?A!3+=v~Jz8HSXu%orFXy|$}+`TM}DVAs?7Y8YY zLGZa`w=E+8Z%fYmaMEJt%-4@$3YN^ZF^3O#yh|eeCRZgSt}{+xq%9p3w%DBz63WAY zpz!II=NI=Oepii z^+FqzmFU#AL)%F=r`jGujaVJ2ZXsEv;~cdQ%qZ((My-+ap;}{}w@8^P=FWuP6!|c6 zGRp(47w={ycAiwzVC10si&ANkVkSgdy&4ceE!H_GXeTr%7t3TU&X>1eu3H~7*hr!n zHU_KM|8bbp@`t3HU$vF37N`XpR*b2|#vm9O)?09+6QiOKkc$DmiVBUXtj_5i z`Ir*#-LD z`(+MDtqv znNn#1xre;5%YkVC57R1HH5#U5n1L~j*BD7_p7zRzDB_v>7^gCPBCS2_BvDN~d_>Ql zIl!!gl7FgbUub-=oiZ3QQQR88N?=qD-7)f5B0DNLls8o#TIF6ynaF9?+@Ar&fV7I3 z9L=VZZp;L~bE<&jv8EkpH49M6ZBI6Nvye6pqqT@-LZYhWVx4^iJw+9;s%oy)y; zW-0L4fu<{;i7)radH`5^QNK`tZNO4VyB_|`*EIQ}IQ(=H@XK`FmXj=m?myHe`A)jT zzz3{u6$|67i~RC-##$<3)HG(q=Q<4_5qh%EFBC(W875?h3L|US)6}MF4%MPxA!Dt5 zza(7*&IcO4V5u=f9)8Mb9P3btekX)O{6;U*tlCVcqCzoJ@%i5H(VM^-jDFiJC31Fe z+sjJCW`b2aTV+K*BYRW@{w3GTW-B=(OJO{yEx&hsOj+G!>JeV^)0+x@T85%m_y3^K zt(PB4?P(~A4MuJGC!a3f(%FCNzr8Wki>?UwQ&Q z!S`#-m4%>n0cq{$P=OJJwQ_~McW)~+uoD&|tvi^B2l@NBq`l-RnAs3EznF`VF31>S zQ%`NzLD-XIUYlj3BlS0+n?((EJAN0?Q;SwbW`-V*7+D#}^GGW)v-Hb;@-SKHDqhY$ z(o|H+enNDYOMpQO0=zzuwW8sAR&z)XA^1l~eyDAXcg*>?$7eaz zT6O+VF)Sr{g3*diu*v7_@Pa*D&cWw$1TUAmcn!^>r@GSBFMUEs2sBuo6}9d9ojC?IkZDZ)qiB zI=>NbZlkvV^P^p2aD}NX1sDW2dtRs8QrEC+l8WffLA6 zUIN6veGd&^?)1_}HVAnjlZ0s2^$oUKkBkyLi#c6?1|mu{>H!BY_suCjDbL%i5tz=C zW>sQ@v%203PlMMEk5wG*c)O%Pmo(ujEey&GadErF)y#sALh`whr|~)=Jn|gl38PI6 zT@1%H#!WiXSlWfF$a!LSD^A>q3XO2#&GNx`SluNnbIFnGA#4ve%)rTOuLo?OM!8SD zi|(XBLY(|2!`5%EL+_1eX^vWmVIg?;k+}dP1bQc&e@yYH6g*g3#h$tg@uN$pkRJqx z9UuGe;D!BAq=cbi&VckIO2sMQBwJXEq+-e2PRaTd` zCVspde$R^2qhF2OASu)^ee$N?m@<0!;}aL8)HKG+N-7lk9%c$-@rluizO5b*Yu&y^ zj3a%gNIpv5;H1fFT~b0tnOMpu?rDrBA~v=&y!+M@6tk1caH!}P%?n`VmB-?G1PD0p zSUulv?7NP`gr4-McBmv+t=1uU$De7$#tRgikF~ESui09bzv(KT-GjtZNJ2u8QbPD` z%u}l5%yPshay365ym@of*J6c8GwyXu6NuZXQgJbWv)R*$jkLW1GIvz%D-3 zlFAv$zp#KF=UGZ8DVRN4=hjVz;vWcpT?zhA&GK)1e~3KoT7aQjo%rM4azIrYuhGzq z3Xld{`h{X0a}j&7x?ocp;+g#7QqEc(BKIQqTkPy3=GdmqndWbJEMFvf*Fx{Id4AwV zlAq>|4U8%CqDdi4Dio5=YEwBQsuu*=$QWeaecth;(LH2z5hP!?F%luHQ=PdtrG6$2IpdjagHVGjE6Z?Z=WctooK z{;mYs!9*thophiuk=kmS_pb+yU}{Ehs&_z|*7yM7jL>{c82^!!HmVqRRZD*+^nrub zPLbgX$9nIR2OXyW0tjFb{|0h5a)THCYg)q~PUah(md$SDzz}vpI{e)TbzRYY!azcO&-gh9Eq1oAi^1iZPX9Z|b za}v+}lyzKR5Utz$6dm1Dc%)xfQfoEZ?Rkj4|9QdjN(o$XWnkC>=;Phv1`s#LS}LWR z{|v^&>To-2EA*?Mhf{f>EA35yvtJRQg2VO0-gPPhB&M2}K|vs797r>5?>G)ilMAkK zn)OE69zoW~vzgJeK1YT?(bA6w9RC^;RbkE%#W0<#sZmUHWAyT8U~o93Kf>?w#Dbhm z=udb^BKdQ)EVUKDnes&dPDUp}vwcUNB?BCk+1AkeH>2 z8Q?c#T-Tah)=U9?JOTr^I5IeP(4jSLiZE6_ss&`T&{t`u{rSiGq3j8_3)gcr^mzvR(q}uLXuE zh7t8l0P1K|5_+n(Ki&7#$_of0Hv!^P4|-?{^neLS>Ro(nC#$-Ew*TpN%U%Q-o8A}( z>9PTGRE#;u)>(PmY0;UuK~)9F?cBiFO#?u1+gD6g5w$k?^Hb0?*d~5-BaB1eE0!}) z$yo)4k#o33DrJbyBY_Pb@JtX|> z--e0@y|L>rzZtq?_)Ciud*Jv)0maYaAtP{dbld47Z{8W31D0GfXPy?|7_xsH_Ze=| zfqmq&dG)}d#RQ;ZE@zq5$DmiI1Idk62Tg&Rhie8aZ4?j}n`2*ubDDo9t^bz%W~7U3 z34o7&P1Tu0y#T&TT|4msxu=0~D0)8whjUaIka|;0JP*OpeD^-Tob%i$RVjI-+yYYF zJP~mD=;IOq$vs@#8q&2k$>+w%jgptMhTLURMXn!kyruE}Tz)c3fbzf%mE1N=WZucUCgs`FbN?ESqj|5FM5@01n$4hhybVcel(rtAAXu($9p zE&v^!%b2>XF5uG|*l}|pxU!i&hFiVd8K1)(Fd`%J+5KJrk()`OB@q&-?9y&bPC209z%Y^yJo0DcYlA%rd&`kyUw)h79$_SaI8X+ zpe4Jfti&&{=hXrMT!RJR6j@`mOF&|&TWQ*rs&hN{20px0ej|E+M2G>WOZT;6)hEyA~5L^GMOt)mRbDtD=;Lx&=NHQ{dj9<-soC3OeKQ1%1vk(*m~C zt(LINfIO7&!I*{{K#j}<7%IQ6AS^VwbeMlCT#bv$9m`88cq>pBJEWOt6kR4j z0MBRIiQ5c5j(dqUY3070sNzk9fV;Q)AN$*Q=c?{aLKelp9fIDSsWY$ z3j==sKiw%X`54g?EXPM^m@rHS_6;UQ`Z1U07n$@EfLlh(Fc{l9^bm=PgN;{DWdJa(;o`|2vD%7cmCT9;d5|2Iwd(k5 z4b$<+!Jy9g_6AAylu%R*_FxI1)af}E-Oqnmm{;2ZI$4EBLBNKi*>z+2H1<{h>Uq%} z=>QfA^%3QSd^x>PKPCxbR|2@7+nXD;qRmaxPn_d$V=G1*C)(u3ZYY)w2q>~EpyWy0 zfd@{6zje==twM#uy!pO|Hjfc3c`Rsf~GE1@UAXtUwfVGRJY!7%v)>W(*Sa*aQo>bE*pp;2g(XLPJKV z8({`C$`937q;s>bU^f_g#<+yqeT7R^SESEaY*Cq|8VAaUMhJe|BCk^9 zAl+GLHB>`$`mUJSnwx&`v@IYl%gpCi$w&%=8N+AT0M}_tP*omUG}|?hf-Px=vuYO* zPyQ$A`EPO0n>bj}WQE>uU$Qj;(veT|i{UQN6RpEEnUn#{Rfx?-o+)LgEDvXy+{<5)&VJN?^@r+n6nZa z+M9F%tny9h?>;0rY4^wb0XFE(aHM&pk~+Sx+Z$XBl{!emeNM|D#C@`ZY7!@J~8 z|EyJ)!dJeDS{inY(jBWf-eg_c#vcnK6zDaj(xF|Nk&43>q3As)*(wzY_S%Y=ui)%9 zP*~5%NI=*!kn^gn=Ow>H5)}n8dKAPbRcP=;FB4H(+A`GlsolqqbuT6gk zy^KQ{Q~BG@9iTr|78K*;ck8LZgUBe>f&1F1W73wPhsna@pkxFJxMot_o1y*)wNZbfS#rM;f(75pMnBkN0 zS4K6bu{yaEEM0uD@p`C}p*+!NO|;T%rLj7OshPMkG?ObD-?c@-3MlM>?dx7J(Tc@ig0Gu#ak zPh@oB6pCt8FMemcmcPnj+^+Ik8A}X;2-QNwD5;&Mvm>GcG^8>R!VEev514)IVNtBY zcntnsUJ4taM=e`U$_j*ulFH}kIX30cx@wg2jJF|KcU`}5k+>;jK9P#` z10cFlVW87C>XC?60jlhDlH^c{m`)9|oCL+mr6M_2!}QF3CBwXE-MuWiVLhZ6u~?B{ zeynK-oFLD6zxi!-83eh!7Tl|!uvw%_xBX^ibdZK+W%s!MDASPwVe4{Y{50p zdMzWmA||STb&uVuyj>%95dGMp@LR#@D=?vMPI2UFKH5cOJrO=gav09)@;!syFRs4( z*-xtXomE3y`Gsb<%ta6G|B-O|H;}j)eaLZd46x50yOWh^koaB{<_Mehj){#)RUcJO zMk=o>Dnn$DLJA8p-lL_F#KD(dNHrwU=|^L$Sfm%RFaTy`J&jyHduuEr5hSC{nTVkc zC)^pHLV(0W8(h=_lF3z!JmpvljCuV?N|v)~<}lKqMtSNSm5}jVBBZo+=vNY)FM_ej z<{uMOm^g^IYWum^|65A4b!QlN)?fbCQ{t=Dmu5Zwa{X z_#ZR3PAMAqv4rU6YAJRod^mA_0Td`r_*Ca@H@Lrf@0T4u*|wW&^J#bfv9iH|zr%li zim!Zr55xp=yep)Eg}7WHTz=19{Ay~g*h_e{>eXhXzYbPJ1L1ux*yvudGoxc6k4~v9 zWi9o1tHu`@!!o4q=t&&*QG!*$Y?gf|fzh)Uw~9M+p+8}j<1p@n$(;i-?)lWa*E5KV z$J}OOQzlnL1e~}^YfqZueJ_ajgr<`*{&E`Zhq$wh#?KJLiP23cXtj^rg%41jh3p9b z1J3)+|CpZ0y@Op-N-V5=Juugv!Up&Wv0E#2e9BkLR+tb3Foz%@4q3V!Rf3E&Iuc&I|MVp<|<6 zkjq`%Qgtph3mKc+ONj6vj#Zxok1yVdZX-X0;>HH!Yd3>rF75qo704a!xyel&lLSMj z7Mwq;$A8vLXnk>EC7`;Lll5lArpPeAA2Q#%Hjjn(P+D5Lc`aT!_m;(2EH#DqbsL_Q zd*LF7;JABVX&p+6VT6PsV;4hRHDaywqJ4dhLUS~WUS#q5OpxjpwLj<>CJkZySbRnM zes%h#Rj8QOSRtar>i>v{h_z4*54Wycl+v8qfM6yI=}lo0zDkT0!qn_o^5EdR}T>ksI~3lM)M!QEcdxcEgSsg|K4{m$u{)tW`{^wf(30(s4tZp&Y zrLIKsuO7aMAuHa&G(#THvJ#|H`Vjy8cmvEkZR*nl0VTUnPOPv3D1d7~-UwUR)gk%k zEBy_`!EYbcB}oHdt`WmztVi=~{BsD9N#M^(?g_P^Ga6ri00pHIfautl;x-#pF0m5- z`F=guU7s;7LX$?rtXG;e84?ppJ3ZiD6-M0YCCM z%`s0}gQ&4J0|wVL5RHeFEhPT43yI{^z&K#f2;^bNp}>zRvK&pHC_s~vk9uV*uE~%@qK;u#{I~R73jR{zR`1mqHe<#(6 zAuxTmR9D3Wen<+sM2%@jDF-|mDjo0pcW^lB7(6AUJ*Wf>YVSiZntUI3^mH&h#@Ep* ze_vfuvEYaDjis!>W)BzB22urK5@zsZx#-+~6eeMRljT=o+NJ!W0bz&3OHDvqNI zmH@rz-RpmjeHMnuX*d54EWJ(w@XYVF(q*Y~!np8@_r-^x&@~OHgpG5e=;`_);4NVR ztQ@K@FV6GbMScgBDN_Opp57PV0mows*t*z*rBVyPbm1cB+pZOV$IT1M&CVM8Vi@(o1ushH z05{Lr^fkSh2MsHgi-x)Z;JC+=1#SZwrnm=52^u6?wrQU z;@U&=@ynmGP0fm3So)Cb5AYv!9e(Pt7rzkQaHvO>qS)N(QU1Xpru1|pGrt}meXMS~ zUCmuxT^-ZzY&f^nvjyK4-1Zqnr#BvXq(8&pWCZtq|KPU?+<5Bj>Sx@NzM+8$@gcBL zI4SSC1uvVkZ{ECJ7*x~$91gb1O@O-9?=18%qU(S&f|2L<7oP|wDUAY8Kj&3oh?=&M zB8L?jzNdX!fQNc3oI8X4xk^3C6#oH_rJ5+gqi4@fKo0J$xqAJM@KaD@S6*uaJjz4l zdAGhgIdCSNZ zJiP)AZu8uX*SBYGP7iJdcyYo792bsFU3FVswqM)iXwo8M@?natDTaku+PIwT(M-XB zW)bBjcI)SU1T(4MC~Kw|PS2DMq*8CYjzd^eT-Nki7F>g&? zi=SrcvhU=iguDVuBU+RE8&E8$S`~K98vA77P#DfokTQPh1x4lZRa|EO2@e}z&Zk`wBa#3rW;>4d^Qte2oS%Rd)<#VLm6Zda+!jI( zWOJ0;eJ{{p9n^zV7K1yyv!sODgCfsb&tq_nB>cy#>*o!iNrue;=idZV`mnG83{@hi zRMkh|=M_n(1DeL<e$~{1ybMB4&d>aU2I)h84Vw~i8m|$cJc-!IG5kjA3Klxr=99@d+g1W-7=er!f zJ)_%0_a32NmbZYxd3$@oXXtDR_>muOU!T|j6l!z^B*Ooiy4~;Io3-CCSM&H9Qu!YL z^lbn>6uX6o-dEr#uK?~RUfmCDE5s4q2EVm~s@!R#i5&)vz7({o?1*TJ8Q|W&F=6wB zf0L>xxne6LSWFa6#Ulx&#hRq2HAxVeSeM6^q8-Au98I$zL70B(1Jexo=FW^aim~3E-$$S( z0fD(iO)?+=xBZt16{IJ~n}XRT=f&W{B1#jqeDpLoZ(!|Zxl%sMa}+V((8vEPRa z=Cr}!aQzgwwyW}m;K*T@WiL}clU@GhK=C>%SSo!fCq`u#!5Md#%&iZ_YS8et;S!YF zc-DRC)Ps|AeD6mG4AAU#QSLon4sL|+4RBw$l99Oq1BJbjNRNjikR++aQLFs4<!1xv%R5N)0c?!_nDSW^B0d zQ~X+-!rb1L?pyFnfe&XSy5Lvt51U;Qw5FQ#WA9ZEc&cb zkOS!)4FLms7Bt2J*f=%gEs74*ITATVLs@j&y)(S%iK0nyDY1J1MSuOyulJBI0jwCO z91Xp0bzmq*fOU86UMGZ#w)e-!YY(K}L#P7fZdcvD7MLgweWHnieC!s{y)M-=++k2D z6AYC~uYCTa*He6z?w#d36A&*RJ$plLgDVnA(;he z@1HKuq=u^aASES^iS+LvpQKg;3uKnlEL$*xI14wS+N};{YSl66X(j5CvUT$Le_6G_ zTHRZKQ7BlBE(<9@b$7T6$Guasq=McwXkREx_xP%pOC^XQEA-A)60w`qDUmvfXh9Hz zL=$XSS@P}e9h%FSkZQ9I!opVp!J(x@CbBi8g=swNI}rKcR9bT$%qYcJy&3!QG!m#k zrt(!V2hnA^ls7~WkK~~a;Azj#W~_b&lAUqcBFwc}e;{`Gh=81=L(%CogaEFf?M7iDCni`Kdba?XP*#ui(amGr4l9 zBgUfJCbB#~o5^PJ5ux%LuD3~`g{9De%QJ4#4VvXi9szIm@#w8eA# zW6_R?g2pJUFjmRlw98NCz zRAfH|*W@*sc?|LQM!_xLN{L}t$%=KbdzFmG;0BAOL%!z0haTp1zKyw)vPFvy)|)g> zFtq&~%X}H2Xb9BSox&Lc@{74LWHGgS-H$Xxm=wkpsCe_-s-|tyh<4 zZPbP(0@=P4&+vImk$CX+?{v0A*Mi0LfXMZe0ES{aR48mhKFzTpfuQBI6h}dndyXyF zZo%&{K9igg&YFzk-x(o-l;wyatHL^=1jt6%=c+7-8=V5DTm`D$xgdp{Ors+6ra{48 zwljhD9CKC|@k~1)9vh(`f3&pC3fK66Zne(E<~|HKYI` zS6!DB?8*Y8+MwHgdB7$#LFIg3q`==2KKm$`C-J69&y{Q&$rF4FitQV-BC1If_8idtC$Aw(l=eogXJ5OF$e6rH8CYy?R^Y1=V;{LElBmIVe!n zWVNj}9^iTvFK!EeO_ z==zC7mhj+NR>7V0WCWZtRBX3)ED)%`DNWKAS#*F*CzZOU_i?`H-Pg<%J3N+OX*0GG ziYoBen#Szt17qTqP&?;+r=0}e@xcUzi4~!Fe7n5Rj9?YBt*EDCy!a#UETG^z zs}92F!W>u(6l}pYKaDVK5^_E|!#zN7{sqyKdcVUD$q<)hkIL&0ntiQ>jI)$*u0WoNCWI_7=#GH;GsM$){>gmJPh$ik09<+ zJqUf`pGq#&yJ2$dbv?KL;sWTV$Gt&Vo(t19MeyiA4@?884z`H$z~HD>4-Q>2iBZM>qK89DVUu?Vuhv1@>8G&!Dw+-+MQ#5g#Vp=ag_Rtj8hk> zI=h--8FO_(lRmne(FDC&YTsCsD(D0GblaaWqwiU8@rkoT#AQkE5y*e~?)VZ1`9@PN z3-Ns~eLn)uJ?HQCnModB$7ey2d6|6r?tA;RiL`mwsyc6xDKp>uF zYtkvw@c2B5(YllFGcg`KfMz7BKollmnrNw-#VYB-{2IwK{mVDZr~73`1n=jw^k5M^ zP1KIC+z%p##CN*pZQ#^4UtOz>W5Ar+pGDsoabQ8CguqN6(_K>lNedK{v_vag`io#@ z`X}bfv7+q2sV8*MFdh->+Q3iu1epBC#$LSoZNvgl2ac>V=&Ax;x?kpZI(fpZJjp~D z;z{{MgXuLTvLBDk1?Od{pJ$M8S&hw0-iy+c*Bk~MrY-t`I*^53`9MpEtSZA_pS_FL zV73B#i>?$49UCrVz)c_-ZqG4=`U;f1mRW>CZ)LtF${gPQnko6n;1l6+^sNUfOy)+A zL>@wEG3(#?5H4#OHFLj27O7v~&XkzpJdW#Oji_Tx-S$C2pThGlM?Ruw>< zJn5uL`3VIxws5l;?nL>`V0c(%$qA^0XW67|bIax&ay{O;7Z9t1s1Rf( z*X^gI4Um+)OaIJjxVP`+5Sc2P}VVBZW&YL!^&cbztw5mdD&4YEaSy6uItLA~sVdnCNjYzb1m{U$S8 zfzp^>rmQT}ji6x;#$U);rC2())LjLty01?Y3h2hLj)0KOQ~mnJi>IKZMxwju1nXq@>N8qkR>y5m_>)mci_wmyhKV9+q;d)8FFu6tWds!0(8l@M^S{ z3Ux%?8cqssF{tLi`z?&@t)Au@Apib7u<5NW{(j#Y7#NIA$9Psb-hvV)luQ+Leu;uK4kd&pzPir%Ou{0*>|L z14ftG=fR21^>!wu>zefvd<`lTP} zja6OJsMIm~>K>lHHV})Wtl^u9!`6}J$hS8JsqmvzQ61?|%!Sj42L$SOmAlxOfAd7$ zBB3=;9Mhkzp0qw`aqqo*{U8R8H3eUcX!X47Z-Dv!8?Fb;VCy5t-t`y-cQ518%O)9r{Po2xUkP&zc9)st!8)TR#;mWWZAF9w#3?T;sARP>)p4HBS-2PPcU zr|+^FnQ}C&Ha==MpdkBJtYHXuospTSc+N|{Q~$Nj%Gg5Cq-UdjroENG`Pmd4N}w+w zF7Prx{&3iM4~{dxd68pjfPL78|R)+YHhBgtix@lRPEo%wsH5Z zH~JzP?dQMm+T<)b(OHkCDW0zDV}oFFgg$D>D`-DmuABu1dWCev)~W=m=7B}*7H3fa z2QU^XlfU<@3qAU@DY#~%J$-9wnnh)7v&mG%YbSEj4IL$4ml5Nk4KI{2n;!EB+hj`c z^xUW)-E6JgLnl)!`%tVi7%Lqh!)!t)(oU!bv3 z>?)nDYi6_2+e$Tz?zZL+YKzRvF9ds<_)3?~+H6rgUO#&7ABK}SRvY?mow2jkTbwm% zm3QCj$7}wQCgttj6MtyTc$0=&aI`$4?w}~46nUS0%6|DZ!_wj2V4~pR+n5-h^n^3s z(pF6(-w&V93CZ&0H3y|%M~yw>We-k|gg0iJ>g?8`=q6FVZa=DzGT7h!W>cSL zY>y7k-V^B=bn7V1+t_PN)1Rc-`Vj0TJ5ZxvXKjzuZ1eVA+D0IB;hl1bG?CFp)|JX} zejSx*RB+qZmw7V=EcPB3sY|EQog#;OtniEMzPsxu(er1G56&!nrY5x`uGjuTG;T35 zGVT4n!LI_+Cd4j@zs}P6t;o=m_!rTFEt5>sRTUg2uW6&UJJ|Uc4)+fpoVkoi&Ka@1 zXGhO=YB{ZCXF?$4WNtol$xp`;n-|XRyX|~6&mJJM+}!w9L1ld0-+q2rnsR0-gs`NU zZCi5M;~&W#Bux8#j!>_(&GqFE)vJr~G$wDaAwDpQrBf4?vIbPGUVe#ID~)H-P{mlv z@F_O}15G18#`(`1E7;ogcOdRrXWP~6rAO@#cl~$xXN|)iZtNBI)-nvJF7Y z@7u!8*AXGtYI0V$SG!#2>t1g}TGi0r&UtJCLA6?|$i1I&_31 zb9-hwO-ID3+qZLiG|~_!#@iNb+Y2R@N+ACL=UF~)R6BN!d+%K1aIe&R7I5A_x$$I1 z1G5F=E;d)tFU8{W>XEzu-bwxFoAXF$IUpJ!h6B z+y*7iPHXnof(6(o8_lR%+3HWyRoOTDl?+SsPLz`#(K z;`Mg-v~eZqKC#=49l{Y>X)*Az`?!mHvzhzPv-i_=I?DE!4O9)LuwiHgugimdsM-i- zH>O&z3&hh~Rk6AFB3(Q4gD{g;dB3-eV3U=6Q)_xWF@?{svYfNteoxQsts<}8GW7fL zteSRngy)Mp_TNUP8YF3gZM{<7Fc`Q@8&20$#Jd*jZfjp28h&pvch;}AQr=jD&bv9i zzVmb9D-}X~;b~~AG1c~+(D!^)yz7?P7HlJ7C@^bgIm zc14{Lj@${<&3zir`Zx0pTPZ}x`Nlp!_<`_;Ad4SY@h}`JllvU?@84i&aPw67MzyEv zV$#OuX4kZ5C&NPxA-V?n9b?yUust^iDgiX)vZC8mT!MaMb-AfG&O&8pXx5OmpPG4o z<$rRv_Gz-yoQt@8UrJ@rNLY1KqIrCLXy#e2=iz&JZ>W#5NBCnGqb7Ncjr|`ZBUD|e z8h55{-Er;^u@lgnH9}yMSBMwObZCk;1*b97^q3RzMuQ(=;SVp|y01dby>;z|dN%tu zeGh$G@7CHzH226QOi|eFOHOU@$HX{%zK>J8Bkvtfo<+`=7Yx~S0V5peHY4KX)J5!HWaTK5(+LB z2^%SROy3&rEY+ALbYF;`-B{lm5x^7m?0_4uoCQl_n|csf_0kA83kx+#oTb&=OwA?- zVJ`izbuG*tKppI|KgLVJd$;gnJM!R1vAnN$0jykFO9xg4)`{#>P@{9}gvg?R8^7H| ziEe*LGXe$QH!iQl9M+Mc+Y?oFu3k>IlQc6-pByLNDu1WhDmr<5Z*~=Wb;qc+t=wI) z^YIdofD20<-FJIbR^yc~k0-kU*yzj%Y`4~zMkHDdE+2Pn@o5{oN5OY|Uy-F87lv~~zi_0KUoi02xw|f@7}iIp zZLNRaS|V(a_zbyl3xK~`kE(x&N;s=iO(a%KX2a2T{52lt$Etr6&rASu3;vjGIHR{d zZusj&^T8UzPH?rT9{y6XV^GNt-9K{YH&=Qr4+$_|R3A!w!hDh_2E*ajj=UoNnx|hZ zjtPEtW5k$&hx?s`EaI|9&sHAbh*J*dW>Qt?`D`42*bNej)&8{QMot%m3bSi(n5s)v zq5*K&*oTeTs(4q&0|yQozHyq+K-!j`+}3Uqx0KpKJJl@|aOFHX(HbcL&u5Vr>nKD1 zcHjg$s{_Ok30KZNH``PWte2>Dq6-;mzXWO6R!_pvRr2~ie}o>^*aZLozw~AB%Y_G#;mF+6Pcc*d289mg;U=cNcDtWuCc>c&G5V?{G!CshjZ6QoVOggU6+h zt`-ZLH&nuPFHiA}Uj#XLC#B00R*ljdWu<19pvtO-WepWi$dUbfoGd^Q56Tdr^PBQy zv1xy#y{c8FH9%d?>X!i$8{U=n^N0LXOi-UAKDP6lo~?}Cta~B|9XWHstycNNj_v;Z z%78GB^sQ2xX%Tx3n&}Ai)Yrl3f{@bVAG>~Wsyx|axhKBsSp2h-Z4rZQTRBaUGa;@# zw@fweo#ANjCu^qdP6^H=qa1B`YtDRAtyX-6a?w@VGn6liwz?vgroHNFeT?;!=VMAA znk2A|tjl8N1%i=CB_y28^(W!v6g)?Ce!#GCj^!X6xA{NGN756wg~%g&o7$q%A4iO6 zJkfKTGF1%`^Ot^lH%VwTe-8(fVG#o6t8nsd0%=e=YsRjsU6V1R zTP?UDseJD;^nB(T)xf6YW)(-?RO#H8Uw%Hm(_~J0&--;V&L72@hNLd9_uO{(OY*fC z`0ivfGwobhk^R)!*2^U*-uJqar+m%k)-Bn&x+l(8xou^4l^gurwfmRvYxlo(+r^1N zTQTDI+q~Q4Vf-r8747t(X+J&Y^^u6@|oPP zFakcl0Yf7gayLMyOVN5^W3F)kGEk1Ng0M|NeXk(2$WFV+4o-XN^hD)2_5D_*qn}4i zXSZ@@>9&rz3#+o9v%7HF6}9quMqFi6K-}vaGz>W05o~pI=;Q1K69?7(Z#_2Ir*8XJ z74?likoG@-LJKknW#tg5a5 z8a7Y~fxStk*?@on($WZ?} zJn!{eE*h{Qdx z1b)RRuUows&bEa~OAbynSyjk>Y=8e%b_1M+HbfZHMTvB!c4+Qfj1aG%;O;zT{BUu) zW5}8fv$E!Y;eRF;zQ3hCLsEZwwxz5hrL}tgZN?)?wr_=6$8elXUbHM-==i*|+IwP> zAZF=8$_o0nyu2Y;nO91_aq-b(O?U>~ZRs}lVc>l!zy78~wd1nO1?xCwMS;IUMLmnf zc8_I(*1YDBg-?(ZdUk!aiFIKv)T+fjHpqhR+olF-w&N%A8J;MTqf=UplF7pSup@Na z6yx~x_SCRPQuXIhH{*<%%3Jz1MrghZU1SUG1VJa?7O9X)w{6>2%_>geNRX`fYt>O& zqSTo>`+lb8XDrf|^&HYFhX5ppd2PTtmZoX>6;yY0k6P$l--Xk{w%$c2uE>Xcm&jz* ztUl>#P49M>EX^nE zTIW+OcoZ#hnC{W?tZUh6oLwKPTJqw{Y#t&an-g zFqYT+{^8STwVi0t@M2$|Llj#0{kkh-kB39EIceP1Nluj8Vdw|bkv$VLuH-Uwt9ZTO z`W7Kr+A(WGam_|G!yd^6oi}GUJ+n!)MohvivNAm;bCccqJc`#{n<(eZ+RwPJA9Ii0 zv>NUXi}_U8t6MwD0>GOvGuMfba}~yW(gB-4v7dt;!kPHm*)fnrB%r_PH=Lu3J8@9= zMri!e_=Sz9spQQ_G7vk__h!8+&&adRoQF(+JNvHWD4YD2FDApcuus#HzLY#CQ~QT| zB^9yocxLL7lgem ztm?yTnBz%#-wd{AXGD4&M-XZ#)Sj^7ThH`M2I#CU468CuJ2;y*xUczW$f+A!sUKdN zs`J7nRL5jkhBz_N;U#)_B>iN^> zAb$qvtd9)g-y6k-gaWp7QQ*sosrALc2XPvAfO4!>3_Ji|BiKnI+&ojsI_@!t+JVb< zEbLYajN|V+PCAKACHQD(tQ?i}kWKIx67J|xzu;3tYsVFzpx}u?%B{!Wv4qxY`ofKp znn-+fTp9Zu&X#FzV(5?LR9Toa<^v|paOhl6YdpNnPA7VjU}8$Vnex{X25T?OL5#7x zW?hGGyGvdlpRSR>qFU$2bWq~=Qmi=xYf>de*>jqu_j;ErAEnbS&wse3*`rJ zeZ@$yXgM-Te|A~%0Uhw;8vI|#Xvaa0Yo08t<-EL>J;Sn_A%(p<%tWjHT5qLy@*T10 z5S!eup(#}zp?BH)Jl!*6t8i!AJ56buts0lA3aii~_Y@w5l+N4ud9VbsVxm{l35nqj zMA1v7{yLLa;mDoF^)@k?HF6eqT>YA-LE}x+RE*WfbG_u;WtQ3ztGkGeIV^s}MxMe* zzxayt*ov}2YH*T7`$%{~7M6d4M<1C3AVe`3%A-s^BOj6Jx7){Aj}M;kc}dRO;36nY zDNfiNmJp{PQlof{CkPC?oW>z3H0<4k9S762zZKH>N!S6HgYA64=>X4r0^X58j)4bY z&k6Jqspgp)EM9N46--sDZxQaZZb!Ee_BhxNEwO06DyYYNJTAuayu!cWhUESGN`_f^ z@{X5NZOJdmd?-g~d$vyBdTO<)A?hmFk`rg)0%;HSS z<1OTA3#AF%N2m9|{+xq`l)p~X@p8Feo0P9iGMb|K*KHoPCjXaNemVmB{wN$Jzsrfz zvLT20>65WfvCI^d2!fKgk-IrkxM&-hhX$^F;^#%%h^32lWwZP!$+cf%EL8_p!f}G` zENY{)xBWN9hH-kqezBTCDgC{1GLG zw-Y4t6~YjL~&9;I6;%)2;5x%N3fujSBu8BgMaB|w_Liw|*CO=;67bj|G*)$>2g$Sif+kW)TXY#6q3nf6R*Zbw`=4J&hEw& z$L0tAf;R6YI7XOn$=tR(TwOw3;T5?)wjWODzHe6e(wINMA)jBrI;sFI$=}~8Uo5gBcvh-U)b@EiuUe>DExun~ zUxEqnQB_K)LqBdwzg-rkPiqvT?MY~Rcx@GY8p)<6>jGR^@h8Tv`@qzYbhReSNpO^s zfmUkIz7AzwcApVv#&CUAp;4j`7t3lPOh2}8ORGGwa9%#-ye*kYc3vc?E{G&RJCzKO@|61ayM6m8F?E z#H-8BoGwU!I0xvB@Hzw7QJmiI^gbv$F+zmay1vFCw=J^zWMa>k`zz&1X?m1>(xi)0 zY(H-Wmz-X0y{ETlvQZpUvYc*4@_y066{6shM3e zE&LEKJ~NGf5#l-O5G5YsG(2FFw{6jBgf^tBUZOK^w+G9L6^I$It@C8#+!u0RIE!#q zh^=hNXIgIJHu~3=kwc308wqjOU)xyirFFZeUq0u!e4s3Ik18eKSzNE(WUUwi(nxrYSKd}J+Uu@R~_IVu&`$UJdhrkvb*L&GM1bK4%yuE zF1paj!n}ypG`@foVh9)*eeh!j%#rNXNEV1@l?Mljue>F`yQ%~ZzU(SwxzNIZ63GGc z#V9aqiw-OT*q~4u{n-6raO)c^sZmq?g-GyC@t^4i`#Te*mt%r+ zHX8yWIG1%49)b2j!7O{Vt03a_nz#t~JLbKzchpati{CLSh(X&XnEXND(KJW*5TJen zfLo|%5SFxqz2R|G>A?LC25`T2zxoWgAN=Y7#ubONwg$n^ilAY{Yu%=k;O{Z_z%(*%@F0SCPfcd~x~p2`)pr!Li7naayYhq<>YMF)D@ zWum}?nD(GkG{NVgAaq;$sdm)oCz`s;$f`VIq>nXIPVu3pZ;xTan|N76z4~z#r$JSe%A_sbI0Q3aA zoa!O;;qcF&Mkru^NPMCBngun{UaemF-$fy5g?D75{_$=0h89a!fIF-2dlK3YtR+Klh;6h)BS?t49tL`~S~;`CnIh zZ9o%;#d;r_IHS0~^h*vV0li_$vLqC8+IUnO^v6o}M$szKQ73H~y$#F6!wr_e5& zc3B?-X*NwjB)WCX;RT2k9tR;iR>GHOc4H0CopYc6_yXmI2DYr_T*Dff*jp*5S}Y)2 zvXkDf8# zmq^Bl99SEEz25{1kON>kGxo8dSd$As+8529t7TEFdmTQw+66{h8)%}_%4oqfzW9!&>2N~pFA$irCL=JfcIJS+YsoLehBf_ zF@Kh6Xng{UGW}!M1B(K^PfeB;-?`&+2Dc>HySfJ4&Op1m@8d`YnFPtc!`I zpzXbG7}Fqx-55v{UNcBPpXvgllODBZE9cj~AYt5;y*f1|spkFWEl8N5WD7iUcKhB) z@n*aR{!#Lfx+C@~!Sw0n>CEK@=tCZl9!O8(00bY$9;?FIVnkcO!YBDHTxOI(JPQYU zkJ5sn^~n3r{t32FaJ-KEY!D{+`WBT_T3F${_vVE4CrglkdXFHC#AgahJrO3bi-H7u zLSLgv`t@55?;|#(jG=s|o4}*_SJk4_0R8jO%1Mx*`EK76IQ zc?jfFwg&mD_uw2^18`~52^2S?D4cEJFU<*LfPQjGdj;mKuKmWB)V1@3!Sbknz4-T` zn9VduSPt2m^MlxXOVFhK2tS}O%Kx;yKcZ)is#Vo2ECtc~%V$R~!{`|ZFx9HehIVTb zpS;QNynvFff;^2Ir$XRm=tAHE1A8UvH?7(TRiJdz66RjwOvdDY@+HJr3y4oZn58gC zM7ogqVW~IOdT$Jv{R;CG`U6MwKA?f(hRl@W=vw^o|Kz;PrHhPp0@3v3*L)7ZD?R~a zLz~O~drLsmHg?#4j(Rt@->?LNt9O4(V{=ESxTQKXimriK_51e@FxWoXM*{=2F~}Ny zM5X*@Z{y+gM5xe-$xJ&K-wl=4yVtjIlw+hz4CQofI~jmk372OVG6}9Y8Pw5(%RTy8 z51*s$7|>S?&wB6aMD6$-3$Nv6-rnH^wpS1_ct+6^&(WTns^`2IW3>KIZ4_Adg2+L4 z!woep%b0Mxm!Hq|kHCyL?%*6A0;X15?Fv&hTGow2o{$LT*S6pt|7F>>yB=Ai96W`% z8HUdCVCsb)@|NTFNzCKlF0`EX22SESIj&p1WOVjTAmRIk>(d`>{>W8dl;H5M&tBJ~ zaA!)?bu+;t4{g^;U}LbAs2Bb37DYfH;zVPZrz5$CWqSnU%~tuCNIiEz)dK~$*ex!cnb9|H6Baww57 zPJvTZclh%TACMS*V&;0^XMekK&&yN2R1#A9Szz#ld#m%8Jir2~7d5V@t*!*g4k#kRvuN3Q)OGA*k{!lPFL=dp*_!<6 zpKg_ghSPOhhucc=q}=lId5Jz=N)(h`&;;4}+`l2rz3SfhB z+B_E|Vp;CXmL(O&^Ra{RU@{`v3qRmwLFyGNu-=VP*j^@D)pvG@0^?#ayb6|^_*5Hp z_vX9F7|nv5y=b?YM<<=;Lmc|085t-D9)^b%Dj7ajD#^NJ$Hx?=zdJ|h+}$YVVhVF; zkevN|8l`$@5XyUx**MGe<>ynZyQp9D>m9t)dW9_@DR6HuNBKdD>();GP)8F@6rrF7 z%%R)d>65Z-gtMIZC>!gqmyZTCSW%aXy)5HA_Um_eHbe~?I;=gc!ewOV1;6)FOe=JP zwUb@+qzPJh&cJa-{x6ykT9^Qt_w*fp{-BTOZJ3b+C-!?pLNN(|Zj7$_ZPWkOVuPfeffc#H1&Q)0jR~m^kj};-;!9kBl5wGq;)r@Mp z11k*1&989Uc}2bHyYe*B_R()b9e({szbMEpD1F->9>c+_km1+WO@~C4}Fo!K%-=7J{%>2*IpGtliVc&`Hx zDwz)q?N9!QM_V)AEUuBLX}@sTq@a?uvlvf9R+ zygVZ<&U#EJF1d%m?ezLFJ0@xnI_vvGU?E)0&E-LaZ93fEUv-hI@CKNMZhhvj%ts-i ztk8`+yd4!=wez7uYb*x~Kb(HUd~{S7{7l%9IL1ivO}fR~+l$?&Mj#6y-4UIQU1o?LiUJc?sRQUmTwqWn_Y`rc49N;yMX zdO$*B_aX9b^M^zCkGZUXtT4d#R}5#7qc_YDW0Ybu7W5G&fo1!5%iSACJzhND`}UyZ zNY#Al$^E79DFJJZu7D+pW|o+k0&{45UNVCtThZT+Y*JBI*l*7I4L-H?@pGFMvyplc zjLon7DBv?cP#rdA(NU7DiejjV`%o$eb9R{Y9mDedh-mIJt0gH3TLRa^eP9HR9Q8d> zLtQZ>eFP9`e>b6y?eZ>`MQrq_2i&2flm-ttuu8;aKiCLIqDYkQN+06juA&~h0`a6f z<f4So*-@{_AvKSh&y=p{E+n+FT% zrXr;tBjI+$F>lhTmwcB+6ya=lfv%#$zfW)uOK| zCmcG%947rS6)@PFN$v(WcKe{gqdzv-2 z^E81eHm`L4C8qKgYAIcUoZ3ekmpIsG)D@-y(AOZ^WP6H zVnY%af`W|Z&uThwPpOxX(O5$gptsU4&CJODRf-KO@p5@vM(V7e~0};&%83p9F*xFp7?gqJ5@GoZ*I;b%4GOk?}Go$Zwdphdd@dN{oz$k-R z4$91%?}@!7dGt4p3!?x`7kOa;M0L6ia9_XH%zK*Qvd9$p4GFRS@K~upHYNKk(MUcV zo-JS4{GiHGfYd_-k#sM8GuQ+&P$rLBv9F;L>@am&bw1o$xT9n@*pRtR^G&po8&4_C zfwF2c?*ZzJ_M9(iQksZ<{i<~9h_j$lFlIL-KamzYqM?s4p1khm{cK)0ORgoT>hp;g zV_$mNBaj0=m_w96zx&G_WEvtg;$sJ$Sw2|*W8p9S?<|JjT9ZSx!3~k>&Ee7VYuC#i zGILuA^gGUWsSu0N<}mlgHxe2Q1P++ONdAaY0i3_oF9Eh@X@?{#`FaEx`m+A({LGw_ly9*M(P&YnVpm-{S!u7shH7|Z8bpe51dvY24R zdm!(~h-Amir4fwW4~nKadBn{SD@H3?i67x5#ff2fpQ{T5o`(BLWs#LLbJFA9y6Q7- z^h;un?G~pGLwRssCiqnErRKa62Pt4|{mWDwZ|}&clJ(hjlKl=zn@9Hl2_JFr~`d1Q1`F^~sRgi0aJVJ97Hs+5}ksg&DppRf>9S@MWjlqTue)!8+vdu8)W|U-Y zG@DE=*6b7UQz7G?B%e6CqB%iwY1Fv0l-M8f3tXv9`h!#`OovF`IU7&Q zggE}e6FM`|`nMl%hf{RCK?i|W*8lt22k7p@tvYYb?vnMfQ+y0BlJKEfO++plqThVG2Nwt+ z>l5lD)TrQXv*>;fbI9^u`Uq#1h-T`Gmnzog$)KGVKk+HAykv2@4H~EXwQ(sjRfi-*9+hk~wW%^eke-%iB0}hDgS6=9VD~NEfX19`j zS6zGh-}B=CT;&Onyv6NF6GN~L97D3(%OxZo6zlT$pH~JH7xz}><1Glr69o|Ab4>M` zeI{JSigYh3`VAIN|j?llO3**!Aq0D%5SQ2;{*Ps}1`;}Q?>QE9Ml zKIk+Z5rYoofB%U%fcuT&YFq%g%VYt%%U95otnr`MfKkB!9r6CdJzRiC($c z&cB~au^Y58(Ai<%0{lxnK#k)}GwM_ST^8!IE%ZIW1dT6^J*Y9iqCe)}MbTU{1y3wX zn4}9~z>(kv8QbywAOF5ATnF4gwpcU`3d2qW-KF1~D)DEd`1_SNo`RvH$&vO3cytFL zxWkzpPUiRT=OW?&ZG4rYBM4#UIZ&@8ye|qAMT2?!L4BBi)lgl2h&YZ)Yr#BE;s)<+d+H8t}&)KWlRAJ6<{AM0HKKh zH0}a+!TOLN5cpgb#Fs3j{S(mY0fFRU&OiQi{omi^v=y+6qDOD3yw&NGtD*w1ElmsP za`xM*HUB$HaTNC%z?839n~pN%5;!{(c&q{PRh<_uf4;%* zSoYi1K7TamZ54F&-(H))zbN2+jJndiH9k8~q>rv1+fZ}oy~fcNT5++?6ZzMX1Z29v zLKmj1N&b7p!=Y&By|L#9F+j-0kT(L}K~-?@)B!ns{x?>m<0`n8LgX@67BGip>b(fX zkI>||U)qOwyubf$h=bSLaezTTj{jio1%^xvC^5gMA?1&f?7#h)mIl49R`r1SZ`TzJ z63K^Ff}!{kyML_&L`0zCNx3*5ovqOVV>r`u(AC*|zg+Oqs6Mmy1oBe78Mo2W83oR03OkxBt<+;6}(U}Yu)R0|01EV1Hof&<;a4jYz$Rk zM&Acdp`7e2Sc4gT|BH`CTrhRw~?7^fLxouU4}u{ z2a``{yFt_i2~a1AylnQts&2-90J(?Q;vroVC*_`&om=%#N`t9|2!|$vs`M%!*yi2? z{k!ECx%iuB_+A#COdo)Y$WV#i%;5!Ccg7L`Bhsz?>uHPwaCRC8R6tGlA)s$wCG**r z16pbV!1MZ-=i8oLO`ZVap13@pxf~0=%kixVN_mS0Ze4}Oy(uq}w?X!Sv3v4KCIBTh zZT8r%oyP+x@#zL=N6*@k@+Rcu0aPp;XyDMX87_t=y>DcyclaMJ?w6KP%=7*xg<`!U z003(N`Xx}U6%h^jB2E}a$rS>AL;4M5G}Y$113>?8C$4O(KQHQD0*s>uEa8Bz-C*Zm zFRX}l9)q%^0y?kh5n%AZO0Kl^s8`_kvr`}x-ULo^`y;Iv{3p%8{p{~d+#h-nWaTIa zsa1PZT-KN`D>eW~DKo0~u~#jA9Zccdjo%u%C%_!$(QE{SaoQb-{cU-;b{EX@IiSN( z!^u>{GCURg@*j?;{i}aJpyO=rzI{4sS+^K_0FtCW`Ynj;Xd}1Y_~dZB7&mcwaVUIw z*~`BzfL@4{J@cOieKpPJuwepnnP%&_tY+)}6x{8gGje}zkaF}Ba`>5mK=w^5K$v+X zVWI)4!B5`u(S|yTQK1^r+<-rr`%^k-`0DBPkv9Bl#;FFWT1Fv9L13b0Ca#ylGeXE# z_mxD;iB=^TH@&h~==5ks5vg5%08X|pS5HCe-*J$OH*RA$$xRrOpD)av#X~~J=F1H% zU!5a^X5Qr_15jY$SI~os$lwKU@hRpsd>fac^1u=EQnq>J=1;aAou00y21l>j%r z@t3>Yfgr|IjJodXB_+U5@sw(SJ=7?$KU?1>Cp)*O9(#R}3QbM+nF5)Hd`vVrM2tUI z*MaEdPG=%HXOQHxUkiL#ciBGR{9pER*@%ti-gJ1!dwr6pl_S=l&WDFJ&5IUut};ukZ2*4U>=uQMh|ItqDB^fU0C9q%Sp#{F zlRZvWJ-P?m>B#(1iQ#7afC%)oZrg*1P$(rWNPSu3$%4S1s4}n7{)D912U%-m`N!=? zbe8G5p!9`$VI2$mSvPCUhVGkw0#8#1B_tWvb+t_7waYOcthUsG*kc8)1?SpJ?g%wv-6d0kOq@8%Ler-Lt&{2@Uz_ZYpVP!Y2FrJiPbA7tkBaEOTYKdVp>diVW^xJ~5yu zG#Gwj`c*;bI4hNK>&v);cx$+(9LBx%Bq1o%Q*Q`cBQWO^I*Hhm8KzgtdFt`IyHuYF z$h%&HdW-aMfLGtab_r6Opivrv#-Z#!Ar!p*zwhZpFVl<19(DkH{H6W+@P7XHthjOq z1797ODl!cuV4Au(k3iYI6luMCThDcqi#ond8-ytU=eew@$X7jwFE05vE2DJmzH>EQ zs*;_7OwVh2?c+3okC9C|h>f4`EX3OSUR_dBUOzMriO{528@v(FPC<}O`eDa|ZjDo2 z&NWdxf~RX~dudWI+XZ}$bufZ``k0fPEC_7l5?CfGiAqWJFG#Qqe~Vr1dxMOD9s6ml zj8_#X9M4_1FhRkDKG1dL-Sxea=8IqH7B2DjhQ5E6SPzZG99PpcAiP*d!znt+%6;+I zP|1x4<6R>Wb7Gv-rEsxXZ%#xg{xPTWiqz~v3s;7pN4Q7_b8&jgi%Q>d8+@UowDV=* z%bgj~YqZNwg72?5zk4gz4COGAK3xP^phJ#?o}>z-g=z<#$UvlFQHr#V+pgA7amtXkIj>jOrF3_GZYlQV8CB zyIl69lL#(_sLv7AdBG3CvKeg9?H6Ma3zK#bQ~n+g3tPo-8pOJ|QPzeh;MD*(CHgH= zJ-1n3I=9F-q3mhGS=`&6Sx@F^GvKtHZoewLxq4k=Rx~rLRDeAUTd@u1?JA;{Q??f} z4}aUPr2D(HRgQW}wu?;hwNc7(d;;9*8THMdfuD67oXS9wnyFB}H6}FZ4i#sg`*jko ziTA%G^vhKoqj|Hhu0la>8Q^zPWp#x z5D;_RaR*^}^g}@`1BT`E>t~CZ6tmbKtqDqCp2mk}qn1ivY(c3(q7s+7Ugi8}m%lxB z7hxHx4<2`IW1!jxM~3(|mG>e$115#M=`#0csY9BEgOgX^?IC<{9TlP-(! z6=9zWTwgL;4WvC1_@uZ%-RT9J8*BMZ+Y4((8qj`&S*3d}If77$!(eCJ5`X*%RG~DY z9hDH2qvBt6g?yNkfaqxe2E9uN32+yUk>(XPE|vk;kt8B(<}TF7AYkXpa=G_k-#Q^9d6TEu>Y*EO6ch+k1r%w6V>92vGEB*(6Of0l zNIgHXx34af3q3Qgy}$wfod?2>N36Ino~be!e0x4HeNTOX7<`oMLLViNRBN@TF-03A zFa_)@$82980lO5S{T@RxW_~mnOmAzHFAvZfoc1y_3?#+{5-l&{i1VN@7K=qwb zN6%XQcnu0sK~A`|ejQJ9*jPIrXOx5^CTjq_($NlLNoI2W{k)}C2$g*IkU4!YS4_H$ z9~!R6K6o9`Fc~6@i-hdGf3#(acZbvx`RZM=!yNoO1kb^KITvhF{{z%9=C6Su3oiS^ z2yRApltaBsyBjz{Bf)EOLy6Rkn_bQ4IVy=>lsE#*c!6X-WcamgE5leVaP>OlU5)0& z3Sr{!7n}_+2}(I3C${b;++~LwPtRlWXZ#Sweu%(1CYO&|dW=I)c`bI?heOHnX~sR* z96%S1+hE>^D3=-mSqYL9zms9VdUPWavalwjxG;dYk3M4mGh5{Y2D+-N?KoF|pO5(VZJHKWX@kMe&{%?{AErX>|X6k41 zPcm`MieI{`TfbDeV4t|xqeFB!4&(}zHHQ`kkT=D4aD3tccAZ63PPJ*``0($#PqFBB zzVBvZxLTYQLx~R`^)&u)SG#=O@5VoSF3k-*fhtcQ`jWn07ki3YnZ0gbNvn(^Kxm=> z*-zkq@;WPQ8h8uq+Q8VvFn}%O&KHWisI!kC_G+vsLJm3StICm~5iTD<7NL*whV2j+ z<3^Zuh$Gq|i?r`Bqx5Oja!meyQ8=oPObyuk%5o<7Rf;N~ej{`!J7;Gxgb{dydfo@` z^z{=vWNF+p5R>@{f%WD^U(NUA3=y{1I>j0V4K?_hSp>)l@MuXa3u(7HJu_cnb0c5* z@mag2t}tWhm>{nryy2Ee%NtjTP6bdx8RGNF8RV`j7`g8vKL!;d9r|94XgII4|HoJ_ zf)G@uNCi^H32~7{h_~J&2p}4Q>BpgD%Trwe(Pb}5-RDi_B)qGe?ttJrt-TqGjNMH! zq|&&B^g#>}yveyqhj{Sq-62K<@CAHB?Zdhobcv8Mzz>t!(=Kbqd(%UG4fV`$la4Cj z8N&&8_0F@7FR<^Vr`t~FeyOqoJa*TEMkUDjcax33KYy^-C|6u-lrdugKFL}&?15&* zYYv(R6u@mKM%aB~?R>xAC&%g124w%)NT3pzQtrs`d=Gsm?^tiD4kf;qki8 zZn_Z&&2!Wr@UMhm9Zc6Au@{>UBpkGDS#JEOVI0@WT+V9rD4rzJS8uQOaF}F2r1@wi zwau+I<$mM=jBU6LH?+07_MIokLpFNjaV&$3V;hAwuM(}|RP3bFWU<=*yxM2h2dsb5 zcYQaQABLTwu+b1gH)4~CM_PV6U4mXgb=22fl#k}{o*54B(^gVuTNG)(&Aj_)Xeh&A zNT6I91GXR2{e{j9@u0KDOt=C_-Lz|;t4!?B5G#g}fdJG;ki@93vCN?Lb8NI29&f8; z4#O+4aW=b7T1?vV)o2y@om<$C@w1TrC)PIjePOGZXq$J^7J@=%gYc=ZNqwUAOK?ew zS#dbr#4V&v)4-C}P)zKC*0oMp-~~sY2@=5zSE0E5?z8m04b+Ml(=FzX6LCB^u3s=- zpU2|4leFY0JfbbZ-!(j{@I}fasVby)|v7L&<;bnh^A+6Y0nwV!Bkq$XX~tIERSXbL7*rd zJOcgD<4nA*>x#|hbQO2KjxWF)#YD~w*e?q$4={7vj`v3oC)QY4kyqv!agaQ$ zUXt(r?vQR**1SbP>g+g5hHG&-!Z_;hdHw*mcf7QW`tYj!Jpv zRk_+e(?76&Z^W76SoX%OAG1xKwmbv+m(|!cdsCM5q{fHj)YvGidBbmtM8YSfgA)kx z&^#I+Xo^+x6ys+w4z6P}mtk_!U!nd`hfn=J0E5m5<|UEY^PIDY4Q_!!sO)b-)h`u6 zmGukpQEFEFCTy(5|1CY$WsaWEk6VDNp9FBltnZ`*hy+Sv%91B0@QMNhGr2-*8d^G4L2M>#s*yUp#fbs=R2O}t(!^<1xyNyb!W@x3G|Wn1k_p zRAi6{eHA}E&T#U&ItEi;S%>6OjI&_~&9i_Uq^^=(q%o$yywBkC3)v}dpXcm7J9m)2 zlo838GX29ZRHXRZ-fJvWp`4~O8&bRYSLjkoy(za{k#ykg^Sa$egzan{_26~<8V~uWL=G;sDPSCs_5^}A+6C5_A$SAEv>;~* zb4a}u?8Pc^9v%&-*c=}z!vg~2gWk)<#W4lo3solnt??(~SM|1HyPBEC05_RHThI3N zAf9$rli>Nf1^HgfZP#J(iIDiJ4qdxGm&W9WV#CP?vxcl2CcU-m!=}0eYW8zsgTqV3 z)00vB)#Z9gdE{(HO@zMt774QLI@?%|+aU>~O6J^^y_2`zG9i0#UKv|7 zTIyG{_cImDOW%W=oo)W_Y1A_OWijMwt1P=7)$HvuBoPGpq|gJ&X<@_LbLN;BKGa+vo$$!u3RzgOaRiAnkSipERAx3j28F`z6L#DD>5YHD z_#;QTg(c}VdF!kD7AOmBe?C~ekI3R@yrtkbv^?R7oUJ#wW?}Um+K7;!q+w(1gg`S2 z6BWR3X5w7BaV(2Lx>^TCtoh3*u7_NImdD?oUoxmg=YkdB`x(bTl*}&edc-SCvf(?7 z^?nj&Y}Y=Zsx*XV$x@=CUku+gr&5xnH!*Rei;NXjAt-Ms;zwErPi|5^vHe|?O(WhGV&XY&l<>iuM=r>dm0gp(XJN%cz zDWzgf%6c4wc){n64NXo8yVJoV&AnBn)koepnZMSV`8L#a5?4IiPYjv)9#2;AN1a3g zQ0+his(m5=H@(3m+Rfp{MDwz^Sk@D&RedaZDQC_~tlQqm(=#&MNYWfQoVX`z1Nm{v}@=7!oA z)dU$%%*X6-_XIRuPdX$2x%D(~^9DIjS^V#~qwPMSnOvPoOf>$XnfY%LR^|fB!`hC`V}@wLwMCJs+>+Hh)GbWgV5Rc2_2%O!7oo_?uXPT20m02SA=;Pu5`!Y zDxIz-T{Ft$eydc21egp(p8$~BU)iy;9<`siHZDdM+590-iT@!oUA#oL?WXiMJKARk zi}=r;Fc6uNC||9Mbup$*j%=!BQJUARr7IF8Qg^m{KJEU;0`o0OY*?}q@0Anf>UiOU zrO1w2RkDaQ98#8t{5w%>yTTb9vjp)M!Oa7$r7*{uIL{di<>?^T6nxVg} zq0N;k<`8oio8>W2^=BRD$D$(hQg}@g%Cy#Q#pufsE1xUE+BLH=OTr5BXR%Rg_s6ox8Q5LsE6xK9Q&57Vuj!=AV^7yH1c)Cj@l|g_OEFEUU3(S5+6CkD4Ub zy$lFzHDB#FUM2N%nl@H(b1`r52>0t~kI^QYs=lOnz2ndSZa&94XM&4`_)e#!CX4NE zp;d@VJxkW_B^zF~ymjRhMUi?VXT}fLLRRkJUGI<(qel>Zi#;J}>L8wuZ{p&ju0lWg* zgos1!CLSc~JiR;5pjP+L2b}*|?`xtxc#H$%|GYN508xsW&sB$dkPSMA02iP%k7+D) zOp-SG*YErb3^lXisU@>I%OzzA-YW?1W-Ck5A&2S$@{tzGU6Jp#mO7f7kXg?}Q{vX5 z-jlb8`x~%V+U<=b^-}SnS+jg>QkxBqQ=R41wFz(A@!}VZFXZCXvrHMf+%>Jwu`N?cn&?#UWwrIk63tvfp8$hG-xU9#CKX*I1hcf`P^!(rzS^A=l3R5iy!0w!vzc*AR{@=I3ANB{{HnuJ@qdhxQno(b$Urtefqsv z-cgIz(fL{0DJ8XhU@xcR0p66>NVTg(cPSm3dvD)Z{p~Hu>;b&({?C4Ogj=6%FAYk5 z^laLTa(eI`7Ic^cF{`NT2nZoDd=F0Z_n6kGKn>2)G=z!c4Q{0!X=;wmRvR0>x*y>& z>bWjlo>JF~djkFO6e9&;pXJwF0zd59%)NTF30(n~U2U4znK9LuJaCyt-FvxgLpVj+cz zFqHh$5(S*gOHv0*^n{^2-9ePzDGGlg6KMQIj1IVw^^N6P=8)5>QI$!{=BLS$Wd4w* zdf|s_o-8hnWe>F8L_SD;ihdiFQ_R_pJv>7xB);b+>=?=K`FJ7h9kNSB)cd5Iv1Fh_ z;+f9N#G3rg%O8%JL4nls@`<195|?xfLq3#$9?8reY-3>!skx=Y>FIw{HuIMldOYUD z={Stnp)u&%#_Zbfdhbkk&Z&HA`Esvfp{-(;Zyw`kUz_Nq&?j7z)N}>M*DD-ssvLT4 zxm+h|QT*%#R&UPf$wml2b$|Rd9$8xOn7#2YRvQdDmLe;Vw07uSiy0eS+Ey_&%ocfv zf1}YXq#FNbiXlkZRUsV}>tTXz`T5<|k7a7uI8S@QXCBQpHjL@yfpI^su^(jB=34sP z$bOxzwHd$bxfwhnew*8PLF0$}pEiA}$atOAiyjW`0q7_*h0wvZ@uUVm%k6N0aQzZf zJK7{h>8R`~H(~PYj%ONiesT1YNG*gNEV1g&DO6F{zTj|tqGpM98|BnmX=<~loLt?{ zt?fX^vP7mje0VVR)WG7q&J2a_{$NRk#m6XX$)d3zR`$0n&X+J~^{Cg@j&zr{?9aw= z-H*dsrp$%j8VTklp87=j?^VS{MY*DPtX9~s&4rjxwYRH&NcGY2jaa<_nid18Ql;Oy z)PY6y0xT$9kg+ELq13$%AWmZi3fDg|E(}sHJ0`hJanKj*T;&%mTvcI&FDP`0EpPG1 zx9h8&PsgvQ7QZq{jFfJUWTz|0E~+}q$4Lg$?h82{-ouTJngzy+!LCnZ1PPv^hr;I(OQK3)8bQ8%WEsir>jJI1rA2)^7FimH%4qC+<2_Ux`k}H-0o@t zQ_Ja;M-8Q}MuQamE&IeNL6I~qI;}sGXL?I_~n@}&hZ|H)vWfCsSXFHSyuk`>{bcish`Q3cb8RY=C>Rp3v*?KLy`ijf=lUa zLrF4@fp13P>E(VGK)FVz>z~`5zC9C|Sgy*xC1<(3<*9#okHw`#VhpBtYA0XUyp<@8 zW1)9w@jx7}{LU)ks-7KUTT5M-=RRiUJhr`uiVh(NAB*g_2jdnG#icMu)`x*{n}+9g z=WuURM&5)^y(Zk}{$x!)j;3q`!pels_OY*o)5@#>gR*}vE*>32;i&TeVk||MF zlZA(9{h-%P+>@3p;`SWs2BCQkCLp@UhjQN8YI(SLX%aCpoy<**nAWwoKNv@!yk$g_ zxgggYx7wbaqk-Q}%CUWQoGmb^T|Oj6PPsQJ@bd!)Yqf%LY#?whE#^XkZT$$0QYAb1 zlnsm(G2*Vu7d>va#LOS*U;nyu!N6Y^HS)wOWG^^#I(g-j#G6Gs?Rv?E1h({4I-PpV z#P7s(o|K`K*KefEOjm^LyPLmnaMA8BksNwuiYJ>@U#=6z-fd>s zIvzIH?Yh_f!jR~L$Iy4#3<(*0@zVU+-(|9&(tU|Cs3-k>sdFEdzAmor@v?q6q42Rc zdCflHBCwa12x9RmT~^Y$ZGL`Mwa#A$T?6&bI}xKp?z7N&$0@WG_0FDk#UxdB1Z|JW z85Lv59vP|cNPBj@E{9uv0h{{w(M+12o9LN+OumfYP4Nw%S!|VT9@NjY-}=P+rJghY zMe!Z}QSs4;ax&K`N!OYxd0exj#?G)U$A^}+6UG87W=nnYZl00*YpNqsnGC*h(dvk0 zp$qfkNH;*l6|VN zIhmiLb0U7+|KQ9>AmwO%`eU`=KtT-U0*oJ%+cwJ8bcSqu&%63PK zRe9S2gHN@H9^fL(GCMC}-qxxsm7FjewZzH@C!G3aP$PXgn*GUUZn6tqU2UJNdfGJ< zKh|McY>L?dFy5aysxxyJfz%W7Hm3ZH#REU;h3YF!_8f0j)}W{SF8Sh?2sMuscdN+B zeM@-EUQDpqxmkzMU1>3g1q`-gQS?s!yc|Q5J|QnS51SXR^P99~z71EV3kVJOQI=v* ziuixn`|^LNyZ8SRrO~1yl`Pd=S+cYsWJytBY-8+eCE2r$>v=7w`K9ytb9$PV zxTi-dM~KRuH-}YAJdfJWG__RBCVF=2+QPu)v5&Uk;8v{oy(Moi&3?JI z;Z0hwXqWb_3~sK zOMJ~!Mfb<`wmFQk>|2I3c6YRSp|1l};wEY;q1g?Vp3S}W&y!?snIiWyLxG9SO`vY&9I;+}`1li5WYv4UQ5xuTtK zJ+yey>CK9}Uz8VgX4T{yDinvU5-7*Y`fHU858s@oWxJ$T>!SBhy%lxts<1CxsE82b zUy?4GfLCcwO`gkMS|5f}(zf?>R0rfgGxw(W88|w zchT}mk+PhlhIH^k;xe8`a;^0st^7-~7P_>x$hUD~*Tz(fL%i2iXSEh`kZPga)yYwW z@b%xQCZJRL_e|Ti%k{LaC<;#BDAyR$~h$(MYnzfwrWE7Oo%}dYj z!qZLRE?9`Zl*ypn#U5ELnEjQI2Z@IHV|x%&xmAN62qYEN&lNwjM~}8%PoPTSgUxwo z>$cS-{1#Cv4Kmlz^08;jBe;-y;b(U* zY~DxgLz-JGJ{DtP7vvNr%lyY+y0>-Fx#=aGlVifC^*-}-`y<%^r z;3aomrCbeeh3|P&TQAg<8+Y@*nxtHvEqH~g#oEQ=6st1qyu)6Q&_pucxW3po;NZHP zthe53cft{|NLA)bv7D3;m6z^lcdu@wk3uoJsFWB7j&g^Bg$RpQoe5Gl3Kc4Bsl1kk6 zqK4VLoyx!77SKEIJDyZVH{_vY+Ciq8Zt-|+H)jt{o6O!Q+UiDsgJ zAtU@-mi6(N68m%+cqHE>qo7QL_P`Tszf`f1*&hG>TL;IOzMTP;<9p%lC0mO}OUWxf zY>y#X0h;SXXn{WkC!#O8Jil(~RB0$yT*Gl3ZMhM?L3V!EH$1a?l=AHH_Q(E$0OMl! zgm@Nt|0hb(^Rf7B*AM0+-hQ)e-tXX)$z5yX3fNv{7blc)zo%#77+k~1wEa{bl?LCc zsFLaqiM`a>iCQV&p3#Q4J=A1WYgLsee2=(p)rAf$uR|iwv+X=j-Z)Y!|0@#5>`nS%s=)VVxjVS8h_x=i2+T7&={QKM%7yIbr#WWwe$BLxlAesB6c`?3VjyQ;+_LSpk%`snl+_$Tw%?*%=+X^8#|B8HHuj%bv|MA(6u^lEs zDL%Y&8rL)al0C>g4yG(a>yAfE?Nds51EfP6UKRj}onv7|{9@L1kU(3Y6kcrY3#{vyT zub}vaE)bvX4G1w;a=Famlr$LXy({*wp0}2;kNatg%cK{?cu8uxD$4dTlKL9A><#xh ztGA2LTdug>i|Z5}%08Z&tgg{}|0tybHHV_&H)oqm!9m~N*bucZ^R(4BPKOW{p~Po+ z{s_z=+!~Ua)~|MO{^e_xJJ}Jjpp{|#t0|{W`-%!XtjUHp@g6<^s>XX)^Py?CY5MqW z9{&FC+4AWZo3U%$h*|HQiR+Rr9Qsm>6-^gShTS!j747M5MlXlDRHCfXi7KvbdT?3}nF{lQhU6lVT>QMH(u|0Rz zdu;pPe=#55cI*+lj(kT7UHill8RpIXxJT9r$rB*8Oe%s|rE_X-0pj;zH$0j10HOgspGiX(E=^av_f> zUb3>~f_1R6N{@!tpyNsDL=WmBRGupGXo38SV;{{U3&W6J#LS9FY6?*7ps3u6u*w=V z4KJM3WR<&ZUcR&x@9VIT=F;P#0t-VgV??9oL=FerP0A$B&EDchU|;FO>XS(0Ef@aU zHzwE3S0};Nz4wPdxO9^rL7q+*K!-Y8*K^jozrc_?ScZ-Y=5Z8XCEwVA%-j2P`bIYC z7O}_E9vh8XsT&}k$Gh8kyPOz%C?V<7X-d{|c;7l{)*9m3?!OV<`}z8i_I4D4Ap9Oe zD-@4M=yY+BZObz=63eH`i~1KktEFg{AT(*ld?@Y;k#0X(C!B4ao|%KaO}q#>f=8G) z%F5HcUG|u=A>#W>J8^!d6Kt@16%mro^VnUU)ck%AbLf!yIDE~H7#`^B508%3wUVjF zVNAcK#b;ADLMw(tqOq4wj%N?Y24JzIYNCnf`SJwBet9@0Gc&NDBH#fR#jY-_u`T0% z#8&Yz{g#F0+sBL(i&geY+e3PJD-p$+&!pRu&9A~?DqyAWUpKjWlL)PCih=Au<;i}>!y?%S`JD-bf-&sQDBie^8|}n#TX)48 zHxnRY`ar}&cl{Kx1fbYbu6s(9$hV=^r;7jX_xKXo@1Y6L&0ZbQ*9LrwD9WS;R;B?tzEHzV!Z57u7j4*jjRF#+`eI}|MEOV|4 zJDWASjle9)kGMQefmDh0>%&8y-|2sLqz6iJ>G}NLEBK>04K^`e+*py96X`X; zQz2^wR>_gjMe$ohTWuVZ9=QWP)>GsCT1AkHyHIr97;lC0xkugFf!b*J+2LOEt^Sf5 z4*NuC4m9Z5wuk{mi9IDbA4>_=Om{$v-v zKLme1jl9j!!A<2JsSwndCfO9WOuV)bk|N{2nHHVt7V%DbgZ;c)OnVVw4Q)2CJY`^k z5Qh&M_Guj9{*WomwYV*xJH9NRyRn_+yWh9fG*+7z&UF)ZwX_|zj0)J(OIv?mnM>2H zgDZ(Fe#vc+U|T!e;(bVY2Xf-*^UikZj_ea=vuoWa6OP05SIhc- z>BdZ!7Tnx3QCe4L357r@kb8AK35I8RM4>2+eRCrFwNXFrO+7d;R4xNQy0sEsk!gKD zkBmSe9uwW*o^j|1AZoB!cc!Ryh;n8FZDzQ+?c3HqW^_!Iz7sidSTHX<=g0eEcERXf z$cKOhBE{hqIvGDY>YWMXVdlf19y^$#6$e$7!XTo-SVKy@zgp5Xzy z5*|%vem!59`SrWjS5~`Ea-@D%5acum-0hIESUkFu)lINEwCALi>%*IS!Hslt|DMF_ zK)s$;d(tk=Cd0MqD1Oy9=Y_I0w_u)V)|qzyGdk?7t2(CayJgi}URpx#dRUc;G41^* z7QdpVGgkZ-fc|TPKlH#+iP?cP&!Id%#KL|vc!z*}s`ZaG^2Q(zh1T!RHr{+$bRZUNHc zJELMxKT*yB8{$^q5Do}90m$OrpL$c^^5MZ)mLv{9)DGke)>TK?qjokW6h~W<=ouX& zHg+0h`m_50I~U7N#m=Ka*CRVjM`DAH@fkUD08#v$3TT>Vj|+`mD3Cq)xSR{{=d7|- ztf;ibQ?N@+vK}5gyB_y)M-|tcnIIsF{r+Fowx}Rq@zrj#N6Ti%k+mYqR)$8mk~Y+m zIoan@ia7VVz34w(#r0{T+Rc<5T z4aicf6Y6LLq^CU@X8}Yh3!d2tOsqcln)b1vg7Ic{E(0*WR&Kseh9v|*iTDt!=zEJt z*U|*l_Pp&?0lPOZ1ZcX(GBz~2>Dac5Ujo>|W_vj)qhS#|K#H#bvZPO_{jK=muWAXl zYdnCc9W2N9&UIr3lQ;sTc@h*1)spidoD^=-%kj!XLaZ!)8Dgx#CsT_*OR!y(%j-MJ z+`U_XrjbLE2PJ@RE0jQ%2eL179*P^4dxSAb3O~2oCGwCa$a~EdV~_f`5Fc`NCyyc{h});VDl4YpPel7Nqd>p?qsbMRkq@>n61Vg zqHzC@T>^`_#{AHs=vd~F2OD$)&~P#+WbP3Qf%C4L2rDN$bAN=+9bifb=WPkVzso-l z;uGb^;8@g7^F5YgA^cAP`?=FBgljDGZog=76H2ajp4i08`f-Pbs#`rwe8A^AXK zo#OX%<^w-`4hR49kAM4n62QFD9Y{XyJwJEyKg#Og#e%sV0q;Ma{l0^_e4ww`X^+MK zU>Epp^N?U(i-2KU7$ELbIhf?j!HY7_|D=@dFL08S=gfdM=X{O;t9>YNeLd>W5ibGx zJeg;1R52%E2Q1@Y>;Mhpv<=?h+>6uC)SJ+5JAO!6@B<5YOgk8!saLNaE6(?A!R^vN zo1)BQ`L#Vi*;jh_vxAk_PzX;CXjEncUQy2jZ%PZ0LP2?0LsH;>C=U5ZO1~RqL z-AY?O4tn*0=c;W$)s$O28I-DZL@I4PUcskn#G2p?vikx9=bO$igQ~0EsV-bL=m|wC zu+??|P5E-@vN96upwx+38dh{f>uOD;>{y(pw(DT^POmTS?FzOa>D}2`JZ3ic-DTTI zX8BPOs1wnj-YlOGPEWWC8l&OCE6bzfHuFn;@?P{pm&&yUEhUDBC>*O_X@N`D+5%6J zo0PhZ_AT&gPJ>dJY;yt$RJ0xbGGwg;3VEbR8MmFBo?tjrdt zM#_oc(V%v~BSw^2qz779)-dj7=_G?bkMuY|fA?TbFtLve*6a*=8=*Ghf!`^_>?PMn z2AyWCrb*fE+u4JYn+xsjJh6-C;az>87o!8SP}m$anR{ok@bVVFYon0S8>hddL%1L& z^PTtkNO! z@SZiNt<5!N?ca*hWeAPSPiV(pN5IqqKx(aj(?J>+gxL*i8RUBtrsVuB8#C>v7uuED zyZpx&Rclxg*c{jr_4CU!Goblk%jDTw#5JUOTjpNc0h#ckYVnH~J9_XA<3Xj1aIbZF zjOM#b%Tu`AvCW}D?tiH|%A|__H&v=u#r_+J4KSU#R12s})Z0k0oPfiDk2ZaS&49vW z5wrNm6mK6R<=jW9Xh0vB4%9}Px4ae0_FJ3BOnPA}7f_bnE1)37^vC!5f{GOd-(?D) zSKkn(8U!+3oUz&1A$>*~X4^nDGT`M(FL`oYyJ@ef_$lPu-w_S!5Wq zF`|ZLoQstk^n#(fP=ZJu7LF^Mo$jeATtu29q=kirmqBJ&&q&j2DC`CSF*DV0BSzi+ z#CfQ%fhCtwlK*n|axTct#QR93b0a`_rhp3061`uFQ3$%Qwkda^K*tGYJ>Zx1l3JD-9EgLbS+LXlH%*}_KGvwMw^p4ano zHb6(jZe|r=!Gl&exN&?NQ{H;t*(TY^X$@oTQr63YdO)jDskl>l%%V(%jk0BCncli9 zDAgcTn{s1B4YKl>RX|C8^G%($`VQHXNnB5H@oA0vcdxtTJEtqX=I%qrpHJ{ChdW9POY&!mbEb}1hLf%xqO|fSAk{o z&m{#AuE<*eX#mEVAxjE7?eO8leG-$9m&-G~mUh+UXI}H7?1dGm_e;+-Av)z8L;$7!rP#5T6`qgTiU5-M1?yf6vCmO zIj-%hznZKgnD<>fB|cKu6Ri>hIvBKsh(N-)3O~54SJE~3?R$%y3P3;HlI^!qL3`Ow zKt6EIKZM*hAj6YW^S*!6knkMW`I?^Bf&Q4WLv;YQ6dS~Du1!P~I3ST!P{tdZ9Z?&b zjt=Od27&^_gb?hTBFFB#XEojy+IL|~tJUElCswWp?=F2*pBwPyj#c{L3aHHUxaSM1 zfXi&}w1+mBjuTDp#q(!xz$vLZw`)lCJeIVRY7r-Fiu>Hcef#MZD;C9Tp|!G7Y_-_; zuily2QA4aI^J23fdR(k3{X7|N6u;Hfxf5J5os2=_F$BI%@!~ zyHrjFoU*L+^R~Z@v8;thqz~ios19O|1faPQlN;FGdR8f%qJ*(fi}eu8n49(xgTl(B zS|Rw#eVzoi%V!NxnkQX99ddTx4@iSau7S-i%@v-|d%}Ip=l)Ty-MbM;_;tv-{*Moe z!WW3sQeE2Iel?CBhmKK%Kg{2#df~`y{}3J?uHpGjLCCRF?kOaXYea}uw>m8!2K$Op z)*p&D(PAOo67cV23r60Xb$K;3fUX*gV-aDCgX9;U`qXhtV20%!8x#_-8FquM0n)F3 zDnp8s`?@1f zxodX`r~V9R{p?28&A@Vk9+Np$TRw(~m}>3WG*D$ww&ICst3)&P26;o^lo7`SFO{X$uCH6xSu@ zE<$3tzGSG}A(T$Vde87@T9z+``>sRe*%0>eSd7T4D<}TL1RANKS_E#9> z-&N&-2%Y>)FAJcof=0+?Q|F~ZxSN`q!1=tiQpNrFQ)43mA{1D5*C5IjppA*91s`=i zUpJ@>MT=BjvM)QoM9ekPmx?!WW|0rH$Ee6u2@C37J$CFXc3`VeEdp)%V*)Us>)A&+ zP<%@*NY&9yiC0mtTGu6a6q>MI_dzhK#UGVB&xdLomWw->zs@ zspwn3!5|!t9A!)1x|}1aX`Li8r4PLd`lA>|q}$hoaPX8#Y7=c45oVX%_hPh>$V9BQ zr;-oa{+4x>U0IIEw*H;Wal`ha0Ty(}#d{6+r!VYMD2RVHB?FZr``!>gVF+D;6ARLR z=o(GVo=&7Hu@>HZ(38&>@?c%BF|7Td^|$y3;tz~%ED0Y`a`zw42SUEOXh=qt@&dJ5w zne(@-(JTj{T?;4{2>eh5=tOXQx2=h&R*V|tOaU$rR+5V4TWYs-Xg^5XqzZYOvmGiU zXR#2fr`N$EKW{#{gG^)*3X!q*-b?EwXE|PJg>%1-Kbo~v|4nAQ$4x&k6><@3tS@B> zHin84*}ZF*{C2#X?!#1!XV)g@A}Hl{BcQNr;3H=P?Xi{$X@Z*E(ece#D|n5a74b5T zQ@uT$d1uy-~EtdG3C@FThqo%@kzjsmH`^aXA863Q6g|VGl_Ift>3F zET+)Fm;zGxo7{d5B+~qWcsJs+yr3iyxU2?hZvTSitacP>AxGVi4T06~T*&%rbT4Iv z0^0TtI3^N#Cr*F^vb=U3L&C$JONg?)u}7|;y8M|$dMueXKz4N93L%;7xtu4=-f0pT zAmu3e;3V_l{B*ePp9kJkBIm$Q2mG^+;2RKXVY9amgr5{1{MxC=cJ&=NTl|IIND;d$ zd{82{R2nIl{v%x#3(>B!a)_q_J3VRmrM^&Fb8JWT5#0=Y@KZwJMOF>gRFOPsjf_k^g9dcsN(jAz_ zF5Dq_Mdp(`+_616X9(3=u&3PXlSYA;_XI*RlybTiK7e3>K6KJhV5-c1OKoq!lbCLm3GKh(R=7&Law)n`uCU`08Qsz>n_4ZNSF3gXql!+*lD$v6 z2$#N;jU=ITx3gTzxDj+(eBHqsJG-@Lqx6V8b~%2x*8fvi?MA$ZT}_`m{YWe)i$z{3 zf#?9go-TPB?6z{MWzUjE84Gf;b=yV}XH7iE*>mAK+!vy+@10hy+I1V^v&d0iyA%FB z6ZFtb#oWf$-XR;V4)1Z2Q-e7&9*NAAF&zQifd$cmOV~UO0g8$I1i2IhRxJ&F|JF#Tbgz$H$ZxzD0>p zxNc{PXvG9Tg1JeP4z`J&Z#g`t2xL|H%40c_(3skDXU~@MnK7U&`qt^qu~|ps40WQZ2jZs4eWmW^HZeKF#nsKl zz!!(q zBw>1eNX3c`6?BNrY3q#j!JZI1X15fM z^V}cTou@)Ymbu1+@^sG{XtBsATFO5ac{7h{%be3E!#`u}l9AIa>(kh#<^feW5?y;z zD&$4a)f&rjt(0UHV?G>4*(k;uyQ~kZP8Y5KQCKN0xhRX&*~}AAUa^M@2dnI8+~S#& zN!*Av`Q7OP2iYu7R4u~p?h@_U9B&#hyGe${jjb5ZN*QNppE%fkf9$(U(#tOQp(-ya zsQF4xJ`9c)>Da_5o7QBjh>e{MyKAN@_Wf5pKbUSP83fn)1DE+KDkaX5_xo|3ihEbB^nAh!tQIy5PsUR z?z&#J*gBx2f{d(nq8TGa>)brSmSt?%ix6~}_ zHfG8hSXT-JjgR0@hX1Sef&J9D@C)k~MLs-8E3|$pg8FD5Kb0ZuKlBhAJ)hpQUZZj~ zXWH@xcNF$Qb5`aF@l?J`bse|hPwdw3K7hu&#Wpu{_S8OD)@ID+jmX@ia9fN^bf^Gouv9KQ?T=x@ER34NB7%D&Hr!{W2 z_olB^EDqJy$dma!6W~MAeyelO*&XJ87@V2!e0i9XSu2^|_3+T%6E%r?U5TXAUmPTJ z)NjM7_F^|q{KgMC?E%Q96Yi!6KA9ZN%rXf`mQpp$>FEnPYO&L4F*#3De)AhAyV@Y= zz)!C*iROM+v-tYnKja9?0r%j&d*6f<_@2%nKq=T1Sv`L1h@XS*f-s}_zi2){ti`KB zWR-qT&x9BMhynax5bf`Ijxd+K|DR?5hW~%>4)bE}257B~#;2V>rTRFC5d8QR)A$!9 zT{{CVZl#l|Tk4p1qWP=jSL5cdg%Y@{<_wq!S2EV28#YFp=T%y^G+zdS<&S z0NE8@vS~Kb(|g4&YVakebnJvLXz@RfaI;{7`v0I;&T3^F6ihSj`pqU$y8bL zUfKWR8|Qxk;p|)!#Pq0E&@2O3{!YN%D)gNER&Vimh%NfRR=JZ^1z^2Ar*CWhbV`mQ z7vTtfGWXk;(sy&97PWGV04`%sMVW4O^2yiQXRpu8}yYf1W|76Sa!(j7;jtiCjy@N%T=a}@pxc!M2uZ&BKaI@ch zQvH>F!|C?QsY73nl>G<#2LJV6*#{-n^m`I{^3sm6BPt+8$coVY)qoX|QZsVl59ZxO z1Au!m;<$p^4x}K{i+=h&?=!eH!^ekdzo!-ju=3krHo=su7v{lkpLWrl0AbF5WxyK9BF7o&Se4%qCKN7& z89n+HHu}Az$O-1cDleus0;&H1W{cK2dDI(F`BLWG;=fQ6*da-9OhXQ#G(c$RSEQ!3 zONB|Wil!DI`1iERi_C=$#C6;Qp&mUj+ijr(;nx6_pLvh?favAFca}HDI1Xc>N%jUsa1&L>34*YH1{^SFY2~OIW{8sNdAbGE93H$6R#bi15zZL z2RtZ(tMly~?vC@P^1>&EKzI3nZyENHVxZgtINSqRIfW#VM(nh6p7HnnPb|4UbHd>B zqlyAXwl?q_mIb?G%B!6b2RCvCybzP9Sd&_by_jlI*|Xh8osA#blCBZAR;Lf*q%kH_ z$3{l{a{X6K;`33i9s8z9$tW=`BS#on6aT7Sy+^FoIe)lWz&2f5fm%uVp_rT67Dab%^b#tRgB9i+C~K~mqb`f^=}d9l}Nd+N}`7E6Pzf7n%(6} z)o)!)owW-CYvTooeQeJoL}Wfj75h65oDHyzl_~E_oSux)?mZ*Qe~}HbBuXl#HWAEY z2C-cf|IeDQiXV-gr$o4K92kFH_Eup;}ILk-lK8snmvN!fp zv!Acnt}TxusJ-NFiK=bAd;#3n_1_-4NdWU*O|WMd0mp0Z zB;nyx1Km}T+vV>JaZF1Wu{lARQ79pYSZn(&Kb0R_i)`;z(%wq1R}z>d#kWkZvFsvP zoL~MJTUP$&+@{8MiTUxsG`nWnGX-8V<%~gk^h$dFQy)5>=VTxt#u->Ph2D~Tiq0KRPM722fS$Ny&|19AKW&triVwsytdZu^v8KJ> z1Y*mcXk9li)g5(QQZh9C@xDHL?adf{X}F!&h0osWk+qdRNRY`mAuji?iwkr_2GckZ zYhDA^)Lqv1vUxafeQ2*A<3Vw0Ft+nb1#BjJcHkq2WtC)|6b)poQstn0Vy5L;5gTp^ zs8`sxy$fr>-T~AFr?#{1w3^X_Jrx2})Pon?STRarHQmWO_w3f&hqmQ@&o%=ajDl z$Cmr49Qh{59cI7jZZ)nwqo14xJ=VFN)JKM+?WWQ(1k8ZP-G$25oG^MQLx}`5K44G={w!UCuNRS&IW47xS#2b zD7@Ev^0mntrrLS+=&3!kKUZJLq{C5WNgZ5Mk@B_(soz5MHnlJQ<|Y4f2DN z_VE_D%ug$R9@<7prP1b3tHtlXN%+%leh5@ec!O?gJ1|Y(C|}uwnBK_tS}_W|%c!f( zNA*`VwPC$U*2qLs2n^r-TwAS#?;c~hVO_?t!zSF#uX2&_Xk_a3yz@WFJ{=DyL;etUU*ze@|i!hJJ(a7 z-LGf3LxCk6u3Wequ@#vGfN*~JccI>{z*w=qk6#OGo{BX7c?%V{C1O5dT#CiYTKOGT zDAq*d0(rDbaW!TbpLvtKZVs!;O0>vL@J;hyC#GX&tGp(CNt=`^aul)C02-RO-l~uC z?$G3+tPyQqyFiP!#W!*{s`$@?6dxWfF9!;A;%ZG#r{g{0z%--$Ghw9>XXV;yj!nW% zMVQ$W#r#Q#m+*Rv{Fn88p88Gc*JWnVKI$_%WdZFzyd{s4D+p`U3leGMY)aATme&3K6YI}c$iFZNlMdx>9K^11|IiyQHYu=#RZN9JgCxb zrNtL1q`8S_%iY@O?gdLFmzH8NZTJ3!1PQXt+hCre+^+uqDw=;WidLpD{jGd{?14;r zx2*J@^wwpacXZy8cApB6C}X^#Uep*mvV(;XE$MdnkEOq7f;}d;A_D-6idbrPBZlQ( z6ilPcTq?#Fi_V0dv~G-1YGYanN@p4OUE4p|cqe{9y>{#zPo*VA(VG@aWC#_LEQB#w zylwCJmjfl`L!;u4fN2uX@EaVjJ3uJw=vr1u=jkKQZm$OLe)M?0;ldUM`HOo|Vc>_i zgH9$galTf#s9;+QXM}xUv0b-eQ0!-pGt5uomNSvMtt6w16Byn}ZII7H6Wu@3}!;L_nnKn}$Nj@5!MgF@YYfM+lsjD5AH5*h} z2I6%S%7 zy)&2WIvdX+5+<;?!>^Cbyd^s4-!i!NJ?NxAV^~0`h;VdZROKY!M(q?&mP!+Zgby&B zmd2qN7Ox6_=K=`tD<53tFw-fJcFp#5_}V%aphRXX%=goQ{LgQkYyoOzbm3VC^IeT^aL5?!h=>nD%Gm8~|bRQAd zEW5O&?wh|5juCWO;7v*}J!Q7Zenms74g}r zs&TnG&l}*ARg|IUlNJ~!32yY5cRnr6E|8O6O9M7{0{`mvRVueloJJH~s&6Uzq4zqb zLDj!2PGEkk6Q$E$deB*la@5KAzG@g(evzw_6TP8ydBA2HZ)13&K!#Lh-5WrCVAyEs zjLAkmTaUtGnF=R} zyOfg&GO@PqvVDTJ4f?`#q@glyW8zXlIMc!o;}J`eTQY_}1GX(_z){xs!0bw0$cq!b zQ#Lr$zu%uV8;rF}trh>a?UNGt$6YzXEYql*cLV9*LbsX6W-|Sg$mg=$QGfC0y1pM7 zubmTRs#^Oc<)Kz9U&BKCZ+antCn(Hu#0*`k1vUP5GGw0;g|Q*xDXY@ zVLq8refa%&{^{I^Sk%Ir%uw#i+y0vEGLa+PI#@gV$z9n3dG40-A%paz#@t*7}n=p{E(MO^% z`r>P@f8%x&eI8DV(hviMy-weg?F#KphP7Z~@?A4;9uW6}M6}NC(4K1}*ygv6=HIcy zIr5`&JQwT~ieT7;yHX8ZkH+D(`AoajuOJ673)(bIHkUi|cmzY_r3q1w5@&b;UMz*-V zr4$|x4+g<*ONb_eY6{9AGpj*9$>2EwkarPNj2T;po|>w2;zz)o@NEfIH*#6z8O5Y- z+>)l4*Q(Tc%VCVF!+4K&&gQ28%69Nl9JS6r7an2YoLzBz`awzcBi-85gp7Vw@fZ54 z$$^pk2GF1)V<}gR&ifyGaVJ1}8-sxzbx{=DxmrYiy!R86)aD4<{1FK+WVA|5TE*M4 z+?9A@OLh!Kaxyhrz!+uc)VHV|q5w&`;xu3D!_zZ2T9lGZJI``xJfh@#AdXG53VO|?( ze?+$KE~|zj&3~K;>viWt7hDu=EcvQ)D&?04(mJa+;G}u!ag$Z?p{6`io#Q}_zx4d; zm*Sl;o1t2PqdiX-&Kaeh5j{kyrwv@|JN6rO2dwl4mTC*(lr4xiuyeCZgS-mO1(zH0 zAMIf~osK38jsruGpk4wh14;!S3|B^wvP>>d7bm05-D=O0GPTYZOTR-%CzAb>x`a&Y zW;*3uhOZS*e)z#prED0ZQCwtnf1NMXjx&{}_o1Tk4K;Z!qe3sY>nCf&1A7hx?lL z`BE|z7Dg{4?gVQ9f9-*Uw1&Gl)l}_3&mk|SJ59ONST=kl;LansXZ$)&IDISL)#QeD zN=B7dpP_%Ho4;O2B551Oj%k(c-1MBiqQ>+?@16b^@gQkR|1Kr zoYvZsK=qBSv~7)YqFcP*FL{LzF(m`bb`B}aVbP)6qgI@X`rUmAo~xB5>*1aGot^h0 z!bP?&Iq6eWCk8}3n3X8My`4E`#JKSt4(NfM?5w;LHlh^UL(gWiE6ln}KV;Of&~De> zK!%LJb_!+r!;D=JJ}W21U3BDIP6l$k!(nmBa(lH4+WMmC?YrP0!pNKZ8xK}wN-NK1 zixy88n}>R{xz}cu@3PFjNTOipCH;vN!dLe7nUj>%?qQ%B-vT8Ha&0NLpq%R2tKSi6 zX5jO=F>rX|_)MJlfgs~?O3L5L%woCP=GOY(bTbEC0fTPy9$E*w>=cMzaU4H;9>PQ+ zOX^Gc{}84_ObaT%a;3-8zt+yqEL!^6@Bo={5~pHmUh5Hq+N8_es6W3z)qck(+gckIbO+c0%`*JW`*bY7F4r?a}un za=bfolDs@4+hV$EAi00S!;4?D{8;#Ikn;CweXyF#EI_+mC#f!pe}troP>7b-5{EUC zbcgqueslOcaigcS$gg)knXc5x(^SVH?((>l3Bkj8VXXifzKk^IzrgW5&DSfy8_ZTwq8-3hc=<<3giP#IG1rbu&q)s)oLeJEj+)d#ZZ+`P^%M@KA0z@qavVt1HiZ52CwHl)q@aGN29a%`_x> zlM1OOaF@}o?U``rVqpH8T3HoO%&U^5bFvkd>S5A@jGdw8D>y~V?!{S5Z(uBg{0uGI zC!Q0S@9Ce4%kC{`V}s2)1~k%v3k>T@qL=;$a5i3hHEM1I64GQ!bzC3eYDm+cES4KB z11T%MRmCE=4Dv5qDAw`4e%THT*ECm6fn=)xIu%iW$B>e56N!&e$}QY@+c&QA5rBw~ zl5E0a^Pk<4DTFKKeyrqJHwk&g?WmW2hu)Je;9JJEPfl(nvlP1x4; z?$fT=-Y(pXT5{-XZ;lbNX0z%5E=Lr29y|)&ePEf~ zMW{T#f(PJbPoq+RqnB7^3tf&O&XXm4)dW>SDwCRc;9iBEW=Ux!yC66wL~qMi(bvRB z>@VU&IPP7t2gXrrs(t*{3R39+1%wLnits%qudz_tm)Ld>GSQPfYwS`W^QEZ&{8ACn z<~V>Pay)l-Mihv3mwhrJJI++*a*C%e&qxjQYz1%3EM}$?CKDlVoqe{?`PvfYm(8nY z`xcGa9zcefu2m1G^3G&Tb^pSULJ_uIgz1dVtUJZ}Qx8`-Gyr^=9HmhoU(Xqw^e9TW zD0zcrgV8=P141T*01?Ru-*atcq}g~A>-4ROO{%R|vCBrlG>hKxrEB@=9B`DDvpsfG zWA$uJ$UF1M*}F{S`oysP)m{g!$_7lp<*uZSDOs2xc0`AHn?J$>L6Ys zKf=hANwJt>1m+KO@I1BenM#-)KPi2V8RX)R-<=qj@US=m5w7s@`ytfw%$4!1W#C23 zln=KRFymA|+8cX+a7TI}GjedY+Wv6h&3p)CdpdKU19PtD)6R{oo+LSI!Z{ulK@--) zVjcE~9@{fxKn&hqOhx|^gHu2Zjt?tK05K3?4SIJNCp97RB4;pXYs+Sq5GDr4EBMk2 z5G$cxy&41M_wg-V2)F}w^~B;(@;=`a$0(V1r;((CM!@!@0UNZ`oxZwqz<0BtaVAP> z`w`L{0CMkw1mRCBe1>p{W{z|UGqwXf5{!EFvv#RX->~~-%Zsww1a*ZFliqDSHB!j! zDMtd=o6JjdaTgmY>4pLJ_2Eag6X)f(%72hj5PQ6wl=7HfyST&Scg(<<`Ss~f!X=Bl z2rvA!k~}_gG;C(pBh6tk6wh?CaXMbD`-JoqP4}pmegfRPEDVH#^!Y9?A6t24+veX; zCNtq|_@$y2WM=CMFh+EVZ>XP8hpC05Ah;lZAiC4?CU-Ra0{o(*32Pavg_bwuQZofaQ;T zoFuVOlP%cE(r)^Iu4)pdd>v`7rU*H7nS(ZEkm2xJ9*W=&3V?B=1t@7J)$rcbaG)6O_ua|-tU%%pH=MMx&r`J zi>X-n+V~#Ojiz(^ZR-j-U{>VnoH4I(1s4 zoL$wf#dm%ALjX7fb~8;$%E$Q0ho0#B8O=i?LmD@ddu*DodZwyf2G`RJ3ePl{3De6| z2^UJ6%HX&yQF$|;G0al^O%ooHFLS$!*4{{TRY8Ag^te@;+uW}T)KHKxP{}KUAm#s_ zII~+eRbox7;!WLtAwO&1LS&cmV?%^%bU!ZUq3glOE`Lu~CI1~4|6>EZQZ-3Z5Psq) zIRt7h7aA|TQd{`)y@8H5y(3q}0TVm*Lc6J`%4i!mc9fz$%d}wZDGzR&YQFDnjHJzn z7>}Ze@RX^vEtT2%)K%K>=01^@+^ez!OKyfHLdmtUS3oa67a#!qVl6K|TielZc@jQ5 z%W@AHBtUy-;LE~^K53`&&3xk&q4G_if@2RKp3!Whu#;@UG51d--UX*>%KV|~Z&Sl1 z-nEYNbaE%eFZWgOwyIWs9x|F+`l7F%35vpwkDk|Kq%2cY(?l~9$;W6JY*TX*!32+m zkn(M7va>6aPYf{EWlu+^cKSeolgTM`CjB>XE29?T#Qp zW^K*D>A*StP#3ZCM9R(8tduXQbKn#OVN$GSgo?FN zxjb*_;-1+TVsW!1^ypBZlDpG0zsMvqlqOWpt&~q`FI_b0v?L=7$W4K#u6Am)f%8W* zX(z&q{l7O-Z=R3l_FcAV!oqljMf6G?ZxM zWYz6eTJLmBT-E?{rkziq5t=W z+9JXApLvWbrEN;SU2|oaMJ+~KYk}aWp$x^B^l_t?wI!1%E=~t$&0Lex6gL9Xz#Z%L zcD9F*KPBz_>|NYFGs2RlvX^fA@2I0&GRh!xzRy{=2oIlg2}Zj402@6sBbU zd4){SO-$)bc4IKFSy{Q4=NpRYNRvEVykQ@|nb<8k>I*XPvx;9HtBrm<45>t)TRr0t=t#bO z_9{z?p&1?8ExJv#nG3Y9^b=WLg`M`IJd+(5o}$^;{Gq|q`>0fC` zSGd2DJSD(<19VV}D>@)x_|1+!AOHQ%)P!*6N00Z(T5eQbb$1=St3UCfKw*#-92AJB zL#sN!U12mF6vK^gRU-D8R{o(GuXRl61wCAIO~5LWbxRX#&-a>Sgc`6~Ssa*#gztLZ z)<%(nwv~q*C8#)$X}lZ-j?$KZ%+4ffz_h#^w|QkWzL^QiQXzAyzcJb#TYo3wBKAA7eU9|!xEjQU>Z zR@=TzJNEM=XEtQr4F%`H;SWm%=qZ@`ZL=wC1Z!Rdb59H9PHL4$3>2-Qm|~JzOF=-! z`P64VlXixopzixymFF+E+vN$WF1|9mm|ymjLH}dT=9$PjgIzd{4F+oZW>(xqG$5+r zxP!(QpZ%H2Y`m3UIALa|jtHVVrCQ@V)rMfngVKBU!I-hW4}EU;j3vVs~Xhz8ZOg`xPGH#6$oPMAHp5Ks1$ zT~go6KueukHM+fI*P?nB)iIyjVIesNYj_n^sx%C+5JA@em`=phBT{oh`Avgn5 z)S#zz&!#ryf}=oBbMBFOstX1!!oV@ST%8`17fN<{*oEof=iE@_I7X5r^X6P{EhD?*?+p-Rwir|DU7k8;B%Q^`8mS2IhrMH zz)E{ks%KGj#{0hUCff-9=F$XT#!kxCfIrt=gD$I3WQdivq}>Fwdo;85wn(7A)S%K7 z&lOM-;e6^<*?B!e#3V+uPo3l7t>ez=H7jr0*GLSqj<`~*#b%CD-OV5X!;Ebx(#GVHC+gOpVxFBR$7l^3tuB$SnnLC60*PjFK)qpgZ^Xd?3_K)=i8wor)^R{`UypTzVtT|mR$tq z+idbb#pnG{e10>i2Oww~B-u|5F@n2od<@z)_MtJE_TI$UJXYL}iKOMKlX)-b!I@Q=jZM6cFL>btuaZ5oQr?E^`2r+4 zdFtK;F$|(okE=)u)!^Z84LpOEiDGRYuC+wOuJhq4aSPENK3xRSDFrcPUHuE542p1y zjA*4fiCzd(1c6XIR6xOW*;+mNa5Nvn19fIh^XG@9X821Vo$G&P5=e8RFTU%oy$J#i zv+YthG)cmlP0$Y%dytLR!A{!wGj9OOTJ@F~Tl#Kf>r*~fM><8V+)P!>je}kQq;h%V z6L$^`u}V#p%?k%mvr2vvEg>oyX9@Gw9nr%$~vO)Yk~flvdzQSiJd9EO z9hWXjK|-raa}L6;ZQo^>f3?bb?_D?P0pH|fc!b*p)ohVDPfsB%1H=6_Xw3Ltr`%T}vq?>i-! z&;NW??&DO^9%6eGYL#s@xz7a&6?!#MPfFU{P;R|;%_c%2LJLhr?L0o1k289TmSj#M zy8~*6sxC{i$?JQY&x$MktiTjHvPd>ZyG@tMZ1U(78z*Q4<7_q8;#2u@HHs#^JNhXh zhKe^><}BG^FtBFvlgh63X@gK8=F#hXz~j<`xV%Va^xW$`Wa!lmz7w~Va-ZYlbn_(L z*vqn9so_rKXZ7N`u(>Tan?|>eIeKUA@{Uo9gKy>tyzeIku>q)@jJ|%R2W-GI)?#P( zdx#qlRG}TRR@Q!%po%9O(J5pn+={IW@brJ&7d^9kJfwbMBx;6Yv`0=dgpP zT0G}d3W4BVAn>|+?XUX3)PBx8Eso0-w{m;1VvuVG+E)f|Yp z$>Z_c=j($YO|Mddkl;qKx7hv(Ohx|BX_g$r@nD7Ws zAC?{MV` ztG5H%8wUbScJO^BxM8Oh{(rJ6(3A3cm4IeVcEGO88y66OI)=V6CXnr}{lYCs-+b{~ z(shfSsVjKv{|M2fqeO*wL-i4;(Sd2S$&mfd1t`oN%dW3e5;!MhR`0yBbe6GGw9Lbx&y7Dy&+I=+59*dNBJ`95(Q-N@J(%^R4h%@I zmy`|^WrTF>&h^ymp4zX>*{2kCvxL)I1?esPh61v{UwhL1SOIFB1BqSyiYvD`&?oF; zGVJDmk|ETSnf0)Q1;`LG1ZDwGgDrL$IqtxH{D0fvi?3!}R0-JsVS~?x(D?55%spcP zx12o1GFK0q(hFamhcW-BBxJ;()5*`1 zfpG}@`bqy#=J$2G?IHo08e7}(49g^6Av2cwkcUjfnR~Y2&`$on5rYqJ#Jf9_w0MLU z|FIE&h&Va?Z~Po=#ij)|2mAfA^Zx~*{Kq%}{&Pa?0Z)_(p~&>58aia^zoc&exB}2Y z&^Mg2*OyGT`JbHp<7+^Ev@aq7X37pVBOiJMosiyLa3UjT?T31mmm%*{K=gXZoIZQH z7#H$?^JmnpfnAMn?e9zwq~#@<)~p0kp1#wHC;$E~@eBt#*R4dF%!|8FYIkG*BL|1r z(Ac3<-0Y5%haF<(Jy#O=!xOi1uc!6e!~}h;-~1d~&mP%xulGv=k8znqwIU!p>y+C0 zwd@27u!H>ECFl^Ri);hM=?4!Wic$o19;EcbuA_5pw#-( z2V@Y8tm9j%t>(v2Gs5=<`@1s@?Y1^n0Mgr)rd`l?uUs;6V7(kSM<~`XePH&E^>f%FY7&RD`9mJ~+R3V#uDJyW|E>+wO4^qrTC zLdJyz2kueQ>q_TNKCgJphJPe3x^&3-+WE8n$Irj@-0{}ZrW^T?6R6ApN9^} z#wh~{O&7l-0`CQ99+HWIoL?a?Q%rP{1K0Pire7DpmgcZd)B8(6Q%2Rrilay;+vQ9Kuw~LzQQ;ow;an!cCMB)Nbya}H zXqebt3r@5;m!>XU7Z6bvG)juaVv)-zaTV{?xvsY=QYmgT-GCiXJ}fNEYp!S_*P`)6 z?-=4Ohe|h}-y(UyJmB)ugu(KQ-RCm_c~is0vd!AUNHGp$CkVsp_yVrRX>`tK7H{k& zx4EXP9F7~~)?9dwAe3>@59u%>FaRafi8xene&TWT4TBKe%SNqfsSvZK5D>3?!W%@} z^E9p|+pzDygJCsqHoMiTQw|r{Vw==F~_4E37I%MLkKZ zvTr7}-qPjqrm)Y6N)fhT ztePWxQU*AUe;L@B4{jpOH!75XCOJ;eUa;c=&f9HrE-HcH6*e^KaTwFFu1gx{*$EUh zQ~T?TGH?P?>~XSHI;4zfh%>HT*bk5g05UvH0C2^`%r~WPE;aH+?XXCC<&L8S86Sl& z4Y&`b*!5(hxFr}X2=T>{(g2p6wTQ7;twD@L_sqD>c`j5EhrWLO32>0!R0_XBYM$l^ z9krx}jxjIjE}sXz2nb+IKUR7gHiKxlE7F2O!luKk{CvVhnuBTpg9k$hn5jeXYiUNt zw(Py5sn2!o78azSPM|FiIqZjcA>(>orWI(@w4z4mFRfmu+H0Mxw#rEGd(?9t3z23``$VtDcVb+{MQ&? z6n_iRBU%XPV*qy@)ln0!tz{08u<@%mN07HbBCe6(Y6qHB6J9`}579_Iv)D`pgm@ zcff4Y0ult`{4DuU5?D<{tcMOLmo8~4jhci^Bw_nyne zNqJA!6grPKu-H8+ee>qcmvcElH#Wnq`S3_I&I7yS4gkA7j$lK^Eee26CQGsWdE+_^ z)qNuXd&dB9y+*z@o;9VJ#pJnI4)mUa;b?V{Ob z0cOLXbKv6JOp)(DKP~$9LO@z!mNe4&y{mP(#*)%(*T~(WNVGC$ef`@YbMLBy($}`1 z>37|x>(-uOqf^&0y&Zb9$Gd|Vg|*ffUMQN)KEath2SmI#V|1V0Uc_deeaiX#(Elv>VkyY|I(FVx6FmFZ0_ZfK{aH zlPqyk6)LKzG_CKFrIL(~jAWW1X!}g~z3s+kWO-n7!aINH0`sE^`PMFGjq&S1 z&dS%*Fg>v8PI;p&Z9vqkuBN7T%c;Po)c9@ta#wz`Y~B!!JXJP)AS@%ExKVKCjef#u z75Ne2V5?_}0{{Z)F8)d&`W5Ez_UZ zDIw}y&v;3NG%HuOcq4(o>virzby$KqrpyQNMKSW_bUZ7ipI!+eH(8A&c^H;Z`9 z6XGK$0thX;1uwbznHJ_Qj#cX#X=0X9#$|D`xMDiRDbhOXDaap7_hv&BkU_UUS%N|}bp+t(}_w+cs&I$1QgA8eKPo_Krm;Ax!9GgzAN@dCsYVQXo1l0T=ApYs;rmy0hq< zf-94kXhcfcB?Stn#g1-8g*|siyxWYsJ*5=1|PWvvrBt_r;2yfW>N;(@S&bco|T96@yj-~J^SwGwR}@4>~ifX*|GZ9lGh*fy0x|U zW=1fPq-M?LK*_%u{ zy41APmfG;yyLqWSB)&)zUmJbOPFhf)W$tpLZ5kOCdl{Yjz6N*Me|6a=Kvsk!8(B%< zZz(i9_Bo#PdAp{i&AjIMNrb@2#;=V@0IZy9=SEl!AuWxIEXJdU2i?c;#WyGX!!P#Q z2{gZ~8j27riO_Y6a4_k~ZyYHv@LseC*L53fd^FzrT=G11l!fK;4aMGL`mw(BO>Hq! zUj3pzu|6sGw{IvSXB5OQU%q_BZMw5q&2V}bj<^FR%LS|Kj&=Z=X29c=dkV%X1tuVC z)ue^3-%w8h4H@<^KS#*N7YGq@krY_$5!%YJ*jmC+Rua|0S|S6Xcls-gH20N`{!*IX zjQ67=v`pE;Fp=r{Jss%Bwg`Ri~o$5uc@LIXM4pHwuOP#_gY{=)1G6Hyl=esnKor^ohz-u;ontMtHN7Yd%^x zASUB&bmoS^@!pEeuA*`3qXLWzYUWlm5i7Bxo$&_ael0Ode_lN{@3iI>0Kkw7nZ%U`fjkE^eG&f}<``TJByztx2=y2uzqZ<+^6LT_mu zK3d^l^*$-+uMAR4EBVNM8xJx`<9 zz=C)5FCKm=j-)r@&qj>&OAqlU{n*aFjrP*{kgBjQ%Yj{<6K-^g_!#DoXuir^seW&^ z^281O1da`+5VL_o%Y&G^pkgh1WYu?dfCCv2H8lUBezXmQq0=|blPz=z6ciNhTJgoe zFgB*1tbCBenf?NzJ;OQ4=>yh@uH)H9swsr~W8b}d_bQfmtu@-YS7Bg6^(D}U9C(yu z>}fk&6Mr#AGeZ##PX6A8Tvkd)k=_c&Led|8{l*)kXWC&dgB95m_=uk9@-_n9@uAt2 zT-dnohnWOQ?eLk0@F-mcyVYXvZfuB^N*cupB=Mks#MbaNT(*FsnS+UNMbi&NqiMO| zCYa)DnbrG9Aeau_J0bORz5;$6hKrSCs%G&}^^?bb82>9Ulqw>jWPJFUvTwpf(81GCoNwueHa?#P!PRdDdyoT7=4c((e;ZJrCr8{YWI9`ex z%Hn}CIb~GrMI_q_!gWKX2o3xfCw;pb&w^2PpCXi43 z{)v>``&ZGoaz=)}x>K@N-Fe5yQ>gdlG@7pb4N-XM@O0QWLEPEynjMY|{zi-0Z`tl2 z@!2FSabXl~e$rD5jE{&F(5GC57QT!`4OBap=qCL7%o-8q-K!QH)@DVOH|a>}olR{1 zGLVX5i?e%($h#+^*G(v0B6p~%-Oj|4E?kyxdtGa|^|?t-fC^ex2s@P}bTapdDZdE! z&TVfzdzT;K6!zDm>b z&QTSBaJ;M7&8~EziX5WFCdC@861JKUP$y>s1#=@TI-p{UQ#1LUP9)nKyD?Z)rx9tU zAE%t_vwS4ZXy$8y%iuD`zG*wpbKAs<(ll8Idt8H&MT^^f=)<(|<;`zlRxB!_$h5Fi zR@bT;UP!KYqt*~B*0ZFNGNT4AVwn_Rm$|n#B_mADo$_NSh#e%c?So4Ke}yDT!McZT zjJxVEUcRZg*qDW%vaVPBWbqh}uU^M22FC2TlH;f3g@m3n;6@`mW)a0knFO)a2HT(VqWK3XStWo71h?pQ{_;Qebwr)8g70~*2x6bgpJ zD;*6M|M~Syi1kR-lW>01hfk?kGb0>}W#39xR{n`;MS$7yk;dUNd_gfLBRL7q0@3n zA&U>*E|rI0N>XLdv+bf3vHYi1&WzWlt=_2%&@ZR({&@ZN?Uydcol3n{n4Wsimudre z0_Eq1YvJsxMiKcygDqT2B|r5=cW!|R4`SZTm{acb#y&Y(nMduqyRn9TlnKIJr8kmEjY6y;oZ zia{Kf4yXe)Z!Um=3u*qqB_d%N0-rO#RLrF$MZYm*w+n2UP=~_4FgKy(&2K7xaf-Y} z*XfO{h2H%7^lLH1zY|P5iJLFqDdWx-_gILX?#Y_J>4gDYWxODLby2nI(Ac+QB7X3{ z4gK!&R4_(lSgf-MD&6%dT~8a1fvKt0P}x>=Q3r?+)}?wi%n0aABqN*uU}g`Q`AYZ{l6;PU<3w$KX0N+IqBP8B*Nz8ddS?w0-S*iPip;LXOd2s z?n^3ReF=f6;mlnWm-xzJ95(Z`Bk2tWOPP9nxNGO^O{2GBCWe+$@OgVTIA3O-g zXf{Z>As;b{gr`wEy*`9uIH?wB_i7+x2EZox^zI$4RZH=US{Cj$Fso)lx?UYSavbN0 zi|CV}&i>6lky|TpEsZS7C z1o>wXh%EGN!RP_E<}A8f2l>MC#VPPpnE)$%WE~x@epa)Ni0ahMEufv}sV>hnAqk!r3 zvq(OIckF!jnS2?^x?9q$E0O|O_Vjz~DPmxll3JEgX=-{qAe~(SYM+`q17_qT0X9`7 z^ZH=T+ef{h-&7k?qX)u+ST@(vJpr1x7l4*ei0T3c zs^>lhu(Ve_A&)XiQ2l(<3(%gTt5IXD%KXouK{fVkS~;!&HV_{2anUsqzWbHh4U29o zO3%)|)A)g}2e3wGLVzOVIU5CSK=q&yfZC`GNfqQiXKAb&ywXP0_G3i4$nt040Fx`a zA2TVnC}<@)ej5Gk<5Yh!$49WTVh$MlE;?-k)pNHNiroi2<$O1BJV-~7LMXTPBq-8n z=pu;j00CEo0w0u#g)toPvTBn4{C70kY;3fe$lYbUsSVqU_~eo!*6<}zGxS}9iFPDnEl@8cl4`ud27r#1U*7`QWhj0k! zOjG15{rN2L?c29q*{0P1U8J)@gYi;Y14#ECkR(iS7bQX~?+7YsF6LQ~Y2Mv7$r`_- zG~&n5)gKu9G2d$LY3F%KB^5kMj!iALt$s>T46xn?JY;U}6@cKi7fJB7V9yjgL@c+6 zii%nY%<@Cp*$-gMiF}r%(t~ZEP| zdIH+Es_jHfApT_ASJ5+8{8DQo!GQU-SC(U8cPGCh!LE_*y)aO3R)V^uhyVg(_ImR@@jrd?c&4cr-4Y&> zJl6m1_~Z&_Jz1g6`9K_iQ2o90xx4+g;9i_0=?33B;@WC$b>BUt0AMxs!ndS&soI7n zGoJv9pTJ5gwvO9GKfA}1U)K) zyFDtyAYKx8$BAd$gjO7ghU|}yLULmbicJKqP)yCyl8o2(@09!$7#6{WLWnOKTA4UG z6DPf3hVk=UsJ&`ew%}>G4#b8Y8^a3 z%b%RbQD)Y$N5zGOBG7;-FF z@ZO*Sn45Kf@ZelYr(zCEI~_aLcGq1_gy(sdN@rB(C@=DYR(!#Zd}w+p|2AF{EGE zPlv`9){?;<0|VY-9}!c%7tHe8+7JW4v*A6*s7(KxH=hDJY%c{2KS)i05%euy9rShs zR#{ zZEIi9Ep2WA6S4SQx}=I3NbCF$z#>jf2wqyX_4$>(@tPp|=g2lOgLgND>b=WN8GKa8 z2){s!1~3_mdHidEU?x_fLhs2>0l7{i*J^8OL=c~%J;zVsvIDE;V^b?oLcopWpr$I; z*Vn%&-YsAIPA<8nstXd-V-h9Ep{1Hl9HiRWwDmtZm`4?wF1qj7;lNm zLZHMH>F|-Dnsghui=<^Nh|1eNL$=02nji>Bz_mdn*PgD6i1fY)WF(UdwS_3Ev>$96 z5t~1HMJ-7XY`8}rXN)qwqX{1X=y7L`xm<01z4+B|9>ZeS*IUcpb47Asp4R;;B$QO5 zrYdop*&`7Y6DX0_mL$O44Nb=Dp;3|e6v$hA3ru+lRQ7?AnoP)jSQK=CDapIW=mOs#V>C46>0pVW zk`EpnH^c+jIt%x436iR(mO>v9V_?`|x93FE1lC*x1SoD$6` zR>y52;7%Wt(lk&;WhWH^yj=^LfB3!@W08`NZbcE<a|S;Wb7uLBM|%%)7L!d3MaA$H=*s4kh5=VZGY2+1uhb-c)P{ z%DfTCXZe7&>>YGGTsNQ{_^{nI6}fcCYS18GryX;G|91HS_UvE$~*kQm{@H&&y{@iOapS`W3xEs6U`a9U{}gVYrr8!j%D>2xC9M)&G1m1 zIar-gxc*@4Q(z8h0Xdb!rG5z^5EEvTn(nB@?twCm?{ zuYL>~!2`<^O9_=l;L33?_Dmd`(=>q=@1o6a4MaiIjc?=|NvG=XC_|2Uwu&JQ*M=vK zybS||dQMKx`?yb=SvwnilEAn!Ncj}H#MS`VV!G)e0^Zl9ZJYsy-6_`7plACET0}k9 zDv4P&fH9=`)yv;HXW9cQ{-E-SBAvxzON0kKWFafn9xor17e)+4(s@v*9M`S*m z_dEH8b)CoMK-cpfcm}CxcywsG^1w+}pM8Ri$Lc_}A~7WgG$riHf8LdU{EmXjDoTyw z7Op|PT{Ld`h6VLA0=QZzXF-m3+SbdETa#4NyFu45FBE312xlCmb@%Zix;F(cN*X%^ ziVXu-MR%yr;qREn5sGO7m!XjxP;>%_WPjs>?1{=61n^tOeuqNW6=LpC?f&U1(9xo# zEE_POde;wYQ2)-ROaP4u9RaN7L~{K}n;%C(7`!{F_wT_4^ZRsh@Mp(P{;~M~=-xqy z2a#CnNA6{Dz@OBR993>qt?s{DyPsy~pZ{U)rS(P3&HV2)oBDrW>yLQQL2CX#-yu0< zp`ImEh03tKIo!&UVw_CA8%uga&`U>8E|_Y*yNC^>zhlq`;7>p-`8PaUS-*c`$pPBq zPsQeU8(D#R>CyjAcoVv3t`S0{Fnb9vNF?OqS149?6IZ7I05yqkglX+VO)`YPgHVdy zl-oqk9$p;sdhTQ8+VJSEP-S9;z$4nqM!6q+DTDl_N=rvR-J-4Ko_DygS6iBve79Ix z@W)5||4R-2c&9%kiBKA_FJTO5W=*4T8{DnpK^6uS>a0?T&U%ksoH9sT86G3K9{1Zf z$eqPex$RC$_@A{9g)7Z$g};l=%D=(v1ok8up5M9~o1n2`0hBE5X?QXZ(0I(7iNEtV zJ3OurYq{30IeI4&I`x=6F5$iA9Db`J@B1P&taj~Lr&OHLu8OyTzB&3W-`nH|;q=3b zj#Gpi1CNYHz~<57%K;b)>)ZzJp8ewc{LoI4|c|J(0L{!iTJpKAvy#=Cj& zp&RWjIGhBs_pzv0&VRis5FW+9JOzl=|DT07yz^H{M3XVMb)V^9;Gg^rju8qG$}TgT%qTmA$jZ*18UN4W z9I5M$@Be;pw|jrz9^st#=lywq-sAb&@6YS)dtFxS>}mYd$BrF4D|v&JkJ1MTr=!M)a)!Aj7-f9k5RG;?SGP+EKC!Q?jt=T3Im|nCcqa=vvq^ zSQ^>^MZk4kV?#?=1sTvCQ*(1&N)}Oe1}5MVwT!O5sihsr)`XH(5cn=(X=i8-`~`}E z-*Sq;FE!wYnTdg!eZPjyZAw-FW)?06rpv$uF++Vb5YQU53h0EK(C z_Pb!KuWN3&-$$Shv{QN@8v{d|{UR2iE6}prhBmgQAWK*w@CXJ@=ml00N+to|GW0*I z5Nv$F7fW3WSew^@A=1MdHZ|D4UroZ8o>@Y~R9!?^TvtRnaN?+kVds9QR8&*t%B%weO0)v*Ew{!Q!90>xiUHnddL)W8l{WhhLz z*kRp+mWvyj8k@kM46NdQFD-Q8m0<<8Cb|Y7`~Ca(iVm;J{yf5G(*^_r+B>vphcCk` zSvUw8nnR%mpGVkp0NniN_a4lK%>mN?Oml*)svg{vulGew`246xo8n zHu^Bg963yY4xyxIr)y&j@Q*#X99a)W7|=P`ueJAp-(K~vYcFO9vM>aO20Y2e&|KHf z^!DLa0G0*Az41XkXs7-NHh?-XX#I9yp#1YY1Rgm~*nHg9H3!4GNXf!s4$O?8f$44F zqcQX&yvPUyFj^L-uNdL52198<5PE^$c8@(kIT4GCY@0j}@24~%5L zy>FiZ6d&>ABZ~fXXMm3AS^zi|wA8bO{-G8H+Zr0kfJA^1m|9wap+gnW2i6W)Kaiy@ zAP@edjzd@U>rAl#48{sN!vJV?&CLzXK~O0MJk`q3#uUKHo~-!h+V!s&1no^s?F<#I zboZcb5Abr}#zSOeWa?mO03$Le(E%dVTnJ#sdyB3(ljM1zh#} zd6)(JKC?IiHGhS;+n2q7Nu*<7Mad!ruuZ8yDbfG|02Tg+1d*<}sWH^P(FfEus6-Ql ziV0IdUl)K~urM_+fa;e=O0;ibyFWQc=y?A{(ZdBt3Jgtv+}nTbVKlKGfhKs_&jTio z{~MAACQ*J@@-V{;f*&Y){)h$y*MkIXY;>X40USS8Aizyw3thrL-U_@e3y15_yMfkV z3e|!A3qv&r7YxC_Z(!`VdH6Tfkoc~BHZIsI{JHj70Uji}KM{Xq(0kPZPX^p%`t5sQ zwf9V*->dn@EG5{OkKopSTE+{Qo3j8vzzPbR+FI$_>6`pd(z9@Kz~mzgW$bViALh+) zQ{k{#0Ivt9t-s&M=Gy1YM?mMV67<#pmcEnpP>BzRm7Wdo=^NAjZ@0#w>WR+2AN8BA zGcYjxr)tWt`Y(lQ_Nv1f>YMdLq*&_l!^rQN$9Jl z{FBqqVdB~E^#3U552{m873V;$IMAXFYyr5I!_5gesrUR%hiL~09PO1IrX4tq{XDb- zx03%Z#}r`keRb}C_4WVP-TQF0|M%?!n9Jq|+6Pea`}@Hkpw55Q!4IF2-&Q^VXR`&^ z&Ja)$=zmj3_(oX%LeAP4TA2gZ{$4w}mIgp;fOi|#uAK?A^*z5jVCMiIfLIV<{{TN? zfZcCpXdq^2VF9lJ{51mE06{w7gT0BN4fNK7s=#NE5%6m-dRrtxw`VbEPrs=`tGv7Tq8ByE8FyhzI}}`5}IHHYV16 z)&*m52Y&d&?2Y4q=^bWp%s-!Yzzi%LbnSi{=WrbSd6P}Y*jNXyW$Oa_3l2PLKfRs+ zbo{WMebD*;roQl}=NG1T zKN=4>8l?6ax9owB;Yfz_P4nMzmVdE^z{&m{%i(}IlE2Xqjxc}@_F4Tz77(nPzl}UN z$o{nal96H70qt!efx6)iWb9X*{ffiw_u=;obi@1n`?r3F+<(4I`^ii zUUpc4{Rw5_|B(P5u(x1u#eP>N!k6~nZ=wG(vIJ~9+mm>J9mpgD1_DFhQSQH3<}fqC zb=U)%hH>sgLgy%lHJr?UB9ZfB<30j1M@pPMAc;aSmB=~)i0kbRr^&$g4BfqW8{>s+D-y-mT zwP1lu4H)Hf!*>K85-iM2Ebt)DVcLiDh@VHcu)y{=9f5#9ZMz7_$p0Bh@|~{o&zBk8 ztOri;-+~U@Cw>?>KcAfeRVBY$At<|4{6R1J{5YWjlLfyoG+>6;zh7v;X52@qxqq2QKn{BEj|R9^gMM zxMXBljuKqIvCsWGPWaE4RqR|`3~(U)mat-HW?(tOar+a>D_Af8CL`y^y_gCUL%%Pt zU|xlPzr2FY$ZyLlWy`O7Ob+(0eDB8ih0mP&@NEZd*MP?X0^B3{gP6wSlx{X{hX_xe_UnA!m+<=>}ynBU-EPxJ5@ z`E9jB!O#E@qR2 z1L;$MD^P#Np`%!&qWF2$CGO=6*g0Bs}NcAx;4o z+`ct3EI|NQ{WjketmsGZI2Jg#V7TOle?J6E7WkWQhd~Kf@qQu}2UC-N5Gd{UHbfg5 z{120RD#3qI?y|G9{(@Z|X$He1I6pZq|Bw$B&OiS&vlRI%!;cW)`_2Eb*M9!dypape z_uw*{8D4ORIWoijGe0>9f4M>S7plCWr2&viW-o)QzPYZgt*JgBmkwp7`c~C9bTGA3 z1M*ZcF)*>P0TrR&)%U(L0~>4hzkhu<8Mr76okpmDgw_OJmOV(>2rc_nS@*kk9Bg2J z-luYCQQ^AT;WdSq{o!P!znwA%3LE(UBk}_syqN~W-a+OTxV+t;7g$yys3s05KL0`D zqaz;40k>M0b~iha&`zm?M-&IH-;yB{;DD{_H}mMko`B{8aiY2b}G|aL3F7*A@OcHV&jv z-d_j!Guhy_8T^_5T%-f2h64M)$i)BKNdK2@d>+A;t_#nIhf95|Iz`wrd{;vpmiQ0gmncP5s+`F(JYx_P% z`-@$>aPP!_yxrzs;I80=ZVB6yqj09q0uRd`Ql8*i#9@yGTo}M%{P%k-;9CH{(FlHh zwSi^-CRgzv^?R|f{K_#zdk6gh2;D#HM-JFPzJJ6I@Stx_1cKx37v8V~yy|`5w?B^I z@9+H6qoNi9_M4a*gF#@xjko7gWnq#9S<(wa6N&?;bg?iA>Dn6pfeZk@JE-qu-=Cqb ze{_(Al?kTW!btK!VjQB#gEX>-NfPc@{98!ULPS*GT0@6kLC08HQBGY~7{n=-moy0E>rvoZLUp8Q}16^c zLhJ}-?}(z`*RBrrlnH+15AYb+VSna^MnLu=_&2zi+dp;)fC_XlvS16ny~xvEB3~uo z@HtDU%=(of97x+EI=ff&hdTkEi@#=w(3e2=$0rUPb7TlzkG(_re&rpI|DRd>YX<=y*ua4ZluX8=AQ%$Vudc>okzFmM2}r7>{b@*6e>%~uY+ z>#qgzU)#DYXfL_@9;1d~e~+Q>g9ZMc>>gL&hsEBiew{}kECHOKZSDw6gMpzHpr!x? zz~{a8_HXz$X%UkxV7lyyH0ar^(AnF6(gEG+>p}@{FZnxQ>p+*v6q*~}5}I88yQcOQ z>Dy+$s{y?TZ2{Wkes2sN0FVN^V}U~GAOI4e-vbix*&2E>=@}W31I;x2kjZHlaG|%5nP=be2gP8c^#9PAMto724WJTfS@TxvXDZt z0DdSW2QXC`J+@MN$&MmfQH-fw5VABYPwJ+wb3RUvt~CG9D%D%_8Wef%7|8?v^L zdHmWLB3E;Po3pbs4l{1SHIb83y!xaFoS+YN3Hnsl>vM{W;}mGbSKTjC$&O5|5WKwP zmQexfubJ($*ZVjFmh|Ia=KuJjo>-*+X`Em`mmIk^;m$UW0)@oA@s4ESy3Q6dp(&h( zkWbtc(inVqP`R+g5|U&oJkn(j&?oO+={}hK#Ck`l!C@id*@Wzu>xi$sD#17EH8wpR3q?zOYudZc(qD(N~WA(EhVfu~G|J5g}>VB}pHOMJm%o(X5Q<1-mT=W;A?CuwFl8>K=@u^av36Wch+a&oX{2TDqX$ z*9k5^ey`H;kywr(9+QFB=Z<3<1!`A2aixXj z2;#*Bcil|6n=73Bkm<|%tf)L|CP9_&xu`-aFk7Ezf9^0TS9)>B!HtmDFIq}PcOh+% zj#9Qjm4)m}dUWP8+F5`xp*SNJ%no=Hxy{rY$q(YXP*FAFzV)}aH9RxndpQ-mwdtv+ zDx8JgP4Pk>T5h~2PO)oBjnIbcuIP)FF!1~ywSxMZsh!r79^R~3H^Lfw+Xs5HT^|}A zuo4V7j8P_uM{;opCEBVY>aiiHW@*(&#yY*Q{$j}+M#?bdtVB3B-gF_vRz-n|^nna= zCY*`&i#;}<`XFw{dvUnI4d0L7PUK@NYYDR*ShRG$_;bfZvFHoy8xQTbeK9*+D~TnV zsKdu)Mq46?g0H2C4!p2toOq{P34$kFPE)W7qM)i1W0MM4UA=;g%+;$N^|Qkx=#&Y8~# zl&$i$W3C#xr!GUT`FF*u2!9GNJ%NA$w5}Iz7Xoknt`lP3V#MbFetv#hos+L+*5-s!t4m2^T0KKZ;C zA4g={*&`w#Xl&46s9e4s@IoX$w8w1}(GSm=fH6sq=*p$cTxj)p%=NW)LptrEbK{IR zY92FAk8I?%b&u*u5XC=G1s(9uM099Nn;Bj~HLSoVN&EV0(T@|aJ4||QUn;kyIWGvJ zJ8|_^0Ne%<1);*SKZrynqpFf|TBMIgU2+Ps!KhuA1}-l?W3)D^7BeCoKc4A~i;|9G zTmSNQ69e1aRRoa%$1^8uQPuY)p$h8Icp3ch@*UTgopp#cxFPZJ^fMPP1?(OYjK37yfe|g<;1Wk$l z%Qc0k;^+E?S~ma|#E*#i(srC7rCH~GJ+Tv_&=ode4+rauJ&A6XstWadT(@Z9%gY}Q zZ8ee5W|oN%+N#yvltN(?_uZW@=<#6u0NU=Xkua7yGjKXeLMkVexnaAD=shl-Zt*y5vvl{!E8RN29Dt?#Pj)cRRT3vsy8;-dk;Sh;gJ zv&;(}+UT|ZWrjh_sN5M<)Hp7@x6{sq7Nrc%@m>Q7Ky&W*Duv}|48&u%5@M)mQ?bRb zh}NgUiMnSGCPW7?S=a9`otAUn5|PW>%s9=GBynLKyZr*RPyD3(OQTgIDNhu95zsUH ziy%D5Unqa2EP!sIl;s1btdl$E<>E{3E(Vr77EUN_Smt88p*oR_P_&BgS!uO8OehZ3NSXHliyIU=5&n4Uy#b85r-NwNAI-< zU3YJf$}AUeChO07Z+42Yi5nqiPY(c5105e`wcK0~{@#>*Bz)*(`T@k& ztW_G&nCxF?bl&bWbp6g^tb<2xn}5j1J0rH@)80m%ik)m_W(;rVt{b$Soho@v-y6_& zMu!RdG<9CM5)`Ay2doC(s=ZBZw*zfQ1|zEACbza^{Ar6R6~7!zfMFY6HxoN(XLqli zY6VR{-o18o)9Fu$Oy`M&+S`z6M&jY3?ZGbbBmy3Tmgs{jeI{cQJcH(_OOka--;v~w z2oywkau)P9&^ALo(+f_c#xdb(t`YL*+fd@bTEa!BFz@jjp_0>ej_;CZVW0#jA{xyK z!;v&k0${q)P3@Lcs5g|vdCm2mn#APdM2GM*Tk{>YC}hH!3aS&90Ey#n;!|14tSFyt zb2}$&(Xkn2tu2*A}6uN%Msus8tTn4-hT4nAEUW4*nd*azA za-UUe7jjEVhedSyMr!gSC&TgRAPS6X@-qq|baqHH6g~znEEl7Fy%Is4?O|#cg$_#~{p@cJ3peJrSQe+YNY?|MfC#HC z`#Ivz+iN7mPBSmmxZPs&$T56SZWP@nviO2pU9j;4V=a~HqTy+PA)n6ZeYQ6XP~>v& zQCgL<<_aSe5)}@&%qjuxQa;rYefgF$pXA5$t-095j-!~)ZCPgF%i}UsXU3arA_J#e zi&}1i7`n}X)g_G~WpZ5=be^}gRHBh*QV<0|n^c(=cp#OIkRwQ_Tb5{O#8

@@ -16,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone --- *Latest News* 🔥 + - [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing). - [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/). - [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing). @@ -46,6 +48,7 @@ Easy, fast, and cheap LLM serving for everyone --- + ## About vLLM is a fast and easy-to-use library for LLM inference and serving. @@ -75,6 +78,7 @@ vLLM is flexible and easy to use with: - Multi-LoRA support vLLM seamlessly supports most popular open-source models on HuggingFace, including: + - Transformer-like LLMs (e.g., Llama) - Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3) - Embedding Models (e.g., E5-Mistral) @@ -91,6 +95,7 @@ pip install vllm ``` Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more. + - [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html) - [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) - [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) @@ -107,6 +112,7 @@ vLLM is a community project. Our compute resources for development and testing a Cash Donations: + - a16z - Dropbox - Sequoia Capital @@ -114,6 +120,7 @@ Cash Donations: - ZhenFund Compute Resources: + - AMD - Anyscale - AWS diff --git a/RELEASE.md b/RELEASE.md index 9352e7ef706..db0d51afc7b 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -60,9 +60,10 @@ Please note: **No feature work allowed for cherry picks**. All PRs that are cons Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI. **Current Coverage:** + * Models: Llama3, Llama4, and Mixtral * Hardware: NVIDIA H100 and AMD MI300x -* *Note: Coverage may change based on new model releases and hardware availability* +* _Note: Coverage may change based on new model releases and hardware availability_ **Performance Validation Process:** @@ -71,11 +72,13 @@ Request write access to the [pytorch/pytorch-integration-testing](https://github **Step 2: Review Benchmark Setup** Familiarize yourself with the benchmark configurations: + * [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda) * [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm) **Step 3: Run the Benchmark** Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure: + * **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`) * **vLLM commit**: Set to the RC commit hash diff --git a/benchmarks/README.md b/benchmarks/README.md index 3b10963c3e0..644517235b1 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available. -**Dataset Overview** +## Dataset Overview @@ -81,9 +81,10 @@ become available. **Note**: HuggingFace dataset's `dataset-name` should be set to `hf` ---- +## 🚀 Example - Online Benchmark +
-🚀 Example - Online Benchmark +Show more
@@ -109,7 +110,7 @@ vllm bench serve \ If successful, you will see the following output -``` +```text ============ Serving Benchmark Result ============ Successful requests: 10 Benchmark duration (s): 5.78 @@ -133,11 +134,11 @@ P99 ITL (ms): 8.39 ================================================== ``` -**Custom Dataset** +### Custom Dataset If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl -``` +```json {"prompt": "What is the capital of India?"} {"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of China?"} @@ -166,7 +167,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \ You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`. -**VisionArena Benchmark for Vision Language Models** +### VisionArena Benchmark for Vision Language Models ```bash # need a model with vision capability here @@ -184,7 +185,7 @@ vllm bench serve \ --num-prompts 1000 ``` -**InstructCoder Benchmark with Speculative Decoding** +### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ @@ -201,13 +202,13 @@ vllm bench serve \ --num-prompts 2048 ``` -**Other HuggingFaceDataset Examples** +### Other HuggingFaceDataset Examples ```bash vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests ``` -**`lmms-lab/LLaVA-OneVision-Data`** +`lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench serve \ @@ -221,7 +222,7 @@ vllm bench serve \ --num-prompts 10 ``` -**`Aeala/ShareGPT_Vicuna_unfiltered`** +`Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench serve \ @@ -234,7 +235,7 @@ vllm bench serve \ --num-prompts 10 ``` -**`AI-MO/aimo-validation-aime`** +`AI-MO/aimo-validation-aime`: ``` bash vllm bench serve \ @@ -245,7 +246,7 @@ vllm bench serve \ --seed 42 ``` -**`philschmid/mt-bench`** +`philschmid/mt-bench`: ``` bash vllm bench serve \ @@ -255,7 +256,7 @@ vllm bench serve \ --num-prompts 80 ``` -**Running With Sampling Parameters** +### Running With Sampling Parameters When using OpenAI-compatible backends such as `vllm`, optional sampling parameters can be specified. Example client command: @@ -273,25 +274,29 @@ vllm bench serve \ --num-prompts 10 ``` -**Running With Ramp-Up Request Rate** +### Running With Ramp-Up Request Rate The benchmark tool also supports ramping up the request rate over the duration of the benchmark run. This can be useful for stress testing the server or finding the maximum throughput that it can handle, given some latency budget. Two ramp-up strategies are supported: + - `linear`: Increases the request rate linearly from a start value to an end value. - `exponential`: Increases the request rate exponentially. The following arguments can be used to control the ramp-up: + - `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`). - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark. - `--ramp-up-end-rps`: The request rate at the end of the benchmark.
+## 📈 Example - Offline Throughput Benchmark +
-📈 Example - Offline Throughput Benchmark +Show more
@@ -305,15 +310,15 @@ vllm bench throughput \ If successful, you will see the following output -``` +```text Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s Total num prompt tokens: 5014 Total num output tokens: 1500 ``` -**VisionArena Benchmark for Vision Language Models** +### VisionArena Benchmark for Vision Language Models -``` bash +```bash vllm bench throughput \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ @@ -325,13 +330,13 @@ vllm bench throughput \ The `num prompt tokens` now includes image token counts -``` +```text Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s Total num prompt tokens: 14527 Total num output tokens: 1280 ``` -**InstructCoder Benchmark with Speculative Decoding** +### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_WORKER_MULTIPROC_METHOD=spawn \ @@ -349,15 +354,15 @@ vllm bench throughput \ "prompt_lookup_min": 2}' ``` -``` +```text Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s Total num prompt tokens: 261136 Total num output tokens: 204800 ``` -**Other HuggingFaceDataset Examples** +### Other HuggingFaceDataset Examples -**`lmms-lab/LLaVA-OneVision-Data`** +`lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench throughput \ @@ -370,7 +375,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**`Aeala/ShareGPT_Vicuna_unfiltered`** +`Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench throughput \ @@ -382,7 +387,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**`AI-MO/aimo-validation-aime`** +`AI-MO/aimo-validation-aime`: ```bash vllm bench throughput \ @@ -394,7 +399,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**Benchmark with LoRA Adapters** +Benchmark with LoRA adapters: ``` bash # download dataset @@ -413,20 +418,22 @@ vllm bench throughput \
+## 🛠️ Example - Structured Output Benchmark +
-🛠️ Example - Structured Output Benchmark +Show more
Benchmark the performance of structured output generation (JSON, grammar, regex). -**Server Setup** +### Server Setup ```bash vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests ``` -**JSON Schema Benchmark** +### JSON Schema Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Grammar-based Generation Benchmark** +### Grammar-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Regex-based Generation Benchmark** +### Regex-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Choice-based Generation Benchmark** +### Choice-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**XGrammar Benchmark Dataset** +### XGrammar Benchmark Dataset ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \
+## 📚 Example - Long Document QA Benchmark +
-📚 Example - Long Document QA Benchmark +Show more
Benchmark the performance of long document question-answering with prefix caching. -**Basic Long Document QA Test** +### Basic Long Document QA Test ```bash python3 benchmarks/benchmark_long_document_qa_throughput.py \ @@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \ --repeat-count 5 ``` -**Different Repeat Modes** +### Different Repeat Modes ```bash # Random mode (default) - shuffle prompts randomly @@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
+## 🗂️ Example - Prefix Caching Benchmark +
-🗂️ Example - Prefix Caching Benchmark +Show more
Benchmark the efficiency of automatic prefix caching. -**Fixed Prompt with Prefix Caching** +### Fixed Prompt with Prefix Caching ```bash python3 benchmarks/benchmark_prefix_caching.py \ @@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \ --input-length-range 128:256 ``` -**ShareGPT Dataset with Prefix Caching** +### ShareGPT Dataset with Prefix Caching ```bash # download dataset @@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \
+## ⚡ Example - Request Prioritization Benchmark +
-⚡ Example - Request Prioritization Benchmark +Show more
Benchmark the performance of request prioritization in vLLM. -**Basic Prioritization Test** +### Basic Prioritization Test ```bash python3 benchmarks/benchmark_prioritization.py \ @@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \ --scheduling-policy priority ``` -**Multiple Sequences per Prompt** +### Multiple Sequences per Prompt ```bash python3 benchmarks/benchmark_prioritization.py \ diff --git a/benchmarks/auto_tune/README.md b/benchmarks/auto_tune/README.md index c479ff1aa29..9aad51df6e0 100644 --- a/benchmarks/auto_tune/README.md +++ b/benchmarks/auto_tune/README.md @@ -3,6 +3,7 @@ This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate. ## Table of Contents + - [Prerequisites](#prerequisites) - [Configuration](#configuration) - [How to Run](#how-to-run) @@ -52,7 +53,7 @@ You must set the following variables at the top of the script before execution. 1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section. 2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost. -``` +```bash cd bash auto_tune.sh ``` @@ -64,6 +65,7 @@ bash auto_tune.sh Here are a few examples of how to configure the script for different goals: ### 1. Maximize Throughput (No Latency Constraint) + - **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens. - **Configuration**: @@ -76,6 +78,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number ``` #### 2. Maximize Throughput with a Latency Requirement + - **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms. - **Configuration**: @@ -88,6 +91,7 @@ MAX_LATENCY_ALLOWED_MS=500 ``` #### 3. Maximize Throughput with Prefix Caching and Latency Requirements + - **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms. - **Configuration**: @@ -109,7 +113,7 @@ After the script finishes, you will find the results in a new, timestamped direc - **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found. -``` +```text # Example result.txt content hash:a1b2c3d4... max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8 diff --git a/benchmarks/kernels/deepgemm/README.md b/benchmarks/kernels/deepgemm/README.md index 917e814010f..41e68e047be 100644 --- a/benchmarks/kernels/deepgemm/README.md +++ b/benchmarks/kernels/deepgemm/README.md @@ -8,7 +8,7 @@ Currently this just includes dense GEMMs and only works on Hopper GPUs. You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory: -``` +```bash git clone --recursive https://github.com/deepseek-ai/DeepGEMM cd DeepGEMM python setup.py install @@ -17,7 +17,7 @@ uv pip install -e . ## Usage -``` +```console python benchmark_fp8_block_dense_gemm.py INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda. ===== STARTING FP8 GEMM BENCHMARK ===== diff --git a/csrc/quantization/cutlass_w8a8/Epilogues.md b/csrc/quantization/cutlass_w8a8/Epilogues.md index a30e1fdf3ac..15a66913e97 100644 --- a/csrc/quantization/cutlass_w8a8/Epilogues.md +++ b/csrc/quantization/cutlass_w8a8/Epilogues.md @@ -86,6 +86,7 @@ D = s_a s_b \widehat A \widehat B ``` Epilogue parameters: + - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). @@ -135,7 +136,7 @@ That is precomputed and stored in `azp_with_adj` as a row-vector. Epilogue parameters: - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - - Generally this will be per-tensor as the zero-points are per-tensor. + - Generally this will be per-tensor as the zero-points are per-tensor. - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector). - `bias` is the bias, is always per-channel (row-vector). @@ -152,7 +153,7 @@ That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product Epilogue parameters: - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - - Generally this will be per-token as the zero-points are per-token. + - Generally this will be per-token as the zero-points are per-token. - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector). - `azp` is the zero-point (`z_a`), is per-token (column-vector). diff --git a/docs/cli/README.md b/docs/cli/README.md index dfb6051a8c8..b1371c82a4c 100644 --- a/docs/cli/README.md +++ b/docs/cli/README.md @@ -6,13 +6,13 @@ toc_depth: 4 The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: -``` +```bash vllm --help ``` Available Commands: -``` +```bash vllm {chat,complete,serve,bench,collect-env,run-batch} ``` diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 005b7f78f44..0ff0cdda380 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -40,6 +40,7 @@ Although the first compilation can take some time, for all subsequent server lau Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). #### Reducing compilation time + This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. ### Optimize based on your data @@ -71,12 +72,15 @@ The fewer tokens we pad, the less unnecessary computation TPU does, the better p However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. -**If possible, use the precision that matches the chip’s hardware acceleration** +#### Quantization + +If possible, use the precision that matches the chip’s hardware acceleration: - v5e has int4/int8 hardware acceleration in the MXU - v6e has int4/int8 hardware acceleration in the MXU -Supported quantized formats and features in vLLM on TPU [Jul '25] +Supported quantized formats and features in vLLM on TPU [Jul '25]: + - INT8 W8A8 - INT8 W8A16 - FP8 KV cache @@ -84,11 +88,13 @@ Supported quantized formats and features in vLLM on TPU [Jul '25] - [WIP] AWQ - [WIP] FP4 W4A8 -**Don't set TP to be less than the number of chips on a single-host deployment** +#### Parallelization + +Don't set TP to be less than the number of chips on a single-host deployment. Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). -### Tune your workloads! +### Tune your workloads Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. @@ -99,6 +105,7 @@ Although we try to have great default configs, we strongly recommend you check o The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance. #### SPMD + More details to come. **Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** diff --git a/docs/contributing/ci/failures.md b/docs/contributing/ci/failures.md index 573efb3b05f..d7e2dfbca87 100644 --- a/docs/contributing/ci/failures.md +++ b/docs/contributing/ci/failures.md @@ -20,19 +20,19 @@ the failure? - **Use this title format:** - ``` + ```text [CI Failure]: failing-test-job - regex/matching/failing:test ``` - **For the environment field:** - ``` - Still failing on main as of commit abcdef123 + ```text + Still failing on main as of commit abcdef123 ``` - **In the description, include failing tests:** - ``` + ```text FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test2 - Failure description https://github.com/orgs/vllm-project/projects/20 diff --git a/docs/contributing/ci/update_pytorch_version.md b/docs/contributing/ci/update_pytorch_version.md index 699d0531ac7..3a6026d450a 100644 --- a/docs/contributing/ci/update_pytorch_version.md +++ b/docs/contributing/ci/update_pytorch_version.md @@ -106,6 +106,7 @@ releases (which would take too much time), they can be built from source to unblock the update process. ### FlashInfer + Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): ```bash @@ -121,6 +122,7 @@ public location for immediate installation, such as [this FlashInfer wheel link] team if you want to get the package published there. ### xFormers + Similar to FlashInfer, here is how to build and install xFormers from source: ```bash @@ -138,7 +140,7 @@ uv pip install --system \ ### causal-conv1d -``` +```bash uv pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8' ``` diff --git a/docs/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md index ff69cbae08b..904ef4ca058 100644 --- a/docs/contributing/deprecation_policy.md +++ b/docs/contributing/deprecation_policy.md @@ -31,7 +31,7 @@ Features that fall under this policy include (at a minimum) the following: The deprecation process consists of several clearly defined stages that span multiple Y releases: -**1. Deprecated (Still On By Default)** +### 1. Deprecated (Still On By Default) - **Action**: Feature is marked as deprecated. - **Timeline**: A removal version is explicitly stated in the deprecation @@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0"). - GitHub Issue (RFC) for feedback - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs -**2.Deprecated (Off By Default)** +### 2.Deprecated (Off By Default) - **Action**: Feature is disabled by default, but can still be re-enabled via a CLI flag or environment variable. Feature throws an error when used without @@ -55,7 +55,7 @@ re-enabling. while signaling imminent removal. Ensures any remaining usage is clearly surfaced and blocks silent breakage before full removal. -**3. Removed** +### 3. Removed - **Action**: Feature is completely removed from the codebase. - **Note**: Only features that have passed through the previous deprecation diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 13c3bc2c7e0..7c18b464b57 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -112,13 +112,13 @@ vllm bench serve \ In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run: -``` +```bash nsys sessions list ``` to get the session id in the form of `profile-XXXXX`, then run: -``` +```bash nsys stop --session=profile-XXXXX ``` diff --git a/docs/contributing/vulnerability_management.md b/docs/contributing/vulnerability_management.md index e20b10f8f7b..847883f7429 100644 --- a/docs/contributing/vulnerability_management.md +++ b/docs/contributing/vulnerability_management.md @@ -32,9 +32,9 @@ We prefer to keep all vulnerability-related communication on the security report on GitHub. However, if you need to contact the VMT directly for an urgent issue, you may contact the following individuals: -- Simon Mo - simon.mo@hey.com -- Russell Bryant - rbryant@redhat.com -- Huzaifa Sidhpurwala - huzaifas@redhat.com +- Simon Mo - +- Russell Bryant - +- Huzaifa Sidhpurwala - ## Slack Discussion diff --git a/docs/deployment/frameworks/anything-llm.md b/docs/deployment/frameworks/anything-llm.md index d6b28a358cc..e62a33b2085 100644 --- a/docs/deployment/frameworks/anything-llm.md +++ b/docs/deployment/frameworks/anything-llm.md @@ -19,9 +19,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 - Download and install [Anything LLM desktop](https://anythingllm.com/desktop). - On the bottom left of open settings, AI Prooviders --> LLM: - - LLM Provider: Generic OpenAI - - Base URL: http://{vllm server host}:{vllm server port}/v1 - - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` + - LLM Provider: Generic OpenAI + - Base URL: http://{vllm server host}:{vllm server port}/v1 + - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` ![](../../assets/deployment/anything-llm-provider.png) @@ -30,9 +30,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 ![](../../assets/deployment/anything-llm-chat-without-doc.png) - Click the upload button: - - upload the doc - - select the doc and move to the workspace - - save and embed + - upload the doc + - select the doc and move to the workspace + - save and embed ![](../../assets/deployment/anything-llm-upload-doc.png) diff --git a/docs/deployment/frameworks/chatbox.md b/docs/deployment/frameworks/chatbox.md index 15f92ed1e34..cbca6e6282f 100644 --- a/docs/deployment/frameworks/chatbox.md +++ b/docs/deployment/frameworks/chatbox.md @@ -19,11 +19,11 @@ vllm serve qwen/Qwen1.5-0.5B-Chat - Download and install [Chatbox desktop](https://chatboxai.app/en#download). - On the bottom left of settings, Add Custom Provider - - API Mode: `OpenAI API Compatible` - - Name: vllm - - API Host: `http://{vllm server host}:{vllm server port}/v1` - - API Path: `/chat/completions` - - Model: `qwen/Qwen1.5-0.5B-Chat` + - API Mode: `OpenAI API Compatible` + - Name: vllm + - API Host: `http://{vllm server host}:{vllm server port}/v1` + - API Path: `/chat/completions` + - Model: `qwen/Qwen1.5-0.5B-Chat` ![](../../assets/deployment/chatbox-settings.png) diff --git a/docs/deployment/frameworks/dify.md b/docs/deployment/frameworks/dify.md index a3063194fb5..35f02c33cb0 100644 --- a/docs/deployment/frameworks/dify.md +++ b/docs/deployment/frameworks/dify.md @@ -34,11 +34,11 @@ docker compose up -d - In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it. - Fill in the model provider details as follows: - - **Model Type**: `LLM` - - **Model Name**: `Qwen/Qwen1.5-7B-Chat` - - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` - - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` - - **Completion Mode**: `Completion` + - **Model Type**: `LLM` + - **Model Name**: `Qwen/Qwen1.5-7B-Chat` + - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` + - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` + - **Completion Mode**: `Completion` ![](../../assets/deployment/dify-settings.png) diff --git a/docs/deployment/frameworks/haystack.md b/docs/deployment/frameworks/haystack.md index a18d68142ca..70b4b48d454 100644 --- a/docs/deployment/frameworks/haystack.md +++ b/docs/deployment/frameworks/haystack.md @@ -1,7 +1,5 @@ # Haystack -# Haystack - [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. diff --git a/docs/deployment/frameworks/retrieval_augmented_generation.md b/docs/deployment/frameworks/retrieval_augmented_generation.md index 96dd99e7118..d5f2ec302b6 100644 --- a/docs/deployment/frameworks/retrieval_augmented_generation.md +++ b/docs/deployment/frameworks/retrieval_augmented_generation.md @@ -3,6 +3,7 @@ [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. Here are the integrations: + - vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus) - vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus) diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 497f9f1a92a..fae392589c0 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -140,11 +140,12 @@ The core vLLM production stack configuration is managed with YAML. Here is the e ``` In this YAML configuration: + * **`modelSpec`** includes: - * `name`: A nickname that you prefer to call the model. - * `repository`: Docker repository of vLLM. - * `tag`: Docker image tag. - * `modelURL`: The LLM model that you want to use. + * `name`: A nickname that you prefer to call the model. + * `repository`: Docker repository of vLLM. + * `tag`: Docker image tag. + * `modelURL`: The LLM model that you want to use. * **`replicaCount`**: Number of replicas. * **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. * **`requestGPU`**: Specifies the number of GPUs required. diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index f244b0858eb..cad801a4312 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -5,7 +5,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le - [Deployment with CPUs](#deployment-with-cpus) - [Deployment with GPUs](#deployment-with-gpus) - [Troubleshooting](#troubleshooting) - - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) + - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) - [Conclusion](#conclusion) Alternatively, you can deploy vLLM to Kubernetes using any of the following: diff --git a/docs/design/metrics.md b/docs/design/metrics.md index 52cd320dd4e..ba34c7dca00 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -361,7 +361,7 @@ instances in Prometheus. We use this concept for the `vllm:cache_config_info` metric: -``` +```text # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # TYPE vllm:cache_config_info gauge vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 @@ -686,7 +686,7 @@ documentation for this option states: The metrics were added by and who up in an OpenTelemetry trace as: -``` +```text -> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117) -> gen_ai.latency.time_in_model_forward: Double(3.151565277099609) -> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676) diff --git a/docs/design/p2p_nccl_connector.md b/docs/design/p2p_nccl_connector.md index 082dff15ef2..94af8bedd24 100644 --- a/docs/design/p2p_nccl_connector.md +++ b/docs/design/p2p_nccl_connector.md @@ -5,6 +5,7 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica ## Detailed Design ### Overall Process + As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow: 1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface. @@ -23,7 +24,7 @@ A simple HTTP service acts as the entry point for client requests and starts a b The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example: -``` +```text cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0 ``` @@ -70,6 +71,7 @@ pip install "vllm>=0.9.2" ## Run xPyD ### Instructions + - The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model. - Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput. - For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance. diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index 2d3c8412894..fcc014cf851 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified * Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision. * Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments. -> **Note 1:** We only cache full blocks. +!!! note "Note 1" + We only cache full blocks. -> **Note 2:** The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. -SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). +!!! note "Note 2" + The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. + SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). **A hashing example with multi-modality inputs** In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages: @@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others. -> **Note:** Cache isolation is not supported in engine V0. +!!! note + Cache isolation is not supported in engine V0. ## Data Structure diff --git a/docs/design/torch_compile.md b/docs/design/torch_compile.md index ea5d8ac212f..2d76e7f3adc 100644 --- a/docs/design/torch_compile.md +++ b/docs/design/torch_compile.md @@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d In the very verbose logs, we can see: -``` +```console INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile ``` @@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual In the very verbose logs, we can also see: -``` +```console DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py') ... @@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs: -``` +```console DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') ``` diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md index 259a447984c..930265b8f98 100644 --- a/docs/features/compatibility_matrix.md +++ b/docs/features/compatibility_matrix.md @@ -36,9 +36,9 @@ th:not(:first-child) { | Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| -| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | | -| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | -| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | +| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | +| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | +| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | [pooling](../models/pooling_models.md) | ✅\* | ✅\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | diff --git a/docs/features/lora.md b/docs/features/lora.md index ea1b495138c..a4e05dae11c 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -119,6 +119,7 @@ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True ``` ### Using API Endpoints + Loading a LoRA Adapter: To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary @@ -156,6 +157,7 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \ ``` ### Using Plugins + Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter. You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds. diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index d4c8852206b..b8677f11a1d 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -588,7 +588,9 @@ Full example: /bin/bash`. + If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it /bin/bash`. Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient. diff --git a/docs/serving/expert_parallel_deployment.md b/docs/serving/expert_parallel_deployment.md index d79b6fc5901..280b3322b11 100644 --- a/docs/serving/expert_parallel_deployment.md +++ b/docs/serving/expert_parallel_deployment.md @@ -31,11 +31,12 @@ vLLM provides three communication backends for EP: Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as: -``` +```text EP_SIZE = TP_SIZE × DP_SIZE ``` Where: + - `TP_SIZE`: Tensor parallel size (always 1 for now) - `DP_SIZE`: Data parallel size - `EP_SIZE`: Expert parallel size (computed automatically) diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index 4eb2ea27318..dfed15d4ace 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -206,6 +206,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai We support both [Vision](https://platform.openai.com/docs/guides/vision)- and [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information. + - *Note: `image_url.detail` parameter is not supported.* Code example: diff --git a/docs/usage/security.md b/docs/usage/security.md index 76140434dcb..d54e2bb37ec 100644 --- a/docs/usage/security.md +++ b/docs/usage/security.md @@ -13,15 +13,18 @@ All communications between nodes in a multi-node vLLM deployment are **insecure The following options control inter-node communications in vLLM: #### 1. **Environment Variables:** - - `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on + +- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on #### 2. **KV Cache Transfer Configuration:** - - `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1) - - `--kv-port`: The port for KV cache transfer communications (default: 14579) + +- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1) +- `--kv-port`: The port for KV cache transfer communications (default: 14579) #### 3. **Data Parallel Configuration:** - - `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1) - - `data_parallel_master_port`: Port of the data parallel master (default: 29500) + +- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1) +- `data_parallel_master_port`: Port of the data parallel master (default: 29500) ### Notes on PyTorch Distributed @@ -41,18 +44,21 @@ Key points from the PyTorch security guide: ### Security Recommendations #### 1. **Network Isolation:** - - Deploy vLLM nodes on a dedicated, isolated network - - Use network segmentation to prevent unauthorized access - - Implement appropriate firewall rules + +- Deploy vLLM nodes on a dedicated, isolated network +- Use network segmentation to prevent unauthorized access +- Implement appropriate firewall rules #### 2. **Configuration Best Practices:** - - Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults - - Configure firewalls to only allow necessary ports between nodes + +- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults +- Configure firewalls to only allow necessary ports between nodes #### 3. **Access Control:** - - Restrict physical and network access to the deployment environment - - Implement proper authentication and authorization for management interfaces - - Follow the principle of least privilege for all system components + +- Restrict physical and network access to the deployment environment +- Implement proper authentication and authorization for management interfaces +- Follow the principle of least privilege for all system components ## Security and Firewalls: Protecting Exposed vLLM Systems diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 498ff3da0ca..38399c6633b 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -148,7 +148,7 @@ are not yet supported. vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic differences compared to V0: -**Logprobs Calculation** +##### Logprobs Calculation Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty @@ -157,7 +157,7 @@ probabilities used during sampling. Support for logprobs with post-sampling adjustments is in progress and will be added in future updates. -**Prompt Logprobs with Prefix Caching** +##### Prompt Logprobs with Prefix Caching Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414). @@ -165,7 +165,7 @@ Currently prompt logprobs are only supported when prefix caching is turned off v As part of the major architectural rework in vLLM V1, several legacy features have been deprecated. -**Sampling features** +##### Sampling features - **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361). - **Per-Request Logits Processors**: In V0, users could pass custom @@ -173,11 +173,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha feature has been deprecated. Instead, the design is moving toward supporting **global logits processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360). -**KV Cache features** +##### KV Cache features - **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping to handle request preemptions. -**Structured Output features** +##### Structured Output features - **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now. diff --git a/examples/offline_inference/disaggregated-prefill-v1/README.md b/examples/offline_inference/disaggregated-prefill-v1/README.md index 9cbdb19820f..abf6883f8d3 100644 --- a/examples/offline_inference/disaggregated-prefill-v1/README.md +++ b/examples/offline_inference/disaggregated-prefill-v1/README.md @@ -5,6 +5,6 @@ This example contains scripts that demonstrate disaggregated prefill in the offl ## Files - `run.sh` - A helper script that will run `prefill_example.py` and `decode_example.py` sequentially. - - Make sure you are in the `examples/offline_inference/disaggregated-prefill-v1` directory before running `run.sh`. + - Make sure you are in the `examples/offline_inference/disaggregated-prefill-v1` directory before running `run.sh`. - `prefill_example.py` - A script which performs prefill only, saving the KV state to the `local_storage` directory and the prompts to `output.txt`. - `decode_example.py` - A script which performs decode only, loading the KV state from the `local_storage` directory and the prompts from `output.txt`. diff --git a/examples/offline_inference/openai_batch/README.md b/examples/offline_inference/openai_batch/README.md index 631fde91fcd..3c6f6c7a6c5 100644 --- a/examples/offline_inference/openai_batch/README.md +++ b/examples/offline_inference/openai_batch/README.md @@ -19,9 +19,9 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e ## Pre-requisites * The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`. - - Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) - - Install the token on your machine (Run `huggingface-cli login`). - - Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. + * Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) + * Install the token on your machine (Run `huggingface-cli login`). + * Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. ## Example 1: Running with a local file @@ -105,7 +105,7 @@ To integrate with cloud blob storage, we recommend using presigned urls. * [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). * The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3. - - [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). + * [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). * The `boto3` python package (Run `pip install boto3`) to generate presigned urls. ### Step 1: Upload your input script diff --git a/examples/others/lmcache/README.md b/examples/others/lmcache/README.md index 95a6bf995b2..759be55d6f1 100644 --- a/examples/others/lmcache/README.md +++ b/examples/others/lmcache/README.md @@ -28,16 +28,20 @@ to run disaggregated prefill and benchmark the performance. ### Components #### Server Scripts + - `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server. - `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder - `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example #### Configuration + - `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server - `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server #### Log Files + The main script generates several log files: + - `prefiller.log` - Logs from the prefill server - `decoder.log` - Logs from the decode server - `proxy.log` - Logs from the proxy server diff --git a/examples/others/logging_configuration.md b/examples/others/logging_configuration.md index 916ab5fd1c8..7c8bdd199a7 100644 --- a/examples/others/logging_configuration.md +++ b/examples/others/logging_configuration.md @@ -8,11 +8,11 @@ of logging configurations that range from simple-and-inflexible to more-complex-and-more-flexible. - No vLLM logging (simple and inflexible) - - Set `VLLM_CONFIGURE_LOGGING=0` (leaving `VLLM_LOGGING_CONFIG_PATH` unset) + - Set `VLLM_CONFIGURE_LOGGING=0` (leaving `VLLM_LOGGING_CONFIG_PATH` unset) - vLLM's default logging configuration (simple and inflexible) - - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` + - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` - Fine-grained custom logging configuration (more complex, more flexible) - - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and + - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and set `VLLM_LOGGING_CONFIG_PATH=` ## Logging Configuration Environment Variables diff --git a/pyproject.toml b/pyproject.toml index a65267942d4..dfad5d2cdf3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -156,16 +156,6 @@ markers = [ "optional: optional tests that are automatically skipped, include --optional to run them", ] -[tool.pymarkdown] -plugins.md004.style = "sublist" # ul-style -plugins.md007.indent = 4 # ul-indent -plugins.md007.start_indented = true # ul-indent -plugins.md013.enabled = false # line-length -plugins.md041.enabled = false # first-line-h1 -plugins.md033.enabled = false # inline-html -plugins.md046.enabled = false # code-block-style -plugins.md024.allow_different_nesting = true # no-duplicate-headers - [tool.ty.src] root = "./vllm" respect-ignore-files = true diff --git a/tools/ep_kernels/README.md b/tools/ep_kernels/README.md index f1479146f05..273e0f378e3 100644 --- a/tools/ep_kernels/README.md +++ b/tools/ep_kernels/README.md @@ -1,6 +1,9 @@ +# Expert parallel kernels + Large-scale cluster-level expert parallel, as described in the [DeepSeek-V3 Technical Report](http://arxiv.org/abs/2412.19437), is an efficient way to deploy sparse MoE models with many experts. However, such deployment requires many components beyond a normal Python package, including system package support and system driver support. It is impossible to bundle all these components into a Python package. Here we break down the requirements in 2 steps: + 1. Build and install the Python libraries (both [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) and [DeepEP](https://github.com/deepseek-ai/DeepEP)), including necessary dependencies like NVSHMEM. This step does not require any privileged access. Any user can do this. 2. Configure NVIDIA driver to enable IBGDA. This step requires root access, and must be done on the host machine. @@ -8,15 +11,15 @@ Here we break down the requirements in 2 steps: All scripts accept a positional argument as workspace path for staging the build, defaulting to `$(pwd)/ep_kernels_workspace`. -# Usage +## Usage -## Single-node +### Single-node ```bash bash install_python_libraries.sh ``` -## Multi-node +### Multi-node ```bash bash install_python_libraries.sh diff --git a/vllm/plugins/lora_resolvers/README.md b/vllm/plugins/lora_resolvers/README.md index 7e7c55f5c69..48f27dddea0 100644 --- a/vllm/plugins/lora_resolvers/README.md +++ b/vllm/plugins/lora_resolvers/README.md @@ -6,7 +6,8 @@ via the LoRAResolver plugin framework. Note that `VLLM_ALLOW_RUNTIME_LORA_UPDATING` must be set to true to allow LoRA resolver plugins to work, and `VLLM_PLUGINS` must be set to include the desired resolver plugins. -# lora_filesystem_resolver +## lora_filesystem_resolver + This LoRA Resolver is installed with vLLM by default. To use, set `VLLM_PLUGIN_LORA_CACHE_DIR` to a local directory. When vLLM receives a request for a LoRA adapter `foobar` it doesn't currently recognize, it will look in that local directory From 94efcfe2a7f5d8366d2e2cd4ea7ad6625c044186 Mon Sep 17 00:00:00 2001 From: Chen Zhang Date: Tue, 29 Jul 2025 19:45:18 -0700 Subject: [PATCH 491/552] [DOC] Fix path of v1 related figures (#21868) Signed-off-by: Chen Zhang Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- .../design/{v1 => }/metrics/intervals-1.png | Bin .../design/{v1 => }/metrics/intervals-2.png | Bin .../design/{v1 => }/metrics/intervals-3.png | Bin .../{v1 => }/prefix_caching/example-time-1.png | Bin .../{v1 => }/prefix_caching/example-time-3.png | Bin .../{v1 => }/prefix_caching/example-time-4.png | Bin .../{v1 => }/prefix_caching/example-time-5.png | Bin .../{v1 => }/prefix_caching/example-time-6.png | Bin .../{v1 => }/prefix_caching/example-time-7.png | Bin .../design/{v1 => }/prefix_caching/free.png | Bin .../design/{v1 => }/prefix_caching/overview.png | Bin .../design/{v1 => }/tpu/most_model_len.png | Bin docs/configuration/tpu.md | 2 +- docs/design/metrics.md | 6 +++--- docs/design/prefix_caching.md | 16 ++++++++-------- 15 files changed, 12 insertions(+), 12 deletions(-) rename docs/assets/design/{v1 => }/metrics/intervals-1.png (100%) rename docs/assets/design/{v1 => }/metrics/intervals-2.png (100%) rename docs/assets/design/{v1 => }/metrics/intervals-3.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/example-time-1.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/example-time-3.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/example-time-4.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/example-time-5.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/example-time-6.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/example-time-7.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/free.png (100%) rename docs/assets/design/{v1 => }/prefix_caching/overview.png (100%) rename docs/assets/design/{v1 => }/tpu/most_model_len.png (100%) diff --git a/docs/assets/design/v1/metrics/intervals-1.png b/docs/assets/design/metrics/intervals-1.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-1.png rename to docs/assets/design/metrics/intervals-1.png diff --git a/docs/assets/design/v1/metrics/intervals-2.png b/docs/assets/design/metrics/intervals-2.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-2.png rename to docs/assets/design/metrics/intervals-2.png diff --git a/docs/assets/design/v1/metrics/intervals-3.png b/docs/assets/design/metrics/intervals-3.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-3.png rename to docs/assets/design/metrics/intervals-3.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-1.png b/docs/assets/design/prefix_caching/example-time-1.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-1.png rename to docs/assets/design/prefix_caching/example-time-1.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-3.png b/docs/assets/design/prefix_caching/example-time-3.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-3.png rename to docs/assets/design/prefix_caching/example-time-3.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-4.png b/docs/assets/design/prefix_caching/example-time-4.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-4.png rename to docs/assets/design/prefix_caching/example-time-4.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-5.png b/docs/assets/design/prefix_caching/example-time-5.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-5.png rename to docs/assets/design/prefix_caching/example-time-5.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-6.png b/docs/assets/design/prefix_caching/example-time-6.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-6.png rename to docs/assets/design/prefix_caching/example-time-6.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-7.png b/docs/assets/design/prefix_caching/example-time-7.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-7.png rename to docs/assets/design/prefix_caching/example-time-7.png diff --git a/docs/assets/design/v1/prefix_caching/free.png b/docs/assets/design/prefix_caching/free.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/free.png rename to docs/assets/design/prefix_caching/free.png diff --git a/docs/assets/design/v1/prefix_caching/overview.png b/docs/assets/design/prefix_caching/overview.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/overview.png rename to docs/assets/design/prefix_caching/overview.png diff --git a/docs/assets/design/v1/tpu/most_model_len.png b/docs/assets/design/tpu/most_model_len.png similarity index 100% rename from docs/assets/design/v1/tpu/most_model_len.png rename to docs/assets/design/tpu/most_model_len.png diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 0ff0cdda380..a2941c80bd2 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -47,7 +47,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### max model len vs. most model len -![most_model_len](../assets/design/v1/tpu/most_model_len.png) +![most_model_len](../assets/design/tpu/most_model_len.png) If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. diff --git a/docs/design/metrics.md b/docs/design/metrics.md index ba34c7dca00..1f65331d3c0 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -223,7 +223,7 @@ And the calculated intervals are: Put another way: -![Interval calculations - common case](../../assets/design/v1/metrics/intervals-1.png) +![Interval calculations - common case](../assets/design/metrics/intervals-1.png) We explored the possibility of having the frontend calculate these intervals using the timing of events visible by the frontend. However, @@ -238,13 +238,13 @@ When a preemption occurs during decode, since any already generated tokens are reused, we consider the preemption as affecting the inter-token, decode, and inference intervals. -![Interval calculations - preempted decode](../../assets/design/v1/metrics/intervals-2.png) +![Interval calculations - preempted decode](../assets/design/metrics/intervals-2.png) When a preemption occurs during prefill (assuming such an event is possible), we consider the preemption as affecting the time-to-first-token and prefill intervals. -![Interval calculations - preempted prefill](../../assets/design/v1/metrics/intervals-3.png) +![Interval calculations - preempted prefill](../assets/design/metrics/intervals-3.png) ### Frontend Stats Collection diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index fcc014cf851..9941837bf16 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -125,7 +125,7 @@ There are two design points to highlight: As a result, we will have the following components when the KV cache manager is initialized: -![Component Overview](../../assets/design/v1/prefix_caching/overview.png) +![Component Overview](../assets/design/prefix_caching/overview.png) * Block Pool: A list of KVCacheBlock. * Free Block Queue: Only store the pointers of head and tail blocks for manipulations. @@ -195,7 +195,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first. -![Free queue after a request us freed](../../assets/design/v1/prefix_caching/free.png) +![Free queue after a request us freed](../assets/design/prefix_caching/free.png) ### Eviction (LRU) @@ -211,24 +211,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens), **Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens. -![Example Time 1](../../assets/design/v1/prefix_caching/example-time-1.png) +![Example Time 1](../assets/design/prefix_caching/example-time-1.png) **Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. -![Example Time 3](../../assets/design/v1/prefix_caching/example-time-3.png) +![Example Time 3](../assets/design/prefix_caching/example-time-3.png) **Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. -![Example Time 4](../../assets/design/v1/prefix_caching/example-time-4.png) +![Example Time 4](../assets/design/prefix_caching/example-time-4.png) **Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. -![Example Time 5](../../assets/design/v1/prefix_caching/example-time-5.png) +![Example Time 5](../assets/design/prefix_caching/example-time-5.png) **Time 6: Request 1 is finished and free.** -![Example Time 6](../../assets/design/v1/prefix_caching/example-time-6.png) +![Example Time 6](../assets/design/prefix_caching/example-time-6.png) **Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). -![Example Time 7](../../assets/design/v1/prefix_caching/example-time-7.png) +![Example Time 7](../assets/design/prefix_caching/example-time-7.png) From 8b68315e1095d8230d3fbbfc07e02dad10f8e6eb Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Tue, 29 Jul 2025 22:45:41 -0400 Subject: [PATCH 492/552] [Docs] Update docker.md with HF_TOKEN, new model, and podman fix (#21856) Signed-off-by: x22x22 --- docs/deployment/docker.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index e500751896b..5f6cfcb00a3 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -10,23 +10,23 @@ The image can be used to run OpenAI compatible server and is available on Docker ```bash docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ - --env "HUGGING_FACE_HUB_TOKEN=" \ + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ - --model mistralai/Mistral-7B-v0.1 + --model Qwen/Qwen3-0.6B ``` This image can also be used with other container engines such as [Podman](https://podman.io/). ```bash -podman run --gpus all \ +podman run --device nvidia.com/gpu=all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ - vllm/vllm-openai:latest \ - --model mistralai/Mistral-7B-v0.1 + docker.io/vllm/vllm-openai:latest \ + --model Qwen/Qwen3-0.6B ``` You can add any other [engine-args](../configuration/engine_args.md) you need after the image tag (`vllm/vllm-openai:latest`). From 55cde0aa484c3f8a01ff0ad1c5396b839784cd3d Mon Sep 17 00:00:00 2001 From: Csrayz <33659823+Csrayz@users.noreply.github.com> Date: Wed, 30 Jul 2025 10:46:31 +0800 Subject: [PATCH 493/552] Expose PyTorch profiler configuration to environment variables (#21803) Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com> Signed-off-by: x22x22 --- docs/contributing/profiling.md | 7 ++++++- vllm/envs.py | 29 +++++++++++++++++++++++++++++ vllm/v1/worker/gpu_worker.py | 15 +++++++++++++-- vllm/v1/worker/xpu_worker.py | 13 ++++++++++++- 4 files changed, 60 insertions(+), 4 deletions(-) diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 7c18b464b57..74627e90621 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -5,7 +5,12 @@ ## Profile with PyTorch Profiler -We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/` +We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`. Additionally, you can control the profiling content by specifying the following environment variables: + +- `VLLM_TORCH_PROFILER_RECORD_SHAPES=1` to enable recording Tensor Shapes, off by default +- `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=1` to record memory, off by default +- `VLLM_TORCH_PROFILER_WITH_STACK=1` to enable recording stack information, on by default +- `VLLM_TORCH_PROFILER_WITH_FLOPS=1` to enable recording FLOPs, off by default The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set. diff --git a/vllm/envs.py b/vllm/envs.py index 9b6d8c8be24..50cb3b7d1b7 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -80,6 +80,10 @@ VLLM_PLUGINS: Optional[list[str]] = None VLLM_LORA_RESOLVER_CACHE_DIR: Optional[str] = None VLLM_TORCH_PROFILER_DIR: Optional[str] = None + VLLM_TORCH_PROFILER_RECORD_SHAPES: bool = False + VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY: bool = False + VLLM_TORCH_PROFILER_WITH_STACK: bool = True + VLLM_TORCH_PROFILER_WITH_FLOPS: bool = False VLLM_USE_TRITON_AWQ: bool = False VLLM_ALLOW_RUNTIME_LORA_UPDATING: bool = False VLLM_SKIP_P2P_CHECK: bool = False @@ -629,6 +633,31 @@ def get_vllm_port() -> Optional[int]: lambda: (None if os.getenv("VLLM_TORCH_PROFILER_DIR", None) is None else os .path.expanduser(os.getenv("VLLM_TORCH_PROFILER_DIR", "."))), + # Enable torch profiler to record shapes if set + # VLLM_TORCH_PROFILER_RECORD_SHAPES=1. If not set, torch profiler will + # not record shapes. + "VLLM_TORCH_PROFILER_RECORD_SHAPES": + lambda: bool(os.getenv("VLLM_TORCH_PROFILER_RECORD_SHAPES", "0") != "0"), + + # Enable torch profiler to profile memory if set + # VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=1. If not set, torch profiler + # will not profile memory. + "VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY": + lambda: bool( + os.getenv("VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY", "0") != "0"), + + # Enable torch profiler to profile stack if set + # VLLM_TORCH_PROFILER_WITH_STACK=1. If not set, torch profiler WILL + # profile stack by default. + "VLLM_TORCH_PROFILER_WITH_STACK": + lambda: bool(os.getenv("VLLM_TORCH_PROFILER_WITH_STACK", "1") != "0"), + + # Enable torch profiler to profile flops if set + # VLLM_TORCH_PROFILER_WITH_FLOPS=1. If not set, torch profiler will + # not profile flops. + "VLLM_TORCH_PROFILER_WITH_FLOPS": + lambda: bool(os.getenv("VLLM_TORCH_PROFILER_WITH_FLOPS", "0") != "0"), + # If set, vLLM will use Triton implementations of AWQ. "VLLM_USE_TRITON_AWQ": lambda: bool(int(os.getenv("VLLM_USE_TRITON_AWQ", "0"))), diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py index d9d1f14f055..0f46ed223ab 100644 --- a/vllm/v1/worker/gpu_worker.py +++ b/vllm/v1/worker/gpu_worker.py @@ -71,12 +71,23 @@ def __init__( torch_profiler_trace_dir = envs.VLLM_TORCH_PROFILER_DIR logger.info("Profiling enabled. Traces will be saved to: %s", torch_profiler_trace_dir) + logger.debug( + "Profiler config: record_shapes=%s," + "profile_memory=%s,with_stack=%s,with_flops=%s", + envs.VLLM_TORCH_PROFILER_RECORD_SHAPES, + envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY, + envs.VLLM_TORCH_PROFILER_WITH_STACK, + envs.VLLM_TORCH_PROFILER_WITH_FLOPS, + ) self.profiler = torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], - with_stack=True, + record_shapes=envs.VLLM_TORCH_PROFILER_RECORD_SHAPES, + profile_memory=envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY, + with_stack=envs.VLLM_TORCH_PROFILER_WITH_STACK, + with_flops=envs.VLLM_TORCH_PROFILER_WITH_FLOPS, on_trace_ready=torch.profiler.tensorboard_trace_handler( torch_profiler_trace_dir, use_gzip=True)) else: @@ -209,7 +220,7 @@ def reload_weights(self) -> None: @torch.inference_mode() def determine_available_memory(self) -> int: - """Profiles the peak memory usage of the model to determine how much + """Profiles the peak memory usage of the model to determine how much memory can be used for KV cache without OOMs. The engine will first conduct a profiling of the existing memory usage. diff --git a/vllm/v1/worker/xpu_worker.py b/vllm/v1/worker/xpu_worker.py index c7885694f7a..2a7e0625b2f 100644 --- a/vllm/v1/worker/xpu_worker.py +++ b/vllm/v1/worker/xpu_worker.py @@ -41,12 +41,23 @@ def __init__( torch_profiler_trace_dir = envs.VLLM_TORCH_PROFILER_DIR logger.info("Profiling enabled. Traces will be saved to: %s", torch_profiler_trace_dir) + logger.debug( + "Profiler config: record_shapes=%s," + "profile_memory=%s,with_stack=%s,with_flops=%s", + envs.VLLM_TORCH_PROFILER_RECORD_SHAPES, + envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY, + envs.VLLM_TORCH_PROFILER_WITH_STACK, + envs.VLLM_TORCH_PROFILER_WITH_FLOPS, + ) self.profiler = torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.XPU, ], - with_stack=True, + record_shapes=envs.VLLM_TORCH_PROFILER_RECORD_SHAPES, + profile_memory=envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY, + with_stack=envs.VLLM_TORCH_PROFILER_WITH_STACK, + with_flops=envs.VLLM_TORCH_PROFILER_WITH_FLOPS, on_trace_ready=torch.profiler.tensorboard_trace_handler( torch_profiler_trace_dir, use_gzip=True)) else: From 3846108c1c1e448406cb963b1b6a11e25aa31604 Mon Sep 17 00:00:00 2001 From: Areeb Syed Date: Wed, 30 Jul 2025 09:05:21 +0530 Subject: [PATCH 494/552] [Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization (#21808) Signed-off-by: sydarb Signed-off-by: x22x22 --- vllm/model_executor/models/gemma3n.py | 31 +++++++++++++++++++++------ 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/vllm/model_executor/models/gemma3n.py b/vllm/model_executor/models/gemma3n.py index 168665cc296..d0880103d4e 100644 --- a/vllm/model_executor/models/gemma3n.py +++ b/vllm/model_executor/models/gemma3n.py @@ -167,22 +167,33 @@ def correct(self, predictions: torch.Tensor, class Gemma3nLaurelBlock(nn.Module): """Learned Augmented Residual Layer""" - def __init__(self, hidden_size: int, laurel_rank: int, rms_norm_eps: float, - prefix: str): + def __init__( + self, + hidden_size: int, + laurel_rank: int, + rms_norm_eps: float, + *, + quant_config: Optional[QuantizationConfig] = None, + prefix: str, + ) -> None: super().__init__() self.linear_left = ColumnParallelLinear( hidden_size, laurel_rank, bias=False, + quant_config=quant_config, prefix=f"{prefix}.linear_left", return_bias=False, ) - self.linear_right = RowParallelLinear(laurel_rank, - hidden_size, - bias=False, - prefix=f"{prefix}.linear_right", - return_bias=False) + self.linear_right = RowParallelLinear( + laurel_rank, + hidden_size, + bias=False, + quant_config=quant_config, + prefix=f"{prefix}.linear_right", + return_bias=False, + ) self.post_laurel_norm = RMSNorm( hidden_size=hidden_size, eps=rms_norm_eps, @@ -417,6 +428,7 @@ def __init__( hidden_size=config.hidden_size, laurel_rank=config.laurel_rank, rms_norm_eps=config.rms_norm_eps, + quant_config=quant_config, prefix=f"{prefix}.laurel", ) @@ -427,6 +439,7 @@ def __init__( config.hidden_size, config.hidden_size_per_layer_input, bias=False, + quant_config=quant_config, prefix=f"{prefix}.per_layer_input_gate", return_bias=False, ) @@ -434,6 +447,7 @@ def __init__( config.hidden_size_per_layer_input, config.hidden_size, bias=False, + quant_config=quant_config, prefix=f"{prefix}.per_layer_projection", return_bias=False, ) @@ -547,6 +561,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): bias=False, gather_output=True, return_bias=False, + quant_config=quant_config, prefix=f"{prefix}.per_layer_model_projection", ) self.per_layer_projection_norm = RMSNorm( @@ -566,6 +581,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): bias=False, gather_output=True, return_bias=False, + quant_config=quant_config, prefix=f"{prefix}.{idx-1}.altup_projections", ) for idx in range(1, self.config.altup_num_inputs) ]) @@ -576,6 +592,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): bias=False, gather_output=True, return_bias=False, + quant_config=quant_config, prefix=f"{prefix}.{idx-1}.altup_unembed_projections", ) for idx in range(1, self.config.altup_num_inputs) ]) From 6defc83f9b576ebebf9f8273c356d87cc23791c1 Mon Sep 17 00:00:00 2001 From: MingzhenHan Date: Wed, 30 Jul 2025 11:35:33 +0800 Subject: [PATCH 495/552] [Bugfix] Fix comment typo of get_num_common_prefix_blocks() (#21827) Signed-off-by: MingzhenHan Signed-off-by: x22x22 --- vllm/v1/core/kv_cache_coordinator.py | 4 ++-- vllm/v1/core/single_type_kv_cache_manager.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/vllm/v1/core/kv_cache_coordinator.py b/vllm/v1/core/kv_cache_coordinator.py index 0cce2ec81e0..258805843e2 100644 --- a/vllm/v1/core/kv_cache_coordinator.py +++ b/vllm/v1/core/kv_cache_coordinator.py @@ -130,10 +130,10 @@ def get_num_common_prefix_blocks(self, request_id: str, Args: request_id: The request ID. - block_hashes: The block hashes of the request. + num_running_requests: The number of requests in the RUNNING state. Returns: - The number of common prefix blocks. + list[int]: The number of common prefix blocks. """ num_blocks_per_group = [ manager.get_num_common_prefix_blocks(request_id, diff --git a/vllm/v1/core/single_type_kv_cache_manager.py b/vllm/v1/core/single_type_kv_cache_manager.py index e8a44c7773a..714f49494c9 100644 --- a/vllm/v1/core/single_type_kv_cache_manager.py +++ b/vllm/v1/core/single_type_kv_cache_manager.py @@ -181,7 +181,7 @@ def get_num_common_prefix_blocks(self, request_id: str, Args: request_id: The request ID. - block_hashes: The block hashes of the request. + num_running_requests: The number of requests in the RUNNING state. Returns: The number of common prefix blocks. From c604c014f3a663ee97b55df753e72045b52514a1 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 11:36:04 +0800 Subject: [PATCH 496/552] [Bugfix] Actually disable processing cache when API server is scaled out (#21839) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/entrypoints/cli/serve.py | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/vllm/entrypoints/cli/serve.py b/vllm/entrypoints/cli/serve.py index a69363e3d98..7dcba2cccdb 100644 --- a/vllm/entrypoints/cli/serve.py +++ b/vllm/entrypoints/cli/serve.py @@ -140,11 +140,16 @@ def run_multi_api_server(args: argparse.Namespace): num_api_servers = args.api_server_count assert num_api_servers > 0 + orig_disable_mm_preprocessor_cache = args.disable_mm_preprocessor_cache + # set_process_title("ProcManager") if num_api_servers > 1: setup_multiprocess_prometheus() + # Not compatible with API server scale-out + args.disable_mm_preprocessor_cache = True + listen_address, sock = setup_server(args) engine_args = vllm.AsyncEngineArgs.from_cli_args(args) @@ -161,11 +166,9 @@ def run_multi_api_server(args: argparse.Namespace): "with api_server_count > 1") if model_config.is_multimodal_model and not ( - model_config.disable_mm_preprocessor_cache): - logger.warning( - "Multi-model preprocessor cache will be disabled for" - " api_server_count > 1") - model_config.disable_mm_preprocessor_cache = True + orig_disable_mm_preprocessor_cache): + logger.warning("Multi-model preprocessor cache will be disabled " + "for api_server_count > 1") executor_class = Executor.get_class(vllm_config) log_stats = not engine_args.disable_log_stats From b083e075a40ddbf25a3ccc2aeb0957a9dda4b7ca Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Tue, 29 Jul 2025 23:50:46 -0400 Subject: [PATCH 497/552] [Perf] Using `__nv_fp8_e4m3` instead of `c10::e4m3` for `per_token_group_quant` (#21867) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- csrc/quantization/fp8/per_token_group_quant.cu | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/csrc/quantization/fp8/per_token_group_quant.cu b/csrc/quantization/fp8/per_token_group_quant.cu index 2609054f207..f5b40e35b6e 100644 --- a/csrc/quantization/fp8/per_token_group_quant.cu +++ b/csrc/quantization/fp8/per_token_group_quant.cu @@ -1,12 +1,10 @@ #include -#include #include "../per_token_group_quant_8bit.h" #include -#include -#include +#include #include @@ -199,7 +197,7 @@ void per_token_group_quant_8bit(const torch::Tensor& input, VLLM_DISPATCH_FLOATING_TYPES( input.scalar_type(), "per_token_group_quant_8bit", ([&] { if (dst_type == at::ScalarType::Float8_e4m3fn) { - LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn); + LAUNCH_KERNEL(scalar_t, __nv_fp8_e4m3); } else if (dst_type == at::ScalarType::Char) { LAUNCH_KERNEL(scalar_t, int8_t); } From 61f843bc3d3644257403d086792fe8822440e3ba Mon Sep 17 00:00:00 2001 From: "wang.yuqi" Date: Wed, 30 Jul 2025 11:56:03 +0800 Subject: [PATCH 498/552] [Frontend] Add LLM.reward specific to reward models (#21720) Signed-off-by: wang.yuqi Signed-off-by: x22x22 --- examples/offline_inference/basic/embed.py | 3 +- examples/offline_inference/basic/reward.py | 53 ++++++++++++++++ tests/conftest.py | 4 ++ tests/models/language/pooling/test_reward.py | 2 +- .../pooling/test_truncation_control.py | 6 +- vllm/entrypoints/llm.py | 60 ++++++++++++++++++- 6 files changed, 121 insertions(+), 7 deletions(-) create mode 100644 examples/offline_inference/basic/reward.py diff --git a/examples/offline_inference/basic/embed.py b/examples/offline_inference/basic/embed.py index 526753bcef2..158836728be 100644 --- a/examples/offline_inference/basic/embed.py +++ b/examples/offline_inference/basic/embed.py @@ -12,10 +12,9 @@ def parse_args(): parser = EngineArgs.add_cli_args(parser) # Set example specific arguments parser.set_defaults( - model="intfloat/e5-mistral-7b-instruct", + model="intfloat/e5-small", runner="pooling", enforce_eager=True, - max_model_len=1024, ) return parser.parse_args() diff --git a/examples/offline_inference/basic/reward.py b/examples/offline_inference/basic/reward.py new file mode 100644 index 00000000000..aa173cf96f5 --- /dev/null +++ b/examples/offline_inference/basic/reward.py @@ -0,0 +1,53 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from argparse import Namespace + +from vllm import LLM, EngineArgs +from vllm.utils import FlexibleArgumentParser + + +def parse_args(): + parser = FlexibleArgumentParser() + parser = EngineArgs.add_cli_args(parser) + # Set example specific arguments + parser.set_defaults( + model="internlm/internlm2-1_8b-reward", + runner="pooling", + enforce_eager=True, + max_model_len=1024, + trust_remote_code=True, + ) + return parser.parse_args() + + +def main(args: Namespace): + # Sample prompts. + prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", + ] + + # Create an LLM. + # You should pass runner="pooling" for reward models + llm = LLM(**vars(args)) + + # Generate rewards. The output is a list of PoolingRequestOutput. + outputs = llm.reward(prompts) + + # Print the outputs. + print("\nGenerated Outputs:\n" + "-" * 60) + for prompt, output in zip(prompts, outputs): + rewards = output.outputs.data + rewards_trimmed = ( + (str(rewards[:16])[:-1] + ", ...]") if len(rewards) > 16 else rewards + ) + print(f"Prompt: {prompt!r} \nReward: {rewards_trimmed} (size={len(rewards)})") + print("-" * 60) + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/tests/conftest.py b/tests/conftest.py index e4df6ebf2c2..67f0e742403 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -1053,6 +1053,10 @@ def encode(self, prompts: list[str]) -> list[list[float]]: req_outputs = self.llm.encode(prompts) return [req_output.outputs.data for req_output in req_outputs] + def reward(self, prompts: list[str]) -> list[list[float]]: + req_outputs = self.llm.reward(prompts) + return [req_output.outputs.data for req_output in req_outputs] + def score( self, text_1: Union[str, list[str]], diff --git a/tests/models/language/pooling/test_reward.py b/tests/models/language/pooling/test_reward.py index 3b7fab3ba5c..a5f7dca76d8 100644 --- a/tests/models/language/pooling/test_reward.py +++ b/tests/models/language/pooling/test_reward.py @@ -95,7 +95,7 @@ def test_prm_models( monkeypatch.setenv("VLLM_USE_TRITON_FLASH_ATTN", "False") with vllm_runner(model, max_model_len=1024, dtype=dtype) as vllm_model: - vllm_outputs = vllm_model.encode(math_step_prompts) + vllm_outputs = vllm_model.reward(math_step_prompts) with hf_runner(model, dtype=dtype, auto_cls=AutoModel) as hf_model: hf_model = step_reward_patch_hf_model(hf_model) diff --git a/tests/models/language/pooling/test_truncation_control.py b/tests/models/language/pooling/test_truncation_control.py index dc2bf21ef63..c6ef899958a 100644 --- a/tests/models/language/pooling/test_truncation_control.py +++ b/tests/models/language/pooling/test_truncation_control.py @@ -28,7 +28,7 @@ def test_smaller_truncation_size(vllm_runner, with vllm_runner(model_name, runner="pooling", max_model_len=max_model_len) as vllm_model: - vllm_output = vllm_model.llm.encode( + vllm_output = vllm_model.llm.embed( input_str, truncate_prompt_tokens=truncate_prompt_tokens) prompt_tokens = vllm_output[0].prompt_token_ids @@ -43,7 +43,7 @@ def test_max_truncation_size(vllm_runner, with vllm_runner(model_name, runner="pooling", max_model_len=max_model_len) as vllm_model: - vllm_output = vllm_model.llm.encode( + vllm_output = vllm_model.llm.embed( input_str, truncate_prompt_tokens=truncate_prompt_tokens) prompt_tokens = vllm_output[0].prompt_token_ids @@ -61,7 +61,7 @@ def test_bigger_truncation_size(vllm_runner, model_name, runner="pooling", max_model_len=max_model_len) as vllm_model: - llm_output = vllm_model.llm.encode( + llm_output = vllm_model.llm.embed( input_str, truncate_prompt_tokens=truncate_prompt_tokens) assert llm_output == f"""truncate_prompt_tokens value diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py index adef350931f..842a22cceba 100644 --- a/vllm/entrypoints/llm.py +++ b/vllm/entrypoints/llm.py @@ -1037,7 +1037,7 @@ def encode( truncate_prompt_tokens: Optional[int] = None, use_tqdm: Union[bool, Callable[..., tqdm]] = True, lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, - pooling_task: PoolingTask = "encode", + pooling_task: Optional[PoolingTask] = None, tokenization_kwargs: Optional[dict[str, Any]] = None, ) -> list[PoolingRequestOutput]: """Apply pooling to the hidden states corresponding to the input @@ -1069,6 +1069,25 @@ def encode( considered legacy and may be deprecated in the future. You should instead pass them via the `inputs` parameter. """ + if pooling_task is None: + if "embed" in self.supported_tasks: + pooling_task = "embed" + else: + pooling_task = "encode" + + logger.warning_once( + "`LLM.encode` is currently using `pooling_task = %s`.\n" + "Please use one of the more specific methods or set the " + "task directly when using `LLM.encode`:\n" + " - For embeddings, use `LLM.embed(...)` " + "or `pooling_task=\"embed\"`.\n" + " - For classification logits, use `LLM.classify(...)` " + "or `pooling_task=\"classify\"`.\n" + " - For rewards, use `LLM.reward(...)` " + "or `pooling_task=\"reward\"`\n" + " - For similarity scores, use `LLM.score(...)`.", + pooling_task) + model_config = self.llm_engine.model_config runner_type = model_config.runner_type if runner_type != "pooling": @@ -1207,6 +1226,45 @@ def classify( return [ClassificationRequestOutput.from_base(item) for item in items] + def reward( + self, + prompts: Union[PromptType, Sequence[PromptType]], + /, + *, + truncate_prompt_tokens: Optional[int] = None, + use_tqdm: Union[bool, Callable[..., tqdm]] = True, + pooling_params: Optional[Union[PoolingParams, + Sequence[PoolingParams]]] = None, + lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, + ) -> list[PoolingRequestOutput]: + """ + Generate rewards for each prompt. + + Args: + prompts: The prompts to the LLM. You may pass a sequence of prompts + for batch inference. See [PromptType][vllm.inputs.PromptType] + for more details about the format of each prompts. + use_tqdm: If `True`, shows a tqdm progress bar. + If a callable (e.g., `functools.partial(tqdm, leave=False)`), + it is used to create the progress bar. + If `False`, no progress bar is created. + lora_request: LoRA request to use for generation, if any. + pooling_params: The pooling parameters for pooling. If None, we + use the default pooling parameters. + Returns: + A list of `PoolingRequestOutput` objects containing the + pooled hidden states in the same order as the input prompts. + """ + + return self.encode( + prompts, + use_tqdm=use_tqdm, + lora_request=lora_request, + pooling_params=pooling_params, + truncate_prompt_tokens=truncate_prompt_tokens, + pooling_task="encode", + ) + def _embedding_score( self, tokenizer: AnyTokenizer, From 3da602d4016beec78c6d413593c000c777d320d2 Mon Sep 17 00:00:00 2001 From: Kunshang Ji Date: Wed, 30 Jul 2025 11:56:14 +0800 Subject: [PATCH 499/552] [XPU] use `ZE_AFFINITY_MASK` for device select on xpu (#21815) Signed-off-by: Kunshang Ji Signed-off-by: x22x22 --- vllm/platforms/xpu.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/platforms/xpu.py b/vllm/platforms/xpu.py index 1d0bb365492..d8a663f2f0c 100644 --- a/vllm/platforms/xpu.py +++ b/vllm/platforms/xpu.py @@ -30,7 +30,7 @@ class XPUPlatform(Platform): # see https://github.com/ray-project/ray/blob/6a5eb5865eeb9ccf058a79b44f107e327e360673/python/ray/_private/accelerators/intel_gpu.py#L20 # noqa: E501 ray_device_key: str = "GPU" dist_backend: str = "ccl" # ccl | xccl - device_control_env_var: str = "ONEAPI_DEVICE_SELECTOR" + device_control_env_var: str = "ZE_AFFINITY_MASK" @classmethod def get_attn_backend_cls(cls, selected_backend: _Backend, head_size: int, From e725b72255aab840d06fae2fb0648680d0f078af Mon Sep 17 00:00:00 2001 From: Tao He Date: Wed, 30 Jul 2025 12:30:44 +0800 Subject: [PATCH 500/552] Add @sighingnow as maintainer of qwen's related files. (#21895) Signed-off-by: Tao He Signed-off-by: x22x22 --- .github/CODEOWNERS | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index a3b2713430e..fb9f44353ce 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -61,3 +61,7 @@ mkdocs.yaml @hmellor /vllm/v1/worker/^xpu @jikunshang /vllm/platforms/xpu.py @jikunshang /docker/Dockerfile.xpu @jikunshang + +# Qwen-specific files +/vllm/attention/backends/dual_chunk_flash_attn.py @sighingnow +/vllm/model_executor/models/qwen* @sighingnow From a63c8b0bc259e63c4eaaa20be26e9d283735ebbb Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 12:53:08 +0800 Subject: [PATCH 501/552] [CI/Build] Fix pre-commit failure in docs (#21897) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- docs/design/fused_moe_modular_kernel.md | 63 +++++++++++++++++-------- 1 file changed, 43 insertions(+), 20 deletions(-) diff --git a/docs/design/fused_moe_modular_kernel.md b/docs/design/fused_moe_modular_kernel.md index 0943454d642..3ef1232051b 100644 --- a/docs/design/fused_moe_modular_kernel.md +++ b/docs/design/fused_moe_modular_kernel.md @@ -1,6 +1,7 @@ # Fused MoE Modular Kernel ## Introduction + FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py) Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. @@ -31,7 +32,8 @@ As can be seen from the diagrams, there are a lot of operations and there can be The rest of the document will focus on the Contiguous / Non-Batched case. Extrapolating to the Batched case should be straight-forward. -## ModularKernel Components: +## ModularKernel Components + FusedMoEModularKernel splits the FusedMoE operation into 3 parts, 1. TopKWeightAndReduce @@ -39,6 +41,7 @@ FusedMoEModularKernel splits the FusedMoE operation into 3 parts, 3. FusedMoEPermuteExpertsUnpermute ### TopKWeightAndReduce + The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class. Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py). @@ -50,12 +53,14 @@ The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExperts * `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction. ### FusedMoEPrepareAndFinalize + The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare` and `finalize` functions. The `prepare` function is responsible for input activation Quantization and All2All Dispatch. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section) ![](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png "FusedMoEPrepareAndFinalize Blocks") ### FusedMoEPermuteExpertsUnpermute + The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions, * apply() @@ -63,6 +68,7 @@ The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operati * finalize_weight_and_reduce_impl() #### apply() + The `apply` method is where the implementations perform * Permute @@ -74,50 +80,56 @@ The `apply` method is where the implementations perform * Maybe TopK Weight Application + Reduction #### workspace_shapes() + The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation. #### finalize_weight_and_reduce_impl() + It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section. `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use. ![](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png "FusedMoEPermuteExpertsUnpermute Blocks") ### FusedMoEModularKernel + `FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects. `FusedMoEModularKernel` pseudocode/sketch, -``` -FusedMoEModularKernel::__init__(self, - prepare_finalize: FusedMoEPrepareAndFinalize, - fused_experts: FusedMoEPermuteExpertsUnpermute): +```py +class FusedMoEModularKernel: + def __init__(self, + prepare_finalize: FusedMoEPrepareAndFinalize, + fused_experts: FusedMoEPermuteExpertsUnpermute): + + self.prepare_finalize = prepare_finalize + self.fused_experts = fused_experts - self.prepare_finalize = prepare_finalize - self.fused_experts = fused_experts + def forward(self, DP_A): -FusedMoEModularKernel::forward(self, DP_A): + Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...) - Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...) + workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...) - workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...) + # allocate workspaces + workspace_13 = torch.empty(workspace13_shape, ...) + workspace_2 = torch.empty(workspace2_shape, ...) - # allocate workspaces - workspace_13 = torch.empty(workspace13_shape, ...) - workspace_2 = torch.empty(workspace2_shape, ...) + # execute fused_experts + fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...) - # execute fused_experts - fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...) + # war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations + # performs the TopK Weight Application and Reduction. + war_impl = self.fused_experts.finalize_weight_and_reduce_impl() - # war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations performs the TopK Weight Application and Reduction. - war_impl = self.fused_experts.finalize_weight_and_reduce_impl() + output = self.prepare_finalize.finalize(fe_out, war_impl,...) - output = self.prepare_finalize.finalize(fe_out, war_impl,...) - - return output + return output ``` ## How-To ### How To Add a FusedMoEPrepareAndFinalize Type + Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example, * PplxPrepareAndFinalize type is backed by Pplx All2All kernels, @@ -125,9 +137,11 @@ Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & C * DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels. #### Step 1: Add an All2All manager + The purpose of the All2All Manager is to setup the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py). #### Step 2: Add a FusedMoEPrepareAndFinalize Type + This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class. `FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked. @@ -145,6 +159,7 @@ This section describes the significance of the various functions exposed by the We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference. ### How To Add a FusedMoEPermuteExpertsUnpermute Type + FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows, `FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format. @@ -159,12 +174,14 @@ implementations that input `FusedMoEActivationFormat.Standard` support chunking `FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above. ### FusedMoEModularKernel Initialization + `FusedMoEMethodBase` class has 2 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are, * select_gemm_impl, and * init_prepare_finalize #### select_gemm_impl + The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object. Please refer to the implementations in, @@ -176,12 +193,14 @@ Please refer to the implementations in, dervied classes. #### init_prepare_finalize + Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188). **Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used. ### How To Unit Test + We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py). The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are @@ -196,18 +215,21 @@ If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnp Doing this will add the new implementation to the test suite. ### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility + The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script. Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` As a side-effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked with incompatible types, the script will error. ### How To Profile + Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py) The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types. Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` ## FusedMoEPrepareAndFinalize Implementations + The following table lists the `FusedMoEPrepareAndFinalize` implementations at the time of writing, | Implementation | Type | Comments | @@ -220,6 +242,7 @@ The following table lists the `FusedMoEPrepareAndFinalize` implementations at th | BatchedPrepareAndFinalize | Batched | A reference prepare/finalize class that reorganizes the tokens into expert batched format, i.e. E x max_num_tokens x K. (Doesn’t use any all2all kernels. This is primarily used in unit testing) | ## FusedMoEPermuteExpertsUnpermute + The following table lists the `FusedMoEPermuteExpertsUnpermute` implementations at the time of writing, | Implementation | Type | Comment | From 0d9cc64d90c8128ae657b83ce4753ac29a457f8f Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Tue, 29 Jul 2025 22:07:28 -0700 Subject: [PATCH 502/552] [Docs] Expand introduction to Ray in Multi-node deployment section (#21584) Signed-off-by: Ricardo Decal Signed-off-by: x22x22 --- docs/serving/distributed_serving.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index 93049765727..08d889a00d2 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -58,7 +58,17 @@ vllm serve gpt2 \ ## Multi-node deployment -If a single node lacks sufficient GPUs to hold the model, deploy vLLM across multiple nodes. Multi-node deployments require Ray as the runtime engine. Ensure that every node provides an identical execution environment, including the model path and Python packages. Using container images is recommended because they provide a convenient way to keep environments consistent and to hide host heterogeneity. +If a single node lacks sufficient GPUs to hold the model, deploy vLLM across multiple nodes. Ensure that every node provides an identical execution environment, including the model path and Python packages. Using container images is recommended because they provide a convenient way to keep environments consistent and to hide host heterogeneity. + +### What is Ray? + +Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments require Ray as the runtime engine. + +vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens. + +Ray also offers high-level APIs for large-scale [offline batch inference](https://docs.ray.io/en/latest/data/working-with-llms.html) and [online serving](https://docs.ray.io/en/latest/serve/llm/serving-llms.html) that can leverage vLLM as the engine. These APIs add production-grade fault tolerance, scaling, and distributed observability to vLLM workloads. + +For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.html). ### Ray cluster setup with containers From 7ac1bce79280d0541fdf01d889d09fc079ff433d Mon Sep 17 00:00:00 2001 From: Louie Tsai Date: Tue, 29 Jul 2025 22:57:03 -0700 Subject: [PATCH 503/552] Update vLLM Benchmark Suite for Xeon based on 0.9.2 release (#21486) Signed-off-by: Tsai, Louie Signed-off-by: x22x22 --- .../convert-results-json-to-markdown.py | 1 + .../scripts/run-performance-benchmarks.sh | 2 +- .../tests/serving-tests-cpu-snc2.json | 209 +++++++++++++++++ .../tests/serving-tests-cpu-snc3.json | 211 ++++++++++++++++++ .../tests/serving-tests-cpu.json | 15 ++ 5 files changed, 437 insertions(+), 1 deletion(-) create mode 100644 .buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json create mode 100644 .buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json diff --git a/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py b/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py index 05623879c0c..554256b4bdb 100644 --- a/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py +++ b/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py @@ -44,6 +44,7 @@ "test_name": "Test name", "gpu_type": "GPU", "completed": "# of req.", + "max_concurrency": "# of max concurrency.", "request_throughput": "Tput (req/s)", "total_token_throughput": "Total Token Tput (tok/s)", "output_throughput": "Output Tput (tok/s)", diff --git a/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh b/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh index b515ee43934..2c57666a81a 100644 --- a/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh +++ b/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh @@ -33,7 +33,7 @@ check_gpus() { check_cpus() { # check the number of CPUs and NUMA Node and GPU type. - declare -g numa_count=$(python3 -c "from numa import info;numa_size = info.get_num_configured_nodes(); print(numa_size)") + declare -g numa_count=$(lscpu | grep "NUMA node(s):" | awk '{print $3}') if [[ $numa_count -gt 0 ]]; then echo "NUMA found." echo $numa_count diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json new file mode 100644 index 00000000000..a144b4420fb --- /dev/null +++ b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json @@ -0,0 +1,209 @@ +[ + { + "test_name": "serving_llama8B_tp1_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp2_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp4_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 4, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp1_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_tp2_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_tp4_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 4, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + } +] diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json new file mode 100644 index 00000000000..e6e69b63b74 --- /dev/null +++ b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json @@ -0,0 +1,211 @@ +[ + { + "test_name": "serving_llama8B_pp1_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_pp3_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp2pp6_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_pp1_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_pp3_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL:": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_tp2pp3_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "disable_log_requests": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + } +] diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json index 22f71c993ff..ce1f924de38 100644 --- a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json +++ b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json @@ -6,6 +6,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -18,6 +19,8 @@ "disable_log_stats": "", "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -36,6 +39,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -48,6 +52,8 @@ "disable_log_stats": "", "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -66,6 +72,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -78,6 +85,8 @@ "disable_log_stats": "", "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -96,6 +105,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -109,6 +119,8 @@ "disable_log_stats": "", "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -129,6 +141,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -142,6 +155,8 @@ "disable_log_stats": "", "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { From d56483b81afe3107695354e18c3647da30e30df5 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 14:54:18 +0800 Subject: [PATCH 504/552] [Misc] Remove redundant config definitions (#21891) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/model_executor/models/aimv2.py | 22 +- vllm/model_executor/models/dbrx.py | 14 +- vllm/model_executor/models/exaone.py | 8 +- vllm/model_executor/models/exaone4.py | 6 +- vllm/model_executor/models/keye.py | 3 - vllm/model_executor/models/minimax_vl_01.py | 7 +- vllm/model_executor/models/mpt.py | 8 +- vllm/model_executor/models/ovis.py | 13 +- vllm/transformers_utils/config.py | 28 +- vllm/transformers_utils/configs/__init__.py | 30 +- vllm/transformers_utils/configs/cohere2.py | 195 ------------ vllm/transformers_utils/configs/dbrx.py | 280 ------------------ vllm/transformers_utils/configs/exaone.py | 190 ------------ vllm/transformers_utils/configs/exaone4.py | 252 ---------------- .../configs/minimax_text_01.py | 70 ----- .../configs/minimax_vl_01.py | 71 ----- vllm/transformers_utils/configs/mpt.py | 180 ----------- vllm/transformers_utils/configs/nvlm_d.py | 31 -- vllm/transformers_utils/configs/ovis.py | 184 ------------ vllm/transformers_utils/configs/skyworkr1v.py | 54 ---- vllm/transformers_utils/configs/solar.py | 247 --------------- vllm/transformers_utils/configs/telechat2.py | 64 ---- .../transformers_utils/processors/__init__.py | 7 + 23 files changed, 54 insertions(+), 1910 deletions(-) delete mode 100644 vllm/transformers_utils/configs/cohere2.py delete mode 100644 vllm/transformers_utils/configs/dbrx.py delete mode 100644 vllm/transformers_utils/configs/exaone.py delete mode 100644 vllm/transformers_utils/configs/exaone4.py delete mode 100644 vllm/transformers_utils/configs/minimax_text_01.py delete mode 100644 vllm/transformers_utils/configs/minimax_vl_01.py delete mode 100644 vllm/transformers_utils/configs/mpt.py delete mode 100644 vllm/transformers_utils/configs/nvlm_d.py delete mode 100644 vllm/transformers_utils/configs/ovis.py delete mode 100644 vllm/transformers_utils/configs/skyworkr1v.py delete mode 100644 vllm/transformers_utils/configs/solar.py delete mode 100644 vllm/transformers_utils/configs/telechat2.py diff --git a/vllm/model_executor/models/aimv2.py b/vllm/model_executor/models/aimv2.py index b13d863ebb7..d2307bb464b 100644 --- a/vllm/model_executor/models/aimv2.py +++ b/vllm/model_executor/models/aimv2.py @@ -8,6 +8,7 @@ import torch import torch.nn as nn +from transformers import PretrainedConfig from vllm.attention.layer import MultiHeadAttention from vllm.distributed import get_tensor_model_parallel_world_size @@ -20,13 +21,12 @@ from vllm.model_executor.layers.quantization.base_config import ( QuantizationConfig) from vllm.model_executor.model_loader.weight_utils import default_weight_loader -from vllm.transformers_utils.configs.ovis import AIMv2Config class AIMv2SwiGLUFFN(nn.Module): - def __init__(self, config: AIMv2Config, quant_config: QuantizationConfig, - prefix: str): + def __init__(self, config: PretrainedConfig, + quant_config: QuantizationConfig, prefix: str): super().__init__() hidden_features = config.intermediate_size in_features = config.hidden_size @@ -57,7 +57,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor: class AIMv2PatchEmbed(nn.Module): - def __init__(self, config: AIMv2Config): + def __init__(self, config: PretrainedConfig): super().__init__() self.proj = nn.Conv2d( config.num_channels, @@ -75,7 +75,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor: class AIMv2ViTPreprocessor(nn.Module): - def __init__(self, config: AIMv2Config): + def __init__(self, config: PretrainedConfig): super().__init__() num_patches = (config.image_size // config.patch_size)**2 @@ -93,8 +93,8 @@ def forward(self, x: torch.Tensor) -> torch.Tensor: class AIMv2Attention(nn.Module): - def __init__(self, config: AIMv2Config, quant_config: QuantizationConfig, - prefix: str): + def __init__(self, config: PretrainedConfig, + quant_config: QuantizationConfig, prefix: str): super().__init__() self.config = config self.embed_dim = config.hidden_size @@ -141,8 +141,8 @@ def forward(self, x: torch.Tensor) -> torch.Tensor: class AIMv2Block(nn.Module): - def __init__(self, config: AIMv2Config, quant_config: QuantizationConfig, - prefix: str): + def __init__(self, config: PretrainedConfig, + quant_config: QuantizationConfig, prefix: str): super().__init__() self.attn = AIMv2Attention(config, quant_config=quant_config, @@ -163,7 +163,7 @@ class AIMv2Transformer(nn.Module): def __init__( self, - config: AIMv2Config, + config: PretrainedConfig, quant_config: QuantizationConfig, *, require_post_norm: Optional[bool] = None, @@ -193,7 +193,7 @@ def forward(self, tokens: torch.Tensor) -> torch.Tensor: class AIMv2Model(torch.nn.Module): def __init__(self, - config: AIMv2Config, + config: PretrainedConfig, quant_config: QuantizationConfig, *, require_post_norm: Optional[bool] = None, diff --git a/vllm/model_executor/models/dbrx.py b/vllm/model_executor/models/dbrx.py index 7a4dd69443a..360c7e66bf5 100644 --- a/vllm/model_executor/models/dbrx.py +++ b/vllm/model_executor/models/dbrx.py @@ -6,6 +6,7 @@ import torch import torch.nn as nn +from transformers import PretrainedConfig from vllm.attention import Attention from vllm.config import CacheConfig, VllmConfig @@ -24,7 +25,6 @@ default_weight_loader, maybe_remap_kv_scale_name) from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from vllm.transformers_utils.configs.dbrx import DbrxConfig from .interfaces import SupportsPP from .utils import (AutoWeightsLoader, is_pp_missing_parameter, @@ -39,7 +39,7 @@ class DbrxRouter(nn.Module): def __init__( self, - config: DbrxConfig, + config: PretrainedConfig, params_dtype: Optional[torch.dtype] = None, ): super().__init__() @@ -63,7 +63,7 @@ class DbrxExperts(FusedMoE): def __init__( self, - config: DbrxConfig, + config: PretrainedConfig, quant_config: Optional[QuantizationConfig] = None, params_dtype: Optional[torch.dtype] = None, prefix: str = "", @@ -138,7 +138,7 @@ class DbrxMoE(nn.Module): def __init__( self, - config: DbrxConfig, + config: PretrainedConfig, quant_config: Optional[QuantizationConfig] = None, params_dtype: Optional[torch.dtype] = None, prefix: str = "", @@ -169,7 +169,7 @@ class DbrxAttention(nn.Module): def __init__( self, - config: DbrxConfig, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", @@ -249,7 +249,7 @@ class DbrxFusedNormAttention(nn.Module): def __init__( self, - config: DbrxConfig, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", @@ -284,7 +284,7 @@ class DbrxBlock(nn.Module): def __init__( self, - config: DbrxConfig, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", diff --git a/vllm/model_executor/models/exaone.py b/vllm/model_executor/models/exaone.py index aaf105ec255..8052b6bb823 100644 --- a/vllm/model_executor/models/exaone.py +++ b/vllm/model_executor/models/exaone.py @@ -30,6 +30,7 @@ import torch from torch import nn +from transformers import PretrainedConfig from vllm.attention import Attention from vllm.compilation.decorators import support_torch_compile @@ -49,7 +50,6 @@ default_weight_loader, maybe_remap_kv_scale_name) from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from vllm.transformers_utils.configs.exaone import ExaoneConfig from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, is_pp_missing_parameter, @@ -99,7 +99,7 @@ class ExaoneAttention(nn.Module): def __init__( self, - config: ExaoneConfig, + config: PretrainedConfig, hidden_size: int, num_heads: int, num_kv_heads: int, @@ -194,7 +194,7 @@ class ExaoneBlockAttention(nn.Module): def __init__( self, - config: ExaoneConfig, + config: PretrainedConfig, hidden_size: int, num_heads: int, num_kv_heads: int, @@ -236,7 +236,7 @@ class ExaoneDecoderLayer(nn.Module): def __init__( self, - config: ExaoneConfig, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", diff --git a/vllm/model_executor/models/exaone4.py b/vllm/model_executor/models/exaone4.py index 97aeb6fd7b1..3d6ce3e8895 100644 --- a/vllm/model_executor/models/exaone4.py +++ b/vllm/model_executor/models/exaone4.py @@ -26,6 +26,7 @@ import torch from torch import nn +from transformers import PretrainedConfig from vllm.attention import Attention from vllm.compilation.decorators import support_torch_compile @@ -45,7 +46,6 @@ default_weight_loader, maybe_remap_kv_scale_name) from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from vllm.transformers_utils.configs.exaone4 import Exaone4Config from .interfaces import SupportsLoRA, SupportsPP from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, @@ -96,7 +96,7 @@ class Exaone4Attention(nn.Module): def __init__( self, - config: Exaone4Config, + config: PretrainedConfig, hidden_size: int, num_heads: int, num_kv_heads: int, @@ -224,7 +224,7 @@ class Exaone4DecoderLayer(nn.Module): def __init__( self, - config: Exaone4Config, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", diff --git a/vllm/model_executor/models/keye.py b/vllm/model_executor/models/keye.py index 36e57b5e4f4..892d970aaad 100644 --- a/vllm/model_executor/models/keye.py +++ b/vllm/model_executor/models/keye.py @@ -980,9 +980,6 @@ def _parse_video_data( class KeyeProcessingInfo(BaseProcessingInfo): - def get_hf_config(self): - return self.ctx.get_hf_config(PretrainedConfig) - def get_hf_processor( self, *, diff --git a/vllm/model_executor/models/minimax_vl_01.py b/vllm/model_executor/models/minimax_vl_01.py index 9aba82cb115..62a7d37ec9d 100644 --- a/vllm/model_executor/models/minimax_vl_01.py +++ b/vllm/model_executor/models/minimax_vl_01.py @@ -5,7 +5,7 @@ import torch import torch.nn as nn -from transformers import BatchFeature +from transformers import BatchFeature, PretrainedConfig from vllm.config import VllmConfig from vllm.jsontree import json_map_leaves @@ -17,7 +17,6 @@ from vllm.multimodal import MULTIMODAL_REGISTRY from vllm.multimodal.inputs import MultiModalFieldConfig from vllm.sequence import IntermediateTensors -from vllm.transformers_utils.configs.minimax_vl_01 import MiniMaxVL01Config from .clip import CLIPVisionModel from .interfaces import MultiModalEmbeddings, SupportsMultiModal, SupportsPP @@ -90,8 +89,8 @@ class MiniMaxVL01DummyInputsBuilder(LlavaDummyInputsBuilder): class MiniMaxVL01ProcessingInfo(LlavaNextProcessingInfo): - def get_hf_config(self): - return self.ctx.get_hf_config(MiniMaxVL01Config) + def get_hf_config(self): # Need to override the config type + return self.ctx.get_hf_config(PretrainedConfig) def get_hf_processor(self, **kwargs: object): hf_processor = self.ctx.get_hf_processor(**kwargs) diff --git a/vllm/model_executor/models/mpt.py b/vllm/model_executor/models/mpt.py index 0878ada34d1..c243f575ae5 100644 --- a/vllm/model_executor/models/mpt.py +++ b/vllm/model_executor/models/mpt.py @@ -8,6 +8,7 @@ import torch import torch.nn as nn +from transformers import PretrainedConfig from vllm.attention import Attention from vllm.compilation.decorators import support_torch_compile @@ -25,7 +26,6 @@ from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from vllm.transformers_utils.configs.mpt import MPTConfig from .interfaces import SupportsPP from .utils import (AutoWeightsLoader, is_pp_missing_parameter, @@ -50,7 +50,7 @@ class MPTAttention(nn.Module): def __init__( self, - config: MPTConfig, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", @@ -144,7 +144,7 @@ class MPTMLP(nn.Module): def __init__( self, - config: MPTConfig, + config: PretrainedConfig, quant_config: Optional[QuantizationConfig] = None, ): super().__init__() @@ -176,7 +176,7 @@ class MPTBlock(nn.Module): def __init__( self, - config: MPTConfig, + config: PretrainedConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", diff --git a/vllm/model_executor/models/ovis.py b/vllm/model_executor/models/ovis.py index 111628d8d18..c8b528048b5 100644 --- a/vllm/model_executor/models/ovis.py +++ b/vllm/model_executor/models/ovis.py @@ -25,7 +25,7 @@ import torch.nn as nn from torch import Tensor from torch.nn.functional import gumbel_softmax, pad, softmax -from transformers import BaseImageProcessor, BatchFeature +from transformers import BaseImageProcessor, BatchFeature, PretrainedConfig from vllm.config import VllmConfig from vllm.model_executor.layers.linear import ReplicatedLinear @@ -48,8 +48,6 @@ BaseProcessingInfo, PromptReplacement) from vllm.multimodal.profiling import BaseDummyInputsBuilder from vllm.sequence import IntermediateTensors -from vllm.transformers_utils.configs.ovis import (BaseVisualTokenizerConfig, - OvisConfig) from vllm.transformers_utils.processors.ovis import OvisProcessor from .interfaces import MultiModalEmbeddings, SupportsMultiModal, SupportsPP @@ -83,7 +81,7 @@ class VisualTokenizer(torch.nn.Module): def __init__( self, - config: BaseVisualTokenizerConfig, + config: PretrainedConfig, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", ): @@ -107,7 +105,7 @@ def __init__( def _init_backbone( self, - config: BaseVisualTokenizerConfig, + config: PretrainedConfig, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", ) -> nn.Module: @@ -247,9 +245,6 @@ def dtype(self): class OvisProcessingInfo(BaseProcessingInfo): - def get_hf_config(self): - return self.ctx.get_hf_config(OvisConfig) - def get_hf_processor(self, **kwargs): return self.ctx.get_hf_processor( OvisProcessor, @@ -417,7 +412,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config quant_config = vllm_config.quant_config - self.config: OvisConfig = config + self.config: PretrainedConfig = config self.llm = init_vllm_registered_model( vllm_config=vllm_config.with_hf_config(config.get_text_config()), prefix=maybe_prefix(prefix, "llm"), diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index 04ff08825bb..40a6a9118e5 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -29,19 +29,13 @@ from vllm.logger import init_logger # yapf conflicts with isort for this block # yapf: disable -from vllm.transformers_utils.configs import (ChatGLMConfig, Cohere2Config, - DbrxConfig, DeepseekVLV2Config, - EAGLEConfig, Exaone4Config, - ExaoneConfig, JAISConfig, +from vllm.transformers_utils.configs import (ChatGLMConfig, DeepseekVLV2Config, + EAGLEConfig, JAISConfig, KimiVLConfig, MedusaConfig, - MiniMaxText01Config, - MiniMaxVL01Config, MllamaConfig, - MLPSpeculatorConfig, MPTConfig, + MllamaConfig, MLPSpeculatorConfig, Nemotron_Nano_VL_Config, - NemotronConfig, NVLM_D_Config, - OvisConfig, RWConfig, - SkyworkR1VChatConfig, SolarConfig, - Telechat2Config, UltravoxConfig) + NemotronConfig, RWConfig, + UltravoxConfig) # yapf: enable from vllm.transformers_utils.configs.mistral import adapt_config_dict from vllm.transformers_utils.utils import check_gguf_file @@ -77,28 +71,16 @@ def _get_hf_token() -> Optional[str]: _CONFIG_REGISTRY: dict[str, type[PretrainedConfig]] = { "chatglm": ChatGLMConfig, - "cohere2": Cohere2Config, - "dbrx": DbrxConfig, "deepseek_vl_v2": DeepseekVLV2Config, "kimi_vl": KimiVLConfig, "Llama_Nemotron_Nano_VL": Nemotron_Nano_VL_Config, - "mpt": MPTConfig, "RefinedWeb": RWConfig, # For tiiuae/falcon-40b(-instruct) "RefinedWebModel": RWConfig, # For tiiuae/falcon-7b(-instruct) "jais": JAISConfig, "mlp_speculator": MLPSpeculatorConfig, "medusa": MedusaConfig, "eagle": EAGLEConfig, - "exaone": ExaoneConfig, - "exaone4": Exaone4Config, - "minimax_text_01": MiniMaxText01Config, - "minimax_vl_01": MiniMaxVL01Config, "nemotron": NemotronConfig, - "NVLM_D": NVLM_D_Config, - "ovis": OvisConfig, - "solar": SolarConfig, - "skywork_chat": SkyworkR1VChatConfig, - "telechat": Telechat2Config, "ultravox": UltravoxConfig, **_CONFIG_REGISTRY_OVERRIDE_HF } diff --git a/vllm/transformers_utils/configs/__init__.py b/vllm/transformers_utils/configs/__init__.py index 89303213a27..0fcb2beb8c7 100644 --- a/vllm/transformers_utils/configs/__init__.py +++ b/vllm/transformers_utils/configs/__init__.py @@ -1,13 +1,15 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Model configs may be defined in this directory for the following reasons: + +- There is no configuration file defined by HF Hub or Transformers library. +- There is a need to override the existing config to support vLLM. +""" from vllm.transformers_utils.configs.chatglm import ChatGLMConfig -from vllm.transformers_utils.configs.cohere2 import Cohere2Config -from vllm.transformers_utils.configs.dbrx import DbrxConfig from vllm.transformers_utils.configs.deepseek_vl2 import DeepseekVLV2Config from vllm.transformers_utils.configs.eagle import EAGLEConfig -from vllm.transformers_utils.configs.exaone import ExaoneConfig -from vllm.transformers_utils.configs.exaone4 import Exaone4Config # RWConfig is for the original tiiuae/falcon-40b(-instruct) and # tiiuae/falcon-7b(-instruct) models. Newer Falcon models will use the # `FalconConfig` class from the official HuggingFace transformers library. @@ -15,36 +17,21 @@ from vllm.transformers_utils.configs.jais import JAISConfig from vllm.transformers_utils.configs.kimi_vl import KimiVLConfig from vllm.transformers_utils.configs.medusa import MedusaConfig -from vllm.transformers_utils.configs.minimax_text_01 import MiniMaxText01Config -from vllm.transformers_utils.configs.minimax_vl_01 import MiniMaxVL01Config from vllm.transformers_utils.configs.mllama import MllamaConfig from vllm.transformers_utils.configs.mlp_speculator import MLPSpeculatorConfig from vllm.transformers_utils.configs.moonvit import MoonViTConfig -from vllm.transformers_utils.configs.mpt import MPTConfig from vllm.transformers_utils.configs.nemotron import NemotronConfig from vllm.transformers_utils.configs.nemotron_h import NemotronHConfig from vllm.transformers_utils.configs.nemotron_vl import Nemotron_Nano_VL_Config -from vllm.transformers_utils.configs.nvlm_d import NVLM_D_Config -from vllm.transformers_utils.configs.ovis import OvisConfig -from vllm.transformers_utils.configs.skyworkr1v import SkyworkR1VChatConfig -from vllm.transformers_utils.configs.solar import SolarConfig -from vllm.transformers_utils.configs.telechat2 import Telechat2Config from vllm.transformers_utils.configs.ultravox import UltravoxConfig __all__ = [ "ChatGLMConfig", - "Cohere2Config", - "DbrxConfig", "DeepseekVLV2Config", - "MPTConfig", "RWConfig", "JAISConfig", "MedusaConfig", "EAGLEConfig", - "ExaoneConfig", - "Exaone4Config", - "MiniMaxText01Config", - "MiniMaxVL01Config", "MllamaConfig", "MLPSpeculatorConfig", "MoonViTConfig", @@ -52,10 +39,5 @@ "NemotronConfig", "NemotronHConfig", "Nemotron_Nano_VL_Config", - "NVLM_D_Config", - "OvisConfig", - "SkyworkR1VChatConfig", - "SolarConfig", - "Telechat2Config", "UltravoxConfig", ] diff --git a/vllm/transformers_utils/configs/cohere2.py b/vllm/transformers_utils/configs/cohere2.py deleted file mode 100644 index e547a9c281c..00000000000 --- a/vllm/transformers_utils/configs/cohere2.py +++ /dev/null @@ -1,195 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# ruff: noqa - -# Adapted from -# https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere2/configuration_cohere2.py -from transformers import PretrainedConfig -from transformers.modeling_rope_utils import rope_config_validation - - -class Cohere2Config(PretrainedConfig): - r""" - This is the configuration class to store the configuration of a [`CohereModel`]. It is used to instantiate an Cohere - model according to the specified arguments, defining the model architecture. - - Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the - documentation from [`PretrainedConfig`] for more information. Instantiating a configuration - with the defaults will yield a similar configuration to that of the [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) model. - - - Args: - vocab_size (`int`, *optional*, defaults to 256000): - Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the - `inputs_ids` passed when calling [`CohereModel`] - hidden_size (`int`, *optional*, defaults to 8192): - Dimension of the hidden representations. - intermediate_size (`int`, *optional*, defaults to 22528): - Dimension of the MLP representations. - logit_scale (`float`, *optional*, defaults to 0.0625): - The scaling factor for the output logits. - num_hidden_layers (`int`, *optional*, defaults to 40): - Number of hidden layers in the Transformer decoder. - num_attention_heads (`int`, *optional*, defaults to 64): - Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (`int`, *optional*): - This is the number of key_value heads that should be used to implement Grouped Query Attention. If - `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if - `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When - converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed - by meanpooling all the original heads within that group. For more details checkout [this - paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to - `num_attention_heads`. - hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): - The non-linear activation function (function or string) in the decoder. - max_position_embeddings (`int`, *optional*, defaults to 8192): - The maximum sequence length that this model might ever be used with. - initializer_range (`float`, *optional*, defaults to 0.02): - The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (`float`, *optional*, defaults to 1e-05): - The epsilon used by the layer normalization. - use_cache (`bool`, *optional*, defaults to `True`): - Whether or not the model should return the last key/values attentions (not used by all models). Only - relevant if `config.is_decoder=True`. - pad_token_id (`int`, *optional*, defaults to 0): - Padding token id. - bos_token_id (`int`, *optional*, defaults to 5): - Beginning of stream token id. - eos_token_id (`int`, *optional*, defaults to 255001): - End of stream token id. - tie_word_embeddings (`bool`, *optional*, defaults to `True`): - Whether to tie weight embeddings - rope_theta (`float`, *optional*, defaults to 10000.0): - The base period of the RoPE embeddings. - rope_scaling (`dict`, *optional*): - Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type - and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value - accordingly. - Expected contents: - `rope_type` (`str`): - The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', - 'llama3'], with 'default' being the original RoPE implementation. - `factor` (`float`, *optional*): - Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In - most scaling types, a `factor` of x will enable the model to handle sequences of length x * - original maximum pre-trained length. - `original_max_position_embeddings` (`int`, *optional*): - Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during - pretraining. - `attention_factor` (`float`, *optional*): - Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention - computation. If unspecified, it defaults to value recommended by the implementation, using the - `factor` field to infer the suggested value. - `beta_fast` (`float`, *optional*): - Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear - ramp function. If unspecified, it defaults to 32. - `beta_slow` (`float`, *optional*): - Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear - ramp function. If unspecified, it defaults to 1. - `short_factor` (`list[float]`, *optional*): - Only used with 'longrope'. The scaling factor to be applied to short contexts (< - `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden - size divided by the number of attention heads divided by 2 - `long_factor` (`list[float]`, *optional*): - Only used with 'longrope'. The scaling factor to be applied to long contexts (< - `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden - size divided by the number of attention heads divided by 2 - `low_freq_factor` (`float`, *optional*): - Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE - `high_freq_factor` (`float`, *optional*): - Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE - attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`): - Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (`float`, *optional*, defaults to 0.0): - The dropout ratio for the attention probabilities. - sliding_window (`int`, *optional*, defaults to 4096): - Size of the sliding window attention context. - sliding_window_pattern (`int`, *optional*, defaults to 4): - Pattern for the sliding window attention. - cache_implementation (`str`, *optional*, defaults to `"hybrid"`): the cache type to be used with `generate`. - - ```python - >>> from transformers import Cohere2Model, Cohere2Config - - >>> # Initializing a Cohere Nextmodel configuration - >>> configuration = Cohere2Config() - - >>> # Initializing a model from the Cohere2 configuration - >>> model = Cohere2Model(configuration) # doctest: +SKIP - - >>> # Accessing the model configuration - >>> configuration = model.config # doctest: +SKIP - ``` - """ - - model_type = "cohere2" - keys_to_ignore_at_inference = ["past_key_values"] - - def __init__( - self, - vocab_size=256000, - hidden_size=8192, - intermediate_size=22528, - logit_scale=0.0625, - num_hidden_layers=40, - num_attention_heads=64, - num_key_value_heads=None, - hidden_act="silu", - max_position_embeddings=8192, - initializer_range=0.02, - layer_norm_eps=1e-5, - use_cache=True, - pad_token_id=0, - bos_token_id=5, - eos_token_id=255001, - tie_word_embeddings=True, - rope_theta=10000.0, - rope_scaling=None, - attention_bias=False, - attention_dropout=0.0, - sliding_window=4096, - sliding_window_pattern=4, - cache_implementation="hybrid", - **kwargs, - ): - self.vocab_size = vocab_size - self.max_position_embeddings = max_position_embeddings - self.hidden_size = hidden_size - self.logit_scale = logit_scale - self.intermediate_size = intermediate_size - self.num_hidden_layers = num_hidden_layers - self.num_attention_heads = num_attention_heads - - # for backward compatibility - if num_key_value_heads is None: - num_key_value_heads = num_attention_heads - - self.num_key_value_heads = num_key_value_heads - self.hidden_act = hidden_act - self.initializer_range = initializer_range - self.layer_norm_eps = layer_norm_eps - self.use_cache = use_cache - self.rope_theta = rope_theta - self.rope_scaling = rope_scaling - self.attention_bias = attention_bias - self.attention_dropout = attention_dropout - self.sliding_window = sliding_window - self.sliding_window_pattern = sliding_window_pattern - # Need to specify head_dim in the config so it can be used in the attention forward functions - self.head_dim = hidden_size // num_attention_heads - self.cache_implementation = cache_implementation - - # Validate the correctness of rotary position embeddings parameters - rope_config_validation(self) - - super().__init__( - pad_token_id=pad_token_id, - bos_token_id=bos_token_id, - eos_token_id=eos_token_id, - tie_word_embeddings=tie_word_embeddings, - **kwargs, - ) - - -__all__ = ["Cohere2Config"] diff --git a/vllm/transformers_utils/configs/dbrx.py b/vllm/transformers_utils/configs/dbrx.py deleted file mode 100644 index 7dbda99f85a..00000000000 --- a/vllm/transformers_utils/configs/dbrx.py +++ /dev/null @@ -1,280 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# yapf: disable -# ruff: noqa: E501 -# coding=utf-8 -# Copied from -# https://huggingface.co/databricks/dbrx-base/blob/main/configuration_dbrx.py -"""Dbrx configuration.""" - -from typing import Any, Optional - -from transformers.configuration_utils import PretrainedConfig -from transformers.utils import logging - -logger = logging.get_logger(__name__) - -DBRX_PRETRAINED_CONFIG_ARCHIVE_MAP = {} # type: ignore - - -class DbrxAttentionConfig(PretrainedConfig): - """Configuration class for Dbrx Attention. - - [`DbrxAttention`] class. It is used to instantiate attention layers - according to the specified arguments, defining the layers architecture. - - Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the - documentation from [`PretrainedConfig`] for more information. - - Args: - attn_pdrop (`float`, *optional*, defaults to 0.0): - The dropout probability for the attention layers. - clip_qkv (`float`, *optional*, defaults to None): - If not `None`, clip the queries, keys, and values in the attention layer to this value. - kv_n_heads (Optional[int]): For grouped_query_attention only, allow user to specify number of kv heads. - rope_theta (float): The base frequency for rope. - """ - - def __init__( - self, - attn_pdrop: float = 0, - clip_qkv: Optional[float] = None, - kv_n_heads: int = 1, - rope_theta: float = 10000.0, - **kwargs: Any, - ): - super().__init__(**kwargs) - self.attn_pdrop = attn_pdrop - self.clip_qkv = clip_qkv - self.kv_n_heads = kv_n_heads - self.rope_theta = rope_theta - - for k in ["model_type"]: - if k in kwargs: - kwargs.pop(k) - if len(kwargs) != 0: - raise ValueError(f"Found unknown {kwargs=}") - - @classmethod - def from_pretrained( - cls, pretrained_model_name_or_path: str, **kwargs: Any - ) -> "PretrainedConfig": - cls._set_token_in_kwargs(kwargs) - - config_dict, kwargs = cls.get_config_dict( - pretrained_model_name_or_path, **kwargs - ) - - if config_dict.get("model_type") == "dbrx": - config_dict = config_dict["attn_config"] - - if ( - "model_type" in config_dict - and hasattr(cls, "model_type") - and config_dict["model_type"] != cls.model_type - ): - logger.warning( - "You are using a model of type %s to instantiate a model of " - "type %s. This is not supported for all configurations of " - "models and can yield errors.", - config_dict["model_type"], cls.model_type) - - return cls.from_dict(config_dict, **kwargs) - - -class DbrxFFNConfig(PretrainedConfig): - """Configuration class for Dbrx FFN. - - [`DbrxFFN`] class. It is used to instantiate feedforward layers according to - the specified arguments, defining the layers architecture. - - Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the - documentation from [`PretrainedConfig`] for more information. - - Args: - ffn_act_fn (dict, optional): A dict specifying activation function for the FFN. - The dict should have a key 'name' with the value being the name of - the activation function along with any additional keyword arguments. - ffn_hidden_size (int, optional): The hidden size of the feedforward network. - moe_num_experts (int, optional): The number of experts in the mixture of experts layer. - moe_top_k (int, optional): The number of experts to use in the mixture of experts layer. - moe_jitter_eps (float, optional): The jitter epsilon for the mixture of experts layer. - moe_loss_weight (float, optional): The loss weight for the mixture of experts layer. - moe_normalize_expert_weights (float, optional): The normalization factor for the expert weights. - uniform_expert_assignment (bool, optional): Whether to use uniform expert assignment. - This should only be used for benchmarking purposes. - """ - - def __init__( - self, - ffn_act_fn: Optional[dict] = None, - ffn_hidden_size: int = 3584, - moe_num_experts: int = 4, - moe_top_k: int = 1, - moe_jitter_eps: Optional[float] = None, - moe_loss_weight: float = 0.01, - moe_normalize_expert_weights: Optional[float] = 1, - uniform_expert_assignment: bool = False, - **kwargs: Any, - ): - super().__init__() - if ffn_act_fn is None: - ffn_act_fn = {"name": "silu"} - self.ffn_act_fn = ffn_act_fn - self.ffn_hidden_size = ffn_hidden_size - self.moe_num_experts = moe_num_experts - self.moe_top_k = moe_top_k - self.moe_jitter_eps = moe_jitter_eps - self.moe_loss_weight = moe_loss_weight - self.moe_normalize_expert_weights = moe_normalize_expert_weights - self.uniform_expert_assignment = uniform_expert_assignment - - for k in ["model_type"]: - if k in kwargs: - kwargs.pop(k) - if len(kwargs) != 0: - raise ValueError(f"Found unknown {kwargs=}") - - @classmethod - def from_pretrained( - cls, pretrained_model_name_or_path: str, **kwargs: Any - ) -> "PretrainedConfig": - cls._set_token_in_kwargs(kwargs) - - config_dict, kwargs = cls.get_config_dict( - pretrained_model_name_or_path, **kwargs - ) - - if config_dict.get("model_type") == "dbrx": - config_dict = config_dict["ffn_config"] - - if ( - "model_type" in config_dict - and hasattr(cls, "model_type") - and config_dict["model_type"] != cls.model_type - ): - logger.warning( - "You are using a model of type %s to instantiate a model of " - "type %s. This is not supported for all " - "configurations of models and can yield errors.", config_dict["model_type"], cls.model_type) - - return cls.from_dict(config_dict, **kwargs) - - -class DbrxConfig(PretrainedConfig): - """Configuration class for Dbrx. - - [`DbrxModel`]. It is used to instantiate a Dbrx model according to the - specified arguments, defining the model architecture. - - Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the - documentation from [`PretrainedConfig`] for more information. - - - Args: - d_model (`int`, *optional*, defaults to 6144): - Dimensionality of the embeddings and hidden states. - n_heads (`int`, *optional*, defaults to 48): - Number of attention heads for each attention layer in the Transformer encoder. - n_layers (`int`, *optional*, defaults to 40): - Number of hidden layers in the Transformer encoder. - max_seq_len (`int`, *optional*, defaults to 32768): - The maximum sequence length of the model. - vocab_size (`int`, *optional*, defaults to 100352): - Vocabulary size of the Dbrx model. Defines the maximum number of different tokens that can be represented by - the `inputs_ids` passed when calling [`DbrxModel`]. - resid_pdrop (`float`, *optional*, defaults to 0.0): - The dropout probability applied to the attention output before combining with residual. - emb_pdrop (`float`, *optional*, defaults to 0.0): - The dropout probability for the embedding layer. - attn_config (`dict`, *optional*): - A dictionary used to configure the model's attention module. - ffn_config (`dict`, *optional*): - A dictionary used to configure the model's FFN module. - use_cache (`bool`, *optional*, defaults to `False`): - Whether or not the model should return the last key/values attentions (not used by all models). - initializer_range (`float`, *optional*, defaults to 0.02): - The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - output_router_logits (`bool`, *optional*, defaults to `False`): - Whether or not the router logits should be returned by the model. Enabling this will also allow the model to output the auxiliary loss. - router_aux_loss_coef (`float`, *optional*, defaults to 0.001): - The aux loss factor for the total loss. - - - Example: - ```python - >>> from transformers import DbrxConfig, DbrxModel - - >>> # Initializing a Dbrx configuration - >>> configuration = DbrxConfig() - - >>> # Initializing a model (with random weights) from the configuration - >>> model = DbrxModel(configuration) - - >>> # Accessing the model configuration - >>> configuration = model.config - ``` - """ - - model_type = "dbrx" - attribute_map = { - "num_attention_heads": "n_heads", - "hidden_size": "d_model", - "num_hidden_layers": "n_layers", - "max_position_embeddings": "max_seq_len", - } - - def __init__( - self, - d_model: int = 2048, - n_heads: int = 16, - n_layers: int = 24, - max_seq_len: int = 2048, - vocab_size: int = 32000, - resid_pdrop: float = 0.0, - emb_pdrop: float = 0.0, - attn_config: Optional[DbrxAttentionConfig] = None, - ffn_config: Optional[DbrxFFNConfig] = None, - use_cache: bool = True, - initializer_range: float = 0.02, - output_router_logits: bool = False, - router_aux_loss_coef: float = 0.05, - **kwargs: Any, - ): - if attn_config is None: - self.attn_config = DbrxAttentionConfig() - elif isinstance(attn_config, dict): - self.attn_config = DbrxAttentionConfig(**attn_config) - else: - self.attn_config = attn_config - - if ffn_config is None: - self.ffn_config = DbrxFFNConfig() - elif isinstance(ffn_config, dict): - self.ffn_config = DbrxFFNConfig(**ffn_config) - else: - self.ffn_config = ffn_config - - self.d_model = d_model - self.n_heads = n_heads - self.n_layers = n_layers - self.max_seq_len = max_seq_len - self.vocab_size = vocab_size - self.resid_pdrop = resid_pdrop - self.emb_pdrop = emb_pdrop - self.use_cache = use_cache - self.initializer_range = initializer_range - self.output_router_logits = output_router_logits - self.router_aux_loss_coef = router_aux_loss_coef - - tie_word_embeddings = kwargs.pop("tie_word_embeddings", False) - if tie_word_embeddings: - raise ValueError( - "tie_word_embeddings is not supported for Dbrx models." - ) - - super().__init__( - tie_word_embeddings=tie_word_embeddings, - **kwargs, - ) diff --git a/vllm/transformers_utils/configs/exaone.py b/vllm/transformers_utils/configs/exaone.py deleted file mode 100644 index 7450904a15c..00000000000 --- a/vllm/transformers_utils/configs/exaone.py +++ /dev/null @@ -1,190 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# Copied from -# https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct/blob/main/configuration_exaone.py -# Copyright 2021 The LG AI Research EXAONE Lab. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Exaone model configuration""" - -from transformers.configuration_utils import PretrainedConfig -from transformers.utils import logging - -logger = logging.get_logger(__name__) - -EXAONE_PRETRAINED_CONFIG_ARCHIVE_MAP: dict[str, str] = {} - - -class ExaoneConfig(PretrainedConfig): - r""" - This is the configuration class to store the configuration of a :class: - `~transformers.ExaoneModel`. It is used to instantiate a GPT Lingvo model - according to the specified arguments, defining the model architecture. - Instantiating a configuration with the defaults will yield a similar - configuration to that of the Exaone - - Configuration objects inherit from {class}`~transformers.PretrainedConfig` - and can be used to control the model outputs. Read the documentation from : - class:`~transformers.PretrainedConfig` for more information. - - Args: - vocab_size ({obj}`int`, `optional`, defaults to 50257): - Vocabulary size of the GPT Lingvo model. Defines the number of - different tokens that can be represented by the {obj}`inputs_ids` - passed when calling {class}`~transformers.ExaoneModel`. Vocabulary - size of the model. - Defines the different tokens that can be represented by the - `inputs_ids` passed to the forward method of :class: - `~transformers.EXAONEModel`. - hidden_size ({obj}`int`, `optional`, defaults to 2048): - Dimensionality of the encoder layers and the pooler layer. - num_layers ({obj}`int`, `optional`, defaults to 24): - Number of hidden layers in the Transformer encoder. - num_attention_heads (`int`, *optional*, defaults to 32): - Number of attention heads for each attention layer in the - Transformer decoder. - num_key_value_heads (`int`, *optional*): - This is the number of key_value heads that should be used to - implement Grouped Query Attention. If - `num_key_value_heads=num_attention_heads`, the model will use Multi - Head Attention (MHA), if `num_key_value_heads=1 the model will use - Multi Query Attention (MQA) otherwise GQA is used. When - converting a multi-head checkpoint to a GQA checkpoint, - each group key and value head should be constructed by meanpooling - all the original heads within that group. For more details checkout - [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not - specified, will default to `num_attention_heads`. - rotary_pct (`float`, *optional*, defaults to 0.25): - percentage of hidden dimensions to allocate to rotary embeddings - intermediate_size ({obj}`int`, `optional`, defaults to 8192): - Dimensionality of the "intermediate" (i.e., feed-forward) layer in - the Transformer encoder. - activation_function ({obj}`str` or {obj}`function`, `optional`, - defaults to {obj}`"gelu_new"`): - The non-linear activation function (function or string) in the - encoder and pooler. If string, {obj}`"gelu"`, {obj}`"relu"`, - {obj}`"selu"` and {obj}`"gelu_new"` are supported. - embed_dropout ({obj}`float`, `optional`, defaults to 0.0): - The dropout probabilitiy for all fully connected layers in the - embeddings, encoder, and pooler. - attention_dropout ({obj}`float`, `optional`, defaults to 0.0): - The dropout ratio for the attention probabilities. - max_position_embeddings ({obj}`int`, `optional`, defaults to 2048): - The maximum sequence length that this model might ever be used with. - Typically set this to something large just in case - (e.g., 512 or 1024 or 2048). - type_vocab_size ({obj}`int`, `optional`, defaults to 2): - The vocabulary size of the {obj}`token_type_ids` passed when calling - {class}`~transformers.EXAONEModel`. - initializer_range ({obj}`float`, `optional`, defaults to 0.02): - The standard deviation of the truncated_normal_initializer for - initializing all weight matrices. - layer_norm_epsilon ({obj}`float`, `optional`, defaults to 1e-5): - The epsilon used by the layer normalization layers. - use_cache ({obj}`bool`, `optional`, defaults to {obj}`True`): - Whether or not the model should return the last key/values - attentions (not used by all models). - Only relevant if ``config.is_decoder=True``. - gradient_checkpointing ({obj}`bool`, `optional`, - defaults to {obj}`False`): - If True, use gradient checkpointing to save memory at the expense - of slower backward pass. - Example:: - - >>> from transformers import ExoneModel, ExaoneConfig - - >>> # Initializing a EXAONE configuration - >>> configuration = ExaoneConfig() - - >>> # Initializing a model from configuration - >>> model = ExoneModel(configuration) - - >>> # Accessing the model configuration - >>> configuration = model.config - """ - - model_type = "exaone" - keys_to_ignore_at_inference = ["past_key_values"] - attribute_map = {"num_hidden_layers": "num_layers"} - - def __init__( - self, - vocab_size=102400, - max_position_embeddings=2048, - hidden_size=2048, - num_layers=32, - num_attention_heads=32, - num_key_value_heads=None, - intermediate_size=None, - activation_function="silu", - rotary_pct=0.25, - resid_dropout=0.0, - embed_dropout=0.0, - attention_dropout=0.0, - layer_norm_epsilon=1e-6, - initializer_range=0.02, - use_cache=True, - bos_token_id=0, - eos_token_id=2, - tie_word_embeddings=True, - **kwargs, - ): - super().__init__( - bos_token_id=bos_token_id, - eos_token_id=eos_token_id, - tie_word_embeddings=tie_word_embeddings, - **kwargs, - ) - - self.vocab_size = vocab_size - self.max_position_embeddings = max_position_embeddings - self.hidden_size = hidden_size - self.num_layers = num_layers - self.num_attention_heads = num_attention_heads - self.num_hidden_layers = num_layers - if num_key_value_heads is None: - num_key_value_heads = num_attention_heads - self.num_key_value_heads = num_key_value_heads - if intermediate_size: - self.intermediate_size = intermediate_size - else: - self.intermediate_size = hidden_size * 4 - self.activation_function = activation_function - self.resid_dropout = resid_dropout - self.embed_dropout = embed_dropout - self.attention_dropout = attention_dropout - self.layer_norm_epsilon = layer_norm_epsilon - self.initializer_range = initializer_range - self.use_cache = use_cache - self.rotary_pct = rotary_pct - - self.bos_token_id = bos_token_id - self.eos_token_id = eos_token_id - - self.use_logit_cap = kwargs.pop("use_logit_cap", False) - self.ln_no_scale = kwargs.pop("ln_no_scale", False) - self.use_gated = kwargs.pop("use_gated", False) - self.use_emb_norm = kwargs.pop("use_emb_norm", False) - self.use_rotary_pos = kwargs.pop("use_rotary_pos", False) - self.rotary_type = kwargs.pop("rotary_type", None) - self.scaling_factor = kwargs.pop("scaling_factor", 1) - self.use_absolute_pos = kwargs.pop("use_absolute_pos", True) - self.use_extra_logit = kwargs.pop("use_extra_logit", True) - self.rotary_expand_length = kwargs.pop("rotary_expand_length", None) - self.rotary_base = kwargs.pop("rotary_base", 10000.0) - self.use_qkv_fuse = kwargs.pop("use_qkv_fuse", False) - self.rescale_before_lm_head = kwargs.pop("rescale_before_lm_head", - (rotary_pct == 0.25)) - if self.use_rotary_pos: - self.use_absolute_pos = False diff --git a/vllm/transformers_utils/configs/exaone4.py b/vllm/transformers_utils/configs/exaone4.py deleted file mode 100644 index a22ebaa6bd6..00000000000 --- a/vllm/transformers_utils/configs/exaone4.py +++ /dev/null @@ -1,252 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -# ruff: noqa: E501 - -# Copied from -# https://github.com/lgai-exaone/transformers/blob/add-exaone4/src/transformers/models/exaone4/configuration_exaone4.py -# Copyright 2025 The LG CNS Gen AI Solution Delivery Team. -# Copyright 2025 The LG AI Research and HuggingFace Inc. team. All rights reserved. -# -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from transformers.configuration_utils import (PretrainedConfig, - layer_type_validation) -from transformers.utils import logging - -logger = logging.get_logger(__name__) - - -def check_is_sliding(config, layer_idx): - """ - Check if the current layer is a sliding window attention (local attention) layer. - """ - if config.sliding_window is None: - return False - if config.layer_types is not None: - return config.layer_types[layer_idx] == "sliding_attention" - if isinstance(config.sliding_window_pattern, int): - return ((layer_idx + 1) % config.sliding_window_pattern) != 0 - elif isinstance(config.sliding_window_pattern, str): - assert isinstance(config.sliding_window, int), ( - f"Sliding window must be positive integer, but got {config.sliding_window}" - ) - return (layer_idx != config.num_hidden_layers - 1 - and config.sliding_window_pattern[layer_idx % len( - config.sliding_window_pattern)] == "L") - else: - logger.warning_once( - "Sliding window is set, but none of `sliding_window_pattern` or `layer_types` is set. " - "Defaulting to use 'full_attention' for all layers.") - return False - - -class Exaone4Config(PretrainedConfig): - r""" - This is the configuration class to store the configuration of a [`Exaone4Model`]. It is used to - instantiate a EXAONE 4.0 model according to the specified arguments, defining the model architecture. Instantiating a - configuration with the defaults will yield a similar configuration to that of the EXAONE-4.0-Instruct [LGAI-EXAONE/EXAONE-4.0-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-Instruct) - NOTE: `EXAONE-4.0-Instruct` is a placeholder model ID. The exact model ID will be updated in the future. - - Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model - outputs. Read the documentation from [`PretrainedConfig`] for more information. - - Args: - vocab_size (`int`, *optional*, defaults to 102400): - Vocabulary size of the EXAONE 4.0 model. Defines the number of different tokens that can be represented by the - `inputs_ids` passed when calling [`Exaone4Model`]. - hidden_size (`int`, *optional*, defaults to 4096): - Dimension of the hidden representations. - intermediate_size (`int`, *optional*, defaults to `hidden_size * 4`): - Dimensionality of the MLP representations. - num_hidden_layers (`int`, *optional*, defaults to 32): - Number of hidden layers in the Transformer encoder. - num_attention_heads (`int`, *optional*, defaults to 32): - Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (`int`, *optional*): - This is the number of key_value heads that should be used to implement Grouped Query Attention. If - `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if - `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When - converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed - by meanpooling all the original heads within that group. For more details checkout [this - paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to - `num_attention_heads`. - hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): - The non-linear activation function (function or string) in the decoder. - max_position_embeddings (`int`, *optional*, defaults to 2048): - The maximum sequence length that this model might ever be used with. Typically set this to something large - just in case (e.g., 32768 for EXAONE 3.5). - initializer_range (`float`, *optional*, defaults to 0.02): - The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (`float`, *optional*, defaults to 1e-05): - The epsilon used by the layer normalization layers. - use_cache (`bool`, *optional*, defaults to `True`): - Whether or not the model should return the last key/values attentions (not used by all models). Only - relevant if ``config.is_decoder=True``. - bos_token_id (`int`, *optional*, defaults to 0): - Beginning of stream token id. - eos_token_id (`int`, *optional*, defaults to 2): - End of stream token id. - tie_word_embeddings (`bool`, *optional*, defaults to `False`): - Whether to tie weight embeddings - rope_theta (`float`, *optional*, defaults to 10000.0): - The base period of the RoPE embeddings. - rope_scaling (`Dict`, *optional*): - Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type - and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value - accordingly. - Expected contents: - `rope_type` (`str`): - The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', - 'llama3'], with 'default' being the original RoPE implementation. - `factor` (`float`, *optional*): - Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In - most scaling types, a `factor` of x will enable the model to handle sequences of length x * - original maximum pre-trained length. - `original_max_position_embeddings` (`int`, *optional*): - Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during - pretraining. - `attention_factor` (`float`, *optional*): - Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention - computation. If unspecified, it defaults to value recommended by the implementation, using the - `factor` field to infer the suggested value. - `beta_fast` (`float`, *optional*): - Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear - ramp function. If unspecified, it defaults to 32. - `beta_slow` (`float`, *optional*): - Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear - ramp function. If unspecified, it defaults to 1. - `short_factor` (`List[float]`, *optional*): - Only used with 'longrope'. The scaling factor to be applied to short contexts (< - `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden - size divided by the number of attention heads divided by 2 - `long_factor` (`List[float]`, *optional*): - Only used with 'longrope'. The scaling factor to be applied to long contexts (< - `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden - size divided by the number of attention heads divided by 2 - `low_freq_factor` (`float`, *optional*): - Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE - `high_freq_factor` (`float`, *optional*): - Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE - attention_dropout (`float`, *optional*, defaults to 0.0): - The dropout ratio for the attention probabilities. - sliding_window (`int`, *optional*): - The size of the sliding window for the sliding window attention. - sliding_window_pattern (`str`, *optional*): - The pattern to use for sliding window attention. Can be one of: - - `None`: No sliding window attention is used - - `int`: Every `sliding_window` layers, use global attention, else use local attention. - - `str`: A sequence of "L" (local attention) and "G" (global attention) characters that defines the - attention pattern. The pattern starts from layer 0 and repeats every `sliding_window` layers. The - final layer always uses global attention regardless of the pattern. - For instance, sliding_window_pattern="LLLG" same as sliding_window=4, which means: - - Layer 0, 1, 2: local attention, - - Layer 3: global attention, - ...(repeated) - layer_types (`list`, *optional*): - Attention pattern for each layer. Prioritized over `sliding_window_pattern`. - - Example: - - ```python - >>> from transformers import Exaone4Model, Exaone4Config - - >>> # Initializing a EXAONE configuration - >>> configuration = Exaone4Config() - - >>> # Initializing a model from configuration - >>> model = Exaone4Model(configuration) - - >>> # Accessing the model configuration - >>> configuration = model.config - ```""" - - model_type = "exaone4" - keys_to_ignore_at_inference = ["past_key_values"] - # Default tensor parallel plan for base model `LlamaModel` - base_model_tp_plan = { - "layers.*.self_attn.q_proj": "colwise", - "layers.*.self_attn.k_proj": "colwise", - "layers.*.self_attn.v_proj": "colwise", - "layers.*.self_attn.o_proj": "rowwise", - "layers.*.mlp.gate_proj": "colwise", - "layers.*.mlp.up_proj": "colwise", - "layers.*.mlp.down_proj": "rowwise", - } - base_model_pp_plan = { - "embed_tokens": (["input_ids"], ["inputs_embeds"]), - "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), - "norm": (["hidden_states"], ["hidden_states"]), - } - - def __init__( - self, - vocab_size=102400, - hidden_size=4096, - intermediate_size=None, - num_hidden_layers=32, - num_attention_heads=32, - num_key_value_heads=None, - hidden_act="silu", - max_position_embeddings=2048, - initializer_range=0.02, - rms_norm_eps=1e-5, - use_cache=True, - bos_token_id=0, - eos_token_id=2, - tie_word_embeddings=False, - rope_theta=10000.0, - rope_scaling=None, - attention_dropout=0.0, - sliding_window=None, - sliding_window_pattern=None, - layer_types=None, - **kwargs, - ): - self.vocab_size = vocab_size - self.hidden_size = hidden_size - self.num_hidden_layers = num_hidden_layers - self.num_attention_heads = num_attention_heads - if num_key_value_heads is None: - num_key_value_heads = num_attention_heads - self.num_key_value_heads = num_key_value_heads - if intermediate_size: - self.intermediate_size = intermediate_size - else: - self.intermediate_size = hidden_size * 4 - self.hidden_act = hidden_act - self.max_position_embeddings = max_position_embeddings - self.initializer_range = initializer_range - self.rms_norm_eps = rms_norm_eps - self.use_cache = use_cache - self.attention_dropout = attention_dropout - self.rope_theta = rope_theta - self.rope_scaling = rope_scaling - self.sliding_window = sliding_window - self.sliding_window_pattern = sliding_window_pattern - - self.layer_types = layer_types - if self.layer_types is None: - self.layer_types = [ - "sliding_attention" - if check_is_sliding(self, i) else "full_attention" - for i in range(self.num_hidden_layers) - ] - layer_type_validation(self.layer_types) - - super().__init__(bos_token_id=bos_token_id, - eos_token_id=eos_token_id, - tie_word_embeddings=tie_word_embeddings, - **kwargs) - - -__all__ = ["Exaone4Config"] diff --git a/vllm/transformers_utils/configs/minimax_text_01.py b/vllm/transformers_utils/configs/minimax_text_01.py deleted file mode 100644 index e3b63dfa003..00000000000 --- a/vllm/transformers_utils/configs/minimax_text_01.py +++ /dev/null @@ -1,70 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -""" MiniMaxText01 model configuration""" - -from transformers.configuration_utils import PretrainedConfig - - -class MiniMaxText01Config(PretrainedConfig): - model_type = "MiniMaxText01" - keys_to_ignore_at_inference = ["past_key_values"] - - def __init__( - self, - vocab_size=32000, - hidden_size=4096, - intermediate_size=14336, - num_hidden_layers=32, - num_attention_heads=32, - num_key_value_heads=8, - hidden_act="silu", - max_position_embeddings=4096 * 32, - initializer_range=0.02, - rms_norm_eps=1e-5, - use_cache=True, - pad_token_id=None, - bos_token_id=None, - eos_token_id=None, - tie_word_embeddings=False, - rope_theta=1e6, - sliding_window=None, - attention_dropout=0.0, - num_experts_per_tok=2, - num_local_experts=8, - output_router_logits=False, - router_aux_loss_coef=0.001, - router_jitter_noise=0.0, - **kwargs, - ): - self.vocab_size = vocab_size - self.max_position_embeddings = max_position_embeddings - self.hidden_size = hidden_size - self.intermediate_size = intermediate_size - self.num_hidden_layers = num_hidden_layers - self.num_attention_heads = num_attention_heads - self.sliding_window = sliding_window - - # for backward compatibility - if num_key_value_heads is None: - num_key_value_heads = num_attention_heads - - self.num_key_value_heads = num_key_value_heads - self.hidden_act = hidden_act - self.initializer_range = initializer_range - self.rms_norm_eps = rms_norm_eps - self.use_cache = use_cache - self.rope_theta = rope_theta - self.attention_dropout = attention_dropout - - self.num_experts_per_tok = num_experts_per_tok - self.num_local_experts = num_local_experts - self.output_router_logits = output_router_logits - self.router_aux_loss_coef = router_aux_loss_coef - self.router_jitter_noise = router_jitter_noise - super().__init__( - pad_token_id=pad_token_id, - bos_token_id=bos_token_id, - eos_token_id=eos_token_id, - tie_word_embeddings=tie_word_embeddings, - **kwargs, - ) diff --git a/vllm/transformers_utils/configs/minimax_vl_01.py b/vllm/transformers_utils/configs/minimax_vl_01.py deleted file mode 100644 index c62497192cc..00000000000 --- a/vllm/transformers_utils/configs/minimax_vl_01.py +++ /dev/null @@ -1,71 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -"""MiniMaxVL01 model configuration""" - -from transformers.configuration_utils import PretrainedConfig -from transformers.models.auto import CONFIG_MAPPING - -from .minimax_text_01 import MiniMaxText01Config - - -class MiniMaxVL01Config(PretrainedConfig): - model_type = "minimax_vl_01" - - def __init__( - self, - vision_config=None, - text_config=None, - ignore_index=-100, - image_token_index=32000, - projector_hidden_act="gelu", - vision_feature_select_strategy="default", - vision_feature_layer=-2, - image_grid_pinpoints=None, - tie_word_embeddings=False, - image_seq_length=576, - **kwargs, - ): - self.ignore_index = ignore_index - self.image_token_index = image_token_index - self.projector_hidden_act = projector_hidden_act - self.image_seq_length = image_seq_length - - if vision_feature_select_strategy not in ["default", "full"]: - raise ValueError("vision_feature_select_strategy should " + - "be one of 'default', 'full'." + - f"Got: {vision_feature_select_strategy}") - - self.vision_feature_select_strategy = vision_feature_select_strategy - self.vision_feature_layer = vision_feature_layer - image_grid_pinpoints = ( - image_grid_pinpoints if image_grid_pinpoints is not None else - [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]) - self.image_grid_pinpoints = image_grid_pinpoints - - if isinstance(vision_config, dict): - if "model_type" not in vision_config: - vision_config["model_type"] = "clip_vision_model" - vision_config = CONFIG_MAPPING[vision_config["model_type"]]( - **vision_config) - elif vision_config is None: - vision_config = CONFIG_MAPPING["clip_vision_model"]( - intermediate_size=4096, - hidden_size=1024, - patch_size=14, - image_size=336, - num_hidden_layers=24, - num_attention_heads=16, - vocab_size=32000, - projection_dim=768, - ) - - self.vision_config = vision_config - - if text_config is not None: - text_config = MiniMaxText01Config(**text_config) - else: - text_config = MiniMaxText01Config() - - self.text_config = text_config - - super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs) diff --git a/vllm/transformers_utils/configs/mpt.py b/vllm/transformers_utils/configs/mpt.py deleted file mode 100644 index 91316408dcd..00000000000 --- a/vllm/transformers_utils/configs/mpt.py +++ /dev/null @@ -1,180 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# Copied from -# https://huggingface.co/mosaicml/mpt-7b/blob/main/configuration_mpt.py -"""A HuggingFace-style model configuration.""" -import warnings -from typing import Any, Optional, Union - -from transformers import PretrainedConfig - -attn_config_defaults: dict = { - 'attn_type': 'multihead_attention', - 'attn_pdrop': 0.0, - 'attn_impl': 'triton', - 'qk_ln': False, - 'clip_qkv': None, - 'softmax_scale': None, - 'prefix_lm': False, - 'attn_uses_sequence_id': False, - 'alibi': False, - 'alibi_bias_max': 8 -} -ffn_config_defaults: dict = {'ffn_type': 'mptmlp'} -init_config_defaults: dict = { - 'name': 'kaiming_normal_', - 'fan_mode': 'fan_in', - 'init_nonlinearity': 'relu', - 'init_div_is_residual': True, - 'emb_init_std': None, - 'emb_init_uniform_lim': None, - 'init_std': None, - 'init_gain': 0.0 -} - - -class MPTConfig(PretrainedConfig): - model_type = 'mpt' - attribute_map = { - 'num_attention_heads': 'n_heads', - 'hidden_size': 'd_model', - 'num_hidden_layers': 'n_layers', - } - - # pylint: disable=dangerous-default-value - def __init__(self, - d_model: int = 2048, - n_heads: int = 16, - n_layers: int = 24, - expansion_ratio: int = 4, - max_seq_len: int = 2048, - vocab_size: int = 50368, - resid_pdrop: float = 0.0, - emb_pdrop: float = 0.0, - learned_pos_emb: bool = True, - attn_config: dict = attn_config_defaults, - ffn_config: dict = ffn_config_defaults, - init_device: str = 'cpu', - logit_scale: Optional[Union[float, str]] = None, - no_bias: bool = False, - embedding_fraction: float = 1.0, - norm_type: str = 'low_precision_layernorm', - use_cache: bool = False, - init_config: dict = init_config_defaults, - fc_type: str = 'torch', - verbose: Optional[int] = None, - **kwargs: Any): - self.d_model = d_model - self.n_heads = n_heads - self.n_layers = n_layers - self.expansion_ratio = expansion_ratio - self.max_seq_len = max_seq_len - self.vocab_size = vocab_size - self.resid_pdrop = resid_pdrop - self.emb_pdrop = emb_pdrop - self.learned_pos_emb = learned_pos_emb - self.attn_config = attn_config - self.ffn_config = ffn_config - self.init_device = init_device - self.logit_scale = logit_scale - self.no_bias = no_bias - self.embedding_fraction = embedding_fraction - self.norm_type = norm_type - self.use_cache = use_cache - self.init_config = init_config - self.fc_type = fc_type - if verbose is not None: - warnings.warn(DeprecationWarning( - 'verbose argument for MPTConfig is now ignored and ' - 'will be removed. Use python_log_level instead.'), - stacklevel=2) - if 'name' in kwargs: - del kwargs['name'] - if 'loss_fn' in kwargs: - del kwargs['loss_fn'] - if self.attn_config.get('alibi', False): - self.learned_pos_emb = False - warnings.warn( - f'alibi is turned on, setting `learned_pos_emb` ' - f'to {self.learned_pos_emb}`', - stacklevel=2) - super().__init__(**kwargs) - self._validate_config() - - def _set_config_defaults( - self, config: dict[str, Any], - config_defaults: dict[str, Any]) -> dict[str, Any]: - for (k, v) in config_defaults.items(): - if k not in config: - config[k] = v - return config - - def _validate_config(self) -> None: - self.attn_config = self._set_config_defaults(self.attn_config, - attn_config_defaults) - self.ffn_config = self._set_config_defaults(self.ffn_config, - ffn_config_defaults) - self.init_config = self._set_config_defaults(self.init_config, - init_config_defaults) - if self.d_model % self.n_heads != 0: - raise ValueError('d_model must be divisible by n_heads') - if any( - prob < 0 or prob > 1 for prob in - [self.attn_config['attn_pdrop'], self.resid_pdrop, self.emb_pdrop - ]): - raise ValueError( - "self.attn_config['attn_pdrop'], resid_pdrop, emb_pdrop are " - "probabilities and must be between 0 and 1") - if self.attn_config['attn_impl'] not in ['torch', 'flash', 'triton']: - raise ValueError( - f"Unknown attn_impl={self.attn_config['attn_impl']}") - if self.attn_config['prefix_lm'] and self.attn_config[ - 'attn_impl'] not in ['torch', 'triton']: - raise NotImplementedError( - 'prefix_lm only implemented with torch and triton attention.') - if self.attn_config['alibi'] and self.attn_config['attn_impl'] not in [ - 'torch', 'triton' - ]: - raise NotImplementedError( - 'alibi only implemented with torch and triton attention.') - if self.attn_config['attn_uses_sequence_id'] and self.attn_config[ - 'attn_impl'] not in ['torch', 'triton']: - raise NotImplementedError( - 'attn_uses_sequence_id only implemented with torch ' - 'and triton attention.') - if self.embedding_fraction > 1 or self.embedding_fraction <= 0: - raise ValueError( - 'model.embedding_fraction must be between 0 (exclusive) ' - 'and 1 (inclusive)!') - if isinstance(self.logit_scale, - str) and self.logit_scale != 'inv_sqrt_d_model': - raise ValueError( - f"self.logit_scale={self.logit_scale!r} is not recognized as " - "an option; use numeric value or 'inv_sqrt_d_model'.") - if self.init_config.get('name', None) is None: - raise ValueError( - f"self.init_config={self.init_config!r} 'name' needs to be set." - ) - if not self.learned_pos_emb and (not self.attn_config['alibi']): - warnings.warn( - 'Positional information not being provided to the model.', - stacklevel=2) - if self.fc_type == 'te' or self.ffn_config['ffn_type'] == 'te_ln_mlp': - try: - # pylint: disable=import-outside-toplevel - import transformer_engine.pytorch as te - del te - except Exception as exc: - raise ImportError( - 'TransformerEngine import fail. `fc_type: te` requires ' - 'TransformerEngine be installed. ' - 'The required version of transformer_engine also requires ' - 'FlashAttention v1.0.6 is installed:\n' - 'pip install flash-attn==1.0.6 --no-build-isolation \n' - 'pip install git+https://github.com/NVIDIA/TransformerEngine.git@144e4888b2cdd60bd52e706d5b7a79cb9c1a7156' - ) from exc - if self.ffn_config['ffn_type'] == 'mptmlp': - self.ffn_config['fc_type'] = self.fc_type - elif self.ffn_config['ffn_type'] == 'te_ln_mlp': - self.ffn_config['bias'] = not self.no_bias diff --git a/vllm/transformers_utils/configs/nvlm_d.py b/vllm/transformers_utils/configs/nvlm_d.py deleted file mode 100644 index edfc506882f..00000000000 --- a/vllm/transformers_utils/configs/nvlm_d.py +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# Adapted from -# https://huggingface.co/nvidia/NVLM-D-72B/blob/main/configuration_nvlm_d.py -# -------------------------------------------------------- -# NVLM-D -# Copyright (c) 2024 NVIDIA -# Licensed under Apache 2.0 License [see LICENSE for details] -# -------------------------------------------------------- -from transformers import Qwen2Config -from transformers.configuration_utils import PretrainedConfig - - -class NVLM_D_Config(PretrainedConfig): - model_type = 'NVLM_D' - is_composition = True - - def __init__(self, vision_config=None, llm_config=None, **kwargs): - super().__init__(**kwargs) - - # Handle vision_config initialization - if vision_config is None: - vision_config = {} - - # Handle llm_config initialization - if llm_config is None: - llm_config = {} - - self.vision_config = PretrainedConfig(**vision_config) - self.text_config = Qwen2Config(**llm_config) diff --git a/vllm/transformers_utils/configs/ovis.py b/vllm/transformers_utils/configs/ovis.py deleted file mode 100644 index 021d402a71f..00000000000 --- a/vllm/transformers_utils/configs/ovis.py +++ /dev/null @@ -1,184 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# yapf: disable -# ruff: noqa: E501 -# copied from https://huggingface.co/AIDC-AI/Ovis2-1B/blob/main/configuration_aimv2.py -# and https://huggingface.co/AIDC-AI/Ovis2-1B/blob/main/configuration_ovis.py -from typing import Any, Optional, Union - -from transformers import AutoConfig, PretrainedConfig - - -class AIMv2Config(PretrainedConfig): - """This is the configuration class to store the configuration of an [`AIMv2Model`]. - - Instantiating a configuration with the defaults will yield a similar configuration - to that of the [apple/aimv2-large-patch14-224](https://huggingface.co/apple/aimv2-large-patch14-224). - - Args: - hidden_size: Dimension of the hidden representations. - intermediate_size: Dimension of the SwiGLU representations. - num_hidden_layers: Number of hidden layers in the Transformer. - num_attention_heads: Number of attention heads for each attention layer - in the Transformer. - num_channels: Number of input channels. - image_size: Image size. - patch_size: Patch size. - rms_norm_eps: Epsilon value used for the RMS normalization layer. - attention_dropout: Dropout ratio for attention probabilities. - projection_dropout: Dropout ratio for the projection layer after the attention. - qkv_bias: Whether to add a bias to the queries, keys and values. - use_bias: Whether to add a bias in the feed-forward and projection layers. - kwargs: Keyword arguments for the [`PretrainedConfig`]. - """ - - model_type: str = "aimv2" - - def __init__( - self, - hidden_size: int = 1024, - intermediate_size: int = 2816, - num_hidden_layers: int = 24, - num_attention_heads: int = 8, - num_channels: int = 3, - image_size: int = 224, - patch_size: int = 14, - rms_norm_eps: float = 1e-5, - attention_dropout: float = 0.0, - projection_dropout: float = 0.0, - qkv_bias: bool = False, - use_bias: bool = False, - **kwargs: Any, - ): - super().__init__(**kwargs) - self.hidden_size = hidden_size - self.intermediate_size = intermediate_size - self.num_hidden_layers = num_hidden_layers - self.num_attention_heads = num_attention_heads - self.num_channels = num_channels - self.patch_size = patch_size - self.image_size = image_size - self.attention_dropout = attention_dropout - self.rms_norm_eps = rms_norm_eps - - self.projection_dropout = projection_dropout - self.qkv_bias = qkv_bias - self.use_bias = use_bias - - -IGNORE_ID = -100 -IMAGE_TOKEN_ID = -200 -IMAGE_TOKEN = "" -IMAGE_ATOM_ID = -300 -IMAGE_INDICATOR_IDS = [-301, -302, -303, -304, -305] - - -# ---------------------------------------------------------------------- -# Visual Tokenizer Configuration -# ---------------------------------------------------------------------- -class BaseVisualTokenizerConfig(PretrainedConfig): - - def __init__(self, - vocab_size=16384, - tokenize_function="softmax", - tau=1.0, - depths=None, - drop_cls_token=False, - backbone_config: Optional[Union[PretrainedConfig, - dict]] = None, - hidden_stride: int = 1, - **kwargs): - super().__init__(**kwargs) - self.vocab_size = vocab_size - self.tokenize_function = tokenize_function - self.tau = tau - if isinstance(depths, str): - depths = [int(x) for x in depths.split('|')] - self.depths = depths - self.backbone_kwargs = dict[str, Any]() - self.drop_cls_token = drop_cls_token - if backbone_config is not None: - assert isinstance(backbone_config, (PretrainedConfig, dict)), \ - f"expect `backbone_config` to be instance of PretrainedConfig or dict, but got {type(backbone_config)} type" - if not isinstance(backbone_config, PretrainedConfig): - model_type = backbone_config['model_type'] - if model_type != "aimv2": - backbone_config.pop('model_type') - backbone_config = AutoConfig.for_model(model_type, **backbone_config) - else: - backbone_config = AIMv2Config(**backbone_config) - self.backbone_config = backbone_config - self.hidden_stride = hidden_stride - - -class Aimv2VisualTokenizerConfig(BaseVisualTokenizerConfig): - model_type = "aimv2_visual_tokenizer" - - def __init__(self, **kwargs): - super().__init__(**kwargs) - if self.drop_cls_token: - self.drop_cls_token = False - if self.depths: - assert len(self.depths) == 1 - self.backbone_kwargs['num_hidden_layers'] = self.depths[0] - - -class SiglipVisualTokenizerConfig(BaseVisualTokenizerConfig): - model_type = "siglip_visual_tokenizer" - - def __init__(self, **kwargs): - super().__init__(**kwargs) - if self.drop_cls_token: - self.drop_cls_token = False - if self.depths: - assert len(self.depths) == 1 - self.backbone_kwargs['num_hidden_layers'] = self.depths[0] - - -AutoConfig.register("siglip_visual_tokenizer", SiglipVisualTokenizerConfig) -AutoConfig.register("aimv2_visual_tokenizer", Aimv2VisualTokenizerConfig) - - -# ---------------------------------------------------------------------- -# Ovis Configuration -# ---------------------------------------------------------------------- -class OvisConfig(PretrainedConfig): - model_type = "ovis" - - def __init__(self, - llm_config: Optional[Union[PretrainedConfig, dict]] = None, - visual_tokenizer_config: Optional[Union[PretrainedConfig, - dict]] = None, - multimodal_max_length=8192, - hidden_size=None, - conversation_formatter_class=None, - llm_attn_implementation=None, - disable_tie_weight=False, - **kwargs): - super().__init__(**kwargs) - if llm_config is not None: - assert isinstance(llm_config, (PretrainedConfig, dict)), \ - f"expect `llm_config` to be instance of PretrainedConfig or dict, but got {type(llm_config)} type" - if not isinstance(llm_config, PretrainedConfig): - model_type = llm_config['model_type'] - llm_config.pop('model_type') - llm_config = AutoConfig.for_model(model_type, **llm_config) - - # map llm_config to text_config - self.text_config = llm_config - if visual_tokenizer_config is not None: - assert isinstance(visual_tokenizer_config, (PretrainedConfig, dict)), \ - f"expect `visual_tokenizer_config` to be instance of PretrainedConfig or dict, but got {type(visual_tokenizer_config)} type" - if not isinstance(visual_tokenizer_config, PretrainedConfig): - model_type = visual_tokenizer_config['model_type'] - visual_tokenizer_config.pop('model_type') - visual_tokenizer_config = AutoConfig.for_model( - model_type, **visual_tokenizer_config) - - self.visual_tokenizer_config = visual_tokenizer_config - self.multimodal_max_length = multimodal_max_length - self.hidden_size = hidden_size - self.conversation_formatter_class = conversation_formatter_class - self.llm_attn_implementation = llm_attn_implementation - self.disable_tie_weight = disable_tie_weight diff --git a/vllm/transformers_utils/configs/skyworkr1v.py b/vllm/transformers_utils/configs/skyworkr1v.py deleted file mode 100644 index 33a45220e31..00000000000 --- a/vllm/transformers_utils/configs/skyworkr1v.py +++ /dev/null @@ -1,54 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# Adapted from -# https://huggingface.co/Skywork/Skywork-R1V-38B/blob/main/configuration_skywork_chat.py -# -------------------------------------------------------- -# SkyworkR1V -# Copyright (c) 2025 Skywork -# Licensed under The MIT License [see LICENSE for details] -# -------------------------------------------------------- -from transformers.configuration_utils import PretrainedConfig - - -class SkyworkR1VChatConfig(PretrainedConfig): - model_type = 'internvl_chat' - is_composition = True - - def __init__(self, - vision_config=None, - llm_config=None, - use_backbone_lora=0, - use_llm_lora=0, - select_layer=-1, - force_image_size=None, - downsample_ratio=0.5, - template=None, - dynamic_image_size=False, - use_thumbnail=False, - ps_version='v1', - min_dynamic_patch=1, - max_dynamic_patch=6, - **kwargs): - super().__init__(**kwargs) - - if vision_config is None: - vision_config = {} - - if llm_config is None: - llm_config = {} - - self.vision_config = PretrainedConfig(**vision_config) - self.text_config = PretrainedConfig(**llm_config) - - self.use_backbone_lora = use_backbone_lora - self.use_llm_lora = use_llm_lora - self.select_layer = select_layer - self.force_image_size = force_image_size - self.downsample_ratio = downsample_ratio - self.template = template - self.dynamic_image_size = dynamic_image_size - self.use_thumbnail = use_thumbnail - self.ps_version = ps_version # pixel shuffle version - self.min_dynamic_patch = min_dynamic_patch - self.max_dynamic_patch = max_dynamic_patch diff --git a/vllm/transformers_utils/configs/solar.py b/vllm/transformers_utils/configs/solar.py deleted file mode 100644 index a83dfa40b43..00000000000 --- a/vllm/transformers_utils/configs/solar.py +++ /dev/null @@ -1,247 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. -# -# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX -# and OPT implementations in this library. It has been modified from its -# original forms to accommodate minor architectural differences compared -# to GPT-NeoX and OPT used by the Meta AI team that trained the model. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Solar model configuration""" - -from transformers import PretrainedConfig -from transformers.utils import logging - -logger = logging.get_logger(__name__) - - -class SolarConfig(PretrainedConfig): - r""" - This is the configuration class to store - the configuration of a [`SolarModel`]. - It is used to instantiate an LLaMA model - according to the specified arguments, - defining the model architecture. - Instantiating a configuration with the - defaults will yield a similar - configuration to that of the LLaMA-7B. - Configuration objects inherit from [`PretrainedConfig`] - and can be used to control the model outputs. - Read the documentation from [`PretrainedConfig`] for more information. - Args: - vocab_size (`int`, *optional*, defaults to 32000): - Vocabulary size of the LLaMA model. - Defines the number of different tokens - that can be represented by the `inputs_ids` - passed when calling [`SolarModel`] - hidden_size (`int`, *optional*, defaults to 4096): - Dimension of the hidden representations. - intermediate_size (`int`, *optional*, defaults to 11008): - Dimension of the MLP representations. - num_hidden_layers (`int`, *optional*, defaults to 32): - Number of hidden layers in the Transformer decoder. - num_attention_heads (`int`, *optional*, defaults to 32): - Number of attention heads for each attention layer - in the Transformer decoder. - num_key_value_heads (`int`, *optional*): - This is the number of key_value heads that - should be used to implement Grouped Query Attention. If - `num_key_value_heads=num_attention_heads`, - the model will use Multi Head Attention (MHA), if - `num_key_value_heads=1` the model - will use Multi Query Attention (MQA) - otherwise GQA is used. When - converting a multi-head checkpoint to a GQA checkpoint, - each group key and value head should be constructed - by meanpooling all the original heads within that group. - For more details checkout [this paper] - (https://arxiv.org/pdf/2305.13245.pdf). - If it is not specified, will default to - `num_attention_heads`. - hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): - The non-linear activation function (function or string) - in the decoder. - max_position_embeddings (`int`, *optional*, defaults to 2048): - The maximum sequence length that this model might ever be used with. - Solar 1 supports up to 2048 tokens, - Solar 2 up to 4096, CodeSolar up to 16384. - initializer_range (`float`, *optional*, defaults to 0.02): - The standard deviation of - the truncated_normal_initializer for initializing - all weight matrices. - rms_norm_eps (`float`, *optional*, defaults to 1e-06): - The epsilon used by the rms normalization layers. - use_cache (`bool`, *optional*, defaults to `True`): - Whether or not the model should return - the last key/values attentions (not used by all models). Only - relevant if `config.is_decoder=True`. - pad_token_id (`int`, *optional*): - Padding token id. - bos_token_id (`int`, *optional*, defaults to 1): - Beginning of stream token id. - eos_token_id (`int`, *optional*, defaults to 2): - End of stream token id. - pretraining_tp (`int`, *optional*, defaults to 1): - Experimental feature. Tensor parallelism rank - used during pretraining. - Please refer to [this - document](https://huggingface.co/docs/ - transformers/main/ - perf_train_gpu_many#tensor-parallelism) - to understand more about it. This value is - necessary to ensure exact reproducibility - of the pretraining results. - Please refer to [this - issue](https://github.com/pytorch/pytorch/issues/76232). - tie_word_embeddings (`bool`, *optional*, defaults to `False`): - Whether to tie weight embeddings - rope_theta (`float`, *optional*, defaults to 10000.0): - The base period of the RoPE embeddings. - rope_scaling (`dict`, *optional*): - Dictionary containing the scaling configuration for - the RoPE embeddings. - Currently supports two scaling - strategies: linear and dynamic. - Their scaling factor must be a float greater than 1. - The expected format is - `{"type": strategy name, "factor": scaling factor}`. - When using this flag, don't update - `max_position_embeddings` to the expected new maximum. - See the following thread for more information on how - these scaling strategies behave: - https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/ - dynamically_scaled_rope_further_increases/. This is an - experimental feature, subject to breaking - API changes in future versions. - attention_bias (`bool`, *optional*, defaults to `False`): - Whether to use a bias in the query, key, value - and output projection layers during self-attention. - attention_dropout (`float`, *optional*, defaults to 0.0): - The dropout ratio for the attention probabilities. - mlp_bias (`bool`, *optional*, defaults to `False`): - Whether to use a bias in up_proj, down_proj and gate_proj - layers in the MLP layers. - sliding_window (`int`, *optional*, defaults to 2047): - Sliding window attention window size. If not specified, - will default to `2047`. - ```python - >>> from transformers import SolarModel, SolarConfig - >>> # Initializing a Solar-pro style configuration - >>> configuration = SolarConfig() - >>> # Initializing a model from the Solar-pro style configuration - >>> model = SolarModel(configuration) - >>> # Accessing the model configuration - >>> configuration = model.config - ```""" - - model_type = "solar" - keys_to_ignore_at_inference = ["past_key_values"] - - def __init__( - self, - vocab_size=32000, - hidden_size=4096, - intermediate_size=11008, - num_hidden_layers=32, - num_attention_heads=32, - num_key_value_heads=None, - hidden_act="silu", - max_position_embeddings=2048, - initializer_range=0.02, - rms_norm_eps=1e-6, - use_cache=True, - pad_token_id=None, - bos_token_id=1, - eos_token_id=2, - pretraining_tp=1, - tie_word_embeddings=False, - rope_theta=10000.0, - rope_scaling=None, - attention_bias=False, - attention_dropout=0.0, - mlp_bias=False, - sliding_window=2047, - bskcn_1=None, - bskcn_2=None, - bskcn_3=None, - bskcn_4=None, - bskcn_tv=None, - **kwargs, - ): - self.vocab_size = vocab_size - self.max_position_embeddings = max_position_embeddings - self.hidden_size = hidden_size - self.intermediate_size = intermediate_size - self.num_hidden_layers = num_hidden_layers - self.num_attention_heads = num_attention_heads - - # for backward compatibility - if num_key_value_heads is None: - num_key_value_heads = num_attention_heads - - self.num_key_value_heads = num_key_value_heads - self.hidden_act = hidden_act - self.initializer_range = initializer_range - self.rms_norm_eps = rms_norm_eps - self.pretraining_tp = pretraining_tp - self.use_cache = use_cache - self.rope_theta = rope_theta - self.rope_scaling = rope_scaling - self._rope_scaling_validation() - self.attention_bias = attention_bias - self.attention_dropout = attention_dropout - self.mlp_bias = mlp_bias - self.sliding_window = sliding_window - self.bskcn_1 = bskcn_1 if bskcn_1 is not None else [12, 20, 32, 44] - self.bskcn_2 = bskcn_2 if bskcn_2 is not None else [20, 32] - self.bskcn_3 = bskcn_3 if bskcn_3 is not None else [16, 24, 36, 48] - self.bskcn_4 = bskcn_4 if bskcn_4 is not None else [28, 40] - self.bskcn_tv = bskcn_tv if bskcn_tv is not None else [0.9, 0.8] - - super().__init__( - pad_token_id=pad_token_id, - bos_token_id=bos_token_id, - eos_token_id=eos_token_id, - tie_word_embeddings=tie_word_embeddings, - **kwargs, - ) - - def _rope_scaling_validation(self): - """ - Validate the `rope_scaling` configuration. - """ - if self.rope_scaling is None: - return - - if (not isinstance(self.rope_scaling, dict) - or len(self.rope_scaling) != 2): - raise ValueError( - "`rope_scaling` must be a dictionary with two fields," - " `type` and `factor`, " - f"got {self.rope_scaling}") - rope_scaling_type = self.rope_scaling.get("type", None) - rope_scaling_factor = self.rope_scaling.get("factor", None) - if rope_scaling_type is None or rope_scaling_type not in [ - "linear", - "dynamic", - ]: - raise ValueError(f"`rope_scaling`'s type field must be one of " - f"['linear', 'dynamic'], got {rope_scaling_type}") - if (rope_scaling_factor is None - or not isinstance(rope_scaling_factor, float) - or rope_scaling_factor <= 1.0): - raise ValueError( - f"`rope_scaling`'s factor field must be a float > 1," - f" got {rope_scaling_factor}") diff --git a/vllm/transformers_utils/configs/telechat2.py b/vllm/transformers_utils/configs/telechat2.py deleted file mode 100644 index 050a7851d14..00000000000 --- a/vllm/transformers_utils/configs/telechat2.py +++ /dev/null @@ -1,64 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -# adapted from https://www.modelscope.cn/models/TeleAI/TeleChat2-3B/resolve/master/configuration_telechat2.py -""" Telechat configuration compatible with LlamaConfig. """ - -from transformers.configuration_utils import PretrainedConfig - - -class Telechat2Config(PretrainedConfig): - - model_type = "telechat" - keys_to_ignore_at_inference = ["past_key_values"] - attribute_map = { - "num_hidden_layers": "n_layer", - "num_attention_heads": "n_head", - "intermediate_size": "ffn_hidden_size", - "rms_norm_eps": "layer_norm_epsilon" - } - - def __init__( - self, - vocab_size=160256, - hidden_size=4096, - n_layer=30, - n_head=32, - layer_norm_epsilon=1e-5, - initializer_range=0.02, - use_cache=True, - bos_token_id=1, - eos_token_id=2, - apply_residual_connection_post_layernorm=False, - hidden_dropout=0.0, - attention_dropout=0.0, - ffn_hidden_size=12288, - training_seqlen=8192, - logn=True, - embed_layernorm=False, - hidden_act="silu", - **kwargs, - ): - self.vocab_size = vocab_size - n_embed = kwargs.pop("n_embed", None) - self.hidden_size = hidden_size if n_embed is None else n_embed - self.n_layer = n_layer - self.n_head = n_head - self.layer_norm_epsilon = layer_norm_epsilon - self.initializer_range = initializer_range - self.use_cache = use_cache - self.apply_residual_connection_post_layernorm = ( - apply_residual_connection_post_layernorm) - self.hidden_dropout = hidden_dropout - self.attention_dropout = attention_dropout - self.bos_token_id = bos_token_id - self.eos_token_id = eos_token_id - self.logn = logn - self.training_seqlen = training_seqlen - self.embed_layernorm = embed_layernorm - self.num_key_value_heads = kwargs.pop("num_key_value_heads", None) - self.ffn_hidden_size = ffn_hidden_size - self.hidden_act = hidden_act - super().__init__(bos_token_id=bos_token_id, - eos_token_id=eos_token_id, - **kwargs) diff --git a/vllm/transformers_utils/processors/__init__.py b/vllm/transformers_utils/processors/__init__.py index 14d15f2bc16..eca4d7c884d 100644 --- a/vllm/transformers_utils/processors/__init__.py +++ b/vllm/transformers_utils/processors/__init__.py @@ -1,5 +1,12 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Multi-modal processors may be defined in this directory for the following +reasons: + +- There is no processing file defined by HF Hub or Transformers library. +- There is a need to override the existing processor to support vLLM. +""" from vllm.transformers_utils.processors.deepseek_vl2 import ( DeepseekVLV2Processor) From f5b35033b01ddb21fe86b6b5b2e6c7982aa254fa Mon Sep 17 00:00:00 2001 From: Kebe Date: Wed, 30 Jul 2025 15:37:59 +0800 Subject: [PATCH 505/552] [CI] rollback lint-and-deploy pipeline using amd machine (#21912) Signed-off-by: Kebe Signed-off-by: x22x22 --- .github/workflows/lint-and-deploy.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/lint-and-deploy.yaml b/.github/workflows/lint-and-deploy.yaml index d5736c0aee2..74a7a3a3530 100644 --- a/.github/workflows/lint-and-deploy.yaml +++ b/.github/workflows/lint-and-deploy.yaml @@ -7,7 +7,7 @@ permissions: jobs: lint-and-deploy: - runs-on: ubuntu-24.04-arm + runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 From 825959cb8c0438e6b0f90a73d1febe8c0a0d6bed Mon Sep 17 00:00:00 2001 From: Varun Vinayak Shenoy Date: Wed, 30 Jul 2025 00:44:15 -0700 Subject: [PATCH 506/552] [Tests] Fixing bug inside MultiModalProfiler. (#21842) Signed-off-by: Varun Shenoy Signed-off-by: x22x22 --- .../multimodal/processing/test_mllama4.py | 67 +++++++++++++++++++ tests/models/registry.py | 4 +- 2 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 tests/models/multimodal/processing/test_mllama4.py diff --git a/tests/models/multimodal/processing/test_mllama4.py b/tests/models/multimodal/processing/test_mllama4.py new file mode 100644 index 00000000000..f3871b60c3f --- /dev/null +++ b/tests/models/multimodal/processing/test_mllama4.py @@ -0,0 +1,67 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Tests for mllama's multimodal preprocessing and profiling.""" +import pytest +from torch import prod +from transformers import Llama4Config + +from vllm.multimodal import MULTIMODAL_REGISTRY +from vllm.multimodal.profiling import MultiModalProfiler + +from ...utils import build_model_context + + +@pytest.mark.parametrize("model_id", ["meta-llama/Llama-Guard-4-12B"]) +@pytest.mark.parametrize("max_model_len", [4096, 8192, 25600, 131072]) +def test_profiling(model_id: str, max_model_len: int): + model_config_kwargs = { + "max_model_len": max_model_len, + } + ctx = build_model_context( + model_id, + model_config_kwargs=model_config_kwargs, + limit_mm_per_prompt={"image": 1}, + ) + + mm_config = ctx.get_mm_config() + processor = MULTIMODAL_REGISTRY.create_processor(ctx.model_config) + profiler = MultiModalProfiler(processor) + + decoder_dummy_data = profiler.get_decoder_dummy_data( + max_model_len, + mm_counts=mm_config.limit_per_prompt, + ) + dummy_mm_data = processor.dummy_inputs.get_dummy_processor_inputs( + max_model_len, + mm_counts=mm_config.limit_per_prompt, + ) + + hf_config = ctx.get_hf_config(Llama4Config) + + mm_kwargs = processor.apply( + prompt=dummy_mm_data.prompt, + mm_data=dummy_mm_data.mm_data, + hf_processor_mm_kwargs=dict(), + )["mm_kwargs"] + + image_size = hf_config.vision_config.image_size + patch_size = hf_config.vision_config.patch_size + downsample_ratio = int( + round(1.0 / (hf_config.vision_config.pixel_shuffle_ratio**2))) + tokens_per_patch = ((image_size // patch_size)**2) // downsample_ratio + chunks_per_image = prod(mm_kwargs["patches_per_image"]) + total_num_patches = chunks_per_image * tokens_per_patch + num_tiles = mm_kwargs["aspect_ratios"][0][0] * mm_kwargs["aspect_ratios"][ + 0][1] # x-y seperator tokens + total_tokens = total_num_patches.item() + num_tiles.item( + ) + 3 # image start, image, image end + + profiled_tokens = profiler.get_mm_max_contiguous_tokens( + max_model_len, + mm_counts=mm_config.limit_per_prompt, + ) + + assert total_tokens == profiled_tokens["image"] + assert total_tokens == sum( + placeholder.length for placeholder in + decoder_dummy_data.multi_modal_placeholders["image"]) diff --git a/tests/models/registry.py b/tests/models/registry.py index 4fcd02efb6d..caa691039fc 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -391,7 +391,9 @@ def check_available_online( extras={"thinking": "moonshotai/Kimi-VL-A3B-Thinking"}, # noqa: E501 trust_remote_code=True), "Llama4ForConditionalGeneration": _HfExamplesInfo("meta-llama/Llama-4-Scout-17B-16E-Instruct", # noqa: E501 - max_model_len=10240), + max_model_len=10240, + extras={"llama-guard-4": "meta-llama/Llama-Guard-4-12B"}, # noqa: E501 + ), "LlavaForConditionalGeneration": _HfExamplesInfo("llava-hf/llava-1.5-7b-hf", extras={"mistral": "mistral-community/pixtral-12b", # noqa: E501 "mistral-fp8": "nm-testing/pixtral-12b-FP8-dynamic"}), # noqa: E501 From fe78c9eed62c3e407bf9d437ce2a4bcbd7d48e78 Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Wed, 30 Jul 2025 15:55:03 +0800 Subject: [PATCH 507/552] [Model] Remove DSV2 unused code (#21903) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- vllm/model_executor/models/deepseek_v2.py | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/vllm/model_executor/models/deepseek_v2.py b/vllm/model_executor/models/deepseek_v2.py index 79ddd3d0f62..68a0a83d620 100644 --- a/vllm/model_executor/models/deepseek_v2.py +++ b/vllm/model_executor/models/deepseek_v2.py @@ -830,20 +830,6 @@ def compute_logits( sampling_metadata) return logits - def make_empty_intermediate_tensors( - self, batch_size: int, dtype: torch.dtype, - device: torch.device) -> IntermediateTensors: - return IntermediateTensors({ - "hidden_states": - torch.zeros((batch_size, self.config.hidden_size), - dtype=dtype, - device=device), - "residual": - torch.zeros((batch_size, self.config.hidden_size), - dtype=dtype, - device=device), - }) - def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: stacked_params_mapping = [ From f38d30b6f004489ffbd93427914af7963ac78bfb Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Wed, 30 Jul 2025 16:15:43 +0800 Subject: [PATCH 508/552] [benchmark] add max-concurrency in result table (#21095) Signed-off-by: Peter Pan Signed-off-by: x22x22 --- benchmarks/benchmark_serving.py | 4 ++++ benchmarks/benchmark_serving_structured_output.py | 4 ++++ vllm/benchmarks/serve.py | 6 ++++++ 3 files changed, 14 insertions(+) diff --git a/benchmarks/benchmark_serving.py b/benchmarks/benchmark_serving.py index 53bd3247afb..3affa18ae3a 100644 --- a/benchmarks/benchmark_serving.py +++ b/benchmarks/benchmark_serving.py @@ -413,6 +413,10 @@ async def limited_request_func(request_func_input, pbar): print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="=")) print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) + if max_concurrency is not None: + print("{:<40} {:<10}".format("Maximum request concurrency:", max_concurrency)) + if request_rate != float("inf"): + print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", request_rate)) print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output)) diff --git a/benchmarks/benchmark_serving_structured_output.py b/benchmarks/benchmark_serving_structured_output.py index d535cd5d7e1..2a22f122c78 100644 --- a/benchmarks/benchmark_serving_structured_output.py +++ b/benchmarks/benchmark_serving_structured_output.py @@ -555,6 +555,10 @@ async def limited_request_func(request_func_input, pbar): print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="=")) print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) + if max_concurrency is not None: + print("{:<40} {:<10}".format("Maximum request concurrency:", max_concurrency)) + if request_rate != float("inf"): + print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", request_rate)) print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output)) diff --git a/vllm/benchmarks/serve.py b/vllm/benchmarks/serve.py index 635363440c0..bd2b1e5990c 100644 --- a/vllm/benchmarks/serve.py +++ b/vllm/benchmarks/serve.py @@ -486,6 +486,12 @@ async def limited_request_func(request_func_input, pbar): print("{s:{c}^{n}}".format(s=' Serving Benchmark Result ', n=50, c='=')) print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) + if max_concurrency is not None: + print("{:<40} {:<10}".format("Maximum request concurrency:", + max_concurrency)) + if request_rate != float('inf'): + print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", + request_rate )) print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) From 0a4aaa013e496dc486ebedc8f4ce1eb53b352b2e Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 16:32:39 +0800 Subject: [PATCH 509/552] [Doc] Update partial support (#21916) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- docs/features/compatibility_matrix.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md index 930265b8f98..5b08b381077 100644 --- a/docs/features/compatibility_matrix.md +++ b/docs/features/compatibility_matrix.md @@ -41,17 +41,18 @@ th:not(:first-child) { | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | -| [pooling](../models/pooling_models.md) | ✅\* | ✅\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | +| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | enc-dec | ❌ | [❌](gh-issue:7366) | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | | logP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | | async output | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | | multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | -| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](gh-pr:4194) | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | +| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](gh-pr:4194)^ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | | best-of | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ✅ | ✅ | | | beam-search | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ❔ | ✅ | ✅ | -\* Chunked prefill and prefix caching are only applicable to last-token pooling. +\* Chunked prefill and prefix caching are only applicable to last-token pooling. +^ LoRA is only applicable to the language backbone of multimodal models. [](){ #feature-x-hardware } From 12322576fc2ab0a13b5b713725ac4fc444f8b5ae Mon Sep 17 00:00:00 2001 From: Hongsheng Liu Date: Wed, 30 Jul 2025 20:11:58 +0800 Subject: [PATCH 510/552] [Docs] Fix the example code of streaming chat completions in reasoning (#21825) Signed-off-by: wangzi <3220100013@zju.edu.cn> Co-authored-by: wangzi <3220100013@zju.edu.cn> Co-authored-by: Zi Wang <66560864+BruceW-07@users.noreply.github.com> Signed-off-by: x22x22 --- docs/features/reasoning_outputs.md | 13 ++++++------- ...enai_chat_completion_with_reasoning_streaming.py | 13 ++++++------- 2 files changed, 12 insertions(+), 14 deletions(-) diff --git a/docs/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md index 6b84eca2753..04b943efbbb 100644 --- a/docs/features/reasoning_outputs.md +++ b/docs/features/reasoning_outputs.md @@ -123,13 +123,12 @@ OpenAI Python client library does not officially support `reasoning_content` att printed_content = False for chunk in stream: - reasoning_content = None - content = None - # Check the content is reasoning_content or content - if hasattr(chunk.choices[0].delta, "reasoning_content"): - reasoning_content = chunk.choices[0].delta.reasoning_content - elif hasattr(chunk.choices[0].delta, "content"): - content = chunk.choices[0].delta.content + # Safely extract reasoning_content and content from delta, + # defaulting to None if attributes don't exist or are empty strings + reasoning_content = ( + getattr(chunk.choices[0].delta, "reasoning_content", None) or None + ) + content = getattr(chunk.choices[0].delta, "content", None) or None if reasoning_content is not None: if not printed_reasoning_content: diff --git a/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py b/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py index 5a919297709..7d1ea377145 100644 --- a/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py +++ b/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py @@ -51,13 +51,12 @@ def main(): printed_content = False for chunk in stream: - reasoning_content = None - content = None - # Check the content is reasoning_content or content - if hasattr(chunk.choices[0].delta, "reasoning_content"): - reasoning_content = chunk.choices[0].delta.reasoning_content - elif hasattr(chunk.choices[0].delta, "content"): - content = chunk.choices[0].delta.content + # Safely extract reasoning_content and content from delta, + # defaulting to None if attributes don't exist or are empty strings + reasoning_content = ( + getattr(chunk.choices[0].delta, "reasoning_content", None) or None + ) + content = getattr(chunk.choices[0].delta, "content", None) or None if reasoning_content is not None: if not printed_reasoning_content: From 1f303cc1b7b5fb436650375ac89d31f9e7acfb97 Mon Sep 17 00:00:00 2001 From: Patrick von Platen Date: Wed, 30 Jul 2025 14:42:51 +0200 Subject: [PATCH 511/552] Add @patrickvonplaten as maintainer of mistral's related files. (#21928) Signed-off-by: Patrick von Platen Signed-off-by: x22x22 --- .github/CODEOWNERS | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index fb9f44353ce..5bc94429676 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -65,3 +65,11 @@ mkdocs.yaml @hmellor # Qwen-specific files /vllm/attention/backends/dual_chunk_flash_attn.py @sighingnow /vllm/model_executor/models/qwen* @sighingnow + +# Mistral-specific files +/vllm/model_executor/models/mistral*.py @patrickvonplaten +/vllm/model_executor/models/mixtral*.py @patrickvonplaten +/vllm/model_executor/models/voxtral*.py @patrickvonplaten +/vllm/model_executor/models/pixtral*.py @patrickvonplaten +/vllm/transformers_utils/configs/mistral.py @patrickvonplaten +/vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten From 8f69358fb5f622afeaaaec325895b4e2c2494040 Mon Sep 17 00:00:00 2001 From: Eric Curtin Date: Wed, 30 Jul 2025 14:22:00 +0100 Subject: [PATCH 512/552] [Hardware][CPU] Build fix for ARM without BF16 (#21848) Signed-off-by: Eric Curtin Signed-off-by: x22x22 --- csrc/cpu/quant.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/csrc/cpu/quant.cpp b/csrc/cpu/quant.cpp index c1f7c64ea2f..6e120b8d20a 100644 --- a/csrc/cpu/quant.cpp +++ b/csrc/cpu/quant.cpp @@ -16,12 +16,14 @@ struct KernelVecType { using cvt_vec_type = vec_op::FP32Vec16; }; +#if !defined(__aarch64__) || defined(ARM_BF16_SUPPORT) template <> struct KernelVecType { using load_vec_type = vec_op::BF16Vec16; using azp_adj_load_vec_type = vec_op::INT32Vec16; using cvt_vec_type = vec_op::FP32Vec16; }; +#endif template <> struct KernelVecType { From ffd60734464b4b21b4afa0c32191ef5d493410e4 Mon Sep 17 00:00:00 2001 From: aladerran <108529629+aladerran@users.noreply.github.com> Date: Wed, 30 Jul 2025 21:27:57 +0800 Subject: [PATCH 513/552] [Feature][EPLB] Add eplb support for Qwen3 (#20815) Signed-off-by: aladerran Signed-off-by: x22x22 --- vllm/model_executor/models/qwen3_moe.py | 166 ++++++++++++++++++++---- 1 file changed, 142 insertions(+), 24 deletions(-) diff --git a/vllm/model_executor/models/qwen3_moe.py b/vllm/model_executor/models/qwen3_moe.py index 12899c28016..ca14fd06574 100644 --- a/vllm/model_executor/models/qwen3_moe.py +++ b/vllm/model_executor/models/qwen3_moe.py @@ -22,7 +22,8 @@ # See the License for the specific language governing permissions and # limitations under the License. """Inference-only Qwen3MoE model compatible with HuggingFace weights.""" -from collections.abc import Iterable +import typing +from collections.abc import Callable, Iterable from typing import Any, Optional, Union import torch @@ -31,8 +32,9 @@ from vllm.attention import Attention from vllm.compilation.decorators import support_torch_compile -from vllm.config import CacheConfig, VllmConfig -from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size +from vllm.config import CacheConfig, VllmConfig, get_current_vllm_config +from vllm.distributed import (get_ep_group, get_pp_group, + get_tensor_model_parallel_world_size) from vllm.logger import init_logger from vllm.model_executor.layers.activation import SiluAndMul from vllm.model_executor.layers.fused_moe import FusedMoE @@ -50,8 +52,8 @@ from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.sequence import IntermediateTensors -from .interfaces import SupportsLoRA, SupportsPP -from .utils import (AutoWeightsLoader, extract_layer_index, +from .interfaces import MixtureOfExperts, SupportsLoRA, SupportsPP +from .utils import (AutoWeightsLoader, PPMissingLayer, extract_layer_index, is_pp_missing_parameter, make_empty_intermediate_tensors_factory, make_layers, maybe_prefix) @@ -101,23 +103,47 @@ def __init__( config: PretrainedConfig, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", + enable_eplb: bool = False, ): super().__init__() self.tp_size = get_tensor_model_parallel_world_size() + self.ep_group = get_ep_group().device_group + self.ep_rank = self.ep_group.rank() + self.ep_size = self.ep_group.size() + self.n_routed_experts = config.num_experts + if self.tp_size > config.num_experts: raise ValueError( f"Tensor parallel size {self.tp_size} is greater than " f"the number of experts {config.num_experts}.") - self.experts = FusedMoE(num_experts=config.num_experts, + # Load balancing settings. + vllm_config = get_current_vllm_config() + parallel_config = vllm_config.parallel_config + self.enable_eplb = enable_eplb + + self.n_logical_experts = self.n_routed_experts + self.n_redundant_experts = parallel_config.num_redundant_experts + self.n_physical_experts = (self.n_logical_experts + + self.n_redundant_experts) + self.n_local_physical_experts = self.n_physical_experts // self.ep_size + + self.physical_expert_start = (self.ep_rank * + self.n_local_physical_experts) + self.physical_expert_end = (self.physical_expert_start + + self.n_local_physical_experts) + + self.experts = FusedMoE(num_experts=self.n_routed_experts, top_k=config.num_experts_per_tok, hidden_size=config.hidden_size, intermediate_size=config.moe_intermediate_size, reduce_results=False, renormalize=config.norm_topk_prob, quant_config=quant_config, - prefix=f"{prefix}.experts") + prefix=f"{prefix}.experts", + enable_eplb=self.enable_eplb, + num_redundant_experts=self.n_redundant_experts) self.gate = ReplicatedLinear(config.hidden_size, config.num_experts, @@ -246,6 +272,7 @@ def __init__( cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", + enable_eplb: bool = False, ) -> None: super().__init__() self.hidden_size = config.hidden_size @@ -277,7 +304,8 @@ def __init__( (layer_idx + 1) % config.decoder_sparse_step == 0): self.mlp = Qwen3MoeSparseMoeBlock(config=config, quant_config=quant_config, - prefix=f"{prefix}.mlp") + prefix=f"{prefix}.mlp", + enable_eplb=enable_eplb) else: self.mlp = Qwen3MoeMLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size, @@ -323,6 +351,9 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config cache_config = vllm_config.cache_config quant_config = vllm_config.quant_config + parallel_config = vllm_config.parallel_config + enable_eplb = parallel_config.enable_eplb + self.num_redundant_experts = parallel_config.num_redundant_experts self.padding_idx = config.pad_token_id self.vocab_size = config.vocab_size @@ -336,7 +367,8 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): lambda prefix: Qwen3MoeDecoderLayer(config=config, cache_config=cache_config, quant_config=quant_config, - prefix=prefix), + prefix=prefix, + enable_eplb=enable_eplb), prefix=f"{prefix}.layers", ) self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) @@ -382,7 +414,8 @@ def get_expert_mapping(self) -> list[tuple[str, str, int, str]]: ckpt_gate_proj_name="gate_proj", ckpt_down_proj_name="down_proj", ckpt_up_proj_name="up_proj", - num_experts=self.config.num_experts) + num_experts=self.config.num_experts, + num_redundant_experts=self.num_redundant_experts) def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: @@ -433,27 +466,51 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight, shard_id) break else: + is_expert_weight = False for mapping in expert_params_mapping: param_name, weight_name, expert_id, shard_id = mapping if weight_name not in name: continue - name = name.replace(weight_name, param_name) - # Skip layers on other devices. - if is_pp_missing_parameter(name, self): + + # Anyway, this is an expert weight and should not be + # attempted to load as other weights later + is_expert_weight = True + + # Do not modify `name` since the loop may continue here + # Instead, create a new variable + name_mapped = name.replace(weight_name, param_name) + + if is_pp_missing_parameter(name_mapped, self): continue + # Skip loading extra parameters for GPTQ/modelopt models. - if name.endswith( - ignore_suffixes) and name not in params_dict: + if name_mapped.endswith( + ignore_suffixes + ) and name_mapped not in params_dict: continue - param = params_dict[name] - weight_loader = param.weight_loader - weight_loader(param, - loaded_weight, - name, - shard_id=shard_id, - expert_id=expert_id) - break + + param = params_dict[name_mapped] + # We should ask the weight loader to return success or not + # here since otherwise we may skip experts with other + # available replicas. + weight_loader = typing.cast(Callable[..., bool], + param.weight_loader) + success = weight_loader(param, + loaded_weight, + name_mapped, + shard_id=shard_id, + expert_id=expert_id, + return_success=True) + if success: + name = name_mapped + break else: + if is_expert_weight: + # We've checked that this is an expert weight + # However it's not mapped locally to this rank + # So we simply skip it + continue + # Skip loading extra parameters for GPTQ/modelopt models. if name.endswith( ignore_suffixes) and name not in params_dict: @@ -482,7 +539,8 @@ def load_weights(self, weights: Iterable[tuple[str, return loaded_params -class Qwen3MoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA): +class Qwen3MoeForCausalLM(nn.Module, SupportsPP, SupportsLoRA, + MixtureOfExperts): packed_modules_mapping = { "qkv_proj": [ "q_proj", @@ -514,6 +572,66 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self.make_empty_intermediate_tensors = ( self.model.make_empty_intermediate_tensors) + # Set MoE hyperparameters + self.expert_weights = [] + + self.moe_layers: list[FusedMoE] = [] + example_layer = None + for layer in self.model.layers: + if isinstance(layer, PPMissingLayer): + continue + + assert isinstance(layer, Qwen3MoeDecoderLayer) + if isinstance(layer.mlp, Qwen3MoeSparseMoeBlock): + example_layer = layer.mlp + self.moe_layers.append(layer.mlp.experts) + + if example_layer is None: + raise RuntimeError("No Qwen3MoE layer found in the model.layers.") + + self.num_moe_layers = len(self.moe_layers) + self.num_expert_groups = 1 + self.num_shared_experts = 0 + self.num_logical_experts = example_layer.n_logical_experts + self.num_physical_experts = example_layer.n_physical_experts + self.num_local_physical_experts = example_layer.n_local_physical_experts + self.num_routed_experts = example_layer.n_routed_experts + self.num_redundant_experts = example_layer.n_redundant_experts + + def set_eplb_state( + self, + expert_load_view: torch.Tensor, + logical_to_physical_map: torch.Tensor, + logical_replica_count: torch.Tensor, + ) -> None: + for layer_idx, layer in enumerate(self.moe_layers): + # Register the expert weights. + self.expert_weights.append(layer.get_expert_weights()) + layer.set_eplb_state( + moe_layer_idx=layer_idx, + expert_load_view=expert_load_view, + logical_to_physical_map=logical_to_physical_map, + logical_replica_count=logical_replica_count, + ) + + def update_physical_experts_metadata( + self, + num_physical_experts: int, + num_local_physical_experts: int, + ) -> None: + assert self.num_local_physical_experts == num_local_physical_experts + self.num_physical_experts = num_physical_experts + self.num_local_physical_experts = num_local_physical_experts + self.num_redundant_experts = (num_physical_experts - + self.num_logical_experts) + for layer in self.model.layers: + if isinstance(layer.mlp, Qwen3MoeSparseMoeBlock): + moe = layer.mlp + moe.n_local_physical_experts = num_local_physical_experts + moe.n_physical_experts = num_physical_experts + moe.n_redundant_experts = self.num_redundant_experts + moe.experts.update_expert_map() + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: return self.model.get_input_embeddings(input_ids) From 3803e15d9a21d342296776cfcae156cef4653ca7 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 21:36:34 +0800 Subject: [PATCH 514/552] [Doc] Remove vLLM prefix and add citation for PagedAttention (#21910) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- .../paged_attention}/k_vecs.png | Bin .../paged_attention}/key.png | Bin .../paged_attention}/logits_vec.png | Bin .../paged_attention}/q_vecs.png | Bin .../paged_attention}/query.png | Bin .../paged_attention}/v_vec.png | Bin .../paged_attention}/value.png | Bin docs/design/paged_attention.md | 29 ++++++++++++------ docs/design/plugin_system.md | 2 +- docs/design/torch_compile.md | 2 +- 10 files changed, 22 insertions(+), 11 deletions(-) rename docs/assets/{kernel => design/paged_attention}/k_vecs.png (100%) rename docs/assets/{kernel => design/paged_attention}/key.png (100%) rename docs/assets/{kernel => design/paged_attention}/logits_vec.png (100%) rename docs/assets/{kernel => design/paged_attention}/q_vecs.png (100%) rename docs/assets/{kernel => design/paged_attention}/query.png (100%) rename docs/assets/{kernel => design/paged_attention}/v_vec.png (100%) rename docs/assets/{kernel => design/paged_attention}/value.png (100%) diff --git a/docs/assets/kernel/k_vecs.png b/docs/assets/design/paged_attention/k_vecs.png similarity index 100% rename from docs/assets/kernel/k_vecs.png rename to docs/assets/design/paged_attention/k_vecs.png diff --git a/docs/assets/kernel/key.png b/docs/assets/design/paged_attention/key.png similarity index 100% rename from docs/assets/kernel/key.png rename to docs/assets/design/paged_attention/key.png diff --git a/docs/assets/kernel/logits_vec.png b/docs/assets/design/paged_attention/logits_vec.png similarity index 100% rename from docs/assets/kernel/logits_vec.png rename to docs/assets/design/paged_attention/logits_vec.png diff --git a/docs/assets/kernel/q_vecs.png b/docs/assets/design/paged_attention/q_vecs.png similarity index 100% rename from docs/assets/kernel/q_vecs.png rename to docs/assets/design/paged_attention/q_vecs.png diff --git a/docs/assets/kernel/query.png b/docs/assets/design/paged_attention/query.png similarity index 100% rename from docs/assets/kernel/query.png rename to docs/assets/design/paged_attention/query.png diff --git a/docs/assets/kernel/v_vec.png b/docs/assets/design/paged_attention/v_vec.png similarity index 100% rename from docs/assets/kernel/v_vec.png rename to docs/assets/design/paged_attention/v_vec.png diff --git a/docs/assets/kernel/value.png b/docs/assets/design/paged_attention/value.png similarity index 100% rename from docs/assets/kernel/value.png rename to docs/assets/design/paged_attention/value.png diff --git a/docs/design/paged_attention.md b/docs/design/paged_attention.md index ef525e8c604..fb991a35caf 100644 --- a/docs/design/paged_attention.md +++ b/docs/design/paged_attention.md @@ -1,7 +1,7 @@ -# vLLM Paged Attention +# Paged Attention !!! warning - This document is being kept in the vLLM documentation for historical purposes. + This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180). It no longer describes the code used in vLLM today. Currently, vLLM utilizes its own implementation of a multi-head query @@ -140,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE; ```
- ![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" } + ![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
Each thread defines its own `q_ptr` which points to the assigned @@ -149,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains total of 128 elements divided into 128 / 4 = 32 vecs.
- ![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" } + ![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
```cpp @@ -188,7 +188,7 @@ points to key token data based on `k_cache` at assigned block, assigned head and assigned token.
- ![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" } + ![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
The diagram above illustrates the memory layout for key data. It @@ -203,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one thread group) separately.
- ![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" } + ![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
```cpp @@ -362,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of ## Value
- ![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" } + ![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
- ![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" } + ![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
- ![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" } + ![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
Now we need to retrieve the value data and perform dot multiplication @@ -499,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { Finally, we need to iterate over different assigned head positions and write out the corresponding accumulated result based on the `out_ptr`. + +## Citation + +```bibtex +@inproceedings{kwon2023efficient, + title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, + author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, + booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, + year={2023} +} +``` diff --git a/docs/design/plugin_system.md b/docs/design/plugin_system.md index 23a05ac719c..ca1c2c2305d 100644 --- a/docs/design/plugin_system.md +++ b/docs/design/plugin_system.md @@ -1,4 +1,4 @@ -# vLLM's Plugin System +# Plugin System The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. diff --git a/docs/design/torch_compile.md b/docs/design/torch_compile.md index 2d76e7f3adc..47ac4958dbf 100644 --- a/docs/design/torch_compile.md +++ b/docs/design/torch_compile.md @@ -1,4 +1,4 @@ -# vLLM's `torch.compile` integration +# `torch.compile` integration In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage. From 5689504afe95f109a0deeab7ba689908f80f239a Mon Sep 17 00:00:00 2001 From: "rongfu.leng" Date: Wed, 30 Jul 2025 21:51:58 +0800 Subject: [PATCH 515/552] [Bugfix] we should use metavar is not choices (#21902) Signed-off-by: rongfu.leng Signed-off-by: x22x22 --- vllm/entrypoints/openai/cli_args.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 2d19e16883a..282493e5435 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -194,7 +194,9 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: # Special case: Tool call parser shows built-in options. valid_tool_parsers = list(ToolParserManager.tool_parsers.keys()) - frontend_kwargs["tool_call_parser"]["choices"] = valid_tool_parsers + parsers_str = ",".join(valid_tool_parsers) + frontend_kwargs["tool_call_parser"]["metavar"] = ( + f"{{{parsers_str}}} or name registered in --tool-parser-plugin") frontend_group = parser.add_argument_group( title="Frontend", From 44827e4fdd05875b599f682dbc0d1afd73042a6f Mon Sep 17 00:00:00 2001 From: Yan Pashkovsky Date: Wed, 30 Jul 2025 15:03:23 +0100 Subject: [PATCH 516/552] [Feature] Support multiple api keys in server (#18548) Signed-off-by: Yan Pashkovsky Signed-off-by: x22x22 --- docs/getting_started/quickstart.md | 1 + vllm/entrypoints/openai/api_server.py | 12 +++---- vllm/entrypoints/openai/cli_args.py | 46 +++++++++++++-------------- 3 files changed, 30 insertions(+), 29 deletions(-) diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 74235db16a1..3a93497fab1 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -126,6 +126,7 @@ curl http://localhost:8000/v1/models ``` You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header. +You can pass multiple keys after `--api-key`, and the server will accept any of the keys passed, this can be useful for key rotation. ### OpenAI Completions API with vLLM diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index c375c875510..05d9a69a65f 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -1239,9 +1239,9 @@ class AuthenticationMiddleware: 2. The request path doesn't start with /v1 (e.g. /health). """ - def __init__(self, app: ASGIApp, api_token: str) -> None: + def __init__(self, app: ASGIApp, tokens: list[str]) -> None: self.app = app - self.api_token = api_token + self.api_tokens = {f"Bearer {token}" for token in tokens} def __call__(self, scope: Scope, receive: Receive, send: Send) -> Awaitable[None]: @@ -1255,7 +1255,7 @@ def __call__(self, scope: Scope, receive: Receive, headers = Headers(scope=scope) # Type narrow to satisfy mypy. if url_path.startswith("/v1") and headers.get( - "Authorization") != f"Bearer {self.api_token}": + "Authorization") not in self.api_tokens: response = JSONResponse(content={"error": "Unauthorized"}, status_code=401) return response(scope, receive, send) @@ -1303,7 +1303,7 @@ class ScalingMiddleware: """ Middleware that checks if the model is currently scaling and returns a 503 Service Unavailable response if it is. - + This middleware applies to all HTTP requests and prevents processing when the model is in a scaling state. """ @@ -1512,8 +1512,8 @@ async def validation_exception_handler(_: Request, status_code=HTTPStatus.BAD_REQUEST) # Ensure --api-key option from CLI takes precedence over VLLM_API_KEY - if token := args.api_key or envs.VLLM_API_KEY: - app.add_middleware(AuthenticationMiddleware, api_token=token) + if tokens := [key for key in (args.api_key or [envs.VLLM_API_KEY]) if key]: + app.add_middleware(AuthenticationMiddleware, tokens=tokens) if args.enable_request_id_headers: app.add_middleware(XRequestIdMiddleware) diff --git a/vllm/entrypoints/openai/cli_args.py b/vllm/entrypoints/openai/cli_args.py index 282493e5435..dfbc9cde3d5 100644 --- a/vllm/entrypoints/openai/cli_args.py +++ b/vllm/entrypoints/openai/cli_args.py @@ -85,22 +85,22 @@ class FrontendArgs: """Allowed methods.""" allowed_headers: list[str] = field(default_factory=lambda: ["*"]) """Allowed headers.""" - api_key: Optional[str] = None - """If provided, the server will require this key to be presented in the - header.""" + api_key: Optional[list[str]] = None + """If provided, the server will require one of these keys to be presented in + the header.""" lora_modules: Optional[list[LoRAModulePath]] = None """LoRA modules configurations in either 'name=path' format or JSON format - or JSON list format. Example (old format): `'name=path'` Example (new - format): `{\"name\": \"name\", \"path\": \"lora_path\", + or JSON list format. Example (old format): `'name=path'` Example (new + format): `{\"name\": \"name\", \"path\": \"lora_path\", \"base_model_name\": \"id\"}`""" chat_template: Optional[str] = None - """The file path to the chat template, or the template in single-line form + """The file path to the chat template, or the template in single-line form for the specified model.""" chat_template_content_format: ChatTemplateContentFormatOption = "auto" """The format to render message content within a chat template. * "string" will render the content as a string. Example: `"Hello World"` -* "openai" will render the content as a list of dictionaries, similar to OpenAI +* "openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example: `[{"type": "text", "text": "Hello world!"}]`""" response_role: str = "assistant" """The role name to return if `request.add_generation_prompt=true`.""" @@ -117,40 +117,40 @@ class FrontendArgs: root_path: Optional[str] = None """FastAPI root_path when app is behind a path based routing proxy.""" middleware: list[str] = field(default_factory=lambda: []) - """Additional ASGI middleware to apply to the app. We accept multiple - --middleware arguments. The value should be an import path. If a function - is provided, vLLM will add it to the server using - `@app.middleware('http')`. If a class is provided, vLLM will + """Additional ASGI middleware to apply to the app. We accept multiple + --middleware arguments. The value should be an import path. If a function + is provided, vLLM will add it to the server using + `@app.middleware('http')`. If a class is provided, vLLM will add it to the server using `app.add_middleware()`.""" return_tokens_as_token_ids: bool = False - """When `--max-logprobs` is specified, represents single tokens as - strings of the form 'token_id:{token_id}' so that tokens that are not + """When `--max-logprobs` is specified, represents single tokens as + strings of the form 'token_id:{token_id}' so that tokens that are not JSON-encodable can be identified.""" disable_frontend_multiprocessing: bool = False - """If specified, will run the OpenAI frontend server in the same process as + """If specified, will run the OpenAI frontend server in the same process as the model serving engine.""" enable_request_id_headers: bool = False - """If specified, API server will add X-Request-Id header to responses. + """If specified, API server will add X-Request-Id header to responses. Caution: this hurts performance at high QPS.""" enable_auto_tool_choice: bool = False - """If specified, exclude tool definitions in prompts when + """If specified, exclude tool definitions in prompts when tool_choice='none'.""" exclude_tools_when_tool_choice_none: bool = False - """Enable auto tool choice for supported models. Use `--tool-call-parser` + """Enable auto tool choice for supported models. Use `--tool-call-parser` to specify which parser to use.""" tool_call_parser: Optional[str] = None - """Select the tool call parser depending on the model that you're using. - This is used to parse the model-generated tool call into OpenAI API format. - Required for `--enable-auto-tool-choice`. You can choose any option from + """Select the tool call parser depending on the model that you're using. + This is used to parse the model-generated tool call into OpenAI API format. + Required for `--enable-auto-tool-choice`. You can choose any option from the built-in parsers or register a plugin via `--tool-parser-plugin`.""" tool_parser_plugin: str = "" - """Special the tool parser plugin write to parse the model-generated tool - into OpenAI API format, the name register in this plugin can be used in + """Special the tool parser plugin write to parse the model-generated tool + into OpenAI API format, the name register in this plugin can be used in `--tool-call-parser`.""" log_config_file: Optional[str] = envs.VLLM_LOGGING_CONFIG_PATH """Path to logging config JSON file for both vllm and uvicorn""" max_log_len: Optional[int] = None - """Max number of prompt characters or prompt ID numbers being printed in + """Max number of prompt characters or prompt ID numbers being printed in log. The default of None means unlimited.""" disable_fastapi_docs: bool = False """Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint.""" From c8183ec9f769b2db5dc001b5620f83297b94212f Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 30 Jul 2025 22:05:04 +0800 Subject: [PATCH 517/552] [misc] skip p2p check by default (#21904) Signed-off-by: x22x22 --- vllm/envs.py | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/vllm/envs.py b/vllm/envs.py index 50cb3b7d1b7..ec4b0888d0f 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -668,12 +668,14 @@ def get_vllm_port() -> Optional[int]: (os.environ.get("VLLM_ALLOW_RUNTIME_LORA_UPDATING", "0").strip().lower() in ("1", "true")), - # By default, vLLM will check the peer-to-peer capability itself, - # in case of broken drivers. See https://github.com/vllm-project/vllm/blob/a9b15c606fea67a072416ea0ea115261a2756058/vllm/distributed/device_communicators/custom_all_reduce_utils.py#L101-L108 for details. # noqa - # If this env var is set to 1, vLLM will skip the peer-to-peer check, - # and trust the driver's peer-to-peer capability report. + # We assume drivers can report p2p status correctly. + # If the program hangs when using custom allreduce, + # potantially caused by a bug in the driver (535 series), + # if might be helpful to set VLLM_SKIP_P2P_CHECK=0 + # so that vLLM can verify if p2p is actually working. + # See https://github.com/vllm-project/vllm/blob/a9b15c606fea67a072416ea0ea115261a2756058/vllm/distributed/device_communicators/custom_all_reduce_utils.py#L101-L108 for details. # noqa "VLLM_SKIP_P2P_CHECK": - lambda: os.getenv("VLLM_SKIP_P2P_CHECK", "0") == "1", + lambda: os.getenv("VLLM_SKIP_P2P_CHECK", "1") == "1", # List of quantization kernels that should be disabled, used for testing # and performance comparisons. Currently only affects MPLinearKernel From a03b93a40376727d466dcf4996c8a55fbc7ae268 Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Wed, 30 Jul 2025 10:15:02 -0400 Subject: [PATCH 518/552] [Test] Add Benchmark and Unit Test for `per_token_group_quant` (#21860) Signed-off-by: yewentao256 Signed-off-by: x22x22 --- .../benchmark_per_token_group_quant.py | 159 ++++++++++++++++++ .../test_per_token_group_quant.py | 31 +++- 2 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 benchmarks/kernels/benchmark_per_token_group_quant.py diff --git a/benchmarks/kernels/benchmark_per_token_group_quant.py b/benchmarks/kernels/benchmark_per_token_group_quant.py new file mode 100644 index 00000000000..1ccb5e08b3d --- /dev/null +++ b/benchmarks/kernels/benchmark_per_token_group_quant.py @@ -0,0 +1,159 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import argparse +import math +from contextlib import contextmanager +from typing import Callable +from unittest.mock import patch + +import torch + +from vllm.model_executor.layers.quantization.utils import fp8_utils, int8_utils +from vllm.platforms import current_platform + + +@contextmanager +def _triton_mode(): + """Temporarily force the Triton fallback path""" + with patch("vllm.platforms.current_platform.is_cuda", return_value=False): + yield + + +def _time_cuda( + fn: Callable[[], tuple[torch.Tensor, torch.Tensor]], + warmup_iters: int, + bench_iters: int, +) -> float: + # warmup + for _ in range(warmup_iters): + fn() + torch.cuda.synchronize() + + start = torch.cuda.Event(enable_timing=True) + end = torch.cuda.Event(enable_timing=True) + + start.record() + for _ in range(bench_iters): + fn() + end.record() + torch.cuda.synchronize() + + return start.elapsed_time(end) / bench_iters # ms/iter + + +def _run_single( + shape: tuple[int, int], + group_size: int, + dtype: str, + *, + column_major: bool = False, + scale_ue8m0: bool = False, + warmup_iters: int, + bench_iters: int, +) -> None: + num_tokens, hidden_dim = shape + + device = torch.device("cuda") + torch.manual_seed(42) + x = torch.randn(num_tokens, hidden_dim, device=device, dtype=torch.bfloat16) * 8 + + if dtype == "fp8": + + def cuda_impl(): + return fp8_utils.per_token_group_quant_fp8( + x, + group_size, + column_major_scales=column_major, + use_ue8m0=scale_ue8m0, + ) + + def triton_impl(): + with _triton_mode(): + return fp8_utils.per_token_group_quant_fp8( + x, + group_size, + column_major_scales=column_major, + use_ue8m0=scale_ue8m0, + ) + elif dtype == "int8": + + def cuda_impl(): + return int8_utils.per_token_group_quant_int8(x, group_size) + + def triton_impl(): + with _triton_mode(): + return int8_utils.per_token_group_quant_int8(x, group_size) + else: + raise ValueError("dtype must be 'fp8' or 'int8'") + + cuda_ms = _time_cuda(cuda_impl, warmup_iters, bench_iters) + triton_ms = _time_cuda(triton_impl, warmup_iters, bench_iters) + + speedup = triton_ms / cuda_ms if cuda_ms else math.inf + + cfg_desc = ( + f"shape={shape} gs={group_size:<3} col_major={column_major:<5} " + f"ue8m0={scale_ue8m0:<5} dtype={dtype}" + ) + print( + f"{cfg_desc:55} | CUDA {cuda_ms:7.3f} ms | Triton {triton_ms:7.3f} ms | " + f"speed-up ×{speedup:5.2f}" + ) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--warmup-iters", type=int, default=10) + parser.add_argument("--bench-iters", type=int, default=100) + parser.add_argument("--dtype", choices=["fp8", "int8", "both"], default="both") + return parser.parse_args() + + +if __name__ == "__main__": + if not current_platform.is_cuda(): + raise RuntimeError("CUDA device is required to run this benchmark.") + + args = parse_args() + warmup_iters, bench_iters = args.warmup_iters, args.bench_iters + + shapes = [(32, 128), (64, 256), (16, 512)] + group_sizes = [64, 128] + + dtypes = ["fp8", "int8"] if args.dtype == "both" else [args.dtype] + + header = ( + "Configuration".ljust(55) + + " | " + + "CUDA (ms)".center(12) + + " | " + + "Triton (ms)".center(13) + + " | " + + "Speed-up" + ) + print(header) + print("-" * len(header)) + + for dtype in dtypes: + for shape in shapes: + for gs in group_sizes: + if dtype == "fp8": + for col_major in (False, True): + for ue8m0 in (False, True): + _run_single( + shape, + gs, + dtype, + column_major=col_major, + scale_ue8m0=ue8m0, + warmup_iters=warmup_iters, + bench_iters=bench_iters, + ) + else: # INT8 has no col-major / ue8m0 switches + _run_single( + shape, + gs, + dtype, + warmup_iters=warmup_iters, + bench_iters=bench_iters, + ) diff --git a/tests/kernels/quantization/test_per_token_group_quant.py b/tests/kernels/quantization/test_per_token_group_quant.py index f826983fe94..07f17d1efe6 100644 --- a/tests/kernels/quantization/test_per_token_group_quant.py +++ b/tests/kernels/quantization/test_per_token_group_quant.py @@ -5,7 +5,7 @@ import pytest import torch -from vllm.model_executor.layers.quantization.utils import fp8_utils +from vllm.model_executor.layers.quantization.utils import fp8_utils, int8_utils @pytest.mark.parametrize("shape", [(32, 128), (64, 256), (16, 512)]) @@ -42,3 +42,32 @@ def test_per_token_group_quant_fp8(shape, column_major: bool, assert torch.allclose(out_q.float(), ref_q.float(), atol=0.15, rtol=0.15) assert torch.allclose(scale, ref_s, atol=0.01, rtol=0.01) + + +@pytest.mark.parametrize("shape", [(32, 128), (64, 256), (16, 512)]) +@pytest.mark.parametrize("group_size", [64, 128]) +@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") +def test_per_token_group_quant_int8(shape, group_size: int): + device = "cuda" + + torch.manual_seed(42) + num_tokens, hidden_dim = shape + + x = (torch.randn( + (num_tokens, hidden_dim), device=device, dtype=torch.bfloat16) * 8) + + # cuda path + out_q, scale = int8_utils.per_token_group_quant_int8( + x, + group_size, + ) + + # triton ref + with patch("vllm.platforms.current_platform.is_cuda", return_value=False): + ref_q, ref_s = int8_utils.per_token_group_quant_int8( + x, + group_size, + ) + + assert torch.allclose(out_q.float(), ref_q.float(), atol=0.15, rtol=0.15) + assert torch.allclose(scale, ref_s, atol=0.01, rtol=0.01) From ed8b20cda06648299895c4c4a32ed16b87de8bbc Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 22:17:14 +0800 Subject: [PATCH 519/552] [CI/Build] Only run markdownlint in CI (#21892) Signed-off-by: DarkLight1337 Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- .github/workflows/matchers/markdownlint.json | 17 +++++++++++++++++ .github/workflows/pre-commit.yml | 1 + .pre-commit-config.yaml | 3 ++- 3 files changed, 20 insertions(+), 1 deletion(-) create mode 100644 .github/workflows/matchers/markdownlint.json diff --git a/.github/workflows/matchers/markdownlint.json b/.github/workflows/matchers/markdownlint.json new file mode 100644 index 00000000000..fe094a9badb --- /dev/null +++ b/.github/workflows/matchers/markdownlint.json @@ -0,0 +1,17 @@ +{ + "problemMatcher": [ + { + "owner": "markdownlint", + "pattern": [ + { + "regexp": "^([^:]*):(\\d+):?(\\d+)?\\s([\\w-\\/]*)\\s(.*)$", + "file": 1, + "line": 2, + "column": 3, + "code": 4, + "message": 5 + } + ] + } + ] +} \ No newline at end of file diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml index 8e694d18134..835e91d91ae 100644 --- a/.github/workflows/pre-commit.yml +++ b/.github/workflows/pre-commit.yml @@ -17,6 +17,7 @@ jobs: with: python-version: "3.12" - run: echo "::add-matcher::.github/workflows/matchers/actionlint.json" + - run: echo "::add-matcher::.github/workflows/matchers/markdownlint.json" - run: echo "::add-matcher::.github/workflows/matchers/mypy.json" - uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1 with: diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 045096cb863..612b290e88d 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -38,8 +38,9 @@ repos: - repo: https://github.com/igorshubovych/markdownlint-cli rev: v0.45.0 hooks: - - id: markdownlint-fix + - id: markdownlint exclude: '.*\.inc\.md' + stages: [manual] # Only run in CI - repo: https://github.com/rhysd/actionlint rev: v1.7.7 hooks: From 842736e7601859aa127fec1baab49e7287df19f3 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Wed, 30 Jul 2025 15:18:02 +0100 Subject: [PATCH 520/552] Reduce time wasted in GitHub Actions using `concurrency` (#21919) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- .github/workflows/lint-and-deploy.yaml | 4 ++++ .github/workflows/pre-commit.yml | 4 ++++ 2 files changed, 8 insertions(+) diff --git a/.github/workflows/lint-and-deploy.yaml b/.github/workflows/lint-and-deploy.yaml index 74a7a3a3530..2b1086b7faf 100644 --- a/.github/workflows/lint-and-deploy.yaml +++ b/.github/workflows/lint-and-deploy.yaml @@ -2,6 +2,10 @@ name: Lint and Deploy Charts on: pull_request +concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true + permissions: contents: read diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml index 835e91d91ae..195579f206a 100644 --- a/.github/workflows/pre-commit.yml +++ b/.github/workflows/pre-commit.yml @@ -5,6 +5,10 @@ on: push: branches: [main] +concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: ${{ github.event_name == 'pull_request' }} + permissions: contents: read From 4884f02396686f791bec32eb39768f1ffc3b116d Mon Sep 17 00:00:00 2001 From: Ruixiang Tan <819464715@qq.com> Date: Wed, 30 Jul 2025 22:20:43 +0800 Subject: [PATCH 521/552] [Misc] Improve code readability of KVCacheManager (#21673) Signed-off-by: tanruixiang Signed-off-by: Ruixiang Tan <819464715@qq.com> Signed-off-by: GitHub Signed-off-by: x22x22 --- tests/v1/core/test_kv_cache_utils.py | 4 ++-- vllm/v1/core/block_pool.py | 2 +- vllm/v1/core/kv_cache_coordinator.py | 9 ++++++--- vllm/v1/core/kv_cache_manager.py | 5 +---- vllm/v1/core/kv_cache_utils.py | 8 -------- vllm/v1/core/single_type_kv_cache_manager.py | 12 ++++++++---- 6 files changed, 18 insertions(+), 22 deletions(-) diff --git a/tests/v1/core/test_kv_cache_utils.py b/tests/v1/core/test_kv_cache_utils.py index e9c6f1f95cd..bff3724d95e 100644 --- a/tests/v1/core/test_kv_cache_utils.py +++ b/tests/v1/core/test_kv_cache_utils.py @@ -112,9 +112,9 @@ def test_kv_cache_block(): assert block.block_hash is None # Test reference count manipulation - block.incr_ref() + block.ref_cnt += 1 assert block.ref_cnt == 1 - block.decr_ref() + block.ref_cnt -= 1 assert block.ref_cnt == 0 # Test block hash setting and resetting diff --git a/vllm/v1/core/block_pool.py b/vllm/v1/core/block_pool.py index 5bf4d3a2acb..ad9854dd29c 100644 --- a/vllm/v1/core/block_pool.py +++ b/vllm/v1/core/block_pool.py @@ -276,7 +276,7 @@ def touch(self, blocks: tuple[list[KVCacheBlock], ...]) -> None: # candidate), so remove it. if block.ref_cnt == 0 and not block.is_null: self.free_block_queue.remove(block) - block.incr_ref() + block.ref_cnt += 1 def free_blocks(self, ordered_blocks: Iterable[KVCacheBlock]) -> None: """Free a list of blocks. The blocks should be ordered by their diff --git a/vllm/v1/core/kv_cache_coordinator.py b/vllm/v1/core/kv_cache_coordinator.py index 258805843e2..f3a16d64e19 100644 --- a/vllm/v1/core/kv_cache_coordinator.py +++ b/vllm/v1/core/kv_cache_coordinator.py @@ -126,14 +126,17 @@ def free(self, request_id: str) -> None: def get_num_common_prefix_blocks(self, request_id: str, num_running_requests: int) -> list[int]: """ - Get the number of common prefix blocks for a request. + Get the number of common prefix blocks for all requests in the RUNNING + state for each kv cache group. Args: request_id: The request ID. - num_running_requests: The number of requests in the RUNNING state. + num_running_requests: The total number of requests in the RUNNING + state. Returns: - list[int]: The number of common prefix blocks. + list[int]: The number of common prefix blocks for all requests in + the RUNNING state for each kv cache group. """ num_blocks_per_group = [ manager.get_num_common_prefix_blocks(request_id, diff --git a/vllm/v1/core/kv_cache_manager.py b/vllm/v1/core/kv_cache_manager.py index e820a0ad6d5..ce333dbe61a 100644 --- a/vllm/v1/core/kv_cache_manager.py +++ b/vllm/v1/core/kv_cache_manager.py @@ -170,10 +170,6 @@ def get_computed_blocks(self, self.block_size, request) self.req_to_block_hashes[request.request_id] = block_hashes - if self.log_stats: - assert self.prefix_cache_stats is not None - self.prefix_cache_stats.requests += 1 - # NOTE: When all tokens hit the cache, we must recompute the last token # to obtain logits. Thus, set max_cache_hit_length to prompt_length - 1. # This can trigger recomputation of an entire block, rather than just @@ -187,6 +183,7 @@ def get_computed_blocks(self, if self.log_stats: assert self.prefix_cache_stats is not None + self.prefix_cache_stats.requests += 1 self.prefix_cache_stats.queries += request.num_tokens self.prefix_cache_stats.hits += num_new_computed_tokens diff --git a/vllm/v1/core/kv_cache_utils.py b/vllm/v1/core/kv_cache_utils.py index 3a72ac271af..25520eb6551 100644 --- a/vllm/v1/core/kv_cache_utils.py +++ b/vllm/v1/core/kv_cache_utils.py @@ -154,14 +154,6 @@ class KVCacheBlock: # Whether the block is a null block that should never be cached. is_null: bool = False - # TODO(Jialin): For performance, let callers handle ref_cnt bumps to - # avoid function calls. - def incr_ref(self): - self.ref_cnt += 1 - - def decr_ref(self): - self.ref_cnt -= 1 - @property def block_hash(self) -> Optional[BlockHashWithGroupId]: return self._block_hash diff --git a/vllm/v1/core/single_type_kv_cache_manager.py b/vllm/v1/core/single_type_kv_cache_manager.py index 714f49494c9..8f310023a8c 100644 --- a/vllm/v1/core/single_type_kv_cache_manager.py +++ b/vllm/v1/core/single_type_kv_cache_manager.py @@ -1,5 +1,6 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import itertools from abc import ABC, abstractmethod from collections import defaultdict from typing import Callable @@ -177,14 +178,17 @@ def free(self, request_id: str) -> None: def get_num_common_prefix_blocks(self, request_id: str, num_running_requests: int) -> int: """ - Get the number of common prefix blocks for a request. + Get the number of common prefix blocks for all requests in the RUNNING + state. Args: request_id: The request ID. - num_running_requests: The number of requests in the RUNNING state. + num_running_requests: The total number of requests in the RUNNING + state. Returns: - The number of common prefix blocks. + The number of common prefix blocks for all requests in the RUNNING + state. """ raise NotImplementedError @@ -264,7 +268,7 @@ def find_longest_cache_hit( computed_blocks: tuple[list[KVCacheBlock], ...] = tuple( [] for _ in range(len(kv_cache_group_ids))) max_num_blocks = max_length // kv_cache_spec.block_size - for i, block_hash in zip(range(max_num_blocks), block_hashes): + for block_hash in itertools.islice(block_hashes, max_num_blocks): # block_hashes is a chain of block hashes. If a block hash is not # in the cached_block_hash_to_id, the following block hashes are # not computed yet for sure. From 88243b4fa31a59358a4d3046c516fc5260e9d39a Mon Sep 17 00:00:00 2001 From: "Po-Han Huang (NVIDIA)" <53919306+nvpohanh@users.noreply.github.com> Date: Wed, 30 Jul 2025 22:33:40 +0800 Subject: [PATCH 522/552] [NVIDIA] Fix Llama4 Scout FP4 functionality issues (#21499) Signed-off-by: Po-Han Huang Signed-off-by: x22x22 --- vllm/model_executor/layers/fused_moe/layer.py | 15 +- .../layers/quantization/modelopt.py | 2 - vllm/model_executor/models/llama4.py | 272 +++++++++++++----- 3 files changed, 219 insertions(+), 70 deletions(-) diff --git a/vllm/model_executor/layers/fused_moe/layer.py b/vllm/model_executor/layers/fused_moe/layer.py index 254cd2e10b8..e16fc13c945 100644 --- a/vllm/model_executor/layers/fused_moe/layer.py +++ b/vllm/model_executor/layers/fused_moe/layer.py @@ -874,6 +874,14 @@ def _load_per_tensor_weight_scale(self, shard_id: str, elif shard_id == "w2": param_data[expert_id] = loaded_weight + def _load_w13_weight_scale(self, shard_dim: int, + loaded_weight: torch.Tensor, + param: torch.Tensor, tp_rank: int): + shard_size = param.shape[shard_dim] + loaded_weight = loaded_weight.narrow(shard_dim, shard_size * tp_rank, + shard_size) + param.copy_(loaded_weight) + def _load_model_weight_or_group_weight_scale(self, shard_dim: int, expert_data: torch.Tensor, @@ -1123,7 +1131,12 @@ def weight_loader(self, "weight_scale_2" in weight_name if uses_weight_scale_2 else "weight_scale" in weight_name) or "input_scale" in weight_name - if per_tensor_conditions: + if "w13_weight_scale" in weight_name: + self._load_w13_weight_scale(shard_dim=shard_dim, + loaded_weight=loaded_weight, + param=param, + tp_rank=self.tp_rank) + elif per_tensor_conditions: self._load_per_tensor_weight_scale( shard_id=shard_id, param=param, diff --git a/vllm/model_executor/layers/quantization/modelopt.py b/vllm/model_executor/layers/quantization/modelopt.py index 38866586ae2..8fbc3231d86 100644 --- a/vllm/model_executor/layers/quantization/modelopt.py +++ b/vllm/model_executor/layers/quantization/modelopt.py @@ -778,8 +778,6 @@ def process_weights_after_loading(self, layer: Module) -> None: # Swizzle the weight blockscale. # contracting dimension is input dimension # block_size = 16; - assert (layer.weight_scale.shape[1] % 16 == 0), ( - "Expected weight_scale.dim(1) to be divisible by 16") assert (layer.weight_scale.dtype == torch.float8_e4m3fn), ( "Weight Block scale must be represented as FP8-E4M3") swizzled_weight_scale = swizzle_blockscale(layer.weight_scale) diff --git a/vllm/model_executor/models/llama4.py b/vllm/model_executor/models/llama4.py index fab1c163ac2..470e701d980 100644 --- a/vllm/model_executor/models/llama4.py +++ b/vllm/model_executor/models/llama4.py @@ -342,34 +342,94 @@ def load_moe_expert_weights( expert_params_mapping: list[tuple[str, str, int, str]], fused: bool = True, ) -> bool: + """ + Load MoE expert weights. + + Args: + name: The name of the weight to load. + loaded_weight: The weight to load. + params_dict: The dictionary of module parameters. + loaded_params: The set of already loaded parameters. + expert_params_mapping: The mapping of expert parameters. Must be + generated by FusedMoE.make_expert_params_mapping(). + fused: Whether the expert weights are fused into a single weight + tensor or are separate weight tensors for each expert. + When fused is True, loaded_weight should have shape of: + [num_experts, hidden_in, hidden_out] for gate/up/down proj and + [hidden_out, hidden_in] for the others like router. + When fused is False, loaded_weight should have shape of: + [hidden_out, hidden_in]. + + Returns: + True if loaded_weight is one of MoE weights and the MoE expert + weights are loaded successfully, False otherwise. + """ + + # Whether the MoE expert weights are loaded successfully. expert_param_loaded = False - if "experts.gate_up_proj" in name: - loaded_weight = loaded_weight.chunk(2, dim=-1) + + # If fused is True, the loaded weight is in the layout of: + # [num_experts, hidden_in, hidden_out], so we must transpose the last + # two dimensions to match the expected layout of the parameters. + if fused and loaded_weight.ndim == 3: + loaded_weight = loaded_weight.transpose(-1, -2) + + # If the gate_proj and up_proj weights are fused into a single + # weight tensor, we need to split the weight tensor into a tuple + # of two weight tensors along the hidden_out dimension. + if "experts.gate_up_proj" in name: + loaded_weight = loaded_weight.chunk(2, dim=-2) + + # Iterate over all the expert parameters and load the weights if we find + # a match in weight name. for (param_name, weight_name, expert_id, shard_id) in expert_params_mapping: + + # Get a view of the loaded_weight to avoid modifying the original + # one across iterations. new_loaded_weight = loaded_weight + + # If expert weights are fused into a single weight tensor, remove + # the expert index from the expected weight name. if fused: + # The string between e_str and proj_str is the expert index. e_str, _, proj_str, _ = weight_name.split('.') weight_name = f"{e_str}.{proj_str}" param_name = f"{param_name}weight" + + # Skip if the current weight is not one of the MoE weights. if weight_name not in name: continue + + # Replace the weight name with the parameter name. full_param_name = name.replace(weight_name, param_name) - # Skip layers on other devices. + + # Skip if the current weight corresponds to a parameter that + # does not exist on the current PP (pipeline parallel) rank. if is_pp_missing_parameter(name, self): continue + + # Skip if the current weight is for the bias. if ((name.endswith(".bias") or name.endswith("_bias")) and name not in params_dict): continue + param = params_dict[full_param_name] weight_loader = param.weight_loader + if fused: + # If the parameter is for w13 together, the corresponding weight + # will be a tuple, so we must select the correct weight + # depending on the shard id, which is either "w1" or "w3". if "w13" in full_param_name: + assert shard_id in ["w1", "w3"] shard_idx = 0 if shard_id == "w1" else 1 new_loaded_weight = new_loaded_weight[shard_idx] - new_loaded_weight = new_loaded_weight.transpose(-1, -2) + + # If EP (expert parallel) is enabled, update expert_id to the + # starting expert index for the current EP rank and extract the + # corresponding expert weights. layer_idx = extract_layer_index(name) - # EP mapping expert_map = self.layers[ layer_idx].feed_forward.experts.expert_map if expert_map is not None: @@ -382,6 +442,9 @@ def load_moe_expert_weights( else: # TODO: add EP support for non fused weights pass + + # Load the weight into the module parameter with corresponding + # shard id and expert id. weight_loader(param, new_loaded_weight, full_param_name, @@ -390,10 +453,13 @@ def load_moe_expert_weights( loaded_params.add(full_param_name) expert_param_loaded = True + return expert_param_loaded def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: + # Name mapping from the parameter name to the shard name and + # corresponding shard id. stacked_params_mapping = [ # (param_name, shard_name, shard_id) (".qkv_proj", ".q_proj", "q"), @@ -402,26 +468,43 @@ def load_weights(self, weights: Iterable[tuple[str, (".gate_up_proj", ".gate_proj", 0), (".gate_up_proj", ".up_proj", 1), ] + # Indicate whether the expert weights are fused into a single weight + # tensor. fused_experts_params = False + # Expert parameter mapping for the case where the expert weights are + # not fused into a single weight tensor. expert_params_mapping = FusedMoE.make_expert_params_mapping( ckpt_gate_proj_name="gate_proj", ckpt_down_proj_name="down_proj", ckpt_up_proj_name="up_proj", num_experts=self.num_experts) + # Expert parameter mapping for the case where the expert weights are + # fused into a single weight tensor. expert_params_mapping_fused = FusedMoE.make_expert_params_mapping( ckpt_gate_proj_name="gate_up_proj", ckpt_down_proj_name="down_proj", ckpt_up_proj_name="gate_up_proj", num_experts=1) + # All the module parameters. params_dict = dict(self.named_parameters()) + # The module parameters that have been loaded. loaded_params: set[str] = set() + + # Iterate over all the weights and load them into module parameters. for name, loaded_weight in weights: + + # If the name contains "experts.gate_up_proj" or "experts.down_proj" + # without the expert indices, it means the expert weights are fused + # into a single weight tensor across all experts. if "experts.gate_up_proj" in name or "experts.down_proj" in name: fused_experts_params = True expert_params_mapping = expert_params_mapping_fused + + # If kv cache quantization scales exist and the weight name + # corresponds to one of the kv cache quantization scales, load + # them. if (self.quant_config is not None and (scale_name := self.quant_config.get_cache_scale(name))): - # Loading kv cache quantization scales param = params_dict[scale_name] weight_loader = getattr(param, "weight_loader", default_weight_loader) @@ -430,84 +513,119 @@ def load_weights(self, weights: Iterable[tuple[str, weight_loader(param, loaded_weight) loaded_params.add(scale_name) continue + + # Iterate over stacked_params_mapping to check if the current weight + # is one of the stacked parameters. If so, load the weight with the + # corresponding shard id. Note that MoE weights are handled + # separately in the else block. for param_name, weight_name, shard_id in stacked_params_mapping: + # Skip if the current weight is not one of the stacked + # parameters or if the current weight is a MoE weight. if weight_name not in name or "experts" in name: continue - # This check is for ModelOpt ckpts with kv cache quant enabled + + # For ModelOpt checkpoints, we need to rename the self_attn + # weight/weight_scale names except for kv cache scales. if not (name.endswith( (".k_scale", ".v_scale")) and "self_attn" in name): name = name.replace(weight_name, param_name) + + # Skip if the current weight corresponds to a parameter that + # does not exist on the current PP (pipeline parallel) rank. if is_pp_missing_parameter(name, self): continue - if name.endswith("scale") and "expert" not in name: - # Remapping the name of FP8 kv-scale. + + # Remap kv cache scale names for ModelOpt checkpoints. + # TODO: ModelOpt should implement get_cache_scale() such that + # kv cache scale name remapping can be done there. + if name.endswith("scale"): name = maybe_remap_kv_scale_name(name, params_dict) if name is None: continue + + # Load the weight into the module parameter with corresponding + # shard id and exit the for loop and the else block. param = params_dict[name] weight_loader = getattr(param, "weight_loader", default_weight_loader) + if weight_loader == default_weight_loader: weight_loader(param, loaded_weight) else: weight_loader(param, loaded_weight, shard_id) + loaded_params.add(name) break + + # Handle normal (non-stacked) weights and MoE weights. else: - moe_loaded = self.load_moe_expert_weights( - name, - loaded_weight, - params_dict, - loaded_params, - expert_params_mapping, - fused=fused_experts_params) - - if not moe_loaded: - if is_pp_missing_parameter(name, self): - continue + # First, try to load MoE weights using load_moe_expert_weights. + # If successful, move on to next loaded weight. + if self.load_moe_expert_weights(name, + loaded_weight, + params_dict, + loaded_params, + expert_params_mapping, + fused=fused_experts_params): + continue - # Handle flat expert scale parameters that - # don't match per-expert patterns - if ("experts." in name and ("w13_input_scale" in name - or "w13_weight_scale" in name - or "w2_input_scale" in name - or "w2_weight_scale" in name)): - # These are flat expert scales that apply to all experts - param = params_dict[name] - weight_loader = getattr(param, "weight_loader", - default_weight_loader) - - # Check for MoE-specific loading support via - # attribute instead of expensive runtime reflection - supports_moe = getattr(weight_loader, - 'supports_moe_loading', False) - - if supports_moe: - # This is a MoE weight loader - if "w13_" in name: - shard_id = "w1" - elif "w2_" in name: - shard_id = "w2" - else: - shard_id = "w1" - - weight_loader(param, - loaded_weight, - name, - shard_id=shard_id, - expert_id=0) - else: - # Regular weight loader (handles both - # param.weight_loader and default_weight_loader) - weight_loader(param, loaded_weight) - loaded_params.add(name) - continue + # Skip if the current weight corresponds to a parameter that + # does not exist on the current PP (pipeline parallel) rank. + if is_pp_missing_parameter(name, self): + continue + + # Handle flat expert scale parameters that don't match + # per-expert patterns, i.e. one weight scale tensor for all + # experts. + scale_names = [ + "w13_input_scale", "w13_weight_scale", "w2_input_scale", + "w2_weight_scale" + ] + if ("experts." in name and any(scale_name in name + for scale_name in scale_names)): param = params_dict[name] weight_loader = getattr(param, "weight_loader", default_weight_loader) - weight_loader(param, loaded_weight) + + # If weight loader supports special moe loading, use it to + # avoid expensive runtime reflection + if getattr(weight_loader, 'supports_moe_loading', False): + # Map the weight name to the corresponding shard id. + shard_id = "w2" if "w2_" in name else "w1" + + # Transpose if weight scales are FP8 block scales with + # three dimensions: + # [num_experts, hidden_in, hidden_out]. + if name.endswith("weight_scale") \ + and loaded_weight.dtype == torch.float8_e4m3fn \ + and loaded_weight.ndim == 3: + loaded_weight = loaded_weight.transpose(-1, -2) + + # Load the weight into the module parameter with + # corresponding shard id and expert id. + weight_loader(param, + loaded_weight, + name, + shard_id=shard_id, + expert_id=0) + + else: + # Regular weight loader (handles both + # param.weight_loader and default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + continue + + # Handle normal (non-stacked, non-MoE) weights. + param = params_dict[name] + weight_loader = getattr(param, "weight_loader", + default_weight_loader) + weight_loader(param, loaded_weight) + loaded_params.add(name) + + # Finally, return the set of loaded parameters. return loaded_params @@ -560,23 +678,43 @@ def permute_qk_weight_for_rotary( loaded_weight: torch.Tensor, ) -> tuple[str, torch.Tensor]: - def permute(w: torch.Tensor, n_heads: int): + # Helper function to permute the weight's channels + def permute(w: torch.Tensor, n_heads: int, is_weight_scale: bool): + + # Calculate the expected shape of the weight. + # Do not rely on w's shape, as it may be in another layout. attn_in = self.config.head_dim * n_heads attn_out = self.config.hidden_size + # If the weight is FP4 packed as uint8, we need to divide attn_out + # by 2. + if w.dtype == torch.uint8 and w.shape[1] * 2 == attn_out: + attn_out = attn_out // 2 + + # If the weight is a weight scale, we need to divide attn_out by + # block size, which is currently 16. + elif w.dtype == torch.float8_e4m3fn and is_weight_scale \ + and w.shape[1] * 16 == attn_out: + attn_out = attn_out // 16 + return w.view(n_heads, attn_in // n_heads // 2, 2, attn_out).transpose(1, 2).reshape(attn_in, attn_out) modules = name.split(".") - # rotary embeds should be sliced - if ("wk" in modules or "k_proj" in modules) \ - and modules[-1] == "weight": - loaded_weight = permute(loaded_weight, - self.config.num_key_value_heads) - elif ("wq" in modules or "q_proj" in modules) \ - and modules[-1] == "weight": - loaded_weight = permute(loaded_weight, - self.config.num_attention_heads) + # Permute Q/K weights and weight block scales for rotary embedding + is_weight = modules[-1] == "weight" + is_nvfp4_weight_scale = (modules[-1] == "weight_scale" and + loaded_weight.dtype == torch.float8_e4m3fn) + + if is_weight or is_nvfp4_weight_scale: + if ("wk" in modules or "k_proj" in modules): + loaded_weight = permute(loaded_weight, + self.config.num_key_value_heads, + is_nvfp4_weight_scale) + elif ("wq" in modules or "q_proj" in modules): + loaded_weight = permute(loaded_weight, + self.config.num_attention_heads, + is_nvfp4_weight_scale) return name, loaded_weight From 5b559a6d8a55c399e37c19941ad38b06e2657b91 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Wed, 30 Jul 2025 15:35:08 +0100 Subject: [PATCH 523/552] [Docs] Reduce the size of the built docs (#21920) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- mkdocs.yaml | 7 +++++++ requirements/docs.txt | 1 + 2 files changed, 8 insertions(+) diff --git a/mkdocs.yaml b/mkdocs.yaml index 78f1c5b77cd..e5b74540033 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -67,6 +67,13 @@ plugins: exclude: - argparse/* - examples/* + - minify: + minify_html: true + minify_js: true + minify_css: true + cache_safe: true + js_files: [docs/mkdocs/javascript/*.js] + css_files: [docs/mkdocs/stylesheets/*.css] # For API reference generation - api-autonav: modules: ["vllm"] diff --git a/requirements/docs.txt b/requirements/docs.txt index 9e56c9573b3..4d4fc7da681 100644 --- a/requirements/docs.txt +++ b/requirements/docs.txt @@ -6,6 +6,7 @@ mkdocs-gen-files mkdocs-awesome-nav mkdocs-glightbox mkdocs-git-revision-date-localized-plugin +mkdocs-minify-plugin python-markdown-math regex ruff From 8b16774d39cb3f554260a0bf0a132ea36e22bc11 Mon Sep 17 00:00:00 2001 From: Isotr0py Date: Wed, 30 Jul 2025 22:35:47 +0800 Subject: [PATCH 524/552] [Bugfix] Fix OOM tests in initialization test (#21921) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: x22x22 --- tests/models/test_initialization.py | 14 ++++++++------ vllm/model_executor/models/glm4_1v.py | 1 + 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/tests/models/test_initialization.py b/tests/models/test_initialization.py index d5441540176..4c7da24fca3 100644 --- a/tests/models/test_initialization.py +++ b/tests/models/test_initialization.py @@ -33,12 +33,6 @@ def can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch, model_info.check_available_online(on_fail="skip") model_info.check_transformers_version(on_fail="skip") - # FIXME: Possible memory leak in the previous tests? - if model_arch in ("Glm4vForConditionalGeneration", - "GraniteSpeechForConditionalGeneration", - "KimiVLForConditionalGeneration"): - pytest.skip("Avoid OOM") - if model_arch in ("Llama4ForCausalLM", "EagleLlama4ForCausalLM"): from vllm.model_executor.models.llama4 import Llama4ForCausalLM from vllm.model_executor.models.registry import ModelRegistry @@ -87,6 +81,14 @@ def hf_overrides(hf_config: PretrainedConfig) -> PretrainedConfig: "num_hidden_layers": 1, }) + # e.g.: Qwen/Qwen2-Audio-7B-Instruct + if hasattr(hf_config, "audio_config"): + hf_config.audio_config.update({ + "num_layers": 1, + "num_hidden_layers": 1, + "encoder_layers": 1, + }) + return hf_config # Avoid calling model.forward() diff --git a/vllm/model_executor/models/glm4_1v.py b/vllm/model_executor/models/glm4_1v.py index 1fd65cc9099..ae1bf22c704 100644 --- a/vllm/model_executor/models/glm4_1v.py +++ b/vllm/model_executor/models/glm4_1v.py @@ -1275,6 +1275,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): vllm_config=vllm_config, prefix=maybe_prefix(prefix, ""), architectures=["Glm4ForCausalLM"], + hf_config=self.config.get_text_config(), ) self.make_empty_intermediate_tensors = ( From c31cf241b01987f2742f4bf8e894e284fa5f7f77 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Wed, 30 Jul 2025 23:42:05 +0800 Subject: [PATCH 525/552] [Bugfix] Fix multi-api server not working for text models (#21933) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/config.py | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/vllm/config.py b/vllm/config.py index 52985229ad7..9576cf2d322 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -856,7 +856,7 @@ def maybe_pull_model_tokenizer_for_s3(self, model: str, self.tokenizer = s3_tokenizer.dir def _init_multimodal_config(self) -> Optional["MultiModalConfig"]: - if self.registry.is_multimodal_model(self.architectures, self): + if self._model_info.supports_multimodal: return MultiModalConfig( limit_per_prompt=self.limit_mm_per_prompt, media_io_kwargs=self.media_io_kwargs, @@ -865,19 +865,6 @@ def _init_multimodal_config(self) -> Optional["MultiModalConfig"]: disable_mm_preprocessor_cache, interleave_mm_strings=self.interleave_mm_strings) - if self.limit_mm_per_prompt: - raise ValueError("`limit_mm_per_prompt` is only supported for " - "multimodal models.") - if self.mm_processor_kwargs: - raise ValueError("`mm_processor_kwargs` is only supported for " - "multimodal models.") - if self.disable_mm_preprocessor_cache: - raise ValueError("`disable_mm_preprocessor_cache` is only " - "supported for multimodal models.") - if self.interleave_mm_strings: - raise ValueError("`interleave_mm_strings` is only " - "supported for multimodal models.") - return None def _get_encoder_config(self): From b64872d2571386a5d8b8650c3a47c312b9cf569e Mon Sep 17 00:00:00 2001 From: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Date: Wed, 30 Jul 2025 08:54:15 -0700 Subject: [PATCH 526/552] Override attention metadata for fast prefill in some KV sharing setups (#21590) Signed-off-by: Yong Hoon Shin Signed-off-by: x22x22 --- tests/v1/e2e/test_kv_sharing_fast_prefill.py | 143 +++++++++++++++++++ vllm/config.py | 15 ++ vllm/engine/arg_utils.py | 6 + vllm/model_executor/models/gemma3n.py | 1 + vllm/v1/attention/backends/utils.py | 35 ++++- vllm/v1/worker/gpu_model_runner.py | 113 +++++++++++---- 6 files changed, 287 insertions(+), 26 deletions(-) create mode 100644 tests/v1/e2e/test_kv_sharing_fast_prefill.py diff --git a/tests/v1/e2e/test_kv_sharing_fast_prefill.py b/tests/v1/e2e/test_kv_sharing_fast_prefill.py new file mode 100644 index 00000000000..616fc7a8605 --- /dev/null +++ b/tests/v1/e2e/test_kv_sharing_fast_prefill.py @@ -0,0 +1,143 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import gc +import random +from typing import Optional, Union + +import pytest +import torch + +from vllm import LLM, SamplingParams +from vllm.config import CompilationConfig, CompilationLevel +from vllm.forward_context import get_forward_context +from vllm.model_executor.models.gemma3n import Gemma3nForConditionalGeneration +from vllm.model_executor.models.registry import ModelRegistry +from vllm.model_executor.models.utils import extract_layer_index +from vllm.sequence import IntermediateTensors + +from ...utils import fork_new_process_for_each_test + + +class TestGemma3nForConditionalGeneration(Gemma3nForConditionalGeneration): + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + **kwargs, + ) -> Union[torch.Tensor, IntermediateTensors]: + hidden_states = self.model(input_ids, positions, intermediate_tensors, + inputs_embeds, **kwargs) + attn_metadata = get_forward_context().attn_metadata + # attn_metadata is None during dummy runs + if (attn_metadata is not None + and self.cache_config.kv_sharing_fast_prefill): + assert isinstance(attn_metadata, dict) # true in V1 + # Gemma3n-E2B has 30 layers, with last 20 layers being + # cross-decoder layers. Check attention metadata is correct + for layer_name, metadata in attn_metadata.items(): + layer_idx = extract_layer_index(layer_name) + if layer_idx >= 20: + assert hasattr(metadata, 'logits_indices_padded') + assert hasattr(metadata, 'num_logits_indices') + else: + assert not hasattr(metadata, 'logits_indices_padded') + assert not hasattr(metadata, 'num_logits_indices') + + # Last layer will be a KV sharing layer + layer_attn_metadata = attn_metadata[ + self.model.language_model.layers[-1].self_attn.attn.layer_name] + logits_indices_padded = (layer_attn_metadata.logits_indices_padded) + assert logits_indices_padded is not None + num_logits_indices = layer_attn_metadata.num_logits_indices + assert num_logits_indices > 0 + # Reset hidden states to random values and + # only set logits at logits_indices to valid values + # Because logits_indices are the only positions that are used + # for output token sampling, this still produces same outputs + logits_hs = hidden_states[logits_indices_padded] + hidden_states = torch.randn_like(hidden_states) + gen_indices = logits_indices_padded[:num_logits_indices] + hidden_states[gen_indices] = logits_hs[:num_logits_indices] + + return hidden_states + + +@pytest.fixture +def test_prompts(): + """ + Adapted from tests/v1/e2e/test_spec_decode.py + """ + prompt_types = ["repeat", "sentence"] + # Setting higher num prompts increases the chance of numerics mismatch + # due to matrix multiplication numerics depending on batch dimension + num_prompts = 10 + prompts = [] + + random.seed(0) + random_prompt_type_choices = random.choices(prompt_types, k=num_prompts) + + for kind in random_prompt_type_choices: + word_choices = ["test", "temp", "hello", "where"] + word = random.choice(word_choices) + if kind == "repeat": + prompt = f"""please repeat the word '{word}' 10 times.""" + elif kind == "sentence": + prompt = f"""please give a ten-word sentence that + uses the word {word} at least once.""" + else: + raise ValueError(f"Unknown prompt type: {kind}") + prompts.append(prompt) + + return prompts + + +@fork_new_process_for_each_test +@pytest.mark.parametrize("enforce_eager", [True, False]) +def test_kv_sharing_fast_prefill( + monkeypatch: pytest.MonkeyPatch, + enforce_eager: bool, + test_prompts: list[str], +): + ModelRegistry.register_model("Gemma3nForConditionalGeneration", + TestGemma3nForConditionalGeneration) + sampling_params = SamplingParams(temperature=0.0, max_tokens=100) + compilation_config = CompilationConfig( + # This allows vLLM compilation backend to handle allocating and + # managing buffers for cudagraph + cudagraph_copy_inputs=True, + level=CompilationLevel.PIECEWISE + if not enforce_eager else CompilationLevel.NO_COMPILATION) + + with monkeypatch.context() as m: + m.setenv("VLLM_USE_V1", "1") + + llm = LLM( + model="google/gemma-3n-E2B-it", + enforce_eager=enforce_eager, + compilation_config=compilation_config, + ) + ref_responses = llm.generate(test_prompts, sampling_params) + + del llm + gc.collect() + torch.cuda.empty_cache() + + llm = LLM(model="google/gemma-3n-E2B-it", + enforce_eager=enforce_eager, + compilation_config=compilation_config, + kv_sharing_fast_prefill=True) + optimized_responses = llm.generate(test_prompts, sampling_params) + + misses = 0 + + for ref_response, optimized_response in zip(ref_responses, + optimized_responses): + if ref_response.outputs[0].text != optimized_response.outputs[ + 0].text: + misses += 1 + + assert misses == 0 diff --git a/vllm/config.py b/vllm/config.py index 9576cf2d322..7c8ed575fb2 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -1795,6 +1795,16 @@ class CacheConfig: num_cpu_blocks: Optional[int] = field(default=None, init=False) """The number of blocks to allocate for CPU memory.""" + kv_sharing_fast_prefill: bool = False + """This feature is work in progress and no prefill optimization takes place + with this flag enabled currently. + + In some KV sharing setups, e.g. YOCO (https://arxiv.org/abs/2405.05254), + some layers can skip tokens corresponding to prefill. This flag enables + attention metadata for eligible layers to be overriden with metadata + necessary for implementating this optimization in some models (e.g. Gemma3n) + """ + def compute_hash(self) -> str: """ WARNING: Whenever a new field is added to this config, @@ -1836,6 +1846,11 @@ def _verify_args(self) -> Self: "GPU memory utilization must be less than 1.0. Got " f"{self.gpu_memory_utilization}.") + if self.kv_sharing_fast_prefill: + logger.warning_once( + "--kv-sharing-fast-prefill is currently work in progress " + "and not functional yet (i.e. no prefill savings)") + return self def _verify_cache_dtype(self) -> None: diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index 6bdc3c361af..ababa49a53a 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -445,6 +445,9 @@ class EngineArgs: # DEPRECATED enable_prompt_adapter: bool = False + kv_sharing_fast_prefill: bool = \ + CacheConfig.kv_sharing_fast_prefill + def __post_init__(self): # support `EngineArgs(compilation_config={...})` # without having to manually construct a @@ -697,6 +700,8 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: **cache_kwargs["cpu_offload_gb"]) cache_group.add_argument("--calculate-kv-scales", **cache_kwargs["calculate_kv_scales"]) + cache_group.add_argument("--kv-sharing-fast-prefill", + **cache_kwargs["kv_sharing_fast_prefill"]) # Multimodal related configs multimodal_kwargs = get_kwargs(MultiModalConfig) @@ -1069,6 +1074,7 @@ def create_engine_config( prefix_caching_hash_algo=self.prefix_caching_hash_algo, cpu_offload_gb=self.cpu_offload_gb, calculate_kv_scales=self.calculate_kv_scales, + kv_sharing_fast_prefill=self.kv_sharing_fast_prefill, ) # Get the current placement group if Ray is initialized and diff --git a/vllm/model_executor/models/gemma3n.py b/vllm/model_executor/models/gemma3n.py index d0880103d4e..a58b32793db 100644 --- a/vllm/model_executor/models/gemma3n.py +++ b/vllm/model_executor/models/gemma3n.py @@ -793,6 +793,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): del lora_config # Unused. super().__init__() self.config = config + self.cache_config = vllm_config.cache_config self.model = Gemma3nModel(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")) self.logits_processor = LogitsProcessor( diff --git a/vllm/v1/attention/backends/utils.py b/vllm/v1/attention/backends/utils.py index d1599ba10b6..36bacf0cb36 100644 --- a/vllm/v1/attention/backends/utils.py +++ b/vllm/v1/attention/backends/utils.py @@ -3,8 +3,8 @@ import abc import functools from abc import abstractmethod -from dataclasses import dataclass -from typing import TYPE_CHECKING, ClassVar, Generic, Optional, TypeVar +from dataclasses import dataclass, make_dataclass +from typing import TYPE_CHECKING, Any, ClassVar, Generic, Optional, TypeVar import numpy as np import torch @@ -508,3 +508,34 @@ def reorder_batch_to_split_decodes_and_prefills( modified_batch = True return modified_batch + + +KV_SHARING_FAST_PREFILL_METADATA_FIELDS = [ + ('logits_indices_padded', Optional[torch.Tensor], None), + ('num_logits_indices', int, 0), +] + + +def subclass_attention_metadata( + name_prefix: str, + metadata_cls: Any, + fields: list[tuple[str, Any, Any]], +) -> Any: + """ + Return a new subclass of `metadata_cls` with additional fields + """ + name: str = name_prefix + metadata_cls.__name__ # type: ignore + Wrapped = make_dataclass(name, fields, bases=(metadata_cls, )) + return Wrapped + + +def make_kv_sharing_fast_prefill_attention_metadata( + metadata_cls: Any, ) -> Any: + """ + Return a new subclass of `metadata_cls` for fast prefill + """ + return subclass_attention_metadata( + name_prefix="KVSharingFastPrefill", + metadata_cls=metadata_cls, + fields=KV_SHARING_FAST_PREFILL_METADATA_FIELDS, + ) diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 3befb6adf27..987ef22a1b7 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -1,6 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import dataclasses import gc import time from contextlib import contextmanager @@ -47,6 +48,7 @@ from vllm.v1.attention.backends.mamba_selectors import get_mamba_attn_backend from vllm.v1.attention.backends.utils import ( AttentionMetadataBuilder, CommonAttentionMetadata, + make_kv_sharing_fast_prefill_attention_metadata, make_local_attention_virtual_batches) from vllm.v1.core.encoder_cache_manager import compute_encoder_budget from vllm.v1.kv_cache_interface import (AttentionSpec, @@ -320,6 +322,12 @@ def __init__( # means this layer will perform attention using the keys and values # from the KV cache of `shared_kv_cache_layers[layer_name]`. self.shared_kv_cache_layers: dict[str, str] = {} + self.kv_sharing_fast_prefill_eligible_layers: set[str] = set() + + self.kv_sharing_fast_prefill_logits_indices = None + if self.cache_config.kv_sharing_fast_prefill: + self.kv_sharing_fast_prefill_logits_indices = torch.zeros( + self.max_num_tokens, dtype=torch.int32, device=self.device) def _may_reorder_batch(self, scheduler_output: "SchedulerOutput") -> None: """ @@ -735,6 +743,55 @@ def _prepare_inputs( spec_decode_common_attn_metadata = None + use_spec_decode = len( + scheduler_output.scheduled_spec_decode_tokens) > 0 + if not use_spec_decode: + # NOTE(woosuk): Due to chunked prefills, the batch may contain + # partial requests. While we should not sample any token + # from these partial requests, we do so for simplicity. + # We will ignore the sampled tokens from the partial requests. + # TODO: Support prompt logprobs. + logits_indices = query_start_loc[1:] - 1 + spec_decode_metadata = None + else: + # Get the number of draft tokens for each request. + # Iterate over the dictionary rather than all requests since not all + # requests have draft tokens. + num_draft_tokens = np.zeros(num_reqs, dtype=np.int32) + for req_id, draft_token_ids in ( + scheduler_output.scheduled_spec_decode_tokens.items()): + req_idx = self.input_batch.req_id_to_index[req_id] + num_draft_tokens[req_idx] = len(draft_token_ids) + + spec_decode_metadata = self._calc_spec_decode_metadata( + num_draft_tokens, cu_num_tokens) + logits_indices = spec_decode_metadata.logits_indices + + logits_indices_padded = None + if self.cache_config.kv_sharing_fast_prefill: + assert self.kv_sharing_fast_prefill_logits_indices is not None + num_logits = logits_indices.shape[0] + assert num_logits > 0 + self.kv_sharing_fast_prefill_logits_indices[:num_logits].copy_( + logits_indices) + # There might have leftover indices in logits_indices[num_logits:] + # from previous iterations, whose values may be greater than the + # batch size in the current iteration. To ensure indices are always + # valid, we fill the padded indices with the last index. + self.kv_sharing_fast_prefill_logits_indices[num_logits:].fill_( + logits_indices[-1].item()) + if (self.use_cuda_graph + and num_logits <= self.cudagraph_batch_sizes[-1]): + # Use piecewise CUDA graphs. + # Add padding to the batch size. + num_logits_padded = self.vllm_config.pad_for_cudagraph( + num_logits) + else: + num_logits_padded = num_logits + logits_indices_padded = ( + self.kv_sharing_fast_prefill_logits_indices[:num_logits_padded] + ) + attn_metadata: dict[str, Any] = {} # Prepare encoder attention metadata separately @@ -806,7 +863,28 @@ def _prepare_inputs( common_attn_metadata=common_attn_metadata, )) + fast_prefill_metadata = attn_metadata_i + if (self.cache_config.kv_sharing_fast_prefill + and self.kv_sharing_fast_prefill_eligible_layers): + # Dynamically create a a dataclass type that inherits + # from attention metadata type but includes additional + # fields logits_indices_padded and num_logits_indices + # which are required for prefill truncation + fast_prefill_metadata_type = ( + make_kv_sharing_fast_prefill_attention_metadata( + metadata_cls=type(attn_metadata_i), )) + fast_prefill_metadata = fast_prefill_metadata_type( + **dataclasses.asdict(attn_metadata_i), + logits_indices_padded=logits_indices_padded, + num_logits_indices=logits_indices.size(0), + ) + for layer_name in kv_cache_group_spec.layer_names: + if (self.cache_config.kv_sharing_fast_prefill and layer_name + in self.kv_sharing_fast_prefill_eligible_layers): + attn_metadata[layer_name] = fast_prefill_metadata + continue + attn_metadata[layer_name] = attn_metadata_i # Hack for now to fix chunked local attention + no hybrid kv cache @@ -838,30 +916,6 @@ def _prepare_inputs( b.can_run_in_cudagraph(common_attn_metadata) for b in self.attn_metadata_builders) - use_spec_decode = len( - scheduler_output.scheduled_spec_decode_tokens) > 0 - if not use_spec_decode: - # NOTE(woosuk): Due to chunked prefills, the batch may contain - # partial requests. While we should not sample any token - # from these partial requests, we do so for simplicity. - # We will ignore the sampled tokens from the partial requests. - # TODO: Support prompt logprobs. - logits_indices = query_start_loc[1:] - 1 - spec_decode_metadata = None - else: - # Get the number of draft tokens for each request. - # Iterate over the dictionary rather than all requests since not all - # requests have draft tokens. - num_draft_tokens = np.zeros(num_reqs, dtype=np.int32) - for req_id, draft_token_ids in ( - scheduler_output.scheduled_spec_decode_tokens.items()): - req_idx = self.input_batch.req_id_to_index[req_id] - num_draft_tokens[req_idx] = len(draft_token_ids) - - spec_decode_metadata = self._calc_spec_decode_metadata( - num_draft_tokens, cu_num_tokens) - logits_indices = spec_decode_metadata.logits_indices - # Hot-Swap lora model if self.lora_config: self.set_active_loras(self.input_batch, num_scheduled_tokens) @@ -1433,6 +1487,7 @@ def execute_model( spec_decode_metadata, num_scheduled_tokens_np, spec_decode_common_attn_metadata) = ( self._prepare_inputs(scheduler_output)) + num_scheduled_tokens = scheduler_output.total_num_scheduled_tokens if (self.use_cuda_graph and num_scheduled_tokens <= self.cudagraph_batch_sizes[-1]): @@ -2814,6 +2869,16 @@ def initialize_kv_cache_tensors( kv_cache_config.kv_cache_groups, kv_caches, ) + attn_layers = get_layers_from_vllm_config(self.vllm_config, + Attention) + # Iterate in reversed order and add layers that re-use KV cache + # e.g. in YOCO-like KV sharing setups (e.g. Gemma3n) + for layer_name in reversed(attn_layers): + if layer_name in self.shared_kv_cache_layers: + self.kv_sharing_fast_prefill_eligible_layers.add( + layer_name) + else: + break bind_kv_cache(kv_caches, self.compilation_config.static_forward_context, From 731cbb8c831fe1dc57b88f16023b8cd882aa5c12 Mon Sep 17 00:00:00 2001 From: 633WHU Date: Wed, 30 Jul 2025 23:54:44 +0800 Subject: [PATCH 527/552] [Bugfix] Fix TypeError in scheduler when comparing mixed request_id types (#21816) Signed-off-by: chiliu Co-authored-by: chiliu Signed-off-by: x22x22 --- tests/v1/engine/test_engine_core.py | 72 +++++++++++++++++++++++------ vllm/v1/engine/core.py | 5 ++ 2 files changed, 64 insertions(+), 13 deletions(-) diff --git a/tests/v1/engine/test_engine_core.py b/tests/v1/engine/test_engine_core.py index bbdc73e9608..eb826bf0623 100644 --- a/tests/v1/engine/test_engine_core.py +++ b/tests/v1/engine/test_engine_core.py @@ -236,7 +236,7 @@ def test_engine_core_concurrent_batches(monkeypatch: pytest.MonkeyPatch): Test that the engine can handle multiple concurrent batches. """ - def make_request_with_max_tokens(req_id: int, + def make_request_with_max_tokens(req_id: str, max_tokens: int) -> EngineCoreRequest: request = make_request() request.request_id = req_id @@ -297,16 +297,16 @@ def shutdown(self): assert engine_core.batch_queue is not None # Add two requests in a row. Each request have 12 prompt tokens. - req0 = make_request_with_max_tokens(0, 5) + req0 = make_request_with_max_tokens("0", 5) engine_core.add_request(req0) - req1 = make_request_with_max_tokens(1, 5) + req1 = make_request_with_max_tokens("1", 5) engine_core.add_request(req1) # Schedule Batch 1: (10, req0) assert engine_core.step_with_batch_queue()[0] is None assert engine_core.batch_queue.qsize() == 1 scheduler_output = engine_core.batch_queue.queue[-1][1] - assert scheduler_output.num_scheduled_tokens[0] == 10 + assert scheduler_output.num_scheduled_tokens["0"] == 10 # num_computed_tokens should have been updated immediately. assert engine_core.scheduler.requests[ req0.request_id].num_computed_tokens == 10 @@ -315,11 +315,11 @@ def shutdown(self): assert engine_core.step_with_batch_queue()[0] is None assert engine_core.batch_queue.qsize() == 2 scheduler_output = engine_core.batch_queue.queue[-1][1] - assert scheduler_output.num_scheduled_tokens[0] == 2 - assert scheduler_output.num_scheduled_tokens[1] == 8 + assert scheduler_output.num_scheduled_tokens["0"] == 2 + assert scheduler_output.num_scheduled_tokens["1"] == 8 # num_computed_tokens should have been updated immediately. - assert engine_core.scheduler.requests[0].num_computed_tokens == 12 - assert engine_core.scheduler.requests[1].num_computed_tokens == 8 + assert engine_core.scheduler.requests["0"].num_computed_tokens == 12 + assert engine_core.scheduler.requests["1"].num_computed_tokens == 8 assert engine_core.scheduler.get_num_unfinished_requests() == 2 @@ -331,7 +331,7 @@ def shutdown(self): engine_core.step_with_batch_queue() assert engine_core.batch_queue.qsize() == 2 scheduler_output = engine_core.batch_queue.queue[-1][1] - assert scheduler_output.num_scheduled_tokens[1] == 4 + assert scheduler_output.num_scheduled_tokens["1"] == 4 # Batch queue is full. Finish Batch 2. Get first token of req0. output = engine_core.step_with_batch_queue()[0].get(0) @@ -343,7 +343,7 @@ def shutdown(self): engine_core.step_with_batch_queue() assert engine_core.batch_queue.qsize() == 2 scheduler_output = engine_core.batch_queue.queue[-1][1] - assert scheduler_output.num_scheduled_tokens[0] == 1 + assert scheduler_output.num_scheduled_tokens["0"] == 1 # Batch queue is full. Finish Batch 3. Get first token of req1. output = engine_core.step_with_batch_queue()[0].get(0) @@ -355,14 +355,14 @@ def shutdown(self): engine_core.step_with_batch_queue() assert engine_core.batch_queue.qsize() == 2 scheduler_output = engine_core.batch_queue.queue[-1][1] - assert scheduler_output.num_scheduled_tokens[1] == 1 + assert scheduler_output.num_scheduled_tokens["1"] == 1 # Loop until req0 is finished. step = 0 req_id = 0 expected_num_tokens = [ - engine_core.scheduler.requests[0].num_tokens + 1, - engine_core.scheduler.requests[1].num_tokens + 1, + engine_core.scheduler.requests["0"].num_tokens + 1, + engine_core.scheduler.requests["1"].num_tokens + 1, ] while engine_core.scheduler.get_num_unfinished_requests() == 2: output = engine_core.step_with_batch_queue()[0] @@ -413,3 +413,49 @@ def get_worker_cache_config_field(worker, key: str): get_worker_cache_config_field, args=("num_cpu_blocks", )) assert all(x is not None for x in num_gpu_blocks) assert all(x is not None for x in num_cpu_blocks) + + +@create_new_process_for_each_test() +def test_engine_core_invalid_request_id_type(monkeypatch: pytest.MonkeyPatch): + """Test that engine raises TypeError for non-string request_id.""" + with monkeypatch.context() as m: + m.setenv("VLLM_USE_V1", "1") + + engine_args = EngineArgs(model=MODEL_NAME) + vllm_config = engine_args.create_engine_config() + executor_class = Executor.get_class(vllm_config) + + with set_default_torch_num_threads(1): + engine_core = EngineCore(vllm_config=vllm_config, + executor_class=executor_class, + log_stats=True) + + # Test with UUID object (common mistake) + uuid_request = make_request() + uuid_request.request_id = uuid.uuid4() # UUID object instead of string + + with pytest.raises(TypeError, + match="request_id must be a string, got.*UUID"): + engine_core.add_request(uuid_request) + + # Test with integer + int_request = make_request() + int_request.request_id = 12345 + + with pytest.raises(TypeError, + match="request_id must be a string, got.*int"): + engine_core.add_request(int_request) + + # Test with None + none_request = make_request() + none_request.request_id = None + + with pytest.raises(TypeError, + match="request_id must be a string, got.*NoneType"): + engine_core.add_request(none_request) + + # Verify engine is still functional after errors + valid_request = make_request() + engine_core.add_request(valid_request) + assert len(engine_core.scheduler.waiting) == 1 + assert len(engine_core.scheduler.running) == 0 diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index cad93061e65..39fda521f36 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -207,6 +207,11 @@ def get_supported_tasks(self) -> tuple[SupportedTask, ...]: def add_request(self, request: EngineCoreRequest): """Add request to the scheduler.""" + # Validate the request_id type. + if not isinstance(request.request_id, str): + raise TypeError( + f"request_id must be a string, got {type(request.request_id)}") + if pooling_params := request.pooling_params: supported_pooling_tasks = [ task for task in self.get_supported_tasks() From 13264ca578b8ab2a01066fa50f4cf9f9cddc3f0d Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Thu, 31 Jul 2025 00:10:41 +0800 Subject: [PATCH 528/552] [CI/Build] Fix registry tests (#21934) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- tests/models/registry.py | 16 +++++++---- vllm/model_executor/models/mpt.py | 20 ++++++------- vllm/model_executor/models/telechat2.py | 15 ++++++++-- vllm/transformers_utils/config.py | 5 ++-- vllm/transformers_utils/configs/__init__.py | 2 ++ vllm/transformers_utils/configs/nvlm_d.py | 31 +++++++++++++++++++++ 6 files changed, 70 insertions(+), 19 deletions(-) create mode 100644 vllm/transformers_utils/configs/nvlm_d.py diff --git a/tests/models/registry.py b/tests/models/registry.py index caa691039fc..8fcff5a8c51 100644 --- a/tests/models/registry.py +++ b/tests/models/registry.py @@ -170,8 +170,10 @@ def check_available_online( min_transformers_version="4.54"), "Ernie4_5_MoeForCausalLM": _HfExamplesInfo("baidu/ERNIE-4.5-21B-A3B-PT", min_transformers_version="4.54"), - "ExaoneForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct"), # noqa: E501 - "Exaone4ForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-4.0-32B"), # noqa: E501 + "ExaoneForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct", + trust_remote_code=True), + "Exaone4ForCausalLM": _HfExamplesInfo("LGAI-EXAONE/EXAONE-4.0-32B", + min_transformers_version="4.54"), "Fairseq2LlamaForCausalLM": _HfExamplesInfo("mgleize/fairseq2-dummy-Llama-3.2-1B"), # noqa: E501 "FalconForCausalLM": _HfExamplesInfo("tiiuae/falcon-7b"), "FalconH1ForCausalLM":_HfExamplesInfo("tiiuae/Falcon-H1-0.5B-Base", @@ -199,8 +201,10 @@ def check_available_online( trust_remote_code=True), "HunYuanMoEV1ForCausalLM": _HfExamplesInfo("tencent/Hunyuan-A13B-Instruct", trust_remote_code=True), + # TODO: Remove is_available_online once their config.json is fixed "HunYuanDenseV1ForCausalLM":_HfExamplesInfo("tencent/Hunyuan-7B-Instruct-0124", - trust_remote_code=True), + trust_remote_code=True, + is_available_online=False), "HCXVisionForCausalLM": _HfExamplesInfo( "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B", trust_remote_code=True), @@ -275,7 +279,8 @@ def check_available_online( "StableLMEpochForCausalLM": _HfExamplesInfo("stabilityai/stablelm-zephyr-3b"), # noqa: E501 "StableLmForCausalLM": _HfExamplesInfo("stabilityai/stablelm-3b-4e1t"), "Starcoder2ForCausalLM": _HfExamplesInfo("bigcode/starcoder2-3b"), - "SolarForCausalLM": _HfExamplesInfo("upstage/solar-pro-preview-instruct"), + "SolarForCausalLM": _HfExamplesInfo("upstage/solar-pro-preview-instruct", + trust_remote_code=True), "TeleChat2ForCausalLM": _HfExamplesInfo("Tele-AI/TeleChat2-3B", trust_remote_code=True), "TeleFLMForCausalLM": _HfExamplesInfo("CofeAI/FLM-2-52B-Instruct-2407", @@ -449,7 +454,8 @@ def check_available_online( max_model_len=4096), "Qwen2_5OmniModel": _HfExamplesInfo("Qwen/Qwen2.5-Omni-3B"), "Qwen2_5OmniForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2.5-Omni-7B-AWQ"), # noqa: E501 - "SkyworkR1VChatModel": _HfExamplesInfo("Skywork/Skywork-R1V-38B"), + "SkyworkR1VChatModel": _HfExamplesInfo("Skywork/Skywork-R1V-38B", + trust_remote_code=True), "SmolVLMForConditionalGeneration": _HfExamplesInfo("HuggingFaceTB/SmolVLM2-2.2B-Instruct"), # noqa: E501 "UltravoxModel": _HfExamplesInfo("fixie-ai/ultravox-v0_5-llama-3_2-1b", # noqa: E501 trust_remote_code=True), diff --git a/vllm/model_executor/models/mpt.py b/vllm/model_executor/models/mpt.py index c243f575ae5..8db52a69924 100644 --- a/vllm/model_executor/models/mpt.py +++ b/vllm/model_executor/models/mpt.py @@ -8,7 +8,7 @@ import torch import torch.nn as nn -from transformers import PretrainedConfig +from transformers import MptConfig from vllm.attention import Attention from vllm.compilation.decorators import support_torch_compile @@ -50,7 +50,7 @@ class MPTAttention(nn.Module): def __init__( self, - config: PretrainedConfig, + config: MptConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", @@ -59,15 +59,15 @@ def __init__( self.d_model = config.d_model self.total_num_heads = config.n_heads self.head_dim = self.d_model // self.total_num_heads - self.clip_qkv = config.attn_config["clip_qkv"] - self.qk_ln = config.attn_config["qk_ln"] - self.alibi_bias_max = config.attn_config["alibi_bias_max"] + self.clip_qkv = config.attn_config.clip_qkv + self.qk_ln = config.attn_config.qk_ln + self.alibi_bias_max = config.attn_config.alibi_bias_max if "kv_n_heads" in config.attn_config: - self.total_num_kv_heads = config.attn_config['kv_n_heads'] + self.total_num_kv_heads = config.attn_config.kv_n_heads else: self.total_num_kv_heads = self.total_num_heads - assert not config.attn_config["prefix_lm"] - assert config.attn_config["alibi"] + assert not config.attn_config.prefix_lm + assert config.attn_config.alibi # pylint: disable=invalid-name self.Wqkv = QKVParallelLinear( @@ -144,7 +144,7 @@ class MPTMLP(nn.Module): def __init__( self, - config: PretrainedConfig, + config: MptConfig, quant_config: Optional[QuantizationConfig] = None, ): super().__init__() @@ -176,7 +176,7 @@ class MPTBlock(nn.Module): def __init__( self, - config: PretrainedConfig, + config: MptConfig, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", diff --git a/vllm/model_executor/models/telechat2.py b/vllm/model_executor/models/telechat2.py index f0b31b1332f..49a7677151a 100644 --- a/vllm/model_executor/models/telechat2.py +++ b/vllm/model_executor/models/telechat2.py @@ -37,9 +37,20 @@ class TeleChat2Model(LlamaModel): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + hf_config = vllm_config.model_config.hf_config + + vllm_config.model_config.hf_config.attribute_map = { + "num_hidden_layers": "n_layer", + "num_attention_heads": "n_head", + "intermediate_size": "ffn_hidden_size", + "rms_norm_eps": "layer_norm_epsilon" + } + vllm_config.model_config.hf_config.hidden_act = "silu" + # 1. Initialize the LlamaModel with bias - vllm_config.model_config.hf_config.bias = True - vllm_config.model_config.hf_config.mlp_bias = True + hf_config.bias = True + hf_config.mlp_bias = True + super().__init__(vllm_config=vllm_config, prefix=prefix) # 2. Remove the bias from the qkv_proj and gate_up_proj based on config # Telechat2's gate_up_proj and qkv_proj don't have bias diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py index 40a6a9118e5..4ce56cb3a6a 100644 --- a/vllm/transformers_utils/config.py +++ b/vllm/transformers_utils/config.py @@ -34,8 +34,8 @@ KimiVLConfig, MedusaConfig, MllamaConfig, MLPSpeculatorConfig, Nemotron_Nano_VL_Config, - NemotronConfig, RWConfig, - UltravoxConfig) + NemotronConfig, NVLM_D_Config, + RWConfig, UltravoxConfig) # yapf: enable from vllm.transformers_utils.configs.mistral import adapt_config_dict from vllm.transformers_utils.utils import check_gguf_file @@ -81,6 +81,7 @@ def _get_hf_token() -> Optional[str]: "medusa": MedusaConfig, "eagle": EAGLEConfig, "nemotron": NemotronConfig, + "NVLM_D": NVLM_D_Config, "ultravox": UltravoxConfig, **_CONFIG_REGISTRY_OVERRIDE_HF } diff --git a/vllm/transformers_utils/configs/__init__.py b/vllm/transformers_utils/configs/__init__.py index 0fcb2beb8c7..7c7d859e4a3 100644 --- a/vllm/transformers_utils/configs/__init__.py +++ b/vllm/transformers_utils/configs/__init__.py @@ -23,6 +23,7 @@ from vllm.transformers_utils.configs.nemotron import NemotronConfig from vllm.transformers_utils.configs.nemotron_h import NemotronHConfig from vllm.transformers_utils.configs.nemotron_vl import Nemotron_Nano_VL_Config +from vllm.transformers_utils.configs.nvlm_d import NVLM_D_Config from vllm.transformers_utils.configs.ultravox import UltravoxConfig __all__ = [ @@ -39,5 +40,6 @@ "NemotronConfig", "NemotronHConfig", "Nemotron_Nano_VL_Config", + "NVLM_D_Config", "UltravoxConfig", ] diff --git a/vllm/transformers_utils/configs/nvlm_d.py b/vllm/transformers_utils/configs/nvlm_d.py new file mode 100644 index 00000000000..edfc506882f --- /dev/null +++ b/vllm/transformers_utils/configs/nvlm_d.py @@ -0,0 +1,31 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +# Adapted from +# https://huggingface.co/nvidia/NVLM-D-72B/blob/main/configuration_nvlm_d.py +# -------------------------------------------------------- +# NVLM-D +# Copyright (c) 2024 NVIDIA +# Licensed under Apache 2.0 License [see LICENSE for details] +# -------------------------------------------------------- +from transformers import Qwen2Config +from transformers.configuration_utils import PretrainedConfig + + +class NVLM_D_Config(PretrainedConfig): + model_type = 'NVLM_D' + is_composition = True + + def __init__(self, vision_config=None, llm_config=None, **kwargs): + super().__init__(**kwargs) + + # Handle vision_config initialization + if vision_config is None: + vision_config = {} + + # Handle llm_config initialization + if llm_config is None: + llm_config = {} + + self.vision_config = PretrainedConfig(**vision_config) + self.text_config = Qwen2Config(**llm_config) From 04dc2e75b3c93d45efa281ad903f1ec9bb4c5a53 Mon Sep 17 00:00:00 2001 From: Chenguang Zheng <645327136@qq.com> Date: Thu, 31 Jul 2025 00:18:37 +0800 Subject: [PATCH 529/552] [Bugfix] SharedStorage Connector for V1 PD multimodal (#21611) Signed-off-by: fake0fan <645327136@qq.com> Signed-off-by: herotai214 Co-authored-by: herotai214 Signed-off-by: x22x22 --- .../unit/test_shared_storage_connector.py | 215 ++++++++++++++++++ .../v1/shared_storage_connector.py | 41 +++- 2 files changed, 244 insertions(+), 12 deletions(-) create mode 100644 tests/v1/kv_connector/unit/test_shared_storage_connector.py diff --git a/tests/v1/kv_connector/unit/test_shared_storage_connector.py b/tests/v1/kv_connector/unit/test_shared_storage_connector.py new file mode 100644 index 00000000000..ee3e71d3b84 --- /dev/null +++ b/tests/v1/kv_connector/unit/test_shared_storage_connector.py @@ -0,0 +1,215 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from dataclasses import asdict +from typing import NamedTuple + +from PIL import Image + +from vllm import LLM, EngineArgs, SamplingParams +from vllm.assets.image import ImageAsset +from vllm.config import KVTransferConfig +from vllm.multimodal.utils import encode_image_base64 + +MODEL_NAME = "Qwen/Qwen2.5-VL-3B-Instruct" + +SAMPLING_PARAMS = SamplingParams(temperature=0.0, top_k=1, max_tokens=128) + +TEXT_PROMPTS = [ + "What's in the image(s)? Around 30 words. What's special in 2nd image?", + "The future of AI is", +] + + +class InputCase(NamedTuple): + text: str + img: list[Image] + expected_len: int + info: str + + +def _check_path_len(path): + """Return the latest length in path""" + return len(list(path.iterdir())) + + +def _list_path(path): + """Return the list of foldername (hashes generatd) under the path""" + return list(path.iterdir()) + + +def run_test(tmp_path, processor, llm: LLM, question: str, + image_urls: list[Image], expected_len: int, info: str): + """ + One individual test to process the prompt and output base on 1 set of input + Then check if the length in the strorage path matches the expected length + `info` introduces details or purpose of the individual test + """ + print(f"***info: {info}***") + print( + f"**Expected storage path length after llm generate: {expected_len}**") + process_prompt(processor, llm, question, image_urls) + + print(f"Path matched expected length: {_check_path_len(tmp_path)}") + print(f"Hashes under the storage path: {_list_path(tmp_path)}") + + assert _check_path_len(tmp_path) == expected_len, ( + f"Expect storage path length {expected_len} ;", + f"but end up {_check_path_len(tmp_path)} instead. ", f"Info: {info}") + + +def process_prompt(processor, llm: LLM, question: str, + image_urls: list[Image]): + """ + Form the prompt based on the text and image input, then llm generate output + """ + placeholders = [{ + "type": "image_url", + "image_url": { + "url": f"data:image;base64,{encode_image_base64(image_pil)}" + } + } for image_pil in image_urls] + + messages = [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": [ + *placeholders, + { + "type": "text", + "text": question + }, + ], + }, + ] + + prompt = processor.apply_chat_template(messages, + tokenize=False, + add_generation_prompt=True) + + outputs = llm.generate( + { + "prompt": + prompt, + **({ + "multi_modal_data": { + "image": [*image_urls] + } + } if image_urls else {}) + }, + sampling_params=SAMPLING_PARAMS, + ) + + print("-" * 50) + print("Output:") + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + print("-" * 50) + + +def test_shared_storage_connector_hashes(tmp_path): + """ + Tests that SharedStorageConnector saves KV to the storage locations + with proper hashes; that are unique for inputs with identical text but + differnt images (same size), or same multiple images but different orders. + """ + # Using tmp_path as the storage path to store KV + print(f"KV storage path at: {str(tmp_path)}") + + # Configure the SharedStorageConnector + kv_transfer_config = KVTransferConfig( + kv_connector="SharedStorageConnector", + kv_role="kv_both", + kv_connector_extra_config={"shared_storage_path": str(tmp_path)}) + + engine_args = EngineArgs( + model=MODEL_NAME, + max_model_len=8192, + max_num_seqs=1, + kv_transfer_config=kv_transfer_config, + limit_mm_per_prompt={"image": 2}, + ) + + # don't put this import at the top level + # it will call torch.cuda.device_count() + from transformers import AutoProcessor # noqa: F401 + + # Create processor to handle the chat prompt + processor = AutoProcessor.from_pretrained(MODEL_NAME) + + # Prepare images for the tests + # Resize to the same size to check hashes correctness + image_1 = ImageAsset("stop_sign").pil_image.resize((1280, 720)) + image_2 = ImageAsset("cherry_blossom").pil_image.resize((1280, 720)) + + # Make sure that they are not the same picture + assert image_1 != image_2, "The images should not be identical" + + # Create the LLM instance + engine_args = asdict(engine_args) + llm = LLM(**engine_args) + + # Prepare the input cases + input_cases = [ + InputCase(text=TEXT_PROMPTS[0], + img=[image_1], + expected_len=1, + info="image_1 single input the first time."), + InputCase(text=TEXT_PROMPTS[0], + img=[image_2], + expected_len=2, + info=("image_2 single input the first time. " + "It is in same pixel size with image_1, yet it " + "should be able to form a new unique hash.")), + InputCase(text=TEXT_PROMPTS[0], + img=[image_1], + expected_len=2, + info=("image_1 single input the 2nd time. " + "It should not form aother new hash.")), + InputCase(text=TEXT_PROMPTS[0], + img=[image_2], + expected_len=2, + info=("image_2 single input the 2nd time. " + "It should not form aother new hash.")), + InputCase(text=TEXT_PROMPTS[0], + img=[image_1, image_2], + expected_len=3, + info="image_1 with image_2 input the first time."), + InputCase(text=TEXT_PROMPTS[0], + img=[image_2, image_1], + expected_len=4, + info="The image order is swapped. Should form new hash."), + InputCase(text=TEXT_PROMPTS[0], + img=[image_1, image_2], + expected_len=4, + info=("[image_1, image_2] input the 2nd time. " + "It should not form aother new hash.")), + InputCase(text=TEXT_PROMPTS[0], + img=[image_2, image_1], + expected_len=4, + info=("[image_2, image_1] input the 2nd time. " + "It should not form aother new hash.")), + InputCase(text=TEXT_PROMPTS[0], + img=[], + expected_len=5, + info="Pure text input test as a case-control"), + InputCase(text=TEXT_PROMPTS[0], + img=[], + expected_len=5, + info="Identical pure text input as a case-control"), + InputCase(text=TEXT_PROMPTS[1], + img=[], + expected_len=6, + info="Another pure text input as a case-control"), + ] + + # Run tests + for case_id, (text, img, expected_len, info) in enumerate(input_cases): + print("\n", "=" * 25, f"Below running input case: {case_id}", "=" * 25) + run_test(tmp_path, processor, llm, text, img, expected_len, info) + + print("All tests passed successfully!") diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py b/vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py index 048748e6b8e..fd79387269d 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py @@ -32,10 +32,11 @@ class ReqMeta: slot_mapping: torch.Tensor # Is store or load is_store: bool + mm_hashes: list[str] @staticmethod def make_meta(token_ids: list[int], block_ids: list[int], block_size: int, - is_store: bool) -> "ReqMeta": + is_store: bool, mm_hashes: list[str]) -> "ReqMeta": valid_num_tokens = align_to_block_size(len(token_ids), block_size) token_ids_tensor = torch.tensor(token_ids)[:valid_num_tokens] block_ids_tensor = torch.tensor(block_ids) @@ -48,6 +49,7 @@ def make_meta(token_ids: list[int], block_ids: list[int], block_size: int, token_ids=token_ids_tensor, slot_mapping=slot_mapping, is_store=is_store, + mm_hashes=mm_hashes, ) @@ -64,9 +66,11 @@ def add_request( block_ids: list[int], block_size: int, is_store: bool, + mm_hashes: list[str], ) -> None: self.requests.append( - ReqMeta.make_meta(token_ids, block_ids, block_size, is_store)) + ReqMeta.make_meta(token_ids, block_ids, block_size, is_store, + mm_hashes)) class SharedStorageConnector(KVConnectorBase_V1): @@ -169,7 +173,7 @@ def inject_kv_into_layer( forward_context.virtual_engine] filename = self._generate_filename_debug( - layer_name, request.token_ids) + layer_name, request.token_ids, request.mm_hashes) kv_cache = safetensors.torch.load_file( filename)["kv_cache"].cuda() inject_kv_into_layer(kv_cache_layer, kv_cache, @@ -221,7 +225,7 @@ def extract_kv_from_layer( for request in connector_metadata.requests: if request.is_store: filename = self._generate_filename_debug( - layer_name, request.token_ids) + layer_name, request.token_ids, request.mm_hashes) kv_cache = extract_kv_from_layer(kv_layer, request.slot_mapping) tensors = {"kv_cache": kv_cache.detach().cpu()} @@ -299,7 +303,8 @@ def build_connector_meta( meta.add_request(token_ids=new_req.prompt_token_ids, block_ids=new_req.block_ids[0], block_size=self._block_size, - is_store=False) + is_store=False, + mm_hashes=new_req.mm_hashes) total_need_load += 1 else: # NOTE: here, we set the store and load being exclusive, @@ -310,7 +315,8 @@ def build_connector_meta( meta.add_request(token_ids=new_req.prompt_token_ids, block_ids=new_req.block_ids[0], block_size=self._block_size, - is_store=True) + is_store=True, + mm_hashes=new_req.mm_hashes) cached_reqs = scheduler_output.scheduled_cached_reqs for i, req_id in enumerate(cached_reqs.req_ids): @@ -338,7 +344,8 @@ def build_connector_meta( meta.add_request(token_ids=token_ids, block_ids=block_ids, block_size=self._block_size, - is_store=False) + is_store=False, + mm_hashes=request.mm_hashes) total_need_load += 1 assert total_need_load == len(self._requests_need_load) @@ -359,20 +366,28 @@ def _found_match_for_request( len(request.prompt_token_ids) - 1, self._block_size) foldername = self._generate_foldername_debug(torch.tensor( request.prompt_token_ids)[:num_tokens_to_check], + request.mm_hashes, create_folder=False) return os.path.exists(foldername) def _generate_foldername_debug( self, - input_ids: torch.Tensor, + token_ids: torch.Tensor, + mm_hashes: list[str], create_folder=False, ) -> str: """Generate a folder name based on the hash of the bytes of the input ids. """ - input_ids_bytes = input_ids.numpy().tobytes() - input_ids_hash = hashlib.md5(input_ids_bytes, + token_bytes = token_ids.numpy().tobytes() + # Add mm_hashes to the bytes being hashed to avoid path traversal and + # to create a canonical key. + if mm_hashes: + mm_str = "-".join(mm_hashes) + token_bytes += mm_str.encode('utf-8') + input_ids_hash = hashlib.md5(token_bytes, usedforsecurity=False).hexdigest() + foldername = os.path.join(self._storage_path, input_ids_hash) if create_folder: os.makedirs(foldername, exist_ok=True) @@ -381,12 +396,14 @@ def _generate_foldername_debug( def _generate_filename_debug( self, layer_name: str, - input_ids: torch.Tensor, + token_ids: torch.Tensor, + mm_hashes: list[str], ) -> str: """Generate a file name based on the layer name and the hash of the bytes of the input ids. """ - foldername = self._generate_foldername_debug(input_ids, + foldername = self._generate_foldername_debug(token_ids, + mm_hashes=mm_hashes, create_folder=True) return os.path.join(foldername, f"{layer_name}.safetensors") From e45a9b2ca4c84c9b27368beeb9d6b1903fe751e7 Mon Sep 17 00:00:00 2001 From: wxsm Date: Thu, 31 Jul 2025 00:41:51 +0800 Subject: [PATCH 530/552] feat(distributed): add `get_required_kvcache_layout` class method to kv connector api (#20433) Signed-off-by: wxsm Signed-off-by: x22x22 --- tests/distributed/test_kvlayout.py | 72 +++++++++++++++++++ .../kv_transfer/kv_connector/base.py | 16 ++++- .../kv_transfer/kv_connector/factory.py | 37 +++++----- .../kv_transfer/kv_connector/utils.py | 19 ++--- .../kv_transfer/kv_connector/v1/base.py | 14 ++++ .../kv_connector/v1/multi_connector.py | 33 +++++++++ .../kv_connector/v1/nixl_connector.py | 23 +++++- 7 files changed, 186 insertions(+), 28 deletions(-) create mode 100644 tests/distributed/test_kvlayout.py diff --git a/tests/distributed/test_kvlayout.py b/tests/distributed/test_kvlayout.py new file mode 100644 index 00000000000..d447876f6cc --- /dev/null +++ b/tests/distributed/test_kvlayout.py @@ -0,0 +1,72 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from vllm.config import (DeviceConfig, KVTransferConfig, ModelConfig, + VllmConfig, set_current_vllm_config) +from vllm.distributed.kv_transfer.kv_connector.utils import ( + get_kv_connector_cache_layout) +from vllm.logger import init_logger + +logger = init_logger("test_expert_parallel") + + +def test_get_kv_connector_cache_layout_without_kv_connector(): + vllm_config = VllmConfig(device_config=DeviceConfig("cpu")) + with set_current_vllm_config(vllm_config): + # Test with default settings + layout = get_kv_connector_cache_layout() + assert layout == "NHD" + + +def test_get_kv_connector_cache_layout_with_lmcache_connector(): + kv_transfer_config = KVTransferConfig( + kv_connector="LMCacheConnectorV1", + kv_role="kv_both", + ) + vllm_config = VllmConfig(device_config=DeviceConfig("cpu"), + kv_transfer_config=kv_transfer_config) + with set_current_vllm_config(vllm_config): + # Test with default settings + layout = get_kv_connector_cache_layout() + assert layout == "NHD" + + +def test_get_kv_connector_cache_layout_with_nixl_connector(): + kv_transfer_config = KVTransferConfig( + kv_connector="NixlConnector", + kv_role="kv_both", + ) + model_config = ModelConfig() + vllm_config = VllmConfig(device_config=DeviceConfig("cpu"), + model_config=model_config, + kv_transfer_config=kv_transfer_config) + with set_current_vllm_config(vllm_config): + # Test with default settings + layout = get_kv_connector_cache_layout() + assert layout == "HND" + + +def test_get_kv_connector_cache_layout_with_multi_connector(): + kv_transfer_config = KVTransferConfig(kv_connector="MultiConnector", + kv_role="kv_both", + kv_connector_extra_config={ + "connectors": [{ + "kv_connector": + "SharedStorageConnector", + "kv_role": + "kv_both" + }, { + "kv_connector": + "NixlConnector", + "kv_role": + "kv_both" + }] + }) + model_config = ModelConfig() + vllm_config = VllmConfig(device_config=DeviceConfig("cpu"), + model_config=model_config, + kv_transfer_config=kv_transfer_config) + with set_current_vllm_config(vllm_config): + # Test with default settings + layout = get_kv_connector_cache_layout() + assert layout == "HND" diff --git a/vllm/distributed/kv_transfer/kv_connector/base.py b/vllm/distributed/kv_transfer/kv_connector/base.py index 181c33925da..868b227fc89 100644 --- a/vllm/distributed/kv_transfer/kv_connector/base.py +++ b/vllm/distributed/kv_transfer/kv_connector/base.py @@ -9,7 +9,7 @@ """ from abc import ABC, abstractmethod -from typing import TYPE_CHECKING, Union +from typing import TYPE_CHECKING, Optional, Union import torch @@ -124,5 +124,19 @@ def recv_kv_caches_and_hidden_states( raise NotImplementedError + @classmethod + def get_required_kvcache_layout( + cls, vllm_config: "VllmConfig") -> Optional[str]: + """ + Get the required KV cache layout for this connector. + Args: + vllm_config (VllmConfig): the vllm config. + + Returns: + str: the required KV cache layout. e.g. HND, or NHD. + None if the connector does not require a specific layout. + """ + return None + KVConnectorBaseType = Union[KVConnectorBase, KVConnectorBase_V1] diff --git a/vllm/distributed/kv_transfer/kv_connector/factory.py b/vllm/distributed/kv_transfer/kv_connector/factory.py index be9ce72dea6..cf7cde2c437 100644 --- a/vllm/distributed/kv_transfer/kv_connector/factory.py +++ b/vllm/distributed/kv_transfer/kv_connector/factory.py @@ -5,6 +5,7 @@ from typing import TYPE_CHECKING, Callable import vllm.envs as envs +from vllm.config import KVTransferConfig from vllm.distributed.kv_transfer.kv_connector.base import KVConnectorBaseType from vllm.distributed.kv_transfer.kv_connector.v1 import (KVConnectorBase_V1, KVConnectorRole) @@ -41,25 +42,15 @@ def create_connector_v0(cls, rank: int, local_rank: int, raise ValueError("Attempting to initialize a V0 Connector, " f"but found {envs.VLLM_USE_V1=}") - connector_name = config.kv_transfer_config.kv_connector - if connector_name not in cls._registry: - raise ValueError(f"Unsupported connector type: {connector_name}") - - connector_cls = cls._registry[connector_name]() + connector_cls = cls.get_connector_class(config.kv_transfer_config) assert issubclass(connector_cls, KVConnectorBase) return connector_cls(rank, local_rank, config) @classmethod - def create_connector_v1( - cls, - config: "VllmConfig", - role: KVConnectorRole, - ) -> KVConnectorBase_V1: - if not envs.VLLM_USE_V1: - raise ValueError("Attempting to initialize a V1 Connector, " - f"but found {envs.VLLM_USE_V1=}") - - kv_transfer_config = config.kv_transfer_config + def get_connector_class( + cls, kv_transfer_config: "KVTransferConfig" + ) -> type[KVConnectorBaseType]: + """Get the connector class by name.""" connector_name = kv_transfer_config.kv_connector if connector_name in cls._registry: connector_cls = cls._registry[connector_name]() @@ -70,9 +61,23 @@ def create_connector_v1( f"Unsupported connector type: {connector_name}") connector_module = importlib.import_module(connector_module_path) connector_cls = getattr(connector_module, connector_name) + return connector_cls + + @classmethod + def create_connector_v1( + cls, + config: "VllmConfig", + role: KVConnectorRole, + ) -> KVConnectorBase_V1: + if not envs.VLLM_USE_V1: + raise ValueError("Attempting to initialize a V1 Connector, " + f"but found {envs.VLLM_USE_V1=}") + + kv_transfer_config = config.kv_transfer_config + connector_cls = cls.get_connector_class(kv_transfer_config) assert issubclass(connector_cls, KVConnectorBase_V1) logger.info("Creating v1 connector with name: %s and engine_id: %s", - connector_name, kv_transfer_config.engine_id) + connector_cls.__name__, kv_transfer_config.engine_id) # NOTE(Kuntai): v1 connector is explicitly separated into two roles. # Scheduler connector: # - Co-locate with scheduler process diff --git a/vllm/distributed/kv_transfer/kv_connector/utils.py b/vllm/distributed/kv_transfer/kv_connector/utils.py index 459a5329891..559c233947c 100644 --- a/vllm/distributed/kv_transfer/kv_connector/utils.py +++ b/vllm/distributed/kv_transfer/kv_connector/utils.py @@ -13,6 +13,8 @@ import vllm.envs as envs from vllm import _custom_ops as ops from vllm.config import VllmConfig, get_current_vllm_config +from vllm.distributed.kv_transfer.kv_connector.factory import ( + KVConnectorFactory) from vllm.logger import init_logger from vllm.v1.outputs import ModelRunnerOutput @@ -103,15 +105,14 @@ def get_kv_connector_cache_layout(): # used for faster transfer. vllm_config = get_current_vllm_config() kv_config = vllm_config.kv_transfer_config - if kv_config is not None and vllm_config.model_config is None: - logger.warning_once("Unable to detect current VLLM config. " \ - "Defaulting to NHD kv cache layout.") - elif kv_config is not None: - use_mla = vllm_config.model_config.use_mla - if not use_mla and kv_config.kv_connector == "NixlConnector": - logger.info_once("NixlConnector detected. Setting KV cache " \ - "layout to HND for better xfer performance.") - return "HND" + if kv_config is not None: + connector_cls = KVConnectorFactory.get_connector_class(kv_config) + required_kvcache_layout = connector_cls.get_required_kvcache_layout( + vllm_config) + if required_kvcache_layout is not None: + return required_kvcache_layout + logger.info_once("Connectors do not specify a " \ + "kv cache layout, defaulting to NHD.") return "NHD" diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/base.py b/vllm/distributed/kv_transfer/kv_connector/v1/base.py index 8bbdd7e0621..7a2ccb58656 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/base.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/base.py @@ -299,3 +299,17 @@ def request_finished( returned by the engine. """ return False, None + + @classmethod + def get_required_kvcache_layout( + cls, vllm_config: "VllmConfig") -> Optional[str]: + """ + Get the required KV cache layout for this connector. + Args: + vllm_config (VllmConfig): the vllm config. + + Returns: + str: the required KV cache layout. e.g. HND, or NHD. + None if the connector does not require a specific layout. + """ + return None diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py b/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py index a2eaa004019..934a03a12ee 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py @@ -202,3 +202,36 @@ def request_finished( self._requests_to_connector.pop(request.request_id, None) return async_saves > 0, kv_txfer_params + + @classmethod + def get_required_kvcache_layout( + cls, vllm_config: "VllmConfig") -> Optional[str]: + """ + Get the required KV cache layout for this connector. + Args: + vllm_config (VllmConfig): the vllm config. + + Returns: + str: the required KV cache layout. e.g. HND, or NHD. + None if the connector does not require a specific layout. + """ + ktcs = vllm_config.kv_transfer_config.kv_connector_extra_config.get( + "connectors") + assert ktcs is not None + layouts: set[str] = set() + temp_vllm_config = copy.copy(vllm_config) + for ktc in ktcs: + kv_transfer_config = KVTransferConfig(**ktc) + temp_vllm_config.kv_transfer_config = kv_transfer_config + required_kvcache_layout = KVConnectorFactory.get_connector_class( + kv_transfer_config).get_required_kvcache_layout( + temp_vllm_config) + if required_kvcache_layout is not None: + layouts.add(required_kvcache_layout) + + if len(layouts) > 1: + raise ValueError(f"KV cache layout mismatch: " + f"found {len(layouts)} different layouts " + f"({', '.join(layouts) })." + f"All connectors must use the same layout.") + return next(iter(layouts), None) diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py b/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py index 6d86ab7f7a4..e7fc2b11814 100644 --- a/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py +++ b/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py @@ -133,6 +133,25 @@ def __init__(self, vllm_config: VllmConfig, role: KVConnectorRole): self.connector_worker = NixlConnectorWorker( vllm_config, self.engine_id) + ############################################################ + # Class Methods + ############################################################ + @classmethod + def get_required_kvcache_layout(cls, vllm_config: VllmConfig): + if vllm_config.model_config is None: + logger.warning_once("Unable to detect current VLLM config. " + "Fallback to default kv cache layout.") + return None + use_mla = vllm_config.model_config.use_mla + if use_mla: + # return None when we have mla + # as the layout should not matter in that case, + # which fallback to the default behavior. + return None + logger.info_once("NixlConnector setting KV cache " + "layout to HND for better xfer performance.") + return "HND" + ############################################################ # Scheduler Side Methods ############################################################ @@ -236,13 +255,13 @@ def get_num_new_matched_tokens( """ For remote prefill, pull all prompt blocks from remote asynchronously relative to engine execution. - + Args: request (Request): the request object. num_computed_tokens (int): the number of locally computed tokens for this request Returns: - * the number of tokens that can be loaded from the + * the number of tokens that can be loaded from the external KV cache beyond what is already computed. * true if the external KV cache tokens will be loaded asynchronously (between scheduler steps). From 931d776db60fb47c65429d6bfcdb974b79a637ed Mon Sep 17 00:00:00 2001 From: wenxindongwork <161090399+wenxindongwork@users.noreply.github.com> Date: Wed, 30 Jul 2025 10:02:12 -0700 Subject: [PATCH 531/552] [TPU] Support Pathways in vLLM (#21417) Signed-off-by: wenxindongwork Signed-off-by: x22x22 --- vllm/envs.py | 5 +++++ vllm/platforms/__init__.py | 18 ++++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/vllm/envs.py b/vllm/envs.py index ec4b0888d0f..19bc9156b25 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -124,6 +124,7 @@ VLLM_V1_USE_OUTLINES_CACHE: bool = False VLLM_TPU_BUCKET_PADDING_GAP: int = 0 VLLM_TPU_MOST_MODEL_LEN: Optional[int] = None + VLLM_TPU_USING_PATHWAYS: bool = False VLLM_USE_DEEP_GEMM: bool = False VLLM_USE_FLASHINFER_MOE_FP8: bool = False VLLM_USE_FLASHINFER_MOE_FP4: bool = False @@ -900,6 +901,10 @@ def get_vllm_port() -> Optional[int]: "VLLM_TPU_MOST_MODEL_LEN": lambda: maybe_convert_int(os.environ.get("VLLM_TPU_MOST_MODEL_LEN", None)), + # Whether using Pathways + "VLLM_TPU_USING_PATHWAYS": + lambda: bool("proxy" in os.getenv("JAX_PLATFORMS", "").lower()), + # Allow use of DeepGemm kernels for fused moe ops. "VLLM_USE_DEEP_GEMM": lambda: bool(int(os.getenv("VLLM_USE_DEEP_GEMM", "0"))), diff --git a/vllm/platforms/__init__.py b/vllm/platforms/__init__.py index c13659f8a06..56edb8629e4 100644 --- a/vllm/platforms/__init__.py +++ b/vllm/platforms/__init__.py @@ -1,11 +1,11 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project - import logging import traceback from itertools import chain from typing import TYPE_CHECKING, Optional +from vllm import envs from vllm.plugins import load_plugins_by_group from vllm.utils import resolve_obj_by_qualname, supports_xccl @@ -31,20 +31,26 @@ def vllm_version_matches_substr(substr: str) -> bool: def tpu_platform_plugin() -> Optional[str]: - is_tpu = False logger.debug("Checking if TPU platform is available.") + + # Check for Pathways TPU proxy + if envs.VLLM_TPU_USING_PATHWAYS: + logger.debug("Confirmed TPU platform is available via Pathways proxy.") + return "tpu_commons.platforms.tpu_jax.TpuPlatform" + + # Check for libtpu installation try: # While it's technically possible to install libtpu on a # non-TPU machine, this is a very uncommon scenario. Therefore, - # we assume that libtpu is installed if and only if the machine + # we assume that libtpu is installed only if the machine # has TPUs. + import libtpu # noqa: F401 - is_tpu = True logger.debug("Confirmed TPU platform is available.") + return "vllm.platforms.tpu.TpuPlatform" except Exception as e: logger.debug("TPU platform is not available because: %s", str(e)) - - return "vllm.platforms.tpu.TpuPlatform" if is_tpu else None + return None def cuda_platform_plugin() -> Optional[str]: From 45447ab0618ae2b69290c781edd0a19cc2622a97 Mon Sep 17 00:00:00 2001 From: Nick Hill Date: Wed, 30 Jul 2025 18:20:20 +0100 Subject: [PATCH 532/552] [Misc] Support more collective_rpc return types (#21845) Signed-off-by: Nick Hill Signed-off-by: x22x22 --- tests/v1/engine/test_engine_core_client.py | 65 +++++++++++++++++++++- vllm/v1/engine/__init__.py | 9 ++- vllm/v1/engine/core.py | 6 +- vllm/v1/engine/core_client.py | 3 +- vllm/v1/serial_utils.py | 44 +++++++++++++++ 5 files changed, 121 insertions(+), 6 deletions(-) diff --git a/tests/v1/engine/test_engine_core_client.py b/tests/v1/engine/test_engine_core_client.py index 2ac6dc796bd..f648c38a63f 100644 --- a/tests/v1/engine/test_engine_core_client.py +++ b/tests/v1/engine/test_engine_core_client.py @@ -6,8 +6,9 @@ import signal import time import uuid +from dataclasses import dataclass from threading import Thread -from typing import Optional +from typing import Optional, Union from unittest.mock import MagicMock import pytest @@ -292,6 +293,68 @@ async def test_engine_core_client_asyncio(monkeypatch: pytest.MonkeyPatch): client.shutdown() +@dataclass +class MyDataclass: + message: str + + +# Dummy utility function to monkey-patch into engine core. +def echo_dc( + self, + msg: str, + return_list: bool = False, +) -> Union[MyDataclass, list[MyDataclass]]: + print(f"echo dc util function called: {msg}") + # Return dataclass to verify support for returning custom types + # (for which there is special handling to make it work with msgspec). + return [MyDataclass(msg) for _ in range(3)] if return_list \ + else MyDataclass(msg) + + +@pytest.mark.asyncio(loop_scope="function") +async def test_engine_core_client_util_method_custom_return( + monkeypatch: pytest.MonkeyPatch): + + with monkeypatch.context() as m: + m.setenv("VLLM_USE_V1", "1") + + # Must set insecure serialization to allow returning custom types. + m.setenv("VLLM_ALLOW_INSECURE_SERIALIZATION", "1") + + # Monkey-patch core engine utility function to test. + m.setattr(EngineCore, "echo_dc", echo_dc, raising=False) + + engine_args = EngineArgs(model=MODEL_NAME, enforce_eager=True) + vllm_config = engine_args.create_engine_config( + usage_context=UsageContext.UNKNOWN_CONTEXT) + executor_class = Executor.get_class(vllm_config) + + with set_default_torch_num_threads(1): + client = EngineCoreClient.make_client( + multiprocess_mode=True, + asyncio_mode=True, + vllm_config=vllm_config, + executor_class=executor_class, + log_stats=True, + ) + + try: + # Test utility method returning custom / non-native data type. + core_client: AsyncMPClient = client + + result = await core_client.call_utility_async( + "echo_dc", "testarg2", False) + assert isinstance(result, + MyDataclass) and result.message == "testarg2" + result = await core_client.call_utility_async( + "echo_dc", "testarg2", True) + assert isinstance(result, list) and all( + isinstance(r, MyDataclass) and r.message == "testarg2" + for r in result) + finally: + client.shutdown() + + @pytest.mark.parametrize( "multiprocessing_mode,publisher_config", [(True, "tcp"), (False, "inproc")], diff --git a/vllm/v1/engine/__init__.py b/vllm/v1/engine/__init__.py index 79dc80d8fc5..810d03f32d7 100644 --- a/vllm/v1/engine/__init__.py +++ b/vllm/v1/engine/__init__.py @@ -123,6 +123,13 @@ def finished(self) -> bool: return self.finish_reason is not None +class UtilityResult: + """Wrapper for special handling when serializing/deserializing.""" + + def __init__(self, r: Any = None): + self.result = r + + class UtilityOutput( msgspec.Struct, array_like=True, # type: ignore[call-arg] @@ -132,7 +139,7 @@ class UtilityOutput( # Non-None implies the call failed, result should be None. failure_message: Optional[str] = None - result: Any = None + result: Optional[UtilityResult] = None class EngineCoreOutputs( diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 39fda521f36..9f2fca69613 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -36,7 +36,7 @@ from vllm.v1.engine import (EngineCoreOutputs, EngineCoreRequest, EngineCoreRequestType, ReconfigureDistributedRequest, ReconfigureRankType, - UtilityOutput) + UtilityOutput, UtilityResult) from vllm.v1.engine.mm_input_cache import MirroredProcessingCache from vllm.v1.engine.utils import EngineHandshakeMetadata, EngineZmqAddresses from vllm.v1.executor.abstract import Executor @@ -715,8 +715,8 @@ def _handle_client_request(self, request_type: EngineCoreRequestType, output = UtilityOutput(call_id) try: method = getattr(self, method_name) - output.result = method( - *self._convert_msgspec_args(method, args)) + result = method(*self._convert_msgspec_args(method, args)) + output.result = UtilityResult(result) except BaseException as e: logger.exception("Invocation of %s method failed", method_name) output.failure_message = (f"Call to {method_name} method" diff --git a/vllm/v1/engine/core_client.py b/vllm/v1/engine/core_client.py index acff5bf6823..fdf5a5de191 100644 --- a/vllm/v1/engine/core_client.py +++ b/vllm/v1/engine/core_client.py @@ -552,7 +552,8 @@ def _process_utility_output(output: UtilityOutput, if output.failure_message is not None: future.set_exception(Exception(output.failure_message)) else: - future.set_result(output.result) + assert output.result is not None + future.set_result(output.result.result) class SyncMPClient(MPClient): diff --git a/vllm/v1/serial_utils.py b/vllm/v1/serial_utils.py index 03200c2c2f8..4b6a983252b 100644 --- a/vllm/v1/serial_utils.py +++ b/vllm/v1/serial_utils.py @@ -2,6 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import dataclasses +import importlib import pickle from collections.abc import Sequence from inspect import isclass @@ -9,6 +10,7 @@ from typing import Any, Optional, Union import cloudpickle +import msgspec import numpy as np import torch import zmq @@ -22,6 +24,7 @@ MultiModalFlatField, MultiModalKwargs, MultiModalKwargsItem, MultiModalSharedField, NestedTensors) +from vllm.v1.engine import UtilityResult logger = init_logger(__name__) @@ -46,6 +49,10 @@ def _log_insecure_serialization_warning(): "VLLM_ALLOW_INSECURE_SERIALIZATION=1") +def _typestr(t: type): + return t.__module__, t.__qualname__ + + class MsgpackEncoder: """Encoder with custom torch tensor and numpy array serialization. @@ -122,6 +129,18 @@ def enc_hook(self, obj: Any) -> Any: for itemlist in mm._items_by_modality.values() for item in itemlist] + if isinstance(obj, UtilityResult): + result = obj.result + if not envs.VLLM_ALLOW_INSECURE_SERIALIZATION or result is None: + return None, result + # Since utility results are not strongly typed, we also encode + # the type (or a list of types in the case it's a list) to + # help with correct msgspec deserialization. + cls = result.__class__ + return _typestr(cls) if cls is not list else [ + _typestr(type(v)) for v in result + ], result + if not envs.VLLM_ALLOW_INSECURE_SERIALIZATION: raise TypeError(f"Object of type {type(obj)} is not serializable" "Set VLLM_ALLOW_INSECURE_SERIALIZATION=1 to allow " @@ -237,8 +256,33 @@ def dec_hook(self, t: type, obj: Any) -> Any: k: self._decode_nested_tensors(v) for k, v in obj.items() }) + if t is UtilityResult: + return self._decode_utility_result(obj) return obj + def _decode_utility_result(self, obj: Any) -> UtilityResult: + result_type, result = obj + if result_type is not None: + if not envs.VLLM_ALLOW_INSECURE_SERIALIZATION: + raise TypeError("VLLM_ALLOW_INSECURE_SERIALIZATION must " + "be set to use custom utility result types") + assert isinstance(result_type, list) + if len(result_type) == 2 and isinstance(result_type[0], str): + result = self._convert_result(result_type, result) + else: + assert isinstance(result, list) + result = [ + self._convert_result(rt, r) + for rt, r in zip(result_type, result) + ] + return UtilityResult(result) + + def _convert_result(self, result_type: Sequence[str], result: Any): + mod_name, name = result_type + mod = importlib.import_module(mod_name) + result_type = getattr(mod, name) + return msgspec.convert(result, result_type, dec_hook=self.dec_hook) + def _decode_ndarray(self, arr: Any) -> np.ndarray: dtype, shape, data = arr # zero-copy decode. We assume the ndarray will not be kept around, From e8e693a6b0ab1b8a71e9e02f6515e3f2b9ab3c8a Mon Sep 17 00:00:00 2001 From: Doug Smith Date: Wed, 30 Jul 2025 16:04:40 -0400 Subject: [PATCH 533/552] For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted (#21964) Signed-off-by: x22x22 --- setup.py | 79 +++++++++++++++++++++++++++++++------------------------- 1 file changed, 44 insertions(+), 35 deletions(-) diff --git a/setup.py b/setup.py index 58e5833f16a..bf3391e2db1 100644 --- a/setup.py +++ b/setup.py @@ -371,40 +371,31 @@ def run(self) -> None: raise SetupError( f"Failed to get vLLM wheel from {wheel_location}") from e - # During a docker build: determine correct filename, copy wheel. - if envs.VLLM_DOCKER_BUILD_CONTEXT: - dist_dir = "/workspace/dist" - os.makedirs(dist_dir, exist_ok=True) - # Determine correct wheel filename from METADATA - with zipfile.ZipFile(wheel_path, "r") as z: - metadata_file = next( - (n for n in z.namelist() - if n.endswith(".dist-info/METADATA")), - None, - ) - if not metadata_file: - raise RuntimeError( - "Could not find METADATA in precompiled wheel.") - metadata = z.read(metadata_file).decode() - version_line = next((line for line in metadata.splitlines() - if line.startswith("Version: ")), None) - if not version_line: - raise RuntimeError( - "Could not determine version from METADATA.") - version = version_line.split(": ")[1].strip() - - # Build correct filename using internal version - arch_tag = "cp38-abi3-manylinux1_x86_64" - corrected_wheel_name = f"vllm-{version}-{arch_tag}.whl" - final_wheel_path = os.path.join(dist_dir, corrected_wheel_name) + # Set the dist_dir for Docker build context + dist_dir = ("/workspace/dist" + if envs.VLLM_DOCKER_BUILD_CONTEXT else "dist") + os.makedirs(dist_dir, exist_ok=True) - print(f"Docker build context detected, copying precompiled wheel " - f"({version}) to {final_wheel_path}") - shutil.copy2(wheel_path, final_wheel_path) - return - - # Unzip the wheel when not in Docker context + # Extract only necessary compiled .so files from precompiled wheel with zipfile.ZipFile(wheel_path) as wheel: + # Get version from METADATA (optional, mostly useful for logging) + metadata_file = next((n for n in wheel.namelist() + if n.endswith(".dist-info/METADATA")), None) + if not metadata_file: + raise RuntimeError( + "Could not find METADATA in precompiled wheel.") + metadata = wheel.read(metadata_file).decode() + version_line = next((line for line in metadata.splitlines() + if line.startswith("Version: ")), None) + if not version_line: + raise RuntimeError( + "Could not determine version from METADATA.") + version = version_line.split(": ")[1].strip() + + print(f"Extracting precompiled kernels from vLLM wheel version: " + f"{version}") + + # List of compiled shared objects to extract files_to_copy = [ "vllm/_C.abi3.so", "vllm/_moe_C.abi3.so", @@ -413,6 +404,7 @@ def run(self) -> None: "vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so", "vllm/cumem_allocator.abi3.so", ] + file_members = list( filter(lambda x: x.filename in files_to_copy, wheel.filelist)) compiled_regex = re.compile( @@ -430,9 +422,26 @@ def run(self) -> None: if package_name not in package_data: package_data[package_name] = [] - wheel.extract(file) - if not file_name.endswith(".py"): - package_data[package_name].append(file_name) + output_base = (dist_dir + if envs.VLLM_DOCKER_BUILD_CONTEXT else ".") + target_path = os.path.join(output_base, file.filename) + os.makedirs(os.path.dirname(target_path), exist_ok=True) + with wheel.open(file.filename) as src, open(target_path, + "wb") as dst: + shutil.copyfileobj(src, dst) + + package_data[package_name].append(file_name) + + # Copy wheel into dist dir for Docker to consume (e.g., via --mount) + if envs.VLLM_DOCKER_BUILD_CONTEXT: + arch_tag = "cp38-abi3-manylinux1_x86_64" + corrected_wheel_name = f"vllm-{version}-{arch_tag}.whl" + final_wheel_path = os.path.join(dist_dir, corrected_wheel_name) + + print( + "Docker build context detected, copying precompiled wheel to " + f"{final_wheel_path}") + shutil.copy2(wheel_path, final_wheel_path) def _no_device() -> bool: From ace708fd0e17dfa1521bee766dc743bddc6223e8 Mon Sep 17 00:00:00 2001 From: Ming Yang Date: Wed, 30 Jul 2025 13:15:06 -0700 Subject: [PATCH 534/552] [Misc] Use dracut on CentOS and skip clone if repo exists for EP kernel installation (#21635) Signed-off-by: Ming Yang Signed-off-by: x22x22 --- tools/ep_kernels/configure_system_drivers.sh | 12 +++++- tools/ep_kernels/install_python_libraries.sh | 40 +++++++++++++++++++- 2 files changed, 49 insertions(+), 3 deletions(-) diff --git a/tools/ep_kernels/configure_system_drivers.sh b/tools/ep_kernels/configure_system_drivers.sh index cf15c1dacca..b8bd8b8f6f5 100644 --- a/tools/ep_kernels/configure_system_drivers.sh +++ b/tools/ep_kernels/configure_system_drivers.sh @@ -2,6 +2,16 @@ set -ex # turn on IBGDA echo 'options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"' | tee -a /etc/modprobe.d/nvidia.conf -update-initramfs -u + +if command -v update-initramfs &> /dev/null; then + # for Debian/Ubuntu + sudo update-initramfs -u +elif command -v dracut &> /dev/null; then + # for Fedora/CentOS + sudo dracut --force +else + echo "No supported initramfs update tool found." + exit 1 +fi echo "Please reboot the system to apply the changes" diff --git a/tools/ep_kernels/install_python_libraries.sh b/tools/ep_kernels/install_python_libraries.sh index 83643c084bf..9d1b2da3b41 100644 --- a/tools/ep_kernels/install_python_libraries.sh +++ b/tools/ep_kernels/install_python_libraries.sh @@ -53,9 +53,45 @@ popd export CMAKE_PREFIX_PATH=$WORKSPACE/nvshmem_install:$CMAKE_PREFIX_PATH +is_git_dirty() { + local dir=$1 + pushd "$dir" > /dev/null + + if [ -d ".git" ] && [ -n "$(git status --porcelain 2>/dev/null)" ]; then + popd > /dev/null + return 0 # dirty (true) + else + popd > /dev/null + return 1 # clean (false) + fi +} + +# Function to handle git repository cloning with dirty/incomplete checks +clone_repo() { + local repo_url=$1 + local dir_name=$2 + local key_file=$3 + + if [ -d "$dir_name" ]; then + # Check if directory has uncommitted changes (dirty) + if is_git_dirty "$dir_name"; then + echo "$dir_name directory is dirty, skipping clone" + # Check if clone failed (directory exists but not a valid git repo or missing key files) + elif [ ! -d "$dir_name/.git" ] || [ ! -f "$dir_name/$key_file" ]; then + echo "$dir_name directory exists but clone appears incomplete, cleaning up and re-cloning" + rm -rf "$dir_name" + git clone "$repo_url" + else + echo "$dir_name directory exists and appears complete; manually update if needed" + fi + else + git clone "$repo_url" + fi +} + # build and install pplx, require pytorch installed pushd $WORKSPACE -git clone https://github.com/ppl-ai/pplx-kernels +clone_repo "https://github.com/ppl-ai/pplx-kernels" "pplx-kernels" "setup.py" cd pplx-kernels # see https://github.com/pypa/pip/issues/9955#issuecomment-838065925 # PIP_NO_BUILD_ISOLATION=0 disables build isolation @@ -64,7 +100,7 @@ popd # build and install deepep, require pytorch installed pushd $WORKSPACE -git clone https://github.com/deepseek-ai/DeepEP +clone_repo "https://github.com/deepseek-ai/DeepEP" "DeepEP" "setup.py" cd DeepEP export NVSHMEM_DIR=$WORKSPACE/nvshmem_install PIP_NO_BUILD_ISOLATION=0 pip install -vvv -e . From d8a2eaec9af28e39f6fb03ad34ff7848450b1d0c Mon Sep 17 00:00:00 2001 From: cascade Date: Wed, 30 Jul 2025 14:23:41 -0700 Subject: [PATCH 535/552] [Feature] Add async tensor parallelism for scaled mm (#20155) Signed-off-by: cascade812 Signed-off-by: x22x22 --- tests/compile/test_async_tp.py | 143 ++++++++++++- vllm/compilation/collective_fusion.py | 244 ++++++++++++++++++++++- vllm/compilation/sequence_parallelism.py | 2 +- 3 files changed, 381 insertions(+), 8 deletions(-) diff --git a/tests/compile/test_async_tp.py b/tests/compile/test_async_tp.py index 916ec2b83df..9a51e6b3514 100644 --- a/tests/compile/test_async_tp.py +++ b/tests/compile/test_async_tp.py @@ -22,6 +22,8 @@ multi_gpu_test) from .backend import TestBackend +FP8_DTYPE = current_platform.fp8_dtype() + prompts = [ "Hello, my name is", "The president of the United States is", @@ -32,9 +34,10 @@ class TestMMRSModel(torch.nn.Module): - def __init__(self, hidden_size=16): + def __init__(self, hidden_size=16, dtype=torch.float16): super().__init__() self.hidden_size = hidden_size + self.dtype = dtype self.gate_proj = torch.nn.Parameter(torch.empty( (self.hidden_size * 2, hidden_size)), requires_grad=False) @@ -64,9 +67,10 @@ def ops_in_model_after(self): class TestAGMMModel(torch.nn.Module): - def __init__(self, hidden_size=16): + def __init__(self, hidden_size=16, dtype=torch.float16): super().__init__() self.hidden_size = hidden_size + self.dtype = dtype self.weight = torch.nn.Parameter(torch.empty( (hidden_size, hidden_size)), requires_grad=False) @@ -91,8 +95,125 @@ def ops_in_model_after(self): return [torch.ops.symm_mem.fused_all_gather_matmul.default] +class _BaseScaledMMModel(torch.nn.Module): + + def __init__(self, hidden_size=16, dtype=torch.float16): + super().__init__() + self.hidden_size = hidden_size + self.dtype = dtype + self.weight = torch.empty([hidden_size, hidden_size], dtype=FP8_DTYPE)\ + .contiguous().transpose(0, 1) + + # Initialize scale_b for _scaled_mm. + self.scale_b = torch.ones(1, self.hidden_size, dtype=torch.float32) + + +class TestScaledMMRSModel(_BaseScaledMMModel): + + def forward(self, input: torch.Tensor): + """ + Forward pass implementing the scaled_mm + reduce scatter in the FX graph + + """ + fp8_input = input.to(FP8_DTYPE) + scale_a = torch.ones(input.shape[0], 1, dtype=torch.float32) + scaled_mm = torch._scaled_mm(fp8_input, + self.weight, + scale_a=scale_a, + scale_b=self.scale_b, + out_dtype=self.dtype) + reduce_scatter = tensor_model_parallel_reduce_scatter(scaled_mm, dim=0) + return reduce_scatter + + def ops_in_model_before(self): + return [torch.ops.vllm.reduce_scatter.default] + + def ops_in_model_after(self): + return [torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter.default] + + +class TestAGScaledMMModel(_BaseScaledMMModel): + + def forward(self, input: torch.Tensor): + """ + Forward pass implementing the all gather + scaled_mm in the FX graph + """ + # Reshape input + fp8_input = input.to(FP8_DTYPE) + all_gather = tensor_model_parallel_all_gather(fp8_input, dim=0) + + scale_a = torch.ones(all_gather.shape[0], 1, dtype=torch.float32) + scaled_mm = torch._scaled_mm(all_gather, + self.weight, + scale_a=scale_a, + scale_b=self.scale_b, + out_dtype=self.dtype) + return scaled_mm + + def ops_in_model_before(self): + return [torch.ops.vllm.all_gather.default] + + def ops_in_model_after(self): + return [torch.ops.symm_mem.fused_all_gather_scaled_matmul.default] + + +class TestCutlassScaledMMRSModel(_BaseScaledMMModel): + + def forward(self, input: torch.Tensor): + """ + Forward pass implementing the cutlass_scaled_mm + reduce scatter + in the FX graph + + """ + fp8_input = input.to(FP8_DTYPE) + scale_a = torch.ones(input.shape[0], 1, dtype=torch.float32) + mm_out = torch.empty((fp8_input.shape[0], self.weight.shape[1]), + dtype=self.dtype, + device=input.device) + torch.ops._C.cutlass_scaled_mm(mm_out, fp8_input, self.weight, scale_a, + self.scale_b, None) + reduce_scatter = tensor_model_parallel_reduce_scatter(mm_out, dim=0) + return reduce_scatter + + def ops_in_model_before(self): + return [torch.ops.vllm.reduce_scatter.default] + + def ops_in_model_after(self): + return [torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter.default] + + +class TestAGCutlassScaledMMModel(_BaseScaledMMModel): + + def forward(self, input: torch.Tensor): + """ + Forward pass implementing the all gather + cutlass_scaled_mm + in the FX graph + """ + # Reshape input + fp8_input = input.to(FP8_DTYPE) + all_gather = tensor_model_parallel_all_gather(fp8_input, dim=0) + + scale_a = torch.ones(all_gather.shape[0], 1, dtype=torch.float32) + + mm_out = torch.empty((all_gather.shape[0], self.weight.shape[1]), + dtype=self.dtype, + device=all_gather.device) + torch.ops._C.cutlass_scaled_mm(mm_out, all_gather, self.weight, + scale_a, self.scale_b, None) + return mm_out + + def ops_in_model_before(self): + return [torch.ops.vllm.all_gather.default] + + def ops_in_model_after(self): + return [torch.ops.symm_mem.fused_all_gather_scaled_matmul.default] + + @multi_gpu_test(num_gpus=2) -@pytest.mark.parametrize("test_model", [TestMMRSModel, TestAGMMModel]) +@pytest.mark.parametrize("test_model", [ + TestMMRSModel, TestAGMMModel, TestScaledMMRSModel, TestAGScaledMMModel, + TestCutlassScaledMMRSModel, TestAGCutlassScaledMMModel +]) @pytest.mark.parametrize("batch_size", [8]) @pytest.mark.parametrize("seq_len", [16]) @pytest.mark.parametrize("hidden_size", [16]) @@ -101,6 +222,14 @@ def ops_in_model_after(self): reason="Only test on CUDA") def test_async_tp_pass_replace(test_model: str, batch_size: int, seq_len: int, hidden_size: int, dtype: torch.dtype): + if test_model in (TestScaledMMRSModel, TestAGScaledMMModel, + TestCutlassScaledMMRSModel, + TestAGCutlassScaledMMModel) and dtype == torch.float16: + pytest.skip( + "Only bf16 high precision output types are supported for " \ + "per-token (row-wise) scaling" + ) + num_processes = 2 def run_torch_spawn(fn, nprocs): @@ -155,7 +284,8 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int, async_tp_pass = AsyncTPPass(vllm_config) backend = TestBackend(async_tp_pass) - model = test_model_cls(hidden_size) + model = test_model_cls(hidden_size, + dtype) # Pass dtype to model constructor hidden_states = torch.randn((batch_size * seq_len, hidden_size), dtype=dtype, @@ -174,7 +304,10 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int, @create_new_process_for_each_test() -@pytest.mark.parametrize("model_id", ["meta-llama/Llama-3.2-1B-Instruct"]) +@pytest.mark.parametrize("model_id", [ + "meta-llama/Llama-3.2-1B-Instruct", + "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8" +]) @pytest.mark.parametrize("tp_size", [2]) @pytest.mark.parametrize("async_tp_enabled", [True]) @pytest.mark.parametrize("distributed_backend", ["mp"]) diff --git a/vllm/compilation/collective_fusion.py b/vllm/compilation/collective_fusion.py index 0e7961841bd..cb99fe8310e 100644 --- a/vllm/compilation/collective_fusion.py +++ b/vllm/compilation/collective_fusion.py @@ -15,10 +15,13 @@ from vllm.distributed.parallel_state import ( get_tensor_model_parallel_rank, get_tensor_model_parallel_world_size) from vllm.logger import init_logger +from vllm.platforms import current_platform from vllm.utils import direct_register_custom_op from .vllm_inductor_pass import VllmInductorPass +FP8_DTYPE = current_platform.fp8_dtype() + if find_spec("flashinfer"): try: import flashinfer.comm as flashinfer_comm @@ -28,7 +31,6 @@ flashinfer_comm = None else: flashinfer_comm = None -from vllm.platforms import current_platform logger = init_logger(__name__) @@ -118,6 +120,230 @@ def replacement( pm.fwd_only, pm_pass) +class ScaledMMReduceScatterPattern(BasePattern): + + def get_inputs(self): + input = torch.empty([16, 16], device=self.device, dtype=FP8_DTYPE) + mm_weight = torch.empty([16, 16], device=self.device, + dtype=FP8_DTYPE).contiguous().transpose(0, 1) + scale_a = torch.empty([16, 1], device=self.device, dtype=torch.float32) + scale_b = torch.empty([1, 16], device=self.device, dtype=torch.float32) + return [input, mm_weight, scale_a, scale_b] + + def register(self, pm_pass: PatternMatcherPass): + + def pattern(input: torch.Tensor, mat2: torch.Tensor, + scale_a: torch.Tensor, + scale_b: torch.Tensor) -> torch.Tensor: + scaled_mm = torch.ops.aten._scaled_mm.default(input, + mat2=mat2, + scale_a=scale_a, + scale_b=scale_b, + bias=None, + scale_result=None, + out_dtype=self.dtype) + reduce_scatter = torch.ops.vllm.reduce_scatter.default( + scaled_mm, + dim=0, + world_size=self.tp_size, + group_name=self.tp.unique_name) + return reduce_scatter + + def replacement(input: torch.Tensor, mat2: torch.Tensor, + scale_a: torch.Tensor, + scale_b: torch.Tensor) -> torch.Tensor: + gemm_rs = torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter( + input, + mat2, + scale_a, + scale_b, + "avg", + scatter_dim=0, + out_dtype=self.dtype, + group_name=self.tp.device_group.group_name, + ) + + return gemm_rs + + pm.register_replacement(pattern, replacement, self.get_inputs(), + pm.fwd_only, pm_pass) + + +class AllGatherScaledMMPattern(BasePattern): + + def get_inputs(self): + x = torch.empty([8, 16], device=self.device, dtype=FP8_DTYPE) + weight = torch.empty([16, 16], device=self.device, + dtype=FP8_DTYPE).contiguous().transpose(0, 1) + + s1 = x.shape[0] * self.tp_size + + scale_a = torch.empty([s1, 1], device=self.device, dtype=torch.float32) + scale_b = torch.empty([1, 16], device=self.device, dtype=torch.float32) + + return [x, weight, scale_a, scale_b] + + def register(self, pm_pass: PatternMatcherPass): + + def pattern( + x: torch.Tensor, + weight: torch.Tensor, + scale_a: torch.Tensor, + scale_b: torch.Tensor, + ) -> torch.Tensor: + all_gather = torch.ops.vllm.all_gather.default( + x, + dim=0, + world_size=self.tp_size, + group_name=self.tp.unique_name) + + return torch.ops.aten._scaled_mm.default(all_gather, + mat2=weight, + scale_a=scale_a, + scale_b=scale_b, + bias=None, + scale_result=None, + out_dtype=self.dtype) + + def replacement(x: torch.Tensor, weight: torch.Tensor, + scale_a: torch.Tensor, + scale_b: torch.Tensor) -> torch.Tensor: + ag_output, mm_outputs = torch.ops.symm_mem.fused_all_gather_scaled_matmul( # noqa + x, + [weight], + scale_a, + [scale_b], + gather_dim=0, + biases=[None], + result_scales=[None], + out_dtypes=[self.dtype], + use_fast_accum=[False], + group_name=self.tp.device_group.group_name, + ) + return mm_outputs + + pm.register_replacement(pattern, replacement, self.get_inputs(), + pm.fwd_only, pm_pass) + + +class CutlassScaledMMReduceScatterPattern(BasePattern): + + def get_inputs(self): + input = torch.empty([16, 16], device=self.device, dtype=FP8_DTYPE) + mm_weight = torch.empty([16, 16], device=self.device, + dtype=FP8_DTYPE).contiguous().transpose(0, 1) + scale_a = torch.empty([16, 1], device=self.device, dtype=torch.float32) + scale_b = torch.empty([1, 16], device=self.device, dtype=torch.float32) + + cutlass_mm_output = torch.empty([16, 16], + device=self.device, + dtype=self.dtype) + return [input, mm_weight, scale_a, scale_b, cutlass_mm_output] + + def register(self, pm_pass: PatternMatcherPass): + + def pattern(input: torch.Tensor, weight: torch.Tensor, + scale_a: torch.Tensor, scale_b: torch.Tensor, + cutlass_mm_output: torch.Tensor) -> torch.Tensor: + cutlass_scaled_mm = torch.ops.higher_order.auto_functionalized( + torch.ops._C.cutlass_scaled_mm.default, + out=cutlass_mm_output, + a=input, + b=weight, + a_scales=scale_a, + b_scales=scale_b, + bias=None) + + reduce_scatter = torch.ops.vllm.reduce_scatter.default( + cutlass_scaled_mm[1], + dim=0, + world_size=self.tp_size, + group_name=self.tp.unique_name) + return reduce_scatter + + def replacement(input: torch.Tensor, mat2: torch.Tensor, + scale_a: torch.Tensor, scale_b: torch.Tensor, + cutlass_mm_output: torch.Tensor) -> torch.Tensor: + gemm_rs = torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter( + input, + mat2, + scale_a, + scale_b, + "avg", + scatter_dim=0, + out_dtype=self.dtype, + group_name=self.tp.device_group.group_name, + ) + + return gemm_rs + + pm.register_replacement(pattern, replacement, self.get_inputs(), + pm.fwd_only, pm_pass) + + +class AllGatherCutlassScaledMMPattern(BasePattern): + + def get_inputs(self): + x = torch.empty([8, 16], device=self.device, dtype=FP8_DTYPE) + weight = torch.empty([16, 16], device=self.device, + dtype=FP8_DTYPE).contiguous().transpose(0, 1) + + s1 = x.shape[0] * self.tp_size + + scale_a = torch.empty([s1, 1], device=self.device, dtype=torch.float32) + scale_b = torch.empty([1, 16], device=self.device, dtype=torch.float32) + + s2 = weight.shape[1] + output = torch.empty([s1, s2], device=self.device, dtype=self.dtype) + + return [x, weight, scale_a, scale_b, output] + + def register(self, pm_pass: PatternMatcherPass): + + def pattern( + x: torch.Tensor, + weight: torch.Tensor, + scale_a: torch.Tensor, + scale_b: torch.Tensor, + output: torch.Tensor, + ) -> torch.Tensor: + all_gather = torch.ops.vllm.all_gather.default( + x, + dim=0, + world_size=self.tp_size, + group_name=self.tp.unique_name) + + cutlass_scaled_mm = torch.ops.higher_order.auto_functionalized( + torch.ops._C.cutlass_scaled_mm.default, + out=output, + a=all_gather, + b=weight, + a_scales=scale_a, + b_scales=scale_b, + bias=None) + return cutlass_scaled_mm[1] + + def replacement(x: torch.Tensor, weight: torch.Tensor, + scale_a: torch.Tensor, scale_b: torch.Tensor, + output: torch.Tensor) -> torch.Tensor: + ag_output, mm_outputs = torch.ops.symm_mem.fused_all_gather_scaled_matmul( # noqa + x, + [weight], + scale_a, + [scale_b], + gather_dim=0, + biases=[None], + result_scales=[None], + out_dtypes=[self.dtype], + use_fast_accum=[False], + group_name=self.tp.device_group.group_name, + ) + return mm_outputs + + pm.register_replacement(pattern, replacement, self.get_inputs(), + pm.fwd_only, pm_pass) + + class AsyncTPPass(VllmInductorPass): def __init__(self, config: VllmConfig): @@ -133,6 +359,20 @@ def __init__(self, config: VllmConfig): AllGatherGEMMPattern(self.model_dtype, self.device).register(self.patterns) + # These fusions are enabled only for bfloat16 models because + # `scaled_mm` or `cutlass_scaled_mm` with per-token (row-wise) scaling + # only supports bfloat16 as the output dtype. + if self.model_dtype == torch.bfloat16: + ScaledMMReduceScatterPattern(self.model_dtype, + self.device).register(self.patterns) + AllGatherScaledMMPattern(self.model_dtype, + self.device).register(self.patterns) + + CutlassScaledMMReduceScatterPattern( + self.model_dtype, self.device).register(self.patterns) + AllGatherCutlassScaledMMPattern( + self.model_dtype, self.device).register(self.patterns) + def is_applicable_for_shape(self, shape: Optional[int]) -> bool: # only do replace for specific shapes tp_size = get_tensor_model_parallel_world_size() @@ -142,7 +382,7 @@ def __call__(self, graph: fx.Graph): self.begin() self.dump_graph(graph, "before_async_tp_pass") count = self.patterns.apply(graph) - logger.debug("Replaced %s patterns", count) + logger.debug("Replaced %s patterns with async TP pass.", count) self.dump_graph(graph, "after_async_tp_pass") self.end_and_log() diff --git a/vllm/compilation/sequence_parallelism.py b/vllm/compilation/sequence_parallelism.py index 6107046e40d..ebc025cba71 100644 --- a/vllm/compilation/sequence_parallelism.py +++ b/vllm/compilation/sequence_parallelism.py @@ -477,6 +477,6 @@ def __call__(self, graph: fx.Graph): self.begin() self.dump_graph(graph, "before_sequence_parallelism_pass") count = self.patterns.apply(graph) - logger.debug("Replaced %s patterns", count) + logger.debug("Replaced %s patterns with sequence parallelism", count) self.dump_graph(graph, "after_sequence_parallelism_pass") self.end_and_log() From 181202f9add9503b3e25aa75e1d3d521f1f5ef08 Mon Sep 17 00:00:00 2001 From: Bram <153647206+br4mm@users.noreply.github.com> Date: Wed, 30 Jul 2025 14:44:02 -0700 Subject: [PATCH 536/552] [Bugfix] Fix None value handling in trace span creation for cancelled requests (#20272) Signed-off-by: x22x22 --- vllm/engine/llm_engine.py | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py index 3f30a34170f..79255b031ee 100644 --- a/vllm/engine/llm_engine.py +++ b/vllm/engine/llm_engine.py @@ -1862,8 +1862,14 @@ def create_trace_span(self, seq_group: SequenceGroup) -> None: context=trace_context, start_time=arrival_time_nano_seconds) as seq_span: metrics = seq_group.metrics - ttft = metrics.first_token_time - metrics.arrival_time - e2e_time = metrics.finished_time - metrics.arrival_time + + # Handle potential None values for cancelled/aborted requests + ttft = (metrics.first_token_time - metrics.arrival_time + if metrics.first_token_time is not None else None) + + e2e_time = (metrics.finished_time - metrics.arrival_time + if metrics.finished_time is not None else None) + seq_span.set_attribute(SpanAttributes.GEN_AI_RESPONSE_MODEL, self.model_config.model) seq_span.set_attribute(SpanAttributes.GEN_AI_REQUEST_ID, @@ -1886,11 +1892,18 @@ def create_trace_span(self, seq_group: SequenceGroup) -> None: seq.get_output_len() for seq in seq_group.get_finished_seqs() ])) - seq_span.set_attribute(SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE, - metrics.time_in_queue) - seq_span.set_attribute( - SpanAttributes.GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN, ttft) - seq_span.set_attribute(SpanAttributes.GEN_AI_LATENCY_E2E, e2e_time) + + # Only set timing attributes if the values are available + if metrics.time_in_queue is not None: + seq_span.set_attribute( + SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE, + metrics.time_in_queue) + if ttft is not None: + seq_span.set_attribute( + SpanAttributes.GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN, ttft) + if e2e_time is not None: + seq_span.set_attribute(SpanAttributes.GEN_AI_LATENCY_E2E, + e2e_time) if metrics.scheduler_time is not None: seq_span.set_attribute( SpanAttributes.GEN_AI_LATENCY_TIME_IN_SCHEDULER, From 3b91b171ad7abc6b897c8802e5b989a3f618b3f1 Mon Sep 17 00:00:00 2001 From: Zebing Lin Date: Wed, 30 Jul 2025 18:00:54 -0400 Subject: [PATCH 537/552] [Core] Move EngineCoreRequest to Request conversion out of EngineCore (#21627) Signed-off-by: linzebing Signed-off-by: x22x22 --- tests/v1/engine/test_engine_core.py | 44 ++++++++++------- vllm/v1/engine/core.py | 74 ++++++++++++++++++----------- vllm/v1/engine/core_client.py | 3 +- 3 files changed, 73 insertions(+), 48 deletions(-) diff --git a/tests/v1/engine/test_engine_core.py b/tests/v1/engine/test_engine_core.py index eb826bf0623..c52b9896712 100644 --- a/tests/v1/engine/test_engine_core.py +++ b/tests/v1/engine/test_engine_core.py @@ -65,7 +65,8 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): """Test basic request lifecycle.""" # First request. - engine_core.add_request(make_request()) + engine_core.add_request( + *engine_core.preprocess_add_request(make_request())) assert len(engine_core.scheduler.waiting) == 1 assert len(engine_core.scheduler.running) == 0 @@ -74,7 +75,8 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): assert len(engine_core.scheduler.running) == 1 # Second request. - engine_core.add_request(make_request()) + engine_core.add_request( + *engine_core.preprocess_add_request(make_request())) assert len(engine_core.scheduler.waiting) == 1 assert len(engine_core.scheduler.running) == 1 @@ -83,8 +85,10 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): assert len(engine_core.scheduler.running) == 2 # Add two requests in a row. - engine_core.add_request(make_request()) - engine_core.add_request(make_request()) + engine_core.add_request( + *engine_core.preprocess_add_request(make_request())) + engine_core.add_request( + *engine_core.preprocess_add_request(make_request())) assert len(engine_core.scheduler.waiting) == 2 assert len(engine_core.scheduler.running) == 2 @@ -104,7 +108,7 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): req = make_request() request_id = req.request_id - engine_core.add_request(req) + engine_core.add_request(*engine_core.preprocess_add_request(req)) assert len(engine_core.scheduler.waiting) == 1 assert len(engine_core.scheduler.running) == 0 assert engine_core.scheduler.has_unfinished_requests() @@ -131,8 +135,8 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): req1 = make_request() req2 = make_request() - engine_core.add_request(req0) - engine_core.add_request(req1) + engine_core.add_request(*engine_core.preprocess_add_request(req0)) + engine_core.add_request(*engine_core.preprocess_add_request(req1)) assert len(engine_core.scheduler.waiting) == 2 assert len(engine_core.scheduler.running) == 0 @@ -140,7 +144,7 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): assert len(engine_core.scheduler.waiting) == 0 assert len(engine_core.scheduler.running) == 2 - engine_core.add_request(req2) + engine_core.add_request(*engine_core.preprocess_add_request(req2)) assert len(engine_core.scheduler.waiting) == 1 assert len(engine_core.scheduler.running) == 2 @@ -166,12 +170,12 @@ def test_engine_core(monkeypatch: pytest.MonkeyPatch): req0 = make_request() req1 = make_request() req0.request_id = req1.request_id = "test" - engine_core.add_request(req0) + engine_core.add_request(*engine_core.preprocess_add_request(req0)) while (outs := engine_core.step()[0].get(0)) and outs.outputs: pass - engine_core.add_request(req1) + engine_core.add_request(*engine_core.preprocess_add_request(req1)) while (outs := engine_core.step()[0].get(0)) and outs.outputs: pass @@ -207,7 +211,7 @@ def test_engine_core_advanced_sampling(monkeypatch: pytest.MonkeyPatch): repetition_penalty=0.1, stop_token_ids=[1001, 1002], ) - engine_core.add_request(request) + engine_core.add_request(*engine_core.preprocess_add_request(request)) def _check_engine_state(): assert len(engine_core.scheduler.waiting) == 1 @@ -226,7 +230,7 @@ def _check_engine_state(): top_p=0.99, top_k=50, ) - engine_core.add_request(request2) + engine_core.add_request(*engine_core.preprocess_add_request(request2)) _check_engine_state() @@ -298,9 +302,9 @@ def shutdown(self): # Add two requests in a row. Each request have 12 prompt tokens. req0 = make_request_with_max_tokens("0", 5) - engine_core.add_request(req0) + engine_core.add_request(*engine_core.preprocess_add_request(req0)) req1 = make_request_with_max_tokens("1", 5) - engine_core.add_request(req1) + engine_core.add_request(*engine_core.preprocess_add_request(req1)) # Schedule Batch 1: (10, req0) assert engine_core.step_with_batch_queue()[0] is None @@ -436,7 +440,8 @@ def test_engine_core_invalid_request_id_type(monkeypatch: pytest.MonkeyPatch): with pytest.raises(TypeError, match="request_id must be a string, got.*UUID"): - engine_core.add_request(uuid_request) + engine_core.add_request( + *engine_core.preprocess_add_request(uuid_request)) # Test with integer int_request = make_request() @@ -444,7 +449,8 @@ def test_engine_core_invalid_request_id_type(monkeypatch: pytest.MonkeyPatch): with pytest.raises(TypeError, match="request_id must be a string, got.*int"): - engine_core.add_request(int_request) + engine_core.add_request( + *engine_core.preprocess_add_request(int_request)) # Test with None none_request = make_request() @@ -452,10 +458,12 @@ def test_engine_core_invalid_request_id_type(monkeypatch: pytest.MonkeyPatch): with pytest.raises(TypeError, match="request_id must be a string, got.*NoneType"): - engine_core.add_request(none_request) + engine_core.add_request( + *engine_core.preprocess_add_request(none_request)) # Verify engine is still functional after errors valid_request = make_request() - engine_core.add_request(valid_request) + engine_core.add_request( + *engine_core.preprocess_add_request(valid_request)) assert len(engine_core.scheduler.waiting) == 1 assert len(engine_core.scheduler.running) == 0 diff --git a/vllm/v1/engine/core.py b/vllm/v1/engine/core.py index 9f2fca69613..f9a6315df8a 100644 --- a/vllm/v1/engine/core.py +++ b/vllm/v1/engine/core.py @@ -205,8 +205,12 @@ def _initialize_kv_caches( def get_supported_tasks(self) -> tuple[SupportedTask, ...]: return self.model_executor.supported_tasks - def add_request(self, request: EngineCoreRequest): - """Add request to the scheduler.""" + def add_request(self, request: Request, request_wave: int = 0): + """Add request to the scheduler. + + `request_wave`: indicate which wave of requests this is expected to + belong to in DP case + """ # Validate the request_id type. if not isinstance(request.request_id, str): raise TypeError( @@ -222,27 +226,12 @@ def add_request(self, request: EngineCoreRequest): raise ValueError(f"Unsupported task: {pooling_params.task!r} " f"Supported tasks: {supported_pooling_tasks}") - if request.mm_hashes is not None: - # Here, if hash exists for a multimodal input, then it will be - # fetched from the cache, else it will be added to the cache. - # Note that the cache here is mirrored with the client cache, so - # anything that has a hash must have a HIT cache entry here - # as well. - assert request.mm_inputs is not None - request.mm_inputs = self.mm_input_cache_server.get_and_update_p1( - request.mm_inputs, request.mm_hashes) - - req = Request.from_engine_core_request(request) - if req.use_structured_output: - # Start grammar compilation asynchronously - self.structured_output_manager.grammar_init(req) - - if req.kv_transfer_params is not None and ( + if request.kv_transfer_params is not None and ( not self.scheduler.get_kv_connector()): logger.warning("Got kv_transfer_params, but no KVConnector found. " "Disabling KVTransfer for this request.") - self.scheduler.add_request(req) + self.scheduler.add_request(request) def abort_requests(self, request_ids: list[str]): """Abort requests from the scheduler.""" @@ -414,6 +403,31 @@ def save_tensorized_model( self.model_executor.save_tensorized_model( tensorizer_config=tensorizer_config, ) + def preprocess_add_request( + self, request: EngineCoreRequest) -> tuple[Request, int]: + """Preprocess the request. + + This function could be directly used in input processing thread to allow + request initialization running in parallel with Model forward + """ + if request.mm_hashes is not None: + assert request.mm_inputs is not None + # Note on thread safety: no race condition. + # `mm_input_cache_server` is reset at the end of LLMEngine init, + # and will only accessed in the input processing thread afterwards. + request.mm_inputs = self.mm_input_cache_server.get_and_update_p1( + request.mm_inputs, request.mm_hashes) + + req = Request.from_engine_core_request(request) + if req.use_structured_output: + # Note on thread safety: no race condition. + # `grammar_init` is only invoked in input processing thread. For + # `structured_output_manager`, each request is independent and + # grammar compilation is async. Scheduler always checks grammar + # compilation status before scheduling request. + self.structured_output_manager.grammar_init(req) + return req, request.current_wave + class EngineCoreProc(EngineCore): """ZMQ-wrapper for running EngineCore in background process.""" @@ -707,7 +721,8 @@ def _handle_client_request(self, request_type: EngineCoreRequestType, """Dispatch request from client.""" if request_type == EngineCoreRequestType.ADD: - self.add_request(request) + req, request_wave = request + self.add_request(req, request_wave) elif request_type == EngineCoreRequestType.ABORT: self.abort_requests(request) elif request_type == EngineCoreRequestType.UTILITY: @@ -806,10 +821,11 @@ def process_input_sockets(self, input_addresses: list[str], bytes(type_frame.buffer)) # Deserialize the request data. - decoder = add_request_decoder if ( - request_type - == EngineCoreRequestType.ADD) else generic_decoder - request = decoder.decode(data_frames) + if request_type == EngineCoreRequestType.ADD: + request = add_request_decoder.decode(data_frames) + request = self.preprocess_add_request(request) + else: + request = generic_decoder.decode(data_frames) # Push to input queue for core busy loop. self.input_queue.put_nowait((request_type, request)) @@ -939,17 +955,17 @@ def shutdown(self): if dp_group := getattr(self, "dp_group", None): stateless_destroy_torch_distributed_process_group(dp_group) - def add_request(self, request: EngineCoreRequest): - if self.has_coordinator and request.current_wave != self.current_wave: - if request.current_wave > self.current_wave: - self.current_wave = request.current_wave + def add_request(self, request: Request, request_wave: int = 0): + if self.has_coordinator and request_wave != self.current_wave: + if request_wave > self.current_wave: + self.current_wave = request_wave elif not self.engines_running: # Request received for an already-completed wave, notify # front-end that we need to start the next one. self.output_queue.put_nowait( (-1, EngineCoreOutputs(start_wave=self.current_wave))) - super().add_request(request) + super().add_request(request, request_wave) def _handle_client_request(self, request_type: EngineCoreRequestType, request: Any) -> None: diff --git a/vllm/v1/engine/core_client.py b/vllm/v1/engine/core_client.py index fdf5a5de191..26985df6f62 100644 --- a/vllm/v1/engine/core_client.py +++ b/vllm/v1/engine/core_client.py @@ -250,7 +250,8 @@ def get_supported_tasks(self) -> tuple[SupportedTask, ...]: return self.engine_core.get_supported_tasks() def add_request(self, request: EngineCoreRequest) -> None: - self.engine_core.add_request(request) + req, request_wave = self.engine_core.preprocess_add_request(request) + self.engine_core.add_request(req, request_wave) def abort_requests(self, request_ids: list[str]) -> None: if len(request_ids) > 0: From f2612885466150dc1c2423325fe197cb9b14e20c Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 30 Jul 2025 20:39:46 -0400 Subject: [PATCH 538/552] [Example] Add `async_llm_streaming.py` example for AsyncLLM streaming in python (#21763) Signed-off-by: mgoin Signed-off-by: x22x22 --- .../offline_inference/async_llm_streaming.py | 111 ++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 examples/offline_inference/async_llm_streaming.py diff --git a/examples/offline_inference/async_llm_streaming.py b/examples/offline_inference/async_llm_streaming.py new file mode 100644 index 00000000000..b876d536e3a --- /dev/null +++ b/examples/offline_inference/async_llm_streaming.py @@ -0,0 +1,111 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Simple example demonstrating streaming offline inference with AsyncLLM (V1 engine). + +This script shows the core functionality of vLLM's AsyncLLM engine for streaming +token-by-token output in offline inference scenarios. It demonstrates DELTA mode +streaming where you receive new tokens as they are generated. + +Usage: + python examples/offline_inference/async_llm_streaming.py +""" + +import asyncio + +from vllm import SamplingParams +from vllm.engine.arg_utils import AsyncEngineArgs +from vllm.sampling_params import RequestOutputKind +from vllm.v1.engine.async_llm import AsyncLLM + + +async def stream_response(engine: AsyncLLM, prompt: str, request_id: str) -> None: + """ + Stream response from AsyncLLM and display tokens as they arrive. + + This function demonstrates the core streaming pattern: + 1. Create SamplingParams with DELTA output kind + 2. Call engine.generate() and iterate over the async generator + 3. Print new tokens as they arrive + 4. Handle the finished flag to know when generation is complete + """ + print(f"\n🚀 Prompt: {prompt!r}") + print("💬 Response: ", end="", flush=True) + + # Configure sampling parameters for streaming + sampling_params = SamplingParams( + max_tokens=100, + temperature=0.8, + top_p=0.95, + seed=42, # For reproducible results + output_kind=RequestOutputKind.DELTA, # Get only new tokens each iteration + ) + + try: + # Stream tokens from AsyncLLM + async for output in engine.generate( + request_id=request_id, prompt=prompt, sampling_params=sampling_params + ): + # Process each completion in the output + for completion in output.outputs: + # In DELTA mode, we get only new tokens generated since last iteration + new_text = completion.text + if new_text: + print(new_text, end="", flush=True) + + # Check if generation is finished + if output.finished: + print("\n✅ Generation complete!") + break + + except Exception as e: + print(f"\n❌ Error during streaming: {e}") + raise + + +async def main(): + print("🔧 Initializing AsyncLLM...") + + # Create AsyncLLM engine with simple configuration + engine_args = AsyncEngineArgs( + model="meta-llama/Llama-3.2-1B-Instruct", + enforce_eager=True, # Faster startup for examples + ) + engine = AsyncLLM.from_engine_args(engine_args) + + try: + # Example prompts to demonstrate streaming + prompts = [ + "The future of artificial intelligence is", + "In a galaxy far, far away", + "The key to happiness is", + ] + + print(f"🎯 Running {len(prompts)} streaming examples...") + + # Process each prompt + for i, prompt in enumerate(prompts, 1): + print(f"\n{'=' * 60}") + print(f"Example {i}/{len(prompts)}") + print(f"{'=' * 60}") + + request_id = f"stream-example-{i}" + await stream_response(engine, prompt, request_id) + + # Brief pause between examples + if i < len(prompts): + await asyncio.sleep(0.5) + + print("\n🎉 All streaming examples completed!") + + finally: + # Always clean up the engine + print("🔧 Shutting down engine...") + engine.shutdown() + + +if __name__ == "__main__": + try: + asyncio.run(main()) + except KeyboardInterrupt: + print("\n🛑 Interrupted by user") From 1843059c3a195710befdf2dbfaa5c9476e2bbc15 Mon Sep 17 00:00:00 2001 From: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Thu, 31 Jul 2025 04:38:52 +0100 Subject: [PATCH 539/552] [Bugfix] Relax lang pin for voxtral (#21833) Signed-off-by: Sanchit Gandhi Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/entrypoints/openai/speech_to_text.py | 8 +-- vllm/model_executor/models/interfaces.py | 53 ++++++++++++++-- vllm/model_executor/models/voxtral.py | 25 +++++--- vllm/model_executor/models/whisper.py | 74 +++++------------------ 4 files changed, 80 insertions(+), 80 deletions(-) diff --git a/vllm/entrypoints/openai/speech_to_text.py b/vllm/entrypoints/openai/speech_to_text.py index c2227a21a4b..01140a4bfea 100644 --- a/vllm/entrypoints/openai/speech_to_text.py +++ b/vllm/entrypoints/openai/speech_to_text.py @@ -86,11 +86,7 @@ async def _preprocess_speech_to_text( audio_data: bytes, ) -> tuple[list[PromptType], float]: # Validate request - # TODO language should be optional and can be guessed. - # For now we default to en. See - # https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/generation_whisper.py#L1520 - lang = request.language or "en" - self.model_cls.validate_language(lang) + language = self.model_cls.validate_language(request.language) if len(audio_data) / 1024**2 > self.max_audio_filesize_mb: raise ValueError("Maximum file size exceeded.") @@ -112,7 +108,7 @@ async def _preprocess_speech_to_text( audio=chunk, stt_config=self.asr_config, model_config=self.model_config, - language=lang, + language=language, task_type=self.task_type, request_prompt=request.prompt) prompts.append(prompt) diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py index 957b57276b4..b6d9877cd01 100644 --- a/vllm/model_executor/models/interfaces.py +++ b/vllm/model_executor/models/interfaces.py @@ -1,13 +1,14 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -from collections.abc import Iterable, MutableSequence +from collections.abc import Iterable, Mapping, MutableSequence from typing import (TYPE_CHECKING, ClassVar, Literal, Optional, Protocol, Union, overload, runtime_checkable) import numpy as np import torch from torch import Tensor +from transformers.models.whisper.tokenization_whisper import LANGUAGES from typing_extensions import Self, TypeIs from vllm.config import ModelConfig, SpeechToTextConfig @@ -685,6 +686,8 @@ def _find_quant_config(*args, **kwargs) -> Optional[QuantizationConfig]: @runtime_checkable class SupportsTranscription(Protocol): """The interface required for all models that support transcription.""" + # Mapping from ISO639_1 language codes: language names + supported_languages: ClassVar[Mapping[str, str]] supports_transcription: ClassVar[Literal[True]] = True @@ -694,11 +697,22 @@ class SupportsTranscription(Protocol): `True`. """ + def __init_subclass__(cls, **kwargs): + super().__init_subclass__(**kwargs) + # language codes in supported_languages + # that don't exist in the full language map + invalid = set(cls.supported_languages) - set(LANGUAGES.keys()) + if invalid: + raise ValueError( + f"{cls.__name__}.supported_languages contains invalid " + f"language codes: {sorted(invalid)}\n. " + f"Valid choices are: {sorted(LANGUAGES.keys())}") + @classmethod def get_generation_prompt(cls, audio: np.ndarray, stt_config: SpeechToTextConfig, - model_config: ModelConfig, language: str, - task_type: str, + model_config: ModelConfig, + language: Optional[str], task_type: str, request_prompt: str) -> PromptType: """Get the prompt for the ASR model. The model has control over the construction, as long as it @@ -706,9 +720,36 @@ def get_generation_prompt(cls, audio: np.ndarray, ... @classmethod - def validate_language(cls, language: str) -> bool: - """Check if the model supports a specific ISO639_1 language.""" - ... + def get_other_languages(cls) -> Mapping[str, str]: + # other possible language codes from the whisper map + return { + k: v + for k, v in LANGUAGES.items() if k not in cls.supported_languages + } + + @classmethod + def validate_language(cls, language: Optional[str]) -> Optional[str]: + """ + Ensure the language specified in the transcription request + is a valid ISO 639-1 language code. If the request language is + valid, but not natively supported by the model, trigger a + warning (but not an exception). + """ + if language is None or language in cls.supported_languages: + return language + elif language in cls.get_other_languages(): + logger.warning( + "Language %r is not natively supported by %s; " + "results may be less accurate. Supported languages: %r", + language, + cls.__name__, + list(cls.supported_languages.keys()), + ) + return language + else: + raise ValueError( + f"Unsupported language: {language!r}. Must be one of " + f"{list(cls.supported_languages.keys())}.") @classmethod def get_speech_to_text_config( diff --git a/vllm/model_executor/models/voxtral.py b/vllm/model_executor/models/voxtral.py index 97cab628317..6b06c0ac668 100644 --- a/vllm/model_executor/models/voxtral.py +++ b/vllm/model_executor/models/voxtral.py @@ -26,8 +26,7 @@ from vllm.model_executor.model_loader.weight_utils import default_weight_loader from vllm.model_executor.models import SupportsPP # yapf: disable -from vllm.model_executor.models.whisper import ( - WhisperEncoder, WhisperForConditionalGeneration) +from vllm.model_executor.models.whisper import WhisperEncoder # yapf: enable from vllm.model_executor.sampling_metadata import SamplingMetadata from vllm.multimodal import MULTIMODAL_REGISTRY @@ -50,6 +49,18 @@ logger = init_logger(__name__) +ISO639_1_SUPPORTED_LANGS = { + "ar": "Arabic", + "nl": "Dutch", + "en": "English", + "fr": "French", + "de": "German", + "hi": "Hindi", + "it": "Italian", + "pt": "Portuguese", + "es": "Spanish", +} + class VoxtralProcessorAdapter: """ @@ -301,6 +312,7 @@ def _get_data_parser(self) -> MultiModalDataParser: dummy_inputs=VoxtralDummyInputsBuilder) class VoxtralForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP, SupportsTranscription): + supported_languages = ISO639_1_SUPPORTED_LANGS def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): super().__init__() @@ -441,8 +453,8 @@ def get_speech_to_text_config(cls, model_config: ModelConfig, # for speech-to-text transcription def get_generation_prompt(cls, audio: np.ndarray, model_config: ModelConfig, - stt_config: SpeechToTextConfig, language: str, - task_type: str, + stt_config: SpeechToTextConfig, + language: Optional[str], task_type: str, request_prompt: str) -> PromptType: tokenizer = cached_tokenizer_from_config(model_config) audio = Audio(audio, int(stt_config.sample_rate), @@ -457,11 +469,6 @@ def get_generation_prompt(cls, audio: np.ndarray, prompts_dict["prompt_token_ids"] = tokenized.tokens return cast(PromptType, prompts_dict) - @classmethod - def validate_language(cls, language: str) -> bool: - # same as whisper - return WhisperForConditionalGeneration.validate_language(language) - @classmethod def get_num_audio_tokens(cls, audio_duration_s: float, stt_config: SpeechToTextConfig, diff --git a/vllm/model_executor/models/whisper.py b/vllm/model_executor/models/whisper.py index d98dab5fac0..d7bafb9ef84 100644 --- a/vllm/model_executor/models/whisper.py +++ b/vllm/model_executor/models/whisper.py @@ -109,51 +109,6 @@ "vi": "Vietnamese", "cy": "Welsh" } -ISO639_1_OTHER_LANGS = { - "lo": "Lao", - "jw": "Javanese", - "tk": "Turkmen", - "yi": "Yiddish", - "so": "Somali", - "bn": "Bengali", - "nn": "Norwegian Nynorsk", - "si": "Sinhala", - "yo": "Yoruba", - "sa": "Sanskrit", - "mi": "Māori", - "fo": "Faroese", # codespell:ignore - "mt": "Maltese", - "tg": "Tajik", - "mg": "Malagasy", - "haw": "Hawaiian", - "km": "Khmer", - "br": "Breton", - "ps": "Pashto", - "ln": "Lingala", - "la": "Latin", - "ml": "Malayalam", - "sq": "Albanian", - "su": "Sundanese", - "eu": "Basque", - "ka": "Georgian", - "uz": "Uzbek", - "sn": "Shona", - "ht": "Haitian", - "as": "Assamese", - "mn": "Mongolian", - "te": "Telugu", - "pa": "Panjabi", - "tt": "Tatar", - "gu": "Gujarati", - "oc": "Occitan", - "ha": "Hausa", - "ba": "Bashkir", - "my": "Burmese", - "sd": "Sindhi", - "am": "Amharic", - "lb": "Luxembourgish", - "bo": "Tibetan" -} class WhisperAudioInputs(TypedDict): @@ -807,22 +762,20 @@ class WhisperForConditionalGeneration(nn.Module, SupportsTranscription, # Whisper only supports audio-conditioned generation. supports_transcription_only = True + supported_languages = ISO639_1_SUPPORTED_LANGS @classmethod - def validate_language(cls, language: str) -> bool: - if language in ISO639_1_SUPPORTED_LANGS: - return True - elif language in ISO639_1_OTHER_LANGS: + def validate_language(cls, language: Optional[str]) -> Optional[str]: + if language is None: + # TODO language should be optional and can be guessed. + # For now we default to en. See + # https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/generation_whisper.py#L1520 logger.warning( - "The selected language %s has limited accuracy with" - " reported WER>=0.5. Results may be less accurate " - "for this choice.", language) - return True - else: - raise ValueError(f"Unsupported language: {language}." - "Language should be one of:" + - f" {list(ISO639_1_SUPPORTED_LANGS.values())}" + - f"or {list(ISO639_1_OTHER_LANGS.values())}") + "Defaulting to language='en'. If you wish to transcribe " + "audio in a different language, pass the `language` field " + "in the TranscriptionRequest.") + language = "en" + return super().validate_language(language) @classmethod def get_generation_prompt( @@ -830,9 +783,12 @@ def get_generation_prompt( audio: np.ndarray, model_config: ModelConfig, # not needed here stt_config: SpeechToTextConfig, - language: str, + language: Optional[str], task_type: str, request_prompt: str) -> PromptType: + if language is None: + raise ValueError( + "Language must be specified when creating the Whisper prompt") prompt = { "encoder_prompt": { # Whisper does not support encoder prompt. From 474f25dcd2cff5376f4114e75de6f6074ba2e891 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 30 Jul 2025 23:40:34 -0400 Subject: [PATCH 540/552] [UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA (#21966) Signed-off-by: mgoin Signed-off-by: x22x22 --- vllm/engine/arg_utils.py | 2 +- vllm/platforms/cuda.py | 10 +++++----- vllm/platforms/interface.py | 2 +- vllm/v1/attention/backends/mla/cutlass_mla.py | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py index ababa49a53a..c36c79c6931 100644 --- a/vllm/engine/arg_utils.py +++ b/vllm/engine/arg_utils.py @@ -1417,7 +1417,7 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: "PALLAS_VLLM_V1", "TRITON_ATTN_VLLM_V1", "TRITON_MLA", - "CUTLASS_MLA_VLLM_V1", + "CUTLASS_MLA", "FLASHMLA", "FLASHINFER", "FLASHINFER_VLLM_V1", diff --git a/vllm/platforms/cuda.py b/vllm/platforms/cuda.py index c35d22c1d68..87ff6b38580 100644 --- a/vllm/platforms/cuda.py +++ b/vllm/platforms/cuda.py @@ -162,7 +162,7 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: if cls.is_device_capability(100): # Blackwell => Force CutlassMLA. use_cutlass_mla = True - envs.VLLM_ATTENTION_BACKEND = "CUTLASS_MLA_VLLM_V1" + envs.VLLM_ATTENTION_BACKEND = "CUTLASS_MLA" else: # Not Blackwell use_flashmla = True @@ -170,7 +170,7 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: # Forced case use_flashmla = (envs.VLLM_ATTENTION_BACKEND == "FLASHMLA") use_cutlass_mla = ( - envs.VLLM_ATTENTION_BACKEND == "CUTLASS_MLA_VLLM_V1") + envs.VLLM_ATTENTION_BACKEND == "CUTLASS_MLA") from vllm.attention.ops.flashmla import is_flashmla_supported if use_flashmla and is_flashmla_supported()[0] \ @@ -182,7 +182,7 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: if use_cutlass_mla and cache_config.block_size != 128: cache_config.block_size = 128 logger.info("Forcing kv cache block size to 128 for " - "CUTLASS_MLA_VLLM_V1 backend.") + "CUTLASS_MLA backend.") compilation_config = vllm_config.compilation_config if (envs.VLLM_ALL2ALL_BACKEND == "deepep_high_throughput" @@ -211,9 +211,9 @@ def get_attn_backend_cls(cls, selected_backend, head_size, dtype, kv_cache_dtype, block_size, use_v1, use_mla) -> str: if use_mla: - # TODO(lucas): refactor to be more concise + # TODO(lucas): refactor to be more concise # we should probably consider factoring out V1 here - if selected_backend == _Backend.CUTLASS_MLA_VLLM_V1: + if selected_backend == _Backend.CUTLASS_MLA: if use_v1: logger.info_once("Using Cutlass MLA backend on V1 engine.") return ("vllm.v1.attention.backends.mla." diff --git a/vllm/platforms/interface.py b/vllm/platforms/interface.py index 02cc392244b..6bae0fe25c7 100644 --- a/vllm/platforms/interface.py +++ b/vllm/platforms/interface.py @@ -53,7 +53,7 @@ class _Backend(enum.Enum): TRITON_MLA_VLLM_V1 = enum.auto() FLASHMLA_VLLM_V1 = enum.auto() FLASHMLA = enum.auto() # Supported by V1 - CUTLASS_MLA_VLLM_V1 = enum.auto() + CUTLASS_MLA = enum.auto() PALLAS = enum.auto() PALLAS_VLLM_V1 = enum.auto() IPEX = enum.auto() diff --git a/vllm/v1/attention/backends/mla/cutlass_mla.py b/vllm/v1/attention/backends/mla/cutlass_mla.py index c787f25cd3a..b23a8f0a5e8 100644 --- a/vllm/v1/attention/backends/mla/cutlass_mla.py +++ b/vllm/v1/attention/backends/mla/cutlass_mla.py @@ -21,7 +21,7 @@ class CutlassMLABackend(MLACommonBackend): @staticmethod def get_name() -> str: - return "CUTLASS_MLA_VLLM_V1" + return "CUTLASS_MLA" @staticmethod def get_impl_cls() -> type["CutlassMLAImpl"]: From 07be3b96d8fab871f0e11bb1eddfd08fb45a6ffe Mon Sep 17 00:00:00 2001 From: Jee Jee Li Date: Thu, 31 Jul 2025 11:41:12 +0800 Subject: [PATCH 541/552] [Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels (#21818) Signed-off-by: Jee Jee Li Signed-off-by: x22x22 --- .../layers/fused_moe/deepep_ll_prepare_finalize.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py b/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py index 57871ca250a..cfc2bdcf024 100644 --- a/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py +++ b/vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py @@ -40,7 +40,7 @@ class DeepEPLLPrepareAndFinalize(mk.FusedMoEPrepareAndFinalize): # DeepEP low-latency kernels are compiled only for certain # specific hidden sizes. - SUPPORTED_HIDDEN_SIZES = [2048, 2560, 4096, 5120, 7168] + SUPPORTED_HIDDEN_SIZES = [2048, 2560, 4096, 5120, 6144, 7168] def __init__(self, buffer: deep_ep.Buffer, From 52f1a7e61ab0445d49f00b325f8a2375bcd72978 Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Wed, 30 Jul 2025 23:45:29 -0400 Subject: [PATCH 542/552] [CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` (#21973) Signed-off-by: mgoin Signed-off-by: x22x22 --- tests/v1/kv_connector/unit/test_shared_storage_connector.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tests/v1/kv_connector/unit/test_shared_storage_connector.py b/tests/v1/kv_connector/unit/test_shared_storage_connector.py index ee3e71d3b84..11b7e378441 100644 --- a/tests/v1/kv_connector/unit/test_shared_storage_connector.py +++ b/tests/v1/kv_connector/unit/test_shared_storage_connector.py @@ -10,7 +10,7 @@ from vllm.config import KVTransferConfig from vllm.multimodal.utils import encode_image_base64 -MODEL_NAME = "Qwen/Qwen2.5-VL-3B-Instruct" +MODEL_NAME = "RedHatAI/Qwen2.5-VL-3B-Instruct-quantized.w4a16" SAMPLING_PARAMS = SamplingParams(temperature=0.0, top_k=1, max_tokens=128) @@ -130,6 +130,8 @@ def test_shared_storage_connector_hashes(tmp_path): model=MODEL_NAME, max_model_len=8192, max_num_seqs=1, + gpu_memory_utilization=0.4, + enforce_eager=True, kv_transfer_config=kv_transfer_config, limit_mm_per_prompt={"image": 2}, ) From 3eee204de04379fc70b80eb2166c3db408452e2a Mon Sep 17 00:00:00 2001 From: Ning Xie Date: Thu, 31 Jul 2025 14:22:11 +0800 Subject: [PATCH 543/552] [Bugfix]: fix metadata file copy in test_sharded_state_loader (#21830) Signed-off-by: Andy Xie Signed-off-by: x22x22 --- tests/test_sharded_state_loader.py | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/tests/test_sharded_state_loader.py b/tests/test_sharded_state_loader.py index 64706defb59..1bb4203d21c 100644 --- a/tests/test_sharded_state_loader.py +++ b/tests/test_sharded_state_loader.py @@ -1,6 +1,7 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import fnmatch import multiprocessing as mp import os import shutil @@ -64,9 +65,10 @@ def _run_writer(input_dir, output_dir, weights_patterns, **kwargs): # Copy metadata files to output directory for file in os.listdir(input_dir): if os.path.isdir(os.path.join(input_dir, file)): - continue - if not any(file.endswith(ext) for ext in weights_patterns): - shutil.copy(f"{input_dir}/{file}", output_dir) + shutil.copytree(os.path.join(input_dir, file), + os.path.join(output_dir, file)) + elif not any(fnmatch.fnmatch(file, ext) for ext in weights_patterns): + shutil.copy(os.path.join(input_dir, file), output_dir) def _run_generate(input_dir, queue: mp.Queue, **kwargs): From 6967d0e84c5e43f2656d21a18901552866ef9dde Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Thu, 31 Jul 2025 14:46:38 +0800 Subject: [PATCH 544/552] [Deprecation] Remove deprecated args and methods (#21907) Signed-off-by: DarkLight1337 Signed-off-by: x22x22 --- vllm/entrypoints/chat_utils.py | 32 ++++-------------------------- vllm/multimodal/registry.py | 25 ----------------------- vllm/worker/neuron_model_runner.py | 7 +------ 3 files changed, 5 insertions(+), 59 deletions(-) diff --git a/vllm/entrypoints/chat_utils.py b/vllm/entrypoints/chat_utils.py index a6602391d40..6485ed6b148 100644 --- a/vllm/entrypoints/chat_utils.py +++ b/vllm/entrypoints/chat_utils.py @@ -48,7 +48,7 @@ # yapf: enable from vllm.transformers_utils.processor import cached_get_processor from vllm.transformers_utils.tokenizer import AnyTokenizer, MistralTokenizer -from vllm.utils import deprecate_kwargs, random_uuid +from vllm.utils import random_uuid logger = init_logger(__name__) @@ -383,17 +383,12 @@ def resolve_mistral_chat_template( return None -@deprecate_kwargs( - "trust_remote_code", - additional_message="Please use `model_config.trust_remote_code` instead.", -) def resolve_hf_chat_template( tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], chat_template: Optional[str], tools: Optional[list[dict[str, Any]]], *, model_config: ModelConfig, - trust_remote_code: Optional[bool] = None, ) -> Optional[str]: # 1st priority: The given chat template if chat_template is not None: @@ -488,10 +483,6 @@ def _log_chat_template_content_format( ) -@deprecate_kwargs( - "trust_remote_code", - additional_message="Please use `model_config.trust_remote_code` instead.", -) def resolve_chat_template_content_format( chat_template: Optional[str], tools: Optional[list[dict[str, Any]]], @@ -499,7 +490,6 @@ def resolve_chat_template_content_format( tokenizer: AnyTokenizer, *, model_config: ModelConfig, - trust_remote_code: Optional[bool] = None, ) -> _ChatTemplateContentFormat: if given_format != "auto": return given_format @@ -568,17 +558,9 @@ def add(self, modality: ModalityStr, item: _T) -> Optional[str]: input_modality = modality.replace("_embeds", "") - if mm_registry.has_processor(model_config): - mm_processor = mm_registry.create_processor(model_config) - allowed_counts = mm_processor.info.get_allowed_mm_limits() - allowed_count = allowed_counts.get(input_modality, 0) - else: - mm_config = model_config.multimodal_config - if mm_config is None: - msg = "This model does not support multi-modal inputs" - raise ValueError(msg) - - allowed_count = mm_config.get_limit_per_prompt(input_modality) + mm_processor = mm_registry.create_processor(model_config) + allowed_counts = mm_processor.info.get_allowed_mm_limits() + allowed_count = allowed_counts.get(input_modality, 0) current_count = len(self._items_by_modality[modality]) + 1 if current_count > allowed_count: @@ -1285,10 +1267,6 @@ def parse_chat_messages_futures( return conversation, mm_tracker.all_mm_data() -@deprecate_kwargs( - "trust_remote_code", - additional_message="Please use `model_config.trust_remote_code` instead.", -) def apply_hf_chat_template( tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], conversation: list[ConversationMessage], @@ -1297,8 +1275,6 @@ def apply_hf_chat_template( *, model_config: ModelConfig, tokenize: bool = False, # Different from HF's default - # Deprecated, explicitly capture here so it doesn't slit into kwargs. - trust_remote_code: Optional[bool] = None, **kwargs: Any, ) -> str: hf_chat_template = resolve_hf_chat_template( diff --git a/vllm/multimodal/registry.py b/vllm/multimodal/registry.py index bfa391829d2..5f5b620e0cf 100644 --- a/vllm/multimodal/registry.py +++ b/vllm/multimodal/registry.py @@ -5,7 +5,6 @@ from typing import TYPE_CHECKING, Generic, Optional, Protocol, TypeVar import torch.nn as nn -from typing_extensions import deprecated from vllm.envs import VLLM_MM_INPUT_CACHE_GIB from vllm.inputs import InputProcessingContext @@ -105,13 +104,6 @@ def reset_processor_cache(self) -> bool: return True # Success - @deprecated("Legacy input processor/mapper pipeline has been removed. " - "Please update your model runner to use " - "`seq_group_metadata.multi_modal_data` directly without " - "further processing.") - def create_input_mapper(self, model_config: "ModelConfig"): - return lambda data, mm_processor_kwargs: data - def get_max_tokens_per_item_by_modality( self, model_config: "ModelConfig", @@ -182,16 +174,6 @@ def get_max_multimodal_tokens(self, model_config: "ModelConfig") -> int: """ return sum(self.get_max_tokens_by_modality(model_config).values()) - @deprecated("Legacy input processor/mapper pipeline has been removed. " - "Please update your model runner to use " - "`seq_group_metadata.multi_modal_data` directly without " - "further processing.") - def init_mm_limits_per_prompt( - self, - model_config: "ModelConfig", - ) -> None: - pass - def get_mm_limits_per_prompt( self, model_config: "ModelConfig", @@ -246,13 +228,6 @@ def _get_model_cls(self, model_config: "ModelConfig"): model_cls, _ = get_model_architecture(model_config) return model_cls - @deprecated("Legacy input processor/mapper pipeline has been removed. " - "Please update your model runner to use " - "`seq_group_metadata.multi_modal_data` directly without " - "further processing.") - def has_processor(self, model_config: "ModelConfig") -> bool: - return True - def create_processor( self, model_config: "ModelConfig", diff --git a/vllm/worker/neuron_model_runner.py b/vllm/worker/neuron_model_runner.py index 7ccf1a2c0a8..8317b9abff0 100644 --- a/vllm/worker/neuron_model_runner.py +++ b/vllm/worker/neuron_model_runner.py @@ -15,8 +15,7 @@ from vllm.model_executor import SamplingMetadata from vllm.model_executor.layers.sampler import SamplerOutput from vllm.model_executor.model_loader.neuron import get_neuron_model -from vllm.multimodal import (MULTIMODAL_REGISTRY, BatchedTensorInputs, - MultiModalKwargs) +from vllm.multimodal import BatchedTensorInputs, MultiModalKwargs from vllm.platforms import current_platform from vllm.sampling_params import SamplingParams from vllm.sequence import IntermediateTensors, SequenceGroupMetadata @@ -88,10 +87,6 @@ def __init__( self.device = self.device_config.device self.pin_memory = is_pin_memory_available() - # Multi-modal data support - self.multi_modal_input_mapper = MULTIMODAL_REGISTRY \ - .create_input_mapper(self.model_config) - # Lazy initialization. self.model: nn.Module # initialize after load_model. From f97ff9be65af3989758eecc38be39f4e4c73e5c1 Mon Sep 17 00:00:00 2001 From: Daniele <36171005+dtrifiro@users.noreply.github.com> Date: Thu, 31 Jul 2025 09:00:08 +0200 Subject: [PATCH 545/552] [CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES (#21599) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Daniele Trifirò Signed-off-by: x22x22 --- .buildkite/scripts/hardware_ci/run-gh200-test.sh | 3 +-- .github/workflows/scripts/build.sh | 1 - docker/Dockerfile | 3 --- docker/Dockerfile.nightly_torch | 3 --- docs/deployment/docker.md | 3 +-- 5 files changed, 2 insertions(+), 11 deletions(-) diff --git a/.buildkite/scripts/hardware_ci/run-gh200-test.sh b/.buildkite/scripts/hardware_ci/run-gh200-test.sh index 8c64e14606d..f69e4b06680 100644 --- a/.buildkite/scripts/hardware_ci/run-gh200-test.sh +++ b/.buildkite/scripts/hardware_ci/run-gh200-test.sh @@ -16,8 +16,7 @@ DOCKER_BUILDKIT=1 docker build . \ --build-arg max_jobs=66 \ --build-arg nvcc_threads=2 \ --build-arg RUN_WHEEL_CHECK=false \ - --build-arg torch_cuda_arch_list="9.0+PTX" \ - --build-arg vllm_fa_cmake_gpu_arches="90-real" + --build-arg torch_cuda_arch_list="9.0+PTX" # Setup cleanup remove_docker_container() { docker rm -f gh200-test || true; } diff --git a/.github/workflows/scripts/build.sh b/.github/workflows/scripts/build.sh index 0f010832b46..c69ebbb42da 100644 --- a/.github/workflows/scripts/build.sh +++ b/.github/workflows/scripts/build.sh @@ -15,7 +15,6 @@ $python_executable -m pip install -r requirements/build.txt -r requirements/cuda export MAX_JOBS=1 # Make sure release wheels are built for the following architectures export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX" -export VLLM_FA_CMAKE_GPU_ARCHES="80-real;90-real" bash tools/check_repo.sh diff --git a/docker/Dockerfile b/docker/Dockerfile index 75b5ab0230c..43522ef8fb8 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -164,9 +164,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \ # see https://github.com/pytorch/pytorch/pull/123243 ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} -# Override the arch list for flash-attn to reduce the binary size -ARG vllm_fa_cmake_gpu_arches='80-real;90-real' -ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches} #################### BASE BUILD IMAGE #################### #################### WHEEL BUILD IMAGE #################### diff --git a/docker/Dockerfile.nightly_torch b/docker/Dockerfile.nightly_torch index 8d43de77aad..e147b97f0e0 100644 --- a/docker/Dockerfile.nightly_torch +++ b/docker/Dockerfile.nightly_torch @@ -114,9 +114,6 @@ RUN cat torch_build_versions.txt # explicitly set the list to avoid issues with torch 2.2 # see https://github.com/pytorch/pytorch/pull/123243 -# Override the arch list for flash-attn to reduce the binary size -ARG vllm_fa_cmake_gpu_arches='80-real;90-real' -ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches} #################### BASE BUILD IMAGE #################### #################### WHEEL BUILD IMAGE #################### diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index 5f6cfcb00a3..1f19f2fecfa 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -106,8 +106,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- -t vllm/vllm-gh200-openai:latest \ --build-arg max_jobs=66 \ --build-arg nvcc_threads=2 \ - --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ - --build-arg vllm_fa_cmake_gpu_arches="90-real" + --build-arg torch_cuda_arch_list="9.0 10.0+PTX" ``` !!! note From b28069d1ecd7766b53633131bdd2b1df238f3848 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 31 Jul 2025 15:27:48 +0800 Subject: [PATCH 546/552] merge mian to feat/support-long-text-embedding Signed-off-by: x22x22 --- diff_config.py.txt | 40 ++ diff_serving_embedding.py.txt | 763 ++++++++++++++++++++++++++++++++++ requirements/test.txt | 1 - 3 files changed, 803 insertions(+), 1 deletion(-) create mode 100644 diff_config.py.txt create mode 100644 diff_serving_embedding.py.txt diff --git a/diff_config.py.txt b/diff_config.py.txt new file mode 100644 index 00000000000..81c9b072b88 --- /dev/null +++ b/diff_config.py.txt @@ -0,0 +1,40 @@ +diff --git a/vllm/config.py b/vllm/config.py +index a330bafb7..7c8ed575f 100644 +--- a/vllm/config.py ++++ b/vllm/config.py +@@ -3369,6 +3369,35 @@ class PoolerConfig: + ``math-shepherd-mistral-7b-prm`` model. + """ + ++ enable_chunked_processing: Optional[bool] = None ++ """ ++ Whether to enable chunked processing for long inputs that exceed the model's ++ maximum position embeddings. When enabled, long inputs will be split into ++ chunks, processed separately, and then aggregated using weighted averaging. ++ This allows embedding models to handle arbitrarily long text without CUDA ++ errors. Defaults to False. ++ """ ++ ++ max_embed_len: Optional[int] = None ++ """ ++ Maximum input length allowed for embedding generation. When set, allows ++ inputs longer than max_model_len to be accepted for embedding models. ++ This parameter enables accepting long inputs without requiring ++ VLLM_ALLOW_LONG_MAX_MODEL_LEN environment variable. When an input exceeds ++ max_embed_len, it will be handled according to the original max_model_len ++ validation logic. Defaults to None (use max_model_len validation). ++ """ ++ ++ allow_non_mean_chunking: Optional[bool] = None ++ """ ++ Whether to allow chunked processing for non-MEAN pooling types without ++ warnings. By default (None or False), a warning will be shown when using ++ chunked processing with pooling types other than MEAN, as they may produce ++ different results than non-chunked processing. Set to True to explicitly ++ allow and suppress warnings for non-MEAN pooling types. Only applies when ++ enable_chunked_processing is True. ++ """ ++ + def compute_hash(self) -> str: + """ + WARNING: Whenever a new field is added to this config, diff --git a/diff_serving_embedding.py.txt b/diff_serving_embedding.py.txt new file mode 100644 index 00000000000..1b1c98f8627 --- /dev/null +++ b/diff_serving_embedding.py.txt @@ -0,0 +1,763 @@ +diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py +index 84ba00873..49a53cf6c 100644 +--- a/vllm/entrypoints/openai/serving_embedding.py ++++ b/vllm/entrypoints/openai/serving_embedding.py +@@ -2,9 +2,11 @@ + # SPDX-FileCopyrightText: Copyright contributors to the vLLM project + + import base64 +-from typing import Final, Literal, Optional, Union, cast ++from collections.abc import AsyncGenerator ++from typing import Any, Final, Literal, Optional, Union, cast + + import numpy as np ++import torch + from fastapi import Request + from typing_extensions import assert_never, override + +@@ -12,18 +14,25 @@ from vllm.config import ModelConfig + from vllm.engine.protocol import EngineClient + from vllm.entrypoints.chat_utils import ChatTemplateContentFormatOption + from vllm.entrypoints.logger import RequestLogger ++# yapf conflicts with isort for this docstring ++# yapf: disable + from vllm.entrypoints.openai.protocol import (EmbeddingChatRequest, ++ EmbeddingCompletionRequest, + EmbeddingRequest, + EmbeddingResponse, + EmbeddingResponseData, + ErrorResponse, UsageInfo) + from vllm.entrypoints.openai.serving_engine import (EmbeddingServeContext, + OpenAIServing, +- ServeContext) ++ ServeContext, ++ TextTokensPrompt) ++# yapf: enable + from vllm.entrypoints.openai.serving_models import OpenAIServingModels ++from vllm.inputs.data import EmbedsPrompt as EngineEmbedsPrompt ++from vllm.inputs.data import TokensPrompt as EngineTokensPrompt + from vllm.logger import init_logger + from vllm.outputs import (EmbeddingOutput, EmbeddingRequestOutput, +- PoolingRequestOutput) ++ PoolingRequestOutput, RequestOutput) + from vllm.pooling_params import PoolingParams + + logger = init_logger(__name__) +@@ -129,6 +138,717 @@ class EmbeddingMixin(OpenAIServing): + usage=usage, + ) + ++ def _get_max_position_embeddings(self) -> int: ++ """Get the model's effective maximum sequence length for chunking. ++ ++ This uses the same logic as vLLM's _get_and_verify_max_len to determine ++ the actual sequence length limit, ++ considering both model config and tokenizer config. ++ When max_model_len is set and smaller than max_position_embeddings, ++ use max_model_len for chunking. ++ """ ++ hf_config = self.model_config.hf_config ++ ++ # Start with max_position_embeddings from model config ++ derived_max_len = getattr(hf_config, 'max_position_embeddings', 512) ++ ++ # Get tokenizer config for pooling models (embedding models) ++ if self.model_config.runner_type == "pooling": ++ from vllm.transformers_utils.config import try_get_tokenizer_config ++ tokenizer_config = try_get_tokenizer_config( ++ self.model_config.tokenizer, ++ trust_remote_code=self.model_config.trust_remote_code, ++ revision=self.model_config.tokenizer_revision) ++ ++ # Consider model_max_length in tokenizer_config ++ # (same logic as _get_and_verify_max_len) ++ if tokenizer_config: ++ tokenizer_model_max_length = tokenizer_config.get( ++ 'model_max_length', derived_max_len) ++ derived_max_len = min(derived_max_len, ++ tokenizer_model_max_length) ++ ++ # Consider max_model_len when it's set and smaller than other limits ++ # max_model_len is set in OpenAIServing.__init__ ++ # from model_config.max_model_len ++ if self.max_model_len is not None: ++ derived_max_len = min(derived_max_len, self.max_model_len) ++ ++ return int(derived_max_len) ++ ++ def _should_use_chunked_processing(self, request) -> bool: ++ """Check if chunked processing should be used for this request.""" ++ if not isinstance(request, ++ (EmbeddingChatRequest, EmbeddingCompletionRequest)): ++ return False ++ ++ pooler_config = getattr(self.model_config, 'pooler_config', None) ++ if not (pooler_config is not None and getattr( ++ pooler_config, 'enable_chunked_processing', False)): ++ return False ++ ++ # Check pooling type compatibility for chunked processing ++ pooling_type = getattr(pooler_config, 'pooling_type', None) ++ if pooling_type: ++ pooling_type_upper = pooling_type.upper() ++ ++ # For LAST and CLS pooling, chunked processing doesn't make ++ # semantic sense because only the last/first chunk ++ # contains the relevant token position ++ if pooling_type_upper in ['LAST', 'CLS']: ++ # Check if user explicitly allowed non-mean chunking ++ allow_non_mean = getattr(pooler_config, ++ 'allow_non_mean_chunking', False) ++ if not allow_non_mean: ++ logger.warning( ++ "Chunked processing with pooling type '%s' " ++ "is not recommended as it may produce semantically " ++ "incorrect results. %s pooling relies on specific " ++ "token positions that lose their meaning when the " ++ "sequence is chunked. Consider using MEAN pooling " ++ "or disable chunked processing. Set " ++ "'allow_non_mean_chunking: true' ", ++ "to override this warning.", pooling_type, ++ pooling_type_upper) ++ return False # Disable chunked processing by default ++ else: ++ logger.info( ++ "Using chunked processing with %s pooling " ++ "(explicitly enabled). Note: only the %s chunk " ++ "will be processed to avoid computational waste.", ++ pooling_type_upper, ++ "last" if pooling_type_upper == "LAST" else "first") ++ ++ # Warn about non-MEAN pooling types (for other pooling types) ++ elif pooling_type_upper != 'MEAN': ++ # Check if user explicitly allowed non-mean chunking ++ allow_non_mean = getattr(pooler_config, ++ 'allow_non_mean_chunking', False) ++ if not allow_non_mean: ++ logger.warning( ++ "Chunked processing with pooling type '%s' " ++ "may produce different results than non-chunked " ++ "processing due to limited attention scope within " ++ "chunks. Each token can only attend to tokens within " ++ "its chunk (similar to sliding window attention), " ++ "which changes token representations before pooling. " ++ "While MEAN pooling provides a reasonable " ++ "approximation through weighted averaging aggregation, " ++ "other pooling " ++ "types use different aggregation strategies that " ++ "further approximate the original behavior. Set " ++ "'allow_non_mean_chunking: true' in pooler config " ++ "to suppress this warning.", pooling_type) ++ # Still allow it but with warning ++ else: ++ logger.info( ++ "Using chunked processing with pooling type " ++ "'%s' (explicitly enabled)", pooling_type) ++ ++ return True ++ ++ def _chunk_token_ids(self, token_ids: list[int], ++ chunk_size: int) -> list[list[int]]: ++ """Split token IDs into chunks of specified size.""" ++ if len(token_ids) <= chunk_size: ++ return [token_ids] ++ ++ chunks = [] ++ for i in range(0, len(token_ids), chunk_size): ++ chunk = token_ids[i:i + chunk_size] ++ chunks.append(chunk) ++ return chunks ++ ++ async def _process_chunked_request( ++ self, ++ ctx: EmbeddingServeContext, ++ original_prompt: TextTokensPrompt, ++ pooling_params, ++ trace_headers, ++ prompt_idx: int, ++ ) -> list[AsyncGenerator[PoolingRequestOutput, None]]: ++ """Process a single prompt using chunked processing.""" ++ generators: list[AsyncGenerator[PoolingRequestOutput, None]] = [] ++ token_ids = original_prompt["prompt_token_ids"] ++ ++ # Split into chunks using max_position_embeddings ++ max_pos_embeddings = self._get_max_position_embeddings() ++ chunks = self._chunk_token_ids(token_ids, max_pos_embeddings) ++ ++ # Check pooling type to optimize chunk processing ++ pooler_config = getattr(self.model_config, 'pooler_config', None) ++ pooling_type = getattr(pooler_config, 'pooling_type', 'MEAN') ++ if pooling_type: ++ pooling_type = pooling_type.upper() ++ ++ # For LAST pooling, only process the last chunk ++ # For CLS pooling, only process the first chunk ++ if pooling_type == 'LAST': ++ chunks_to_process = [chunks[-1]] ++ chunk_indices = [len(chunks) - 1] ++ logger.info("LAST pooling: processing only the last chunk") ++ elif pooling_type == 'CLS': ++ chunks_to_process = [chunks[0]] ++ chunk_indices = [0] ++ logger.info("CLS pooling: processing only the first chunk") ++ else: ++ # For MEAN and other pooling types, process all chunks ++ chunks_to_process = chunks ++ chunk_indices = list(range(len(chunks))) ++ logger.info("Using chunked processing for MEAN pooling") ++ ++ for i, (chunk_idx, chunk_tokens) in enumerate( ++ zip(chunk_indices, chunks_to_process)): ++ # Create a request ID for this chunk ++ chunk_request_id = (f"{ctx.request_id}-prompt-{prompt_idx}-" ++ f"chunk-{chunk_idx}") ++ ++ # Create engine prompt for this chunk ++ chunk_engine_prompt = EngineTokensPrompt( ++ prompt_token_ids=chunk_tokens) ++ ++ # Create chunk request prompt for logging ++ chunk_text = "" ++ chunk_request_prompt = TextTokensPrompt( ++ prompt=chunk_text, prompt_token_ids=chunk_tokens) ++ ++ # Log the chunk ++ self._log_inputs(chunk_request_id, ++ chunk_request_prompt, ++ params=pooling_params, ++ lora_request=ctx.lora_request) ++ ++ # Create generator for this chunk ++ generator = self.engine_client.encode( ++ chunk_engine_prompt, ++ pooling_params, ++ chunk_request_id, ++ lora_request=ctx.lora_request, ++ trace_headers=trace_headers, ++ priority=getattr(ctx.request, "priority", 0), ++ ) ++ ++ generators.append(generator) ++ ++ return generators ++ ++ def _validate_input( ++ self, ++ request, ++ input_ids: list[int], ++ input_text: str, ++ ) -> TextTokensPrompt: ++ """Override to support chunked processing for embedding requests.""" ++ token_num = len(input_ids) ++ ++ # Note: EmbeddingRequest doesn't have max_tokens ++ if isinstance(request, ++ (EmbeddingChatRequest, EmbeddingCompletionRequest)): ++ # Check if chunked processing is enabled for pooling models ++ pooler_config = getattr(self.model_config, 'pooler_config', None) ++ enable_chunked = (pooler_config is not None and getattr( ++ pooler_config, 'enable_chunked_processing', False)) ++ ++ # Get max_embed_len from pooler config if set ++ max_embed_len = (pooler_config.max_embed_len if pooler_config ++ and pooler_config.max_embed_len else None) ++ ++ # Use max_position_embeddings for chunked processing decisions ++ max_pos_embeddings = self._get_max_position_embeddings() ++ ++ # Determine the effective max length for validation ++ if max_embed_len is not None: ++ # Use max_embed_len for validation instead of max_model_len ++ effective_max_len = max_embed_len ++ length_type = "maximum embedding input length" ++ max_length_value = max_embed_len ++ else: ++ # Fall back to max_model_len validation (original behavior) ++ effective_max_len = self.max_model_len ++ length_type = "maximum context length" ++ max_length_value = self.max_model_len ++ ++ validation_error_msg = ( ++ "This model's {length_type} is {max_length} tokens. " ++ "However, you requested {token_num} tokens in the input for " ++ "embedding generation. Please reduce the length of the input." ++ ).format(length_type=length_type, ++ max_length=max_length_value, ++ token_num=token_num) ++ ++ # Check if input exceeds effective max length ++ if token_num > effective_max_len: ++ raise ValueError(validation_error_msg) ++ ++ # Check for chunked processing ++ # when exceeding max_position_embeddings ++ if token_num > max_pos_embeddings: ++ if enable_chunked: ++ # Allow long inputs when chunked processing is enabled ++ logger.info( ++ "Input length %s exceeds max_position_embeddings " ++ "%s, will use chunked processing", token_num, ++ max_pos_embeddings) ++ else: ++ raise ValueError( ++ f"This model's maximum position embeddings length is " ++ f"{max_pos_embeddings} tokens. However, you requested " ++ f"{token_num} tokens in the input for embedding " ++ f"generation. Please reduce the length of the input or " ++ f"enable chunked processing.") ++ ++ return TextTokensPrompt(prompt=input_text, ++ prompt_token_ids=input_ids) ++ ++ # For other request types, use the parent's implementation ++ return super()._validate_input(request, input_ids, input_text) ++ ++ def _is_text_tokens_prompt(self, prompt) -> bool: ++ """Check if a prompt is a TextTokensPrompt (has prompt_token_ids).""" ++ return (isinstance(prompt, dict) and "prompt_token_ids" in prompt ++ and "prompt_embeds" not in prompt) ++ ++ async def _prepare_generators( ++ self, ++ ctx: ServeContext, ++ ) -> Optional[ErrorResponse]: ++ """Override to support chunked processing.""" ++ ctx = cast(EmbeddingServeContext, ctx) ++ generators: list[AsyncGenerator[Union[RequestOutput, ++ PoolingRequestOutput], ++ None]] = [] ++ ++ try: ++ trace_headers = (None if ctx.raw_request is None else await ++ self._get_trace_headers(ctx.raw_request.headers)) ++ ++ if not hasattr(ctx.request, "to_pooling_params"): ++ return self.create_error_response( ++ "Request type does not support pooling parameters") ++ ++ pooling_params = ctx.request.to_pooling_params() ++ ++ # Verify and set the task for pooling params ++ try: ++ pooling_params.verify("embed", self.model_config) ++ except ValueError as e: ++ return self.create_error_response(str(e)) ++ ++ if ctx.engine_prompts is None: ++ return self.create_error_response( ++ "Engine prompts not available") ++ ++ if ctx.request_prompts is None: ++ return self.create_error_response( ++ "Request prompts not available") ++ ++ # Check if we should use chunked processing ++ use_chunked = self._should_use_chunked_processing(ctx.request) ++ ++ for i, engine_prompt in enumerate(ctx.engine_prompts): ++ request_prompt = ctx.request_prompts[i] ++ ++ # Check if this specific prompt needs chunked processing ++ max_pos_embeddings = self._get_max_position_embeddings() ++ if (use_chunked ++ and self._is_text_tokens_prompt(request_prompt)): ++ # Cast to TextTokensPrompt since we've ++ # verified prompt_token_ids ++ text_tokens_prompt = cast(TextTokensPrompt, request_prompt) ++ if len(text_tokens_prompt["prompt_token_ids"] ++ ) > max_pos_embeddings: ++ # Use chunked processing for this prompt ++ chunk_generators = await self._process_chunked_request( ++ ctx, text_tokens_prompt, pooling_params, ++ trace_headers, i) ++ generators.extend(chunk_generators) ++ continue ++ ++ # Normal processing for short prompts or non-token prompts ++ request_id_item = f"{ctx.request_id}-{i}" ++ ++ self._log_inputs(request_id_item, ++ request_prompt, ++ params=pooling_params, ++ lora_request=ctx.lora_request) ++ ++ # Mypy has an existing bug related to inferring the variance ++ # of TypedDicts with `builtins.enumerate`: ++ # https://github.com/python/mypy/issues/8586#issuecomment-2867698435 ++ engine_prompt = cast( ++ Union[EngineTokensPrompt, EngineEmbedsPrompt], ++ engine_prompt) ++ generator = self.engine_client.encode( ++ engine_prompt, ++ pooling_params, ++ request_id_item, ++ lora_request=ctx.lora_request, ++ trace_headers=trace_headers, ++ priority=getattr(ctx.request, "priority", 0), ++ ) ++ ++ generators.append(generator) ++ ++ from vllm.utils import merge_async_iterators ++ ctx.result_generator = merge_async_iterators(*generators) ++ ++ return None ++ ++ except Exception as e: ++ # TODO: Use a vllm-specific Validation Error ++ return self.create_error_response(str(e)) ++ ++ async def _collect_batch( ++ self, ++ ctx: ServeContext, ++ ) -> Optional[ErrorResponse]: ++ """Collect and aggregate batch results ++ with support for chunked processing. ++ ++ For chunked requests, performs online aggregation to ++ minimize memory usage. ++ For regular requests, collects results normally. ++ """ ++ ctx = cast(EmbeddingServeContext, ctx) ++ try: ++ if ctx.engine_prompts is None: ++ return self.create_error_response( ++ "Engine prompts not available") ++ ++ if ctx.request_prompts is None: ++ return self.create_error_response( ++ "Request prompts not available") ++ ++ if ctx.result_generator is None: ++ return self.create_error_response( ++ "Result generator not available") ++ ++ # Check if we used chunked processing ++ use_chunked = self._should_use_chunked_processing(ctx.request) ++ ++ if use_chunked: ++ # Online aggregation for chunked requests to ++ # minimize memory usage ++ # Track aggregation state for each prompt ++ prompt_aggregators: dict[int, dict[str, Any]] = {} ++ short_prompts_results: dict[int, PoolingRequestOutput] = {} ++ ++ async for result_idx, result in ctx.result_generator: ++ if "-chunk-" in result.request_id: ++ # Extract prompt_idx from chunked request_id ++ parts = result.request_id.split("-") ++ try: ++ prompt_idx = int(parts[parts.index("prompt") + 1]) ++ ++ # Initialize aggregator for this prompt if needed ++ if prompt_idx not in prompt_aggregators: ++ # Get pooling type to determine ++ # aggregation strategy ++ pooler_config = getattr( ++ self.model_config, 'pooler_config', None) ++ pooling_type = getattr(pooler_config, ++ 'pooling_type', 'MEAN') ++ if pooling_type: ++ pooling_type = pooling_type.upper() ++ ++ prompt_aggregators[prompt_idx] = { ++ 'pooling_type': ++ pooling_type, ++ 'weighted_sum': ++ None, ++ 'total_weight': ++ 0, ++ 'first_result': ++ None, ++ 'last_result': ++ None, ++ 'chunk_count': ++ 0, ++ 'request_id': ++ result.request_id.split("-chunk-")[0] ++ } ++ ++ aggregator = prompt_aggregators[prompt_idx] ++ pooling_type = aggregator['pooling_type'] ++ ++ # Handle different pooling types with ++ # online aggregation ++ if pooling_type == 'MEAN': ++ # Online weighted averaging ++ # Ensure result is PoolingRequestOutput ++ # for embedding processing ++ if not isinstance(result, ++ PoolingRequestOutput): ++ return self.create_error_response( ++ f"Expected PoolingRequestOutput for " ++ f"chunked embedding, got " ++ f"{type(result).__name__}") ++ ++ embedding_data = result.outputs.data ++ if not isinstance(embedding_data, ++ torch.Tensor): ++ embedding_data = torch.tensor( ++ embedding_data, dtype=torch.float32) ++ ++ if result.prompt_token_ids is None: ++ return self.create_error_response( ++ "prompt_token_ids cannot be None for " ++ "chunked processing") ++ weight = len(result.prompt_token_ids) ++ ++ weighted_embedding = embedding_data.to( ++ dtype=torch.float32) * weight ++ ++ if aggregator['weighted_sum'] is None: ++ # First chunk ++ aggregator[ ++ 'weighted_sum'] = weighted_embedding ++ else: ++ # Accumulate ++ current_sum = aggregator['weighted_sum'] ++ if isinstance(current_sum, torch.Tensor): ++ aggregator['weighted_sum'] = ( ++ current_sum + weighted_embedding) ++ ++ total_weight = aggregator['total_weight'] ++ if isinstance(total_weight, (int, float)): ++ aggregator['total_weight'] = ( ++ total_weight + weight) ++ ++ elif pooling_type == 'LAST': ++ # Keep only the ++ # last result (highest chunk index) ++ if not isinstance(result, ++ PoolingRequestOutput): ++ return self.create_error_response( ++ f"Expected PoolingRequestOutput for " ++ f"chunked embedding, got " ++ f"{type(result).__name__}") ++ ++ chunk_idx = int(parts[parts.index("chunk") + ++ 1]) ++ last_chunk_idx = aggregator.get( ++ 'last_chunk_idx', -1) ++ # Ensure last_chunk_idx is an integer ++ # for comparison ++ if not isinstance(last_chunk_idx, int): ++ last_chunk_idx = -1 ++ if (aggregator['last_result'] is None ++ or chunk_idx > last_chunk_idx): ++ aggregator['last_result'] = result ++ aggregator['last_chunk_idx'] = chunk_idx ++ ++ elif pooling_type == 'CLS': ++ # Keep only the first result (chunk index 0) ++ if not isinstance(result, ++ PoolingRequestOutput): ++ return self.create_error_response( ++ f"Expected PoolingRequestOutput for " ++ f"chunked embedding, got " ++ f"{type(result).__name__}") ++ ++ chunk_idx = int(parts[parts.index("chunk") + ++ 1]) ++ if chunk_idx == 0: ++ aggregator['first_result'] = result ++ ++ chunk_count = aggregator['chunk_count'] ++ if isinstance(chunk_count, int): ++ aggregator['chunk_count'] = chunk_count + 1 ++ ++ except (ValueError, IndexError): ++ return self.create_error_response( ++ f"Invalid chunk request ID format: " ++ f"{result.request_id}") ++ else: ++ # Non-chunked result ++ try: ++ prompt_idx = int(result.request_id.split("-")[-1]) ++ short_prompts_results[prompt_idx] = cast( ++ PoolingRequestOutput, result) ++ except ValueError: ++ return self.create_error_response( ++ f"Invalid request ID format: " ++ f"{result.request_id}") ++ ++ # Build final result batch ++ final_res_batch = [] ++ ++ for prompt_idx, request_prompt in enumerate( ++ ctx.request_prompts): ++ if prompt_idx in prompt_aggregators: ++ # Finalize aggregation for this chunked prompt ++ aggregator = prompt_aggregators[prompt_idx] ++ pooling_type = aggregator['pooling_type'] ++ ++ if pooling_type == 'MEAN': ++ # Finalize weighted average ++ weighted_sum = aggregator['weighted_sum'] ++ total_weight = aggregator['total_weight'] ++ if (weighted_sum is not None ++ and isinstance(weighted_sum, torch.Tensor) ++ and isinstance(total_weight, (int, float)) ++ and total_weight > 0): ++ final_embedding = weighted_sum / total_weight ++ ++ # Create aggregated result ++ from vllm.outputs import PoolingOutput ++ aggregated_output = PoolingOutput( ++ data=final_embedding) ++ ++ # Get original prompt token ids ++ if self._is_text_tokens_prompt(request_prompt): ++ text_tokens_prompt = cast( ++ TextTokensPrompt, request_prompt) ++ original_token_ids = text_tokens_prompt[ ++ "prompt_token_ids"] ++ else: ++ return self.create_error_response( ++ f"Chunked prompt {prompt_idx} is not a " ++ f"text tokens prompt") ++ ++ # Ensure request_id is string ++ request_id = aggregator['request_id'] ++ if not isinstance(request_id, str): ++ return self.create_error_response( ++ f"Invalid request_id type: " ++ f"{type(request_id)}") ++ ++ aggregated_result = PoolingRequestOutput( ++ request_id=request_id, ++ outputs=aggregated_output, ++ prompt_token_ids=original_token_ids, ++ finished=True, ++ ) ++ final_res_batch.append(aggregated_result) ++ else: ++ return self.create_error_response( ++ f"No valid aggregation data for prompt " ++ f"{prompt_idx}") ++ ++ elif pooling_type == 'LAST': ++ if aggregator['last_result'] is not None: ++ # Use the last chunk result ++ last_result = aggregator['last_result'] ++ if not isinstance(last_result, ++ PoolingRequestOutput): ++ return self.create_error_response( ++ f"Expected PoolingRequestOutput for " ++ f"last_result, got " ++ f"{type(last_result).__name__}") ++ ++ if self._is_text_tokens_prompt(request_prompt): ++ text_tokens_prompt = cast( ++ TextTokensPrompt, request_prompt) ++ original_token_ids = text_tokens_prompt[ ++ "prompt_token_ids"] ++ ++ # Ensure request_id is string ++ request_id = aggregator['request_id'] ++ if not isinstance(request_id, str): ++ return self.create_error_response( ++ f"Invalid request_id type: " ++ f"{type(request_id)}") ++ ++ aggregated_result = PoolingRequestOutput( ++ request_id=request_id, ++ outputs=last_result.outputs, ++ prompt_token_ids=original_token_ids, ++ finished=True, ++ ) ++ final_res_batch.append(aggregated_result) ++ else: ++ return self.create_error_response( ++ f"Chunked prompt {prompt_idx} is not a " ++ f"text tokens prompt") ++ else: ++ return self.create_error_response( ++ f"No LAST result found for prompt " ++ f"{prompt_idx}") ++ ++ elif pooling_type == 'CLS': ++ if aggregator['first_result'] is not None: ++ # Use the first chunk result ++ first_result = aggregator['first_result'] ++ if not isinstance(first_result, ++ PoolingRequestOutput): ++ return self.create_error_response( ++ f"Expected PoolingRequestOutput for " ++ f"first_result, got " ++ f"{type(first_result).__name__}") ++ ++ if self._is_text_tokens_prompt(request_prompt): ++ text_tokens_prompt = cast( ++ TextTokensPrompt, request_prompt) ++ original_token_ids = text_tokens_prompt[ ++ "prompt_token_ids"] ++ ++ # Ensure request_id is string ++ request_id = aggregator['request_id'] ++ if not isinstance(request_id, str): ++ return self.create_error_response( ++ f"Invalid request_id type: " ++ f"{type(request_id)}") ++ ++ aggregated_result = PoolingRequestOutput( ++ request_id=request_id, ++ outputs=first_result.outputs, ++ prompt_token_ids=original_token_ids, ++ finished=True, ++ ) ++ final_res_batch.append(aggregated_result) ++ else: ++ return self.create_error_response( ++ f"Chunked prompt {prompt_idx} is not a " ++ f"text tokens prompt") ++ else: ++ return self.create_error_response( ++ f"No CLS result found for prompt " ++ f"{prompt_idx}") ++ else: ++ return self.create_error_response( ++ f"Unsupported pooling type for chunked " ++ f"processing: {pooling_type}") ++ ++ elif prompt_idx in short_prompts_results: ++ # This was a short prompt ++ final_res_batch.append( ++ short_prompts_results[prompt_idx]) ++ else: ++ return self.create_error_response( ++ f"Result not found for prompt {prompt_idx}") ++ ++ ctx.final_res_batch = cast( ++ list[Union[RequestOutput, PoolingRequestOutput]], ++ final_res_batch) ++ else: ++ # Normal processing for non-chunked requests ++ num_prompts = len(ctx.engine_prompts) ++ normal_final_res_batch: list[ ++ Optional[PoolingRequestOutput]] = [None] * num_prompts ++ ++ async for result_idx, result in ctx.result_generator: ++ if result_idx < num_prompts: ++ # Cast to PoolingRequestOutput for embedding results ++ normal_final_res_batch[result_idx] = cast( ++ PoolingRequestOutput, result) ++ ++ if None in normal_final_res_batch: ++ return self.create_error_response( ++ "Failed to generate results for all prompts") ++ ++ final_results = [ ++ res for res in normal_final_res_batch if res is not None ++ ] ++ ctx.final_res_batch = cast( ++ list[Union[RequestOutput, PoolingRequestOutput]], ++ final_results) ++ ++ return None ++ ++ except Exception as e: ++ return self.create_error_response(str(e)) ++ + + class OpenAIServingEmbedding(EmbeddingMixin): + request_id_prefix = "embd" diff --git a/requirements/test.txt b/requirements/test.txt index d45048aae58..567002a5705 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -968,7 +968,6 @@ setuptools==77.0.3 # lightning-utilities # mamba-ssm # pytablewriter - # torch # triton shapely==2.1.1 # via From a0955f89188710e40a735f93bda6268fe79cad15 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 31 Jul 2025 15:33:56 +0800 Subject: [PATCH 547/552] The files `diff_config.py` and `diff_serving_embedding.py` have been deleted, and the code and configurations that are no longer in use have been cleaned up. Signed-off-by: x22x22 --- diff_config.py.txt | 40 -- diff_serving_embedding.py.txt | 763 ---------------------------------- 2 files changed, 803 deletions(-) delete mode 100644 diff_config.py.txt delete mode 100644 diff_serving_embedding.py.txt diff --git a/diff_config.py.txt b/diff_config.py.txt deleted file mode 100644 index 81c9b072b88..00000000000 --- a/diff_config.py.txt +++ /dev/null @@ -1,40 +0,0 @@ -diff --git a/vllm/config.py b/vllm/config.py -index a330bafb7..7c8ed575f 100644 ---- a/vllm/config.py -+++ b/vllm/config.py -@@ -3369,6 +3369,35 @@ class PoolerConfig: - ``math-shepherd-mistral-7b-prm`` model. - """ - -+ enable_chunked_processing: Optional[bool] = None -+ """ -+ Whether to enable chunked processing for long inputs that exceed the model's -+ maximum position embeddings. When enabled, long inputs will be split into -+ chunks, processed separately, and then aggregated using weighted averaging. -+ This allows embedding models to handle arbitrarily long text without CUDA -+ errors. Defaults to False. -+ """ -+ -+ max_embed_len: Optional[int] = None -+ """ -+ Maximum input length allowed for embedding generation. When set, allows -+ inputs longer than max_model_len to be accepted for embedding models. -+ This parameter enables accepting long inputs without requiring -+ VLLM_ALLOW_LONG_MAX_MODEL_LEN environment variable. When an input exceeds -+ max_embed_len, it will be handled according to the original max_model_len -+ validation logic. Defaults to None (use max_model_len validation). -+ """ -+ -+ allow_non_mean_chunking: Optional[bool] = None -+ """ -+ Whether to allow chunked processing for non-MEAN pooling types without -+ warnings. By default (None or False), a warning will be shown when using -+ chunked processing with pooling types other than MEAN, as they may produce -+ different results than non-chunked processing. Set to True to explicitly -+ allow and suppress warnings for non-MEAN pooling types. Only applies when -+ enable_chunked_processing is True. -+ """ -+ - def compute_hash(self) -> str: - """ - WARNING: Whenever a new field is added to this config, diff --git a/diff_serving_embedding.py.txt b/diff_serving_embedding.py.txt deleted file mode 100644 index 1b1c98f8627..00000000000 --- a/diff_serving_embedding.py.txt +++ /dev/null @@ -1,763 +0,0 @@ -diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py -index 84ba00873..49a53cf6c 100644 ---- a/vllm/entrypoints/openai/serving_embedding.py -+++ b/vllm/entrypoints/openai/serving_embedding.py -@@ -2,9 +2,11 @@ - # SPDX-FileCopyrightText: Copyright contributors to the vLLM project - - import base64 --from typing import Final, Literal, Optional, Union, cast -+from collections.abc import AsyncGenerator -+from typing import Any, Final, Literal, Optional, Union, cast - - import numpy as np -+import torch - from fastapi import Request - from typing_extensions import assert_never, override - -@@ -12,18 +14,25 @@ from vllm.config import ModelConfig - from vllm.engine.protocol import EngineClient - from vllm.entrypoints.chat_utils import ChatTemplateContentFormatOption - from vllm.entrypoints.logger import RequestLogger -+# yapf conflicts with isort for this docstring -+# yapf: disable - from vllm.entrypoints.openai.protocol import (EmbeddingChatRequest, -+ EmbeddingCompletionRequest, - EmbeddingRequest, - EmbeddingResponse, - EmbeddingResponseData, - ErrorResponse, UsageInfo) - from vllm.entrypoints.openai.serving_engine import (EmbeddingServeContext, - OpenAIServing, -- ServeContext) -+ ServeContext, -+ TextTokensPrompt) -+# yapf: enable - from vllm.entrypoints.openai.serving_models import OpenAIServingModels -+from vllm.inputs.data import EmbedsPrompt as EngineEmbedsPrompt -+from vllm.inputs.data import TokensPrompt as EngineTokensPrompt - from vllm.logger import init_logger - from vllm.outputs import (EmbeddingOutput, EmbeddingRequestOutput, -- PoolingRequestOutput) -+ PoolingRequestOutput, RequestOutput) - from vllm.pooling_params import PoolingParams - - logger = init_logger(__name__) -@@ -129,6 +138,717 @@ class EmbeddingMixin(OpenAIServing): - usage=usage, - ) - -+ def _get_max_position_embeddings(self) -> int: -+ """Get the model's effective maximum sequence length for chunking. -+ -+ This uses the same logic as vLLM's _get_and_verify_max_len to determine -+ the actual sequence length limit, -+ considering both model config and tokenizer config. -+ When max_model_len is set and smaller than max_position_embeddings, -+ use max_model_len for chunking. -+ """ -+ hf_config = self.model_config.hf_config -+ -+ # Start with max_position_embeddings from model config -+ derived_max_len = getattr(hf_config, 'max_position_embeddings', 512) -+ -+ # Get tokenizer config for pooling models (embedding models) -+ if self.model_config.runner_type == "pooling": -+ from vllm.transformers_utils.config import try_get_tokenizer_config -+ tokenizer_config = try_get_tokenizer_config( -+ self.model_config.tokenizer, -+ trust_remote_code=self.model_config.trust_remote_code, -+ revision=self.model_config.tokenizer_revision) -+ -+ # Consider model_max_length in tokenizer_config -+ # (same logic as _get_and_verify_max_len) -+ if tokenizer_config: -+ tokenizer_model_max_length = tokenizer_config.get( -+ 'model_max_length', derived_max_len) -+ derived_max_len = min(derived_max_len, -+ tokenizer_model_max_length) -+ -+ # Consider max_model_len when it's set and smaller than other limits -+ # max_model_len is set in OpenAIServing.__init__ -+ # from model_config.max_model_len -+ if self.max_model_len is not None: -+ derived_max_len = min(derived_max_len, self.max_model_len) -+ -+ return int(derived_max_len) -+ -+ def _should_use_chunked_processing(self, request) -> bool: -+ """Check if chunked processing should be used for this request.""" -+ if not isinstance(request, -+ (EmbeddingChatRequest, EmbeddingCompletionRequest)): -+ return False -+ -+ pooler_config = getattr(self.model_config, 'pooler_config', None) -+ if not (pooler_config is not None and getattr( -+ pooler_config, 'enable_chunked_processing', False)): -+ return False -+ -+ # Check pooling type compatibility for chunked processing -+ pooling_type = getattr(pooler_config, 'pooling_type', None) -+ if pooling_type: -+ pooling_type_upper = pooling_type.upper() -+ -+ # For LAST and CLS pooling, chunked processing doesn't make -+ # semantic sense because only the last/first chunk -+ # contains the relevant token position -+ if pooling_type_upper in ['LAST', 'CLS']: -+ # Check if user explicitly allowed non-mean chunking -+ allow_non_mean = getattr(pooler_config, -+ 'allow_non_mean_chunking', False) -+ if not allow_non_mean: -+ logger.warning( -+ "Chunked processing with pooling type '%s' " -+ "is not recommended as it may produce semantically " -+ "incorrect results. %s pooling relies on specific " -+ "token positions that lose their meaning when the " -+ "sequence is chunked. Consider using MEAN pooling " -+ "or disable chunked processing. Set " -+ "'allow_non_mean_chunking: true' ", -+ "to override this warning.", pooling_type, -+ pooling_type_upper) -+ return False # Disable chunked processing by default -+ else: -+ logger.info( -+ "Using chunked processing with %s pooling " -+ "(explicitly enabled). Note: only the %s chunk " -+ "will be processed to avoid computational waste.", -+ pooling_type_upper, -+ "last" if pooling_type_upper == "LAST" else "first") -+ -+ # Warn about non-MEAN pooling types (for other pooling types) -+ elif pooling_type_upper != 'MEAN': -+ # Check if user explicitly allowed non-mean chunking -+ allow_non_mean = getattr(pooler_config, -+ 'allow_non_mean_chunking', False) -+ if not allow_non_mean: -+ logger.warning( -+ "Chunked processing with pooling type '%s' " -+ "may produce different results than non-chunked " -+ "processing due to limited attention scope within " -+ "chunks. Each token can only attend to tokens within " -+ "its chunk (similar to sliding window attention), " -+ "which changes token representations before pooling. " -+ "While MEAN pooling provides a reasonable " -+ "approximation through weighted averaging aggregation, " -+ "other pooling " -+ "types use different aggregation strategies that " -+ "further approximate the original behavior. Set " -+ "'allow_non_mean_chunking: true' in pooler config " -+ "to suppress this warning.", pooling_type) -+ # Still allow it but with warning -+ else: -+ logger.info( -+ "Using chunked processing with pooling type " -+ "'%s' (explicitly enabled)", pooling_type) -+ -+ return True -+ -+ def _chunk_token_ids(self, token_ids: list[int], -+ chunk_size: int) -> list[list[int]]: -+ """Split token IDs into chunks of specified size.""" -+ if len(token_ids) <= chunk_size: -+ return [token_ids] -+ -+ chunks = [] -+ for i in range(0, len(token_ids), chunk_size): -+ chunk = token_ids[i:i + chunk_size] -+ chunks.append(chunk) -+ return chunks -+ -+ async def _process_chunked_request( -+ self, -+ ctx: EmbeddingServeContext, -+ original_prompt: TextTokensPrompt, -+ pooling_params, -+ trace_headers, -+ prompt_idx: int, -+ ) -> list[AsyncGenerator[PoolingRequestOutput, None]]: -+ """Process a single prompt using chunked processing.""" -+ generators: list[AsyncGenerator[PoolingRequestOutput, None]] = [] -+ token_ids = original_prompt["prompt_token_ids"] -+ -+ # Split into chunks using max_position_embeddings -+ max_pos_embeddings = self._get_max_position_embeddings() -+ chunks = self._chunk_token_ids(token_ids, max_pos_embeddings) -+ -+ # Check pooling type to optimize chunk processing -+ pooler_config = getattr(self.model_config, 'pooler_config', None) -+ pooling_type = getattr(pooler_config, 'pooling_type', 'MEAN') -+ if pooling_type: -+ pooling_type = pooling_type.upper() -+ -+ # For LAST pooling, only process the last chunk -+ # For CLS pooling, only process the first chunk -+ if pooling_type == 'LAST': -+ chunks_to_process = [chunks[-1]] -+ chunk_indices = [len(chunks) - 1] -+ logger.info("LAST pooling: processing only the last chunk") -+ elif pooling_type == 'CLS': -+ chunks_to_process = [chunks[0]] -+ chunk_indices = [0] -+ logger.info("CLS pooling: processing only the first chunk") -+ else: -+ # For MEAN and other pooling types, process all chunks -+ chunks_to_process = chunks -+ chunk_indices = list(range(len(chunks))) -+ logger.info("Using chunked processing for MEAN pooling") -+ -+ for i, (chunk_idx, chunk_tokens) in enumerate( -+ zip(chunk_indices, chunks_to_process)): -+ # Create a request ID for this chunk -+ chunk_request_id = (f"{ctx.request_id}-prompt-{prompt_idx}-" -+ f"chunk-{chunk_idx}") -+ -+ # Create engine prompt for this chunk -+ chunk_engine_prompt = EngineTokensPrompt( -+ prompt_token_ids=chunk_tokens) -+ -+ # Create chunk request prompt for logging -+ chunk_text = "" -+ chunk_request_prompt = TextTokensPrompt( -+ prompt=chunk_text, prompt_token_ids=chunk_tokens) -+ -+ # Log the chunk -+ self._log_inputs(chunk_request_id, -+ chunk_request_prompt, -+ params=pooling_params, -+ lora_request=ctx.lora_request) -+ -+ # Create generator for this chunk -+ generator = self.engine_client.encode( -+ chunk_engine_prompt, -+ pooling_params, -+ chunk_request_id, -+ lora_request=ctx.lora_request, -+ trace_headers=trace_headers, -+ priority=getattr(ctx.request, "priority", 0), -+ ) -+ -+ generators.append(generator) -+ -+ return generators -+ -+ def _validate_input( -+ self, -+ request, -+ input_ids: list[int], -+ input_text: str, -+ ) -> TextTokensPrompt: -+ """Override to support chunked processing for embedding requests.""" -+ token_num = len(input_ids) -+ -+ # Note: EmbeddingRequest doesn't have max_tokens -+ if isinstance(request, -+ (EmbeddingChatRequest, EmbeddingCompletionRequest)): -+ # Check if chunked processing is enabled for pooling models -+ pooler_config = getattr(self.model_config, 'pooler_config', None) -+ enable_chunked = (pooler_config is not None and getattr( -+ pooler_config, 'enable_chunked_processing', False)) -+ -+ # Get max_embed_len from pooler config if set -+ max_embed_len = (pooler_config.max_embed_len if pooler_config -+ and pooler_config.max_embed_len else None) -+ -+ # Use max_position_embeddings for chunked processing decisions -+ max_pos_embeddings = self._get_max_position_embeddings() -+ -+ # Determine the effective max length for validation -+ if max_embed_len is not None: -+ # Use max_embed_len for validation instead of max_model_len -+ effective_max_len = max_embed_len -+ length_type = "maximum embedding input length" -+ max_length_value = max_embed_len -+ else: -+ # Fall back to max_model_len validation (original behavior) -+ effective_max_len = self.max_model_len -+ length_type = "maximum context length" -+ max_length_value = self.max_model_len -+ -+ validation_error_msg = ( -+ "This model's {length_type} is {max_length} tokens. " -+ "However, you requested {token_num} tokens in the input for " -+ "embedding generation. Please reduce the length of the input." -+ ).format(length_type=length_type, -+ max_length=max_length_value, -+ token_num=token_num) -+ -+ # Check if input exceeds effective max length -+ if token_num > effective_max_len: -+ raise ValueError(validation_error_msg) -+ -+ # Check for chunked processing -+ # when exceeding max_position_embeddings -+ if token_num > max_pos_embeddings: -+ if enable_chunked: -+ # Allow long inputs when chunked processing is enabled -+ logger.info( -+ "Input length %s exceeds max_position_embeddings " -+ "%s, will use chunked processing", token_num, -+ max_pos_embeddings) -+ else: -+ raise ValueError( -+ f"This model's maximum position embeddings length is " -+ f"{max_pos_embeddings} tokens. However, you requested " -+ f"{token_num} tokens in the input for embedding " -+ f"generation. Please reduce the length of the input or " -+ f"enable chunked processing.") -+ -+ return TextTokensPrompt(prompt=input_text, -+ prompt_token_ids=input_ids) -+ -+ # For other request types, use the parent's implementation -+ return super()._validate_input(request, input_ids, input_text) -+ -+ def _is_text_tokens_prompt(self, prompt) -> bool: -+ """Check if a prompt is a TextTokensPrompt (has prompt_token_ids).""" -+ return (isinstance(prompt, dict) and "prompt_token_ids" in prompt -+ and "prompt_embeds" not in prompt) -+ -+ async def _prepare_generators( -+ self, -+ ctx: ServeContext, -+ ) -> Optional[ErrorResponse]: -+ """Override to support chunked processing.""" -+ ctx = cast(EmbeddingServeContext, ctx) -+ generators: list[AsyncGenerator[Union[RequestOutput, -+ PoolingRequestOutput], -+ None]] = [] -+ -+ try: -+ trace_headers = (None if ctx.raw_request is None else await -+ self._get_trace_headers(ctx.raw_request.headers)) -+ -+ if not hasattr(ctx.request, "to_pooling_params"): -+ return self.create_error_response( -+ "Request type does not support pooling parameters") -+ -+ pooling_params = ctx.request.to_pooling_params() -+ -+ # Verify and set the task for pooling params -+ try: -+ pooling_params.verify("embed", self.model_config) -+ except ValueError as e: -+ return self.create_error_response(str(e)) -+ -+ if ctx.engine_prompts is None: -+ return self.create_error_response( -+ "Engine prompts not available") -+ -+ if ctx.request_prompts is None: -+ return self.create_error_response( -+ "Request prompts not available") -+ -+ # Check if we should use chunked processing -+ use_chunked = self._should_use_chunked_processing(ctx.request) -+ -+ for i, engine_prompt in enumerate(ctx.engine_prompts): -+ request_prompt = ctx.request_prompts[i] -+ -+ # Check if this specific prompt needs chunked processing -+ max_pos_embeddings = self._get_max_position_embeddings() -+ if (use_chunked -+ and self._is_text_tokens_prompt(request_prompt)): -+ # Cast to TextTokensPrompt since we've -+ # verified prompt_token_ids -+ text_tokens_prompt = cast(TextTokensPrompt, request_prompt) -+ if len(text_tokens_prompt["prompt_token_ids"] -+ ) > max_pos_embeddings: -+ # Use chunked processing for this prompt -+ chunk_generators = await self._process_chunked_request( -+ ctx, text_tokens_prompt, pooling_params, -+ trace_headers, i) -+ generators.extend(chunk_generators) -+ continue -+ -+ # Normal processing for short prompts or non-token prompts -+ request_id_item = f"{ctx.request_id}-{i}" -+ -+ self._log_inputs(request_id_item, -+ request_prompt, -+ params=pooling_params, -+ lora_request=ctx.lora_request) -+ -+ # Mypy has an existing bug related to inferring the variance -+ # of TypedDicts with `builtins.enumerate`: -+ # https://github.com/python/mypy/issues/8586#issuecomment-2867698435 -+ engine_prompt = cast( -+ Union[EngineTokensPrompt, EngineEmbedsPrompt], -+ engine_prompt) -+ generator = self.engine_client.encode( -+ engine_prompt, -+ pooling_params, -+ request_id_item, -+ lora_request=ctx.lora_request, -+ trace_headers=trace_headers, -+ priority=getattr(ctx.request, "priority", 0), -+ ) -+ -+ generators.append(generator) -+ -+ from vllm.utils import merge_async_iterators -+ ctx.result_generator = merge_async_iterators(*generators) -+ -+ return None -+ -+ except Exception as e: -+ # TODO: Use a vllm-specific Validation Error -+ return self.create_error_response(str(e)) -+ -+ async def _collect_batch( -+ self, -+ ctx: ServeContext, -+ ) -> Optional[ErrorResponse]: -+ """Collect and aggregate batch results -+ with support for chunked processing. -+ -+ For chunked requests, performs online aggregation to -+ minimize memory usage. -+ For regular requests, collects results normally. -+ """ -+ ctx = cast(EmbeddingServeContext, ctx) -+ try: -+ if ctx.engine_prompts is None: -+ return self.create_error_response( -+ "Engine prompts not available") -+ -+ if ctx.request_prompts is None: -+ return self.create_error_response( -+ "Request prompts not available") -+ -+ if ctx.result_generator is None: -+ return self.create_error_response( -+ "Result generator not available") -+ -+ # Check if we used chunked processing -+ use_chunked = self._should_use_chunked_processing(ctx.request) -+ -+ if use_chunked: -+ # Online aggregation for chunked requests to -+ # minimize memory usage -+ # Track aggregation state for each prompt -+ prompt_aggregators: dict[int, dict[str, Any]] = {} -+ short_prompts_results: dict[int, PoolingRequestOutput] = {} -+ -+ async for result_idx, result in ctx.result_generator: -+ if "-chunk-" in result.request_id: -+ # Extract prompt_idx from chunked request_id -+ parts = result.request_id.split("-") -+ try: -+ prompt_idx = int(parts[parts.index("prompt") + 1]) -+ -+ # Initialize aggregator for this prompt if needed -+ if prompt_idx not in prompt_aggregators: -+ # Get pooling type to determine -+ # aggregation strategy -+ pooler_config = getattr( -+ self.model_config, 'pooler_config', None) -+ pooling_type = getattr(pooler_config, -+ 'pooling_type', 'MEAN') -+ if pooling_type: -+ pooling_type = pooling_type.upper() -+ -+ prompt_aggregators[prompt_idx] = { -+ 'pooling_type': -+ pooling_type, -+ 'weighted_sum': -+ None, -+ 'total_weight': -+ 0, -+ 'first_result': -+ None, -+ 'last_result': -+ None, -+ 'chunk_count': -+ 0, -+ 'request_id': -+ result.request_id.split("-chunk-")[0] -+ } -+ -+ aggregator = prompt_aggregators[prompt_idx] -+ pooling_type = aggregator['pooling_type'] -+ -+ # Handle different pooling types with -+ # online aggregation -+ if pooling_type == 'MEAN': -+ # Online weighted averaging -+ # Ensure result is PoolingRequestOutput -+ # for embedding processing -+ if not isinstance(result, -+ PoolingRequestOutput): -+ return self.create_error_response( -+ f"Expected PoolingRequestOutput for " -+ f"chunked embedding, got " -+ f"{type(result).__name__}") -+ -+ embedding_data = result.outputs.data -+ if not isinstance(embedding_data, -+ torch.Tensor): -+ embedding_data = torch.tensor( -+ embedding_data, dtype=torch.float32) -+ -+ if result.prompt_token_ids is None: -+ return self.create_error_response( -+ "prompt_token_ids cannot be None for " -+ "chunked processing") -+ weight = len(result.prompt_token_ids) -+ -+ weighted_embedding = embedding_data.to( -+ dtype=torch.float32) * weight -+ -+ if aggregator['weighted_sum'] is None: -+ # First chunk -+ aggregator[ -+ 'weighted_sum'] = weighted_embedding -+ else: -+ # Accumulate -+ current_sum = aggregator['weighted_sum'] -+ if isinstance(current_sum, torch.Tensor): -+ aggregator['weighted_sum'] = ( -+ current_sum + weighted_embedding) -+ -+ total_weight = aggregator['total_weight'] -+ if isinstance(total_weight, (int, float)): -+ aggregator['total_weight'] = ( -+ total_weight + weight) -+ -+ elif pooling_type == 'LAST': -+ # Keep only the -+ # last result (highest chunk index) -+ if not isinstance(result, -+ PoolingRequestOutput): -+ return self.create_error_response( -+ f"Expected PoolingRequestOutput for " -+ f"chunked embedding, got " -+ f"{type(result).__name__}") -+ -+ chunk_idx = int(parts[parts.index("chunk") + -+ 1]) -+ last_chunk_idx = aggregator.get( -+ 'last_chunk_idx', -1) -+ # Ensure last_chunk_idx is an integer -+ # for comparison -+ if not isinstance(last_chunk_idx, int): -+ last_chunk_idx = -1 -+ if (aggregator['last_result'] is None -+ or chunk_idx > last_chunk_idx): -+ aggregator['last_result'] = result -+ aggregator['last_chunk_idx'] = chunk_idx -+ -+ elif pooling_type == 'CLS': -+ # Keep only the first result (chunk index 0) -+ if not isinstance(result, -+ PoolingRequestOutput): -+ return self.create_error_response( -+ f"Expected PoolingRequestOutput for " -+ f"chunked embedding, got " -+ f"{type(result).__name__}") -+ -+ chunk_idx = int(parts[parts.index("chunk") + -+ 1]) -+ if chunk_idx == 0: -+ aggregator['first_result'] = result -+ -+ chunk_count = aggregator['chunk_count'] -+ if isinstance(chunk_count, int): -+ aggregator['chunk_count'] = chunk_count + 1 -+ -+ except (ValueError, IndexError): -+ return self.create_error_response( -+ f"Invalid chunk request ID format: " -+ f"{result.request_id}") -+ else: -+ # Non-chunked result -+ try: -+ prompt_idx = int(result.request_id.split("-")[-1]) -+ short_prompts_results[prompt_idx] = cast( -+ PoolingRequestOutput, result) -+ except ValueError: -+ return self.create_error_response( -+ f"Invalid request ID format: " -+ f"{result.request_id}") -+ -+ # Build final result batch -+ final_res_batch = [] -+ -+ for prompt_idx, request_prompt in enumerate( -+ ctx.request_prompts): -+ if prompt_idx in prompt_aggregators: -+ # Finalize aggregation for this chunked prompt -+ aggregator = prompt_aggregators[prompt_idx] -+ pooling_type = aggregator['pooling_type'] -+ -+ if pooling_type == 'MEAN': -+ # Finalize weighted average -+ weighted_sum = aggregator['weighted_sum'] -+ total_weight = aggregator['total_weight'] -+ if (weighted_sum is not None -+ and isinstance(weighted_sum, torch.Tensor) -+ and isinstance(total_weight, (int, float)) -+ and total_weight > 0): -+ final_embedding = weighted_sum / total_weight -+ -+ # Create aggregated result -+ from vllm.outputs import PoolingOutput -+ aggregated_output = PoolingOutput( -+ data=final_embedding) -+ -+ # Get original prompt token ids -+ if self._is_text_tokens_prompt(request_prompt): -+ text_tokens_prompt = cast( -+ TextTokensPrompt, request_prompt) -+ original_token_ids = text_tokens_prompt[ -+ "prompt_token_ids"] -+ else: -+ return self.create_error_response( -+ f"Chunked prompt {prompt_idx} is not a " -+ f"text tokens prompt") -+ -+ # Ensure request_id is string -+ request_id = aggregator['request_id'] -+ if not isinstance(request_id, str): -+ return self.create_error_response( -+ f"Invalid request_id type: " -+ f"{type(request_id)}") -+ -+ aggregated_result = PoolingRequestOutput( -+ request_id=request_id, -+ outputs=aggregated_output, -+ prompt_token_ids=original_token_ids, -+ finished=True, -+ ) -+ final_res_batch.append(aggregated_result) -+ else: -+ return self.create_error_response( -+ f"No valid aggregation data for prompt " -+ f"{prompt_idx}") -+ -+ elif pooling_type == 'LAST': -+ if aggregator['last_result'] is not None: -+ # Use the last chunk result -+ last_result = aggregator['last_result'] -+ if not isinstance(last_result, -+ PoolingRequestOutput): -+ return self.create_error_response( -+ f"Expected PoolingRequestOutput for " -+ f"last_result, got " -+ f"{type(last_result).__name__}") -+ -+ if self._is_text_tokens_prompt(request_prompt): -+ text_tokens_prompt = cast( -+ TextTokensPrompt, request_prompt) -+ original_token_ids = text_tokens_prompt[ -+ "prompt_token_ids"] -+ -+ # Ensure request_id is string -+ request_id = aggregator['request_id'] -+ if not isinstance(request_id, str): -+ return self.create_error_response( -+ f"Invalid request_id type: " -+ f"{type(request_id)}") -+ -+ aggregated_result = PoolingRequestOutput( -+ request_id=request_id, -+ outputs=last_result.outputs, -+ prompt_token_ids=original_token_ids, -+ finished=True, -+ ) -+ final_res_batch.append(aggregated_result) -+ else: -+ return self.create_error_response( -+ f"Chunked prompt {prompt_idx} is not a " -+ f"text tokens prompt") -+ else: -+ return self.create_error_response( -+ f"No LAST result found for prompt " -+ f"{prompt_idx}") -+ -+ elif pooling_type == 'CLS': -+ if aggregator['first_result'] is not None: -+ # Use the first chunk result -+ first_result = aggregator['first_result'] -+ if not isinstance(first_result, -+ PoolingRequestOutput): -+ return self.create_error_response( -+ f"Expected PoolingRequestOutput for " -+ f"first_result, got " -+ f"{type(first_result).__name__}") -+ -+ if self._is_text_tokens_prompt(request_prompt): -+ text_tokens_prompt = cast( -+ TextTokensPrompt, request_prompt) -+ original_token_ids = text_tokens_prompt[ -+ "prompt_token_ids"] -+ -+ # Ensure request_id is string -+ request_id = aggregator['request_id'] -+ if not isinstance(request_id, str): -+ return self.create_error_response( -+ f"Invalid request_id type: " -+ f"{type(request_id)}") -+ -+ aggregated_result = PoolingRequestOutput( -+ request_id=request_id, -+ outputs=first_result.outputs, -+ prompt_token_ids=original_token_ids, -+ finished=True, -+ ) -+ final_res_batch.append(aggregated_result) -+ else: -+ return self.create_error_response( -+ f"Chunked prompt {prompt_idx} is not a " -+ f"text tokens prompt") -+ else: -+ return self.create_error_response( -+ f"No CLS result found for prompt " -+ f"{prompt_idx}") -+ else: -+ return self.create_error_response( -+ f"Unsupported pooling type for chunked " -+ f"processing: {pooling_type}") -+ -+ elif prompt_idx in short_prompts_results: -+ # This was a short prompt -+ final_res_batch.append( -+ short_prompts_results[prompt_idx]) -+ else: -+ return self.create_error_response( -+ f"Result not found for prompt {prompt_idx}") -+ -+ ctx.final_res_batch = cast( -+ list[Union[RequestOutput, PoolingRequestOutput]], -+ final_res_batch) -+ else: -+ # Normal processing for non-chunked requests -+ num_prompts = len(ctx.engine_prompts) -+ normal_final_res_batch: list[ -+ Optional[PoolingRequestOutput]] = [None] * num_prompts -+ -+ async for result_idx, result in ctx.result_generator: -+ if result_idx < num_prompts: -+ # Cast to PoolingRequestOutput for embedding results -+ normal_final_res_batch[result_idx] = cast( -+ PoolingRequestOutput, result) -+ -+ if None in normal_final_res_batch: -+ return self.create_error_response( -+ "Failed to generate results for all prompts") -+ -+ final_results = [ -+ res for res in normal_final_res_batch if res is not None -+ ] -+ ctx.final_res_batch = cast( -+ list[Union[RequestOutput, PoolingRequestOutput]], -+ final_results) -+ -+ return None -+ -+ except Exception as e: -+ return self.create_error_response(str(e)) -+ - - class OpenAIServingEmbedding(EmbeddingMixin): - request_id_prefix = "embd" From 54781698e78a718899fce478697c1dbbf30aabba Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 31 Jul 2025 15:38:35 +0800 Subject: [PATCH 548/552] The files `diff_config.py` and `diff_serving_embedding.py` have been deleted, and the code and configurations that are no longer in use have been cleaned up. Signed-off-by: x22x22 --- requirements/test.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/requirements/test.txt b/requirements/test.txt index 567002a5705..d45048aae58 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -968,6 +968,7 @@ setuptools==77.0.3 # lightning-utilities # mamba-ssm # pytablewriter + # torch # triton shapely==2.1.1 # via From 971aa11fc3d0a1226d294fa0595ea21024784336 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 31 Jul 2025 15:54:29 +0800 Subject: [PATCH 549/552] The block processing logic in the embedding processing has been optimized, with the use of mean aggregation enforced and support for other aggregation types removed. Relevant log information has been updated to reflect the new processing approach. Signed-off-by: x22x22 --- vllm/entrypoints/openai/serving_embedding.py | 390 ++++--------------- 1 file changed, 85 insertions(+), 305 deletions(-) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 93bed980f9a..42551a1854f 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -183,69 +183,11 @@ def _should_use_chunked_processing(self, request) -> bool: return False pooler_config = getattr(self.model_config, 'pooler_config', None) - if not (pooler_config is not None and getattr( - pooler_config, 'enable_chunked_processing', False)): - return False - # Check pooling type compatibility for chunked processing - pooling_type = getattr(pooler_config, 'pooling_type', None) - if pooling_type: - pooling_type_upper = pooling_type.upper() - - # For LAST and CLS pooling, chunked processing doesn't make - # semantic sense because only the last/first chunk - # contains the relevant token position - if pooling_type_upper in ['LAST', 'CLS']: - # Check if user explicitly allowed non-mean chunking - allow_non_mean = getattr(pooler_config, - 'allow_non_mean_chunking', False) - if not allow_non_mean: - logger.warning( - "Chunked processing with pooling type '%s' " - "is not recommended as it may produce semantically " - "incorrect results. %s pooling relies on specific " - "token positions that lose their meaning when the " - "sequence is chunked. Consider using MEAN pooling " - "or disable chunked processing. Set " - "'allow_non_mean_chunking: true' ", - "to override this warning.", pooling_type, - pooling_type_upper) - return False # Disable chunked processing by default - else: - logger.info( - "Using chunked processing with %s pooling " - "(explicitly enabled). Note: only the %s chunk " - "will be processed to avoid computational waste.", - pooling_type_upper, - "last" if pooling_type_upper == "LAST" else "first") - - # Warn about non-MEAN pooling types (for other pooling types) - elif pooling_type_upper != 'MEAN': - # Check if user explicitly allowed non-mean chunking - allow_non_mean = getattr(pooler_config, - 'allow_non_mean_chunking', False) - if not allow_non_mean: - logger.warning( - "Chunked processing with pooling type '%s' " - "may produce different results than non-chunked " - "processing due to limited attention scope within " - "chunks. Each token can only attend to tokens within " - "its chunk (similar to sliding window attention), " - "which changes token representations before pooling. " - "While MEAN pooling provides a reasonable " - "approximation through weighted averaging aggregation, " - "other pooling " - "types use different aggregation strategies that " - "further approximate the original behavior. Set " - "'allow_non_mean_chunking: true' in pooler config " - "to suppress this warning.", pooling_type) - # Still allow it but with warning - else: - logger.info( - "Using chunked processing with pooling type " - "'%s' (explicitly enabled)", pooling_type) - - return True + # For chunked processing, we always use MEAN aggregation + # for cross-chunk aggregation (native pooling is used within each chunk) + return (pooler_config is not None + and getattr(pooler_config, 'enable_chunked_processing', False)) def _chunk_token_ids(self, token_ids: list[int], chunk_size: int) -> list[list[int]]: @@ -275,27 +217,10 @@ async def _process_chunked_request( max_pos_embeddings = self._get_max_position_embeddings() chunks = self._chunk_token_ids(token_ids, max_pos_embeddings) - # Check pooling type to optimize chunk processing - pooler_config = getattr(self.model_config, 'pooler_config', None) - pooling_type = getattr(pooler_config, 'pooling_type', 'MEAN') - if pooling_type: - pooling_type = pooling_type.upper() - - # For LAST pooling, only process the last chunk - # For CLS pooling, only process the first chunk - if pooling_type == 'LAST': - chunks_to_process = [chunks[-1]] - chunk_indices = [len(chunks) - 1] - logger.info("LAST pooling: processing only the last chunk") - elif pooling_type == 'CLS': - chunks_to_process = [chunks[0]] - chunk_indices = [0] - logger.info("CLS pooling: processing only the first chunk") - else: - # For MEAN and other pooling types, process all chunks - chunks_to_process = chunks - chunk_indices = list(range(len(chunks))) - logger.info("Using chunked processing for MEAN pooling") + # Process all chunks for MEAN aggregation + chunks_to_process = chunks + chunk_indices = list(range(len(chunks))) + logger.info("Using chunked processing with MEAN aggregation") for i, (chunk_idx, chunk_tokens) in enumerate( zip(chunk_indices, chunks_to_process)): @@ -542,26 +467,11 @@ async def _collect_batch( # Initialize aggregator for this prompt if needed if prompt_idx not in prompt_aggregators: - # Get pooling type to determine - # aggregation strategy - pooler_config = getattr( - self.model_config, 'pooler_config', None) - pooling_type = getattr(pooler_config, - 'pooling_type', 'MEAN') - if pooling_type: - pooling_type = pooling_type.upper() - prompt_aggregators[prompt_idx] = { - 'pooling_type': - pooling_type, 'weighted_sum': None, 'total_weight': 0, - 'first_result': - None, - 'last_result': - None, 'chunk_count': 0, 'request_id': @@ -569,88 +479,44 @@ async def _collect_batch( } aggregator = prompt_aggregators[prompt_idx] - pooling_type = aggregator['pooling_type'] - - # Handle different pooling types with - # online aggregation - if pooling_type == 'MEAN': - # Online weighted averaging - # Ensure result is PoolingRequestOutput - # for embedding processing - if not isinstance(result, - PoolingRequestOutput): - return self.create_error_response( - f"Expected PoolingRequestOutput for " - f"chunked embedding, got " - f"{type(result).__name__}") - - embedding_data = result.outputs.data - if not isinstance(embedding_data, - torch.Tensor): - embedding_data = torch.tensor( - embedding_data, dtype=torch.float32) - - if result.prompt_token_ids is None: - return self.create_error_response( - "prompt_token_ids cannot be None for " - "chunked processing") - weight = len(result.prompt_token_ids) - - weighted_embedding = embedding_data.to( - dtype=torch.float32) * weight - - if aggregator['weighted_sum'] is None: - # First chunk - aggregator[ - 'weighted_sum'] = weighted_embedding - else: - # Accumulate - current_sum = aggregator['weighted_sum'] - if isinstance(current_sum, torch.Tensor): - aggregator['weighted_sum'] = ( - current_sum + weighted_embedding) - - total_weight = aggregator['total_weight'] - if isinstance(total_weight, (int, float)): - aggregator['total_weight'] = ( - total_weight + weight) - - elif pooling_type == 'LAST': - # Keep only the - # last result (highest chunk index) - if not isinstance(result, - PoolingRequestOutput): - return self.create_error_response( - f"Expected PoolingRequestOutput for " - f"chunked embedding, got " - f"{type(result).__name__}") - - chunk_idx = int(parts[parts.index("chunk") + - 1]) - last_chunk_idx = aggregator.get( - 'last_chunk_idx', -1) - # Ensure last_chunk_idx is an integer - # for comparison - if not isinstance(last_chunk_idx, int): - last_chunk_idx = -1 - if (aggregator['last_result'] is None - or chunk_idx > last_chunk_idx): - aggregator['last_result'] = result - aggregator['last_chunk_idx'] = chunk_idx - - elif pooling_type == 'CLS': - # Keep only the first result (chunk index 0) - if not isinstance(result, - PoolingRequestOutput): - return self.create_error_response( - f"Expected PoolingRequestOutput for " - f"chunked embedding, got " - f"{type(result).__name__}") - - chunk_idx = int(parts[parts.index("chunk") + - 1]) - if chunk_idx == 0: - aggregator['first_result'] = result + + # MEAN pooling with online weighted averaging + # Ensure result is PoolingRequestOutput + # for embedding processing + if not isinstance(result, PoolingRequestOutput): + return self.create_error_response( + f"Expected PoolingRequestOutput for " + f"chunked embedding, got " + f"{type(result).__name__}") + + embedding_data = result.outputs.data + if not isinstance(embedding_data, torch.Tensor): + embedding_data = torch.tensor( + embedding_data, dtype=torch.float32) + + if result.prompt_token_ids is None: + return self.create_error_response( + "prompt_token_ids cannot be None for " + "chunked processing") + weight = len(result.prompt_token_ids) + + weighted_embedding = embedding_data.to( + dtype=torch.float32) * weight + + if aggregator['weighted_sum'] is None: + # First chunk + aggregator['weighted_sum'] = weighted_embedding + else: + # Accumulate + current_sum = aggregator['weighted_sum'] + if isinstance(current_sum, torch.Tensor): + aggregator['weighted_sum'] = ( + current_sum + weighted_embedding) + + total_weight = aggregator['total_weight'] + if isinstance(total_weight, (int, float)): + aggregator['total_weight'] = (total_weight + + weight) chunk_count = aggregator['chunk_count'] if isinstance(chunk_count, int): @@ -677,138 +543,52 @@ async def _collect_batch( for prompt_idx, request_prompt in enumerate( ctx.request_prompts): if prompt_idx in prompt_aggregators: - # Finalize aggregation for this chunked prompt + # Finalize MEAN aggregation for this chunked prompt aggregator = prompt_aggregators[prompt_idx] - pooling_type = aggregator['pooling_type'] - if pooling_type == 'MEAN': - # Finalize weighted average - weighted_sum = aggregator['weighted_sum'] - total_weight = aggregator['total_weight'] - if (weighted_sum is not None - and isinstance(weighted_sum, torch.Tensor) - and isinstance(total_weight, (int, float)) - and total_weight > 0): - final_embedding = weighted_sum / total_weight - - # Create aggregated result - from vllm.outputs import PoolingOutput - aggregated_output = PoolingOutput( - data=final_embedding) - - # Get original prompt token ids - if self._is_text_tokens_prompt(request_prompt): - text_tokens_prompt = cast( - TextTokensPrompt, request_prompt) - original_token_ids = text_tokens_prompt[ - "prompt_token_ids"] - else: - return self.create_error_response( - f"Chunked prompt {prompt_idx} is not a " - f"text tokens prompt") - - # Ensure request_id is string - request_id = aggregator['request_id'] - if not isinstance(request_id, str): - return self.create_error_response( - f"Invalid request_id type: " - f"{type(request_id)}") - - aggregated_result = PoolingRequestOutput( - request_id=request_id, - outputs=aggregated_output, - prompt_token_ids=original_token_ids, - finished=True, - ) - final_res_batch.append(aggregated_result) - else: - return self.create_error_response( - f"No valid aggregation data for prompt " - f"{prompt_idx}") - - elif pooling_type == 'LAST': - if aggregator['last_result'] is not None: - # Use the last chunk result - last_result = aggregator['last_result'] - if not isinstance(last_result, - PoolingRequestOutput): - return self.create_error_response( - f"Expected PoolingRequestOutput for " - f"last_result, got " - f"{type(last_result).__name__}") - - if self._is_text_tokens_prompt(request_prompt): - text_tokens_prompt = cast( - TextTokensPrompt, request_prompt) - original_token_ids = text_tokens_prompt[ - "prompt_token_ids"] - - # Ensure request_id is string - request_id = aggregator['request_id'] - if not isinstance(request_id, str): - return self.create_error_response( - f"Invalid request_id type: " - f"{type(request_id)}") - - aggregated_result = PoolingRequestOutput( - request_id=request_id, - outputs=last_result.outputs, - prompt_token_ids=original_token_ids, - finished=True, - ) - final_res_batch.append(aggregated_result) - else: - return self.create_error_response( - f"Chunked prompt {prompt_idx} is not a " - f"text tokens prompt") + # Finalize weighted average + weighted_sum = aggregator['weighted_sum'] + total_weight = aggregator['total_weight'] + if (weighted_sum is not None + and isinstance(weighted_sum, torch.Tensor) + and isinstance(total_weight, (int, float)) + and total_weight > 0): + final_embedding = weighted_sum / total_weight + + # Create aggregated result + from vllm.outputs import PoolingOutput + aggregated_output = PoolingOutput( + data=final_embedding) + + # Get original prompt token ids + if self._is_text_tokens_prompt(request_prompt): + text_tokens_prompt = cast( + TextTokensPrompt, request_prompt) + original_token_ids = text_tokens_prompt[ + "prompt_token_ids"] else: return self.create_error_response( - f"No LAST result found for prompt " - f"{prompt_idx}") - - elif pooling_type == 'CLS': - if aggregator['first_result'] is not None: - # Use the first chunk result - first_result = aggregator['first_result'] - if not isinstance(first_result, - PoolingRequestOutput): - return self.create_error_response( - f"Expected PoolingRequestOutput for " - f"first_result, got " - f"{type(first_result).__name__}") - - if self._is_text_tokens_prompt(request_prompt): - text_tokens_prompt = cast( - TextTokensPrompt, request_prompt) - original_token_ids = text_tokens_prompt[ - "prompt_token_ids"] - - # Ensure request_id is string - request_id = aggregator['request_id'] - if not isinstance(request_id, str): - return self.create_error_response( - f"Invalid request_id type: " - f"{type(request_id)}") - - aggregated_result = PoolingRequestOutput( - request_id=request_id, - outputs=first_result.outputs, - prompt_token_ids=original_token_ids, - finished=True, - ) - final_res_batch.append(aggregated_result) - else: - return self.create_error_response( - f"Chunked prompt {prompt_idx} is not a " - f"text tokens prompt") - else: + f"Chunked prompt {prompt_idx} is not a " + f"text tokens prompt") + + # Ensure request_id is string + request_id = aggregator['request_id'] + if not isinstance(request_id, str): return self.create_error_response( - f"No CLS result found for prompt " - f"{prompt_idx}") + f"Invalid request_id type: " + f"{type(request_id)}") + + aggregated_result = PoolingRequestOutput( + request_id=request_id, + outputs=aggregated_output, + prompt_token_ids=original_token_ids, + finished=True, + ) + final_res_batch.append(aggregated_result) else: return self.create_error_response( - f"Unsupported pooling type for chunked " - f"processing: {pooling_type}") + f"No valid aggregation data for prompt " + f"{prompt_idx}") elif prompt_idx in short_prompts_results: # This was a short prompt From b3204df867c1b6c16167cb62e61b665206357ca1 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 31 Jul 2025 16:31:01 +0800 Subject: [PATCH 550/552] The processing logic for long - text embeddings has been updated. Mean aggregation is uniformly adopted, and support for other aggregation types has been removed. Relevant documents and configurations have been updated to reflect the new processing approach. Configuration options that are no longer in use have been removed to ensure the code's cleanliness. Signed-off-by: x22x22 --- .../openai_embedding_long_text.md | 42 ++++++------ .../openai_embedding_long_text_client.py | 21 +++--- .../openai_embedding_long_text_service.sh | 65 ++++++------------- vllm/config.py | 10 --- 4 files changed, 50 insertions(+), 88 deletions(-) diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md index a94bf95e534..fb83a564ebb 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text.md @@ -50,18 +50,19 @@ The key parameters for chunked processing are in the `--override-pooler-config`: "pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, - "max_embed_len": 3072000, - "allow_non_mean_chunking": true + "max_embed_len": 3072000 } ``` -#### Pooling Type Behavior with Chunked Processing +#### Chunked Processing Behavior -| Pooling Type | Chunks Processed | Performance | Semantic Coverage | Use Case | -|--------------|------------------|-------------|-------------------|----------| -| **MEAN** (recommended) | All chunks | Slower | Complete | General purpose, full documents | -| **CLS** | First chunk only | Fastest | Limited to start | Classification, when beginning matters | -| **LAST** | Last chunk only | Fastest | Limited to end | When ending/conclusion matters | +Chunked processing now uses **MEAN aggregation** for cross-chunk combination, regardless of the model's native pooling type: + +| Component | Behavior | Description | +|-----------|----------|-------------| +| **Within chunks** | Native pooling (MEAN/CLS/LAST) | Uses model's original pooling strategy | +| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts | +| **Performance** | Optimal | All chunks processed for complete semantic coverage | ### Environment Variables @@ -71,21 +72,15 @@ The key parameters for chunked processing are in the `--override-pooler-config`: | `PORT` | `31090` | Server port | | `GPU_COUNT` | `1` | Number of GPUs to use | | `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) | -| `POOLING_TYPE` | `auto` | Pooling type: `auto`, `MEAN`, `CLS`, `LAST` | -| `ALLOW_NON_MEAN_CHUNKING` | `false` | Allow CLS/LAST pooling with chunked processing | +| `POOLING_TYPE` | `auto` | Model's native pooling type: `auto`, `MEAN`, `CLS`, `LAST` | | `API_KEY` | `EMPTY` | API key for authentication | ## 🔧 How It Works 1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables 2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity -3. **Pooling-Optimized Processing**: - - **MEAN pooling**: All chunks processed separately through the model - - **CLS pooling**: Only first chunk processed (contains CLS token) - - **LAST pooling**: Only last chunk processed (contains final token) -4. **Intelligent Aggregation**: - - **MEAN**: Results combined using token count-based weighted averaging - - **CLS/LAST**: Direct use of single chunk result (no aggregation needed) +3. **Unified Processing**: All chunks processed separately through the model using native pooling +4. **MEAN Aggregation**: Results combined using token count-based weighted averaging across all chunks 5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing ### Input Length Handling @@ -105,13 +100,14 @@ With `MAX_EMBED_LEN=3072000`, you can process: ## 📊 Performance Characteristics -### By Pooling Type (for long text) +### Chunked Processing Performance -| Pooling Type | Chunks Processed | Processing Time | Memory Usage | Semantic Quality | -|--------------|------------------|-----------------|--------------|------------------| -| **MEAN** | All chunks | Highest | Moderate | Complete coverage | -| **CLS** | First chunk only | Lowest | Minimal | Limited to beginning | -| **LAST** | Last chunk only | Lowest | Minimal | Limited to ending | +| Aspect | Behavior | Performance | +|--------|----------|-------------| +| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length | +| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead | +| **Memory Usage** | Proportional to number of chunks | Moderate, scalable | +| **Semantic Quality** | Complete text coverage | Optimal for long documents | ## 🧪 Test Cases diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text_client.py index 1909800e420..7e3663f2854 100644 --- a/examples/online_serving/openai_embedding_long_text_client.py +++ b/examples/online_serving/openai_embedding_long_text_client.py @@ -22,13 +22,12 @@ --port 31090 \ --api-key your-api-key - # OR CLS pooling (processes only first chunk, faster but limited coverage) + # OR CLS pooling (native CLS within chunks, MEAN aggregation across chunks) vllm serve BAAI/bge-large-en-v1.5 \ --task embed \ --override-pooler-config \ '{"pooling_type": "CLS", "normalize": true, ' \ - '"enable_chunked_processing": true, "max_embed_len": 1048576, ' \ - '"allow_non_mean_chunking": true}' \ + '"enable_chunked_processing": true, "max_embed_len": 1048576}' \ --served-model-name bge-large-en-v1.5 \ --trust-remote-code \ --port 31090 \ @@ -177,10 +176,10 @@ def test_multiple_long_texts_batch(): print("=" * 70) # Create multiple distinct long texts that will all require chunking - # Note: Results depend on pooling type: - # - MEAN pooling: All chunks processed, full semantic coverage - # - CLS pooling: Only first chunk processed per text (performance optimized) - # - LAST pooling: Only last chunk processed per text (performance optimized) + # Note: All pooling types now use MEAN aggregation across chunks: + # - Native pooling (MEAN/CLS/LAST) is used within each chunk + # - MEAN aggregation combines results across all chunks + # - Full semantic coverage for all pooling types long_texts = [ generate_long_text( "First long document about artificial intelligence and machine learning. " @@ -352,10 +351,10 @@ def main(): print(" - ✅ Automatic chunked processing for long text") print(" - ✅ Seamless handling of mixed-length batches") print(" - ✅ Multiple long texts in single batch (chunk ID fix)") - print(" - ✅ Pooling-type optimized processing:") - print(" • MEAN: All chunks processed (complete coverage)") - print(" • CLS: Only first chunk processed (performance optimized)") - print(" • LAST: Only last chunk processed (performance optimized)") + print(" - ✅ Unified chunked processing:") + print(" • Native pooling used within each chunk") + print(" • MEAN aggregation across all chunks") + print(" • Complete semantic coverage for all pooling types") print(" - ✅ Consistent embedding generation") print(" - ✅ Backward compatibility with short text") print("\n📚 For more information, see:") diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh index 0d9a613c2d3..e22cb933ae9 100644 --- a/examples/online_serving/openai_embedding_long_text_service.sh +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -20,7 +20,6 @@ API_KEY=${API_KEY:-"your-api-key"} # Enhanced pooling configuration with model-specific defaults POOLING_TYPE=${POOLING_TYPE:-"auto"} # auto, MEAN, CLS, LAST -ALLOW_NON_MEAN_CHUNKING=${ALLOW_NON_MEAN_CHUNKING:-"true"} export VLLM_ENABLE_CHUNKED_PROCESSING=true # export CUDA_VISIBLE_DEVICES=2,3,4,5 # export VLLM_ATTENTION_BACKEND=XFORMERS @@ -36,25 +35,25 @@ get_optimal_pooling_type() { local model="$1" case "$model" in *"e5-"* | *"multilingual-e5"*) - echo "MEAN" # E5 series uses mean pooling (best for chunked processing) + echo "MEAN" # E5 series native pooling ;; *"bge-"*) - echo "CLS" # BGE series uses CLS pooling (only first chunk processed when chunked) + echo "CLS" # BGE series native pooling ;; *"gte-"*) - echo "LAST" # GTE series uses LAST pooling (best for chunked processing) + echo "LAST" # GTE series native pooling ;; *"sentence-t5"* | *"st5"*) - echo "MEAN" # Sentence-T5 uses mean pooling (best for chunked processing) + echo "MEAN" # Sentence-T5 native pooling ;; *"jina-embeddings"*) - echo "MEAN" # Jina embeddings use mean pooling (optimal for chunked processing) + echo "MEAN" # Jina embeddings native pooling ;; *"Qwen"*"Embedding"*) - echo "LAST" # Qwen embeddings use LAST pooling (optimal for chunked processing) + echo "LAST" # Qwen embeddings native pooling ;; *) - echo "MEAN" # Default to MEAN for unknown models (best chunked processing compatibility) + echo "MEAN" # Default native pooling for unknown models ;; esac } @@ -72,8 +71,8 @@ echo " - Port: $PORT" echo " - GPU Count: $GPU_COUNT" echo " - Enhanced Chunked Processing: ${VLLM_ENABLE_CHUNKED_PROCESSING}" echo " - Max Embed Length: ${MAX_EMBED_LEN} tokens" -echo " - Pooling Type: $POOLING_TYPE + Normalization" -echo " - Allow Non-MEAN Chunking: $ALLOW_NON_MEAN_CHUNKING" +echo " - Native Pooling Type: $POOLING_TYPE + Normalization" +echo " - Cross-chunk Aggregation: MEAN (automatic)" echo "" # Validate GPU availability @@ -89,38 +88,16 @@ else echo "⚠️ Warning: nvidia-smi not found. GPU detection skipped." fi -# Warning for non-MEAN pooling types -if [ "$POOLING_TYPE" != "MEAN" ] && [ "$ALLOW_NON_MEAN_CHUNKING" != "true" ]; then - echo "" - echo "⚠️ IMPORTANT: Using $POOLING_TYPE pooling with chunked processing" - echo " Chunked processing behavior for different pooling types:" - if [ "$POOLING_TYPE" = "CLS" ]; then - echo " - CLS pooling: Only the FIRST chunk will be processed (performance optimized)" - echo " - This avoids processing unnecessary chunks but may lose information" - elif [ "$POOLING_TYPE" = "LAST" ]; then - echo " - LAST pooling: Only the LAST chunk will be processed (performance optimized)" - echo " - This avoids processing unnecessary chunks but may lose information" - else - echo " - $POOLING_TYPE pooling: All chunks processed, results may differ from non-chunked" - fi - echo " - Each token only attends within its chunk (limited attention scope)" - echo " - Consider using MEAN pooling for full semantic coverage" - echo " - Set ALLOW_NON_MEAN_CHUNKING=true to suppress this warning" - echo "" -fi +# Chunked processing uses unified MEAN aggregation +echo "ℹ️ Chunked Processing: Using $POOLING_TYPE pooling within chunks, MEAN aggregation across chunks" +echo " - All chunks processed for complete semantic coverage" +echo " - Weighted averaging based on chunk token counts" echo "" echo "🔧 Starting server with enhanced chunked processing configuration..." # Build pooler config JSON -POOLER_CONFIG="{\"pooling_type\": \"$POOLING_TYPE\", \"normalize\": true, \"enable_chunked_processing\": ${VLLM_ENABLE_CHUNKED_PROCESSING}, \"max_embed_len\": ${MAX_EMBED_LEN}" - -# Add allow_non_mean_chunking if needed (suppresses warnings for non-MEAN pooling types) -if [ "$ALLOW_NON_MEAN_CHUNKING" = "true" ]; then - POOLER_CONFIG="${POOLER_CONFIG}, \"allow_non_mean_chunking\": true" -fi - -POOLER_CONFIG="${POOLER_CONFIG}}" +POOLER_CONFIG="{\"pooling_type\": \"$POOLING_TYPE\", \"normalize\": true, \"enable_chunked_processing\": ${VLLM_ENABLE_CHUNKED_PROCESSING}, \"max_embed_len\": ${MAX_EMBED_LEN}}" # Start vLLM server with enhanced chunked processing vllm serve "$MODEL_NAME" \ @@ -142,21 +119,21 @@ echo "📡 Server Information:" echo " - Base URL: http://localhost:$PORT" echo " - Model Code: ${MODEL_CODE}" echo " - API Key: $API_KEY" -echo " - Pooling Strategy: $POOLING_TYPE" +echo " - Native Pooling: $POOLING_TYPE | Cross-chunk: MEAN" echo "" echo "🧪 Test the server with:" echo " python examples/online_serving/openai_embedding_long_text_client.py" echo "" echo "📚 Enhanced features enabled:" -echo " ✅ Intelligent pooling type detection and validation" -echo " ✅ Long text chunked processing with proper aggregation" -echo " ✅ Model-specific pooling strategy optimization" +echo " ✅ Intelligent native pooling type detection" +echo " ✅ Unified MEAN aggregation for chunked processing" +echo " ✅ Model-specific native pooling optimization" echo " ✅ Enhanced max embedding length (${MAX_EMBED_LEN} tokens)" -echo " ✅ Automatic chunk aggregation (MEAN/CLS/LAST support)" +echo " ✅ Complete semantic coverage for all pooling types" echo " ✅ OpenAI-compatible API" echo " ✅ GPU acceleration" echo "" echo "🔧 Advanced usage:" echo " - Set POOLING_TYPE=MEAN|CLS|LAST to override auto-detection" -echo " - Set ALLOW_NON_MEAN_CHUNKING=true for non-MEAN pooling without warnings" -echo " - Set MAX_EMBED_LEN to adjust maximum input length" +echo " - Set MAX_EMBED_LEN to adjust maximum input length" +echo " - All pooling types use MEAN aggregation across chunks" diff --git a/vllm/config.py b/vllm/config.py index 7c8ed575fb2..6564121d401 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -3388,16 +3388,6 @@ class PoolerConfig: validation logic. Defaults to None (use max_model_len validation). """ - allow_non_mean_chunking: Optional[bool] = None - """ - Whether to allow chunked processing for non-MEAN pooling types without - warnings. By default (None or False), a warning will be shown when using - chunked processing with pooling types other than MEAN, as they may produce - different results than non-chunked processing. Set to True to explicitly - allow and suppress warnings for non-MEAN pooling types. Only applies when - enable_chunked_processing is True. - """ - def compute_hash(self) -> str: """ WARNING: Whenever a new field is added to this config, From 5536db0a21833d9c49694e0868fd1b4d279e29b6 Mon Sep 17 00:00:00 2001 From: x22x22 Date: Thu, 31 Jul 2025 17:30:03 +0800 Subject: [PATCH 551/552] Error: docs/models/pooling_models.md:261:110 MD047/single-trailing-newline Files should end with a single newline character 117 Error: docs/models/supported_models.md:777:265 MD047/single-trailing-newline Files should end with a single newline character 118 Error: examples/online_serving/openai_embedding_long_text.md:96 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **Academic papers**: Full re..."] 119 Error: examples/online_serving/openai_embedding_long_text.md:130 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 120 Error: examples/online_serving/openai_embedding_long_text.md:138 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 121 Error: examples/online_serving/openai_embedding_long_text.md:146 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 122 Error: examples/online_serving/openai_embedding_long_text.md:159 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 123 Signed-off-by: x22x22 --- examples/online_serving/openai_embedding_long_text.md | 9 +++++---- .../online_serving/openai_embedding_long_text_service.sh | 3 +-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md index fb83a564ebb..e15123004ab 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text.md @@ -93,6 +93,7 @@ Chunked processing now uses **MEAN aggregation** for cross-chunk combination, re ### Extreme Long Text Support With `MAX_EMBED_LEN=3072000`, you can process: + - **Academic papers**: Full research papers with references - **Legal documents**: Complete contracts and legal texts - **Books**: Entire chapters or small books @@ -127,7 +128,7 @@ The test client demonstrates: 1. **Chunked processing not enabled**: - ``` + ```log ValueError: This model's maximum position embeddings length is 4096 tokens... ``` @@ -135,7 +136,7 @@ The test client demonstrates: 2. **Input exceeds max_embed_len**: - ``` + ```log ValueError: This model's maximum embedding input length is 3072000 tokens... ``` @@ -143,7 +144,7 @@ The test client demonstrates: 3. **Memory errors**: - ``` + ```log RuntimeError: CUDA out of memory ``` @@ -156,7 +157,7 @@ The test client demonstrates: Server logs show chunked processing activity: -``` +```log INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096) ``` diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text_service.sh index e22cb933ae9..03feb485d6d 100644 --- a/examples/online_serving/openai_embedding_long_text_service.sh +++ b/examples/online_serving/openai_embedding_long_text_service.sh @@ -21,7 +21,7 @@ API_KEY=${API_KEY:-"your-api-key"} # Enhanced pooling configuration with model-specific defaults POOLING_TYPE=${POOLING_TYPE:-"auto"} # auto, MEAN, CLS, LAST export VLLM_ENABLE_CHUNKED_PROCESSING=true -# export CUDA_VISIBLE_DEVICES=2,3,4,5 +export CUDA_VISIBLE_DEVICES=2,3,4,5 # export VLLM_ATTENTION_BACKEND=XFORMERS echo "🚀 Starting vLLM Embedding Server with Enhanced Chunked Processing" @@ -106,7 +106,6 @@ vllm serve "$MODEL_NAME" \ --override-pooler-config "$POOLER_CONFIG" \ --served-model-name ${MODEL_CODE} \ --task embed \ - --use-v2-block-manager \ --api-key "$API_KEY" \ --trust-remote-code \ --port "$PORT" \ From a835f52286faaca963534f7b9d74329f0681cf1c Mon Sep 17 00:00:00 2001 From: x22x22 Date: Wed, 6 Aug 2025 00:00:35 +0800 Subject: [PATCH 552/552] The latest update introduces new long-text embedding examples and service scripts, incorporating chunk processing support. The README documentation has been revised to include a quick start guide and comprehensive configuration instructions. Server startup scripts have been enhanced with automatic detection of optimal pooling types, significantly improving performance and compatibility for long-text processing. Signed-off-by: x22x22 --- .../README.md} | 28 ++++++++++--------- .../client.py} | 0 .../service.sh} | 0 3 files changed, 15 insertions(+), 13 deletions(-) rename examples/online_serving/{openai_embedding_long_text.md => openai_embedding_long_text/README.md} (85%) rename examples/online_serving/{openai_embedding_long_text_client.py => openai_embedding_long_text/client.py} (100%) rename examples/online_serving/{openai_embedding_long_text_service.sh => openai_embedding_long_text/service.sh} (100%) diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text/README.md similarity index 85% rename from examples/online_serving/openai_embedding_long_text.md rename to examples/online_serving/openai_embedding_long_text/README.md index e15123004ab..dcd66a9fee9 100644 --- a/examples/online_serving/openai_embedding_long_text.md +++ b/examples/online_serving/openai_embedding_long_text/README.md @@ -10,17 +10,17 @@ Use the provided script to start a vLLM server with chunked processing enabled: ```bash # Basic usage (supports very long texts up to ~3M tokens) -./openai_embedding_long_text_service.sh +./service.sh # Custom configuration with different models MODEL_NAME="jinaai/jina-embeddings-v3" \ MAX_EMBED_LEN=1048576 \ -./openai_embedding_long_text_service.sh +./service.sh # For extremely long documents MODEL_NAME="intfloat/multilingual-e5-large" \ MAX_EMBED_LEN=3072000 \ -./openai_embedding_long_text_service.sh +./service.sh ``` ### 2. Test Long Text Embedding @@ -28,16 +28,16 @@ MAX_EMBED_LEN=3072000 \ Run the comprehensive test client: ```bash -python openai_embedding_long_text_client.py +python client.py ``` ## 📁 Files | File | Description | |------|-------------| -| `openai_embedding_long_text_service.sh` | Server startup script with chunked processing enabled | -| `openai_embedding_long_text_client.py` | Comprehensive test client for long text embedding | -| `openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) | +| `service.sh` | Server startup script with chunked processing enabled | +| `client.py` | Comprehensive test client for long text embedding | +| `../openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) | ## ⚙️ Configuration @@ -47,20 +47,22 @@ The key parameters for chunked processing are in the `--override-pooler-config`: ```json { - "pooling_type": "MEAN", + "pooling_type": "auto", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000 } ``` +**Note**: `pooling_type` sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length. + #### Chunked Processing Behavior -Chunked processing now uses **MEAN aggregation** for cross-chunk combination, regardless of the model's native pooling type: +Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length: | Component | Behavior | Description | |-----------|----------|-------------| -| **Within chunks** | Native pooling (MEAN/CLS/LAST) | Uses model's original pooling strategy | +| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy | | **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts | | **Performance** | Optimal | All chunks processed for complete semantic coverage | @@ -72,15 +74,15 @@ Chunked processing now uses **MEAN aggregation** for cross-chunk combination, re | `PORT` | `31090` | Server port | | `GPU_COUNT` | `1` | Number of GPUs to use | | `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) | -| `POOLING_TYPE` | `auto` | Model's native pooling type: `auto`, `MEAN`, `CLS`, `LAST` | +| `POOLING_TYPE` | `auto` | Model's native pooling type: `auto`, `MEAN`, `CLS`, `LAST` (only affects within-chunk pooling, not cross-chunk aggregation) | | `API_KEY` | `EMPTY` | API key for authentication | ## 🔧 How It Works 1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables 2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity -3. **Unified Processing**: All chunks processed separately through the model using native pooling -4. **MEAN Aggregation**: Results combined using token count-based weighted averaging across all chunks +3. **Unified Processing**: All chunks processed separately through the model using its configured pooling strategy +4. **MEAN Aggregation**: When input exceeds model's native length, results combined using token count-based weighted averaging across all chunks 5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing ### Input Length Handling diff --git a/examples/online_serving/openai_embedding_long_text_client.py b/examples/online_serving/openai_embedding_long_text/client.py similarity index 100% rename from examples/online_serving/openai_embedding_long_text_client.py rename to examples/online_serving/openai_embedding_long_text/client.py diff --git a/examples/online_serving/openai_embedding_long_text_service.sh b/examples/online_serving/openai_embedding_long_text/service.sh similarity index 100% rename from examples/online_serving/openai_embedding_long_text_service.sh rename to examples/online_serving/openai_embedding_long_text/service.sh

Y7^^7 zi`>@G97fy3+7ev~O-@umc-o9E+mZU+l|wP_D?{(QJzJgUKI>fE(&r!1xp3}snJgU0 zK1k61n>ymcS`Xr$PASNFmN!OhNV0~gAhgf4=cN{4A`P+2GP6>aHnP3tVBI}XMb%V96DFf8 zf>MUYnUnrF^_iNSZHd}4;dGkEZBu0OT(fSBDB|os=8dvsckHL$Ds{$$Tz!Bg0l8HD zE?reA6}z1BT?vl6w;z8;mgjTbPu&hYuDlh&LJ~w3LTP8eP7#a_^9^^&04mz@PA)*a z@zVLEBQ0n3N7=WDsAzIY&2Gb{9ZAhp40 zJ^EZ^7>7m$!J@>~>qRw^`n~3-XCIE#1rkhjXI|YrOXB@TU0Z0a%>9W6_E@XO^_2E=^}bAgA%7?+Mcjg|1Q(|wt3t2hWK@pjv`LKvRGV;;$_jH zH7WJEmDygj1+VjI7b(ysVqL52$fqBIhBoulsTHX?>zdVWrn<&StkYcs^)?niHPggP zO^)M{FF2cbQKS-tvh~vF;_$TvK0DKyP1kv2>eLZAwtV^nVULCBGaHe8e*lSiTSxqE z3(3hKOC%%qh^HHFB?D4nL?$I;RbDS0Pp(gM-n?_-<8pjCy;54Y`|T??)}+%;nxCKR z+kP>53y{bURj%Cn)H^wS>ZRj2@j%l<)>}N`+PtAW>8f;M!K?Jd=f%Rhh?uRHnerw` zaf=r}6%NGn>}V2Rb;bY_6js(JpXRfYbAMIanxl{I-+6)02~#0O)=W#0ZO!aFw<$-A zoAwF%D|SNVf+CsS#xfUzuU;uUyg)jdtR zS1an$E}sZx5^%2I$V>{-8S1Qqds*BYgh&fi%gIx1R|s2qd>Y8wFN``Xf!sr`NVpR% z-32v$paQRELxiqZeVMK@dzf)7&gaZ>X!xqWW|>ZUxv{&0uU7Y932jdsD& zT{0jqWzvb*ZC#QLh@9UQV8%HQFFc#Peupg8c5+yxGozbJI35^-h}YVZ=F~%UT9r_& zk?Bmf^ftwD-GcK(yI?E6-n9u(C^tPpr)Zq>NjkaMlz0_^oYBjzc^1Zxt~m9Q-%snx zGEUZaBnKCOcFywAYx$sE{W7wescG`=IhywC&HE4l+Gn89u0TO0WW_Cu>8>MC*Pb`4 z22@5|Sqtd{{>Y1{=QKuZpM=)dd;`&>v_JWcDW%_|a0rb^a!^?3Q2qzB8bGP~@4UV8e@urF2Hi?vGXTEQ!d zfF`|5$~5*RIkC6N%SL2r<7w4nmGI65+DE-n2Xf32rDnBJHBOv}d_bMo22zHCibZ=S z6=5c_)3NA$s}=Jo#>qU){FcC`flTEQ5-+Yh=uxyB_~u`p2nkSW(HyVFY<2$RdMZ-4_nJ?%BM+evm&Y0}kY0tsk*9duRaGn&iuhE|W9#DjPfJ(&2 zPp!72#i!_?9be#LP#EuGXJ-1~;C3}j$40qPqy57dcAEFN)NQ8}3eK-vcube~vun9! z6Vbo3NaVO!^TnMni8g=noe;t`B>O(l8|g>oRu@E;#{5=ux8#zX3A(rah%`CLb~@zz zuqo6lbPyt24jwBP z|M;Goo42W7IfD*jV23HZ`uP2#viwXQ8jBp7Tp#lSF~G2~Rab_ZQ%6D;*5^X&uakBY zpxcx7mh7PlQ~2ZPWNs9jJ+tdp76?)Ev||Y%@w^lXpn%+y9OTH*M>b-o3JE@Kmi}0a zP)VW}xfQtnDGKuGa{$CeJ@x@bpXa+AhFFAhJ0*qB7qclbqyk<*&Q&OHc3VNkyd4$5 zzCM+Sk%mU0jk}KEx^ZUCUWVH87|AG)Dy&ZFxN)8^dy2kbR&n$)2u(4az4bJl$(Z+Y z+M~|*H(a^LgArV8iyH&J_;;pWJ?oImGQV{5X2HenvUVm?w-YzJ$!`j$HmEQ0n=1q0 z7)}!b5H-t%5zUe!RzMe>xW*5*+cR43xmV^-bG91Jo0xxb4cB?0pirM}bxVKm8q_Qt zCB^8Y)xmt1P^g7edr5qLD;JN8dWV~D1zA?=B%Nr?41XW}TEvwhtnr`G!;%~AUuPhV-sz(1YIn%S#`2D(M&I59lS*Q~q%GH!DM70#PafIs z%|3ea2cKOE^WM3Td`H>jg_Ez;#*?9|P(9WuV_9+z<5_10^5S@Nko1N2JNBO1d*%i| zo9JU#?{cpB5#frU>wt)Sl^VD=W_LhhMTzn)@0P8YL zMR7dw!0PdZHjG7GHj@B#uMpF0F&EeZKFyi-sT#o2@dd&KFKpgxh3&}olQL=*kFp~y zjMUfb2VecPveGvBih$Gd?b^c%>h-7tcFqm(5t0QK5sf8~X zPsbTEyL8d7&+*RHsADDb&9PU`AzjjT7&>Vd4Dbl%J(os@Dwok4z(H3V01mLs1!5n? znLa^cz|4?)03F8^V62?m-d-CFWj9mFHeLXleR8wPLtBR=mSJnTLyr0OjGCD^iR(Jq zr{-9Ksrk+2Nl(U#EtVQ~TpFdyH_@y@rMFjS)G1_QmfZ^&bjnd7H-`dKWF8JrXIHFl zuIjjLJ4I!Muy&^MU^7=mk?1l8H^Iiy=I;s#(HAue{27 zR|4_bU$rI(l@)Uo>j<+4ky&$f0k*(JopNUYGHYHP3nDZK=p~Eq6cuu=I2wIyND+@U zx<@e11-cPr#oEiy1+`>`5M@DUaDm6$c|r+JS`diJMCHX?Ypw;nUwG_3!9u>vP1^~b zL!{AJ2C87YnI6omca|L{lA1SHrbDmTf4*+b>$bzI?Yd$3=$cDKq~%m+s(hY_JcV== zM#c6jL06U*L1Lg`Q>6MF@*OS-qA%LlZqJ+_aQ!m-@$wF^Kn|114-i}$d0a(El@Jq! z#?tY_?h;10tN7mCI(wJJg%{mX-uR|<;WM31=asvS7w?{W=N`y1VCj9g=E7ZDZ+F=X zB!qI_TKs;hchA(jucnp}fOg&7xlUcxxsrP6x$Y~^IrOr#gndZME=?t_qvc{a zM+d8P5`#wD0pE=ld)Xj1G~_k_^$bPb10cn0%I@;WKdf_O=gXXYiu45%&4pQ;l!`~Z zj#$8Q;sIuxe3E!5f@yowi=9Em{t|n1Z*&qQ#n_3pVHpiob*d3m2ER7?dLNInvm};$_05EiHkjk20p*CxadA6q`+u^LQw~y$`)Vr#MhaY)G z1M$6HT8|)A)U7$jo=UkK+Hhxc^6AyvAYRk<6d8g~z4wgiW$IP4ZoF81o1R^cuSd7c zH@x$aC(O#$+t?&2sPJWbnD4+9x7}?{7K4VHD`Wn_Z3#l&R?}VdNpkVT%f^d1%ZkIi z66KT|KbK`<24G>Bc0FCT+-^T^&(g zRe1!upN5>eQiS&0m9SN+A;wstxUQQI-O8bQ`$b9vf?dz-=0@X%XdRY074wju!mDID z)p&9*D_lh{-mEW09wfMf%vFA3VkyObTX;DQ{R_gJMA7OCT=&de^4=KxHM23=Iji<` zXEA;WqCw`4D^Rn46NpC2E^9p$CsB70C&D^U!6H5X(R3~RMHt~#r&ZyS+p`|oS}*iY z#;}=s00sA=H0uH`(C^}7F|K~pEt7h{us)TZEm#^yDRpl7^9KSF=lQA==Vjur5m5*S z;Co$G%+$!wL3=$?_Zl>cReop6%9%HWbwY&IdEIJy*Rg@q+x82)bBn|j#`LXsY3^=) zo+B4!rNKF2m^+uikCKq;rJfkrC*4zY5tgP^PBXy7lDoEHFh8G)7QIA|#W5Ap#udR8G;SceQthQd^OrZ>Nd99F&s5cr>M#bc~!CZyRLyif6AEWR__jZ>#*`uCk`snUTqnd@9f|w%p;IsSM z@<|a=UKHM$Mpf|mvG+9vO&AyNcdEA&sZk@!+4;7L$0}FfjTU?vMcqL_-P^SimYvjG zR>FU#rN2m%E9zXGPg>`5Z-FVmIWn;*;o~?0@MoJYV3tdcYbsINV6>^=x<3;DSjw|1 zZjDteVg+xb*Su^GA5VQM8jm}(v+X2+R) zpj6xNofMYEsu?(@e2Lfm;Y42{qk|e6turOqBt~k{bEz@>a+t_{CUKgma4O=@ftaj= zxnVp?{4ZKsJubarHF|?`>Gqb)6j%3V$y~Y6+Zgr_Xp?WExzM@^;ClgB&qeEbj0!!K zjTH=0a}GxIBeQ*U)}lA>6t~SNO1H-1K;b}fyc|Dy{S06rI*gnZ_PCdHy1NeJPO~+{ zBth~`m^_QJB-`VHry1ORALB^hn=+*$Z+MogADZWVyzdnVt-7uMS|58C6|c+>Y!| zepE!;n1DCReP1zEsAWZp%Kdig=fOU8zX?jGww2u2&eyV(3V0vivsLOVB+WM%-Mq~z z5o@!pc!~!6N@$8LBHO>CeVlE>IJePc@=BN#ceqwKn_SbnT;o#MBTcbaQFEE~C1Qpm@XCVcv|8vE!}ZYt|SC2+dkHikK>z&9?6h z)Ri1)HDke7nk4%nxM{dy$g`?4S@t2?+&OypU^7>v?W_i5*W@!VU3Zxm6e3mSOQjvd zsh2`k@9M0t@zR`QyxTvABV4#sE`?KQv3b@4-`n|mV-R;x!bplE`~7)srIhQIsQB#1 zcf<5>;#C+tRkhttc$UYWPb|9qc1m~Ye*SrGM>LjlQk@&buR3@gHuAh^L%1Zwxksr; zO1<&H1c_>st#0FJE&W|PIS3cD-`opN9rf=Q>P*)lrjjG9pD|3Yt!G}QNh<^mz5*Ql z*Pn}9r{@EHExoN75%FlA%$DjeQud4CRaN5b=&ZODZ+1S8-l& z48&ks+~i&4&gy$vd!pOPf9>Uvr?o0Ef49Z5$g@Jg%S4~ZJkK*O5Fe4L*43+(Xa2rj zLFe9s&LOf`Z?I_dBW`+SG{q!9DLKK4Uc0-s6bugRcox-&wHYTJ$ID$azxw%uSu*pB zOr=!w3N7Rn+tG$lWVF#CGO!5)uSv)>ry32XTF#@>fktm{=pZRQ?THY*->KV(!JHq4 zpfnSLq5k})dqktY-s;SIYMt`Zl_$BIxY=tsnVN-YD)$_U?^DRm_U3yI4iwprpD>kP zb@EM>j|6r-17wCQPGozvJ8b$UmBX_hvNnO0&}_3w3h>MbOHvSPL_ic|=5boXO^fy} zqqi}Jp=2kJ(NHb@DDXQTiz8^iw5qz8D#YG{&bNi(tP@HuLWhf9=KaR?)5AalPP8nQ zPeGl!8W&$*eA??xL#CH!@ER+eTAutR2~D>+n|1J;cs{;$a9ZNbq%|?8S9t;oV2V^{ zIIlo>@l0|A+oSI#MH4RP192_-4tE_fFGAkexDdQqig(lzI4NchgXzSxB3tW^FoXiV z=*S!u#7^Ttu9Fpos_QH0pm=n%LE=+&;|uQ$(O+jvbHSR@l+u~WoZo|3O#PVc6C zvavL$WoqaYiE^3TIJ_jT!Hx=H1b?-E4o}mnPl!ws?pYK3;HcV~R!|oTe0^ zi*AhMm>_&z6nZ034&665%f={;e*7h8ZH12j2{A-@()RV>a8t*XhyKqlSV=s`Kp0Pd zlr=40zL4)}R?}h3d%g5xh8>oxr6t;IO>~Q{Y;E6!?K^ri0Vl=7K9{rfU~lrh zi4J0}Mj8Q}s!{oN>~Rtkebt1-GkqLc{tLF}M%u(a_Yv^!)Qq5GknIc{zu)O7jlzI+ zGa!{OgCCN_fbeP9%T0dB<-sr?okIN^{oxzBgG8Y-ML6hcoFSd5JEkW}cinf>3n9-p zlDfvvSr0Ry5v^mF-ft`0%bDe>?~)N zulttqJo5xC5O76QWzc!`;-cHom}!l>e2wpcAyUUS`Ym~QGzB~S3UCy8a;7_xvo%$GueyI$@)AHt)g z9MXT2PP_Dctv{X^AcX{8-8H~@jacIGI)sM;D~iX!WAfTu*;?GnNX71s*<}}0N5Nsg zvHT0=;cQf?BhvD=pOuGY#==jnzFod3C*a{Oe8O|m5^z4c9#ntBkhu)aQo)lD~8}>MXhl)<&=?%BX&z(Pf)<6lMp1qa2 z5G**{f1>>ICk5o%h}<@m%W5ypLZ@By&3V_g$ee4gP>q>YdaPFFXfr50>Cnbw#_{OH z#C=03|E1N;``Q)f>{6>brDN8zwUjPzLU?8i&R|j|`4`2zC|HM=T$*3Pc%WDBF!l1` zl~eR3_zkXsR{#yCoh{;1PYFr_Hs9sei5Fjz(8?Uk&+^PU(~Z8rLVjC|Q}okC^kYD% z2uboqrFW=nazyrWI$aJBMoU#J(8G0}Z>z5o>Vx^WL!2ErO_k3F4n z0zC<$J~p%F7^CTlk4xWf(mpZtY{OAMoO)V&h}CgMoXi<3(QbpZj^mkt(0w}1!V|$< zytHdG{Xvu1bx5>#p9w}?vB2V4B{soCz)o%f+;n=F5@39e+)lv&z|GnO#ukxqJL@&b z=tYx1t!}E~M#Ot}$aC}K%%o!E)IEKCZqtHiSXxPFuf5j%9-t9l>{Dk0UE1kcabil% z-RX{Jwp_<}v@?z~(6De0Ki?dkhVTt zlt{sT!&@eyWC)`;HHhSf*#zwX@*M%0t?fPW9ycq-R-M}IL>n}&b{kr@<1vi zR|+h1!X>XFp^0V7X0~N%>Y#3nkyXKH3HV}~$|oKfR- z+;CQ%&f8>s5qO}BKL)=N#zMjNNeQLoOb ze&GNFq^%qKcAg)*D;6gOxaZPZulguInrj*C}des`5xvkfXWm zKR*)DSQnhSZXq~KlV=4XN+FS3dp?YNiu!(4k3@5*iD}Qg>%gnUv8GHMa%_sy9K)th zOBy_{acGhP$Q$!yu+8f%HpKj<7_exRNX`=Ta^0ix>z1#A_Jbw|>1+~3u)fUO7U}9- zCuDhL#bUcjjHI9sCw$i0R2xI?Y`v#`tHu-Os<=uTDXnfW3C7ac8iQKI$a@ws{cG7b zWaA0)ix_Kbq+~H<;AeNpF6#&9>Y3I}_d|FVOLlA%At?&u>|{32{DwRyr}0E&mI&sb z<`rs~w~AZZoo$R-?^tDjB;d;F#v4Q{<4Z%_z*v%bZ=HEkL)k0uwc@mZf_Jww<^5(( z5$g^fU*72LrijINvM7%&A!&g6LB@GkZ0t$j`ob61VXkRH_9$dt7Hmi2W4h1RH!Dh8Od(SGTlMc+H^fj9LK0#dXwnWg{|mX z+G{vi+1q$4^j<2HixI@vWfuenfk5hz|EiYPmP2RE$Qo_eMa$ClbCWCT5$NPF7Uj^ zvLLTjrq7rD$*@+ldGQ-qCqpDE$WvKviM#nrznOQwwEVfhPLgO-4dt}lI$>`GdaMSA zXt%4-Ld)xs9M$5)InF0ucSER%1*$q<;Ezd(KV;OH3VTb?rLG@E+Qg4XE1eTS!+4r?5S`XjpM-Qt*Efg2hQG@B73N0z2VG7Y5j^^OWLuwg+;Jg20GF-f}qT;M(aTO0g5$zLQ_DDc^ zu6c45*b7;^JHRaw%UHhtiifxO3^9YL4OdG zC|@LoU)A!VIMtCbqpK=+&%5(I;C+Qq$ys6|@fw5$FxPZ+>$nAWYdgKAPt_;EV}Abh zn6>k59pargYza4tY`jRp&k*%$>EtlQYSf)N>n0^wF!rnT0%94 zWfi=g+=U}fVX3dus4sKf842Wc)|R45%{6+9@4uCceDy(ZAqbS2OQl1iO@gUVFPm{8 zAL4@QjG#xQ0eDN1Q!e+e;VTkDwmzgY@Qaj0f4L%9j8mC+eB}^FhrU)bh zibV zF%bYKZb~ucl8_nT36Rh11VXl(vAEGF>p+;7%m0(>Tb{?`4YA!kOnsinEjA>og=n92 zYoYsFs`%E_CGpeLpW#Qbfi6IvqR^D!shn6W2fA^q=ZR3ZJD2^ZJTV3D%X-oRrXNxt zP~*lGqQv0oG>Z5u4llOzAb8S9SN*7ejj@n~YwG6Il^V^a8~LPuphW zVUz>A+3ji4IaI3D4p5JpKZY(oLC348o!j#*W#>|ba{S6UjfIJnlBF@X=9?~&p^Cnw z1mhRp0wt?BN>-2$X!Wt>mfC&(@TFx@=~iDp`e|I+%cqm7dMLT@6+FA+FhyGWPM$vb zRHrt0H!t7V=ZrG-+bNZ5U^k5hIsd=x&X&4>?* zPpxNzMc#(0c4{+-?NXlkboXAZJu9BZc2V{Cy!-YY^JoayR&PFByluJjdPhL`!N~!zn}iMZX2#U`x!DH6I=B z3Pyy={n4a39N#Cm;=AfyL+Eg~jU)T$9GSrN&FN}ysFYYk1 z_DkJ3^OW!G*-VMk__gQ8aWECUuW{G-RuxFhysu8>*FR0DY;aq_yKYcvD+<_KH19&= zCa&~0`XUl3u}7564PdjnFor`W!=4%|%+2pB! z5idR5%_<#mH&sP~9KA-=)?$6E9)Vh+3?y^0f& z)Qm28jed3rUrc1<4lSvkRHR=#rJw*rcs#f)wWJ`xmT!1^WA5yk<8Bb1 zI;#8Lixd)Jn#e{eg9ec0)1ts8k$x2da8M|^Kg|Q@?oAik1j1p*X8LpRRa{=_K8tR3FYof6+gv6-4VKB&!pzOxpEObd96;f-o{&l=Vj0d z_=H@uD-4M2Tx2R(B#BBEm?Su9e4gElxuzTNj?**My4K(gMMRfxVq;e41@5Hy2ua*? zsU3x8SGMv-k;6e1) z??va}S^D}-YD!@WLHr~mMq83mBRa|21oON4HngG<>l|VbBRitre+<*^kA`cna0DO};Xn3nYLTof!UtIwr+cNrhzTRpyquvgs ztzA$j%R~(w+d}QV3Z4iQPDbmf*|f7Xb0J~M402Vc=rXzAy#1<7Pn-x!X}W3n7ulBV zD8YhXR04+Pyu3TDRbH@2o)8?9wg!|aBuQ5v$LGn7o|9KE)z+8t0pY21^Zc?Twzv0U zFEunL>w1yLbhA84JKabg?V&FoKM&zi!k4$Gz9|V=ctw^Vq*qUrDinV~=o(Y3(PVpw z5s1w-C8qk`g2Do4>c|ZO8a_0VYXL78(KA?Q43Gv{g81bJ4HZZ>QY7F!y*qNhh@R1vYi~$6j3dnwF$d%oAC*?{f^RWv0#( zn%2B+LEgpCzjA6ynm%AiN1mrlT=1i=O8Id3K&1COWnt-!hdw=*kZ}FxMN@?kyo67` z^|JTv9cPy>YMrYK;mOdbWxJ>X?&Z*Ij}w-uPLo?so+Ldfa_!+6gNQ`x@iFrpAy+&Z zdZa38IXyo`0!NgM&=9&gzxubCW_nq4-8nHzn~w#Z0U~VnnmV`H-5<%8UNC&3F^~RA zs-orH^(_=@q5S4Kh>Zx#q}Bzhw=pHY2`sNKj{qB2=pBh^5F1E;f&~WM&hv$NN6Rvm(f&fnTUl#nZiO5xL}hXU+y{Yj%?szXl*qSc)#2Y;@J^U{TrR}GY&V>yg10#biG2m^WIdG8rOlkd)S zdCpB2fw@Eh_D(`m^oo;;KK+2=q-iYSV{h~ci{yZ_6K!phwCNgRi%OK@1zS#^>9Vwo z0=fdP)k5<7G8b}AE9U3a5|e^mBe^rMgiML==lUAUzDB1-3gaP2BB2Qj;Pd5vq$rVV z`=tvdlxLZ1*!lczh4H?G6CNEc0W3aCI$PUxTTVNV#vWErv=^VkIwOIHpQ^9KPhrt+ zXqDV0ugq-D&j{?^^Iu1PizV`wn6}KnYMB#aoNe4Pq>@^##F3&BdyXz!@D+6VM#5uHQY7 z&TXeeqpJ(x9A(SCB8xLT;(qT-T3t4r6@SK~ z;^{U&V@z%$8>=5PgeRV3>y^UQ_Yy?SeIHz1m~F2Yy?u-x0w`d{v9|=%m@{XpC6zw_ z3$eN7xY(H>rV{5omuR)1@a(dwStid)xKf7RxdmZ_#t@!3tBa}CE*e5x`QkU^YmG~@ zX7qvAH!?tV=eJw|C*Sy03f(l7h`}uilXF`O^=pA6sHr1#lNMB_kC62@$FH~P@$s$! zflJ<(47|MWymN~Zhq&0@&H6>H+6x1ClPucTIY3(?Lt7mNT-K|#t9%RU7Oz4}m710}|2$TCI$ zo#hG4S3VC$adBznPL0e^fW^z@)eQnd)VG_oyJGc?{9>}rE&zHFuF4ZphRc_$WecuP z3L1uL@Gmi z$@ECCvkc4Jl^F!cZCnvrg?xn3UX$p(TF!J)NS!gO3TUKGv19%)t13?*Qvu@svKKgL_eE{#YoI5mp>P9k)eGX$yij(&?8+%zsWzc-Dr>K(3v_Y3a^f_~X55b=86(kweOS11fCE3A zl;-qI9FNje|4VBbUgHMxrnSuFY9-<)okdjW+L{gL@1WtMr6@CNO=b%rXL_bbd&|G; z8JArgkyxOA;mx`Y>{F{=rX6vsIl)RBC8PPWog8c=7kfARe$WflFyAX3PU2JZyQZZ# z6aqYF1ul+c*D^UYFS1wNqKm=8ejrD{UN6O1;iqzT+Gp@BO8TOD-=%nW(LT^Pqrt{R zz&XB%afK8#*^1#(X?%h&%*lPG>(V-;dIAyXO|SSE3`bI#5*e|4Ih);U9?Lslf0t|X zo?PB{Sx7;CHSe5|VG78a7}&ZwAjgguLAg+a{Y=DDLt!EPe4|Q;8>xp!g|`jpx=f9v zzGiJpiZ?YRily#L85x+1!+yJz#;7T7wYsLQx|AH8;>y`O>ze;i)s_bEDBO9t0vfJ$ zWB&9|fXnSO>V3d=o>V@yB{(&28zG3J_W#)W>!7&0cU~rtRk( zUuv?D>r=ZlSv9_TJ*>|F?y~R!TSKC>pCW?p+qj&y9K`+cGa+;(A}VN6Rm0$c7m zbClDui41X3E~AgOl{cZVueb8OS>^k_iz!aQ?-(%ZwUYn~i}kx}44HSPHIu?fiB4@{ z0ZGlrHogvE>!w7N$zBMggdf}ppCto#O)q|5Pth!=O_$1UmufuR6I&j!Y1WQNDAKyK1$$5pFA8-M>Mg z_X1M(&O88t-jjeyuL0@mygHOE=sUfuHq+0?$VUJOq(&m2uIAHU&((dnh+A5>IMa5t z#s{<&GpUyNVu<*hf0Pho*oDlxq`b$0?o9tC7Z>@~gEKF*c(Qs3>&F`T@ra>58T;+m zaKFJMVn_sLI$mvCBby-iJUi3Ryp&bUWnOv^h5tqm;_U(v;8m#C<%9gWx1@&&rv8cFV;dd^3 zpVv*YbslT(lM9?^&K!AkCsiGZYCHjqg zUg6^p2SzhP4pcYW!wMqLR8aJloBE&Vz0RXhI#vZUa=s)cX=atdOo`fw65z1ip_qh< z93l%*7VNDnM-hYzqP10Y4u1c8{EEI1r&?~VD{i|SNR7-)l})B{wfsCtAHGx>A<9pv zb(<*R<|71uK^=?1-ye?<+r1=Ux9~t9Rgs$ZFH|C85aP}%@b>hbt<>x$CnJz80KgHM zxLg{a*v$6QQ(f~>G1~2c;%ESBY2#Ez2%e*$cGgcAy?`RS0mo^)nZF$e>mw~8FvdS` zR+SaP1Tmb}fL@hE^I@nESN2Vyo@$W_B18LV#aWBTcH!W&&APnQ^B2Qy{fh|5J|_64 z-^L2bZLV_VSh~X^kzaky(dTn)*(%js#ekyG55;l)CCSRqZ=lF89jFlZ}PXDJIEc-digj`7JoaW10g zTuv zlf_h3CQUlklyaOl`>|n& zr49f!={+w69srS<-CB8fBxUeu(XXpLzLaMdr=8#EKZ~m;W+Dp|(%3a&D=1}8Uh^%rR(U_S%3>^5v(f2&jj{Ho zq>h?u^65~QsQxNAn49}S0>(eo;C@)>`@jby0<*t)l^H%<9cT5)apT8zsKrE?x}>*G%c=oTlvD%%3__k` zQg5xY(X>*s_BI%qJR=QRk;i4$uNv-J4#!ry)lp)kT#azqPe8_S{?5)+_mYax?;~@K zSDj~xIrH}ke}&>IjeK*MEBBt`bI*ZY--@JMoqgCG3e$V>=BLqAg z=7j))1l&&fUQUb58Qr)fj#9Kc+zpG}wg{C{9B35nCl(aO<&iGBwl z$Rg{w@5eF^LP7uXzWqhMfxUN?>Ja?M%tiYh?>oIjLcevvPQYS*{F_|RnuCK_y(+Ko zP_K;$@va;U-gQcVyTP|zSFB?V$Dkg%Jw@-FeG0OkNH)D9gtWbD2yL7 ze;+fj>jK+zoPSBf0U#4Yyo&k8RY*Vs(1q0Hh2xU{qaB_75cyW(y;~`1(+F+s@oiws zzAJBht#QR!FqW(_mQSfQ4EIGrwP?z0t#R{U_zi4Y|877NreXZu=tTbTyUvLi_eYcZ zvpM{YKU}}^(G2`?0Fd^iCL`-0AW>jkkrDldz+a(w6T?CJKN24fDcUN%ktEi|H$MMA zgr|+pp2Cw{3Ag3^KwX?kkOZ#Dd+Cw8jmp*CAmGc3^>W5k(lY|>G(qYGU6p+qkDEk} zEwEx&IzLkdFaXf^1q8B!<-ZULELgR}nCrlQsSnj?{;8VxkOT++Hdp`abtpdi)(>e= zr~d$D`{6eN6^6kNOVu!feT*tgx+>*a0fo?EX8QD3K`5eE;?y;fr8_*CdxWaGl*8xF zMtC~%;|yazg~jxTG={Omz-*C1v94O{ou%V=n|g#)8c1dj(+|_*AMW8l{9>RO;F=%C z)euuxiFj+^W#V9q1u$eD3mIG^=b4V7zNSMU<-=B_X=v)9JWT^boYe_lUox)^ZaGS# zp$$_?+MIG+qbx_h9T=Rc?hC0HeDc#Ff(u9nbl_RV1=@s#jtP|`ez zP_+c^(!bMawxRtP;r9ICz1ihtl8Hq0$do;a-Rh_0!te_%RLpzl?bp$S*;KAL4d{Z# zTbbE+|7yx^4^z$m_8AF6;%E3GcOET5F6+b3d|<)=*Q(lw0G}(KlM&eY&H&{f_4jcf zT0CHX>-R~rrhO~t7F%&-M2hDf>bv7r6zeLEqeq3ifbYJXFx*%%5|JXaUsx-n?qU%) zxV;uUY$-}I?&E`z1NL7L9dH*(5ism_Uw%(u*dN{)7cCV3uVI&XUjvxV0SWLKrA@j} zUiOe7R51r;(@o{iFcZ8^;u1YyYz?IRd3cp}0h$-Si7>b5SibK{#Nr@MZy2j8Jof@# zR_Vx8a2?%})a)mpAIRJOvUyF`otRBt%X2)vshBgup?#4vYDG_703K!N*E?kYs04p_ zc2ud8bP+BUqH0r(k1D6B)p$9i7~BSw?To&vDf>#L>-sOkh!W@$D^>7d|1rP zC-3;Smqpa?@9qvMKtMM+7}S7C@P?p&`9B2xrv4bfL)d=0zjOzH>ng>ue=Aqth$B^& z6D0br-;n{%2Lgah{~`U+hci=9YRr!4G8hrJY&_dz4<95)6~o2lEw=H-C7v$$>8b?4 zx+kl0pKea5oBR~0z^~7zemHC6OLg>1;^ z8e{pd2{xJ+gD^vJMLVU*2!usN(OI*(kp0@`imRd*kD$Idtn$Rl<`1|SEM@lB*&v+) zELfB|VFkA>yf!jd1n%T&+<&Be|2cZvd|Bv-tDMTvG8h=MU{puOkIC5+TWE`2FQCjb zLYZm9!woLxr-ear(q`$4hL0CexHq*~tfU|}*8^%G<=(q)GWDXU00Agp?@h$Q6a!e7 zATutv(=|KdbUY*9dn(WCco=Uz1pxdGAjA6)W>8z0a2Pa^dP9(V0qV@h=0GLyEfsNl45>#Q(D�UL`2i= z2AQR!RATNdUp56W?Z4C1Rzw1~c7_t%Pxw4WiJ}7f=GOf}4W;j{=Z&NR3JVjXjIIQw z*m7q_!zX~JB>W(bQ*OyNv;LvMs^Ql(7s6)(B+NJEFuv(QdVM8K|} zQj1(7Dhm|SYkr#NwS z#3|g+7jEEya@y#Y^y!pAj&M7s*XbJ9{x;;vn(jdLd%#fS8psjH#jnI$!F-Aa{@Mnsl=YhY}Y8*enC+Q7CC;M15fTcS4Pb`oqDyN~No6^@k zA09%Y@HyVQ9=2pELr2Zt9~9*%SLpNUcM!GcNX?aKoF6x?2f)~WL&6xPe9srt=5e|8 z)<6G4ss62RGJ~xjG_1u*TZn!hed_NZsbUsDCqw)l`u0Qyl2mM2TtT+TCNs+f*Ebl_ zeZ}x_#+eW!ZY4)-XiG1Xb`!NlrDP}TigB8KDaPokaO#-?$v`mjl7{SHiCa88S3U91 zl;TlZK}H?#pS`IBe6>M}Oe*A43}HLifnZYw{QQzo5Ndw)Q+oWpV9Gw)*e__Z|44a^ z;^hO>9W!EA9AF3C09Z`KAKiLx*ZpEpg}*)HESBkTAQ4aw-H1+g`uWw_P!+IzjpKK< zX*=sdTmaxi(4m#cgxsWB%uZU7TYnd8Y52eWHNJ*Wy6QICM_JDl>YOGq8xZt{{v^x~ zng77*GX^R<|Iq@p{TpvvC|un)qlrI>X#5Npp>&O>!jzu+?(e9>Av-)Xy`NfAtq)49K<$WlC|HKlD)J(DuGZZrP?Fij)Dyy*sw(+0 zLz>byeXdj&#{759JI!ULt~nP|IYxkJ+T-50#0P$7Y%z-)%KfyG)&r>8q7 zfVZ?(t|Tvk+JB|Q=O`lQCc_#vno{lY3C@N1@d4oKF;=3*!=`aqO06`#ydo!2BC2I9 z;LhZ!f@Dm*;Z$k#T3dizIr;LrKR2+QM6yLjZm zvji-@tc=n8Otx=-`&>nqfEa4Ta-)OUdWR3KS|Jh&G5?qLVXVFa`CY6n>X-ZdkFj{$ z9Tz+B+aoW#`#p$7JEawrp)SWi#TS_%RyEJ zBp0DQci-WYrS(<29ItwaEU{}@eXr^{C$f99h2bkGml6oxUC+hxcf$c;go;~l4^ z)+iH-jkJm<7B9$CkCJ5$2*wXnm|LRJD5XY(5|@xkvh$E4T={l14T3 z^LUQ%M}#ebKr)PYC$W%2;s8*-|zlN%>bdsCSW$Vdd zaF{+RLp<|=4djPR;dOJA>^ue&Rd4`ici#Dp_81>Qb%G91<+$4yJHIAa$8wgu>}5EJ z)eSb+TTS)YoDR|SyNmb8LcDKR-TtiAGZ@><7Mt7~^T~z)9CHjtbqc=r zz`J{D6n!JizJQ!h7zWfm7fEEY=1NA}LR7Lll-tRSIsq6|0nQvCr+B-kOg5R!^IH3l zh(`OCKP~#-h*YB4?2-{{Jkalv+>MQtHD-#Ga^Pg_jLxh5crIGY<&sF}^(x)+4Dkrw zv4Jm=Dj7a^5)cs7mNj!64Ge8s^R(0YyIv#$)Y3PB-Im1;Z1$hUe{A{!|k*?)UTEmR{wE{PtN1S_k%w=|2A9j+$F z?}ZNryuo?zb2-MtWTyTeTcOYt3-$hri9IN$wYgfUL~zYl4^l4|)X1XA#-J=R92Ks5lM%uD-9& zx4r2#*S=ODlzmxP^wQ2WQkD6R_8kfsSXr18C8h4?4xh*G%5=DYUUt{W(7(_qW*x_q z?plU!9$Wqf|H{LE82tE?EhKcqlUcWpHr0D`jm8>1IhQj7Xe2vfq-NjHRPU2e;*YqW z_v4mNQ22DogoQMq9$iv(%S{h?cQ$9RcV^tROVwlaYI(b(%IaSdcVTG}MBCoMy`5o) zg^T-+G&?Y;Fw_S4Tl@N;&;eg3O{N<_x-g03SN|<^liJh*Cv$e|7g9NV9j0)4bFA0o zc1w~SS(v0o!OqjJg&xNEMu^AIA)!6C7XHo!n_+AdO7?|bv&;%i>J8u8%zVnA3JP;Y zD})UAir4{tE61haR%kj@Gw%qce8b1wC^*SNGZpME=sN2JI174o_YYLvJK?jN-XqCO zgr;&p($#zUS;5s%p&VO?Sq)0BOrXpcBWnXi8xVKVtEsCh_jd)> z=qQ;=xJqb?)=vQR>+%uze^AN=Y7=I2Tf!-$WK{@w9e)M;)2I&K^UE^FBR!dS)0&RD z1N{z+%%v?#SKh|kHLnTCydw95b8$lgKM|ZCx2W?>2kO1hZ8lpBb(*YL7tDO(3ky&k ztdT*-q*Rk-53ct+K5%GDiJ)pYrKF6^Not)yOX7YKm|5-@`o8Vfcd@KcU)L$^ippS# zZyN2R2A1w8=>_jlpeP(WOA7@{LRiaV`VzylKDoWu+D5dEx$wadRR5E<;f5^y%IVh_ zvwlQav=s{BqAx1GsA2FaMk+OIZ_&tJ(ema^29o|zWkz(P1jW6vwkv{(4}C!T*D`KG z5!*`Lb;X(h!M-7$+J38p3H~8#i0mO$EmGR&)^7gF_TbC$e$jd5wri!<NFQk~p_xKJ@At4d4=15YydFWSu-bC&isDO8k!d#3GcywfV}C?x#uI>X$MdBN|7k znwiLo`DZ+1-RLE+&#fu3?jS|JTf0yID820XCa;{u((vGM?Yimmkc% z;Xf0!x|3?ny$8>=ZW#ghIiVey*vP<}75(_jZ?qnPt%lU9I)yIJz;i%mHmSx&@9m7=*xoai@ zlNYRz!=ImYI#N!9r-)|CZFbAtKUdhg*;Lu2uaPu30S-{&Q-^k=eb)6KnjCLax)bky z8WS<6+9b!5<(y@4Aq@Wa>F6&CL%)9rRa# z)Y&K#@&!efs2{)d(!BUSo!-~tjECO5hTn`Gn(gfHxl{kW4egUX!TnpsA_Y)!RNC#K zcYKR}DcNEzvycc?p+qsheFzIA1;xpfe-E<~K%;_7TwY`8n%x9L1xeD1-l! z{dgPKIhp7C8IZnXSbG)Gf3jnDf>jDMkJR$8>bwidB0 z96p3d$jcJ5xS{4;sWF)bNOevgTfWn zs=EvZhpBJ+lM&6fgv6G6OyElZZ+~|RDSKS>L5&Q=H-bYo zzWXJ_>-B(zfI}BOWF~F816o{Yv4*ZNYK;i=JL%8P`V@P4>}_=?sn-{^Pw#FVQ(>2p z>%9wc=hFM#3n*R@@>H13Jk(pgm|K?mgpj!!j@(e|a+O6f`vBS(aHzZGgnvY1{ClhS z_H)i|r#ciee?y+->y$@X+worZUN3K_Jh6%KjM zQA#RLffr5A_ZZaCB=&LMM~#ZT7WPAV`F_MblO2L*MUfnK4@kxHWn6|H&&Z7N=UIZ6 zsES#<(M+A?y+6}xf4S}(YU=45-pT+sDc!t>@xgnZ*+DETmh~Grgn`nO1y9i zMmhhj#IU-xKt&4X7yjiyo&RA0;`uE>S>rw@Ud4tT-_Un|b_8-8c;;6bE2xG(2HvIq zeJLnaj!gJV39fZot;D-J_Q}x{(q=Irg|J21NR~Cb2^wTPrlirsx z573(dAd3YYwalY|iSg?^_Q3aZH;LqKwl-Sh3-H29ho|AQxd--t%5(x92n9Xs%m-`( zY`b0S33vI25DgF2< zDUhhMQx+eE^z0cLO6S`=90l!_0;Rg%$74E&@r$#QKRt>N<@3IY`ezrjSHIdySg!a2#sh#YN5P*oy7*D!A5hp1cml-@S{gz`k} zcz(qFvA&*Wj$l&j(AfiaW0{F>oD?G~FzIvrMn;9ZinK`4Vd+qF`fK#}RH_tR-dD0TTK)1xX;oZL(6-aRqq zRgq5n<0*$bUU?4{0SE5Ba6(jod{wuH#Zt^q6FO1XApE*uht9*Z(`^VKMkc*})0RFn zmuGApzY9L=mQG?KAI4WywLE`XWG{Rvp7}+`E|3G?USL-UHTTq}!V2zP_iwGYix$eW zAP>nN$l@EZr8QusQ7!r$8p5<>G8bWP%)ihxm%(O~NWUBuMqyI-|6_8Qf;AN;5sRhn zG1#MV&z;6v7TQd*>i01B7zY9tXBgW`=k3D2;j^1sb!M|m%c5dF9kbaEyEi^RHM#o_ zdZeqF7zVrY^j^@f+E5h;poOB3y9p)BjS*ns@TWo>p>l-Roh7{X^Wpr)GWe3! zdekwd314-vxnt_MmAYgg9~7M;a~b}r-C1D)wZMDqWUOZvn`CySV4by~_e8wghMk!I zTNV8^>rn~9OiIE(jw$>-G+PSu)uEk0x<*rM%(hACde@T0c$PeyYZ(jXHJ=q$%v>!< zUVz6^Bl!f7(+!M>csf7G*eeTe4S~{KiUxlc$!UHMhANG9Vl)W`4*s`O3#oGnE3mS} z@!r&P5%avyS$7129>u2YqP6;-sEhOA*=*l?20ntKYa`MQ*1>yueu(YfsV(9fiGnf} ztjG31e%5Wfg$x#lC46>NQ0D5z>YnT|kabbYAp_j4Wf8t)a%rF6QRusA^ozp4QRZz; z#M+x?(24>;0+`tzXpSx4%GPMhUjInhcT_^rsv06klCGx~w?`*~ zF?BhjpQt%*+gQ;DZc-L{=M9-QZcZx#*n}8E}7PKL^qId5VpUMsoU5QO_o5E zMRx$$tFDATP!B)fNQv_JTTI|uyBUXWNO(j)qPDfg)i_(<&=%J+qa^a2g!q6$b;xyZ z<^dkZSJsYKYhpmjl}o$1`}X0x9Sy1n=7tMCv}{%~vDb%g=hs_yLziO+b|O>eAi2}p z=q~+tvG{#3uhUeSon>s-R`*N$R>5k|25>?<<#E}qmmzvTN0%@V+4aU_1dSuIw0}kC z$a+AkNfcJs^&oH~CM%MGToH2rAwMeyktds3GIjwRfUOvVPm?WbEFgfdD>>eNr<#mr zF;3~V#8Vn`&%mP{gcEqnW4jDL7){iGr2;LjSY=o=IC`K)BLTk92=jdg^Jbg^r!M=@c{tQ5!T(Y_u&D0uj-L1_q~F9SFlpO}6BUPF+KgQ9r^#oE*DiuGTCcBfYB4b@)l>&Z zG+ar9WWO!UG2FzDeT@(Om<-~OylW3r_AbU7>+UvxcPM54kyU!E4 zvdBRq8VnSygoJAB$;n9w6^^!Z%E^g#ORGG80hTN}HadtPI7hOF!z7DgA=wvFqBCKaT2~3d3jb8`9V}hP z`LstwMF?pYD%j_bB2B=o=oncoYh9UIpcByY=X)4Y0SBHtUL@AP&kgZdZ~--&=h{&O zPwJP#=Rxq;zWoy171R4za=KYmL(V?`iq?nqF>G>2^k*^_4`4VC1U`!4RM(oAO z&we+{X*2xh)`z)j-*Y(4WwN+SV%GfxnTRvyQ^1It#aMZTEy^|KAQaVdn`AmHG_cFz z9`ZrqpMVgV@IAw<0+F~a=krS80@r7ohXMjgm+gKfJkrmk9A{#k?$0P-qx)A|Hv&|+ z=_XGUqTGULb0|xFmxCS;Yn{rHvZb3q$gRO=K?(e<*KId^UeEYmRk&#wI?;PX>^UD% z_3OWrV#)&eEYC8BqJd!|&naGR4)UuAqCc6<))us6vOG$;<%egjuY3wF%x4~hn^6u{ zKaUWBLN(M59FuI{3&CQm+OJ0Rv%4I${5VC>#%9!1Epk}ZY-4-BbP?=wm7S)Rd3M>_ z=%7PP)8uLzS(8+1>u*u!(4XwXe_^wozFXsUXmKF;t!~Wq5p3DN@)d{=uB9@_*w1P$ zV%X0}-fI&&P8r^%H5BRqefsg4dJBehj zU!7_jI+M-3rgP|Ak5=rkID-{f30?|?^cucG`Gr*d&g66H$~kYmbobzZoNYLapxYTM zUW*2e{}K)S=A=v1`;}AX-Ru}EfvVQomN_TIyY4t#RXtTW6%2rE|KablX6_dkN(^Rh z1WAha_fsT$2@#PgMkDIvwP#fV7Ynic0CYR7t!pGd&ey^t$7xH^Y5-_iWifl(a)yyp zyc@XaU(|#_IO&&Flp^@qPICi8<2qSX&rf~V?8MmCe1r1kL$+|fHV=8jC_4S#_8f5% z|Bbics4ASbI<3uD&RGZ1L`<6}Ja){STtj4po%H;nGlMR-cVic@>zbhG`}=xE?Pyw5 z39nyd`~#Ua77`i%@se5Ap38{$H`+!(mHs;CoM=mjjR1m0COMXAA8ds`ifB7{3VT$b z@{8a$n9|iTs@VeLC6c^fW{yMguQ=3K5M6o!V(Rv7DqlM)Y**B}VDglD$Sj+5GuiP7 z?C_wIK4G%vHDUFi$Oq zu>j>k>H&oSfx}_7Nh&1zGDS}E(h5C+V!pMm>Qke=T;1O(N{B{c=LOftv5iX)JTyiVNK}xOL>H8k5S!b z=k>ZI)qSaYS>OW}OJE}tIM8xoQj;>YB#y7aZ+dqp3Fd$Hlh1Fggw)>_grcI#oH$q=t~5(O~&R&H8NSgKFRw zKQro|KgZ8{PJQ+2RIBhNE{M)*we+Y#3Hre|2fdnVyN$O>&JM@^0i%z|!@qM*3I+yZ z_lxYhTVD-^1r#gge!z~1dHJ)1f(2v&4t&6sDPSdt4wveKd9jp-Xd|Gl{QOHMk)$n^ zh{w`-*W7%9b^`QgqkK)I)03LhezIg3r8^-RKGnaV5qlrE6lP#{}1NrDQ#tRE}wgjATD_Hoxy1^lGM zeW=6UJ&lH)1ULroh;n99tdngwH+b8h5r-B-xd`OE_lgXh6%bIOUHGclDiMT49*-7f z>QUA#2&f9P6CU<+Ley1-M1XO>NMbmO_Zpi*2M5*> zf%-|)e|_sO+1ln&o`3it3;E^fa6JuMprhxYp;lB$!u8>A~%KS zZ(yhOP*g(~nVJ&_y*&oUd>EXX3uBZ^76^7P#Ul6}WVlv-jlO04MIK@d1ytV>-AhCq z52IJF^fTtcjQJW%_%@xq^sZ><)n3aM=fs?H@g-p~C-ib~)|z3%Y&I1;FAcvGmE;TZ z9MFcs*5|ZY>uU`FfW85U4+LB7JgD`(Zmo7;JKDR)~ zmw)cxEmOI64J`4ZNj3bjj+qsWGu1ofj*jLbVV242)R52^v~S-XTjqJ7WRzi>T|3vP zFE~qyH%a2hL+tt5RU95q>x5A+e0}8-9#%}39|~_nWw+v`?qg)JyklYOw|dBv^jw-d zTb+}nSkDT2M{bR#VrGAr;n=1v{!h@XqjT)N1Q95DgA?DDjz`KU4516rIGyuU>>aTO z8J9dUR-q&$ycxgdck--%kcdR|y#)ETZ(CdXzFphXmJ3JiMrtY$X@*l+j6~Xr^=@-T zYVN660UIwI(;-A;DIJhiO3tCM8!rwGvO)}F^U{@3h4az=qu4!x1Oztj)1-1XCM9~8 zhS_e8NLq}1sRS-`8#^ZVn^>VBoR!O;wA?bu8w5ddDpm_|r?L0V4cM~gB=fxQh=psh zi1!W|JB4p#h*x4>$jk#Cv-}U0LHMwh=PV`yZpG_nvo#3OGgk*Q*RqDm!I^feW|=4< z6U4iLUZm+!YS3dBOWCKGmDQhr&={Q`IKsXYxF5#;KGeQv1uQ^0t_&hp2-mcyBO{06 zv;S=KH*|D1|8-bKI?>;xCD6!4q-0xAEDIUx67b|6Xh}ap{)UR0QR^|G02?Bf4YPip z!mksi0u-UX5!MJ2ambFQbYMO&OArh}p5}M^MJqi>-R+dMf)oe^0ZNIGM{IuKe)|Z2 z6)w*f4K{&RR<%8Ojh+daPTWdn^@319_x1KYXgqV5Fffc(KNO4FNW<&BGYs-On0gJg z)py{)nIcl{GOm$ye%_=Jts%-$SY#R+lbCN)me}AKZEWh0Aol42DTQ>Za}7aC+e_ba zP^1PpKQDp={2|L8vcpQz{{fv!YBtIvn>I|q<4W0w_zxV^hoNA>sz_eGqEy{KU-%0k zC)$KS1`|T*dh_+waQ>pY-Y7O=FC_8P>=qlrm+eN_pCwprC?z*4U!S)Dv9D|Tq*#YJ zr_1Bkd)f7hUXEg0%%ph8Fy_K)W4$&*8FEy((1O?I1pX@85Ov-<;we-ki z8ivF&n;x4Z9tZ(H;DMMpY_6~g;Nev}eSJiU@kc_6N!xy;+baI*s&wL4kO_|8AI~Z^ z`ezCaHXxDNMMZs^YM7DljNK}kSDO%oqbiN&5rq3qLd+N?jFbJ3Ei(woHj2Eu*wU1< zt&bcT+0{03i86WcM}7qwIpSy>-@voGwKOQ^{;!KSPI@&fis~1%0^@Sg=}pNQec_=d z6t4K#iy+rP;c4m71Bq@y-DK6lKj&k62_2I;&A*u+M0B&8Nq(lW-OPJKFoeVfNhk<< zYc}ee8xrbgX1|uxfGEbMaP#KQj<$m+N4u?L*2NQl@ntM#%6CLW)kPb&6^+ocfo6SUrf~7d>s*-;uxViyzVBo6!qrO3hDcX_wkY1Mn%e{U z{=D**(_Z)UjpGpqkf-3W7#O(W#USnmE0=3_+2;e z_ocg2iA?%n^(UgB ztIN<9?ud5e=hY@pNsJO& z!}08d)bml)f5<~*rv!wtP*_I(@aF0i?QZ1iH1;BX*hRi$?$;=4_GikD!JRPrJF~KI zL?HqQs?hjggP?3G@L2l0=e)ZRc~ptdFoJNM+OEWs8d+qb?|8+Odc-6y4b-_6szlGuh8Ip;{7#f!AWR{+JRU@JvpU_XNrJ(f)Lx z4+RJgtcm?uP<;#gCaXrwBTK8{;)a-oEitE^U;nHtM;x`YtEvH*g(e7ewQNn_;kfVn|%xDRBb)k-IvBLEL4tc*2J zoADltMX%dt2e|@JzsdWtAoBx8hO`WZ)|H6TO6X=$7a=)twRv=RFJ1U9!vjRI2}{Ic zSDauL{os!$di9IYKQ$s023NX%uxw%{lYhG+hamcVC*`K>_B4(7NO6 zg2V#G)Mb}O{W$426`o(a`2Y8uuy4dzRa}Y}%^sH)zrJC>wYsE@#2HQjLKz~L!7J4Y zf})~kAjPVu5z=|_;U%M{MvE?j@)%N@VO2BRzJs?6W~Q-=f<^?q9k)C#neby783<UWCmmf0?z?V zBUYsE?VZsFlLbRR0=9td%ea_e_GVq%{iVR&^MJu-6}Jm#690SkKk&Q?z@E$(8Og)@ z?;ziOLjl7pL(W8j@jKGIOXNSP%g{&s@3pkR!bHUQGoS%asHl=Z;ggmlzFpC(h=37D z`llee=tN-t^S|%B(eSo%%l74ZOp!E3&2!&dW=&~b^=(>tb%0q#`v5!u{l9Yz#Y6r! zb5TFv0un=AF?(0SG60GB+Mxl;i*p9M@1^X<8m|1k}yR`|0WhE|nxF&YeD zHWY&wNId5M!k^|PNxjfYWwjBS!yYs)fJpi&)9PJ=ZAU;uTwz}7Tr3>)xx(UV}L~X&hxGChk4&c{C}4fSaQHx)BZa;@}xKO zRbL@uOnkEn+V1lPM+)PZS$qyqWnTog{L=j$#3@J6hu){vHMTV6HQ9*+N9H0%ZHX&K zmI7`hU-~uT^jDmJ1xpf3#??nY73aNsO?sRmrb|t`OV|2AFllALC?z`TZM&$lnY)?ANeR@7uy4082KZ^6o|^YbwxFepI`D$9yUBYss{qMhE7b@$Em}3&>t>2;-e>cU?3EBCJXYtuYJD9X3#xwcJ(%3CEw)>qU zX8BwoP2a3nwWSoZ3nWySoA2iN(u93#o-Q{!6w+(e?F~DrS&TX+145&KhzoghAy2O! z{I@9g<~PM}Z>4Oq{WC}QTK=#NQP_`c`D1@LLp=d!5;@h##Q_BrM%dfx#Ql>d{vT_N z75Vb!-EpF6WkLVTSDx z{K$Bl3j3Nk`wk?)h4%=L-i_31K}Z8&^R^+o@~Z9j6xPS9W1DlL14m0OIvu0g6qD`1 zhWmls6{nPzSYirqoAd#${ItqRwN)Se;Aq*Xz!rN>g>6m_Cw9+pE&hJwXu3kyPN(U6 z2BlJ*>RL(9a8rZ$?V9uX6Ca<)5e@#Ku+c4tk7dQ_vhR{Q4IWA& z4=G0b4e^a4WU{fW)krs&@a$31tB=KXWuN4>R$?n&864ZZHvv%m{ER39i?Gi_j*!V= zAgU?Em((3?6S}+BbVNuL-JC)dF6Y(VIOpnM$x!NZ8q=c7Ufh35$Nhn$xsK=4qBiis z0OdFV2*PE_dY0UsGpKhO2 zmT6I)5@2_{q8y4lG#6?=1HJe4f_JO1!@4 zM?Cmb(fmcjTCey`Pzx@cP`IP+c-BDi9h)@X8lNc#W5V6fB?nZ;h!n=-z+5d z$0d-V@cNsox6}jCJYTrO<>mgJ71SynczmG3w z&oY;Bm{gY84z9U7Z=D84*C2ky9x)Iy!E|XX#+>DXw(HTo)A`{MmC~zMo@xb#Y=k=4 z^_c)mXsywu6zSmk_1ZVR{yLdIO3#YU3?J-r0u&K)nm0L1>C8fg2IfZI`9_BQ8Yt;< zUOPoN^tA_bR@5Yw9OO?hEyqW-UJ-b?oljr8PbzOLI-FUBqf>;mFZUb*jwl9XFf@r9 z?*AtC{|P*JZ$XGL_QxSwl{AFcmWwvPdmDl+o5=}g*8FNeF$Y#{bXZju+fN{^fuU0l zE0m3A)ak0;iLTf@&-dFclw*73l`K)d_HQ=o^wcRTqLNKsncDOa1i~9FFmXW`L=HgQ zf6H&*K0xXV{r@obmSItKZ{M&Yh=?>uH%Nn0!bs;ZNDUy;Qqo8`sB}noOG`7Pbc2Bc zA|Nm{2$Bvd-SDir{(9Z_d%yQ{yzg=N!4J%iwb$P3Jb$&=<~*=-t5GhJX&LuG$zOvR zUQb-H{~GXKvM2e`z0a5_=VZaPN!Zjh>5%@$6jKcEdW=~9J{OrzP@IQ7ZeP#*e-yF( zV16#F)6X6KDQIP|_%*<5y3`Ph|C#&DDzwn2%WwDy@eD^d$42Jjq{QibBsv`{v>P(KTx~jItZrx`&JNF zflFse7VzZwPmhu%JVE(?xmj?6W4=W!iBQZj97R)LSPU}Yim!>Ls{SqiNJAfkODE|~ zX-klKiRno+($*tDH9i4OlF@dJF8l9Yyk7!$(Rg;-^(q6Vy9g9bS7pCS#00MRYWg?! zKSlc$JOy4*mxnI({!n3(z&s=lwF5;^Iyr!o^#8aTt?~EWp4$N@F!I4~x+t%=EnvH7h4^dH!o6v`wvQc1ok!} zWmUR`sbKS?Uaj=gf2c`fQ;CPt4wGHaMQy(}UZzob?}j;8zlmGHe;rKWeNsU`*b(Ip zaOmGq0QQ9TxOjF|SclI!oBR6sNe$5CK{Lc$0*D_b^QP)l80iNSG4g|0dDqb2z@N#% z=)D)$z5;w{Nxb^i;|O1{#{@g@NC?_|XF}>A;~l0Owd_XAxG*+3S6^U9v;syFc=L1H ziDpEp@ws8k4NnQgs*Ww-a*I#z&Vn^i0;yY?{4Gs{XRZt6{>IXX*!|ky`!y)Q#81Bir-d#9&p}>V^z*KY4fOlxosZe?gDb};0;d3 z-b9O^{jOe^os#>x!)eth? zQH~Gdq+}HOvLp?Oj#o!``ka@(CJnWFd!~xHMcEWe~KQkhz9wUs?sa0y$R|9uB^_kF6A#Vi~;~6Oi<5+;6O8h<+3-F zgcVCnUIG{xb$pI~;`%Eb&jk{#{|FSc`ugieCfJ!nleo?eJ;WY%o6q3gq5QneG_P^9 zj;VDrBoyV(d2ceA&oCe4@1DeH>96`RvxF}XWD9`pnBoDYgX2sk^fWS(P8?og(#8$u zTlcgE!YeI>}THt!&=Yy-RkTADCK4%Y(gRAdPJzXHHtS^KKPg=i=t(uVoWH!`9)$6=zn zr$o0DYcqt?zw9G~DSAB$)%0R3Yj1T#^sg0Sq4WqzcMD&xC>)+5%0cI@ozkox-i+79K1UDPx=5y2OjW=1PbE4BN^KdQes z22+OD0BSD0j(yn?0&xBA^s#3c2=3>EWBxu)4viYv4wf?DVe~J&pe~6YCFj1o2afrp z=ixT|>$6f|Ob%S;b^o1%Y)=yW9>C~%>k(_Cf)-zHWwyT#zCqm!%>+I+)!JtV5s(-Iuvx> z;XQ`v{qU0=xNwC->Kr(O!c*dZ$bT7^OuZ+6^0hw7-fFaZKYn?5@9Lp0f3n8Dkb$f3 zY37XgM6)w!599A@k=l;4c&$nW3FiQvY@lO?1Pd+Ootz~1G1^K7elpvH?!ZOWtfe$#G{9hU~m^o|bSc6r%^i{^Z1 z6lICj$AvpEkY>{@g=w4r`6rR)*8=L7uIe)Hdwy=60wPEJ-isq>uQF(Z-%Oe;Re z-Xmh&Uj?sU@fHpyDh{(HHA&2`)Qy1ZaCGMY|`is$t zzXRKSg3Ke_;7 zc*JuU*y&`?Dps(#7n3E9G^i^XF?3m_=KpRdegvh2Vr8V z_=qs)Ns46${gfMMQ?MqIb>FRjM+SPJ>1G$Ti}v3s7ltntRGHff+>dkPf$2phUbsmY zDfHHw18Gte=XJb)-~Y}{SRwE{iCAk7Q+Kw!u1I*J%LL;qRUrrO)c>$1^D!Z7r%m+k_nlxs*Cn z-UK^f`(|nncz^f^V1Vpsrk4c@AeAS<>T+{qM=PTdUMj0CiK`Iss-d_tSHZ5gU9H|*K!xS6Z-tP^N5b9|Wj^_qft)1_5`r2MUZ?fM++EXAwC2S1nh?I9lX zU+PM)&adsvQd$mme{kJAtl@@nW!9om#??Ma@*SGA@Rg6pZh~e_~L1V1$GCTe-nT@c}`@{_cnz;Pw(q=13s-r6JH(FL$;$HVd& zJ*hL!-(lX52S9>64vC(AwKw03XS4AEpJA|xuOlO^hT&W9<49FAj~bpV>+tQ*b3i%! zrVykGeq{!Nrdbz<3RUOEGrlm_6#>qT<~@{!--DNQ@QnABIzeCI^>9S$eeY^stz2f6 z>cyeDfhRqCinr@D0x`rNjus2C`D1~6bq{tr{!$S$T*IA9U0}13iQyRw`uW25Fz~pq)xKv-aClCPX!I%j(l@ldICW-~|g>t5YN90H?N3YU*3A(kuOZ2ly(^zSq+uo6oYnP|WP@w)=A-aV zJ1EDm{cR`wLdcwzk=Oixf2K4AZ%z-QoOWT4rxzq`J4o7fGxgo9W9g|S`HRZbZ;b-(ca4?5~6E&y6Ir_- zS5imIiQc_r!M3$o@j2Cvpj8mPwKgRmtoEkjbZd6TaEIp=9H~Ulm-{`dw0(@Y zoC;6T5XdaK?%sHPE#&NIe-95m==&zw=z{wkf9*)v`#Ms;FjhS*BCp+02sNpiTv-Np z`Jp&PO%SPmah8{?@G?jFCiP-WoR7$hr90xazA+h^S}bZ)&lB^M3A-gORjQ9eEFdR0 zyj;9jYtgl@OW>vLPel>m?jG;P%+_9MMl3W({+d;cvc0B;aN-D!K(wfZo zrd(lx`SXrnr1pT^LbAyYydcUv2A`aZ0lF$y&G@n~@_PN~OY=&{jg??s)eR5zz$w#R z6Y8PDvG-}#h9e|7M|@&y<|>?kNm zgx<3Gc~BJZcXVNNwx)s3sX(JjJ5E&nUJ7d{# zmNeN@et;%#l=4v?%s$=n_+QqZJXGp~7Zt+xqYq;PmR43Mo#41dM$Ae0T70yk4mb9` zQE}+cw0Y{0a;VcSzA7>8XyN2}cdqjrR`kWdz`0=Cem^P*g-67tJOs&onoZ)JBn!_RY) ziHIM^^09w?bnAJx_zW(*vr^koAHY+$m2Doo6=iuZea!YPE>6VXIjt-zfQAVg%)i~= zn3N^u44eG?!tX-Jj)c!t%rN3^tC0miv4qW&Fb^s5-QsLru7X26Ex{Na$8t*U=DQeV zZTY&E@gLw<@gBv99MYf%YiPRnK`ctY#v$&^eL3;@)&m_?uI}W>Z0k5QnKBy6*3B)f zup`qb%PmVTi#2b>5GQI9Wru^Z5FDy>UFxlT6`dw%s45pJt)XbB`^fx;?Gtg=m614@ zjY*WfEgrV5-??$)BgOc9C^B#VT`=K{XRjH1Hn9q7vL(G8SIgWcSp=~@6O&u;Ba=MT z{dAe2RN~omqn(>{rBQRS<>T%F^!uz#$IWN&u~E`gLw7!{r1nXVvXIIRA5EVz=ep4} zI8`2wRm-kE$>vQ5>=DF{fCL)Dc+O*Lc@Ep0Zs!X0_GZ(#kUQ~wy4J*_Q&yV{DGsT% zUwfKb?`-w1E+OC+gSs@Bw;ZwS!T;7wZg`E&DE~Jy8+5`@%KR(9q^DVg!@PqH%-1Z4 z$TShLf}1;^r!|Riiyr1M-L)HkEK1I+gbHtRiV}z8j>*q4wc!hPh5o|KS>&V8TvO#Wh0J(1;M3cq{S{1Uwvt~B!1=6HJzxsa0#aYW&Zu_<*e?^~?~P;krT zA^8&>xDb(0JXZKhvPaKm;TI+QN2LLzB$){8tGW0FW0iN(4=&5{9s!pfry5F?kH!+) zM2WI;mU*8i(N=3DJcIv85rtY z!4}KV_q!bV5|Wy>ioKNmgVORIWGBL-FZOo|0C7(oK=rW=nk0yw_^e2D^~XnUgW?MH zn6!H?we3V`o7ktzI)Vy%BBjP!$}ViM3bdtTXKnp;z3fLO#~uS&%WmSxknKLC+Ly<~^syq#(Ayab zINJeG`dTjS!e3;_yUO1v2;jKqleB1ag=MOR-0y-z*-FuLNFZVVM#iE3%VH=_F!=h_ z>(oIOlW{D`Y$~M2W4MaRqJ$CS8%~W2iyhQ#Xb9Zr-N{+0dCStDlEM zjr>7fQxT!=4J4VrXa&w}iVdp~3pCV_ep7CWzPtjJXrnFl;}7@_K!2>Wr_9ZVcq%Z< zlwF+2;wHaJsQy}TjKlv)_mScwWBe~!(9Hi*4&Rn@BvnXjjkwqzi@OuLX*1m%cw74o z<02u@;HAcX2|j@KVBL5dS;7!yC2J^l1Gz9hyUuwLP3`7riS6FdB zXRVNnI5=!n9v=fx4d|e_Hpj0Ue5OfGc;c_EB^{e@U4?gf zK~lG;so~aNTbHUm&|VZ9Zdy$4zsr=Lf|>Vlk`7?N|LN7=^WimCqwod@WCwLoxp);n zL!B2Ev%g11HA6UqBlqqqr?c(`cFN@wu@MY?{sv|hhn{TnoaNna{=DL97y6sDdG$m$ znL`S|e&xnvc+Kt`Y0-0LU_Aw*b~oL?>-%c4Uv-+zb+@-^Iq|L4>bMZ235oTD892?L0AKsE{Ap(!PveDT@nO>qo8<2QE%|Y zr_+PPR_<_|A*@_B{T=f8@}a`bnYMC8*=xFb*X^s>5qAeJW*)#K@{fvX?!K|#g(O+k z!tMt=sRB!TuEbMkMsF+z27+R+I}qcJc3~k}m7;ddp%Nd&fd`RR(%g>&Sh9uZR9cqd?9q;`sdymp=nlzWZn0 zfQIt$d2#PEYxL}=2T--l)RJ*WNYaBei?yOZVDDOB(kkgX(IY4v;9U zj<+y|KEF5lN|&*lX26H_&iM4eXScJBc!=t9?m^`KxyZ9*W&{%jAFU-N3)1;S7kE8K z!_H-eX;vS~E;9!WJoc@F5Y!ge?1MRl8|M}KHuiOF$#`$K10dy9(k=do6RzMjvIBnL z#Z=a%3E^6r_dyF}D`62J>|(D2gthiC?kp($I#_svyqEmJ-=RRH!tJ#R`%Q5 zEJ79%P!sb+LL|RY@IP>+?R1^l8|!t6CZ1x)eKW> z0ZAewT&}=l+BisPq8vFHc9_<+8ypVIkS@+@hkQcjN$c-#dLT((#RzU4TAr`k_faV4 zvs&FMGXLo&A<&b^v^kiQ#K354;P~M1d(mtCir!erW6E$YR{C!?eUl#e@X*atxel=^ z`~eo4tUe;CQ0n~Agtk+&W8SHGTYKs4_3L4%s+|le--GGpi?o#|UR!17q=CO)72h0R z@I=2s^&nZFMF!KTKubGc+%D=Q&>_*bDNavP&UC5=H_UgF**!1<<)7bq&S~DJL-Ek$ z-SSM2ZQKZ;vm%7M{_3bmoecpV9M@6@_>kc)pBUQv{I0kcvo^ripui}h%Ynljz`;w< z)k;JsrdgCsjg$Imh;hqX`8@;JbL5c*gY>9;y2e}A=lp;V(L=)yYq9SsfBxCj>a$VVUxRI!1 zJ2Xi@^I~lo;qBGZlpMYEIJCIYdo4JvmqWBekCW6}JJUP~ZXC$kKgfj8G-_^#;d>s< zwJ+N{#awt757&_1ft9f)UTBTYWe@(Q9;#;g^!_@+D3<%_GZc$DL!cito;NiaS4rM< zJiSvT;rhG!FQT=lW)Vesr?%on-#4E@2h<)*`yTHEO{PD9v=Mj)hBh~rynkg`q%Gb| z2%}dH+xGaEJtTbx$T+g57;KXalj0OfRIQ7Wd>>uDE4!{6mH+r(Er0-=Wa6Dt>se9HEO%D&p8K&79ebf4s66pfap>vn3*&B7pygk2ULZs6D`L^4xo!9o-cfA zT+ai$@Z6B4Q|?#7(gI9|y3Qm)f6%{c20Ml;T&#B)Vt@gzQ2u^a@13jn*#SXQNZV`m z@3%mK0elO(E&AgAqZ|DQi#-I|2L*QU__b{Bi&6f2YbijBxe!0~Ma7d7LyXS6GP?a2wvk!H zD61q!>B6u-g1g`-!+wQP*60A9i7d%I(CxqGs|xrjf0JWIZ3sB+_V=ijYL)kDm=QpK zan=8q)F4rQ4=5d~b?Cuk_^^E08!PfmT3q zL($otZSK!Q``>S}lhkImy$z6ts#(&${1{47;?<6Vd1ojy*shz>*Y`4%KBMjb`-T&WvI&r7HQn+;12plx z6x=$ouXdLeCy(et0r!j%N<<4SV3e#{DPSm2GqnAYf9DocIN7_%QL{vK^4YuUF(88! z5dYV~U;axw2aywtpM>RgknWwSoaL-Eua|)ZN`wFn?IbN{`-vh zW!{H^V@KmV+%AG*g=4tIYyS(k=tqy;zPg=7ueS~MK9lkvX8i&2CC%`k0u#oig>NYU z`aTac0DA{++y6Oq7W4|7!@7SltMPfoM9>Yo2kxqX{kHjJWK#mBPOW~1T_Wr6gFf0< zy9eb8F$^=y)0@-HTzI7H@rSdQOkewE>4Cp`A-d(O$X}J;U2yG3FM&)NMz1%{K9-RO zI=62Bwjm&Nv0a!^D;aTIbj5eyKTh`-$h@!3f9WmjvB!A3;`FY?yp4XHW&6XdmWP5? z{WRvYxQCnZq}zUpK{R~|*>J1nijK_)rzR!!^NXy4jSSEY5n<~DriY(fb6n62TP zVVyp_P<_GKH15@>LJ^+DagC7t+27l;!889&O6i;YUs6ir;==JZZ+xytB^ilG9X_2y z+WSqUv)_3W;T*!KmqdTwx5m5RRw#c+5T!vdVg+|Lfe{j<5O5SmGe~iA!3#+*6-!xP z6`C>jkE}eq^Um_et)?X%5lI)kBV?-b%wr{TBFU>iPz z-Z#4|J_!n8@z7j075UbI9jqG-whKz#XCsE#g96s-ncv}D7_Yz~Me6-=5SyY-pTb{G z+uPk!!)6$51YE>!g1#q~$15MaQ-!TIiB0X4I@cRCvnUIbqW5Y*H9W z&rh^DlQeI|{fqK)l)gn}KEDTh@J>s;Q=)KoVah}O_C%Y(TvedV3fv$&k7AV41Y83v zd#~;zfy*w8&{Sn#W%9{Z1oy78Wp6V@#M1PZb%D%JbBr3%>qv`kf9SI4A>BPFe}**2 zD}3_!JLOYroJqUBXNBVDt~+nMDyT-B#*Ny~PH69`X6gx;e*GV;1sA;9ftW#0;G9qB zPE=fl_jQ%lBfJ1FpS|B#sYSkPq}6Y50;6dnUC_3;xZ1+pSE((l0AVQTtq&;2jlGQaTcSrfDTH|79Fx=$zl zYDWy3-w#x&Rv0etX_ssm+5IergRcb<)z4N$ZNIlhVFU;@@@yJs*my|_rF=>Dqa$N{ zfTZyj5JlRXsAPeSAarpV+wRn$nkw5hhEa#BsO`LXsfCrHTyN{oEHx57dzHLz#B~og z$8%2|e^ewao>oBLw)q_IC|;RmB-%`AdEtcgZo2c|w6Y_I&31Vxmyyhc*1x*rMDXs* z;YBwna)b2~`gubu5U`!?rX4WRGNTgK=dCnj@z3yYCxOtD;O?7obktG6WU*FG1Aoo32me*u~1%Nf86@2lw05tm9&E& zP;h@(RXzRfBgv*UpKtI)W{&ym;+oN`I@CMS4uk#laDa{*RG1N*w|K6h-ipC^34Tqe zh;B-jusYu6!3!WSsJhrKufUDtA-+340EhsD2kB7da>aCh!RI}^yn?2bhWu0Bgj?Ef zyaiOU9?UaqW5XQJZ_L4`f&bXM)C>s^1q`N-J;XdwZC00|=^K0m@mTcxwwukC?kje?I13ODIPI~@ zf275U`Dy|QN(6C0;%>LSSP}*A(tQ!Vd*?q=DSu5&X~Zt}<2Dk^A{L4UWrC-bsBI@L z)0>n6<@V?@oguSw4?NRWpODtWtPL{ktsm{|eQP?&BU3)H7_Fyo#UyO+_$?(vg%fCAsXbC6T5xF1Fp3Sy06O<$lPKbK;x|JXRRGjR1l1Jw#1lNIL-nNI_x1Dp^}Vbj zu@o-vx~wH#^O0Kc`!t5&39naSR!Q(JWA$#O85v^X*-e?)KLb11{UhNmx|EFj`gsSv za$8IL<$=Maz)ifJn|%O%QQ&l(X?�EYVUW4L*BMlX&%1yopvzLd!m0byKFHWSO1k zcv#FnK&jsm|DfU7*C(cob(cZ4G9tjh?m8)gK!CZgOv88Orf|CVsrD z0n5mW2fAI@z9$I{^8bE=3(isUjnOSQQudy? za9NvL(dLPw*Z$D))iuH0=|+##WUAuN7*d|E`Ww8c0M>^z%y}8h=jXReyW>RXIa#l^ zW*G^K`$%O&g-dD<;x4LpPYFaN007`)%Y^O@X;tExRtuZGbIw8QRKw=)>R zyGmUg`N}jcsWv(X84)F;2do{~T;JBGkWnxZ?lM-jGMD;?gYx9W@HeqL!Q^*CDFOaS zX~z%jB0l|)Qi!;*Cs86`w9&_9MiLtR7p{V#DA`T|_0E&Iz&9t58w?7aGQa(7fl&T= zB8?qMDy{n0v^8n1-!oTzYiagraN{A2-sF0kj9;q}>pzq$eW6=U2Fp(bN=IzTV@Iuw zo^c_VXybr)=Z`@A5qpy4fR;{UFr0&X9c$%3or?@s9Hz`CHFGQ01obJN2;>J976#fd zQ39_+xxFV90g@GpFSuIxnD*&xe|gEKim!glZfpgf>*XdrQak5OGNi5LB|2p!d}QsH zVNoPCH|Rg99A3ZA+L&FlA#5ejqeeeuR>>?D^hG(#l3!<$@6VFr+O<}Xvs}&ADlIT< zS*~81Q{lq`U>TqE`UH~od0KJGrIUBaFOPem93X{~*Pzg-nhqWAG-Uwgvx&Pd4I43C z&sB_B{@$a7F;F1ovR8+bzBA2vylNN~NhFrFa`e%)8y()gcat5!6|oTLIkm`pjjiox zmkHQomBv*~;Mw7#&2b`2a=mQZ>WS@o9*I`17NJXwX^ zw-s6ok2Vn8+DY@XU#}#kJq2QK`c(^Ea=JcnXpMkSq(O`Sh4}K}@1KE8q}U|7q3VH+ z_h{g)G>`cBw3qo(Y!{S&oyaOl;~s4449MM_XmhNIuHB}WfW?=lNygr{9k#jwqCDsi zo*o~1E`7QTV^pAFwp;no=IPP@a+L7=opP+XsQrY8al==eWcbQv<`iUv2z**k1il^W zh@lME7`Kc5Z6p3w3Lk*?tty1H5|6+?7MLW>R!f1^`>v=-**(IcqGjZ;m*$PzEFTS8 z05~Cx)G~qWuNUPjP^StZT79@eP_i>`I&@v0U~s2z=+%*jBQsod;3Jze`?QYd>)U(U zY%#Dk{am=2W&|$CTOkM;QFUNv5Bn|XSX(UYOD=g}JEb!~sYQ;Th?3PZ%iq+|6LHX7 zuTLz|5b~bqd{r@2&5eYJK=tG^Dal2%1U=ro6(z=Sr}a#43kl)9zjL?B#pG}N4C;`r z5rt6nOf!ufa1RDsbXjy=+?`|1_Y^64q%@`q1Hy^-1E(*xiuDiHrKeZVnvuTSX`3|;3%)JGCU(lmI3hv zNcKzvQIt8eOT6V(Z|Z#qjJF<8o^|5a?TY7w8S@*N*l(aYb|3i)RGEK)7u`6x_?9gWD?&hfcqfV|IX%#1IPcO``8Zt;hJbH#1zP^lYv z@RB1Tn%v6K()mfH50;&faRVZ3#i?Kskpx@(N4Cg-y`sl3sbg-HAQ2h|ChPYDkMao@ z^Z5FoGQGLKOvLcI0X4$H~zA z05N%Tx}mbj(X#ReerIo3&%Mw)tO$G(6D9I!mvUyrdrD6S;CxW`F5(>{vvJWnBYM~o zK2O_;{@Y#uwO|UcV8AeJ-2*%<@b!%4WFEoo;xV)j8*xEF+qg6v8(X!|5pMi|-T%Eqe@>KrX7YxcNoEvy~0yjwMf*4r8oI((NKhEnuPe>?#qaV=zs1Szu zVAg&6%*~l#x7B4ooJ~DVxZ8+tOP*%Dk9KTvfgFhyEqdSd8VRnfRW$#nxP`;b=|Qy4 zW1gW~RhP+LzNa%~?e zzh}P!kgrbBrbV_-ApL~e&xYEbpiT{lMUgBT)ktV4UA5z|!SP;ABxDJPM54Gi@E2J| z>ub)Ku$tTG6g$In3IVVMRxR-t;rxD8b`?!!>>@t#p$ti1<>jyTpaYTvE(|h3RukDj zc$kwSKp25f*^v_Xlil=mUEr#=%{pjnaOM*Ke>-15tn z3t{wp!RW~(h|w{+aut9?Y;#JWkOscrCh|WB*SipaOiCucjQZn0nSSA=TtK)wKZnY; zWGo7;*3{f(ODj>_$wGd?X9;vFofIyFW|DwM-k ztp6`PJ;1y;5$k@Psu>t>8rpEKo(#Bp=5z|~hah zrKg+LS_MGI2GnzwpobdrT~zQs6IM@|tTv1KmL;eTxb8c^^op3QP5G9|-vKI;z!8#} z<>K7#H=60!M+!`aN8!aUk=<0$7O^t<7&KNGg#ifvAtrOLgXXumiYP2ix=4+T>rr!S^nvVic;K?*o&iu6liPqW%Cll#7P4(jJrqf%M71F$zLJ9txJ35^?Fa zhYQ=zm;l5A&XN~*Tgc|fEmMgvF}8;vG?JQVs3{D1^ZubMoj#N5x1IZCh#50n802<=GvZ2lCDE1tvhBvdOrxtfeJ=-tvMR)Lv|} zFBI3C`!hC!CKaeCFFQb6V}Kf;9|4eNTf3{S9klO64K;aMd?TA|2negr_E)y50o!%F zv-WJyT6MW!vzj|1*-pFbR)X~9ky={AT7Cf$OLDT)B)dlT8HbPYB)6=N{)YVqr>;f! zHpiP6u`*9E-yW9#`?m*J%;Cq?=+J18>%B(7e1a4AKxxY3S0bo)F>3J`AM`&tq{<84=as;beOs;A1=Z-GPc8l`IhxL(#vzC9 zM~AX1S11#Ct$N93+w*OKzK@O3Xg&VlOJD~suoiIWjK3uNiz)L1OnGePfkv8mK{UsP z`(nii15lmQ-?*mW`*x=NmApxt54*KG!x)KBFsiAn69Xsji6liV4d-H`)oeJTFE}M3 zwXqw@K6oq$gVFL_oFEcpjolOxselibH)F0_wRfg=fk}BnX^H?&z_ltt#gWgpk6Lnh zwFSAf=Gm?uzV+#Up{NSW+)+f3nCEND9?9gsEjOu_sQ@yyWig%AN1P1U9zW*`>JkZj*^!Yh6gI3hqugw8BVn!Y9iCCWRxk?fK(dby+zvPu^@?ptZ7yt`t~tq-N+VK z{%ij$0t<=@w-2Ec2Ny-~%*2z`@6~M3?KI|9F_XCOue!LNAZ-zO5kG z^mmZK##`NCnkj*mpBTWHXo~h4OjWj+EwJ+YKh{lwc)-My8t%^>Ny!^6-O*cPy%xPv!#NKI<3`yI+4)5fQVx|utG)mc!=;5pBpVx&YzXl|C3EhjS;gaPt3Th$uIdy&QdPbU zBrQv$=a48MsIfR(Q3>QO$Okz1$!kH>O{|EktQ{fC~`_s1( zn}EUj7{PWL1_du5iON{bX}$8Q(a|oE?eM$c68&L=v5&8|=Y!U#nsf`JJSmeb_~6zX zJBq;Z^OS;J2;{qQ#$oc^aAHC8U(9MmEV8h09}3I^%oZUB_-u$(VW%9$&L8xK&*|T! z0ys2_-Y`jLf;+HFErCe)b^nps`BLDUs^{c{LvQE5#0WU*I;IVG`6gdTwj$K%$zeYL zz48#K2@YcddhY|IT)jgsjdbxki(acCN)N5Z)oB>U2#p7!bOrt;1J`Ux_2C?1CSb=M z);j} z9dRygWLmpqK9qia^PCfOkP%jT) zQpt(l`C@C7OptBC|E4R1@iX3z_^7*a<~YTJgAjwq;(0h-Jhi+2cLT?_T4#)bSibRPRRVZr#PC{iZ_y^@0|Y~~%!GeL zcP10y!dqAOniN3$`x{f9|4^6!HS|AuBnlWZSIa9Lokc7hE9J!<2KA@JD zz*NdB>mquvKgB#Cz6lwz^)Rktko}bA23qSr?}ak(r)8kK_?htcrWi1V@h8z~#S~TK zLN4tJ)^H)p#b0fn9Q@_~j1kzxgceC249@7uA{RCvtheri^ye5!WdMkStQ>%k4kSwc zDRE>HF;`#E5R!1p2u4flCJAOi$?dK-1p{G?c@M55qE-G}t2#!Mpm0Pt=>!hqo;M#v zi20ThQwac5a_A5k!Q8zAXV<^qmJ7_c1!VEu`g_WDfD?S;@CS5vCgwXd?BvS#=Wn0j zV!p*p+oVO{o?T9EH*SF6f=H&OyQIG{UnYj84I|JE-l1!9H%E((X3 z@n<;wwOI73Gac5KWJ#V2hb29-d<2f9Bl3 z&nEcqDQ7_k@G}7*rgQ3;0kOT`bo+0|6$s`LeeBDyVax%&@u%Iv!JJy^jHvKlvVn8O zOuO@^{d2wV6EG_6p0w6?ouLQqMy!kHh(^Ybu6sq9(G`|cct_2vW4Hx_aIK_{-FZ*fq*nWw}y$t zcXK>P_r6kB^apC{#+rw3R^slu-&9Gzn||hV8Y|(s|M)E2F>x=<4Z53z{50Lbg8)Mg zYgsaILnNbKVM3H*E-pJd=yHNR^V$xxS}FO`SQ{q19Tg6r9V)zJ#(N$A9;BSHDMeD{ zv`*DM0|ep5`NI02hw|PGtKtqK-Y4;q2ZbR=Ed7A#ImgN-L!4fvDKwa^iQVa2CrE7K zN%cf26P~qAo88W7Gjl3WcU{V^8G7Mab*YkRe=>0XBJq}E%GJY>uRrv@`fRdm6E!5} z>50O&=F{1|bLoa0tiIko+gRPJ4^V-Ab%*7ZYZan84;e$&11!_lO$B64lVqq&yLYT z?E6Q2ax;yl8|NAR5*2$~+Mv0*$H2Ang4g+_bU@2b$$fSo+6u{Pp+D~*lfN_T91MhR z>sjz*#r`S@?(c3BS!|fX#d3MSlE+P_JH%T5P-^^PM2v^{_1+77Rm%a>-Mu>MgvCMW zDL0?2^^Qh`%QA!dteugr&r1o|sY?e7s;(D52C9FKRk^P8uydX0lW`@sXJs^3$+xTM zKZ1Am7QY^{nm7pQSZ{l$d6GDu%W6%`ymxp$ENG)Aj;>f7Dp&UHIkT8SSQ0GTPY{+s zb2EMWlOL+=!g?b4mLe>1o;$w9KUDADwl}mKI~Ip8@7BYaYZ|=vWU*xJChFaW+4VCt z*mYY6tU7Q5%>Z8N7^eRMZ}G?j~6pGia*wVgFcoUQn0NE{@P^P3n{TlVt1opY?V8+g7LTX1+> zr^6t90;0Se%YEBHV+>kvWc~$Pc_iJDdgqA!B+Txx@-YmbRI4r`fziR22MhGqYB6F>W8(&WAruTwrFQ@JadL~6a& z3>Xw^#4J%MC)cjmLaBF)ujzS9nF>!`yRsl+{jPZ9vj#Tk8%?{?;v&~1`);sd_Gb+6 zVp>pG_Gs;6_PuNnb%Q%eS=Q5{=*_2qr!K6jQdjQo_Ewg!a`~taR(>oiJ5b z**wE3Fiy!bH~|iBg^BQ?JbBe^dNDZZ9d#DfW%U%`>{10nk$ozROk}~(vz>2@db)l5 zgn=p|@M!gq;st+n^E`MJni z-+FWdUiC?e8GNbN2`(NHBjJ4vEpR>Ogl@r*oqgFaeB*OkcvNO8aiPeETPCDV(ieu8 z$`2}Dm0na1pEwMOjd32|(ba#oeUf_S>ei5~vfNESYre*Q;<=TtUUs-(gBMtT+z?j1 znvyebx)IPBxt#+WJJDs3Xe(G$_n{9H@-TH&IWyx%Ce}rqTq5^ZrUv?A=c&`sHu;4- zQn#UHBug>BI18jV=^n)|hfxLPMLp*Oal}LDO3$vvnoew321g-NL!L2CG3Y^U#S+g7qY)Z?Rd*sn3$ho`PctGF)lyv4b-NP_(spnlj~ zMvpZ>AK$5p7i=L-pFF(kreKXhN*RE;Tp26p=K$C$H5;YJf=fad>h6Ek?7mcegmA)$ zBG~$h&=}d0LF8Lsw^N^U?o=tWz00Z>x94l3Gjo{cX|+UstM zBi5BtYnGW9aa?1iMoZ+a<#;c>E;1x=WAg7;nsRjyrx$)RR?>Gnv)OMaMZSC-eySn8 zLh$a#M*1jM_dagm&G-4GDg2rbaL#{pcq&W%XzRBAsByD7UUV>+UTg0~nD!%Iotq}5%m`K)rExag zQ1om+23p*Y5@~*ZMw)+b+(ko|t^$jx@?iUqJrUj16x;|dC@8BzHZbH{r z-U3+mJ5x!4+QRl27L2+SNW{54d-(KdRK`?J-?N@0Z>)`FbiVLf@<>@LY9BN{>h3C zd-nR&{19rDrD%x-6DeFS&bYn!PBT!Q|D}M>Zoa}8oh>yplW>-*l-hFScfTijnQoiM zCrF+7Pbj@~U*gq+E91T9<97(RV7EqPKxPxy+a9zeo_-Ys50~eQpMR(+pkb>#pzBVC z0y+r9Y1drFaE(=(O2$)+GemKBCW9!%spXG4qelzX>5|(l@$wX76Gn?D)Y)b9gf6Ni zB}91+a&NU**XGT4N8+hTq>q4X`EjX!V^o6eWLLY@{P8bm)Z_ySzFm39n@&qp0RqAc zU7%Iq##F;Caeo?Q_!&~XF;%1=e5n8k>v3+Zwd%tX1~T4MCzKLZ+_2dsL$k79kS`lr zYi}zZnVpYh*UkvA*>{1gjjkY$6^>SU+26Y2Q@X@xj=bmo?B*NIo@{sBpus?zC^eRW z`vuV9hgZ5{>HH2W(Vwl@gUNDU4w@|9I}bT-O_HC$mYo>^`uVG8fXhe}MCA{K^8BdN z@ni%1k2?u^YI!VsSUy9+hQ)BlBnx|dF$UxHlDKy*G}0S0(+y-T;0r%3xi$(B2M7s^ z?py!*3AK6^hjHVJ@0A)|3(2nZRr+*L`YLJZ@fZK?$5`|_Oc+QIB+LdB4`_n|rKGg4 zxsrQ6SV&v3eaRQq*T=zxg_*z=*t!amQXz@CKa}*yU7mkV(a7U7RU1e;_s<+SrqkP8 z++Dls^AIh*1w%AllVK%-yG^x|Oz=^}_qR-Ot=Pg|-m%!$rSIju1Dp!H#uH|jaK#v< zuJEt$tuht&e-|^D^(^xk`Qkym?TX(-Y7w4oc5c+KR&i051H6m(t9YJkAL+n_t*Iu@ z)yk!|fbu2h=5N;8{t%;=d2dbf#bh%HMYLNyoS)y8UKUEV6fu=r+|%QC;(7ziJEz2$c7VF9PT=2ywq)N3fD>gFc+Qc`|7JLH(V*#LOSQx_ROxSR zWlX~y_@vU)uiDh{)%j}_J{{cH;^z6A*wOp!^S`R{5u#c&^Ly5UeGdJ^wf1Kt6idac?uke@w0mkS zFzoBiZ$YGbfYrSS8anU3vy<~BjLrriuGq8II+e&X*X?lUjfkggYqC{)_B9(i#J_ic zyX8Iog9pxS{ENbc5sM@Q5X{2Ws`-1gPS$9sph&Waq<-)qbiZsDksmQ%I9h2+=XgZ% zJt*v?bJ`o5gf!<&?3PL%D#N^3;=6F4#%h@+#+9K;j~UoaQm@D9{@YE}ce{z}`7f!! zq8+u}P6O!RNzhwE%q1x6Sd0n=nv~cP zF9b5>|CE%Xi&G2p53AIb1VNH>*1Ilcc${TeT`-NzFU6TAT1Yh!n@-8Erx?7KGfDy= z7)*A?WFN=3v;|@2oO$!S^GXHk0_3Qx(;a+*Sz9y6n^DSHax=p7tJ{WmYyk=Kf!iZ;qV3v`i@RlaD7WJ!4hRozx+KqF!%`YZ3^8-ihEPDmg3Uf+6bOxJ~U;< z41f#*9b<=rIoorXUuf1w^kqdP^k9euAIc$e(~BYEQ|kTCuKj-GF{{1u=W@oAq$cCn z1i=<#Mj9hKIFJ6659)%ZWr4`^C!EHDbI;E`A(K$$H^}zt@p-Wa0GQnzG^%bBQ&9Z; z*r-!tV&pKDka(BTUd@}2b&JV+_Dub!*YOE9)5gPv+#le4O|?)?I=R#SIqP|OL`>|3aK>IfMyu6MG#O^$vAB_T0M1?wj&`mqB{AC(6{jGqP7Rxn!M_# ztbvH_YI1ZVXPqN%QVnUjIHg*Qr@;_c1GvWnoz|=3QjzPTACY=+aea1dXs7GgTCGKn zX;!ovn-yMdtGcDPfe1&_M{+ONsB{ARSQ1 zwC#zKBLfMST2vJ`8f!sdAcUbtXQ7OzKv`Y=j5TqqSjFc9C(>UYZ1XjG1g1BZ=HO&e z3X$er$!;as>9G%g*x;mC`p0%GB5+&AhF56w->o zjrFxuEuN|&yc*>v|)g_G&cBp%p*>$0dH|yPjzDv=aO{ss>7WC5`-x+0K+nblfU%>G0E~NvXjwC+4Hb1;b)1A~V z4|#&xnM3p!ozMI>2tbEf3wTtyD2ox_$9j#XOiPSANrCjIE3o>7m{rCN`m9X<(zfp- z?gy_eOMid#p9~*M#=6B7lB&;K;qN;#a5BM?PN*wOU;l^S6@PTaiF)5;LiBjo(7)4& zw&fU{)xQVd$9(}_L;4>DZQh#ic2&;0%LPI{D|yCxPMH=p{C3?LKwO-MT7>guq^+WN z6S`j(J^M*YzC(+}Vld4PV~kkPQ{s+nR2ieuYlz~;eQInF@dun@i2^C2sbt;zJ#{IF z@Z(V#te|Q6ESn0+a^qZ0FwgPhg@P6F8wjD!>hhi99q-#8i%~bIUl3lSYlQzmt$jA{ zA)~p0?KNX)-&lf2V*4 zz8}z==C5i3^BS;(M~P93U=tXIO@1wi;*%bt;tEG)@3-N4>qIZcIXR0r|XZVs9|{vV8MyS6qjEDVopB&jWk9jex>5n)}l^TS4hRN@i(y?i*5Kh4Gzl< zg$ND~%f-x6K9vlp;h$UNtrv7IYme!`pAcHC-cOC36Xz@RaQ9GdPW+^$1xl!Af-#!) zyq{%7{ay*gtS4wx)^lqJDHkV5$95O8KrKaF;qwQ{eQ-8(3xvFiMd@Y>7U_^^C)6`T zhTnf5s-v=UX0K1x$u7v8_8bs=ArRL>RYd12p_W;R8t zd7<4D;ZQ+L+V&B#j!7Vsbky17v(r~iuGvDwroS3W+S*#_PW|yQV+G>Xmm~EDC@ia? zt%c}1KQ7dXHp)DDn99zmlBYs5{7IrEw<1|a2Q=L$s?I5!{>&8QeRx?fca~WBUb!!^ zvZ4KmPQ3uk@jMMhmKPc1!)@=D+2ZkqjFZt6wXZONhH~E9>Jq!>lvI$e)l{CjTxL-2 zqhLPheY>|mD=`aK@Rr4RZ_(lh!BP8KozJAO8&tGqLfG;p9Sx3s5QM%so@!^c^1j~_ zWPWnk>8ZN=GKW{2@-_-Zrd37?f28&7>GnT?;VwS3Mx)o=gYj4_!hglFlsoQz&%R_Y zcMvz^cUZJe@~S8;bRnSPgsbO_rBy!PRVtJ_yoU1z`ty|3Go|4%YK`LyVU+$XZP6E{ zPmkwtakR~e@JGaS$@lWQ)cC(bX|p~#I$8AaEA!p7Qy1MRDIAaD-L8wb!bDBVfakjmsW0TvHjwZW|rUilQSX$ZhN7F^cyX#S8o5_YXp0O|b$ z_zuQ!C%KS+Q2o7_uX6mvs8ev_`&oHPG9&tJbRSb$Vhv4GvM!9~-#$e;y*{Ok1RP0~ zan(4iBc_uH=s<8}*00k38ypvlN$$gs#vb-yE{Fk-SrA(_zqscQ8Rj{tvQefDW1@x9 zhhb~f{g>I3G15Y127*}`XrbMYvrSU0h~bY!93;I{+SFT;eyV^Oyj&Qo!_bWiK$hs< z1(ufA8AW*~ypov%6L9*ou)k;O6&(21l=#{_x>aIn=48iqHERa}5U|UBX5W+(blRD` zL5E~IOtbLN;H!vL{naHHdrP<-LLAdzNYJJc)UDOFzpVCQS6)2u$8c))pl5z=)e zDY%eP^wSZ{!Mc29zHuUftoDh?|sqdtU+Zx^hH@qv{U7g9K!ZuSFy~D?7hrHtZs`1g~aFgr|;V2%;@uaqoRQ^)&-We+ei7fz~IET2(q^T9g>&*q8D0Bm}apZTw7y)sy6mp zkoRPV@8jZc29<0cJpj1sxA{m7%+3Cp2l&s?&8OD@8^U9o*5Wqc3~{}{tnaZ!&VJh8 zAIoE%yIo<6XER9hAfhO{nT-$vhntAinCv-gmxrC+oNPT(hem{r59l!heTR8m-ud<< ztjJ8)J5kQH!UB6Vs_4G%wIHM*<5eD05{*UUfR_(zEO{mtXy$XDR!sU*O3;RHLqjbm z0^<*Ry3D!LN`j3R9;g7TI)9|Blv3?-yC&|LDUtWtXKjRXk9v`u_E}LhhHzobg?q=hu97gbhqbofQE;Z0 zsVm*O$kMNTd4q|+mY?oVRy}R1Sy#%)9*`z2(k~FNVDH@4WVh&Fo-ivY^Bp~HrF=&)?@C_3}c z<6>!y#$KDi1cQLmIK&8+Oao*2DwNR4%ewOkAfIj{9}a*5Y*ZAx^V82}P!dB^Zq23xb@Ez{^wE{Omg!yL@LSz(D z_=ABgcr{YTT)=Ma1H^p?nz&nF{)r3im}Yt+%^%wj*AMIFi?CIp6zEa7PEFqXyk2b* zrTz3YMmv84H(%zjEQX{5i=43-X^VnwbGdclyHi=!x@a+YJ*K?2)6IrT7c~Yl+_P5G z`53!%mGO0U+A?#(U=E=`V+Ngp47l6YZ&Y)Cw*1DRiv7StQId^EAw*D0n8xCenKG!aQ(*mJ4A)fhu)ie=)|VmL%>mj`7Q32Z z{9)Wk$vYExGqS2yYT`3~)X^gPv$zQl=DaN_YSJT-13p`Gtxk${0twuoQ0*q6y@ z1U%M-fNA{wB-%U2-29}3($d=FXg{QNve4*OTV)8f8_exCI@{iAE z>Owl><^96IT;Cq*50r~zRFbW< zv^3BXXDJYaZp_aKBI{*jJT6vox|E{)@>zAIzOtoe#WA)Oyw$3fLR}|yV}NQDyO#5- zaY+ZL+6;cR8mTq{9XGqq1R~L46v!kI2Nn!%f*z0yYno{62UZ;#F|glPU>in-oN@Fd zzF?M2+7WOX4oGT!oJB}m0&D?KTw*pkpl>EunyQ#lKZ*J7LKHfzL|v}z(H$1MOS!T` zfu+xurOi2ig^7B72$ESmE%Argcbyl-k>T~yuKaF z@g?8G$h;8)5Q(-R@{W?5It-kC?#IO-l=4!~LlJ?Ci*mM}3Qglbhj#PBcJ38e+6a5G zL5)5DC?Ni!N`mOJCQGuseSjgYULU}AsrDRqnzsS}VfA|qi=gCU;$BIgBTeWTA#0)8 zg1B1ax36(e_{rGP1*z|$KJVJpS~29GZ@00bQs<{R?VcY8b3Jt?CeEWn zQ$LlpVzrg&SN1jvNljmTZ50T}{C+$>DH@465mew!JKhj)i&*N}#R|nNxYuq-p}u-~ zTDpyF83gs*$`8vTM0a;mx>r202BrnZO=a|c6Bm|17Chji3oV}mVD%WYqSb-E+_0OU7xY)cg){Ok{z(}VC9jI@<>E%FAM^w~pp7-KUoShdyE3zszSVj^aNF1(=(E|53mR}3t?;Tp1r)g6yz8L#8=gBb9*3tbu z^EDJ|)+`5O(#xva3bdQ+$#a$!@QTVe)wfr_3}x3uP!tGtI5;2fSR@O2a|yolAfhq~ zlZIZ7WfG3$wJN1V>4( zC;(30LeReeWLysFg#z!G(G&tkrw4Hj&S%%Ral6P6@_8Lm%jC$K!ZPJ}0F&)^G9OP~ zW`uLNOoFs)uW2W=+)r0IdCRn($`%X&vGvatN}P)JF%_-HJHG}^*62v17odT;nKtDV z>Du!@RxZ?84Gt?<=yX6zkRjq-Ll&vkJFeW)^Mc&(sU87u#&{7#Q8sBCwQ7FoJuXSz zu9-WDHfhwXvQK~_Gu7mcJ6@@$ocNGD8@)Qm+-az;%!ePb{^dDaL+s8Qv2 zTw$Tj#-52mt9d4DX@_WURB|lQ(p(M?Na)|(M1EueOsW4nv7tLa8vG9?76~}8cTJx>(s)wi1%i}O>#$Q{9@uu z6fkPCD%vhwMQZ-1C{4z$Liua2A6$z=Bj$@tz`!pYg(tg`loKV3HEhHu)Hz{{>>7%A z#IMIKQB=YWk-&u6b_)n&yys6zgtJ_Hac(MlKS7ovFkDAnCTD!7#J|`uJpjTmUs4W! zg!FMLzC84jhk_Y(fW%eP-h@Z+A*ji5gZchFROSIo8AhK+IquN6T0iHiiup)=-~O#^ z@wBJ7Y76iQHA(BYjVA>_6Ucf@ts(73FtZ_zp&iW6ntd|J^Au#^nN@OPU5S1Ge5`4uBC=; zu%W^StOe+6@_nB9UB;sX&gMX7+wb4W@l_t2X`fI!DNygH>!Fw8S-1<2H8Dczn{V>J zy8nkM7)k(4!9X?3P{Ru(gbaz%*32E^)GV&QX31x-akoejN9$%Ggt!CGNH7kt%82oE zmoshgEWa;$%!1!lSZ<;3`C*#f`?1Q|*+3bg@fmFUfOtXb>l1uGa!^WuRG33!guD|t zJ}frMYopWD5k*^$1)YN)se ziz~*Y&H|w=zV84Xgpx43fZ+Zabf!21PeFv6INgLObYWGt(2Che{o9k`1mz&6G4c|l zRDQ>(3x-_|p9=x@0=-()P<)#7`1hkG$hHvFIMNwYpYls1{6Q5j+BTB&mUc7hs7H^Liv zsn>Y+KGf2=HTpL;oExCE)RpK32vDPGc-B_?nr%kLh{usWV%(s&siye+yU-9YAtOMy zk%ui!*&l%t(NJWiy|YI2iqxHStZn}>8n4k>fcte-Xfog~!+I)Lhmf|)By^1r9z#Gt z+fr>t=5a_TMrms;8EOJ-{ykfV?c(#n=~E#7{NbQtRf+IosY%nJDTV7sj@7 zb5%GiK07K5i2pUVfVX}|QOevLpd>2x;+7x7^5g6xOfxb1ZHCeRk$eZAVxI+ zLe{+ZLcNGt-NkaG$|mZlBRrR8bMi4U@{e{QorkYMlaEfDL1sj$40N3RZ*Ke%3dGr` z#BbRm?_&WOQs~a`au$#=NhR7?#@gw( zssTLGpg42ZE*D^C1So-*a6g_Hxb)+*89g8T&D=_$Vwn*n&wjmlPzD5udqm=KdUt_> zQhO?q;XdnKj=+(Z}J?J`H zm31FC>ge2u@J!1Dd%vG!P@$%i{xW1OzV~VU?yuLuVs-z%miyOtU+4P0?PtNqaO2O) z)#1E47_g!6&HXk}Cx>bGQTowGV+vrK&yrk(JM-d6fXy1|L6H_*rprILTN6lery&@` zWytbAt*2N*><{f^nOQcD`#dOXx1Pi`ev9q6VzwxF+GD+R zwy)OC zVLaaF2mSy+3Zg*q^A#&alnP=7@P()k>nu3XnU~$>)6#J{{pPo|nQyfkw0n#=m~}OO z^*|rGL4(x`#Feg~z=9hHsudklYsSbkdk-x__> z16Fd0tx|{<|L%ohKkFfWOmdG9)o#S?7)83UJ=rMi?RFbc21t$^mcQq5(?=?w(l{xX zqlv^ZI6MSA>PpFLg8qbiQ*XAoOmdlccMG+5FL=yQ&$W3xnet@glRZu;C9v>FSN=Si zy$tk#S;(K=Hpu7oYUcZh7;g>O@mVYb)P=h6O32%!UqGHsE#U!oI=WC#=J2AeHJ|u| zOTsFy!&j08AeZ4LS33ly%IK?s*#{Oo6*qCL-H|mwFQZ29ZS43`6B)KSPbb_*r?mht zZ zYE7JH?x*Cnlk|f~$pd}&WeDL``pTsj;bZ!On1~FZVm@3>hy930l5$^z8p3=g?|~Yx zV;n1)2-RRomGlmY$j6}r$8Kas*O(9RM66U7DBU46wPt6nyl~PNvWcNL7ssF06;ob& zy4vOeGecE6Ka>{gu34fD#3O786?BGU5Th2?MQdKZSR@($(qH+PeS`iow?=Bkx<+e{ zQL>({)>)A>qV{i#krf-BUzSS^LZe$7J^&YO4bU?m3+FZ)FjT2rve}kA9gwHU>n{ZQ zMEuGDBO6Qw_v&gbN=5U4>M`Psrbm(6%DLU?OPOTf1I_)V6=E^@fo+7t2^ZpcQ$ezM z7OsYns{!PV^NIj1rBAYsj`~(Gz;+>G$x-U84O#bwE zFiCMkF$2oF;;XUI|H)^-AZo+IexI%B=s*#1x@Q-E^We`!saRU5nn612-`Mrq57_w**E)1S5$lfrq-cw9Nof5!Rw}0RhgLonU{#$=e51yk#gXLc}wAtbP zC;5Zo01~iBHrk&OOBRs&iqMPJjw`EvI1DM59OJ_M@Ffed91hpu08N6?u}@NJI0_%9 zU;dCxU{94Sh@k$^?yz&WzSj5k(YiP{p`dV=%(V8e==2HPvanH55P~*KRMe%(~4l7;sVqa0GCxs6E zxICVJJbr81TVMB>BCCWKl_1GQHGmcMAq)_Q7!rsT*c9yH^0X8v$fz3XHc_uy*Y)#}`{0 z&6pe0O76dpk~)3Qk~reIJ`BhKOeUwAMS<8TU~?sL(|vuTb9SV~3E6mJDOx}hT2uy7 zTtn4boQwqJpPl%j3nvU{0RV&W9zWp^5M@#I&Ine?ze!Kyq}JidjgI|bPe5ihQV!y4*3~obALzuV;@Dv( zR^0>HHM;MHB_F-iYqd`R2;!dm;Lav9Qfv|m%&96{l7IP+L7FZ5-=G6SXN%E~qF#6; z%A&=pQXLPJTpqP}90rIK9akNbb7Msb2%QnQcHJpFIz7ocS=tK#mW0u8o$@)d&`>kb z9}ioX`kWt3R#lU6*<*xuzryy6GmM&^P%!6JnZ6>cptBJmF(RI2*KUl(hAmPgaTq@p zHBP&In9S)eR66eywIjD`EQZZT{Nty8dYqVmOU*m;{`%lSo|IKR zcxYPXV(j44=+xK=AgwRbDdg|hZA{2lqOu~hyZ$JBk?lgHDZVMFmExd*udWqO75e5ZO=3t>^L=_pn904@4$ABjAbR zZ%Sp^1(!X(?Q$Zw-vU!9`9Xl5*+#Ch0?0RU{DnB*%nyu~l5LR4R>w?Z^}1|o_rhaV zx;fAqC;%Yu2ET&tJzKNgUUEX)14u7ugMiHijau|xTVWBp96(QT{NxI&_mQCU2Ixt# z>T4gA@ARz$;cAf-dDR>fTU~~5>Hs`bcBr(wC1^;^-A2%oTEX) zTAJz#Vr7ViuhexIV97Nv;dkSI70teb;(BA-qU7AQZ5r&v(j4Q9t9r2|eQm>?ISfb_ zsSRg!e zrF@7bfdM<7E9Zfa+D*Q5`J5?2kop{s0OpE0Q1Ckjj47r$6(~Xt>J@9`5E9T8Ohv4| zCGL($y@zcJtzB%fgbLf7NVHjTo5W@;A7$HA_PJzzMpA6heHBv>%yQhy+%dr%3Wgaj zj{D_oE>Df}Xu^&IM#V4f$ByxepuZ>fE&(b;pati|ig0hHtI*1iWk$_A^NtA7CHJnjP8ZDIVs>zRy0G#s?R-xiL6nqI5 z`X+Xr+`p|cQsLmE*5vhIyrrGp!15Vdl-F6?K3Vf-h(6z0gbi5_TCNc&f;Iu9eDOdn z%S0HN#PL4LsQKiGe5=Xf1lL8@oshf><_P)(gwhWIkZJef%sxQbDxe1qgviELNr#;6 z&ttjI0ol@&0Ct5*!K{kCpoJne!Ef^o8tMh03y$&Kdi_=}a=2!~Z@H0%fYvSk;ntYr zPdE33j%ufF)siN=DMl=UQRjwG2az{%((8qm; z0b6_`jv}NYUuW&-Z5TEix~D#o$5Id-r@{NKj*%WSGcZ8{Q1eL^DR=u@!Y>3b99G-) zDY$bpidw2;pDJE6A46W_?vW4=rIk`Ke2&*K-H+(! zF-c|bQla|7m zcA)3qlDAS){TH@0!OpUj2an7+;QqD@p(O&am={WgexdAHAjb^b0sLlQ7NU)VsJ+ry zJ+p$1plbnBQThXjPS$5~m9c;Th+rIDmuBgjFX#Iq`~7*{*a8PUdTjGTm4YnWqIYbW zG8wki@FZaTpje|ECbbq__mJETqhr7X~?FE0pVZ9r2D7FeA# zuQ%-gp`(2?Qch8C-McL=M?8|5Up)P)sm|urY`p+-c-t;lvW`kLX|>VuFctl6gzhaa zVb-r4Z}XL8lGM75Hj^hnEMoAm4Zm4BIg95#!bMg}zd%U;7fB`Vkd#Enmv1@`;E#d= zDLZtku~DGrV^uf`H;EBKx6J;UvT)+o# z5STJ)aPCsj46^l8tGVWVOnv!rPbpTKXhi@J`F`5599yf&Z6i7Mw?eDPDP#i8wpU_} z*&tJ~Pqqe5yf~mHkISASPdNc8X?S~9rE9l_VrfUUawKL&Hx+--M62=@DH{bV);&TT z^9$L$Iz*k_Kp8=91)5l9dj&Qurr@pAhEm3=sYXR$(|fWe`!A<~kw=Fj7z)#d8GtVT zJJf7^1g0w^>x816!x6*r!&7|tg#}lrLp_#7N=~tOdnSjjoB*RMs3rJIJ{j%}+17Zj zy611T1Mr#UA(~L&-bI!uxb;ZmT`&aiikykOOGh*#x|Kmh2(Cob%JXIid$R0B9FOc~`zHhH;GGQFnvrHp-{_?wiI1WPBY+%g^c|2!eJ!~@c*V-96m zOwcd3X403~cgiG4AQb=l64&p;R_azQ`rDJesigf*kUd7Ir1#T?#Gz#=NVpL5t?mk(#$C(bn#IF`D{;uc z_qwYz`LHsSg8F=)?3{X}DoQ(#27dS(U~?WVwujKQqv`UjW9hL9Qx2lxi|-~m7a2b^ zjN!$Qa$3UWEYHhn;}bdyN6^68jLc)H>1}mKjg?K#3gw>4mx+btwQTZ^myel?`n>{} zi1!@EyJD1wz~XRGjh3TTHoBe>c7;<$X2lKY`7$MxFMY13)4J31Sa1s`2M zO(TRXvOK#1Ep5YFf$StkD`m2Iz*8n~-E{^`JzYXB&mgQ?!S7zvj zP7&+iL#!_(VvS4Mw|Qy?d7@pR8UB*Kyct4&Z4to7>iG zu~1&}wQE{;Jjy(=wf!OQF0Wj)tbQm-QJ19<>hd&z| zRp3=%S#y=4)t?!vtzE3vO|I zw$WBv^pzyJ0SAc_Zs{2LcjFj#0*|j;T9nGk14Jy+9XCI8!kM6E0J?Bq-V~Rhw7V8N zTx2F40B<&sBs`>(c=x?$KT63yl)99aZ$z8Y6d#jowDlDVwl!YA_u8bDiIv7l z14*qZ@1~DMy?S2+ools4P?g{&72LDWmk@e$&vvre<;9-}nm5#qr@1H?#m7a(ZFUm4 zaf^my9s{lDVa)koEx`X?`u!Jo?8b|4WNUm+q4Yf={NG-Yjb#liT{r!5oxPX5-NvZP z<6^6kL8aqrXii)PHLJB$5b>gQ#T?yV`*|abRI1JSG3jxx8+v-qnmG{Y>3ctw96(ET zmNpk|Q@F?^gfr{);ib)}K%mvmRg!AlO2O>Xx^$)7TV}S4acFmoG37;`8>40JbWO+v z@I~}{dYbYfSIm~GE^xHkPw)8zSP+QySJHI& zHtQ~wu$m~4`T`H5vQ26tS+_Nvn)c}Ih3C0mrm)70fj51M`OuHy9t&KrjV zFGm1L@i2g#0)rstB%!iS$bUD#71 zqI7z+-%9hG9MD7*P@yhtTbdcYf4t4#?q69O_9Yu*(tL-Ig^U0lQ5yPP0FFv6FIae? z8rDgnjc|L?;(3<9q*3~-ja#X$I=6h-TY8%^kB3dOkcmWt1KoI4OsPn%ax?q+(nXi z&zO~FvQoe#tXjb!n11Vd?BGUtu_mA&sX`(OzFL~5|2&vo(c|>pA2t1H9!57X6;~7g@4BZc9=A=*{T)k=Gp(TQpi+n_{kLepHwJX^Db%RdsS6+ z3!Fo`f~U>I|B{`)JA5{y;@-M;d4Q`<*>w;NwbDrFi+py{on^s{M0U2X@Us&pfTd!q>8vCCiJDe0cz@MPWuH8A{d3HO%ca$#`HS805Ss~Ffkmr zA8E2BDRut11SIiVW7S1Ex(h%jsbAeqd_Z0Qy)1gDYDXF51Yip`u!k&W3Y*ZC_brgL z>=eAFk6$RI#{ilp=o;!&t=NZ-VsSXUgS5SGW6*c_yRxLkmwl>WI_b-I+ii*K=L)$7 zVeuyI@#WA5pR=%$Q#Iom)aN@ri40 zY=X+vi{LMT>ML9}vuS`P^LUy2-H|i|``p?z1I~a#ij2x=aS)I&t$8nA-DTMkG!R3G z*>s~ByFoealE-+VRFh&Hr_*DRM?;j`JRapmV`VgkMD~75x@gI0Bj=2z4mTjaZ4D+x zZ4U}VIiO?>X!NOTjK&;&*wQgJ`Sn|v-^9Gki4gWE=IN4j*65;%U{A%#WvjDZNV+TI z=PQy)<@IQ0mp!@0AG{v?gNyC9k7l$&e_oCyNV_{GPf9l1?0crzqWO!5BR5I-sEN&WvotZ^5J2_sng|EX%=FXi3I=!d$T4j5eQGwp3oRItzi8NJ$K zdb>>9*J8X07qrceigM^%aY8#63rsASCDOsrKG2^?f!_{jK(3Ic&71wFxDJZqvDl(f zzlTshaYT0Iuv1SreV!wvZbTdjV|X_GzI|0*n4`3TW9jXuVNbcZLY%iWwOis6{JJZKDw1qLL56H#=L^WczZGBDCE#~VcZucMH}#NSjb?XHf=jb#D| zM&$vYpxFVJ#W56>Ppe|LhL3}bGM`5KLBg(=POW~z^!0f`#2gtHXqA{zIeAVQ9A8I= zeVhdtRf{99hI_D_gN2$yIEgWIf2MPj0%VqIM>-vk?X=K5bNj@K3=vtt7T}>+q+0uB z5IWu9tZMPbAwG6B0ix=T4s?mW##KxV2>Q&pc$WMWdn64YkmTRQyxn4J09 z2ZY)@0n*v71Cg8)EHJEZw7jNkt3( zhSqoL##i^BYdymOQ0{$wTud-5N+lq2Fn)&6UJg50W}{B6RgO1S)$B(M?@ofBpy5eX z5Rdnv<*4Hlmsr@KQ~Z2Z2*CL~ICz7232;6mH6Tik1op&!TXe50fOgWm*FH2^OZun}Yl9JU1$RaH1Z7*Wz#_2RL~Bp9V=TP*sLGT;En*v@v! zNMNQaNCUz;#O2fx>TS046Fv?jRX&l~?LK z>v@PA&S1XNU!iMii2u+=%OunlB2@29Ood#W8!jkPHytJ00g7bmy#}k0RTe#jA#3m3YksAkuOMuXMvckhX6i_pbKo|Mvh?4kgt!v~u z#iteQq2NMAdxfPNYXRS7Z~pPJ@4SJ~hPrq;d8e}{Jtx{A;R<%ISH%?Zxp+YE zGaSb$a>YqxMl=wdph!xE^q0I>*;`~uKc_--1Fv+3j_CrvLK(k=wY!H)rs2?GJ1~*9 z-~Yj^ehFT@&uewuNrf_J)L1_=?Md#p_m)j79<5;Pgrr(DExQ*2idqdx>J~)`v)`v>wQMXp3COOV zQE!Q9?P0~}j%;GOpCV}BF9rn%S$5-Z6^J;y{stq4FNU;Gd~1dW6kN$hib`X7U@7CJ z1JFI$cnCp(fG|MCq7jT_Mxh6K)51!Ei|CE#uK|;osd}_tjyKh06E_GsRqf3^0XWW| zTo0u}{J7v&ABjwna{f8!C#JwZnTS^?gY$^~CFxoi#Pc^l!RA=F6{?+gG%uX46GRM2 z7n%+klJK6}Gu|orRT9RPLrBzq|5%cM9F!C#k%sMJ+F<)|R)AhTE&YG~L6^#v@0>R3 zH0;FiJG$z_mw)K0z_Y!{w~{~pa2-(YbNjvTis(E7W5J-a(<9!1(e9_YcMi6vLt}Ev zgrT~h8)cS#r2rD)dgeMRFgjkEX8A&&hZYml+wmv;wv?7z8Bf2(CPayJH z6}I>%(KtW-k7;Z&pWs1^g@Hn)Ab!WSKN|EtvxfDaeIAV;#~tg1dM8m<|Gq|~uih|$ zhBRDCT&uxq#Pr-hP3oQI!V_xuT{1rfKhbwImbtVrJeoiD1Dn(&6i9K8*kO@HdvO^y z_V%9kzPGcOBAH_s`Rh`=qeqnQIE+R3T-C0kOc9-O^K2GzU#Si2-Hu9jwjKI zSZClxzk18h5$ErbR_yj9phj`i#7k2=suK zXr&%&0x+okG=RU$DxIWCaWM48plU)%a@*XHGkBB9&HVPnXJR%b zbsfA7d<;L*6X(Tw3?v%nn68ZYlS$}}iY4l!3O_>OFL7C+4n*-u%hl|OMnG&)?&jM& zr;Pyv#2p=MWW)hLtTk81q0KnTY<2NK)z zG&7t_q;Dx7@cvzkfvKILa?wojg=IuOL}%Z<*P>+8x$cZM*8 z(G;lD(tqfC_wm&@P#s?>)1R&ehEGWP0Zk__{3E@bH^0ePK|9T9j7aQu@K~2v!{aRb z&!I9NeK{d#>egJCvYy|BWtrs@87i49DZdI@d0!nw82o887*9El5k za$z#8T+U-^xC8~=_sc({1^8XZkoQW`8zOKsLy_i`H|p0fk6xWU)zgtJ2kdStU8^2{ znWy$2cRbe$+LxETz)uv=snKR?uu++mWB=-xoW&8siyVHG;zZW1AGRovj3t&$uaRQm zi58E#E99|g_U-4cAjbZv^ElS*1mfk5Z?;*^?IMg z((H=sh5lR3Q2V<(+3lVQn#P@X3^#f;2Fhqifb@T~7{wIxxhA@%=Yg`~Sg1Iz!aD-z zJPBrd;wKSsG}2*!64<<>G6240?|Zy})oG}9gb*LaKTqO`8d`n~Y>tVay{-S2pt4KR z{UBVg(R84TJE6YTMoCrDl&SzxYHX|%vj*T-F`he6z(d$-t@n4M-N*VSjkday^T8h< zcu}EOcTDRQXAH?5+C8~r&-9rQD*yk>|Hs%@Mn$=`VGoU@3P`8YB3(m+g3>8n(x8Bp zAT@x55`xk>fWXj5N~0*c`?{}qj1~@l z7!$;ZHxY%}omiijYl6SSH35s&cZUbMe{WrX{dfBd3IT$Jp;x1HdeiUYinPIn0#RI{ zJ*rN!|5hp}4;OGreg?e9cTkThM!HJ~;F>TnTK92e{`JuTrP|V|P*ZmJ%p*J>IV`}N zN9imGlK6(dONiwIb<+%`eU~0R6n>KmXy^7LKi(Hi3&Vf8<__nn(W5Fix3!uS0sClj z_nnPrqOL2Mkq*xXvT=w3FB?Bu*cUG|*mLp0v(Eg_oMfCjJJdURbSUMYhCcXe8lrXO z!vCE}(xhiB;EvB(_`2l)_eldh;HAM@t~B6uP%v&MjbguCQEIIN{^UqWICZ1{X_QZc zcTE4g{G+Q$t()IQe^dxw1ChD*K(3ge{cMQ1aa#W4B)p@Y--b}&gjMCSqu5Tgn7ZAo5%(fRvT6PclMe$k!C1pcF z0n@qE9OV=P09Y!4w@g)bdY@^CNO3gPx< z9}lZa-H~?bYy9#zws%>)W1js7@kZRY`CH)r*MF#)0q@hrW2*^ApHoKz7CPq~qX@v= zI$F5)F3>6A0TQ!36gOQl)Uy^v`EpGj1(Ks-!24WZ&%CaDdnseDoh^016*O6%=;#Z; zj9%kC?{pzMpX~uU1D$+>Z#8~#ngUj{e2R%|ytDwGwjXy==6%t2h4k&jMBRi0>Lw(X z9@YQNI|M}703PrxR4G2NYrRKbD1ud?jF*twRUQ&1gqnHcWZ%$Yzqj-od!Od*HeE*H zGfpRPE26-rcys|nV57||*VyU5>4z^|YLc>zb(J|f z)5IFUKJACsIA0@pc^{+ps>h=pT}}TP98}Rcp8{b1wUlRE#CNfp4A30lQ8}SLS7!k` zbJuB~Sk%l5iY>NZsJ}e_e(fDAWQ?t2CR$Qe)9L=#nt)jdo}wA^C;Wdd2UZvZF!x<# zAEQlcFjT6ydw8&M0#;1GIsiA))=6JY7XBCW2Zm7LVBPugq^_I>*%g>J|;06|$3Y9tj z{QOll8Jn|z*~sa+gxG*4zp`d&O1!Qg%I^US*6=Wvx! z!uKN9?`$E|>7)m7zyOvQ=1=9s;Jy9}HmCXbI=UB!H^DTbU+2ay=r|+N(nrFg{1oJ} z!gqexYfYA0sWea5xpPiVPGV`YE5$z?Y4Y`UadpM#9T4%{V~BgG|4B1P@rnP{zCzfp zFrI$4d?cGf6peuAo>_r$y-=}XO%x3cjV-uNfa7Zymn{7X>-||Fr&$`LmU`n`cSX2? zrr}$8Tb5sR@c!NAKXmga3$3AAU}r|SZ43!`pSXS<2^=vHv`8A0@1pcaRVD%ikpn*vUeuVdxoB{<-+%j|Os8Qr1Tcbt`?& zUF-+oXx*UtS5Q#sisK~zvjsk~(!3Qr1aCN3CEH_rD)^}jIXU^k;o*{Z^{Z+5I1U}W z!Rz=(W72mRG^Q%;@~wN58{`STy@3>0?k}|BC?@eS?^e02bTzaOMD4+#^>22qBdd5{ zFFr<2Re{C+c|`TgN!j;znHML_ug$SO%hvlFqHM;BpD6!Tjmi9_`wso*TY#QgK8#B) z?fUYx!a|;!oZIBf!AkcqxbFJ*ARJPVZ%s0y-|QzbiqL3v+y2a|hU=In8;mztVq9PJ ze1ChohE&SWheNl72F;|>e)K~ctEm{|HN^2Aw@HuIMkY}U4vp9pqyatRd)St`&1z55 zNcAg24!tsl?TN2JpB_B>u_^iXh>eWPAhs_}Oaa6SAC7mCzuEz(npd)1)MQv$!KkVH zB+=^gKYQ%$biBvR2#hoqwRGh)u{XI``1jPo(PdgLw8k`wuJza(BymR zLP+hP05a|=#~4)bqOtQ0oyTnj{T%0-RFe3t7T=0VyggzlvgwLt7nd+2hD4rACzCNm zHMJ4k5KLf%sy6#wum1`sPp$rvW8%I!7Fq2y*SACGFkA|CNB(q>*=7_;ZTm>y&#rFp zo31#DcxVLDw&w#Yyg1&(fho$^<;c)}$pbU} zS&UfqXGxEs8fhbu#aqh%tSi8>_=D1Cp zUb<{yvevF5Hbx7{#Z2g={Cf7VRcWN}6Oh_tKSb*oFcLK&q`Qy6a7M?)N{AQ>2BkS0 z-u1AWR8mV!{9E^nBRdxlk7_EUt;2X;xcP44G!1fn;ZseGuw3cpc&XM{k$%wUFo=84 z=5moy?I+||G4=ZFY@_#Z*q!?#_cn%d@I4%B!8y$Z8;k~nJ*wMl|4Yl`e}-0J=y%}K z+Tb%eQbk_B2*G+9Fa(ckY;3$WYu5kt8Z9EAuziZxW%()k0Gje3vW3BFV=!AGmE-vO zbWUr3@z;4j3_2x^ByD+rX9Nuw?aN?ZfSvkUPZEDDx2YsH2kidSMp-UKqi`Gj%@H)W z0-eHpVvh3-gYqE+l%8}j3K1KgQa{?{>|0n-h1&Udi?l(NT1@{fa2{7S^8OYr5z94f zI3EnE+rjcP4t2P6nQJ#t?=BO7^;miYwJVH#$lLlKy#%cO(g6r;cL7+SfiPS0%UC}$ z&3p|Ed52cQ@sC34B9yA1pM7SzGnAs9hXb)*YKzQuQ0D~hGq}iDcCCL?d4_N4Z}n)3!)A2Kc z#88L&G`&v6|I*gLD1G9%9S6X-O!dQ6@Rl0oVMV2t1wuoK>8-=~Jht>yzYc{wU&1b1 z`T3s2h?*-Wh8Y%CSo>**b*gM&_i`b%hc$Alt>nkF0^8Rsbd%Dcr7{9rg zBf<9q-FFcrcCw84fZZR!1LzH^6k%*y|Llv>N6%a0Gl`G`EemMmgbaA|7&J__X^}5l zD1fs_B~vR~bgIy+DM)CpTqu#it0{rNB}_Z*sHmu|A{j1P2*no(a|#(+QnI$Hi#kEj z2;RYJLDNI;8KjY541gj8-w8abbOA}AL)G&qU?Eb0Cqwn{@p!W>$TTNh*7{UJiD;R= zRGT}2t%pfhS2vtHs_8P$;aU`$la`pE_=-x! z1ABa!A{d1Cii5F@|N1`f*vxt-~YBV9;c096L<|8O!H{ z*lYjg0{E6iBXz#MctR!O%pm4+ak@(n&%Oto$;u+&eK|@A6k`ll`1z_7Pse3hxkTJ~ zDxYt&4MfyQQUH11ea0tU$pJq)5ce_13FQ}l1hYT+KuX{V&mW8AFlq2~J4Wg>we|Hy z1(4a_W_&_0o?SxR5}t#Gz|f8MQs69Rt7xLWr`i00lwG!|sYyI8C52inv&bC_#QPyU zBBj$1_cSIyvfXY(9ZnmUxa-OP+lD|5WXruwRf?BlEu_w08L)sPIFbW6V=%tmI$Z1D zsG0Lyq7q|!Ng(n9)18MGcE^6YCtX79G2mqQ|G5^fffC!pn+RV-HvXW+a1L5qu$h94 z$b`F!xJCbnXOKZx{LG>Pv(uQNsPe5OExFBoGDkfxB>UA)1g7qUvjk|>GY7QLVC~(3 zlGd>u5`=}rozKUps#qOWe;zXa7k33?(xMB9x1rzq!eb6G55Sf=6U#(ORV!GduCA_7 zIEfLe1kQAFM-*RDs+{H|J5YxZYx}Q`ZkcpQy*lg-UFXvx=EoxwgmP|E#a-(FseKree?dtCX$1p{>++9iE=-R$q+ z9=tMR%0CXO*NxqNf325TQIXhYEgudLTgZNY!C9*Q%fM9Yemtf4_`4sGBB@GVXpPkM7Y72|-RkyBhUtl7FTX2Zk* z-j|p;TWR&1Bl)W(;dL~7xZU{)*$0ie%%bRBZ+${gi~aC!3N&PTXrm=-qmg zHf`P>oZUIzqoJ_wA7MexHemtzNi-l9P_DX$di((NgipI?F(a7{YQTCVz>ex#^a5^g z52y)=K#-(7m#0ELr4T&nyYTwakC(4AQ`cK|Xk$gai4;>FVJxI-SCp#h*iRTrOfQNK zqm()M}&xoE_1XG2{K4YP-);WEK3c>%@*>tMToN(n&4qIfR$nV$}HiRRh4gSLC zQPeW7*TTVvs_3fzH$BgQp)_-3szA^I=~*!Fpiu;j-z=#NscZm%hT3zTd2*ZPQ0WvJ zOKG1P*4Dg?ZR0hCl^YozwdMkM?>w;@F_Jj%OW=qNl;KQg)I7DK`f%@cV9 z$=DiG50M@ltIOYf9}4elB(q}tLo8+`^Xs`K(BsY)!^=IP6Jd(w#Jzq6$XlgpGwlDG z>JUzFjv@+V0~>=LzWZ#mzd}0^EqurKLaVmR^n=r8Cee+KA4=AAeDGvtwNtsTgf3S6 z2?z3D);AaD5ji$$8PR}>?zXzj+x1C;wQ74Nt$gK-J=!dOwL7f_@=CADNG2-uXnvO7 zZLve9>&D{I!MaogNDiVhBa&uoL4Zc2I`CIMdjp8iPrR$0GJ#43^Y%C3&Es(}Kbw3^ z71Jn3(&YHJILN)%RQrpizWv;<$loS!5f+snQ4FJo4K3T#Y8CNPEmmEQ=LIzVVr4a) zqfRq{K)y4uh^7yWg8+2a+Xok#c)U#Jr z8+ORTSBL6roPchs+iib_lEWxVz#Tc50<3+^j^COTLJhUrE(QW5#MlALas&}fqEqA4 zQrCZwCT%yAljos$sJYTaiclyzUc=oF(aQ6hj;egqeh;feN*(ZVSXghpXm;5bUqScj ze)QmK$N$fzDWo@Yk@U`*Y%|aO-!(<1#ctoaVZ`3Aw|{5II5=PaoRJ@oq#ZMD^p+ZG zN}CqBvErPJQ97OV@!NO&^0%_hd7D(@b%<+`;dJg*Ns*pND0W1uG|=bNoJl&U<3QY^ z0`q~f?ib*_;X)_`6U&P#3EBNVezVQ4zUDv{d-=}`#V`dX73&l|+MSjV6ejiHX>8(< zO$S2qjAgjUYLkrZnQb8S+@D<*kfKluCaurUx7W`Wf+~T2I=R}Jd3&lZ2ErK z#3(QPyhIMkOr^;CNSn)iXw9+b61_HzZC{|5>{RkcURH55Fv)Ku%b2w*Q50W=MljZ1 z*!BlpUJ96K2vvC?o&lCD!t9C4${73O+Jj~O(NQ#-WTDptK%11G%ExW@6DFD3hjDvA zNz}#Yl0e;f=I)?bKOKRLVtgUENgan#{+V`Qo8qRpU0CI+3;mA|%u|#f_1cNEWB%9% z5Cr&{sN&liqQkP_ScUGGF)s*+UvLd!Mj1 z9Y^LZ^Oa9n1XkD|lX}$G05gsN7g7dap_mgb(n$#sWG_v|DZ_#x6VYXd?-gr~6J#E9 zCTOgsr{fV=<=$$=(M$_b$#=}QeqIXvT}<4Lhp@1PcXcjuKEB~T0O};Id~8p*Y~X&1 zNEEVW3q!LZ>ig^w6)lN_2%4zYaGvjZpIr%zlE%)D_Oq8qe+NVBju*6LU=my3uY4Zr zs{hbTtu8S=?ClSh)RV8{P8Un-AL9O{SRQ_hNZH}!{WnVEykGRud%#z$Q!R}vHDkF` z>ys`qd4bXIbo)`KpS&_%<~jCN2!qZm0vDR(>cwxT8)LJJmkne7ysba_BrY`pHzu~# z2^+-#u~?byd=Kn-dk)|I-z*jYaTTK$XTjFU>iKmeGGWIVr*fIAaM7EfWa&G+(X$~H z5y|2jUhgoGm76m`b(MBMU=UTB-h8=&g`|W+B^a^bRbN$2&4qKgjbd})(ig2lc>({? zdx~i+T-S&FQX`1BqR7R;gf#If2mx<0v17_Gvemh0E`gL;M2s%KP9lCt+;Of!HAROD zF45JyzdR^+6NVXqFlrRq%|^H^*wSA@$SC=3a(;G+qIVpk9JtR0F{l=FzHx#SeVO|Q#U!O?uZR7O=`wCzdsmsd8g5&9YW zJ2dG5Vx)LH$UDF zE7SnFJA5!l=|>`;|7Jg+;FloF%#8sVg2ZvC>*v4F3S-hRTtv|GDHzd(MwyvRfl)2r z@$TFlCev(#m&I2$rvmN#j@jAt5Mufgdz=UO8VD3=Zhy)DoLr%Dv@ecRH*ANvKYYB@ zRISQ>Gz-+NuJ!u^QLbx!8sGxA1YO|#(C6lQ$fn>qhkmH@JS2{7f2Eb{i^T{+0|kxA zh8BfU+}zwOZ*QyE^rgNZGWL)~cTcUa9R}RFIHeqR6+jM~Ni}rMTvj07aor~Oow_ZF ziIQ8{4cqX@q=ERDA(X52!WMGyBmRj!Ms$m#-Ch81P%ZrQg=f zpTEapP_7I||MrxdGiOHJS+W?I^a8u}1BnS7&2<3;iI@lIr}m+@f!Qj`Z1a~^>qd$v zPh!qTB24aY7+;mK6?VD#ANI-5+kcT|5QoPYoc!d2#AZ7v zg*$tXB6S%AI=9C(;{$+?` z5rDB4PTl$uG#G|~)D0s~t%j)5ZOKt*vnHUoey?UnYiYCr5Gl#wkxbe|Vsei+bYwf~X~FJwD_XkX7A- zAm^dkzT&#`BzBJ{PQUxs+kUu66x^k<_;K2w$M%RkPij2>e|< z@%y*e_Rh|j;hZ1&T#{G>KTiw6kFv2pwf%oc`X+65-|Witf2^8aX?7;F>iQI!cPp}D zwWHVSNoMEBV4dj(;O}IIxUf*RkS(SK!b)R5k~D-JEuBPe%Aek{x%5s84<+778u@)2 zY0K=+Sj_)zlb}8$3Ew^XPVn7t^OF6*b7w$?mP@QK^k9iC5G-C`#A|t}u81 z1mD;CpKsUW07PcXX&|(NNT$#U_!auL1OuI}tI{r=#M^av6sUkBsg6x`iDJ*S5$XXa zyL0`l<+ zGmo_cdx%~m;?(@}3uy}aQnX6T)<=&V?DZNSdgqU5`O&bBx?_+MBxg1l4B1=|L(!=c zshmKxYXzLhz@ioRo|VjA;<=Jn+rt#8lnUACDm1x8+Aac0O34hJ=fvd-t3?wnD^Fm2 zs*Ms84?zrdHOF)=R@KgW5oa@pesxx-J3}{Xb|4wg8aOH7lqai~S6TmhanWe|Oo(0R zUSe{sl^CmVZX)!QOh?37F zD)*|Za>zY-$!asKOu4C&GKNYVi5jkaZxQ?cNPeZ1q=oXO37xOVXPnhvt35cIbX%*k z`*H8xSjlzAI=;S_h}1E81mVtw_krfQ+ZrzR;00Ijab};S)kVG2+PwqKwY&4q4X0)w zGWo*G{H|MQBDRNXKeyk%G6EDWmwn>Tb)^PSsi(O_J)yn&HqPUK(W(AHx581bApa_b zpc^PTgEBC3CHrwsxAFK)VPcji+C|nj$5DEqv^r0K8T~Jp1O@G6qQ_t%%}jWRRWc+c z`-w6}Q86(R7~Tu8^i{r` z>${DE{TgNeT zDA>FQ(IvT63~K(!8)z+|b**)+}+u)LRT#3j<%6ZBR;;)Ow>d(bw9#7Z@s`=gQVv7OrbSLjuGr-`}|g!qWGK zW#Y-4+}b=$Z(;K~DL;nt*dbkE5q#I@ zJ9CX`A^H7rCx<_hrIfIcdRoSmhW}tdp!*YZ3n{&)3Q|{&HimVmZp6muS9C*?plTVC zD+ga@YF&pxL#=m~)$;Oadpe;)Z%_6Uzm4jn(fz|ies2P9Zf+_bGss*}1Y-++tPO4C{z9&b&4{;3m9&cklQ?M-@H~CPq5G zMcw9`(BZKcJJ8<7KK1*llx_w5;-?|wNrzrz{gr&DSHW&{%D$#>RlT^+m;dy<057E( zJ%$izJAettq!fW*@W_Xl;-RlLLvV2|8zJ(EBxj*~(&|6n-P&00{Op4d=|Q!PI!Q+D z+LikuJs2x2Fah4taGsqEp;uHk0}viG4^)RIlkmmPOzraGqAU-$Ja+F`v7udGs%RZK zIV?YdlK-yqr{hdjmKAbUpAx~oT;pZMr1Z<(%mmlUimHjY#6ZH{QUNOJQNCfmh2ycUAfwuqEKv5 z#b?v~!;Nt42sF@~x_W*H=sjrY`~)`WM{2p2=2$FIbZzQ5#Xp%Jqq&mmRarJ6qpMSq z0x#v=NBzf$0-$5h_1gXcpwUuv8OmkS{H`x}V<7_48-SG!wa_Us(yt*0D`$4ORX36V z);sy2CE_l#qBCW5{7T>u%h>@YCZ;n>geb4ine(IR9}N)4g6G9u(wHE_;5!Ce@X1uJ zfdfW4WQf56>o;QFMytZQ*9lHk?t@3l*3veZJjlC>!-k2#dTK=hQSmxhzJ(rl1p2ND z`ghYQK_fWC2(0dFe7vriec*lZEN1gh5qX{Eh({~tVQKq?*qu(&r=tufUo_X@MgZJ{J9^i(kWw>@J8N0t;)Wsh@zKlxh>+qAYI{hOM~G77Q}rhw_CcIoZHGH zUhk;|_Ng!6FKDRk{Lf4THFMZ+C&7tKe!4Dq$gpdEF)O7mzKkpAiA265}Bj49g4~h6BMk{K>DJB%J*Kcf|^(4O$_K2i&n7nd03% zS~yFf28&`O<@<1i;f+mC-^z#Oh83hhEdFZXfHsrZUHu}sh&53iV;O#gXXV(zz(9bZ?JGzdfDR_h@L#n^;Y&Xja~k+&g+pnpA5OlB$++FCx`vAlXw^%* zBM1u6QC}?SO#K%+smZ?A2xk4xF~+N3G1q8X=CRm}^PC_T&`9Wdw2n3Si8x^t1^;7t z>s#reYZ&l%qzWJDtg=m8D7hexX7Ld)MjmnR${l02ZFDo(cfK}|DlS@srZDdfk%k#U znD^FXgn9vY;+FMK;BK|LIe23f4i-OtSfEVz=aE$WE=$Los zGIQW)w^`?sG(ALvT+@23)lvx-rC@lJBQCAPMFvcmK(qIma6cOwy{&PS}J=f1{b=8NuKx zFB~IyxDnBjC}n{s;eVn%ZY{}_L95Y^w{iIXSh&!*bJ=ZdSCsCHmCxheY0dsS2n%** zE1G}J?`k|BVuZ!Bulp%Ro;8@{>zY;tDkV3*Y#DfuAezy3FmfSj5~B=8IqZ|4gTz;3 zq_nxwRTykJ^ohIYhikQ!c>034Ez~V!iR9GS>F#TNv?p6=R|65DD!p2Vql4wXvW0>K z1^X!$S^;y|_rviH(WB8wgGBk8a)YtmA<^m_A&%Mxr+nK~;;dkFx1POwh^>U#l3ZBG zDdD*Xlljag?-mSIAlhPVew}SQThRM_a;%RFsZ~*(MnR>Ij_gSjN)js8m+75y&*CHF zaHZb6uCE^MknzoTB^GwE9U5BPOa$oKpNCtk$WE^HLF9GK(Hy##D_v1(`*C4{cD1>t zVyZxrXpO-URRA==nAo(@r>}0d!6Q!G+k^2>a3S0GKYwn*sD+!PX@j&*gD&uq)6dcG zNhI?n{+px}qY!`~a#t^Zr<^p8zM=}aEqEtXFq~K|Eb@+f_7elpW~-Xz_hl6o*+u#l zF4oGn7%QghSymkCB`ZryBwJ+{>h0RkS_U%ZY9k0&yW(+S0npAF`chV=pu2h{`3ECBXY`~VS zyl(d)8M}T+>f7U-iPwumrg(k5DUwMi+f(wdwY|6Gxu!JFIRv#s>$peiU8xK7YZx;v zh;C-SxO*39F%8_F3%QANPxZ^+9wWy zmTBRRlNXIzzKt96v&JB7fXRNQD`snp8dFT_8VzCwEZ*dwea?eRa#UD-kH|0T{S8Fh zQ5U5zs**4ZwotlU$tUvgAYxhfXP(Wyt`q*@i4Imgb2(rjlOa)XigJ zRy-8U4)s*Gs)uSLp1!X0Te-`XBo=fd;`*t4txGP_5UQDu9;{us0cOI=Ex8n6L8LK8 zQ`_-s02k`NN*YOr>|7FXnAWhOXVK1i-=D9a5(WI(T(L1OcwetF#TB2%TVJZc$ElgHNXAm9@$1#a7v zTcd`_rANOjd@BO)zV8^!$3ZD4W%y3at_!TLjtz(+QRe6F+oW~5>16!Y4<#$~xZSWG z;BU~F&w)OGN=LN)!bDQFqn;e#h323)lV{Ifhw>^5*+@xodhQ}eqOd=vw=lZ_THpg= zR)vFuw65s6iMM6zz#dc+c40LCHZ^E<8+v&B-B(iXq85uQbxI(Y%<$x^Y8ZEcK$XWi z(@43mE{Mh8%Mr6F{1}~1rZJbGoED~Okp3_3kWittb|_S+o_9fyg3oq5&|n*2S;U zZUQ+8c1oq(SBZ$htSO%tMoZEIanV|LJGgi!H_aG6wYEPo^q{Lua~c4GtFco z$b(1^B>P+}jv76$a*mg(Le4+l;J03^MH`#UKVY-uV-&-I*tf496Zk=Vh<4|S8K#>t z_cr)_7*If5n@Wb7mT$vGa-af)Q%yU64DtQiwlgH{Xrr~==?G!cF7usB%-(n|3gwcg zGVKN@Q6L&bTH))QjIbEgxV#_qLd#um|M0|hqZ0b&m0112{(T(a2l`;@+n)!orptg$ z+X3T8^K7ys&u-Tw`m!ehF@oZc(i4~si9pCjvVH!6pruZ*Thx$EPY>MUx*+FD6c+!4 zERhOeLFpzIv3;XGQ~kudb6TLy#cA0SEeo7a~}tZ#K%RC@fnJ0|$eEAjAKTejC;uXnNJ!UPq}K+}lkzhn{+t zQ{p;)z%(?1>H2`v@|&%?g101}GsL`13Z??_2NKdOOLk2cdCl?Xk)F*loi7kLaMc%B z<`Uj+nN{^s&L{Zjm%p=jrR(0T#dmsbl+xcu4xLvl3pr!v^v5PG=#OpHBGksKIYSUB`|zF(5`~dT@vhr=XwipT8x}6~%e0H4RDyOa=JaUaoUN<8h_{9#U6%W@`Mzii zCBN?vy79xtaR&>&xtm1mWxb}XRrFl91#UkzrCj&D9?hvz;q!9>?h#5+J8T;-yTf0S z{fWyCZgdq1PPi_Qn%C$tkv?k2E0h29r_4xDEV}ovVxK%F!2$!^0ks?dCdBpoTMA$_ zbQLgIKyvtRkuWoI_s`C;tt<+`98H?DV0y~5?*_|N?iK%5?cM%^ac`=3MLQltF%IIs zS~BxAe>iu^Nrx*v;flQIOF9R2j@qBhf-TO02?|OoHXJz?^9Xnfs8a@F3w%qHDtMK9iC1(k%yD^-A_IP>PHWl4S&^ptRnbswgaqLb zLOS_Ply!v zy$Y*RcxcFz^2k)Aol^a~OjljUf?%d0B$d`?olV( zGi{WrQw{r+PcwM&rL2(AZkJplK6peiPSSI*C*N{ugZ#+t(YHT1 zq%=Wf`x0r`>lz2>`4D}Tv^a#!e|QLtamu&e_59DsA4G2yp;*ydW1T;~Cm| zY*?w|%dvLnC@iN_I8oUcZzSQGu2i6O#j!SEki*ay+j}eH6&0^v<^}jr!19z|5vk3l zDA%saDgC-Ji!1dEp=fm-8FK1sFZntMHJZhB{ zeMt}CmwQYpf(niW(`8AW=fRlndzz)^3%2U4BG*Ta7W0`eOfxB3@nUlrqrHL>-o~>3 zQm1r-@pM3B-%~fJMGNW_KbAyF&4iXOcRYKgo%DILWkI{dMbM8O#|;o@{e;;E57Y1>ACNuQNt0(p&MiH#%=YJxTQFuMiXx?pZfr1QFE8HLF3Zx1eF(k z0s5LZ1*Ds1(T{X44u}3E&NjVHQv&~bsf3cw%ifyk$#Jl2;5e1@i108W+M1LBPAp6m zmYb!3i<&8_kPBBWX*gH$qdPj)3W8Dz48m#Qef7dlktOX4=gcddXZ2m;$RFLI{)3z= zHo~c&GC@%N*iFxf~OggBc)@PTF4jQo7T+G8c`N;s%+&s|HP{^#CR^2^v8W> zNySDZqgu?Kzd8)*QhR*7f>Lq_j284(n@VcoEx#O@wD}{Y40fgH;V=Rc&+m{PNGkjt z98R?Bm8z=} zO4#z>M2&J3_qE)RmE>axULuCOS=;7h1!uDWD=_0*ATa5uN|28Yw;e1|db6XKlT>4$ zL&bBSZk0dDlt&|dxLH^CQNK*f-o&uO!|P%`?+96vIKj{AGWSj$X+=nI0}i+oNwQsa z^xEVutu0epq9+E3F>}P}x;C*kpFl)fmMM5TN3JT1=u~h#hg(Xq)v=oR+xXlAhx_-~ z?0=`K%jRJ5xzC?#7rBY1)3=aAD4A^1eU5gvsqKe5gS(Tk3pCk14;cn-e8$HjK{DCv zN3Q(upt{`YHpYV7^p?24!ZTpq@%C(7W<4({cwWzbwWIYuFf*w=Bs2SyerTgsaT<|H zXCO+&7s>LG_WP-{lXNDkE%KO8ko(%E^(Q$@?QmI!$@U1(kuH;> z%{~G6NhVkFmHqRS9-|axj}y}3-KPT4S7p-%N_2vtw(lSQ#U7S^jFLNK#m2AwgG9o5 z0hPGjy5{*k>mb0eMA88Fi#UZ}^JB|*|2j`Rjs@EilcJqh)IT3Gf!eBGf*>Aq5EfYl zjQtdum4M7JfDMjPRFysBd-AXMX$JK^&0OBEAi+8!!zCO!*R3iJ(RajK`FI#)FY{^> z&7bUSVck37@v&gDZgH95+CNmrj`Hi|Mm$UyK& zwV(@wZw?FNd-u;xJ$6EU^WE!&Q3Wono&c!lomAe*1dn`=0iLGd{g=;1pVw=>(G`Lc zy)tW{%E24hvf5;SbwQBXDDk7j$+#ICh*A^ru1p^G^uN^3PUU+Mr2@=>7aWblfQ5XF z#S#>_o|*$6Xv8zeM0(3*`NXv>!||Co>d7rHX(yWDiT+h&9#ZOOk_lKET5&t5E526%^ zgKsab_BE8zT;W7hru_i6kdHXk)raP9%p`plFkimSp<4^h4EC~>3xWd6qU^aL8-nfa zJ06WWa3X1nDLY{Y>IugykYPR2^uBQv^t!DFG{p`3W>EA!J!H9%?&7&dwsVVu;P%~-fb`wD#s#4H)9*|h^&!KQg=0)5uY_YF z_1+>YiYZ&du2;M`5jfVnLqVjF5x?{rY z>WQydDsU{g()lCDN)6<1-DS!88@xzg(u@-<*RmqNzCJYtF8`u^R5FucyJ0f1plA#s zdgJ(8-MN6uePi-8<5mQ4LjVgQ_t+%7i*s1x=py~e|%_)E2zd4aFm4S4BDgHQvafwYKmtKlQ>R&5Be^00z!H zm7-opx2<&6TK)?oJ$H}=C{oN_wy>jv6<7PQ-23iMb4FHsnDMvlP(ae#yj8m`arCDP z1k5Vwffop6ddqE6V3mhrlzCgdyv|(>=10&^ym@7^n!4uBnXtF^3pN6>jYZ9TPBUsc9n?^iS@!#AeBn$?VExA@hf#9^0ybczSKKSVo2g#_PRubEzKmNs|ToJqSsuNeHcacEcYv1ah_Y@=1-zhIa1 zcp`SadmhrA2Ju2r?DHm$53dk+s^K*Rb8ywF&nsQfil$M6r!t}aHXXMA&Ds8aoBK(1 z^y85A$~VaTz6l#_&nd4qN^JFFlj(p{qBY3|b)3yMF9^l4Rz{jZK?Bikj&EL%mp?uO zvhDn8@qejjUdp4{*w`eCy0YZ`Xn#)p>TRo;H2V1?&Y;4Rf_QQ$yie?o-V)<&%&o+;h{1iif zowwS|9Tb2MbW6Je8fQ1qv451(%JUUYlUh{d-h0pU0 zUU>`t%C+sN5oY!^?KgF~k+csef4rLU}gIe@5FK9_`B^#VkoT5==K*NnX8x z=|L!Ap)eHyKnuR5L*qQ|E`aC@RLa}iR{1he#*NZca5KG)1@(k`%nOT-KG&AZLfEzH4 z#@oCHZP&k#KD}&kt4)eWPs`n2T8CRP3KgIZ$2kf-mrtINb{X{Lm*ZR3FkBGQKA}w> zT_T`qbyQ&MmwtcOWD5-!at~~34C9V?V>7q19zoEh(M{+Yaq7N`@+-O+h1->VB~vj51mz<|#CpOU#4|Quk2y>s3L9U2EGO={9TS6tb#x&< zMlv0(0d@csnAGW94#u520;rXsqHG+1Lpgx>yb13EKHr2^8evw7&E4dEyfpYkAhWP6 z0sH9n*mwCE2c&p1C}>&vs{9~wT8BeY5YSReJ$#Ale#IJ*Db?(^qKq1=&h((DZ8LH}+uW4Ix{~3U;I0f;s#^T0l@?a!AV$+e8?~fiN3A9AQXYf`YLxgb8 z(i=2$R4Msv7PiuR3VJrjdN<#>kXzzs{>0IT85|*Jk}E7gR>pTzXqgQTvs%mtbv_Re z!Evxh)wn?#;k-RY`(!C2${a&b|j6aJjF4GQJ1aGC8yl zXg#4vLhib~H2K&r@_Ly;jut{JeV34}m{?CfR3{ftIUh|uqL*Yv;6Ece;i^tFW@i;Gw=?#<)$+)6P;*odb&J53A;AhUCT@^RVI4MvPnC_O`RD`yb2*c5JF99s~ zd4Eg$>j^0^!haaSs+LX{)C&bVZIBH+AXD_+@e#=T!CSFfwml=ZES%#z04yN2} zkoOyo-S}rl2SZ=1pY&zP2!Gk;u)tQgT>e72Wt<&n3X%?D01zo_!qJTZvjjX*EVLH| zctPeCm6)mVh@Tb7iWXwIin98dJ^llVc!pku>KEwU*T0%*S&ivkw!X1x>Hg* zB&0z~8U>`JLApb_L%JJjkPbn*K@lXSL%Q>>V|~B4n|KElyNIwlg4ZP`nd^hx%Tw1IsXpFF2M;#rtz4r~u3tkukYP z%4T?mYPvrC0so<#dK!pmH8r3_~SYk`=x_hZ`YRNOhj9e~pl{{y=gFqQLdVb<9(ju?_Z^ zqxKd=ot~mO0!s`~Ba%~v(i{!$m`A>%Jed(iS97T(ghp>Q=H*$_g^d@d#FE0{(!Z3e z2M*wuOqCS~cwax6KuY4XjXT3e&Dt)k^txTe&k>%NBcvt<+=thqPKHM!W`^Q{P~k1@ zbSOR3A2ZEe0BLs2SImF{Z?IWOQoieIMlmxOpilZpvxUVhN&D`gIB($qx^Xmb7zHKt z<@pbkUL?lqx~`4(r>On$i$1Go$s8r60rPv#9Oe|)X?Ka*36 zV#Qr5GbgZMyBo^jq}WEi2{{GKYuZEYW551vj?%vn6-~G<8axzKi3L3uBPJM&Q`->M3*%V zd*#ET_gU49co_MVpdH5N4XsEAcN#25{GW&wqz2LmwC5Ch57~ajs|sM|Tipc>h}iu> zHDV^a@*{eEDn#zr+^9y*Kfl!sZ@7pTkq*-jGm>GuJJ;B8s%#_vj!`Z1ebDJkJqp_a z+k&=T5rvTKxjVTb@~TGuZqM}QLJ8x7pGZP#!PJXkQ|;fx_C!s!m z&Ic?_2pyVeVllyEPqZrM=E*ka(>VEP7;Dg>{Ee8@&-d?9^LVqJFI@VRFB{{{lbglC zR4?5?u-JsT2hqn!;aF=RR=8)3&kra4$bKfT{@2uSpb!wR4|p33xE004$3DpY2N zE5+Mt*Q2O}D_YKJ0-nzLE&g|h;hr66=0)r8IBlI7MD?BLHcX@mn=q(f(`5i%u#Cx* zw#W58vFh8XHE$Dthjisn=VCsSK)wj!g2~{G%GvVK+5W9As9oXiB)e&ExXW~q z$?Ly87S<*o;$m9BuzK{8kd^5S0O3M$;pXK3Vgbzmqkn)YkK4O`yAUeN884DRB7-m- ze#sN()VrGP?Ffw&Rqo0eIs7{Je<}xorF{!J%6|)LOcF*U?Nfl?tt8V7lO7(CIf6pE zJE(yYV@{bhFb0vJ7n%lydLXDvp$ftx|7|A@nt?{7X0GtZszKGxSjfQ9pfX@S;Qg&H ziU~!tVXM8BS^HOoEDj5eEcwa@iK&496~N6%B(DZ&>So?(7Smy(L4S_Mt(idgceTHe z0Z&YL&h8pUsYL6#W3Q@hd?Yll+N`A4AISB_QH5 zd?cI;ge6p_Wz?kGxOlX5-f;uCloX)q;rZyG$7atU=y@Ea1EN%K-`{WZ{0~Hnv$uN8 zSI45kayJ($LrGZuI%l$&GX5osy?swm=@iWKFLY|&uUWMfp><{!$-92B*<4Q^BS}L4)yqq(s`eb z0|JK?0QO1#zM{ne&1dH;*%AJ#2p2O_&0N1PER*`=F)r)-?<>aCUF`g{bAk@O(8MP4__%^kgo-7}uZ zs$ZB=Li>?SO8X~>jMbc{G7DzX7uti0^;#2Y`bu*bODhK}<&5OopIuRd>RH2}R(NoJ zs@D$v_>7vWu-g1PjMc>>wWoLXrXRSP-ER_T1~nz$B`8#EP=fiWszCV18Ek)u0I}~5 zBYX;&vNCg}W5=saMg|+?%aHOwSXGSv1!tQMm|nt=Xom|pn!!9Tuk+x3Tt-v7KyOH! zaqqJhfk60%waEj@$&cEIP#T0ZL6@-kHn)LANF{&-TwY&bb6HITntm7g>KmF-AQ4A5 zWW8Br&io6Y9C<7(}`BpD=wyCxNc+4tE$ZCaqqtRODupAaq6nA|XK1)1}qpdv`5P=2xF>MzwN@q3@1s+mAEog(x?E zkebkG%)=gEZ(=j3RQB^E@x^wd7GzToD0NVL4%;$$GSVoTTxR{+aZtCJ9BDcCA;)v^Knp;&};L9Dp_Eym{7W2Ch#E30GgwBByW1cc!5GzvTCLqm5Aj#_*kKMJ*u3?1+Bjbg{k*Mdx}Of)-J13J+# zp(KFNsN^$EZ-R$pziO;% zwO3+p4$(9Apc(+uE{_l3=IUKFe~xG}WFI?ieMx-P6D;d>b3ywLlJ=uJN7DS~YRTlH z3y7zsn)7nz$A_*R@^)NIeX`L9x3fGJFi+z1=Cl2BK0$ji%IFw)g-S4-!EAU{3q;Ob zZ+DQM4niSc>a{E%MJdmjwh+C64$7^}HUt!`GXgA7?5%Q=VuXUA*n_Pfv+Ace|^px2se)CjpJ%L~d!f*uMeu-p@; zyg5vK%_1bB{q%l+&adzV-a>=H*tFF?!hoJ6uaPC$Ssg7LP9)0OV?bZ z%#kowJ+)V)HT_Cb!v}w2i~1!)_za#YxGjJQK!O*FHwehBCD9$L>_Iy{G%GDb@fjJ+L7^$7?5yt~cxz@NWWKm0MkI7=|=k(=&eV&ZfH_6Uzarr-GM{TFzGm zKngDPz;EZFcPL59)KMslo(&9AnO>s^3C=Z4=E0*4g?a3J<@D=-uwe@*UnXU@SKjNH zfzxYm<~VjoqCj?ZHJQVix0Z9_1-QrN-=_njHaVxfuqj9e&Si9|N%#hYaN-eLk-QhAZCOe$OA9b&u}q*c|7^ z{M=l;U|y^#W@NX2QaL=x7})Rqb{of_AI$NNjWjilc#4GsvrC7F+d9W!(%~K!ixpg< z;|)>q^&YMZDzQB2e4EPER3h}qN?2_)0Z5@g(0iPW>;lqJhjqvzwSLC}l0DLk#;Bzs zwhVgFVZPJX0QEZ`S+gaeK)~z2#ce)pfMh}dLB+}b}X>gn-)1xu5LA_7>fIBxcaHALEY z`NZHe0%P|lCYLayd@>!$EP}J&n)-SyAlu#}4j@@t=sBT|`sl&AJs^7E{a&J*qZJ2) zLWG(iiF6~Wt z@(|fRw1?ueZh)e*9T#f*lM zEs_yt&(1H;+~1nkj%Jh$=(zQ#@{(JiA+{V#Fz+$GL|Iq)idhI% z0!NBg_PMaA8|&amED{Y5kt}$(1CFnGrsq$D=d5>5SWC48SQ{`-^S|?yfwDdb?jNMn z;eZD~I++6e7IOWH3Ekt@{Y?G(^>l4=-!+(qMIcE8S{LzaMR0{V#(bvKFTL3ULgRD6 zA=NG)Q_nl2=nV#IPP5n$sS!x=Bv)_@ji19UKhx|Ko z6uiEaI=(kHY$ml?)e1vraSQs^Ab?e=E=&*b!1FF#Dj_bffhgyU<7Qd^)vK|arjM$p z((`0f%bE~L@38wo+eK=yI@ve<9MyCmj*z*xa5Gl~2~^eO6o0K*KMFpI48o>;WZat? zP~&mY;w&AzUg>9V9mofePDOGx;VT~v5KLGQK4N^t*%8R8T6e!$KRAk_KU2L>Jt4v3 z^h4MVIk1r1nw^U;KOz*0S|`Ms?6{0(KJ{Zi6v{nCB*S2PFm0EgPW)h1gLXteOWSF8 zye3PK`Epr3GeTlD4(cG1HPTm@j6AXf#pEeG)>{dW6mRNG_ZK@ROEg&Lrpoj+?wp-( zQ$z4shxP3NvnI%zjfNBZc)4gX-a{o2(bLNx!lPA8AGCk$sGB7gU^<#9s#>6sW+VT= zs|eNrxDjeR!>n!BpYG_cFV-_ISaomiY@(N@_Sg4LPWCQ#M$0t4%G_8!**Jn6_vIur`)%?Zt$A0GQvcv zjGanrUD1A`nGtvXkB9oA@lX&%M!oN7Xhbl&W5IWOJ>-9Q= z^3G%!5Z%k|C|Nv}S~w@9)|R(W%U(@6E54x0KYZ38`y~F*?18?joXNeJ>|%r4wu5$-kYrK z?tj62dNYEn3~DOIemTy19EoMCrKLHDXJg*;K4c%1#;`c;emKmvlBIM{y1=u+PL92* z;$z%TM7AQ;Yjw^i8>YM`6J(6?+Mb)lAl#F)L`4&ZglD)_lu2Vs_wvpWRBnR!tMPuI z#ZLuDnz>Bp3k zVPqf=0xSMFtkH4%IiQB+4ZSzz;_R%n^R4pYqdr8TMEygy6NBzmr0{K~0JzYlP2PCR zXdodRWQeiWY*)J%aib0ydC9ZDxlh4#GRA^Z#^)+BL%DOe)6}#)f`&9W(bGB2bQ;}B z6|Z{<&g~)*t|V^iK080&@S%zjDXRl(wBMW#opcpWacMDWZJ~C&Onk|+5Bz-gW*F>7 zfplIEm{p`0^*_Nwp_ivV7#Jd`GDeGizWb!Y$sC3PI8i?`V2w#Uk1{AlR?1v;}A5 z?Oaerp+`?j9G6Khi?k}f&n3DTK+HDu;*9a;uq0|NxYCycs$f!;dFi+(4z!1r@6?gt z4SShHzecp;qX=mEHle$DX~idzAIS_{B3XGIho^A;QXTfpC!Z938=dgLZ?L%S?ou03 zF={Emj@$4u+h0}~l$a0Fl|HUM+-k@hK52}X{dQ$`aG7>3`AHB$Jd&_NcvYa5Yp1(J5({Vq}e)y63TX5XYssapAir)l)|iS&C# zE)0x6@nCi`n@1*vN_8vW%P!Ph0Br1x!&Y~a_%-iyw$r{G#6~$10LF47n*I3~aJcn7 z8UPN9sj7*da;CEWoq7@$6{Megv&@sou1MyEK_H^s*RBnuO~hpplwwl_T|?-!M?l4cpvwwhIiAi;5jIe7eUcQP5& zibpG6b{2Zw-7d{rw3CE!0Sd?4p8IQ1%He}aq>okIh~TtJ{nFCm1yD9>tmLjJo#i4i zQ`I-QhpxZ>H&r&~Q^OMG^vr9C(BzVkE44ufKJL zCyK;tpS%i!wSM%z{&aU-eW(_}kIFf{Vyb_?GBrHMWeXzIrLmY98_EcGzlY|OQ$7BI ztmG5ORO9PHit{b`*zPA63n(x8!4-0t)!@6efMl5=Sl!7`fF9!8Jyk+H?Os{qL`Z7b z*`vW#*8O0@Zw17kWuv@rGRt+# z_>m@CFH=%h^K;-7WbXsrH<3RaK*@eU!01R#0huxBxN>ZR3we6c(&i| zlLG)?=li-@ewjak;?g!3k%j|-V<}vG`HQWgRJa{^_!9H%QH_qz2Ytit$vN0Uzsdqu z^hGcI4hRt;!~ScrZiPdfqQa&_I@mZ^Q?sm7T`#@-#t)kyL~#_v#4SSNET4t*paNrB zg>8(xR)4>M9dWB2^#b_ArJI1kq)>JsFpl3VQxT7hmc7-@4P9mZEkCgq+!i1x3u&yi z=w5$lL5(~6pf~(A_T@RAlGkrm60=|4RBrLxKMvoRjzC_! z+-nYWWV*Qm;Q=uTLMRwGI}J7zk2xEUS!=Z3cDYb0GIMhtbGbK~QZ-u~X=}EDBTqgh zfkq+q87P*|V>j*}9nX_BJKtZ@@%#dK7?U|n(B~TL-+_rN<>~3U!|T_6(wjQn;`Z@U zN=gn|`JO12)r>L*IJo;$cvxG!?$WPsZp`j(Z{h$|Dw$hWM{YHcJ^nGq*7wVwfF#J3-1Gu_EMe6C6@=D2iEXH;qMfleX1G30v&OaPv9 z(P!B?kdfjI&64U#(9sIOVuU1wB~{yA;JR}!Zu+q2fQmeuKn$(8<v(q(5g3^`RJhgWw(-1&qnNb|_ibQT6-|+D#N+eoi>q+r zkX#DYf6gRF-M}6h7^%EZPE7ccGX#@g0f7(d=Q6jyz}I{G(s#R<@1w1M<=SY@OoQgum`jCfTB_*#C1K< z{vn%;*RMXr50j^uzaKH$bq%~Y-un_iA_Obh393f?1{ncfO;T9<@+yW(lRansAolVz zd8R`9sh|7sOTF-6l_Rx8+EhxztAG>2Pw_w&W&&cpKz7ugT965s3d?jqbx+mI%MkeW zMoZgl553M#(P}qWKCvN1p|2X<_nrqOI904zkhr3C^Hr1chY3vD!Thry=53z@xzs2} ziyeFx>ZBc?m+04^pZ5x!%&MPT$@>ooZA<~?d=;`W+vF6P6y0;lw=opsL;046n)q;tAme6zCG5yg$Feex1Eg^mGJxsNYf$4Lb^;wA}rJ$5eNC zG+wvYP;iZ>1E^OgXfzdhmICVZZPH%mqu)kH^V+{|oqUmx?s@gte3I^V9+a>anSW__ zeJy+M>(A}&fyQSqprRiMQ5PUm1>ki(_HXcuy&!uzq8l#td-;iGKEy8*!dfpsrB1(!q zCnPD;6L9NZ6>}D1mx0(SWuE{IXRZYTox^NW)12FlTS1vz_cxr@X;JBh=N``6WXJr( zOC!!S$G%0DaxHK4N=aKOy`{`lr{i1Id%L_dlo%W%JuehZ*Nt{ucVPM#YGSxl470Hb zs*+-gy`I8Q_@D9pyj?u)Nj-tt0%v|B%zJ2P-|a07;S0)()tZ+*~>#lRg)= z_G;W`l4-gkPj1>ZJ)G}Yp3;g8jr!lV`mYKkH|ci_A+e#&ZG0^XYYO#B9dp?;7=ciP zd#EeF4jjQB=3cuw8JHr|9S;*6nOtY%bWLi+ODk8mo_m^N5?0Y=Biar2HCt(85fG&f zLPu^7eGO7t*BBy(UJYV?fGd3*Far@>6C6##Og1=dVy*T@`%ypDwsxLsH++4(F}AY1 ztAL2{C^v>0u1W9Bw+bIX+42F7iD(iDoyn}CdotZLEv9DMUfSR&foc{c7g0ls!T?h1SC9x=ZFQ9bE<-AXxjBb4?DKZ(k(|SbVW&%RF z0Psb{Ep%^E8t~|VX+4~~LR8A@eqJ#KkY8Bj$7t~{r-*BWP%XNmvj z>z*m!6VwJ^XE(%^i+mjCn_aQ z`$>ZGu=2qzlf#MFS9GljsoM8-bl!YiWzuRgxOL}~fO19WIiV7CH}Qj|lim9Mk(K0$ zjkioZ!^u!gxo%T8A&V9wlYOIpLqkI)4|=9T05b0VpW`Q45vQ2^{QTUn?oEpJFgqW< z57~F!5`WiHv&sd~wySfs*E3+F(o-QEvqPTT74Y6+h|6Xo3`-Qy-j!yL`2|Nw(8#=s zvUM=J5Dy`pI&7l@D$y4=F%Ljogp7#J=H}9I_*K!1N(#&Q<|HrvY=o@@?OJmQ5G~0E zA@Hhl1=B2a15C(B=Q|Agc)J6scCUMyeETCE??Gy#Llk8l3t`J&rQP&xlSrCr-IDT~ zhE{v^v2eG@BCA?EMp;0NVVc2F{C_l8+n zU6lZI(&T^^o15gGzo;gjOu=F>2cUFMfbH%*U!e073q1m4_4a2a#5`6Tx#J^~yl;01 zpfU`AGvGpxfChtvFV4ftMv?uC4;iPD$>B<$XnX9blEa8^WJFQI+Sp4{OP6o}kalrm zZsUfUAXUklR9K!p!R&6q7E^x<`~fru%3l#sOc>-r-3A{3B2Wnb*{{872J2W#QG&_< zQb}VHKA=*(x(y9ApjKqS)*;(`B0qfBf=z%$mv8h5hNckUi!P8);YI+E0ufWQvWp#q zHx8ZS6dGkZI}JJUZb@*Cocl)R!kqiLx@TKo7-b4)Y9x)Igemgb)J!0sMlxVwK0x_< zV8d&=P2+0EO~2$x6dp^Y|0+`;b!hLg-?l zMHr>u0>ohsfF5P|t8^ku+SqF6#(AuHXD}17{W2KDS#lJTB0xuOsX|f+v5yLX|Jhnx zSA>-SB$gspxoTG*09WF0ccyy=@56U#ZS?e9b@aD&m1LR$;P&5|VXq?6vZl%xMsr~l z)U`{%I7&0Z{n4KzP=THh-i~s7+*Oe~ahHfrZYp8xKCZPHL?Q6fCI82Jl;tFZv=J_;AjrllPNtt|RgxYF+ z$DbQ#iU2C}O<+0vu{fDWAETzVD5=xDBtbaUpBG2Mg8ynwJ9N!~Yf3Q1hY_P}DF${g z?#giPwx9x5J(Z$-7S8dZ;2%BJ?iL0q_)2)tbmN2kcK>PFzM=rZFLWwh#Uo&!Q*$x4 z`JF$5p)+8TrBz6MvLgy;BAF>PrC|zqxqt-AU0C@9UQL8(MtnYNr(^}lIZ?c@anX5? zOThM$rR8Qn$y^QjjR@KAiUA{)#o5*EnyXQy8MdK;EfqjZ?ijoZ^9RA&j`5>?v9uCYjK37h zvS;?AzzTD=x3n=^UPl-$3Y{v-bNE!%F&9rA46Cf|(;}rXb|V#Ow)^bkea#?|wENMT zkA7)Xsrfvhl&s*bc$o_v>7AXO-kzR!Ke}tfM*5(l9>xO+v)qEO2j44;*|n#E!=b%P zLXrBu@+_sAKJ*}#F4v!h47)uJNp!r{sT&2a6?@pY~ z*K*uapkhhae^5G8!M~RY11b}+Q4E=J%$e;S0#wL7kt6KxH<$6~51)uMz5SU@#AOcE zm|>zA@kKx>R4Y-3CNMy5Jwc*7PBgm?j8gXLt;F#TP*kh=aoD`iP;sglP>4`RKR^mE zFPY0?9IzNoYtsOd`v0k!zn=K6cM*JB4z41pMCRic;gZaic+y6D2|Y+T>(?cyf{l$mJgN zSMJW@Z}8jw#R4=u1(f)&MvqQ^YC|bh6IIkoUqojJ0vREI<0pcO1A(JX?^GgCqkuBa zfeAqSw9p>NlGxeV{iwXTxw+s4A~89@HrrSnn13lA12iQUdPv`a#8iakZb_{=wNj5( ze}DZ~^-SsVr8pKxhT`Y>g|(#A%fIZ1fvV4uhSm7Bw|tG=50U;<7DxYLDkWO6)r5E8 z4MZGN+5s)kS%FT2ow6@H5;jy021@}?{ z>ON4myfI7d>>*b$#wj1_dd0v-E-p#9!cVt<{F%{igWKO+ok^l?edvPZHCZGMbV8N3 z=a{heYB)9#t&asT;4oq?JqOqZo4!>aXnZXVXEaj*Xa_m0eTFc4^923jm)!(fHMwA} z#pL8rUhR-Uw1G*I>WLu9p^ymK$Q%aXL=X~b_p0GG9reCFU&d&(TMYrYNdf1mVvTYI z;IY4MOBWBu8Us|O1p5F99|$CXI?p~$n4Im-aRQ%iqlS9S`~#SHuMQdGJZC{91FEZ9yTngVNeK8$>QGPS_zdHUS^jYh=GmpdWjOjyU( zBl4&o=s;OAW{Q#k0g^Oj_C^(;Es_W&u%v!L=OYH$fqYtY>&=C6a5>v0ArW+q3&CUV zS3=ayhM3lD0LRBJlFT?+Jl6Ge-AK_z9^p7f+}1av6jl@#a`%vT($6S3!S9*KD?mKg zZcvHK5n4?c`&x2ab3ABJl94Q4`CR!@+{NW*#+@takE7RI?jj%c(#gxyxSlq1m*RuC zRf(OQzLB!Q>en02m!6Olhc_|@Kl~R;SgJ~8*PwPfTEpj}gj>10hTgPLc6*N-V8<8a zU0tAcVZf)Ai`R$wYSb4KKn=--8SRUvL`>BBQl^_y?PQR@v$XU8e(TM2Rls6=js%bn zst=PP-6CsR2e!{)=@UYu`vrTR0?9eCtizlZkc@IWUAz+p(qK!&0LMz<&Il=pGNyW| zi%<&V`0DmyBF$8bxbV$6isHQ6M?qbWx5sT9nlG7+W!1wP;f=_fQioOfO)w7mXQLzL zw55m&$MjFy?rxuAqt|3tN50)?eVA@Sz`$4E|2B}8RY)jP(W^D@MO#-6RprE?vDe)Z z?U2CNw1`=>O!p#&=OgT)W>>X@HRs2WZL2V%7Uf$XRZ2IFWY>B)lD3aRUF$<;$M3D$ z1Z1DSRYf_Nsn7#~=hFT@k7#M1%4+q|=<38m75jI+oXkm~PvU)N5R0m zf=JWLrYw1Z3mF5!lMTQJ$7#PNce?G`sx zq?SZ$+vO?eKH9QoUk@HN3ZHSnIPSK#ip+DNj|<=3O4uvbk6)FS7b=wL#gQuuHJ)D6 zg?}?HC#EbNfjnVHMcYE1Rgd0nY3*H*Oi+qsN1-?T0egHoVYt;GRB=n9?{&4>;K;Lj zmxD4HjG44bKBnqp)k!mv;dS@)!~&k1DC*Z%+dhumj&m_Kv0mePxVV>2NT<{@-o&{4 z%aCUlhya~yYE3BTt46tAC9496xtoDp=47Lz1-Iz#XcnBw1f+kmSS>U)Ev*mmoc!Lp z@X&yx49K&(Z0@(;Pv<*5J)HzGs(bw2j37WyQKKD6%u}XzhQ@|CFSw5-mVo8p^Rn9C z5x3&rB%z*%a9%`4{9C<@r}0%ad68bNa} zBpo_B!oav0e0duf3TnO5aN}r=5fE}+$8UH*5!9zP~!J1cnwdSh!S}JGXOMcV!}>KmtwT9LH9Rs?py zm4{Pdi~unc5EN{9h^!n#s@o1mR+bP3s5jjj#$e#Z@je1G<_nrE%e zlGv_4!btl2$A;s8vXveLT-;gN9c{`(Ls6dMG8-hL9vN967@&W>`|%m&76WGa^?M1~ z7y2-=Pf0+1a?ZVJrn#xLtOLhkW#9LGW|zt)2v5}yedR>cgzqk7j5)Xs2^{n~ki|Dq z!}}^X-~(}6zT&Tx2doojf$Gcz!{`H5K0|CRx3PD!DUFHgBS+fK#iSj?bQ&R5I$Ecl z7{Y}vxY?v$=Si2AEE2r2iTUaQQU~t(?#+sTwPMxf?r@^KW>Hmt#?gbNAznYPjPbKY z)6LdHV+Ea{A3buC#H{j0-m0`ttSoR%1qLH^B@?*sFGLimO~&<$sdi?L4CC1KXngk~ z)a?)o{X`mO{61<-o^ddK!=b!Sg|Zh6)SdJYR&RUry%QHGW;~9WK$<6QJYVL~bUy77 z)S08phsZ_am(LZ;jCy4T-=xub&h{?NQ7h4gkWeL+K}|{>?Ea+7?KQ9QNP3zl5I8)Y z5v7p_RJ!M%*_&Y&?XS;mLmp7x6NdBttmT9E6Ra%iI6pI=^?aH55ZO+u((?9To|N8U z>VVIWaz!zXa+w5HMVV2;b3-&|=$=)-b3f%;S_k#`f$e!RUr$$}z+N}_rdc3ZJ)L;? zO#6_KVC2O`Jk2x9c+2Iep0v|B9q|HED>PU~&LDPYH{8afEewYEtOsg?vyw2GmB`)K z-Ae1z@0vJmEF^~G=o89V99ec1BjAmh&Sfdp0#~Z8G-|jTzA~|7%nCstk`_s*!UvwmBd^{6eN= zBC2-89 zy;2Vadp}vUl8;}(u*%2-PjHw}X3(cFo2-)&73C zNCl{V`q0DVq(GMeEoBvmO#w(l=|}?QOT-X>rTPhuVIjEHgQ0~VHjVpEUQ}Jn3~<7< zi{1;t+fNbI#A3IMvc(20z8!Yq4;9?E`>u#_2MN0^hiWUjI8aYOg-m-rp|Kq0+b=J@ zhyh^^1lNchMhs^6`qJJYW_i(UDBHfMpL*~vIO|Bs5wo)M3U8RPwE95FO z_f9bnM;bfSr|N?h@M&-=d||I5)3Q$ZF`}&7hQT4-z%ZZ}09+})NA$Y?I?7ZiBXHK+ zZ+~U(kEO*(6L1-q)F5(6FqSS{R<>m}me&ePNG0sbZ`D#A`Wl>!M<9kVuakePJ&@wh z*ZaeV{xGpk$PVR5RuRLUQq?)*0g+-FUVtQB&iY%v*frRq5+E}0mGO4R8g4Hm-Yaa; zgRw@v{@YWx87+4V2zvj-3@%}hCZ`AY_3)X-@K&$nAmA^dnY{25 zbUN1V%DHA3W<@GmETb7*xTTjQvl&mxw72WL9*F_1eoBLHdOvjI1b$gD;WV z(fyl~GYvb<6#i!)sf}Xg%G`owLl=LJmwMy>U%apH?=(i{<7I1yg=QZZjX{KftwcWK z@C8f(DYv%+j9eoGvn`BfsSB#=LmG#3Q(-W%LX$qqhxIcis;3GR2)5i~@?UI`(|LQp z!AxEr?Tzr|7>N)$&*41Nl`uZ)T6$iFqIg=hfHqUNSI-xV)h#1x%+75bzT4Iyu7X}$ z@MDvtS4hM9%Y$vU50{-Ohb31QYa87&=}MY%ML&#R9vM6QNcO5K7a`g9kTY5_Eo`rn z??-%dihL=kuTZSqn(NRL5@;flz=RB6(4J0NU^-OSGFfb;22Q+B#zTPt9=DE`gGN7p zpLtVY6MXr!xQBzUhzXt%NS0TH(p&e3%zE;>RsEQ?`)(_t6^Q3}R6pM3XIZF|j~6Gb zW}Jf}8a$2{PvRGru`jTE?8m$i7HOsHO?i0$+$v*k_f&o30oi0iVrWP)IibW(OwCxl(&3!@I>LJtm)A$wvk2_d^|lC?*Kz~P%68; zayt#q6Q>Klp90RQF|-OI9{k){50kRHSxG?vuInIH;vcMjfx&>j5Ie1QD6yEr*HC68ZYt&cjX-JZx_)159=sOg05c6o;Q9y}sjTxSs^;?~#;I z@By$qG(^#x*&ckP(P!P~+UR_PuBJrOzZ!!3e2%tr!IGW5RDO*N99-7uZ4R%E`7)dB zF(c55RazMGhMFG7=zu|tdE+nlj~%I^2E;#@KA0{#16Y9~`099v4@p+EVsDt%|TCy<$2SjXbti;Bu? zqLl3$Lmh;Lqb!bzsoVwc`|92x%;z9NiO&TDqL|7!xNNX7IMg|Gh4*9ZhnIw}KBcwg zZ>ASTOw84eXqBD4FH33t?$Km1g%m8hcr|^rTD|wW->G_Cdets(H7;Z|Pb(hzx$fOS zJcr3ECf{l*m`Lmw2Mb!7bw7kJ+*ZvmjxyBjUJt3{PPh!^J>AZaVYN=r&3vj%OJ5fq zN(O@k{qu=MQkq<9q9B09arDjlY|$dV2;wm8QW;w#dKId#hwgOOZtCP15}2u}Lvf>j zNT3i$rejHjzGu*KIc{riuur!CONvtNHzfe?wb z(xvpmoad9$lWEJ%hVqN;9ZwHOyta)#Z#k}cC6?hz_Chq;^91@dh%2L6 zot5cCJ(Msbm>}oglv`MU)@*rb=4{m3?2_kjK=Hixxn_}8WvME?zGSf~9diImE{z~| z2%mJ@XB{%6CCMdMm4L-68x*>WSih}=}YD{Sm|Hy(JbSg9iibQRO6^Ji4;z(9St4<{zTNl)v@u% zyxhP32|*H%2JcJvN&ZtA{T9_yzMV9c2rT%MR`kwR7`J|m{|x#sszo)>h~~zltPd&K z3Z}E)LjLwv479g$M=F0m^ch(QJl&S~11qez^2)SGJn^y=zZ(6_6Bh)=&AZ59jxE)Fpdr0 z;aGbRA?DXhImbgJNXmEJO}Fz4v59!l0TaTf?4RPVt1>D5 zNq#lv_lU=&gLlYXc*PR+>!sRDKqH(T*Pm+hNTiL{SK|KZC?jfcwWf52VeFrt!~q>1 zOfgHflbo%9WyO#v{;SbZ$KZ0R8$5fT-!G*{Eu!30!cRKWNWbOZ9sc3>GoO($f~)&v zE+3}+>B;P8pu=&NXkJc%p0ZJCp!&T+a%sUkY<_%?LiMX7_ZdJV^yV%`zk>edw0a}) zr=v8b!PV;H>)-zLB&G!D@a+3MNKN^LvHmIj)W3(HN)TNB|EJvlpK`xUMD{aCHhMU` zFL{M&Z>0gtdhMUdBLb@_2c{FAHRK)bZ<_;y_3R-mh~2~*_RzUg4M*7K|6Q@O(Aj== zYv=gy+6T~8=gT(zcilvZd;`Dn504^&+MnMGHUPR{cQ~h{|4iRtQSiK1JJ=?Fzo5cI z1Q&I6(AXgVyRjMgW`F3Hb^84s3s^$7NFj-TOdABL%O9-d_XcZZ|E}p10XJsfp_ux6 zS>IFe#o-RMhS;A+Q(+oImkR3pxj*X~f&?!Cexp?H{P<6Eqr4NrlMW7n9@?K9V|qbb zw~%fl`R_@#z}_mPTA>#FU6PNW3qOZ=#P~n=h1%5p!Ee5Qmi#j^S8yBgNM^S`rX6%M zJJe9VdD`&skIi~!1HQm=hvW75^97mDw+F!f^M8H#+wOyrB(n#^|7`ul)Yr z#2a1*x*uTSuu}e=Zb3-k(T_U@MgP50j1L%Fqi+&rK7THYc?Z5&nU*c1@<9Sf3lWfu zNNi_ZoP<8NogY4egnupVvt8bQ1_;^=9cv8QbzE3X$Finfz9eCOdomTK zHFUaskFXI+2aBU8_WQ8Jze(}uu_7T@9c-3Mgn5(Z)Xo=^6%A3uEmpBeDGO;`zF zO#SB?zbQ0Zey=|)^rywsyf$0R&o<^6tHKx&%MJlv& z1kN{;DvDfQDZKXioEBf1CU_0Xc;=mRHk&}w#%xM4iCseLqZKo?Yyzi;9 zO%QRKl0BnAwcjpUX57m37)W52C&kQPq=x_g(K;iJK}{yz<5V+tiiQ!j!GTP~2QK`7 zjtv;0S+M+<17qHm16<4d2R2b>Yefa=;PAjJ^~mem)@9KLalG)Mko${t_d{tyF7vpR zp%e{}(4YapXEr_7`xb{Lm~4w&)gXF&RE6tU)_f>)vEPX?wD4i$)zk7Fi>x3%kn5F4 znNH?&ES@aa7oMweJ=rpj!TEmCQm-;yYr&$$ZuV~#r0(6 z!^e3U+AWb$A7>6wu)-<20ZLPNmXyI*M*~O?;?E)u+aKZ$r!yAD%iZDZ0JBXgcR{ok z>39v`-ey^^l@YclcqP(ajIX`;p92>45Ugja9wA9f+?&gj*eQK4U7#Sv^f?5V;|H>M zLWbAv^h&#C)7uYmj2a4MZMUw3k6-yGy&LMlem2SLyq^gqjik~&FU=a()7{2{`9SF} z+YhwExo?yFNt|ZMSzFJ4NH+r+6e@gOhiJm9!vWSvq7f02 zPL<8I6l*;Hys>{4^*lRrgl#;@Dw%)e83GsDjIoVb_OT!9Q?9&1uH11OCl4i;5AY(7^xKvt{WZN1u!6;DGa;Ujnn3?=>Z zdYhL;dM%A%j|69&zxz{cJ&{fJI{1v=EsJP8M9#80m^2XvvZoHQ7AB`V)8nA_dqWg$ zl;cy`BDsGBH__jH{;z`tY;rw0*lEzHG4u=tHOM&0sQS#fP0+E3o(_UKg@`u`t-3-t z$10SEsE-7`sa7DzsjP2|n>12h@Ho?L*lpaF`@-Vy-f?tatu;!Z0f5Yuy%2$Q z<$Hzx*LeL~9nw<3x{h0#8Q~eV*uD`nTF@I2x_T2^LGV%;$oa(qcvwn^23s7;J10ee z)PPfO+ZxMxcYV-9%I>tQGjZ=r2G{_o_H!G~e}t|AB@dd%9G45#CXq;YRTpk=q>o_W z?%iJtj#*bLR^>HYD*P6((%)L%8GGOG=Jsl@o463*b-E2VVikp?WrxFNh3=zWE&5@s z6ql5RpiZK?X8pI2z`unO16-uE-0$P1nxoN_Qbu4;qme7ITkV$|6^%$1p&qi-BdHOk zSW!Y%t68K5#)d*go>A4$+rL?~cg}uF*D24@&!$SqapVu-D^orY`|lPF z7&sqnU^4bUVB*3i1@jT$PaQv4o@1B|;=3POto^YJsgpsw?#-x$ND#i= zxzR|v;6_~?%;6Qt$;zfX|3KOV)u>v+{7$=^(s5~OZDNnQB%X%wv5*~L{qI&gKwFLW z0Jv8=WFa`rTAZD?(F^yoRCw)Hg8Bt!9YUG~>BiFICI8!@{GZ42Ob$39SoRA~wy7w+ z*;rL)#H1bn9}`>eDXcOLF#Y}B(}MqV`cshudj{eKdOG7Emrm`pJDU{f_SqNh@Hn$r z11xW9@Wy307J7J)Knw346AOe8t!c|T71s;(+z!5#< zDT(keRbh*1ZGY0BzlB1RLkAYrN6sgzs<2fy_KkmrIcNIlJ>a%$_YtSQ^69*`+!e@E}m%`%6E18Gxm9&Okeeg5F>CHhvGmOMee}??Eij2 zQZV2#*UUcqIY72C#U=+porrDM!G?-q{$>BI%U;Xb%5tP|0w`8_&sK&=>aGNyV|>}M z0AABp6L@|)&PxH~05r|HWV+fPf7s7h-U|#PRNu=E7+z^GmxlxlTC<0N?KK8PsJ>VW zUy9W@4CmaBhG6U`fX1I>S^W=WaIBTp5+VuWd*2C0%@|-q^92V3_0oV_bGY=5RPqbZ zd;@bv7ApFu3CwhUGyRrPND%%%zxUqm#8zLeN$K$fAKSEg(fE|#YwJ7rZV2nnAt6d^CZ6=5jYAnyZVH}dY;B= z_rpLQ%nkjSZA?dClvzXzzhpgqX9VEpYqh-gYrUE#Y1A3+hrybPoSo%t^RlchMr*fw zZFe$xGV$yyXpszqgzjelEcJgrECxaZ9cPtfMlFOcHz+AL+%HxW$3aODWu1;)A2p3C zlC;YlfjChR$8g05o&|T6Dw7c_0cEoD-2D80pw3AR6vrjSi*%dnbZj~h4zG_!h*^nr z0I2`{PYhVu9LaFt-T`F}6fR(=eM+{murjrsR&a}D?V-m2|*#oR?so4=JJP*tN*I5e7 z1{?lr;JhnB6v*(z&dod1 z)jjw52wlcSR+4~{p@fE?k4R^v6t57;Y&c%+-KlSBAq=9&R8$~3%~)IxST@2X*EH6 zB~C>_`fS|~5hIzqH9O{H(nktQz*yM_KKZNzyjP^80~-b>NmY)25|Vg)I3B$d>^d`};>1;oq_T z?-xNJkz%Jd#`oZ9uv%ck@g!h95j8J1{#dj<*Z3Vu2h$M{NHIO>WzP^=f~t8bWXc(l z?JrNKilUq3+E%>hUrwERQIyDV@DflKcajOtx}eAaFDf0Q(iK5|tQY+G>!Ul`c4=roPua zU8cfY3u%yf&pvQG5B6_U;U}LV{BQ|h?sM;njLFzUDqy4%Gl4rp=MWO$WDCBN{-ixMHyYD*)LW;)6BH7T6 zPH1^hmi3LxV7H(KQBF?IjpWuy+^bf?o~c+pyewm3hjISi*t=^W&(RoSgz`6qIwta% z+-6gI!*4D@lXTdDi9=D|S_PrD{jgW?KZYsok5R+j^$3BBBPzKLRX-VxV8q2R)-UQBC%>)w;izdHTAwt9s;9GJ5 z^0AWv=aSb1yAr+Odj#2kZ^>YAn4dgXCBQCt5~5uO&gbq<-{(8uR%?w7Gkcj5Nx1wj zPL^Zb?*?f8O<4X7&ES_{KZjpQFA+%thBX>bBz%$!(t9#Io}PG`di|EF25F9u>KG>8TEZ)Y}Z`VK-ka$WL&&R zI|v>9;OwQz7hJ3Pd#wL{u>|P;G{g-4Pb9>33P>|MLy9*u5Cq-eZT|Fd-8XYqAc4 zdOSdBqYv2Sj$_CEX4wC|MUh6B_2jrj7Xvwm1=bExXh|_0tPis@{@0jek-LEpj^%cIf9*nf_Fm1kZ9BFBX)cr=VRXFZQ zY*^I7!BG5w4@LgZ+#Aje-l7QkAS#(vusrxu3uOQ#N|f0W-9M(F4YR~egBV=2!10p8 zxc&eTC+oBw8~=C&MT`&L zFIY_ae+~jN3`6^qMa^xavZF}^Frd2Zu%kN(G$E&>2V!7E2vn_sZJI2dz~t2Z8g zK)m$(?{)RB7tHct+7nIa~*ko=wk=ekFQuC{a^~ zeMmHH>>vDO7!>R_veuu2nqcF2&Qn-T0E}yuD5mm{5q1N|DSl$+tM7W_d+4?9(^0^7 zo?dtN{l}1yVGdGe>c$y-C3@FiFPKY!^vt|XzWy(f{y*m+D~uPj$H{-Yy@-%tm^&A< zkWfk!n}=)0{9{P||6g=pf8>I4zJ3h#4FYZ0FWKlhPz1?}krGqIJwv5$uLb=6sneS# z={^1O@px;x^b62!g$F?A&ajx^gG6QQJ>`Yz5zRdjwxL)5e)xaO^-Di_VzWUT)v9$g zv?CbYkU6H+|MK_;`;RZ)i68*Js?8L0yy&8J?fJ6X#0`o>-(j%UpW|IVT!I_105qnv ze#U@ekHZPW6iQiDBtJz21~p)qW1H@1;#{B4A^^fv1GF&%)yw(@@^)Up+QeU7p7EJc zJ)bwJdBU{Ya1gCuYHR>FO~yh1c?%Ryox2Z2K?HgPftDmm8iFl-jrGfaP8>L9j>)O{ zt^%+d3KzaGnd94jEbs)aXKGS>_%RH^4PH|*b9Y?fa&SSqDk$}dD!}BrK8qn#wty(e zi(&LXkpr~p+p^|Jz9yrlKbzbKwYC&qpumZXR>VowWS6_Vfw}g_^At>YX}bL5$Z1rO z`5Qo25Y;Kk?)KroTX)rV1@t5i*b3Iyk`(?O>H;x9{Nxif>6$B-ymEJ_&h6NVY!WH4 zWKmaMFj6A&cv2!527senLgWXKb`JIj1c!S-1|4M>tv9|gRi4rCXS4Vz1#(>zOx);Y z+CcVD7#T|rfCZT3_7tRTV~P#`0iGBc2gDr}rv${MG7+0(1Prr9uzOwpy|3Y_;Aw&& zUHw-X;CW75!VIw=@odgM8`BT~x71@8l{b7Ab77KoUrD>*VHjMad_(?M@>bawO#pCN z0OcTj!NfP7{OtDp@^1H>SiPcNZQX0C0I90Q+3f5ChvC zdx_}Jfqj0uZ_%@j(N3r&Kli4o&+6O#PAZEW!q%!RL&Od+mOtAm4F1Rz^%UITVMl~G z!vH|2qR6=esAJkc`tI7FXyMP9`ylZhXxIMUhh-N32=EtuFLoR4Pr!9EU8q*+lZD`T zZbRP6MML0K5(@eb`faSvdJew@s(^GF^evep%Hok$8b}3u4GhiL>{H!C)q!*cN|TZ& zZVKI`yXYWodO)FQAaW0dok;~RVv^#&Cjv` z&mg|M*XCDNzm{C%D0qv|M(D|iEa)y`3ku%C0)Jw?*!egGX?s)76h$;8SAq<~s?Hu_ z&+xPEV1O#Zj4WX`u^cA=;U{)8ycqacfb$jCGZ29rlb>kr+gq|z)z2%KJp^v5&($x~ z9WBsdVNnEy<}E9HMGs4p2PPUAdaP_Z^UKV>5J2E)xyua6Uv=$a5^qP7PtoR3je$rq>Lq0D>aAQYE zz*L3+z0DF2#L0mg0FpfVfo|FT8$X7M)))8gW1tM}gJk;)xfk5FwHkNP!H9bsG@*(G z+dVg20{YnZ_g9tI>kj}wGawAhgMVugTm$@C0$^^0%%;DmxHfUlRNIS+yOb1QB!9HJ-4QNh@6ndpHf8ttOh1J3^;G-7zga z2XvW8IuA4-w1_ExTJ3l0LBWihg#!?zzg&S9IM)ZaV1lsFXW&5lygy!P)AAq}Xmxop zpc*Raim^uz&!Bu^CTNTQ=0Jo5x(C_vyZP#37_lx3m!IH<(xC0!fs8}61-OeSTG1iI zrBQI$1i%YIOpgKOc)P{{^a1TsSOc9CHI}cD;iqe&eK;`6W5Fsyry|X3EYr(k& zR5I3Fl0}5nOJ2SEYWGM7@9dg6HQpKSa6Z_I`Hlp9+^AuzGH@(-9y0W7Gp9Evku7ai zVu#Yg8c+dgxmevJKyZPSR40(B@>Y$g{N*ivEHip?ToYgdqo zQd4_HLa)Y6<|1tY--M6jrL(EFMZ#De^E3QEpVksj^gW@YZSq>_AzN<6uC{@YdtE1G zxD$MxI3B1c_d&Bz4O*b!F5gdlVFvY=J++(99#yMWwRxL z;EZ5#2P~hvZDf-%!Ytk~!dqA&KJCjKf94) zPg+K*JYM;>J>7ScQ*N4^h?yKEkfPTeYAoIu1$!9M(F^l}SDb@90Na(7+T&MC_4WJt zaa7Rk8K9uPeiidkTakV~ehqzubVK>rTux%tjfC(zJmnBgBxDaVZXNmd-hdHhJO1hOBl1&oNr!Zbw@37E{VPB|sZ`NiH0h`5R zazYkAC^LCm$H>h&R-|GYa4%5H;WONYmIjF{(X;A&!X`0qKaU*|qt@xZ9S)X%HC%D_*HwL}}{EzZt)D8Zyn;{P7Mtu7_b$jDTc8a6^wZpw1 zeD;UjXWuR_&IaqCt@2QYwkFFSQb-xPBuI{~J#`)gg&O`(miOM9O9TWzZr`f@mq&I; z=RfEqn`9n`i2^;U(HY<0-&^$K)#{IX&mTlPF%v&ni|a35OFX`u^ZFyVPqY(7cKyboWyUd6ieTg69DNw|$_d}2IW!!W9knPLLLBMg}1u?;vV?Cs{(#O*-0XKzI?Z;C6Dq1J* zR)}YBc?wC-yd=qkZmv`H=G}rPzrV$&?VmhS%F!;{_SlMAUY5?f()@Z<_vtDLm+i}Q zG5=2UjeDNOQkgb2*6ve!){ zJpjER$6_a25XcTwqzI!%pUY~}ta!G@{G=22Pz4lV)HHTE1B&jk?GrpCt_|oMI-vhp z?f(6^@$vD%Y?40;=Jbf`VCubf`ADCqkw>~52JGzB(V3yhk zx&J$IHnol$Btzr~hHtSGvcW8-l&2%GIfqV<)C*XmC&>TGXviB)T<_Of*EhRoV5~H6 z=AC&fqgqZ60*P+cXvmWCJ<#`lmOQu-x~qjiGW+<8wIM1z|K;z24MTnwQVf5gSob^_ z)xcFu3o#+-d%5}c!?%c7A1_2Xql!QjK{0fzR;QTVo}U2f5hdx%+rmy*(mq zIjZT~qy9_R5B?M!sHw`(pVZ?@uGZTru_hKtfMbm}|7IVXce;+YUFkhb#k9QKgP!B! z<6hx=PSFS6TF>&Sh5^s(40}$|13VLUT=%l{MLnu#y{62RGwT;aY1OgKXX#{5knH&W z<88&)&F?_7C}WU_*uRfVQd=)F$Sxr`rqWUlx@YvqDU1HCb zdlQ~d?Gf4P)+L9o2o)r^CT<`h)7|9mGS=!XfnmXJIY<-X><~bnrmn-h+{z@TNL>d2 zb(ib7X&V=nO>&d4oNe9v4d_~xLd48)MB97tmlC$%1yD)DB?&MPn13z|EYLGjx|z>2 z@2+}%;G09odnS#Do`i#%Z$tN9W6u;L>_N#h0xg*WITf@t{ zURa>#x^0{gX@59DY5V8%g-2&?N{;=ZzK`0O_58E|3T^X4qZj!``; zkB+=IEX2db)3+wv7CR{ptVs|rfTAMJ?NQ(A&@h5=j|C2VigG?=yPST~VP(DrT;iq;LXFs)D9ednyBMo)%@F(M*Ffu5i``jZW#Kzn|nN%AP>@+7BjLSiJi>QnbD810v zaKs00@|im|F#T(AF&5m0hBK?OAAg;Amoq-oP~nQhv*;dqVy~bd7R|B0i_Ori$t;OG zB5mrz#MKTLr@jXgiu6$FQqA)hkB15N`>YR}9#ts78;Bfr229V~+S@i1)7+`v2-W-f zDbd$-&%Z2P`#G^8UvXKN-H@{;;lFsX>HNEC4OaQ-mJ4TV>~lF6)W1;sEP zKzHR-Dg4Ko^*K#V0O;&}op3MUlITm&`xQT3hF^8P(+u@o6|dr^sVrnbSWMV!jEiz4 zP#TxxtpJ+&27NC&cEoutuhA(5*q*3+|_&s6>5x z=o3gyScO$Qo+GH_0^5v$lvTj7797hIHj9DZV;FN7Fd_U{kyQwhLit&d0Ii>2{A=Gk zo)@D>EaN2)5_y%Uksy~Pf_bbFVZiD;E&gm0$SKf3Xm7sflh<7ocW5tfxn%%h%6kL+ zT`Nz53sy>E!xT?{;+;-ZXzCyJ11MmO?OP!iQ`W6`fngzvq=Of}%tT}rt;$New*h=dqVjkR_9fRI zQ%u{aBk1KINNQBQr8D8pJN#QgPmto0z@Fh8)Q5yQmVMsjI*6zAf;-4=w|<6Kv;cLQ zwH2)@FGE+GO@g9^!eM#ZaipHga{>co;2+pmSQN@wDzFV#KFUqVN|{zTcKU%XHV#J^ zFT(AsVPNaCQlv-Wuy9K6e{S7+0{SPEaSK=j@Y{+Iu@dqiw~B7cCMi=%&&70&QGc?# zdof%s*r3Pse%J@WQYj3Ie)s{BgD6B!<*_W}@^wWirU?q~?IlvZ6$ly%%l)EGk{12; zpqFQ?wkL#APoCWC&lBEs-ux9o@n>o|wuy5e33~Xx-IMZ)wOAjQ7hhE)o{@k5P;l@A zj#kD_@&|j%AM#atSGBA7TJZ9w$AC%pT5qg&0dp`nGtpuGlwb6PNd6ECpCYO*Q0=)0 zgfPnxU5Dhq7DH^IgLnEAG~ZzZ{UJoMV~87r+=4{nZg|1(v?z?yIQRlNgpi!5odwT4 z9$kgcx+(jt=T1J!r1#N`eSYZ}=Vs<8V?#XrPOE2UsJWcFT$LqM-9?Fhfa3l)=lB;U zgD}TH05xty*>S3tFnhcBUBCD0O}6L-;s(} zE&zKX_*xY3&$W4A25cAM5ziFm}Z)3muEIYbr&3iPE%irlA8bvHw05nz+ZzX zwO2|z{n6Jq;Dr~Zy*v!hn>EhQ#1D~gTN(Z+Kw0+kFx9%tbFGapW*sEykKI{DpCvzQ zpgX-M-Pf;uI8O8VNxo@`^$6NCGAUJ}b3Da9Cqnde;>&f)PYaI)E!rDD`#0*8*u%K& ze~8!=M~&w689XDJ4lsUB9}s+P_astT^1Z*L)VTm52h7-aF{K`1*>Ua zoy}iGd6RUo=r;z1A^Ev#v!f5Me+HtohS+;y#7d?Xt$&E^5&CDVxvY_#{7Y&@`Y_c6 zl9+VROX>mGKZBR>a$+KoTD(quTeILEQvap2Z1slMNPcN3ZYxPPc|*NZwrxXn$0#t< zr6sun#R4RlVE6<<5f4{s zgF{aV6Wff2s#ZTnti;b2aeylPIBLnbv#^?iYQR|5&&OP0I&owjh130I?XD;=$h1wc0lh`-wQZ+hG@<_Ww$~+s`Q*d%ZoO46bMPUa z)$2FBV9KaidK6l=2MG&IH3I}P0mo*GC(c_eCLy>K)EO-vo&#JIOkpBuQ)G99@XEd% zbQ&0|H2mo-bk;%37!mv(GXK>sz3z9Lp@EK_Ciw*zT4jt8a#LR%B^#s`29@*V-fWS~ z2bl7<=m=2#GCsX~@vFWYaV<*=Wt|2*6?TJSyL-~F&PBj%>)?gPiULq5iDgeQu)k+G zN@uPC%aEt^S|~(abT&BECFzYL;a;7ThVSd!U-ZA;+kU3Ni57148)~Nw#~kthHbXRs z&|~@E-6tk1hcNSOPBTUu@ww0~T>r&Z?SihWufP{ncoYD7lc}de^mFiYbKE&>ZhtQD zJV9Hv6Ny$Jl3ib~yK?Ma{}g^3Y3Xv53PKfXuJJ`b;7CW~+gNG{dZTfCHZ!y7Jlf&S zgdMotpHy_mk@n_5F`A4Zg2=qwqTxxEuJztwY(09AubrXBasySpsf)OOkp~rw z(1hyM^*B$wbRmv<5-m6?NdtI|%-E{kl<3hl;aQ8qxRt}4o3c6b*>>ym0xx0?wg;lL zyUOtn#7SLQKLLpmiUG!!Pxg!>0DfbIAXOtJU@UH9fkj;CZ*m4PZhboJ5XEj*7ADZX zu=;2Xc-rfDp3)K2DpGk>*qQvIW{|JD8@$)W`oJ}5R+n;R8;xIHtrLeVDEns7Z!$3@DE9}o4B zi7)2f0daGiuqDCVr#p>$Epjf5h>eA_&Qh?F9`s##{lBgx&dmkR+v)L38?0*6ONDUL zd7X3->h{+WJkEr1P*LWSxERJs{__fzc~yaM##xNNz5Xd;^{wFmv4Y(vaAAd>4}?h` z=) zmb-AZ+P;#y$SiPPi-wW{O23XtZisYcuK>7e@3RuveHAeA>Y8JRzy=!_ zWN7nsb6Ieb6K0<+_^>V90ZV-h8`0UrPiGU+5s*(ppMJE54l$-m!qj2Pkl>Nv17Zdi z&35q@?Zl$gR@uP+Se6(jxC9!HX%Py`E2QEB)0`Uj7{=3_LZHQ;N+GtjFc>~7iI%Sl zkKOy+411VTL0xZEag%!JM(UJm6pyu-#xD-z}i^~V7VV-E$ z65K7t0t&@j(8v6KDvOfW<|JS~Zj;L;0KMDZx>rJV(0WV+jBr9TsHwUhnz@hw7IxA} z)UH9m8UO@dlY)d-^I#HZok1}SK9i@Tl;n^2{Qxumj03c?wIlE~TVnKy(3|BNDuEVJ zehMAjNYx05f)XSFErOhAm#$-nh3NKs#yAl?1wiR)1l?!#HU*^`^;j*WPg8cUo!Y@Y z77B$sDBpjNWnX#Nl3+4G$b{vVq+${#{$A8yefgEpzUV?8{+9gowaLe>>9D@QEBKv7 zJFK#Ux*XQwndpIdb!gL?@CN{Avqo*n0%Kt#3O^>W<)e#o9 zI1?n;p0Q}EO#QA@opPHYb2o~Jl$D@A3$F0mRBAQjVzY&jLLJ zpj^fd!7%Ao)<%_i4!oG%Iyd5ZGoxTZr$LPjFy`6>ej3>?huTeeOcG?5Krl=YdqL?f z_&SkK**=6*zwp-8usO~7)={R<1ZT3zt+#zFu8xoPjlO%+tM!|f+IbCYeYcjbDcZVD1q$>9In&}gStMQBy@7|@)yT)@m9(}nbm>S8l3`h92;qFsGe&s)5@BoQ?6y(6 zhy|O9-NIo`r1VNX6)dR@6eLx$J0M#13ezc2x5)_V?mwmNGoHRJj?emSkU=WIY5H9= zJ#vabp3%nY#A%u66$hg#HK`D3&ES9@j6oJXG~f`7$F+}zbgl~p+-j(x6!r294N+-A zxke30zl$F`VoGO$=x<-X;|^D;)J8MQL8X`HAjV;=G2wYX2V#8fSpso6oh-Kqb_ z7gjxkVwYv%ZwhckpCO*Yg}N_Cl&8sx(lg)iHs}ioB07Qn_LjF_iY2*otR$8^K?hS5 zlD!SORdrnl%jF7W%qgT;yoWFsr}{_sGS}_BDCd0?Y>}$L`E}qz0f9Mg^qqZ5uL=ef zR?9NfWp}dT=88y#fo|%>clIJp9nLrLT{M-?O{%EcNaP{f(bSG1Pl97f^-EfPztudo zxeR3=F}f$#_5v^a4H)sa4@(ws?Rp}~cod#oM~U@|nh(uCDNVe%&44wx9a3vAW}#h0 zn^pJFl(fA2#F4n15pjZY1bSsWNN;fIHD9^=JP%dL0LRG*fHjJdAlN>MJCtpXJF%ww zVBLBYtdd(4A)q()D>3orPV9OL<=fAl2R&IIx>q6rH5z*=2{C8n6XN6I_NIU5+v(X) zR$6vCm+@tFf=#j$#C5+DIcd4M^OuQ$vtd4-;Jed#V3(Ntous!vvWDpA=Zn!S519=E zZi5BsAdkfl?{?l9e2M9~GQOE0Ums&Zh=kgZxR@Go<*mKZFwQ+{6l*& z#o&{pV^x@Ele02#TbN)y{^~j7J4t&N;foLYr?Iz{a^;b9`}1WiqnU~>r_Va)KxI%l zFW2L~4~7qAl=f{6D;+XsWtQvE6i@t*ML9~!SIJLIRwgQ`oMZ~5Jo^_vSg-9B`$8_j zHMGx4;)j193^9IL6<$60Kh!SI+OF2m=DXge->(LO;>{%K%00Ks9TBqKPb@e)qwZ>* zHQOIeK4Y6qNgPUvUU&Y+u1G(StEfiu)oR6aXX`5kgF(_E)pj?hZZ^qm$KJ{f(G$&J z;76y2N+!|78@D~^uD!#%&jwG#Qlyv4QmVQS<(Q1UhpQ#{BSPZID>}ZOW*i8E z{u4QB%1NXH8#-@K81?)2&UZuLCUsRLM!jb$fi6A*IQEa^@vP_T5!`aN{6%HRslA`8 z>OvO83ZkVCjLlDyQM#jIER!y3NAe>Wl@o{>WrSh;u>0CeQ$RH23epaa6*%WhhnTqc zTio|F`6c}=n`Hml^eoi^Ec`&U`k2>(9Yi6t9UryvWqC+TlxNDjyI82f#WSM%#314H zv{v#50n+!@7x!!&$-^)uzF(QQkp3)tonpIHRl2&a`%~NE%rP6pm%_rjxmLZi`-cH% zIS=DzJ&Gb2wT8YkH9Jdld>G=kLNSEFG#;$q%~t;4?)VyEE$#Kr)O-j^>goWU5IXdi zYyMfBKp+mY`}VS4?w!RW^T}V*){qYeed}-ga6Rtl_tLIZDl4fsYi%HQFX|uoZ<;>T zmlQC+=luIKc9G(<_cd}ii})9XKrJp^t+M%39rYo;4CY9$U$7Y7vYc?ni3hn;Zw3V_ zUvi3TG#WNdHSIN1>Wkg_ARk+vKeSoQ{XR=(bP6seSMb>c_YJ3(mS zg39BL(H|b438db4lc#l)P>#UsTAEDjr})eC;raRxcXQ&L)uE;^R$(;#Xt~Ym4Cd0yL6tUh==ax@Kldi9`_;LRE zeTOoqs+<=-Gy<+)_T={{y*?EL3Rbu8a)X(QuU!U>#F5`DisX>Beql7rF4GeZ4oR# zT4&23{&L`!r+7mc+|oZfruVBY0?4?iB-lz#R0>t*0UammHU z|Ipoqh(pb-B`)(^)y7K4*HfgZ?2XOJXBe(KzwCh1Y+bxBpd!K;h}`5kJP)EDRjm(P*SO~rhc~g+tr5`bm&44oX;e;R z-xHgL10iD-3VlTwKV@h4>CK(Ca_vTld>eYPeTI_hjvwX>YJT`(Obg+`s6PG`mcTG%5f_YT;U!A<6HEX;Y3UEGl$Y-ZuKGEs)^lEJ z!Ypf8+5!?Mh7IwB=$23qiuB}`;++*Xd87A(1{7-dZE90-K<+p_E54p;qqd)IOw5J0 zyjpmTvq&UpZEtVdKCGg$az-%9ieY=t2e2?bohmca zFen3V7Z&H<_k?YG+CkKqWh9vHmj{mnD1f>VrIHD6>Cn1^TFv>TsBr?GCr zuO1vboaVNjv9EXgXemWca(U2c;k}skQ%XDAqKfzbNSH2^70>NKF@!8a``GN93!Vqx$G5LdGK4I%V z`{wWT&yQ}Gf3t5J_TSd;Ld7$c4E?$MA?R45u0zfg#P;fm7F%VEo=x=Uo02KB%=d=!ZQx%i?}%*OgEXXaESJnhL{b89)#q z|4IMf@*hwUc;9uO@x&)3wU9xyhbdT_1<}z0Q?BuuC!^rS)yW?gA~qW(Q13ig1JZ!U z)D8V|dJ8RjM1k!4ry~AoCAwM}4_EiqVjK#SUF|dW?#O+a@`}p46#|QwiJbjS zHK4*po*n&tZ}v@m2z8K}VF6u8Do+p2NVUBdg^MsK^+&2+%YvSoYyY`v-45J|QKEP# z0+eArXComWZP@MidZ2_*Lz6C_NfDo!yy1Cl!^~53o?MV-mHI{)#G@wr|VmADs70#A-`(QYBq%W|g&58{tz4U!FUjJUzM zbhKlMDp!5G{V`bny+XKGgO*d%Yt_`|GOA9(`z^R2_pJ}?cuy8MGN) z^{exXi85}ITsTja;@7Ip)@0oI5*eD!^hz`Q&GzV*hgcM!fy}n8#q&n5Y&@0+^;CFF zO2SA`jA_6*AB0zmblG=W$ayH-`LcVjqHu)vZFKkO-V|5`<@q zBo32|h))%UXzFRiS;Wn?aMnyga5d5nq!@3i{KpF*b7CGT%*6B?REG!ux~)U4SY^ux zmkSH7KxyBhjepgwvUE=DOEDb?5rDU(NH!7RP&%~Y3X(Z z*!&oznq^?@vzaCST|<7pFj@EpCw@WU;rSOtE(3iN7YpljFQlC-LFr0*xTSrij`LAf z&UdS&ONloy74c^%t5!!&8XjG?7d$Dze5neYZ%!VJi;dPio*XJFpAwr&J&~Up z>aq5T)HFd<<#~c4b!dQsP~{aPg8+9;W+;`AZT0eye3;0K?&o#8>_C5`b)PfQ7| zE$HjfWrBA<;w5qVy*eKiF$#HP{8w!QBFY+%m7ch${>YPL*z@aONe5DI4jzauMOwZ} z$1FcBC(ky%czb(Q z0*XQ5YFAgXu+`Hoo+?x40!*CX68{v*hLxVS_;u~S#_CQpwT{xJ#Ngikel~rX)U$Zp zs~?91z23J3-o=S(Rh01doj`HxNPB{ zAIsNFEnU%z|5-6Y0{^pn>L*8RsQ|MNOCfieLBc|fT;sPQTbQA_DYN*$ys>@WcBNdr zWaEYQbLluDw$(U4nu(sJufT`8T;`@3#~N#}&gWrkpgQ=JJm>qYO`NHr&{@QkXTNB< zNgYwe(1l8LmtO$*)xU-+U*3V*PR!6me(ba+@*lF1OR|ZTMPqc&;V2$WnA<^T7^JPh z_t_w6ipFZ8s&QI&eXdTS)c&V;bl#2!A0;MFC?+bgKjPBK4m8RAe#uwfJ=f z+3xzJA=TUX`C7*K$;SuDYEe5i@hn6WaFrUaGD_y-USd7bl;$7Le6f5o)4-3M;KVGJb*Q#!DnQVUOMQ=-v-F}IBy5=muGhBV4Z1J?``5cd!JjE_{=kV7$) zNd?;HOJxe5@+@=~a84{nihuI1h6D+co4Eba8C$PqQ*gQD)gM*gvrN(c<0XyNi1xpb zo6`z-&o}B5rH6c)aY@}=;`L*E7Pq<6O|q^EKcL}UnYfKuX}DK-zwaH@}UjmgWFdh z%N~^vN$X>88cTQ3RrKBTZa>0N$vskW959~Kjms}3aLj-Gh=*8hCGOz&m4%_8LEfhK zH-mXbD)$8}iCi0+gO~fg;eFw)0z|96fP|8N?QE?9pR~6xnWEa+I}wi&u>zemvVgS# z(^^s&+og#*Zs8I3f!lK|&s zZr0a;@;lGG=ihwkn$eJd(B=i6|D7q9zWR86Rsy*C;^UonW%CDzDuP{Hr57#}j8o;j zbDSTpBpx9)GDXQ79dZd8612(v#;bm?o8iUT_O7nl&aDNH`s{br9FxticAII-17{cf z_^Lsp>!(yLqZF6H+irbIEg93M@8=76wF)XRy;PgRf~py@cuSt&f0ufEoV%m1I>>bD zVV77G;*n<)OT)OjM|H|95fNyr8$KKDMOA0sd=}T!1J5yi&FuD5i@am!@-4i)f zDr-;p=Vv+gy`lII#+UT=>O7RGOq;Xrc5;U64-@tFA2ff%&8hIuUpV)d9bN<>rd}py zHr=)9g+QYo&8c4ZhV*04WGc;=T@jkEeqFhbb4)Mkx;`Al-=`h$UYIR=Ja`OiIzYd? zemmq)5;zttiGaV2?Nv>ZTdMn=DA5;{nFDv**kJwDg0uBqHZS~#mfa1-*vUuE_H7&> za@bja9b1f(^UpbW8YjHY_lr((v3r+F@djh;>Uus@>G56MjU4?^5u<+xsf}Z`_-n>w zwU#0%hIl?mln|0BsZaFT)D@EaZiq#zu>W+M_zYJoo~f^7-dmA(Y-;3q@RrDm5R_}# za9fxee+`$maEVYYk3|dOuRfoh&i-j$>yUF@{lcnox92WRv1j|d))eDcn63W|0&X_n zUw((r@tEOmP`}UW;ZsMX>8RUm@-rXe*Up!8Rna2y7!Yb0F7St%q6&U@liwJWjrC4U3~;Q z`uZRF@ADEik$$tEa)j!A@tu87%DW6SENU%`6Cdd4aHNoV zm6fDh^sNtb+HdIjP@|Zj_a-MdzU|d!NF-9yaiK#UKG3;o;59NKwNV!q*=NlPxZD+i zty_4W1%Ti3ktZe_8T6uQvuFYKQEKP!7_*J`0GUoA)GKpNQeo zZn;@hA&B_(oaUuDezX1hqgx^`%8wphUM!oxt96+7o{;|KbvhH`D{?Ac`|Hfdv86gU zCt^uV^wL1*l#fz%p%ELDt%XMXPD(Qon{vN>SvvpVK`s6GZiZi#LP&yOcbWT3h7ezu z((Olb=@Zp{?+NTgOlg*8&NhGii6jt%eZ5}5fB4Pyi*1xY-r(pi3v_{WP@FfpHSbB- zLGllkQ7JF9=oT5lb%Z53;qsvQ9S+YZHxSjgiM-49rU8aT%u)`8V=6qPVb`%lEuxyT zHJ9VLBBI29DgW||hw9*od&4zbn)F&9AEvkOFJ-4;DG?&3r_tPJtTgp_p@p?7Pa9iX z#x$VUHG!oswtWrvW#6pox^R=({rdY#nMGV~)Bdi8OxU8dX8IkckCf;T<99|Hq^{0h zCK5fY?0gy{%$q$Y^WrspwAc=7!%t1-DaO-RGj%InwRxnWfGhprZ&ZI~I|ER2M zDtvQW$34H(ek-S`j3S8A2zV+M@p)Y&eoZkR(S%FD)Hc(gatLFy)9=Nxq4|q-k7OB$ zq%?-#vZVNxiR$*m>_z*^GC8%%aE5N(lfMSv{5?e*iL=Ru5^lZyY053ix_)den*DoE zrJ!Q@D_J3P+{%5fH^{Bx&+I+z!6`fgc%n4Zo=}yZGs|vqm8moB_O`OJV1DWzQ0cqZ zYBz?0|Fur!%|```i_^_SR0iP}QFEE;!jo=8G9Q+3fPpK9d7J%1HAVVu@U)jhrTHJP zz~R&22w%w%aQK)+KE#i*lpAT8c{fuZa?qUMd)h*4mqxJh;}EOkn)ESKv_<}0;#Y~< zZG{NheCPXTW%6^itKUf-8PXbqcHh|Dn*39>c2{AVn*kz~&+u+)<2Xk!ChnfjI)Pbx zwXd@lac>{TteEaD+3)t0M%7VUi?7U>W5?LBr_cXXeE6?M#Ml&>`6To7OcWql`pLby zuu-Og9nLuSabj!Lv@x!m_)?pa2i-w_^ss;F_MJ`R!zK5Kn@8bX{!Ws>U`dm!6|(4# z^F3!yBBto)ev)sV*dU-;6JZSoTt&bMoC)LOa_9lJI2eIj(xuoZagvcKfy16t<4zVt zgBbCmX(|1E#l7V60?+7BHopRwy`%fwQYIzc1nDJtMTSw7UQH7|tsn42X;rlyLT-MU z)DzArDxV{QEa7O;^8_-RHOVlnO<-WA8oze?sGH4#W51BjdimyDz+>UV9`QZM4;k_6 ziRr@oDQ>0TlKZjd6k#c91wA3s3-U}k043As-CZQ6|&stHSnZ#GP*d2ZQyxG*^p z&dD`hk7Hc&s~O^f>hKj=+9?$CUjkFu{r)P>E{YqA9|sM_u^3TQ0j2Um4rA<5z)9Zl&8XO4nK480Z z+23JjDBoV>-fSrTwMJT0i$|2sachsYw$KgxYxZ!>pn&^olTZu^_^&>=4pf6|f}(W% zJ`677jE7I=PZSGc&GO0%Y}(&phfBQt$cTtuc*?ooDO(emNC+W<3QoqxU>Rh^I<9*i zsy!{rQbH7M1CP=ep1D~6VP03#*0rS}Ct?$cf`m`N=@q?jVNElak3EdXF8|5SoA-l3 zK1pwJGySpcem7pWrvG4m@GK!e0=js6$$B+L1nL!xM+}wx^m4M!@4rTwQ%1eB^XFix#7OLX;JC8MA#_U&?l}Um@raTGp`lZ77UsMz0 zDd>7w2JCZPuM#yV(gRZtM0F!k6?^+Lo$3XfjqDz)nY-2F?W+`r9l$U*xyAaSTCh3P z7D&*3RL@2Pdm{eY#uI0JBQvABzk@uIBl^W8tqQJ(q*{Z%OJ5H-syzv<@=ArgsfV`O zZ^cDxne8JQad5ZwHI+($?0ZujK&F6k}gqtNR-0 zGRtr)4?>1#&K_CI5Z}5<77WAA_kOM9tU$BC^t&>YYkZu24i-}MQ}2=e$y-X`9Wqz^ z12;Aw*RHfS_FA)|O#Y?2xbl7C3}?Sm+^c1!B23o<5?ZW&-EjnT2j~KFXvmJsR{_#ZQ>|TmO}=AU7R$Yp0g0e{gWZt zA>P#a_f(2$eq8VEpSwJY!_Ph!8pI9V%@`A|safoH{J^!@LV?HR2pm=AO7CB~1zzpm zc;UZgKD3e&42#}X$_F#)Hpk-gEEis+5PIgnIA~r6nUT2(v3~iy%UYw7E;!@hufCz6 zZ1vJx8O(%!L(?NV;97EkJ$tG}v$sMuCtPY^TM-_=13NIFt6Td1`t|@Gn<*0= z{Pb6IvI_1}_v3jdAxceUhaO7*5yKI;tX`@_0c;47xJi2Im*t%O1z4EOgus%`@9rzo zWYf_I-lJDS4cc%&v9r#>ps}5K(fhe+2l_kZMGu?Wq}GeBd3=u0E4Un2e=s3IY%b0x z#C^L5$ClV!7AC*282tZm_SRupKTY4T(%s!ihkzj69U|S0(y1WQAe|y9-6wI90s_(! zA`Q|Yf;60TH@tiPuKT*L=Q!^BkLP*+67ajbv$M0aGqbbvk?V19PUaUT)>y_sdH?OG zzw*k}qVK@{OA_SkLN3z6UMv?^6?*IM@ zS&-yeF#KiNl-jai?#VT>7*eO75$e)WBGTs zS`UV>s3=t8zn0ka{j78rCoRGkB+3snTZaC7QCaW2_V~{NWJc=hrUswu^=7siBy0$g z<3~=-MHP^ux%VIn1QKidZ+e*y>zw@Mpc-w`g)LW%j>V!1uy+`BZ{88y`J=eHczK}m zu5tUx1it9v;MG~(6w`5YQbhO#FACD^m(+Zrw5TIBDNW3wlNp?!m^gvgfOMk&15tA8 zh=D4c5*Lkj+0TCdZT8(Ozl9HZywTKyOB*shMlBa#no#ymHu`A39HW^<=o1Wtw?Nfi zRhw6?#8OIp6Q1`tOvB*Kd?^%kg8Cs_l)z(uqUqYFO=_(xLlZibyNu2!!-&vEvpfIf zPTEo}TZzy*u!dQq)=GKqt9{Y}$g6jQXX`TV%N^fh+c*nR#y|Sgcx|31!7Cz)f8MMc z=DL}lG1rvxYcBXC@Dqj{UcanxBTudt3_W8`m=>#gs7hYRkbBruPoKlt`t(ntq2$O+ zr`^5vy45!y-MO{v&ab#6VOd87i;4@RbfLr4B?iErOiqem7!E zEbXo+;k)k`-ibsyN7PvevK64|T3%{=xx^0MT^5&if5Jg$WCJ-2%FQ)326~LH?EBLi z`3GNn1lm8?CD4CZ%VS-SdBayt4vO*bdAU_JDtx4;*zPeqM>1P3Wah1)AK+ z?mLnsDS@A>>l|?`-0khaKQepT!M3>U2<}|4^wlXNCEL#%H;l#fV-yYQazy(2CzQ%N zrYEyv)$L_2ET2Y`<8M1-*MN)D=TEa0C0vXXb3|%=~%u*`xJOF*n_3xI^SN)6^Q4&5l&JB)pn3{&l6IR;z?IPmBo1Qf_V@;C@OMY zYCj&6#Vuk|jcYs+;CFLWKY>cvHZmCO!p0 z10{$Sb(~%xpu?db5S+AH)qX9#L>}lUAmVuC&*043UHJntl$uv0*`!fLtAfL1#BW%-5wgvW2TalGlP&__jO4AEEKD7cY-r zi^p4Ng`miY0}iO0WS=UOs)06(!PWjMZjXZu+#z@+%;f(`{W)r#Riu<9Y(FK@WC^o5 zY0OmPro(+c=B_jNDzN%$_}0+1a{M$tr#^+3H6rs@Q+Ou9f`#Ah#bOfs{mb@m$^C>! zY2nXX^XajttES!Wl%ICqY@W-G+jL!HY1-)UdbMwqkyT&)KyiFnCX;#Ed*8-lx}N^# z8nbxsYF+zlkIyrEA(w1#_D^nysoAB?h?X~mPj}3uN4o+*HU;-Rc2=%!(P#O^YG>mz z(htPld%g&ZC?Fnb=fYDxMn%<0_D??hyd8ez+T6D2PnVJnQZLwPZqrm2;xw?<8Jg;+ zFMnMBz91EMwXIrml__wMER3g`9Ho%ZM@!164D9f}Cg})<{>~CPoFp%%HwZjqttQs1 zONR{N4$pmNhGG5667L&6*(pXmz6Y6j(&95O97eb{CZncBkd7& ze_bW}X|Z~~??}JvHeK5~d6CA=^DX^)zIAY50K=bs+!IEkzuwuM9pfBW(yP7%)x~4C zd6PtF$CwGcRK@R~04U z82GjK-PPx4Bxl{{Le4|54>S~KQ|k|1(O0}g&v`{=;_*Ov z-H+X6g+W>u1MKe&-lFL16zz0B;q}hk#&i=(a`Lj5T^d<`jfjsd8iH{^QG;;|acz;$ z&sO~kmL9u&EQmAKYURp6dgA^5dW3Lut;zaWr5;zm#CJPEAq=@dP(l(ffKH*?({F0O zoBf-Upp5W{&9$pG*NYX9sI4E)6 z5OXc*_>D(%ULF2cSF_p??Yl1KMk0$2BksCjxhaq2<}bgCT2}(JY;XlJHs{@5)RBoH zw*53c(bOe3UlQf)=Z(!v8vU1TY=_Af_ZuN@6Q44RRDw)4lu=c9rza|IuUo)1ObPmB zfv-u26UO=gS0n7D2$s1--_}E8b8~?oL;@}02=+0AHgnp^?}X-zS$exqx%_;9?`2E| zxP&yiLs#cSwdWP~vy5xY3*{X5em=v!PFnBKiJjqulP_OFI$R_fFE3$Uqd4*_lXl4a zDo%FXci?RT+LWA+eXXciSnpZksp3i5AH}u6>q(*~Bm?Rz%CYtB^9Soj{$=$NhP~ft zcfNrv8y?ATEq2hk%pD>g8hMbrr9!ghr@~5W+7Q>FAMb~Y@_CP=D2*27A>( zd#~svO;@BnKk%sU-b=;j5qLE}jadRR2dpvz_BES|vI2VDw}UVqBgI+7?Q|OjmN!)ykt2nBklyYx41C; zsH4#-zQxZs@W*uOeUkKA3|tHgNh|xW|D6jUX!B|Kj|&n#cmPcr zj4{HXQ&*$Dc;~N9myCcgpJFubRagbh$1tWm^~I-A-okZ!=N_=?`+93mzUXZ8y2l)p zvW6|-EU!PziekPgkKX^t-P}TnDgWg$a*T!?zI4}$tN-k5SZVGRIaylTY!F?vfki>^ zEP6wvw901DRrQj|Mze5^`#}@SK9qlSmEmHs&33faePlP?bMcHzcPn1v%Fq9RS52N+ z9rh+0_ZM2C$kNtLxGd4)Z+U28@M!I@nP%VbrHRKuD?`~e4XFqGOSxlX9@8P6-pte2 zg-xGqH3{y-ZgLbu+D%`hC5s^_z>>u_3{QIAa4V?c#0N2)*d7^cGc73H8oqH;!L$rI zMht$0fH22;)Wa2Pzf0Am+fODF;CD$fqdQG}=WNcM`dehmEi?XMd*zymNz%`<#@F$u z5wHJ|Pn%2n@7S+8U8~k@;U9bRn}s?|14mz7gh+f`aON~R&_Rei*HU5{wC5SkfkLp6 zh{_n$WL?HK-y(=9lCI&&q4G}=x-u1Ct7dV}G*++Ak3uma$=spbmS&2qvbxTQl&Au{ zzV@`we*D%(Pa|A1{VO0z)P?XKs;$;AGmIQ_ZssG&GN_$`7A&+(p0%JsTgFI&2^q3_ zEj4S{a=mVPJv8l*)tK$lpp@=yo+mDb=!RSyEp$lYQ#;6}_Kol){b6P@fycI7F0HHU zQCC!Lfc8B?r)!?x7B|h6Wzp{iszSzT#reg9&lO2YlbrFx)RaZEpPbo8?`v!STHUbx zIDUq$zuF@_zDc+Bse5x<_D|@>HnT%)12yOT)@Ilje1R8EW8r~yCJ#3eeY{#Mj>Y6V zN=Qi2bN${CNnR>(r*iMoxr=*7i+ke$^;>AH5nT>@w^e*%%Aftn zhwZBF)5vnDIlnHUMy#NxuGYP7HX(i2RgTO^lA7Rn>-#bS83!J5EgM6f>}@G1>}f(Gaf-2Q;xK@_T`m%Z_;{lLz0B=OH>y-I&}13s(f2hs zriBpX!00)c{$rG(O>33)8@i;V zZf5`WhFlxl*I~2gTfh0C3VA~Awq?IF;zk<}yhY-o>-+IECeDYAu+U`bNRSHXJq%)L z>zKH5k-3+}`ef&ZmP)PADKP>mL{*uJatzR!JmOPHNCLt}J7bJ4I8;XRkeX!CDI=*c zO2!9Iktoh^AvojS$r2K(AFzjilx1|vhQAC=tqmwE8CWVcp*{Z4(N={)*J_QzX8Cv- z@hUix+2k_iU_!P`4chQSF}T4Z=OU3kO`|{0#PVeIH%yjrL`M?oRUW;T4=hS3>|+FO z?uSXkqUh{9kC0Vui}oVQ*g;)4W3}#yUJmaN42(0?FMpGG^rQ0Z=SvWE5r&vMT@`LU zB#J4S#_=jY5r|El`G1qq8oXQY2B z2XOl^+7B9`^9NQ;A9pyUc~sHC3!}FpB?7wjIP}`_(=k)u2(?gOJ$6InrF@{&5gn?O z)?!&`6;SV}4hu7XCixcc!xKV&Jd9*^kw5~B=4^uTc#*K-*PVyuf5rO0jhiV^_u+03 zZV(R~i0JQ^7$9Mu2tsf$Ae5(VUsY3teW6_m!kIqp`qss{JGZ_c<0F&~qV9j|zK4wD z`^*N3pv#ZmshrA|NtQ>a*=@KkrpF6GUKUHicFWkGmyitkMjW3k zHguKr*Sp9GOUOg1sX-pi-3Ed!vtsAu-o({T7CTGnBzbv^&pI2`F9K{RB6^5SZv?r9 z;OKmn!knLOw2K%DhkdZa&m<96EscAuUt%-4E=Q7QCTKR$b$oxe4=AANnVV(0G$3`{k-Q+5;H+KgVb$5kj&aF@f!9Brc z+&g|Xs;P;`%{2(+)X!V7vHTQJlBJ!H*m5GJ{lf%hHJ~8`ThdI`Qznat&}H5*ssI#1|9;RkP#oI*7&AcO-k*H<=Zt{wuRN<62u0A!NH)~P zwOM?reREsj$ux~D9qnTVkKH6ZK7?QKBGR(`8d^Vp9{a8%FCumt z zbf^7EBEKCXGigg)Jqq~@KdR3?(m0%%tJXxGJGl~8gNDe&yu=I73H6R|>P!+@mO$;q zvWZ2fRL*@eS)L^Hmb+Z>8mNt1;s#1SGPv6CBVlV6gQe=aMftq5^oy-ckSQjb^bbiH z3(^UKT_|LFUt2g4;oRcA3p^-sf=Z2=?N2X48np|?Fe83$;=61k~{B$?@ovh=r*gF_> znRgCJ$S`~oo!v!dc+>Bam=PHgROLT~=G`=@$(>vNPWF+q)Rr121ZN{7Ug+{nReLFC z=lS5{oaatX@Ab8ikC{ucjz6FXs$LS9#51h{Hs<15vO^-Al1&)4O8riX7!iq^l2iFXnGo^tJP`}V%ZJdW@v&Pzu+Fc1EPBuc@9k%lr*oFq&o@L`(m4=R?6*^92Wn1Av+RF@I zbM8E1Lm+D3My1F2qF(+i@u85AS1wF|Ep^+3917}4MeKBZoQ%5Sa1ylPLHO|}&5Jjg z%lAo2V4$Zeq3wA&t38(HZ5aXSkoKq2TD-)x1F93gWyyGvA=~)%P;K!%dgTb1 zXYLgvSpad|WO~!9Him?>3)epcG&>QG)Tz6w?+>z(_bE2iyF4|q65A=YA~f=VkXQ^@ zY+S#AjA)m|0ffI#A{}i|Ib!$n3pLDoEz{#E?Vx=?8o=jsTw z96Q{A$q)!m$Auit!#N|4!yqPq-D}J39oY_vhPlDN=u8nAW(!Sx5qVK%<>%s`Fqm8V z@L1m>#ZI!l?W@}iS~iQtbrdM*ZAQ1XR8pEsBQMYP#IY`OL6Fc$7s1@K=)YBN=s_!Tv+9Bun4DoNy##NfGxJLYA%NtZ$*s@ROlPgmm`;7Zit` zZBTHM4!fkQV*JHlar*B5r)BYa#w5e8LSi(AF`w-Spc(2AbeTgx1L_!n93%9wIWtFd9bd!us*-F33 zH|rt|i8UjtZIvp}nfOSK+!4|4utGIt^5a0*gXyVoY5ZJyb9h*ObTAb<1)=+1QD^}h z35{%T@MJ?sveoJ}hsW!m=wdYZM6p%5TY59(Nm_^BzoSufg?~X*D?I9_^Lr$NmE>?u zul<|$(Jh6nMzTm&o_V3oR6Zr{$t1e`Sipz?(yO^N9#lTb8y-3ya;UpN ze*R)z^pf;DsaxJK-pzy*TrJc@YORO^%4G>#4&h&V9XIi!X+?x)*Sn+TS!F$=<(=Li z$BcT}BYB?QIb37PvbD=G0M%eShx)wms=JVcb0_1*W0 z?P6l+<4|ar^SUmT9AK8HVH{Di@A_J0wmD^5*Vaka3}w$n$vL&v9(X%_n==iUs)buG zb7;12?`qn~+HkRY)N@pRCCo(-;NhX}?scO_EW1%x~zM>{jB3sgf zRT~t)lh%v3`>nfRTh|wEz^0<6-j?+xgk>2i0V9)bDEUFhEvzei@IiDzJf6Q?zzIW? zB%#W-5Pu%!Fh~bef+J&8`j9+SPRi{rFF;b>9|cJJKFe^`=yRl3*@1<7CAqRkztB?& zNS~y0deaV@ueSeA(w*ZyUHt*eJyV~xh1mYcoNoN$&AHed;=CaY@z}JWn*>- zJ9SZ97H&TL9?6R*rKP&D^eXe{vNe#<+pM1nZLa_${jp}a&+GMqD%|^QTYvt%C~eED zP{dygu9rKIWYDDST=JbMNwgl32WAXet(d}bOB_XcS(3{AvSTHpsGQRG#j@gQ>b_bmIp=xj z^x`7+<1tSfxV7t<*!6XwKc0@18g&nASd)~@+EfeK&DBF?=%#>^a-CheIe#IhsbnM` zF7*n&PBmT|I?iC7gzv-*t#3U8TPpg(?A0D;yv}~PvH4voo9c#V-NN&Ab|4`#@zN55 z$7+XTw%c(_nKutEPDz#OI$*W4fQwgmqLJY-7 zr?=70=g}dbO*7e%w8TA&%a)>is{BU@!yr?y(HKd9F+&hzt2SlB-J38IK zw#5#j7LJp}FY18gbpMfpz2!?N-j`uB^vBMhqA=lYgVlF^X=nr9H`2#*0$r#~QY)f{J!2RK{M6e`$;(6lPj+3pK!p~h23o9 zxL>`u76)lG5)BF^St<7g;((U@;_dTDj*fuC$Zoihj%CG;LIFs5Vnu z%&b$;UaCqq7McD`?f5zAY~Wq%@r^^%bzKqf$_Ejx{C8}^>uq6fN_8r3={76hT(2=w z2hOf>>LG7ac-2hZu5$Q`UUy^En(X)~Ol&QRI$w2YNE-8nQ8m4^?wmZiVcJ>mg2og% zu_Y)rPo~72E}3%~ut3x|H{DsdKkr;rvHjNJV0kg^k{q5g2ov8~d)0;`|MJsD%9;+p zmy2;p>(I2$_KYJX^RU_Ken*?OvwhKBTAf4T&dh7cs^qcz2B@jp;?z!{S@U9oh!ykl zVk>s$AMx&ve%A+L->KxBr|eMz6Hte6zpq{#vi7Qiv@jDzhh*-`?Y5c0&WyJke~VK> zW0|n)b7N*)67i#du<}n}eMZHtC_K~qY#luVVn|6b$waTjl%z-ebIaK_{`SW#)M+4? z&g2)r*le*FR%kLmqOD#fH0oM@G4B>DfLdS04!J;Xu?;ZWo_BwFlTq7JY&OHW?XnKUPLMgmF?mI*|U6+z(ElA}QgL8#4Ak zrJm&C$L0EmUabx-^7P7{pJ!z_O^eC}Xd_uQ3|&3QX|()luVDu_jAyEk-9u^I%^@HCtM+Ygpa-jzUnMj>vD> zlbd#4iupdB<>XkvAzOEC&(QmL?1E#hzMG;i>VRufqf~6}r7}u@#C|2RTRvEYHjLv+ zkQjIWHa?N5daDjK@QYG0wAT8(3_GrGyO0ry%}=zIM{)6BqAx)~o$c)}Wo72b*gw5P zpK(twI`0=N{K-rFi?HKG9FtYk4;6EED=l^Kbiu7+L07tnYbApRU8Nm^^I4HMKT|?7 z)~c(Zf0L@&t#gW>mlZ?*#C2bE*THXAwh7ObC1Y!DVVfIvxAH{wctJs!|0lKt-8vb2 z{XqF2v|`m$Z^?_0CVu);#$ZI7n78m(vQ7=8d^RK8X!%lEaEQ!7B<)Tl8QC0idjDA^ zIU)Qa=Wa&`!QpzvBhF+cov63bqw;uM=4M)JN|AKWtWb$6cCtT6xwn}NA2@6sC`I(~ zPDoC5sZODl_i}U=?zVocK=1x9JCNQ(4~k(Gc$Rn+I!L|5jTk?a?>$$*pme|CD>)V`>3so-k;KeTtd&X zVrQc5t9ic*zhu=LK|L)-#vyjH0FRtRsZ>_?V`MUYb`lYFH zS|)6_&J_f56QtFiRuny>LU&EjGLK&9%oy4aJc9mIrAwKyyi%XL(WzP^U8Rt(>i&(D zyIsTK`?}ru*1oant6Ik+BKhg6IV}otx-xrSOb0IGn=Aa!VSleRm+`h8ZB(+U(muGg z#{9?*bf@Cl2|%X>)ZSBjRVxS!l; zGAwg}vU=_t8#_S9twO?GG!-?5SC&gg5L1ZqIW_8WRFc!nHqaAASdDWI%sQl?bqZ0D zuam9vvgV32Fe8rHGu)qN#T-8Ei!RYk4=IQU@S0M8GaGWZevuCuNqEdO?D_Ng+qK^O zVa-Q(H{Ep;rG~~~7Rbuk#wnNNdF)wqM<;az&s|y5Jw}q{jde*>V67093=T=dl`3gu ze&x3&6;^|W&gBo~_@@`V)GZe15eF~I6nqLCO}ER|UVh=onY4#GWO1n+YTOI8r!N%d zzX}bx8bz9E^N{N>=tvpE79uqTkD8Wejzei+U^lD1nq zxjd|E!49&bw}7bC#t#Si#N50JU6bN>4@*R=bWBNV^x;p=S0TV_kh>~%Wk?5Mh(WQl zF8VW50|I&})H1i#w#AQf@&a*k!m7HQr6TRq8hOnrDlY=D4PliuT%FCOxaMgMu-7Nc zxAh~Ln^?LRcA<$`WoNRm!>k$p+kOr?ze`rYx6Ji3oP^a}Fz2UU*a!WruI z!*C2C?{!_Nu~KNY5ke%y^YblAtu59*yc?H)?g5G|9@Qz;rnGWGcBdpFHOM z9zis$5&T#S6#?xDnl37&Q-|rpXS>hr_aAXFEk*CmaPMl2#jMa6uHP(DNdMB(9uA`q zLsRd9TqrZWs?uS?Srn88-Y&FH%Oit;nbpo^Or+RuioMy_qSl={Pkv^3AQmN36U5r8wuHy)l9;nfOTbRr3U?! z)8H|BZIX?FN$-Y5*@u{}3$9+JpCJ;e6B<=JlT_)?3MyIH$VsrF-L~`;|LEf=ppQxZ z(Z_}VxB6IeFG7n&vYQxrb&>L-ifO&EhjJO!eMJ=0BrQSzSn5z}`c}PKR5^->wl-92 zgMvxw8?g*>Fp7Hm81K;`O4&5qLBPj|!H?#g?sC0K{OJ4nFzRHf4vx29KbBb@?lu|l z$>igWZ?Tiw$ZwPoE}E6D^@udRd>Q_XcO%`Zf}-*auX$3vKO5L!i5SUZ8S}ZUq|WBMMuL7 zq?z#@6wEvAS`&riMSiW^m>q=Iwoxd!bGOSnPnMu@c4#M-^q**x4V5C)F8NIdU{-+rsOVfygF`CR3Zm9N|a+Ux>3ZRr1E z2IF%p$b8~q&M8`xFD-+xqM>`ttMz|6c)lkjrvX8bMtVy$w9r$(pdPe~{`6~kIt5Y^ zub2Uv^OIpgBpo~YovG2mhW7SSi8r`xeX_stpu@}a+c#=~bRhw(2)I?4Gn;Qb_yj^< z6TkO|G>kWb2;aAw$D`5KDvDpE;j5Teoo`KS>c1AXYh*>OsN z>-f?O$!S0KAj$H7YP-E3X^0_N`KHgcSY%O0GN}0x1RfNyQ%ntH;F>b@L&MLWFrX>l zcC!4>InS8I3iI@l&&xA-@V@(7F{jC3apF3uF&D-Z3g;J^ov*41oL{7VAl8YR`^OoL z5gRy5i73wJ`cB-vyR@sSd-(|>=^}2^`*!%iSu%p`HYdV;qC}bb970_9fftfP1|_8R zgF8IW;SLYQ*E#c=HnO_FSibrQ1(+4A&#|P#l(nQ})AF{&7Rmt0}C6EQ<5`@gt5TU}yeVnuC@V|2F zTAI23CKYq$j-yT=X8d`c1b8Dxw|m?Dihzi<*G}rJG{AA;pfP*z;ehw0iBDo-?|*n~ z)oo2@s&-P4`J)jyfhe%U`I*?zW^Xfz3@XQk;3hQH-?>Vr@PAG# z=l_kO%Tc-mt(I03V8=hNHqs-hm}9+AH>Wp$YQ&=cu99Pdq%zgLgfsE{jHB;MT>aXa z!K`7|%uNp7`-s|u} z&EWLs-!|OFiUKXy*1&*Vwf{fKUpCXI`gbnCzYbm?ijsZtJhT_ioSA1!NjB99E-yJ$ z>fMTpWP}nnqbaxXk?$Eax zSnUz`p!(p-^smc__&xX{>@j3{-_;G9ZG1H3YDjCEQ1L@^YA;v!BcA;e{((OJ4O}9H zAWho&&g{oOeewu+yU(HPktuIGDie##rpadt+hfz(P4?3pOeDl=YbK!v^QNMwt4zm} zqa5o+vO6zg9IUNBJ#$|;aGk#KZ7AjZW1h~Ed*;gNIy{gxcsB25t#(k^tiS8{ls356L#oYi93jf@Q8XMayhx9K>#QNd-Q2Z`hN*|MD=(w|%)#WeCe^9uwg+0K8&!*}dp|dlgyKag z1*1-VS6weo*ec8W`S%*rIY58@O&e{nN~5a$LAalU(B*o;V7) zyx;hl#+KXSi0y5roO9D*e9x`5X*(QnmT-NHHEo;y?KS^g(|h5Pwf-{@^r#9%sO@+V zc(8dOvo*r~2tbta>&}ZAKzs$VT56Dq6?CcyNIa!2{OSzE?4CR$#X0kG7w&k%b>a<6W(C@OVXOK8I!Tp?$=SjppL#de4{q!nGf1m@| z4(liWUVX7so-2OP%DL)dnmHvDd($*l?Hrh%dh>JWmX7#uJ#Ln43cYxN(P6dUqpHgM zz_NquPoS2gx2Wt%)889WdM0lzCs>KBX$$qg>uZmvUw z)31N)Fk-2cvv$Hqu#l>3{t6emuat>duJH*X!Vma7j|e`1cIc0Om4SA6@OE#;2&b$; zI}GJ=mS6C8c5;>N@8Ip;QQ|v(N)S0G=RsP5ul?n-uM}88CUW3M6RI<$vpE4`bcFFT zWg&PBj!LVXXBZx797gHMU{HX~MtsMMgzK=CpJNVSEI2pZ-cAxu+Jdprl**Zx;bXl~ zs_c3WAL}_Ko@3m<#`=Gykj-P4&jPLI6PVZ0Ydd~2gr-RFW0MdJ9aiVe{?A7HF9oS* znE~fx26iO>&*XpR^?y6ohpU-r`goUrv~Dwh|NLt-D;$1J|umJxbI! zMDSra0aFPv9o~NeqDp;32jPHk-<>{#7Vy=7eN0CH{x)md4Vu1!(`}$p=Nmg&Gb&}o z07x{#{lq`=uU_HbcLB2F{5j$Jn*_eWkojAFf{w8u{P2#?4h>fS(=mKREI1Lt@PNFu z)W$1*l7N($!W?pc0WB83-hLx{I4k}&!v{JG73!Pk|1UHX-OXJBG&u;sK-2=Dng!fy z--!X57|}fk00LU5cb@;MJY8wb0)UodTdLlbT1=Jc*4q7gVh`Z2M4q#h{06gA!wvMy z7Sv(^@uU4UxTk&mr9R_7e54abV+@{57x5_D1^9JvgvxYr--cuWMwC|rK#ZHdx-Ko= z-2xg0xLS5=brtDU8-BZpUbFyM4-F9ySaMce_#~*n+%eD#jK$Y9euvV+tFe4*e?xFT z1C&wsauI+)&j?_jQ*b3NrOp9dYUEi5z|m6?UKh5>vCFCr=@M$6V$c3|E&L%VM`n4bv3fP4NA+Ob@fx=hA@l|*?4{|`rm>nVi z5>xovAmJ$?LESgfZ3hr>#r`_@1CrPBLU7od(nW;tIy^%Br5RY!4JehOY11P4q zUJ%3O*1@5ywF$?o1PH5xw^t{acVYljQuh7ATg4^-zzKm|C%QmzUJsIK=lJdm3j;iP zOr8+0ZLnYokP)_$tS20L6>rkC!BlX%0IUT)shL;4PPyJofC{}FNp*h*>es&jytE$6 zldh_I_PTf0Q#cH+Kx7X2?tXy{agAo`6{eXx(1J?}1oyzJ_48>bCPl?v7jDr^?v?3{mHN(bx5AzvaMc6JXRu zPb%A(kEc&ce>|H|msB3k;Qx~*FAH;;{UYr!Q}q;A{4}1*edrm%wfiAr=Mq2*+|-f8 z3%#W*8Z)%bW|9rZN&q-|hPU7H3LF5EK|=ulM7ufIJ$HCTu81Eem6vMz&pwkwlXh|+ z?R5Nm9pb&wq+llQ!~AOgH4|#0#4U)df^p*K291w9wu>_tm;*skvbe2$FOR{uKdD(p zC2lJGPKE`!vZ7XGt8bU03n$4X16mi}bYoWqppi^c0*Kq1ll*j1&#E2VyC=^hd2L6i z6j{d<`bywHjraFAu8K5c5de1cq5Pm!8|`V~aQ`kQy;iBE24u{v)mI^o$`CIAnG4in zD#bK(nEjHk-7Dqh)NyxSWy)te@~#K1l2znvzQSl%ldfEtj5Kv=}$oSEqUQV z8ZpIQMI%oe7oGr^PiUT-Oq1LZWaQ?cgBz9VjnpjcdkcW7sr5Q~1=TJ8+%0+Oe#lI_ ze?I<#TlrwI?N9C@(i1C8Cs4|sdJe#3VE%Ygu?xybeJaL$|B?W%Y|!&IR8qi^JfLK2 zCJCgV^;sPW<-et%Dee)yW;_Ke-7RyoP#Rv&>6E^?Fn9J(+CURE@iQHAeurcRjbJ&n zP@J`4@Lzja58E?Cv0H<|t{?TI{fIJcCK{@{Vllb`?q!+c;M#boICWGDuZ~1`g_DI!?hf(2^^vTf@FSCSx>5^Z&<6Qy`@drN?poY4`XfOuo z&=h-b0GR9B4$+lJhJ{U>QECn-Tq_m23d%IEPOHaCF~Fqme+QKtX7#xev{Z zMp~$!D2LKY)N{F5KeDnap&830B&DWq4y7_Ah~Oc}djk*{G%M1(7kcJ)ETh3gtlGt~ z@s-i4#-|{V+rRlCyeR8f^wMd|3NK)IW}(r@?$$D~enHO* zXyY-{wtHW|{uavS_Qp}~=rAP9U0C-gGO2xZ8xh|8_J+rtPLdAdcU}&_fkoU<7Br~x zt@5SfHRyEd+yKPuI$HhypBB+G9Qqev-HGU9h>(l0SMj(g4Zj`XCW=^%9nXECi_jP& zQh?<#17&3T_zI{Y$S7sLQM}<_nGQ5sOdADRVibII=2B?IfAYWO7(8liSnH~Qp=R|b z-YDYVr+847)CX({OKnJ2T!R#=)BxUockyr1*d+9cCIAj=N&%z8mr-#_76T_&JA)fq z+mjglc7Cu;N?*`9YgFYK93dZD14L?G?_;YH36PPikrLmYby#kfEV6jPw6&MG)t)y( z|5R{M3W$%lv^m|te#b+AB220mz_s?Cr5|;EA^$Nzs4k`nlKf7CO4E5PgKs1u8Tv26 zUfAzXMEz6&)9fP4JmineCR7umF912K;a|#uTIxIFKHYqiD;OKU&Ede5ti>98sDeRq zuujBSUU0WN?-2zT`kudON%i2+Nb1hP&jvOLLwG;M|6!BRqkRbi5H}O+huyfJC@8y1 z51r|mu8>JhG~UqWa9=ifji%D!ce&5z5mLyVGxpKldM*N`TdGWrhTN5mBYvvkeRnp4 znG@p{(Ez-Rx2p=IPl&LDeby2b7wQU`JOVD4Q5Ey>R%Go+nSS{nrRUJKeE&|WL?!c^ z&cm-%mq!Ofm$6 z4Kj2Uohm0}`B{L(4|E6 zjwOKZGo3pyA1(}iZ}u9I6EP(8VYVvgXXbPrVA z3l?GgibW+f#%AjGnXM~_;7_Xc;AiwvZ-AJ+uuAWKnchC89ve^JuA!i*ft#-)Q;FD* zs2SXWC5M}KjUbJ9wA@`CtU{af!@!~|)^*IKWp88rqv8PLy7)Vpk16CY(4G-VlQAdJ zNEj&j;v6YbvgJ!g+`L-um$RvT6ZY(!{438~I47g(u!;lx(B%JS`$L=Fqk;@!JNi3+ z#6F|`#Vng?rtKat+^6_2Z{e>wTHctnH{#%`BdI95 zYIbl#^hR`1GSn%IWRmT&0gsEM%F8_4JmBibl;#3~v$+Pr@=K&B#Qv2VQa_p=ZBPj@ zyAcPOH}XDzD1Qt$Som_mtizGUn5y%b@-|2#ov0CFZ2qJ^S0eE(rW0TO zn4ZiiPfbIj2;8lvK0mkm_bmTwvhaT}@IDN6mJ};~q6t?)LtwRgzvS;p8ZNI_>BLzY za7Z^0dJJ@*3OMep0JP_Lt@WU)o!fN7FCbCWMmuT$HsAhn`ry{I72Mk*k}!hUS8L?U z%;Ht4QfxZeapZe@i7fxHVycJRu<)?$zvuHW@93Z3gpf40;t41jj5z-JRSwPtcq;RM z`%wRB6K{n?=iXO32>{(0CHu<+k+6cC;0n))TzMzrf70|EAzm1chfd{p*h&5e*Rd6R zl1v2%O{SVYXZ&B7^5n^*bUv#-JmIap5G;7KNdWrwG%KuHaR4@Gp8C!s)H4d;0V;x)gK_VB@|6J>ySSjdiN+lE!1_08)|^8K0Vf{r4g$gH%T;9*>MLk_F#Cao8tdZDu1FY>?>bIlLxWJjHMTec`K=#_l6f80wr&G^ zEk1ZY*0v6mgK3?8Vzv#Ij*)#aue?i%9r2_egtQ*1)la4 zGUzU>RlJcE9)ZE2zws@G$2^KZS?o6~DP=EE`|f!ELi_6cTd)0QUlha`qQ&knSBuBl zD0o}CKu1KNpWWBsy#Tn+!OH%VW(C(Hyx;Zw2jCfi>#`Quk?=6rE2I&ZL3AXIovb%q z=p(Fit?%4k^;IT?OUc05LeH~d!vuVljc>Lx)xcK?)vOihg|FyQp6x|3LC`xE54~!$ zrxPXWsTBMUeZbG*wjM}UW-^hN@CBg^t#&_eD2N9n%rLJ64`&Jv&euB@$%J8OmFYmg zm74%f>?mp9Qn|+P1^Bpq#QG+d?9?|>@HopPJT}Cy^WKsb&_8SY#1u}nsbcSZ3Y;Cv zM2ULp$_(*nTALf0uAkw4$+MqS3;;%Iz6{fI;cbM8*K!zF$k@scka{*(YkYSHEy;0H zK^p*=e;AQb;0Wl4d!k7!>Wi@JBsCFbg;+9rfE~*9taQ5CoLjp@%_IF2g@6;;Q4^kq z$q9fnR|2)nF|Xa0{4e@}htf3VAOEaWa*O2>o1B-i_}&j3FXJ_@6|VpTfZhH+3W^gr zktOU7QBGn$27!kDUy8KU`ldrsdgFvajyQk%YZHn$#|+AA;nDS4JDwNK96+BvUpa5J zgU=IwrKc(n#8{PD*wM2Ug0qH%k+BqTGaS@p)lUEbLu?*@>5jlnjp1tUySoLRze*~X zX#>I?Fa=X*%zTIMMv_F}hFJ?v*D7#JQxI{4r1UBb=;52IrNRH^{1{%>_zX7n!s?Oc z(sPGNr5u2bDh$MjLBbuNb?#Fn%a_`;y4gJ3U8I6gjz#&6#LX_>Hn6O2{=5|Y!w9Bm zsOqn5`;V^rG0D=x=8ZP%!|6S1G#oAk57h3S{GWp7gr8vbq4GKzF24g2-zo4I5RO24 zjKTRIPxI~h`f>(m8sFV#(l1wE#7_$ERMce+^pYlvLZZ%pB#8iOQ{Jxj_EWmHE)#LL zJ1u2452x`&n!X-R1W2Ka*_(lSVQ-_1Ga-uoGznR)j9#aeHz zXe9-`;_zd15=@d<06jV$9;i>Id;ThAC^>({35yI9_=!*2)K2oJ)rASHu0|E*j^xBv z9>~*?%*k3;kgm#T35{uF22+3&fK>zXd;(Xv{zh}au*#qM*)KSKC&k8h`%4amFpYe< zhbtujx1ym8IG+%nzQauhN+oq*PGXs5cYu)XaAMUTMH%l=w(PGvJ6l`ol{)#f#Z5oo zcDCQ90Eil8KN)M>)k+v9JtuG{Qh?brMS;K7D0;*>?>c5 z_K^D5IIx+iuk&u&5-16Y2rlkUr=>Qm40MSgSHo9FH8gL(_RXhIXM4OKJ=euWc9z?}C|P zb20F#zkX5hx#~ZBIDCB2eVSV94}50+&SV@%!CwP#zLf97qVTm;@FReA;yhr$Q9q(F zMG8yQ`Y>`Pxw35(} zhieC4d~2T>ich4aD%Exk&F3bg>u$Q^7EiUt%IB2M9pHe*FE&hNMFaK;>+8B@0HGHX zbP^_8NVQ=mB7_SEoH?rf_3MAeHbJaLyEa04Sh}H}4-CnbVzDw1Ai|M7!$HSg(@q7{ zlK`gL_H=`X8{-YCeN|*H0d@@}pp z#n9H7U;$@;*Mi&`@xbsTPYl})U^6bIVE~S)qLHvz|G&jCd7u~%wkAvBFDh3bfIT{@ zNK9&Q7J1oN#bV z=JMADnV4O7_+7`=hgW65P+v#}sO4kkMx^1jVk~t$vo2CTQK2dLY({OWi+Y+x@IslU z*wEP0zg+*eNGCiIw3(_zOCfs~|1vzX#P5X99HXIM_MI5V@I|oJg~bV36w* z`*Wcl`-I@wL{Gid5<>YB3;y4$xe;IM|4zuy>7Ym|1s!=Kv}OxFiw=V)-18a$?DP9s z07F`Tvn`p4^#7{`03g+oSMJ?xPt*Q*t$KdFiDIT5$pEDiUwwH zBxMz6hB0{AXDU2chSJ58OTRyQ#5P0{eHir8`=VTmD0WLVi<`7XV!&!vB9uWyS+%`}?wi|_MVCn~X~Jl)sAwzVF|?)4q>ujmZm}N_YuB7GK|TV?5cJ}3Od?D9DqD8l0$v~* z>qD!{B1Hn?c#IPPN(GZG$Ep!+;sGOEjZ@7P>y#cH2ikWE_8NzqV;`VF8s1PpE z*RK2PQQ7GQTegHY89v{qE8h3J>_GtyoK!s0NFkjFJj}cfRF%G!d1RrARIWW%BPUNh z%@4jWX`Rhne-I?0?|-bb-GgWD`}2lQSyH6eDSCLn0Tae_+XVSFP71-C$GWdc*D2T1 zkoFO0R~so(OEL*T6t{Wbm1I=I0i9b9$n6YocHhu(3 zll^}DaQkxe@7tb0gEMuNkvZ-fbd{L;hJ$zroUdHu2rEMgGXc{zi7N!?Db5SsB^@pT z!~xwq1ql)+uq;m|hsRgS25JK#1;Fz3UyW`~@7wNkqsptb5Hc<@1h?H=oC>^@$b4h6 zmkmnh^3wRx`k7LjlgxAY#uJ5h^_J;x!sFk z>jxH>g)}vKF0Y*_*itX!qM&aD|8E+iw<>9%E?kJ=45qsu9YNqHQvG8${+o}ZCMo3i z20+z3BJ*@8cfZs%NddTy7nw$oaAM%y^cP~x8BkQ)abi=1g2oS};RArQ2nGG4Bn^fY zqpRSQ`DdG`@xc?I6a@*ZQ{37SDpXAO7xXHze}2ZVU!fJAhGRv?N)lbxH>vnHOLkI# zhI~Rk;yx4@4F1&0y;>m|Wc~@x|B*y4OYs_uYnVM&n-GoO_9K=M&TBVReP%24KrjhW zY!u@7)LpQQ<3Ew@O_%!i;iwnE=kvzNv|;Aa2u8L`pbR_+GMDzBy4!EI@-M+SdFD8;O=D z;5zwrSQi{p8hzV59THU?*Z*tccCa@i=%Vq5UNd+A&l;oNuI&UJ1t)x%)PO}b(E}~{ z5O=Au=kXyhyPp;s>q}*+s$lLU9)t2*pTjZ!-#`PFwG_Y;n*T=;PfvhxbP{Las-xAP zE&AM$-B#wqdwWyzX!(UFhI0YmX%xl@Jd5UKqwMj^TBiN)Q)%rr;uO!WLkd-eT(&T% zt}epUQN@Qj3<1qmvb&*SLE&dJiu&tH!k0$bpWE1MiHTWZV0D`6gFk5LnXLfYrsK$- zf=$XUF^w9}0Sj6-70OD@k3^SnCW$32+%IRvU;_-mtC(U0d5O)Vfao`mzxl`I5RKJM zVKYFYy@8E!RZ2vW|EBtVJCIVhiZvva{3R@oQlV4M5Nl8onQ8axct;^i3F$fOe@b^T zh-HVxRFsCZloyl(wX`g#!4xVWGJFgS6_5qlg)2sps>hA~QkUv4hg@pp0kB(t8v76* zEwl1ZNISmsMH*V(a5G1 zNv79xck~v4pKo%6r_(S>g8D9UEb<7Q;y9AvT?ehsO^O3bdSEE?KXp1m#!cQA;af@# z(y~CA;R^as`5jbKVgD2u2Hcq_WO4ZL- zJQ0{!!Z!qBd65|h*>hgQOgXI;f)f!;M6x`1g}7v>Il(vm714UAnI=Rw#}q%m4JH(= z6mp}@<|DZ9C`T^RSkzRzBK;#IJIip2ZDeb472y(WD>o$Y{i?DlsY7YXblT)!Cznfk zCV4Xg?JgOg>3P~0erDi>O*}zX6I3-VE>87nAj;4dlY+ly#ywU|mKW8=mSaw|4|}28 zqoHyZ>xh@%TdSEvYx*nmNqA0A#m3GH1+oBlW?UA}MKzbqD}!|C&7lBgeFl@6@sLpY zn+c}#%~$vStE%K%$?hLU;ZnvKs~_A`#ahQl&B&J%3)?ja9L*sNnaRG#Sbo-H zOX9aSP37Y?lU&*|n5D$R1yhw>aLUJX=W(*MS0?#U$;&FN`<*9pm>kJ|8-IVYnW5X! zh#KU5q+aB1mtJg@O3@-|(j_g+8FN`iU|*6d(|7hQ_2rd*5f}&GpR-mjl|NF-dw+TU zO=Vy6^SIti&TzkehKM5(op*g%g3ojoCCM-BX!dtkdwXdp@pO+LB_Hoj>EKHDc5@2b zDM+kxe3M24ch)iR54~q$YF0#KYLq++hg)Bo2 z=@TnzVQRr6VGv$-<&ESo#o!w=p4uKfqrJDQ{XW}VxxlT&_BkIUagzAJ6=+I_SMh7 zL3Xt!BX<954DS-=(XBU}-9g3QF9`zVl=-Xt+21ooO1f^oJ1@0atH2G5TWU^u?gOpr zP9F}9zV>$en4n5R)M=1QfY$C(7@24J)8K<`TLZJe!xyu+`ssgR_xJX7gr4wMbMeqR zbEjQV>z+VMvF!hOtr5;>Cm1NhF|UaOqSi5No~=d$)m>;V&>xOhK}u2;BR1|qnz!%>G42)o0z`pOWM#Qb9(h)+ov_pFRv-)1sEkwRxZhKIplNz!UuG%XF%A3Ai~6Xw0x`1`?QK zgKJlnfXy49x03Dr1OWZRlD8KwWS2^?vU!z%wG7-R(iL%4FRAuEE4PdTGP9;9(R7oW zvSy=Py9N+8*<=S8oQ7gR`PY5SODb8Ssi2r1kW~erB}caz_j50KI?1an{}aE(fD-fg zvmib>r6E2#m+=P;EEt{KVj9TSzBV0K&YTx8w|JG!c=gt&Cm2R~1;?0qp6D2Ejl_kH zJA^i5>E4QdaHput2&@!H0V6lL(l!Sgat(QS0D&u;cu2d}@ckE!T<+&XtEzwR#di;b z6n2|+Bkl~!#~0mv${-a_uL~$$uZlXpDPfZ17R$Q+W3mFG5vtydR49&+(1c(`ap_k` ztQXmSPkv;EF;b2EM!Al7$F3Y{JWnPj7U{1*^sYht=VpMri27d=CRG4qd10=oN^GV& zLduq-nxj`ALmw~#_hS5ZcdFN{46wgC^MEpkvMG9Yn6m4|w+T|ND-HROmlUb4?%n4f zI?14zB-pnZamL2{hft%P9V@YkyJAjM7T#z;w9->iNLE0mT1R<*JEKCu%FuyMZvZI064Pc< zstA}_ae$M)(1U9dBB~%u9)k!Aijb%VV0GNm4Pp7mUE$v{J@OuD#4$dl67iM)VX)Jm z{rhe3Bn9Q3nf~qYGcTvW_EX}yNx%id>h6^$cWF8z%V-M17w{FpY|yfNBob_SQ3^9RAKk2ph_M{%f9fnGI1F}h*V3IN6lPI&*I~VSBR`QoS7pN zu~e?(PgT9trFS1xtx?#-JxyQLBS)H*dIMJK3y-?cI}vmJ&3$TI5h9=QITi>YA;P-) z`8rzDo($`SV$EH&tjHZsIe{Hc#ZZ+FBjpo9uBHgQyhps*FE+zexqRhS$KT%PXz>s! zH1s|M^lfqaGFo8u0vzUNB0R}(=`iKE4~S#9w|X;yzXDJdmJtsFl0}R;`%D?X-(h*4|HxsjNd{&e5n%|~QmSKY?)oUq%^CX=On0ciIX6vO@xCcKbMFd;*ghKKDVSl( z3zVW7URYAGEQLc8glIjKIU*Yo#M)+?{lilEVuE%nEM(^blO!W|+3MwQhlHRFP^W8! z?%7%Rd#)arxG@{B+@kBROo|Pln|79UnkbBP@+W;jv@rYYYFA3W^zWYN$_2`kacQ#J zp=kM&rzZ3Ig~L8+Qxd|O_dsE1qJ!Mm7n{`Isl4=|^enPn8$7W}x;xdBj7i8jx zQ)Vm?zHoLv?~LY4TrWEKocg>-TX475hx(talr`U{JntVvGmqk;5L72KB=3;L=q_8M zg+^mp&i&?q6vV5>bkeVw0T%2o2XX&x8FciFy!c5p*$n@ zpTursnDtbb?;{AgnB|V5d#xmlshHSXeyR(~8h||n{`m%Knxj|`&R*Wys=s&EtjWFB z;NEI}lN;SeU{r1=cxzpK^X%Q@{p|Uyp%Uz6yV{bvWSJ?OL!z95_Kj|^z+=%LEkkdc z;psrnx72U8eyZ$==J}QjBN&R;zO~Mm?wbVbmUV2Bz?J;?xpn&8nBWmx z?@`*sTp9#ebpl$0BP_46N z>EBb~T;6%-kj@cj8?y6ZBpv zoOtUs&_Y{DABI{tL~C)O96vRMD2OA^Q{TjSuh@Ahx+zqV;9dYEeftD$4iRbYRu<^$ z;E(kHbFPrhAeLH@d=4I!(BM5mNen>$|K&Zw7RNCTza$e2iNwF(k$dFA5?wdz4+FPp z2SQHe?TKuZ!~U$bzo6O(LuJ}XO1-`X4h>{~es~lJOqeW(w_lP9!P&wa9^r)j6*e{X z`?4C;Sj?$&!IPNAR5)Uz@;p@MBa#Ct{?Uj?# zwS17#MH&I#PgNH68godXn>hzG;8cCEL!9qq-Su%rae_C}W-@aB#DOa{j&fIYak-LoZmt6L2 z!a#=j!jrLjR_;?W&j1Yxf7#H}-W`!{z^~7rU{q3eH}g7u%%urN95!3lsCYrz-@qBk zs=(>iJSmp^9-Hq49%=aC>Hs-{OIpC(`E9)#MES)6Fb31|XktClza`k3CAv5YUqwNu$4^_|)`z#7z3UJOI_p zBthfkM;I*sJ0;o|0Z#RLBgJ(c(Tq^z=7S{3vVx$RrHc#0{<{-COUuY8F@tsg&m^QQ zXAcl(0^Jer9y?5UyL42c?o46l-gH}adr))rn@mB0eKO8fmnuXQwNZRY^z}stfoOR< zHw&cYsjJi1g6h?tEhN7G7p-rhh|KGK1(Q?@>tciU1QI^ix5~FWrz!B+7UJ1x(w7;7Ck$8k!|#fc{XX|?e(tm^W14Bk<}cz&9Y z&hc2Ok!5TElQ_#5I{D14ug9bVMfso3O#xJBY9dw!iot`j{@RI zU+;isTX5DsUsuz`x*wcIJ@}ypOzei5H0cD0H1LjRN-ViexEde!nXP)W-DktOy;h^i zSL&{RJ8qql3oRcwHUJ{`yLUmt?S}Kkm)vVx-WQ9cP zgf2O5a%>;2M0L#u>gX77)+;ZdpwLm(#%f2w{#~MiE54Zu zBI5CDR^hJV>~4FpC|_ZFL(d<1T2-3tY27W`lD#K06XA3Vz&VEo(y1Gf)g`Qx73oOH zm!BH7?5vvJ$$S6@)$%Jb{vY$|)kx0*d!X{BIN#sP&X$@VAJ#xVv9qIz&*6g&?c3CS zwj6ok6?SH6zbcDv6xK*}*9`KLXp6}J<3nO)b$FnDcUT`!ZEv+xm)hj~zVjHoy7m}7 ze1n3Gfe|Jc#qn>a%i=UHH$-DSsKnBW6XzH+@b4|TjuZ3+92L^^{lgFVCs8!)l*Ui3 z57syC(m#fszz}`Iy~I6;m1>m+25j<~TcLO^$5}iPFILjA21%XaVN3nSDLGHWB|=b0 z^ur~(A@1;Xa3-SyzOZr9LCR83JQ0be$=+yP4JbYKLOyxoDavBOgX7M*>)rixfyt~# zZuS}T7@U#p6@%xIg~{SSb}p^BRC{?>iey1GTTf9t8PPE-b$0k`w$qMc13;$nhm3h$ zj=#2%zBC{4jk$MQrW3qqN4;t<){MF@07I7q39ljBBA8Cv?P#vwp~m;{H~QdPK#RyK z=sZx=k>VN(Adfa-y*eIv#}iDy<^uhW#f9kBKW`)*cYk~Ugc96d6WzjN=$Q4MNr-KF;kMLyqJH+M$sjQXL1*0uJ5mO z$nlF4jFi_XLOao*)BeSJ|NH7^pRdNAGk?qOl9kcuJ1K)}$}(M^nXr?Xz#s|ANYX*l zvZ)b%WSxFnc#9z+^+>YJURHL1h9Gglkk{kwE(xVZ6o|9sPw|59&#|eRig5}k)*E~c z>WL$!aKV&lLFx|Uq^f$BvT@4TDydAt(_+Ir`iPyWB%VYePDypv7M)uHb3G|@`8E68an4pFXcJO3`)u>Pl=!xE8PWG$6$oR=db6R%y2J2BhvnZGgBCF zXf&lNeCCAi0NO*}5D7az*4htxdRo7Ow+H(@tlHk@kJDo9h5(9NP_w_4UQo#gynr}z zU%a{F8{k;mja3J=0HLSeFVJ0-_@7QS{zT988zAc{{e7$RhUzy8;&)1ps85X0qdw;&!OWkUyNrQ;MfR*R&5ZcuyrrT7=`vR}n38?pE2(+U zH47Q5>1i-J+^Y=$B#<%bpc!)@V706G)ag(l6X^Laf0V4W6bfm?)oRL`{fp&Htmg8& zIs16jhRZU5N7Wp1m(bOg`p9|P;pAJ=wPJIfyU7iYD1Ok9+TUax0Mn6P_xCqorSywq zMg)&Nc@KnGC{%|3JVs$ltB&T(WlR!HzqAtk8Y9^6E%`|USkbQiLTSOD&rO(TS~~E; zMgYZ!ZDI2NglQ@S^s1&nCAnIG%7X-;Bw0zcYk260!7=y~x721)Hc?0L=`4dKt2uWi z+Ef`qS~zOlDLh@m&Yt>uK*N`Blc+~J-#{h&aoq7d_F7dxLCMB35@7@c8JfENup${2 z)eUwR)C@6H%3}1UhfKdvY^owHeI1K0q*lD$cN^U(u#67ZHyus6Om_U+75vs%X$L(w zuCbC@E(6k7OHG8S&aRcWx~Rv7m+;du)~Iu5hcydvyghuiUGz}wcaji_Ia+X-?6fXq zp} zde2IC^CTA_(ur!S1ZXM#h}UJ|4b-dzrl>EK+4qbbB~K@WEyR=y*8~g;Bb4RDO}&_z z()>Gg@Y5UO-Uf;-IsjKNy~W-u&=ZpbN>Yt(WdoETiVjT`nca%I$}~p&JTLGu@x{Z! z@dtRd1vD2@yan=7*UOLY66c46ithAv{a}P^Iz^#B68U?6RpOqqc!xb?7K9VdlP)?o z*eZQ1$;W%S%5T=dp37t5lc}8Tc&nL%#&N3R#;o%jUL6Eit&Ri9@H-Q~^!aC!{X8zJ zfn14^lo-5#Q6XVMRIgm6_exZ%f>W*JbkYoxVt4&2F8!53dwAR-%3yMOvPf#O?_=H;|lXdmPp(Oo_ z28^*(5JLPS^#wKEhpKAt#c}I@SmA~`)am#-=3E>SHn1I zWq)05K^=_lJa#&s8vV-T;ybZM_^z@Hum|gnug?QO1_I`+v0npNwOYi$hb6Ec>K5=ed63vcN`` z#e=RqC@V7Y(QPoEf=q*`L>yD40b3itYnEP>hw5mg=K0>@^qXUr{p&9OQ*#O>yzJv` zY~zu1*jf}N4plQS(w>&M2N3N@3OTJ0#w|U2r@%lW!%>3uuk;knRP&zpI5Xt2;XcR- z>cNs>6^KtWgpgit)h~-iTQrz3o}jZd&B+|5ggQa?KBvD;k(nkMF?u3e7t;8`LWY!( zjep%rO%*Q|qXckIm~zhtHBWFUK6g;%lw)M6b${&j(Af!xvt#7vIEEBemd%if(aF(& zDh$64nY8hYu#Y~DT0QmEiEEe_X`M0mjVmw<4fK8(<>Of6M@@L}{abTdsxwKi-Pz8* z{6NC&QDvWmeZlujI0Ryaue1D8R$Do)CHDQqS|;hP6h{^?9YuF=qn1FaXpk6uhQVGy zP{i6-_Cgdaf(~UkiLQkfYiM`*Q`B6099^b@coy6iJm69Mc--LkPpo#6&ZNP>+K*Zt zBHfxyM&@5U=3!VIEiSp${{7_3bMno8Q7$eb!AV=xiengDOg)|U%Nm*K7YkWEHk(0d0G0ksU5YqaguxI9?x`5~LH=*_p~801i=FQu?UTm*^x| z=mbt#`AB`G8M-0M}}Ay#dt>MsdoCrBclK0;)t^DDPo-&kO5d=Sci-- zHobtR7v&s7fD%!ld|HpK7{S}zamgE?8#kjgQ7cHUv|2eKQQ!r;9 ziufY_2+B$k&0{*uqJvZrVEGFQ6G;AE;7bCZ z>h#&?e)6ywDK%LSU;vWZFVYwP+3M6O0aQ&2paM5P03goBf$Z#T<0|6Y7={qed=M<1 zP4mOo4z(m|w(qoFTGetm*ufzs+sb6Zqy+AN6v4-+xD-M&^VNF8A8V~DG`gs(AEX>K z9z?|*M>zLrgzi6IP?edUnC<4qw6cxKEqM98wSM+EFniwQK7fXS7CTc9@X=!+J#NJZA20Lx0*j8fA;#0kaBcCca8sRg|m5iEy8%{ zkH0>}cpnmX*!+iHoSzUu@{b*|1%4sQ;a7lh8=%)$iDC(W*8eFP1t&tjwv0d^>w9L| zv3(Mt;4rR%Zq-@hTi^ddSx&$P~FtGHB-Dc2bg&J?91Z zwVRS*ztf>rqZ)53t1kg`#m`bTS8|QKn;tE5sl{JdZmQGW(PLR1`AJuCSICG6<0hcu z$#p|3tM-np{#Ne$nUD+clHKg0(5J-L$i6~ucBSF$dZ;hnrmdgl7vvST=!@Q8Qdj22 zRrD@$6{KNNx+gB5jrP(GC(57^CD4x4w2^;rYe^|v8p=SOPAqNZTU%D@_~V#hrQBjF zX;J@m9mllAG3_=5TB$?l|0fKy$$pNAkKM<@hbgrRUyHq{|Ca`i=k}sGR}~Wk>-e|2 zvUNrrA*98*T3K2cN|y?1=65^Xw#TB~wv&@la;O93jw5)Tj`qo#3L=P z<7PQkl4%3|pDy5rjSILTuNk#WO7Zm2uV0JWI@J9g$SshpDmXm>d9JU9L+z78Rt0V3 zz>DwsYO*qXY#qkG(N%Y(dK26-zA>q+n`__q5EbaPP|`Q8RNP#r8-~^9AN5yPdOhD_ z71ge6B&aK$w%De7$Sr+|T0jhBm-5G~9{XQ+`{&P%1Q;J49?Of_I z-Lw2#W?r1rywitHM(W@&Ar16Y48K5uJk=fsf=eoEM6p0^#F~6528g++>g`57&Ot}T za(4XxOa93Yj?k#ALr(vX@H5E5LKnbHl+KGV+w^o1Y zHbgYyh&DyXp^Zox0HPn906X@`ikimt$yQr#;jzDI?*oH-yVhQItN0KWtD{brM_Z86 zaYzTo9V=^@cM*wf?cou2#9f>+vN1Ubh+&+faZRPqd?zuYz)E$_sC##lp32j7@R?rL zP>DbI%Fttf6Vh#-wc`LkxqfG8Wd%pN5Da74vW>4KS7UVqJ}&n&vAf~c!zJ`05Fd+h zeDOAdd-Wzm>&93T4 zmRv?BaL-xuzn*2=NQy>9`rwafc?UKgy8Suq=Qr}FyR9Ux29~GSYFJiW0KU= zMkfEIyQ5Z0rRtICMEMNDLkX~@s9SYhy-eZx+KDOvA z|36YUjr_sT)wxj_#*>VD>;vDYTvgU%cw8w25%S%SG}p2jD=iN$qP28~%2s-{YQX3Vr<26z%WWURP=yF)e(N=1vVAqi zi270^QGB=-y$H@{4f$K=ti<>fy>js`mMY9dolZwmV*f}3AOnFAEKkjxRNXZV^Oq-v zU^-p?$aIhc68B4Mfu~2V2f+WK&BkU^e;$6pSNIUp^If|$fIM#+4~H%jG^oMic3GC( zzyOWWPj@RE<;Z7*VNsUwrGjM3YeT7kwy@LDhuG%B%UPEBx2E_$^i={bSk#JnK~JKr z^i&o5g=&1LS@v@}%N&+D#F<1Uf(zN9-c2KUeYlnT61Sf_H@sa4Ep@sjybym7F&b|T z>?Z=azLjotmw_bEsl6kI%LF=euQ_T;=;re@bU}|0Pu@tt%d1+UTS)H)^T}sQ^}%9w zM$Pb{YsMgF{MN~z-sjcPKe+ptoAe3CFeW+x`Px#~gh(0H6#FH*80qa`ieR;&biznQ zRuUV;tHg?Qed9liQ5u{`9H#Sp!~LX3D-Qf)4k=4inX2$V#YM=1I+w4Np_W^UtH;&t zR)p0cXRzGfZ{YbZ3qSn8K!!C#2%vA8ji)>$mXz;3rwnx&47S{nAS$^6aM_duzE#xM zV|DdA&?luIC%MspFVywU$fVHo{~TSUzZQW{Qv8%!5kN>ak|ZF?^F1JmNtG`hApEEY zxgeziLa`qtZ_=MAHv}5W#)_$i)<3AWnd8^F-vSdWHjKu7`7ONj*@8^!VHFn0rg+^9Ci7Xnt8T#|CsmEL0>9*o#u&a8LFp_{t<;BAJf<8q98jc$B zS54$hLoP*I3H|96S??M+zrX(W^%Qlx6(v29i2GZ0hKXjN8MUfPK}jHNlKv&^7B!)# z?>N%fMyA>2S+Xk-9h{~CtZ`Oo%p7x25LiFana4cP*miL#_>JS zXg)%5o2V4ZW8n7@-1nDTXohaTt5zQi5z3j36a5;3Z`}XFay~YNSra5w26-8f`eNxB zT=e)A5bs*veWy3tK=|11fackkJnq86dv~TcK1sqj3fzl_7(%WD3BhL_c|E9Fx&+wKtdv~Bz|9D8-1%~}AQ)gyQ%6MUzl-4od>vgWkVRXCtVIQHLp4b;1l;sjlPe;cp&7xJUYz+zvi7Na-` zYt7T!1=&Du6MJ&vBr8(gcFpmZn3X8jUdiV1w1ljG@|V0hYHT|{vmHJ;uf8(fe4_ib zNjSv*hwi&yzYq>KofQKY5a4rvt+fQdYxn80FI^v^F$+)dbZ7lXx#p^6@;X9aWRVY2 zD&52#Ng-H~435$w)b*~}!{Q#JH5AXkIzaJVCv^*Dfy2US24bWZVYr2|q$lMAoZ-vW z98lwxu4{C%9;GrML#)72?nq@ccvQMS&SSbjrnGLhUKni_Ck#yUqDRv&#y?Ibx-gaJ zh0~W6n}E^Fj?Q!pDHn`va0vEH83Nx>665&^j@I_TkL$I37U~W>KRaS+#qQ1^DgYf6 zuIBia^|Zcz2s597g1CE(7+Um7;9pTCEDRHg=#JBshJyJQL(6E|Q$!Fbt8I&cVyPBc zetmzxu6+1sXt^7wOLJ=PNL3p{=dj15wgP@Ydo|S=F=|V~rH86?u2c@bFOl-ZNLYd_W~;wTcl3?PxMj^4tp2Cfj>1AsaTKcZ zLQqvd&tFB-bD|~4M-HqZIW0m`rt%z`FLjnW)O*XB5e8jsDL;!f6krcTFEa-Tzl=-6 zJg+q0SWG`z{zj|TG#Z8hGT;X;6j7paTshj~KqPB;R!7C9;=_qDD9+GO*~tqe)b_7@ zA1>{Z<;n5qre{sd*6$ypIu(Bmq(JW~8&xK9*YcF!FQ==lIpH!ltxdqQM3No7sg-=8o%8~xM3uk?>R+PGQa!DKhW=yLE3y?L5ja#8aH>h7pfjV3 z67ITU)BFhxwTQ_*V9n9~em`d{vBg49AY0CVdeRu|+}YA0{K3uRQ}$w}hpI(Pe7e4E zQ}b&r!Z7XVK3x&AupJVifOKyD2(M-iSf-ku;TG-3nngifZMh0~bdAPy_`t8=kLehJ z+)It%pBDw`H<7#POoBO0l%gcg6Z^2!eo?nf-VkHm|NLc_NCalJ3GTQ5JR@NsSNG7_ z&ZNnh{nEjdTetamuHh<=rGV5EIA!K~Kl&pkH&t7wEOUs+@%wJvEnqrz#+znBiDyNg z(hRMLb~23_C?-foV)4m(F|AQgd@leb8t&<4Wbg=2?*k8og#JEaRB0c+?F*UA}Lj)Jtda}(*mOWo1(XHRxf86%%JSSZJ-pfXg48UL<$_mK8yij@tt#a;F!0$0d2WL!gf7B)x?D?lT(=P>Ol^@)vSZhJJK9B z-cysD`hw|j)^s`%Tsu0IOe{L{NraW%WlUUHj zsMjgGQf}-)&$C!@(uOq_(DW}9Pl|{>)^RIgWqvZuM>9SCC&2S2yxRPsuv<8+&Dr(# z$W4n2g7Zd0c|=-QCAthxd%HMRufz_Ee9RzTFA@iv3mc+rPoCfs6RiCu+Q-DKrJR#;Q2$)vDV1N*~-@eH)wf%C*$ z*Em9n))(H{@z@?2s&r{no@q-St?L@RV)Zu#rp8cnke6GQY}pgy>(n!eG;U@2^}uOp zq_U;2g8Wyfr>xBRA! zKNS?vH~*N-p-PCph*^17b>&hm=LigYlg5pZH-T~_wo~XA1~Kqdtk~CGv%W`+a@?aZ zbP7S$G>!Qi`v38eDwGN=b#@5UfXd%vguQ=e#q3A1u2WP?`J()7$ki#-SZ&AGs#h!O zWlYDiX*&5Oh`7{@h|c0n%i@x`VVRHwm(-#of|^WPiYI4E&=vcft*3MG5L zVY>?27TjTWY9$7cNg;CM&^heH)jWWL3(wu=ftR8bRYQtN?-8hhQW{xL_ezD1tu8N$YUixoe39xxjG%T(0q`~ zmDT4^6?>UYzf2<|!ihm0U2gO5!lKVvpbWg|Vkn zt9>{)f6n~5{rn?Nq^_@V=Re%+VS@sM|I&OSL_1cjzOgbHwa0ovk#c{7+2kPuRPFBb zCE*HkQ@a_a@hX!YR^T}!K>1)S#S1|?x-oPzpH)j0W6`M~S;dn|sU0=$p}}HxnlmcM z9c#*b@7>TenRJ#pi-2sXfCjMqp*!Bj8E*RDjrn5p($S=r%5gZ_+Uq%6knwC{|Co)fuvE-{uhqXNZH%cl zt!2s=i}<+Bn!J6~Nb0Kr`MD@#tH2%e@L$Od7zBAa&y@(ymk#k5^aGec5%KIp+nHaK z_T+LXK;q@T)09ypLY)w%;t8MIdue7|?iJ})?=X=aYfkFWWlqYuht4jcqT$tXIIvoq zUGtHo>(`zGs84r>iMxUa&a7TMcVh%glLm#oi6>?K*={D44kE#gr%e-tWTq@T93a^4 zFXfSFm5zo6$hx5+W#1R}^m z=kOMm1UA@|x4Ge!L3*jufT&xe6r{1YX0G(_M@(~WQ@#I8U|}KNqQ2f|?`#5x$xX8| zv7LXiYS39!X13LtBXe1en-JK{DK}jg&2q6tDH}^;Nu}ORn-z+q8kKYK_&2~;`B`}u zZ@zC}hLPQuSBx;s3vg?*GQm9WoglOuOt6yn=^%!L?H<9JD`I)%pu4WumlDM^Q2vUQ zwt$yah4{l^OIY6X8~38ogP}h%my0IzZ@v!P8+!L>y8IXNtCu4kjZZ>S(+QWmz|>41 zGH0WXS`t#rHqrbYE&jmIeXO_$y%Icrx49>@s0M&i^4LE%E2TSENf+EAnv?|jZQXLX2)rIf{Fid@LK|} zp-<%MrSA+wRd;RT)GjnG%F;oXU;nH5eKcVa-p!D(SRpbvW>b z`GPtebHzT@nN}cXu*J*iOuH^`41+|Hm<RD)&L$F(*SPo>@c3P3HrsLwEWI!GS5xVBzdafU8yWW@O=J zXdy7}^$1(v&r2Dp>@|HQOEDR2ZFJ1?9uc=%u9`@A5p7o#XA_kCVV&w6Gy_kla0;I=6bUNozj$Exq)8UT`R0$v(u!|s&2 zN7*ieJy<18-d{kFJD|qaH9~|D7)5K!A;X2UJ^wiC$_irDGWSW>eK)&nm=20l`^7!0 z^xf8dudH!1Stq<>K)vX*SE#T`78-|}dTJGx7h@%=X}-C!T8Ya2UxV5|r>3I*nNy_( z4bh^;t=s%;>NE5)QlqT?$HfUvrsL%AET@yYYor!l4TVOq1g1DQRx~gm7V{I!EA;cu zBs*FY?3V`0QCfXO0}b#@Hr{UGkR4vH#=Yw6sUFq~q=c|wfWBNk&o-;bak*Ix{ZED8`Dy@95g+k$Ryy7f1bY2t(T9#h8 zPnb7CeETT*Tw*T&1^e-b|Dft2DvsK2=kCWY^RLs1PG)q}&lAE!QbCYAWv)BPtjmuk zS3JBE8mWZa16LD6zh25r2=a#x$_sH zPK(D`IFpN3$|);0+Mza+4%K1DHz=Xr4HAC^`sQh4Cs%OOE$IWkk|n^*t!IG(p@Zn$ zVtL@Dy4oGPtlg4Czt|!npB1yGcHXh4HsdD&!T9ALPFmHa8ORpqU{#%mUv$>8?w{v9 zOXLe(D5tMR^Z8qYmi4$XNO0jdTG`np$p!`{o=V#5oL7fg`%>=+sMWrAU^emOUzdZL zuFmRee#RY~b{z*>-EORtrg#$GWSTz?l-rlztm0Xe-usnW1PA`OKa7l**(p`8;`8Qm zALWS35~rNJue&%8Hpo`CsE$T(r7-tSN6L%IkqT2420}^y+`iDoK=V|32G5|G`b^8_ z3{(Pp6%WpE7WXXGB7ri)72GgpSdVWjZUIb%pN;F`UAiilUbSyg={q(*gJuP|8O~wo z|HIy!heO%_f5RmeX|bkF#Mnh8)X183Y#CejqUk&SCIF z`_udNKK-5(s+bmIM`nztSM_@wqXKb4pm@BDEeO+~B*5vWDUwtW7071eupt1k;dZY? z3!Fph86>KO4frLsaO>?C#jj_t9ZWi_CMg>LYDve1{gzsOPLv01x%wz1G%Uo;R{A|B zeplT?MXp!2yz;JdHu&`FU;|76PQzFO9U8LIb>*y&6@D9fRly8vbZ=H2WT7c}ThE{3 zv_hv<``^F(VZ+DqwWyTr2dPN2bDd06<+j=F6$DE3;fSfh_mqcxe5W2i->2G|zWFxl zRnYq{?$$0Zlj+b13_XGULjMjIz)9khBJ0MYT~VgQz3jEH@<6xJ%Xv~lqU*jcRs1#Q z<8mgc>^Pp5?8mDbCdAzhcHeaKC8&$-#a&5HAIafQG4kg6r7PQ-b5TO}w9snKtn{@I zjm5KN_@GDOlA>j<$E*&t>r@`}X;mWf*|(rjyG!1*+jD)-neie}p8lpB^tx$r`14Ni z@o|(YAnAILy#TSd-Ka^P97V<}9a_Y?TUK*c&v}u-s^!)@l!m5%23jdb<^)3x6%uz1*0q{{h@?(G2&=&D35JzQ5NSB@p_8U8?4s58 zPtJJFSqsDVT?$!vI+=y^UFkLcRs%Lfa)mDusSFC(3 z_lEa{`Jc*^druu*#Ilkck#b~xk$=4I< zPZoZ4Jn0*ISoPxMLPU|q)&`bny5KN%Fo^T;*ema3KV_cG+rt#S23wi$cADVpUXIS1 z;{y8RA;@l4MG0T>U&;Yv^KX`OEoQ0m{g~lx)|W1t^ZLaS)M*+d48aqIpCQ9%?YsFM zjrSborgt!iJC!Pl`u z&O>sYwiAeES<#M*lnLWiM$@A$!m}yChNsufF5{Ewq#OyaRK-iVU#1471RdbZ^ek*k z*l1zAbvUJ&mw!WvvpgVXEtA`dD;IyxULd***Jg&e|9@P>WTfdr{ntO zapR6j{?|$y+MioaTMqj~D5;=;Unw)Q@Ao%HD8hJB)K#xg&q! z`KGDsf{fY8OPQh{MtF3r9E^XITl?yU zJn&G*^68gd8==b>&B`HkX+-JDu)LY3ol1O`wU9L34*vf>dW( z&sQ36mJynrq=^DLoto}SeE5k;TNUdyJyDck`Fm^@LkCS!sfXkWZK(B7K(sYvA%E?E z1npPB^Rb`z4Av!;3Vh=6fPw$Bw=F)K%L*+|rIUh6F4URl)Wzw4)L7>7K_$tm(3~S) z4ZRq~jC5U!8;pK+kIYv!S?@E>WRjfE7ptlAZbDP%v)O*%=8y;N}aUN&l)8Bgeo$&N!#PXjj>i_C;l252VF0=1|4{T=AZ&7J9 ztZ%549}7RiYd!<5LFtl`ODbd!I8rHj>Y{(T^VPjM{>p-q1>@&5#aj*H38TkbuU;bm zMN4y7$2IWqZlfDX`HY`ZOrR$jhCj4V(m)4`sg_pxeK#%TXukge-diI}G|>kG#kadc z`K8aiiGFtNb*trLiXUWjFPm=gFt9!(;HmlfRl|2B`dWH7{Ji~ux$*qHC%k$vc9Qy6 z8w5__HC&!7P#@l1CG+@wq*3x!SX3sa_X^^V!8H4D-wxvn>!xzB#;cs6wdH;H{=|KRvwo&CZ1 zZ+=nG0v5$pb5!y~&4e>zkFP!yp!s$W*EvBAjf-@9tIQ(@FLd6>0DF43I9=Ot#8l=l z@hBMTI$6*HAIj!+uY*8874;-5J=p_aioLv8xo5is&e8N7Ohuo87dr55Xy_v4kC^8i z>8Ffqmvc|;F)vACe@Ni&$Oc}-htV>swQ)I&>tFIR4Rh=#J8yk7f zKyD%pqcS}3Q{9CSqdl6nnkwTy8n9LjiXxR z^T00?sraaf;zB;7imp2)-{i_ClBni|pOze|Xsn&;x~ZSTSbRJ8G{t=~1*X%FrAZ@! zfHfWXqrIuMpuN*)56<)OkV$N<=L=bUd2^IcGK`_UHvc-sEd_`V&s@6GCdFN*MSb|0l10wxmL>G|FLt8R}xHK?u;3;Ty=PI zUt;q)fxmshF0vNV-)9HT<+GBmOlmU2|M~5|za&ovUV}ZvagSF63_i^#r4fQ!|Go6QDp4n2gd2K)5_I1~G!Bg&it*Cr{6Mp$QXsS)|_t#+Qg}--mpggHR zKHP=o++)#4_V){3d?c-J>$`8Pw2(IETJdvK|9A}_cuEb0@(%{4N-$%3vv1%0$7@If zm_T{xEYnpM(j-~XwPXK#OYNpuhflr_-{nOCuPJkb?m5}tFYqFRZ{U7?$h7PyAk@C& zRe3!Bc!ry>sy$YWt~wLK0i6;1qU7Lzdg^ymprG&XA7i234Ua%!L#DCQk2?)@gVaz( zHK26Xh$QLDdbq;T9ltD1R$z4(tiKp(Nf$|4u0`8-y%z%~Q*R zp39wmF5_5_&tzWbZMx2h0vkQL{BC#N?V_KxCwp~_-EVgqI0PDk@TX<=GAIfb7k211 z28RaDB*s~$4`q+ri3d?XMiRh!|Ys#{@ulQM34t4Y7*}KG$^pU z2-F$$15XIq)CL?cTI|q|t0gI;%zy<*A$%fI^e{TjK*oSH&HvWNisY33SaEG$W%el*W<63twnT#%rpW7tljGHM)92Pslp z5Ft+e7;YMLxpv>dV_!W}QwOJ3e-796f@tZ8@u#(=8EY^mDc#xtGYT^pKDox88}0Pi zpP;qg1(K^~B*o0Fm{qNdN^o}Xej<2Pn~7l&e8P3jncP6YINNh;bLq>~Bi+EVnp(qN zxb0_%Y1aY{ZE!aTn%=tl^l%~6dk)a|lLQ_0!53pa_D<>O(KWEPu>mopLXd5KMO~@c z5ngfQE!k9iX0#a4=Z*NM3p0zr0N+Z0Wk)`|S2s!1bkH-zwGcSmE_XmV*y1#(F)q-Y z^BX`{k7j3ARH_9L)N2CAO20ihx$53Fuc56JKnU%k)h`66n+~TQ%e2uuVv7WgnLv?& z*U(LoYj2@n8EF{*-r=xk(0X|X5G=tKCC4%;odWkR;MC<-B?{bxBau*lSYrL2z{QAt zY-i@6wB9xJOOdU17DniDH z<0L}@i6^Zs2w@K8=U;Ctxu|$sLpv^6tJ4RfJ)ayrZsNy9iGD&a#d=iF;mblk_$j;v ztx(g9lgY6nvE$&$gQoQjyHw;MAjMWB6;23MZZ%Sf<+hgvv3jF?n0oy=Qp2gdn3|!i z{N}vuqMimSuJ$zUySAeHN>@J@j^n-JJVI3nhspA?1X?#~|D*W=^YJKl9|| zogecZm*Q;l6k`n8Fb!;DcBWbQx_WQYN zd}h45cAyy>dw^~WGEG`BrK5Q&wSOCv@y(IK>sI6)(nv zONdc?`sApQ^f1XBL5HY7#@+z{P%Qn%%V(#BvQ=6Yhw-21Vr3$eZqu?~dtD?c9LbJB zM7QGc$?rY_!Ogz@ijaUJlPtgDJFa)*AjS1MP`Kx#{Hg+r`{v5Hcm}5-K3Sa3nwXMb zG`D25cvn+H{zCu*I5!ISJ|_=k&1|*wA}Ce}+ilT#@}KFducfRT`+@)RV0wSD`tzT8egDFv)?# zB#?$ny{7p@wKMD>nml+!-lRWTbhhOE;VVBR6ZzTk$rA^#4Gb3-a|&`m2e5NX;lsno zyaj@W+R7RsA}l5SSfHU1xK9(*W1C5`t(`~d8MG5)=i%D%w&xAjB*}QS?_nR+x1`!! z5t@!D4PnS=Q_v}XtYO|q6_{jyPSA5v3d=H05_o6oq1!0k98HM8Iqip z4dYGs7+tyR2-sW6Kl~Jbxh+jx@zp8&k7pWjM-;tn$|h6U(ej~3d^yi?&&lQ~Sn8sj z(e9@deMcw`RX*m8MA6Eaa)02ls2|=3C%zZohK_(IGERnDmOF&sHxw!5i98{i?itRd z?0DDY{E+6!EJxh8_fOX>Ue>IhquobK154+^0rWU5Y)hR>6D{x4ydoL#$rd;st&Cp; z))|#NLdz9~mr1K4AZC%QRvD&@u2H`YwOiKV*SW2~3vt5DKrMQvSQreL++baFKb!2R z$AUr3lAN;EK<@Ctt*}_t-QjNz#J0}T<0FDDp%t-(GMdWofR2bE+)~ObJL~msvd);#SYfM{NWKcBM25L&5RV7Rz$#>sB z_f=|zr=QWF$&=m(d$;}aMe3_kh;XZ)Ts(@_Z4Vq>qg$KlOA#xq z9H?k?V2mX3S+Es7_XRm_HGI-_ka8ruI;kwB1ZTDPr06wYJNJAB>3UyI--{bYmKeNV z$dgn}VFvf81s-c#DHUw%lK9F3FE)z3Xtnf!EN_M!UK8~}WB2Gt6pA=;LM$KolQ+FS zCDGTVE{JJko`X_bTcakJH3m^_@sDU4H1!=RWZ*+19PKwkyA*9u>gY2j3}advJfT)y z6J6h5OpaDVXtH_fUS*fk3k9-ISkX>nR+<+cj~0$$Kc;Ze;y$KNcG<{!lP_wa6|p0B zj}!uG(w{?kG{9_{J;BIuxx*DfMnK%BFMS!k^_A;heFX4f>b8u8u6GzXsAw>MzA~>| z-=n!p&msDSVj6dbVyFU`X7H3=$-SP_m45Wv4>ty(jw=xf94*PImnR%~ap~--FZME_ zJmR5pg1tp!)#cu5&lj+V3-9{DnbAubuj{|;9O*o6~_mwT88^?veEpwdAd-uIiSn+tfc$kdl28XW< zLzec81=2wXf3sxemKlR?oYe_#GG$B8jF*~5Z5~PmlfN8J z{5?5>$;84$78-$^RIbRfrZt}C?bV5x)d@AF41x8wPPo@5KVBImb-dVtc0#WuyH9&H zs&$xtG6HoW#zqWn?RYwo=&G}8aNQimWXuMNdU zkkbbyv32c-@Y8y?R>8(xqk#B<%mRg%eoJXnyu294)d;Jzw^{MtD?CXVpi!Hk1Z7#R zdQZ^R5u2$WFZ}CU@0FKi1jR&`8^f&jS?+(m7m)0@><*DjbDV?qy6yZ=P%A#{ zf+&0XkG0sV1ai`v+nl=cQgrtrvD-L5hs5~*+o*!{nF(dE$18-gVq>Yd0}Kw_z+h4O zwD0VH34iz@$F%EK>`?zg4n9*VQ+=iRP4SoWUv$15`7hTx!?mTRO$yXpk#9hh{SE}M zMG!4nU-SOQlfte+Iu<+1!S5|uLT&`w>r(@j?bQ z-QL{IjkR7d&om(|w1=3GgnfrkSsc0Ym6_WNM@P+Gv?2yT;`}|&*ZwJT`m?z8CEC)1h z9>!5$ns1#th<`u4dwXA#6bGzfP6~VeOo*f;&E>n*8%1iB8QR&giVM?Fm3WTSabE#d zN!nt8AJy`osDF13wGKcGC9`}}Jjcj*p7nv$vpxvqLq^(g8Z^|sV?^nEc9Fm11TZDh z?zTJtLQyQsSGS|e5N^OdQb$UTkm({`^Z;*V9g<$%a1NEcE*G97Ck@gidVWm$DEfO-}$>;sJEp85*V*F7<0A){uI^6!pU#ayb0Vt z4&7QkPCCMSd^Sx42vL4C&t~l1n3X~|g?FR`J0;q|8^->Jq5a1^H;D6aq0ypYsax384&0LaXIw^iy&dxgoLJHpB z27_T_ooCxgfkR@UT{zkPY?qr#*!|28z91bZZh!%9jBQ>$$Edw6WX4Poq6>p%YD6ch zwgOg=B5(uf0quR`JDxbhFOUjl!dwhIDZ)Q2&SP&{HopmuuNytH74xd1zrH?FB7br} z-eoMN@ldHv_l4g3!+|s4Bauha?B=id@y|Cb7nJpL0Eh3%%=+P=8Ny8SN$^Y2dezzckaUb3b6@B%ro( z$6WW5I$QA16%+@3@&lK4?#B)nz~V7!!XSr|jVGh z2Xv@b=8sFhB5-8mBfu#F_7L7NQh$y=I7NPMN`r(sGV#pc?BQ#is;tz}d_pL_`a;VK zsYlcBUdt1-8iDko8r2>fv?*bi!X&Ni!jMo{mkN2tr<|z|jK&3E6w|!rEK8PMJm4w} z&Q~OLa-LO-v(94ZH1scfdoVu-X6o;Q4Y)NN)orhUVhN@kN=&jH24ME#1c2CD^dt#~ zf~?n;gh_!Bkt>B?St+>Q+RTmAi<=CAsmp7CF+Yg^`nsnVvdnDoQSAWLB+u+5rHB(? zy~DL7y$S1ZH+!^!351~mYJxpghv+?%S6 zC4K9*)Ie@&SsEtrH(6@oMsc?u8?Fmp-cX`gnCa~l1~X>QtyY}U#&>3y5t^;ffZOK5 zw$Tiz4AyD29qduFbaITACX$0)OJ9R!kju+F*x}d2BTY!VOAp|zI&&{3%vIuq7cH;i z$FS=%zVCDEPFv59e-^-}L2~2~!I3IPG6>tSl_Iu-?tnRn(yAyC6%OEH-X5)u-@Y{aQU*W6_?;uJpl*~un{_lHCl)Gv*;%cK4YW1Yi-@y$~0$-7-y zy7@ed0*HU>6EAx5XM}ILf38H5kjwgS$(^(vNtSh8ng-lKCI8W3%bUfghlNo6dwGs4vyfsog&=c=H{(0JV245*5>;RE79RGL>1=#$v%2daa7xUG?%P zy#Pj2UshlqD(C}LVlI53V%ad-$lUa^s*48;*G*T-Q%&;26F$ahMTge=BwXyf+I-t* zL|{ZK=?%M*Wc@4nP-(zL_yTI;(`)pMih*aYK6%cH@{Tw|nOS7p%dbPFAR+qgF)S?(Z2Jj43*-v z#St)XsvkvYBoSH?tkmf3Ql@HNR2)?xc?S1=p9J9iJ1*|AmVMiRVmwYi8DEeaiX5Fm zZF&Y1>bFjx+w*$?mfSu?9lr;ssHffqtz5&m<{-{#7LP%<&#MdhHH4D4_!lK{Ir6l@ z46^AjcU~6wz8W*)AFIk8cD-d%C8uJp!5*ma(qwW&oXD%(fL@tl0o((;^=`*2-C8`N3AL%{b^;$OnQ{n62&IWzs?Ve2z*ByWZnn|c@*CM@`(fD)194hTW{1%az! zY*!cNa@xWjQBh+Vn#nJ?PblJbCl#35gN}w@EkUb7OOtUmf)7J7&#A6GE#`V1 z8H%K}(8|#KiI*LU_^2L=G{IM1+lT|@qZWUmdW5X37L=yv3(w$)e2PQl*&v=~gs|MA zZvjA(z}GGoD`*}R{3HVwPra?vfZI;mK&K?T!&V@EK~W{`q2i6bXBCcQrC&lNP@p-p zv+Jj16|faU=J&RznL4xz{o>NN*-xxr}>IolF1*rq~@v1?78z?!1+#j~mMtT)+6vrMaM$}2-(!kTvGpYvEw5+qcAs*59 zCn+S=C`3Uy>f5a={jta}K0SZH#xw|FZ8?Z%v}^&I7xhe7Nx&}yKKyw@YU18dWEc_* z%6@vIXyn5*{|1Fw9zLNF@Q9;`7U{YV)<~X=Y{zprV{Rl5m$KnNwLCuPqK=f6?4-qa z&2yJS9_ke*NZ^)yol_6t!?{cQ`kxZ&nXKh@Qo0*15bP#A`!m>$a7rn)YT-H|y*Gp* zmrmiJmtvML+nHpA)41I;)mxjJxAgqsJngXz4Y~puJjv-X zT83>ejTjv8^uppcq$sUaeRrT@<-ljVxhse_h_MDe{cBvhvKj1}B&Q_f_X-CiGNK=> zYK2s#C1zeLNMQS^jf%HA$6l<7mhV(JN~{rRmqV9D$3CB`xtM*Gmapvzp+3T3cJ?Z9 zBuI#DC!d5M`}jvv))3@XcfX_+%BJ37A_~z@5gY_f5qZ2WEhTa{O2zUzEb$c*FL^9@ ztge8pt+A%aGniz&t(JPoDg#af#ncdo!zFp>i`CmnoiH;iq`I<@4WT&6)seiUM*A9RVZY}PTR2DQCqDLHy7Y|YfU3Mcz+jA)s1N8vrwU%jc;eT|WrtP2p?A9pP+Wh60=19pz~-}sTj0sMc$0ZVim z>n=@dT}zP+pQLCfYycYNynI0 z6ZMRcXzIW&XZYYh>MmZ_s6|5W$!6OIpQh89c5iV&t+|Ef>St-En$B$#2FfKr_D&j zvb9F(k2MaC;*=G;4H2B}=XL>_5z>I2Cp+!$M;Hn5f=3mnj9*V*Zq3bb0-@vVxOg^pO>9PD3$vy`4jM5xqZeEABZ`~J>JeJ zz8aGz@Z&!x@Ccenk%6{DV7yZ;n3zlrW0H#PdiZ3MC!YTT`@UZb#eRQ)4U7!@XVK|N z8!KDtYXWFa4SaH}BFec@2EBKF19G-35{%}5z)7d*42DPP-1=H5@|&O;`Ocf8Nqfkx z|Lh^1X!#oqFG(T*#y_sh0u4FPZ=dUBDh6iXhzBzKxY*cMBK>iVn0GbZUHa|3VdS6W zt`cuOGAVIq!Mqz)`o06MML;I>zU@C*!~;l>y&9$2NNkVpZ0;B!_>d^8+rX6_iPSRH zEN5W|JPTVnFEtS9`+D-@f*^2Q>M8V9?@AL#=tUmw%-s-3v)>O>@1!WZQdC=Pgs4UuWoM!29Wv>mF3guF`_x8duP11C5anjbMf5TmD9(VWTi$uN$AHw)9o1budnP1^aJh8m<ivqfD%}jhk`eTim3%y1)BZu5))&@@%nOU@@LLe8gX1!ZnuZ-aVz>!HqOb zzM<{HuX^pf?*n=J%!euptE^WIEQS;Azk0Wk;;`orFjT{WoRnUOxh}j0C(E>WSNfD@ z+41;0cdM2f?B{#!r);x_WBP5=442cX=ju+MFc&m->s~c-xau?OKL>7VT;C5nY{N7) zanP~@YbSl%acr2Q>lX*94i%c}3l-11F6Z|z^>l7IG{sJJWn^Y+zbsUU*VIkUz)WF$ z8IjS8S9+@wm$MI@5N;r56>r+RPu|AOuAaOeJP^H>Xx#hVMBtw4l^Nyb#QgQouGsc^ z;@t{gteg5E2$8n*ZRuv~P3dK0rXpNbbkXwf%<99~J(}@7kSu1;yj}ZcTxiwy7U?{r zeukr-QHeP1`Qv3rQoFjaTZg_8t(F~YR?=C!folHM4^`{y`3ICr9P4C=DN24*2Ye;L z3~B}<%h$90<@z~BmD(Yq>LJM03X75tQBDhrX|4sU29@(sXO{ZfG-g`H2VTQ5WJ=&E ze7G^Gw>AOJ%%#<6drZJ;O&NCX(+(oWKyg+)6Ft-p_1Y{@$a)_MO*vY|?!c(=q+x z>dd3%fx_PShbs;B{)S6ky=C!kUfEg>w&j*zcL^jp4Z2CJt%nCsN4@Rr@?Me9n?IVZ z{BAtJ!L>V@AFU9|clPjI)3${Bgbt&$C`MU{l0w7rq(_9HsrzM2?j~srV=Jq?0;?2B zLFNgj$xn_HuAO$PO%t%N)iHRGFw-_I={Egx|GN7L>t*-Y9{rqZtF}bfKH0Nc!1{1~ zCv5rUOPO(3op9eR4XTu8uct*tROOhW0Ud+!(yAYy!i9+)FPEm0WILwc9CgwwJ`EHu zt?AM6G|{o-jWXu}cLCAG>G|nBbzF7>qO*DB{mz{;n*u+|JiEST&fvxq?4}Irst*=j zNy2WdjMGkakJ5`ye_Rs&R#m7opABwXBd8z=UIS(H|? zEWw$X6F^v5Zwtdp0 z>C!+Da+O>F6UD#L=>MYoZnfpIs;iX`1@^{I6f@DI=jYPbi1C@H->dWURf%77BsQNt zvRNlE+6)vAc6FznvNdIP?@@|K))&ra>t!}yJALnU;7P&O>iC0Ii~Jg^k+%X{OUmNE zbO;aT<<69T-`qSAXg9tmxMXv@TUPbObbnT9a7fI(EwNO_1pl0pjweS-J2qFR)I>Mp z=QUEL4BM8aIdR1MyIW^N+-H!Sc&B@`qw<~Oc2AitP(BlV*=bD=7H_-V@4FuNW5j=H zXd!Jnqmlo|h0!-r*|vRR$u__@u@_Lks+7{adq4A7TaKLAujdfOW*2og%N?$65$?Bfn7SujH#)X~1s?)q%PNO0k-E={ zp6%5=^;wyBE1HV>Jvb!>p775pPTH+1ZAHkR&Z|rCeBYRy2SrYtTMZW!PE**Fjz$SP z&l*VBgFUA$7`{6J6|48*?mbzP+>E_HHW^25YHvxTAbyMm7V-EJVz!2l2s;sMm%goi zcRBOQq7jA)d&E-%_fl-%3g23z>?xji@`+yK#yx|I3t}^~PgA>A7l~`1mxTAvj6Pm$ zE3)_|z%AkL)MNV@T?eJ3DXQX=g2KNyxfdm{+4{L{IpnQ2C@|GO(iJooY;RXzo10gg zAtWCuK7T~Qa(U4GsrzzMr&If(;?=>)4>#v(_S=4a9co0pa@i$}alf(q&(cD%C4^|Q zTTaB(AtLeMu?v0T4m85wE7yvGFNQw9JfIkm*x%@vGj zqjkCGoMZXi&}k#$^t2IDRcwqR?(5@_*4hP~(L2~6eO2O>3mZcE$yQn#3o2Vm&BcnJ zKfTUt&GC-LzKCjOAIDl^h_CLB8}${-5XlRlY6h%CIGx>~YFlTnUPhbGQ3=`>wrelb zNQG$yi?5X#I(g_MdPH(zQpJ%iX4Ugi(VA8$#;xdFMK2fUY~b9TpTQZ zf3TAi74LjaLw?aE$fhbS`jQ_}J$ytr*-A)szK^PsaQuEFwrHcde8nP@?Qqxo`R*ALib>wu}4d3RbpOBGlDZ!!~LjYQuNaCcph3ooi)fE87ymen7=H* zPk(C8>PQwDirf?bBYg(#UAE47OVc40(`Gmw)pDeZE?D^<=2Qoli#i4Coz#jejr+=_ z>%#=E%JrEFh^-a$e@(hi-=EACs;Lqt6UvVoZZ!_kFTUKfC8R4rU0uFKxSp9B%u_nL znrYMeXgT}Khi7h_J%OD@-*7=mE(_j*)359{%OET5J>2j1lolU;UE%&xmPmHpan9aw z!GW_HL}mBso8?tow~M4U4!I}t@E8{M#8$EVO!ARf<2oVBzA)YIr>8e z_*i9pI7sh~Dv$748b`$~FJ5{x?S}sB<$&TQgQlRo6QRhLTT>0YTv)GBgd#KW$qG`1 zvQ&&cUo+Qw)|so8-<4;5uARlI-6I=6z}%m{^tCd1W85K+5PI&;pos)C-eD?a3hi+; zQm8rgkjUq@*%**-c#`0?9Rka$L9n?HEct5}4;zMlVC!e!k?Fk5{cFA+t3RJQ)&8`7 zFyh-9sd?oR*FC=aWIaJzCSRX0_6z)uA4106*9xOgY8e(N*tZ#bB*q}S6`yWJ$^>u) zDyCU*ho$xIjqa&hW4@Jqx4auCQaM7!pJ}=NS$yS7kxfN*<{`U`Xe1J2{X=WRW~d#Hz8(ogf8fZ`=ZTi?&H$wWW2 zB0A3Yb$e_q>TJc>*B&<~ky5qM=PqXmVT{}-U5pB6tG`M#M)15tY)S((}kZ#t>`YiTnpfz5iCjCyO_Ts14U(u;Z-{)An6nU3&>j7 z*~*baUm$c{6S`aW9XvjkY$JEL{HK{rqs;Ehb;#K@d$$j}JoZns+ugp|hrm7&c7=KZN-_@?RQ5qEoV<*_3x8_9X74-*| z|C&fPPgD6h-;M+>n1_QvWSE9d|GB<2>B(E?+PU=>8l{bw<-+8)Dwj(Q4UnBy;zOKv z1HSUH)*jZ}qwg4OwH<@et@}y|C4nOS2l3%8)rx)eI%EJ$)L-DdepeG0B%b&gS^6mW zqlPUOb{5XS^3K#!y1fFJL|e~N#|!SYHu)Pt9pl6^Mv}~KDD$oED>fg6HXY>9=2fDH z_}b0YZJo<*SPpZcE*JyGB`_C2wL|-Y<%#8w<;f4J?q%qE0C#e-tB*IQx*LZQ`iYho zI;|-vuj1poD2h!A1EC_mYjdMxQmVfWGLP<+H=#nOAZ~kqWZp2MxquQZ*{FJ|#9uvA z>$OSX5K?4mkGS^MgaIYK@T_QdB#UkRLFCdxy}W~a=uL(_r^UY63D}~-k#CJFgDpI# z?I+ccSZiYPk?xOn!DmIxu?d*|(aM>iHO_8J)#nx)uJyEO!~K+cOnsDkQeJ4gHf~*1 zT-l4yyCjSm#I;5~Z_Y~i#kUwYili663|2*pvBG9&X%=lH^wIwm&Ud5YhS$~#%}kJ; zP~hysP)3nxy_{!uXT@^g_lIH9^$rHM2?OH^lRiPQT}0PP0uEdnpAordULrPl7Ca4) zvmDvUoUVW4n!mx*5{eWpZ_8=sJ3~)lj?+Zlh+gVWw~)oYPJ6d(g*sLJ z3O>dP>8qA#9fmBl%8r}MS6Ge|Ry|$x>XyR)V1M^RVKu|iZwDnj*7}9J?y(u?%c-E# zTo+Hgdj2z}=nJKP+0@;T@I?#ic8}{{RnzdrEpoi^T?{R)p~DERRN|?eNL(} zajuqsYMF%n3=c@NQ}4xx^OuuCpXNo~Yvrm;0ts|+;!1>$EF*WHn`rl=k;}*9XASaQ zn9!VY7bmH9=mdTM4`@C<+G!@kauUti{>FDW=tiI>4mj^ZFF{$-J5Th!Z&J%K8)atcymhm z06u)r$Af<|LQo{~LckZw#~}Q2sw8N=J-18UdcJC-*2v|j?%G(Rzym!^NQD{#Tk9PU zq3*igdbg)z={cSIh%(T~9ufC)v-2H3gr|FIcW1#e<25;WM8bQ#5;h2#095!tKoqqSb(lD2W1%K6F+$%VtOmsE;);8sAH$@F;A$QVm@ZOGT zTiEZdj7In7-EPc+^S%}=LU4!3wu@{p)GTYTocwTc2aG4-=n7%aX2{;{7;RZJrtjZ3la{J?|fq@^2(l_*OFv$gTc^ks;IW}mk8j$zGvGx z5!*$cCl4PNi(2^e;}E6(r();7Gz%-{{=S_g77}J z)X~}1Z%>(RaFN=1KyFma}9xQCP9%zBaQ@&((=n20Yem zdx0!nI{f~$uw*Q6@PgmqjLdP*unWOD9Px8DFlBovz# z&YO}-Vo7vgy+m<$n~p5qB4S2nYP$gA&WNVwht4W_Fu6mh_97=ebiY$?lla!UgDBL( zX{X$A~(=(#qBo;#}j1VT*Ajo$ewBNY%T* zaVd6(7t+o#60-)HrV9998|O?gQQ}`fwM`9?fH#FAi#!rbi^O0m=s1ICQ>6cNRXIp> z_L3VIQINQD!j2Qy)}X4T@&OTwE-{m>?Aacre==zAL8SNBgT0Y!#H^}S;?j`6KvcZ} zTHf1a>{CpmVQ2AO|?lMZ$sSzkdiH!`}5mWMRlQ|M`@Nsez#qk?sobt zwj8m5X%gz}l4o`wX^ZbEupCZ)gs{;Aw~q^vL}=9lDDUh_lT;idN00a3tu7?7522QN zX#C{G`eU-=Hb7z32C^Rrv0E6KV=+^etb!F8VY?PSHlR<7q5Ha$)BR5320g3I`FlOCb+VpMA#z_)1jNz z=YAPnX{Mn*^0mjtYB$h4XGZ;4&Alg%Q$~@>QpI8LgjFG(rY%VPOSR<_f@T4XT>7?! zk@y_bU@<43^Fqw^r|Kg}JXZi^Pd6O!i0wKt6b2#Un?x3Z3d={g5|B{w< ze_5dzuoI#1u5$)h#G8qX0!B7(-7^@H9h2u`iM{EYLBzQ}Ank72QyyWU(|3NdoT525~r z3<~|2v12&Lbf7$bR6?}VwH6}3@zUWPT4WhlFeH+V(KWmTu7F)w_9XYH4HVLk-u%m> zDZ`^(xbbK4$k#1OXh0qfjWpJ^6t3g4brmpXjXcPqsNfEX2pbo)N2fp;sxfIyWmgnF zrjrT{y_s%`pe0V{IZRR z#v=^c9ic=hD)Hz+%h%-xGl>e{UX85z0t1%F7NU8?{M}GV(mxd1-$I1JI2vq!#7BF-Si}0hm3zhRQ z_#D*RD0TgM6z>p~>=ii8xSLRC&u^f%_`C#7YYz%Ict1-3p}=Rc!$G3Wq>5Nk@Ws{O z*YJs=!v*oTKKI*HKhe3hxhx@&LV#H(Mml%H4;(qPg9WU^De6eHMDordnV;lyG-t?3 z-0nlHs*c#!WV7aYUG62n+LqiAU?LhRO3+>c>$7FqKt{gnH&Df4_tZtfZ+)Fx3z%d# zeF<=kqn`zVpWt)+@ zi@cScI+dX4e0b6zjjg%SDx*I+xf1iZj8)I^L_`orw1VQD!3Xn!YC0JpE78F3xi%xB zF`|~?hIvE;#MhjVu0E+OHY&tplT{|l^VlE;Hre8p6XF?EDHNywte;u|?4e%3Q*OIR zswUxKiX5r>9~q8Q9r~U`ChaYU*Sakv`+>`ln^7fH+^?eL{1{TMGD&4phi5D+ybO)I zkLyjW37^%0`mIlYmNt$gLVUF4_U7Pr0F}%B;@M$e3f!i`1lNYV{7q*7i&sI@faJ># zMMQ)BLSvf57Un#UZ+~Uj%iNSrR`r*1J1>%0&4uCG--!x@xtb3pO$t=+om59M z?D~k?3AiL#AU`=A{AWyc`OKF>YaL@QT^~2CODizgo9aOeWhW#i-!6^WPJecJJ)=LofxYJBr3NW&~`-C|T{r(u2GMQgTi=wK`B>mXtdF!25N+MBMvNvuB|M4i- z5ailOGaRcfccU`m_4MZLT{}DdE;3zG+YB*)ZQpzv2E1}yrz;jQjeoK_pN=T z-M%>uu!MkubB^JUyKtR`CmHIV3X5dL_}y#My|BIT8^E9Mt+a}6|C1fMl|~$}Io3YM zsBM`Z3Ml8|G^kqGhPxn3Ik+t{v)vYwQ4kH~9gLXWzVjqC3yHFFzI-2^F8|Q&%iA5b z4lJJ$&m+%w|8Xz+Go)MH0gp5a5iI-mRD1`Mf-+j5a{Hg8nl}1xkGu%n?^Mj|x4-5e zp%;YzZ%B*u70R7{yalIroQ;x`hkp;Jki)xKF|9A@^Q}XW=Xh;G=JPG zmF>S9hcu)AyKzW~)c=_y$IB!F+1uXT9JQHG7>L_4qM0qxzkp7=w2al-o>tO8>!hUN zH{AXwiVb$V?()rnr6D_ko-`_n!X`Nz?F&9@dV}VH9hSFdm@V%zJ8yOc#otd+S+xjb z%3YyxB08n%Z4eh{eyW`=?ltQ-Gd{bTF;eBX6kRb^u0LQ~XX_rHo!82WcxxLNINwO&81&LqOof?YLere4(4|*)U(-HW!wfXhi5xrzRRid)hUFP^D zFUcL~u>5uZ5q^1xb(-O9m1NwC`RQTiaazK2u@Xod0(1$kIDb>Wd7M0_sHa&BP{EB-}(=W3phc4fEHdC~D$LNA> z`E1eBoQOd0ehK#*bvBrQ?^9#NcZ#fyD?1RFqU+{Van<*$f`wxjH~6Z@2BSCc1Q)o} zl&MUa#bp)x8jqJtR9Wx0dGK%}Z-nvWIn#%M31U6V7{j)ytp2DO8}?;S!7A*{G&QmQ zqZ|KMdtVw(b@cs}(3zRpqofB*a3`{KU2&wXw$^qh0P!`^GJz1G_6v-W2fpDb0{uF6>|HnJ59 zlFJaL82%0%Z<@n&0cY2+jND52>$>@p`V`%^k~GaU)uEvilI}N74?aD*9v#Ye>3TH> z$|feXYw?>1<2VT5aL3}g(!w@e9ZjMsp3~MLkxP8^H5IGtmo!2Z42=Yd_-#skz4VXUd^oN<&-8%-x2nV&gGSqJB z#60TydeiFm%?HuDa$GAKa%mhmpMTPyUf7%N;D9Q{*|2maQ1zCr}>z~d`pLb>*HfDa@(I;b&dt)+T?3!=JrfF5PxFu60=nUZe3>4U^?FO3dT1?MCC<+s&IK z=6i1U=~Q*GxQ!$w))baGJ&`ofA2H^B-feVo8>@JD7CzK}WIlT?N2$kpBIkWCcQgn80VcD|ae{-nNYExYT zCFivhDd-XHN!HT!;b0yu6;s()d0t64kM1QCFUlgq#K|@8;<=Z-tyz5=%Z#ciwWCVL z9M};zMrjHACmWKsdYsy^=aw$Iua$OOtB7yE5j|e;u42wQ{8BPLU2yU#mPYRC6cGM2 zT<S=DY4l>Ml2_>N@?@@DLn~+;@H1Oww`tK zI@&NOM?JNWshWFV446eE3FQ2&#GJOY_7YVn4#TaCA&Ha>yTFb6UA%Yo7mWdbhzP zM_vuNL&-_Gi}!C57=?vp;Nzai93m2N62a`Hzb5lG?MVwWd*!%;v@S!67q>VqcEBZK zQcVjX^0D_O$p%QFpD`dc`%3xDE^M|>nHt39mwiPi4p6+m3#iR)=v#@x|3d}($WNf! zss;FC%l@BNrs<7ur3*#rZ#;qr`YvAPdgd6GDZ@4PQq_DY+De`?-{H92`4b8XCH|J9 zM_R7Q`=Q>`+^`BudYac`S>3Lg@v&+FUVE#Wc+^~mP<(#7-pm;{SJ|eQ zCPBZCw6d0eTD&G5tBxUSk|QIhWRPMTprm#>ZL9Etfr6UNTZ#cb-Os)(5ge*qZLVA* zvv~~}xiqR5xfnjFvPq?eQOv4KK2zpw+B1pH&HUkD|;(*UXZdSwvWu` zD-|`&sUMe#?`M#D#Rl)X@4I*&X)wHQkwYqVfaL`0DDnaMh8^F3VzX6JWIjnW420uP zqj!*gl%j_INoRk)x@WUHK8}0X@Cs28Mu&-yVhf?6@I}AK4yY$5{_ai9K%g{DIKV-A z7o{|`jiK%X8>!nnp(*v+ayNZQuO2xH(m66(d2&-$48VV#pm;L!g#tUVZBozR_0K39 zOo%^ANkLQjNKGZDTm|v`Zh96+`tcPac7r`z-L8~`j~jqzk_Lj>>Fy5FmH(}qWTOB4 zNWK9tUCmA6Bt7Y*l(+IDElSd37^alGm8&JMxJi=+;O|g_a#(;k6kh+jKoU=-;vYR= zkowOC(K4{%`x*a?{eI*vp4=n z@tqZ36+nFC&qx|1P~-d1ZW;=EQPMXb_)h^WGu~6G)r=Cw_SHm?`U9i#h?apcrMRg` za9;~0FnJ~dbwJ&F)Si<^mQTo4m*pa*6$!&T{G?*s7{1FD_UeFnZ`ruNcK1#h} zprDsnd?kFlmzU1v*56(}MyA29>K@WJ%G5r4j$OJ{^>CNlaKa;6JLV{og2Gq~LNBV^ zw-j1o-(7%p^F7zxbc$rQtO3l?t1u4)B`3CKvNSbtw5pD_XNLKSp54+|GM^7HbbDt8 zZ2xGU5nzZVKv#oH2ZpC)pgRH408z*_HGs}eUHqdT{xb0R-Dhn{%H;EvGAAd;v0kt` z)p`6|FEb`mz%WZAT{8`+TuT5S%`^T;+bQ(Vd*9PB5cc;5if+v$28#LQ^mh+-q?^_} z{$aT&Aj0`p zWM>j85z|N#IhoH@CN_M{V4ImO@exUnD}ihWL$iqO^N+`zZvlcL^i_C82Zr1r zX&M>DI~X~d-e^H%;)^Iavw_R`c46bMh8P?+_X_8oE1$hduVC8+Ca~p>%7MK^1$=SR zC|AE;SzqtGp|rYWXuC0SH*%Dv2> zHQgacNb+I3czZK|Nrli@p zC}w(l5^c5km1=YE%Dd49#yFK6GT77EdGXI9?4g(aIMO4mzbape;2K^a%G9i z3Sxjr}?Y0Y!9e-@ttL^xxIjl zSKV|aq@}#Q`}>E%aL%26>B%9$o11}sR7Vy?e%L%%4Zjvz>_!>8s_k{=VW_k%n@MpnKW1_Q36c0tVfKg<)REKy#wYrpN^sC>uQQuvQL;? z7j_;I)84jy=a=?OT@5G>`ObcN+LtOdCG6ABS4~45KHqVZ<;;dlXd@{$i=(`);K+ZF^GNv6dUfv0& z8X#+1fK6p#F8uZ^H*Y$EknEHImkS(Yz&6pttpIqJ)t+EnlnU#`w>{OQc z8B(Bh1X^(yUj!mKt;rS{jb_!c)J_z$>`8Z0@vRC+Jr9XRyP++zsiE>2 zF(p1^VCyMwVGnRqQ18b{i6sP~lHVFulrPBg@Pz2SzO?K1LeMxKi!td8Gv`eF+V;W3z82-&u`X%6K zp>?O~6J~+Natts7^=e@=02C8DbF=>3*rG;bkGPV=QW7w531{k;YI4Rp?oHpPN=_K@ zb?ooSOH0|;n{WP$IH^DINTu$>99}?d)+1`L%vuP^2k=)h$J%{5$gP!i_!9O0%ibWY zAGJkK(o(|guGIRHgk=xHV(0utPI?lDlz8mdm#62r2_vS2+S39=+Zv$gt(-D!J#vsJ zAZZ$9r4(hogHD2Tj;Afpzt~B%stSlJJtRa=G^K1_NLv?nCB!M-?TC9p(nNoVTiD&Y zpKQvn+&fjc0ZI_wRh*hUStQ3MER&6zRc3 z-s7#HD1+1)FftK5TJ6AxmI55_FO)gYj~(3D55imBJ0yAm1mM2*LYyd`#}Eo1Pprm) zTB4!#U-(>AIj)MwCY?$dmjMMXxKdR~920qW_&LvzeY-yyMGX^_sfe6~(-0csB@LmX zOd0CG5VOAqG-#Gk(J>$IGWJyktnXZqVdorFTeXaJ=Hg6roN?oWj|-v<+!WwYgrZ1( z{kO1GNCkku1Zu;|Sc6$54d|G7zW~8r=8hwGrEu;Gnn?5^ZW9+J4Fi|NAo(%)P24}j zz;Fjz=7~=XM!Nuv({o);R&Gy=%fXAj2#B@@e@V5K)^oxdK*>GuI(Q znA0%9JXuBW>HK*Y6a3_0So?KGWy~-CYgH=gA=r@IKe7$1%B4zLHXuU_MQXRyGqeS7 z8$d}#Fmh}I#xrP5$5zORm(s1@kGX7=nO`pzIHw4}fUgPB6E0R1irV6!WMX9PUSGL~ z!7UTR`eiseB1a}x%fM9@%B^Y-T;bRoLpJAV>`;qOhw)0|sVgImR}iR#{YzI`3TjFz zqpkuQG%!51<(ex!J$(MO`-j#PB`qLG7XrrC+vbY435`h#VW!pC48-|yzPxk4P^S9* zOCV2^gG#6JL&Db2dxk_O%U*wc!fXX8%gzKvP0h$4G$EBtUhtwM*!O{yn^Vh)af(yz zfqO`^N)deeNZz49D)8yt*_LhG5Zg6~(zd;p3m(HF&JC$)`_&jbib~-g9@T;a3sje; ze}d7kKhuN!@#4!6CIDe$X5Oc>K zC<4nKb$aCuF(SM7beP8OK{$R$(Aby1cyKRLl}G3*#K3VBui%7>y@cZhrWWol9|IVX zfW1@HtQ2cc=YmYmx~Om;6(vrtdNJa<&nFOxS%-g`*;BJAYy0r&7NmNt%=bp6C1wH- zvcVqH^e~h0H!-#?(5hjNwc2uZvkk1RZ0NXuWd5*XEN9$ZY2eK(ZO_sl?V0IypEO&r zFl@>}i)r$9(h5pKg~;Iw?lI%!r8#Horrr}>7}(#+6KP*W0bOHXjPm23PN4BYg3K(fvS)6u5#R}otA*x4$fcYCujX`Msk znK~zL$xZNndGgQ)tZcj6^7-q8k;OCxL!!0w)dF#sSvA$~kzzBYaK>AGd3E7vj1wWn z>Rn$DN;GGOacS<@6?3Lpkj>m^*xNl|$Tge(zQgmk4)oaqBE^2AkCR-fj31ph|?g$~VXw0<>wlHGVx@ z{}e=537+B?|8)tK0$>O#j*pNcuEP|~7(>*dUl#=s2V}?$rr2ft1u5m>2{G&<-(@di zC(rpn?Dum)Xq z!CcP*!xa~1)DoYX0BoBLUGMA$L*ttK3kvD8ky^DX^&ssfh$U%T{CY9SbDTzh>RIx%|la-IXy4bX>-X;qzSz5~07)f$9OOpRxm| zyzqLh`DK|N!^dX{BSbI0nWVUrN|vyDk`gq7;kDFR{0njS0${Go5t|G!Sb){x^2e+&+&S9p3L)Q5r+5q0-ZA{9>pWjwH!F62Qt(02HVdg5EL$|=?-ET#_b@NIO_ zxe=q#(s$_!W-ibdePx3~e9dr2*1>$8Z{;c_edAI!nEV*_wsKYOU*!e1qDBUe9=gi- z)^YVE3SM@AK@) znNS{^!fO$^$Gi-cQ`MUNWNbJtIk#xvak6~<`P?rt{vNSTRD`{#r?uRda$G2bRr`ab zCXqt~u`b~pnUp}58He#CDz#$^vCEG>>&2H}Snj40`z=fu`M^`{K~jp47(_-?L>VO} z1c|?aGT+`-h)E6&s1@`2Q#ALWo3ZgwI@oL8ka=fX9JjIrzNS#Q`hD1r4uuVoD{l3% z`}35Cb`)+ivotSMtBzevcZk4SX3)WL(Chb1dUH0`^17M_EX#Q}q^>VDeO5^`o0w$9 zy&pI2WcQdgJ$^Pv_}lo-Qr&M!W1>1;^qN-6z)Ht=vmj&jNc z7K<*o!Y2)=nnPYsO^-H^Enq$d8_GkK-@61w)()k%l9t4Sp?yk=v9mSt`qTpyQCVkt zvSKpiv(gAGw7YUYhE&s%mE zVV4&sF+eihaza;SoQd`5n%LRLxrtH}TjqLetvzt~Vd-GbX zqiWJ+s8I1piQqZbMG5Td&u}K3pX3~LDk-PC($waM=9j;Y`<}m{Yb+`>P3!b&o!evn zCC`B|%XaNl6WK6`l7umhIDb1u1G)>mMK9(;=%go}t`5FkS3MIY^UXF&l277pO5?H8 zZaWMR-k@Fm&`h*#rM)OC`+TnqS(#Yu(`0vjJ98e}dymt3)}GKMF;$EQ>@8iN7WFhM zy=QADmpP{0IfS{meD5=ASMK||=2-MVH%(1)7JraX7jCveZ$!aGj2Ck=K{5GT)}7og z4M(5c+~17{cyV*OPg#K%mR?`pS$fsU34BwwSLi{0q9Y=QqykMPyWmAw*31wDzVhDHj-}? zmH*`7dEw*?{S^;bFYj{7NSzAw33WTqTH!R9OXy4p5-l*xci|{?6F*;}XC~yjUe}g#KJ`ry zjB-MwldW6`3ioZ4(AGWR{7xG?bYB`3vk@|ROg6-Z`5B{Q0ei#wV8cv%*(mefPSv0E zumibzxM}n58Rue*gZ_7CD%>!|8=2A3!PVu6Io8Z=d5wOe&Bl0V-iC5tGphIj3z!gc zPu)rOf(^kkdg(VO1bEy73DTaOK#l$ic%Jz>lwLHsyF7uURM;JH)UooJ9{E>$ZRX9m zMU7_zP$tWnY*P=cj>&~^lnz|>vtVLy@|=n^r_8*jl%)k!=xDD&-#9i$c7n3tJ5vb_L83a9j<<4^5?D2A+Y0SGKD8)Xe0Tm@#Ku*4 z({MH?Rz^6e*8GUxY;nN~Aqwn3w`pfJ+gjz3fej*tSulk$*AVLR%H1eU8D%E}3YZgT{SclW7$(Pe^hQgE&vPU*Q8#6P+p zFm7m1hA!qeV6YOYo~_+g%%i8Xuoekyd#&@s`5+@+(aA7wdCjTV(9H17a?AMVn#`X& zof7l!BzCkp_(jxd+Q)OVYJF*?UoZBPtLmOeIXE$3=f$F5)bntNUTIkgno~3Nhpu=m`vF_klDue1&DX!aWPfaW#H!VR>bH+q zC_JX-*Yh;2qe!VJaQ{-C5IUE!5(9z#K9zYEt@v^eci7;j)Kz%9 zV;7%o$D4)Qc#8@>bxNhzb&kxCJ}bd60BVk`AlRP6-S6~7EDO~r-~RimmlPm(~OWihY|^LN(%l{zyS|V zGQi_Yi|Q-&G}Uj|Dv!94LQTCBd0ILpGAph2Wjkw;l-1(x%fVaGERw#zMuZn%`BU0Y z;Lof6%#(}hOjEnqKGMxM=h))vbby6ujo-n7kY;{a&u@Z4JCbZiaxT|z-z4Viq;jC4&-fZ=U}!Qou{k(CWfXGADxEo@8{ZpiYg_s@ z)?IMGZCwI67$Ml^V)#kStag9CO_wGd@|m7-m!I$Oe@p29FW!-u&owRMvZnsjmG0Bt)iod{##b za0pnxT?F`wC6cNF5a!TX6&?`x3D?w)!%_^Qng3g-^ zt{I}?VCw-4x5Y;G06hiuaSb-ifEH8xOE}E4SDQBpD>W2&rg19hI&yAB7tZ8kI2?lN z4loDM-|A$(yak&VSISJ(&&!~?dU63Rf^%r)0RU8q;11n*a@(+HdDL)&&^Ctc^I9I^ z6zE3IZw<42U$wgLRa}r)fuAmpops+2rqTMsf!##Yc#s?PD1ojn5Aqv zKqjCj{Ry&J(EARo^$t5OD>dj!sY< zj(kZ2Sr|ofItwtDL#RhN$|uZOge=3Ev%hL(@n&0e6=X73YUc{*bz-@p)3Ooco^<{h zzZ4CsF>o-7>*N(}&D<1NTYW2bNFlUmj`moII*s*0veebdEwLehL-)7o^(ji2Ph~u8EWK-J4G}1|RiU_OTVoe%%veYqJV>F*N#Ro}jQdWrxUHgxe+vNaF@Z8Ry&@+BxuMCg^F9pN zjNPQR0~%sl$$omXOu%P$4>;spv9c;nfP8v)abq(B05WCeLYEv8YRGwR?F3UJIVf+ zu5E*PdfI5u3kpO4Qf}1XY{lD;YP^P)y&ZHVt-1Okr1)zgw;cShPMER&pNU9IoLGawW9Ul|?!!>HUmm4}`5r zHh_$R90^mVeh4}eZzpLc5^hmrV0!$8>Ne9`#6MErU^ytt?(PH}^Pd|K`}SW)h!i(% z-o;=93llPdBQiw9ZZn`G{_&ZD8jF9jvH#pfsSRUU`K0;_sc_@(U2J6_)W11d(ok;R k`v1E`oKpXHjM4^uvg|Wzhv~U&GWe$;t9UU>+UWlO0Byu12><{9 literal 0 HcmV?d00001 diff --git a/docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png b/docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png new file mode 100644 index 0000000000000000000000000000000000000000..94364e593fe68cdf46af8412a2afc7ba6eb33aea GIT binary patch literal 130810 zcmeEv1zeO__dg;kSd@rjfB~pTNTRFomkT4=f9!cD9jNF{&C_8&b9w|m{ZX+8T zc2f%@Gg~7oJ9cZBJ-7t!8=1kZ@ePzvmn@J-BSvl+K6XxU>zI;}iG{U2%FdjT=QQ|y z#@Zf+1b>6e;OE)%;D;9YpNo^7ix1br){&9t6c@K3JLhq5Ll$O&K!M(%Cg2YjxFlm? z3xlJa!Nt{9xG6ZoZ0#&i*0@v4ZrGU^Az`@7;9+RC#wc4;m@U2~m?JkMk2E9a zDR2k+!y}1bANXKxWQBiJ1uT&b|Gb4M?pdufW^7z%q%E|irR0pH&zqc)HP!yHH$tCQ z&a$>fHs(qw)790RI^$==&BKeEl&K5uihz(1?t+;uVI2Ih(1r6Bmk4WFoj5nAnS+HX z%nsih?n!$T3Tba)^Zi8=l(jX?1iwl69*t~mQBL1CgQJl6)#2ONfQ5h8J9OdvPh<*R z)wD3R2fGZZi69?-deC(_n1z`+VPvq2xLH~m5gOw!*qIxdqMUHguU=ZaD>yk4q-l#n zf&SKP+S=QMMpn*}FeIdE1Ucfz0ebV>pZzKg+pn7bd%82XG)4%T$go+T5;%`gmvXQ_ zr^d&|gA}_4Z_TYQu6~qe^ZwOpm+BJWuP(p500GHbP2&yL$Q?-U4E9`db@C6bgXtDhOM`?Co9fe%Z*u9tGO1!C>Gm zY+bZKQ%-h>-mHGqhCUKLNI~%T?Gprk(5Ih)S8h%_l!NUm-h&qFE)x$u_C~g5fS#?| z-@5JayaSQ}uGRl9e%r#3M)nqtYkL8V@UMX4b=SEb6!Rm2WKNq}I5Kkcn?b*VzgI6o zt`Xq&Hyz=>=m1fTRiDXbx5~aCqR7Q(1NIGi41b@{k(7!dxR2`$I2Bxf-;V(_2QyVdNrM$wSlc*2OFd-*_73bH%GwUl?cX$U$isb; z6gS{+HjokkPGE#Y!jLHFkN`%tf!SIBU%$$@zq_aM?Z$tmGR`mp`Gp9{kJQG9lM@by zuK#LkBf!ZiDG8qVSwjN{itpLZFEfK*qBa5qw}z)SU)|gq6y*|HkJ&)ILSp{=m=m6) zuG#m0C5qzF@js#{{*d{vM^U^yzN0~Z5M-WnFtWC{xU`BvKZ`&=V9ob_`8S?}IE{q3 z1bz!xa7zjB1LE^1fd%2{j0X!|0+U(`7VA;2uP5uj4_M$2aO-Hzf5YA3d@qPN8JgNK za!Uf(RQR1P4LAVE`LE$~Mo0@YD9B_2m>lHMPD5_O0)#_O;cr-3n3_VT^>sbk_iFoR zxE?`*rr@=Shwyo=Hu0?EdI;D45@nJP7c{OK3c;w&idGH=( zt>;-+9QcFjzuxoU<*^Qi{BC}H3W?+f{{x_qvaqu;vNtjRpa0<&;K!pj9x?a`ak;gK zK_IkikNkvoe+go&d-M~;A?!M;^WSnm7lAJQ;C~=LL{KYZTkzX=2hP9zj0ibRLtKvJ zyQ#CYv;S7-iQ|Loh<4p&f7k8@diO8zyFaivK7v18YefW$S!+cE%li9R5k4RFCt1-c zoC$s7Z+~<9S!+DFN&nAse#ofeLL)!@2_;j0ng{xtI>h@Qeu9;(wH%NuUX6>awH$(l z{e3KlaB};rC@VOHd{+Vf||7)7`BU2LkCo0mqQBgbt{EHCj&nxo4hql&6 ziPvS0gpK_ZIsC7o)qgxi=KfW|J~fn$0^t3cxavVr{a|B*1lb#?{DBao1KjI2tvyPTY$cp19$b9IA1yYGKwb)djc?30U- z1zKyLgfPP2+aL3+_ZG`fGJ_!Nh*~W_{UKTUwf^*%DA@lBO@Yd;n>!Ks3A5u`x7L(b zfZ!3lN#sP7 z*THY8$F)ib^#FO2zd10jSE~7g1B6+do`T*whf1iCMs{`~Sy8 zadPv5hR|niC~wHl#l0G{2cN&C4xtZ_7$N5fwOsRl(eHex4!^M9e>7XOse}TtwR=jq z_N!m1`!$O9P5pkXmmuIBUiZFMdJ=r?|93j+XNTnfaY#bokuc;xwnT&H=i|@6yH5G6 zy*YT-fSXfV1^iGovPReu2nPsOf6qdHFt^{sH`j}8e{Y_D265ygXvmLBy2g|UO0}LT z5ftmMk5u!lS4jE8e*G=s@!tk}{CHEvYZN~xFFPTNK+u-|KJu+M>OU6z2;r*l=;!}b z8{p%g?9qRCcY~ix=+97Xemp-S7}PqW{&Q6us4=9;f1qlA{YnUkr$QgJS1Ik+?}zxt zXrY^5DecdC+kY88^Yi0RJ9u^C;ua7hge(7h_$)vm5x+3%KUay@3(EgKeEy#j15m*K zS*I4A|EEb;Zh~I_7nS%|jQY=2;y*VTAY+TN0{e^l^=GfIiGg*(!0TN4&9w;shQr+V zbw^xZZDP%7j$kHhPji32h$}CCC4X|IAqZ>X0%EId0ephE!!L)z{)@cf&xup<^YgOv z2z;+Oe_X}t&&R63u>WDue=e4;7pnbj2`g^ipZN+bLTt(y{IdG4vx+SYs+>7xZ7O2{ zzF!Uc5-o1d(@2!bcX6z*%?Z!{H)C7>=v*on=MP+!mjI7zP?U#oZ7qxv&gldO@}~yH z)~(mZ`}98%#7^^HBR7P_uDFHcx=|aDQ@2xXba)7?O1NV9L zTco~zQc{9WH@G|M)@%F04f310*Y|<%^+Mxol2||MasR;NwE*{zRbv7Ki(i8RAOMA% z%vu!SBGg&@ePCcccl5vWDcCwZaICjezbp60|I?yFZh}`K4EDzd!3BQBsQ&~`wLYW) zg@tr>9ZdVR89;8X-^i)faHZeOpT0pFbbk07KpO84sXkr^!6AvR0W>bcwY7joI583w z_phflgctUIk1lZkT3z_Ys=r1c{(Z^--lYELYQxV*u70vnzZxoj?)ar-{Gosl}F=4>z$y{VX{WwwfT$CVtpy$pW0mhx#vocYCybh%=>XGWfl zxea8;q_(j!#qA-Ddu(s=X0o?Ts2;Ybz62X(cUI9YKX&1{Ykh4@ZW$uhZcqR*nv1D( zsxzm!Ohm#!w1HHDh?wxJ8s*~~T5$d~SMa~?X1c5em-*@S(gXp21-aYG3Vr0BTf@QJ4-%`@8Nbr4?T zq=JFCl472-lS3j zeG0IXrypP3Ny0#(P0WnsdQn-{*wInAVou?6!Yk8Cw(XoieqQRVag0ohY}+QBI3!qk zwlENRNl~`NV_MOUYEFoO<%g#jl1OB7<~yw25{qWEqFw0J>5XXdTa;P5A54=<@>#lf zNk}SerpuF;R4VY9TO=yBXyvi@0qOAz)VQV>Hf$*8&fY*wA-8?Q2_xppVBdqJ!5P$} z2Aj96j!g!3O!*Ga-mN4}%9rKaEMhlI>=6&n=zGtc;*HyOhLG@$Kyb#SDr6#DX^PVl zkG-SKvD|{I`XxsJcH~ObD;0)2j>Kv&loz)9RDF5keSloti1CNHg5G^E!s51^W*m_! z3VnVGQW1$OKrg62Dqbh{;-$V%4f9K{q`W+bLERpGpeY2ZI^+ zF49NAV0z8NbI!LS<=ZAEY0;-~20&3$3#802O9J-9l{Uec2y}BTa_`Yq(H!*w8g{DQ z^caO43z%MF%hUv&01$mYN6NMDL_b0XL_f=}SV06tzg2)o?nlvUK+{Ve?BlaOVN_LP zsI~1owQ7STJE9-vw*hoRV-3HvDw;bCKv7AA?mduTxCN#cEt6h_s#SY&JGLPMk0}66 zcLI3Gge0p~#KBhyHij<^e*f?FxD_;c{(vgGVler{-V8b%CJ`pj2qxcF$gX}JOkRBq zp7G=4|JuiuNSX}D7Al3{GyB*_CSHk0E-gN3?_&;ATH>4PQALDDPqfGJn4rhu;_jxi zpSa+CWeunSd+qL#l;`1g2L;=Z@7(92V-p@e@s4If@o3=mf)1Eu)5+J)40ouAEz-T+ zrxtHMl=Tdr%ZD*%?;hn@S?Lta#w>*0nyXA%kbyIona#H9WoP!ndkj-jwK44BEA#?y zN8)IvB6(Me+mYv)vsy|$dF+LJ(clcmkd-h~$BtfN z?RyV8=i#!V7^x{L_|+9N$13Y><&zotb!nY$&x0B>tIDj=&Rp3wt^5o9Q(cM8!PIpT z`8E4jT2Io##zsiyzjQU6Z4Lzbd@p}53(ih%0Xwlc2s=q3N9L8I`J$dFq<@|!pBsaf zXS(Z_&bc)Zz<^+rY3?Gkfu?IP>F=h!YrE9g4WXJ0*N`{B}^_Ac$cUdu1$;2Z`) zHSI$p*X7%Wri6~Hx;3wlz?U&sh2A6~escKEtbALGFHGsmR5#tnoi&j&y>ER7%Ch|X zgY&pVW~H-Tl+|L#;+lCpYpmv{U0cW|+3I41g_^Mkm$oVN*N<9Gi2DKa#~wPe>O;Jy zseqeo#opWxYzXR_&$qvCb69FlZoB(Jgc98xHUbvPF4`Y@vnI00$F4afd}Zas@oW*N z2CEaSJ6du52W+&%Fe}NPBNI{Tfv4a!nHN~`BbT5F35HFO zF6pwTond7e)wx!Zv%BD2_D!vrrNs8?09wV{dz*QJvkeOcv1HjVufaQG&G>s$Lzh1- z*ScWngXXYHB6SDF4r5E3Y>omX`|7FB08XWx<+OJn36(sw%?PR*gK4Lr_NZ|9icax7 zvEenvTiZN25xt^Eimxh3zgV`R9B5u5*N9rkuEeE);c@DQoRkHKNj}-c(}tGpy2ET1y=$J zj#@DiB~d*Ct-ODF-=OJM#GKL#mo6u`_|utJ>H1duJ?iBex|VF&JYu>O7VP#f^thgn z_Kx^n(r>dn)$3tcBWUCip-V1`!72_5ut@?RQ%i#LYZAqPlk|;>r`(V@LH4MG$}uy# z$7mC+mt#u`!qTuT$XBh|mKk7Im-a=Rw~%1)13dQJ^j3c0Ia{azq)#wc4OB&1Iat;$ zB{k)e@URI01FgSBExYPkiAdrU!3tpG7xs}bWRL;}!*l$D1en3a&s6+)_hbn%=Iao~ zABHf#fF-w5l}scz&FtNeFu4Ol+07hPo;Lx?mXYt>@SSrvmInQ1+G<+fil=h4)NgpU z>OC1wfeDk@GD&&?`lnV%<#;}yujwvJ#I41n9XYVU+TJ)Tle?)8sT}W?)U>TqJO(!K zgvA7JpSJ`tsni3iSugUi)a=mqCsuX*qLs+On&d4Dt7fd`@i}roEUz^cj-|gao0*M` zqlAwZhAYiCjuemici7caWGg3!DII`0I<|I=lw9)D&PKP;FFZR1-#BhK?LJDrRKt+X z(pobr>>!$yyRShN5wXKH*^Qk1vA4cv`b9C^>KN$2bQ(Wz)Bq9^qIhi2I1-QM7U`A_ zDsz54l2!^CQ68`*JG00ou8WNuia&Ty>k15TW*nx^auLri6$le`e%yWo8>T-jWs_VV z?yomGdw%)iyDQrqb=pUw(s{~O#GcSsWS?coaJB_H*EMjuPC8^L7*t00thrk}d#4J3$tdW?`*K8>_!>z zFN|kyo7Y8&XV(P?=(@d9eB>14Z;x%WZ>*kNoP#%4g+;f(o)-*yqR>Wjwd2pLCw2Uy z3@@PJ$Fn!@=%yPzTNfkP7_Z_PBcu`7f4Y2<3?+DdEF$2k+JyVKeXGGC>LG&~X%AiD zg(^3b*-wq(eDA3ylE*I0PMN`3P?1LbmD0~woC@3XH*z~h4c8x73T~IY!LjF_!MzfB zk=_B;>;ZO|$7>EaYPw-DsPG(F(Od5!dYrgb4g#r6+6g)0W z8ru?#IL4ZiJ?Gm}7OukukVgD(eKq2PW8cpQ5{2B49VN+V%Fyg>AE{2{8S3vX{#c$+ zG+?XT#zL2}WmDnpa@&`cY3ZrE9Y$iu9=tdvP>CFc3+ya$vPf5yUud%4Zb=s@RfIl; z@hZQjEJZ)*)81k2{xkwDggNvqYh`fR0TaWad*sOF^zbn)K_j<{!tlOn70L9a&oR0@ z?y>ETgAuvlE6uJR1;&QW5DDAC%7?YTV` zM$0|Eu&E#oGwPkA8`>1icYo-ho&M`vlm&(OS`mGJ>uHhE0S+2)pt$Y^wQFD-KG*J-QJXQ*@}f(arFanuM>q& zjiztxY4Cl&W2b={D<+IZ?@*RD`)M~FS6as9L#WZT1wk-FRzvaTlv3*CB|5gJ z2_q+m+ZDHdsq0)iplzM?dERX0lQ@&Qp(0D9me@swp;hIwe z-2%3lNBVEIo($)Y?h?&ZWE|-bui3u=x3@mqz~1VeycSI%C*&os`9f?!I$J$TE=Xcw zxZoy_t#)>l5-W>NSBg5*r6i__A}@tegXhwIp`QX>tiFiA&qjZcN>zx~SeOl#W_~7; zrqvqgxea?EBA+CuDgR|4%-7Cy{|WhoT`*r|>Blt3XDhNuMioO}yJclJqU5n+(`DhG z^5VL#rD+3#cr_29lr2b7Ve82pbE7OWQwk9l;7HC?-p3xAa8UWB#_OW60B7-RrsA$Z z0CZlW%O9wWj~r+%>OH6JMBN#uVQXXY2-Umm>Fn{M@$ zNj)C&F8H&0bA)+3+>E;B#O8#OxTV?ZK=^ofW`0kNrfz}#(Fw4amr1?ynOiueTL4%YKf zz-7=*X1kXb)nije31yM7L>?2(<^(BrDS8N`2=C{QooE)yD1|_7Z4L0jX z%pO!Oo^?d{n5~4juV}Z#hw5{f7y`>_rMcBI|4xX#dw#p-8-<}xfZx`wPaWPWQ?L{+ z#w(PB_>7N^s0C*F>mB0s_3*6AYRbyyOZBhMOTTz1u)nOUr{gr{e zhdcwbmV|%KQ+As^L5YYPc&jL(Zv+HPZ%bVZwSs%kPh*{pWgeKFKe-j z+ogfi{w|68cY`xvX-Q`c2lDjkM30_-rEP`Czt&hE8oB96?E41}sx1i33n7|dV%Ap|!FKA2+{BqW9@#IL^xOm;j8{t#T^c546wR6ZgzK!x4rytc%M#(&BQZMuU~##Ez5$_ibY}otZX2E3a#IcxX8Srrxj7Z^b7nwddXJ zb&pQzuoiT8l_DaC@eX5Bg_K2?q<;V8zWqGXlffOSIanus`^83+r2K+C<0@oDdv*xUi=wf+UOf7o|h`7($+S>+dgX-iC+b699_OO^`SOC~kpFdnCh1NA{))`;s?} zS-e5$5%<$Tlvfx!NcYnsKo|CkPt(oMuE8tLPBR7Z|*-atcNTPc-QS8;%3n1klRj zgQjuHKsFBKM)V}A$_yuhGyI3Hvn|6CP8&_FaEqAj&3;lg&{LEr*CbYFShVERk~OLx z<%-%y|0FdoM5h_qrW?-JE>g5~GhgiG2>L+X^;hjR~ zguzMm+~npwur>;CTJ@78%GBF`dzyFyX^@XZ&9!NL?) zx7<9Uhb1~>eG42truNJdjvhcLY0;FjV0@aR6uI(k4$_&mh44i8;`^Z!FKgNL4q**m z8fy>L+=>Wa;uph;+0xVqTTkq)7!skyI2OD>CO6j}6s{e)T4X15M7RoUig4*OD*V|h z6M&z;^j#02=yxgPbQnk6$2dDb*<_QC^XZzU-+WU7H_4_P=XjPsCI;^Wj56t-|G~2Y z3j?t1LH%iMi$+AjW7NyaaQ7SZqLnk5`Pzf(wFafUQ_JaBquT737L;1g56`}oHEk+! zDlA7{6N6iY*j9PGS552XUx@W*LS4*i=Qj)ILm~M4X|o09&C*JqKZl=guFQxDl5b;_RH%YB5iHjaoU%2yi_(JFhV7v{ zIr43qO7)}_3a;U!??^7D>+_sblRn>qZtEWT%v?KAUr%c8+kn=u^;mvO*6CEs?--YV zW?MGOkjr&IG27kR>4U?8Y@R625_`3Tfb-Gs)cx|}UR<)OSIqV-Ae{zu`$3n;5mnFS zu6VnZSdq$TfoHxWzRHiRozk>*Ly1LS$NQFD)MrvJ4xoOmBfI#r2u{0Q;QUctVpkB* zVzCaU9P{Xy9a5&K+0EzMw=?a6W;Pn`G3}PFTg!iO^^kqfgBvqOk~#0X=$xHzT^r@p zUtTMeC@gcaqb5SWt-vipXfqz!p!A9J+&09aJHo_glDtcLmf>5LtJK{b8+&W?bDUC3 z;WxzKvx7l5pQ%r^^03e?r1^hH(iiTKwh$?td6nALVaj~>Dk8+)>k?8l+r)rZ7bcR; ztiORv`g*=tw{#?jz&_{L_Kko}&1%A@229gZ2WVTG=1s|ZK1cL-r|UC>FlGsephx<) z)QvgyspzgSWH0G^>(QieI5on1cY7Yl^ZvYvy{@=|Dx9w+(IZIPa4E(Ag2H_k)l|9c za%^+G1|JP8ru2b!0s0IrW}`IYh;iT}db~2*17&{Lw~f_{h~zZMT@A8@a)G*#-Y(>P zo1&$|RT(&DEOPI(`2FH}3@BSI@<;=@&Yf(Q;|z=vXvl9iq)Fb`dvQfWz}gO=nx|wjqlwV*I&R z0*?z0bjBUaFS|Z}l-b)A8R6@O5Y1{TX-O#9LL@F|#JZDS`1y!{uUWhYOT}WhZr?;8 z2u_?_^bW@Bq#KBIOfgnYfWuNd>Cq=tdEUcivcmT4#3#)viM?m^%X5wH)oQ5ZBTO@o>LC3XL~;)>pl~cq?B(SOWEybjp)uqlMfya z$rlyLPJJ89&OMSY9^zAWDI&XODe6O0!qpeu`8{WsW;f-axzk5ILhLX(!Nh3^4h;!$L3&Q5l725F0sp?n|Z}fym0a6>}D{=RPQF2DeR`VSm9V0tw zH_S$=1~$xfjHDN^Z|anGrP&eUGcq2R`LbKsh%ep<5k2Lbwns}T5Ox5g88o|qJ(9C; z{|D8W?v78dlrv(BkOTVC{tG>KBDD-!X{eenSz6h}^CI}jdL>sPwa>a=OP%XF_^2N8Tjnm9dY1-mTLupmp?}B zbyWRDX`AocJOz%3O4gud;eNCF>RtIA9ZSi_CTY(`uoW1?Z)K_8WprDTOHH!Cd@(Vo z-qq3+0pp9;Kt$V=%!gZ|nTw{a_|(hA`gpR%nF@?`lo|4eN1CFeFKf`vn&*dTOBV-G zl`gr2^OZLg%1uTn?_bzdI#4+M`C)8>&n4;SvgEJSW;7nnKbPcr@;28OCVi{5t!|{K zm3yeQMzE;Wk>>eQb>4^?A5wjT_{B_arL4nxARsG!S&kO3FQg!BD*NVm7|8}wQ_`dH zRC)BGXkmTnfR|SLG+Qn)=g-`e`gkkt#k3c>sW+me?0nvh!V_q6VWVWy5qMSj=zBjW z3;k?A+Zi#g^SsHi)FmDw%MRHw)~=7S-5Sar+^MG8)6w#bR3uB%_N>y>bjSNQ=ssn&hi5u+S0hB5b zQmR3JC5)H456jb)tX|oScbbQm*fED�w5zSVk*$9JXu<*yf>Wf9ZqIh$yC`=4lY7 zEJ|g$LON+%d(W9b+tlP3F{bl3Yf&P4VZKOlkJC)^2SQ~SN1i%w6~PmwIViKb%A35n zDIq7wZR9@#OS1{Ii@zP)dqnIdu+nddGOsmOstsAGqxR*B#ly=nbsn#s^sY;O-!+K2 z4mRB60SM6I~y9pQ39CG zCMxJTVoDpG?GU5dto%X>k9S^xb#1!R#|z>EIV5*g$%=Akkh5bP5x&s(mK#cRg?rfv;T!D*z*9pSsQ^wy1(e_xh?ga0K^m9&H zax8)8P^C`r1X(&16h7Uzms#yHh%d7cMJwQ`%4<2mjbz&NK&-9$7Pw{CSAH%S$}7E| z92~~4Och8ZG=g&ptjqUVz))Ky2JhmB;)aAs$sY>B>XuN1xMS}{seX_l zESeO1MvPzi6R>gxrLbino?=M6AnmVr2rMs+!kHbP96WXez{Xx9yW>F4P-X`rV`?$b z%Evddu969$scWX=4^8Jq#d8`cRJ_$|FzdfJ9XOZ{{Cc?;7_sjiBNCs7ej5#BGq!yQ z|xAsis$Dk6Ca zBF$1D>@7`vMt~90hPxl)?eO9^1=@L}Z|in1i9#rK+r1o6^9dxyY0nxl}2Bz{DqZIpWoO7@RlsoUTuUl9XML z$o+`|+2x^H07lu`P58=-Z*m=imO4EzrhEhhpreYfrQ+3~5Ewvf^dV8e=N*oMCc?xY zr?bE&IBIj+QT{|Mmmsl}lm;|c1F}oTGJY_+`DW@CKCP7?Cr%P zpO>R@Q%=3SbdhJDD}K8vj$H<7M`RuaPDT<8$4Q1;lFghkb z%OwqOqL(G!0y};y z548R`AF>(_dJJk=3N)nl6RM6R#CLbOkxTumFql8C; z?t(Q1*$?dlttHCAdX9`S)IxHCM!s(hrcNW-a1Gi;fNI7C-@p+h#?Ao2DV1HIfkw>& z`QJ5)A;EARJj!KH#{`XfY&RG+Q*Hp=Fg?_)B(7iNSU2i@ct0Nk&5CHtc)(}==00gj zKf1!3j9Sqy*?nQQAo>}8YrWRc7>WCOBp+BqLUpZ$_Hh}PpMH$iL9lgNCZ~=pPBrAi z>mppnmnN#{(Q(rL(svv`y?O4~<6>5{I9Zcw)tTpVXl1HFJk@1tphw8ib165qc=EMS z^h$&0a2k9?;df?a~if!ZXM4E7vE!rwC znTg1WmG+0|j(mC;!2UGgoiCeiUXCcXt9h&tla>@})(~UmI+A%qAg1}OcsE+#Nnd2V zcnbOX-SZfgM0`r|18{}=Wz&s#Ajuu3^eNH`krNqy{O}~*J?67{`8}6v#8wvVI?W&6 zI_Y&haB*S0tn-6Ehcza>taS8g-|e2vOG;Hc_Y1z6Z8xKL{q*3*s`ecMT^FVX>at@*T|1s6#_06NUfJ>t z)PO-c;t?^^ua_u7!FpN3YElc02+W}qX=_xW%PMusnX^|KW z*VMZxP2)&b*i+)#Su|dXph~>8PoTb?vD3OJmG+zgGP6SQ+(6(FS)HZ%8N^A$4_a5g z1j-(qxV_B}CLO{ku{7JHz|jt-KE6Z;%VfRoce&SdWy!IeQnoEMzelIyeFP#clO?^l z$zy3o6Y=)(xq74!uFQymf(F>~V`pL!NVnB4OEi^dQPf8XHHT=!_nuu`o~+?$@q-=c zOg9uS8cK*n)`oGRGfk?ZPTm&_HMnHK5to99va1)Gtrr<@V~Rwoj?OI$$)BHn*g3D+ zsvhK;+oF-p9OE1}>_(;{O7ceVXtmuf*0T}3uj)@uDY`C<=9k!!i+#DkF^owxGPG}$ z<~OU~&+%dT^Gov1woDVN^pe>u1@?lYh@?w-8; zG34ZHPS#-8AljOs6bYw_^Vm8|#vFA4Of$mFl^sNs-jrOL$wHORRZ^!bMe@&hr$lgs zEIBmkn4|_-=&xkeuxEEW^|~L*G@#zz{P$>6?s#>TeV<<}JItf*-4$*U5=XF36di8gwNta=@U0 zA;h{Q%H0P#8^CYf^)xBY_(THOv zwe}{>O*#DC_UCm`B{R00)2>|Dhs^MzRqD3N?=GG?>Xyyf)HmVoq=6P<-8!ubwp?*zO)E*&K5Z5l> zJ*%t2YUn=GV$g4%ddR?Iu75P>yz&HEYVb%1xyT4-`e}y@t@fbf=Zo!!E|_Wu8eE)+ zOuCNB?{@6TJG*>oIHFCC6=U=q0Osg{0ojD?i55roB;GX3UT}H=?&GpV7x4Xk16(5z z%&15apV?J|?Qu;%U|W4CL2V?vL9EE!SX(_kxGDA6>Zd!ji?7kt%*r{AqaAD?9=Kz> zTUj0lv+d9ggvDQExuuQpiglZOl~fWrMOm|~GmP!^?B(yWk!;vZ)7k+G)!$83)4{FK ziJTjHM>UY7cR;ZYWKp+(xO`X+BO1V(F-Pf_IOgJ|yo%kr&cPX&th4r%i6Z)Va9>-z zE4SP0sXvTNELBP4UYPoc&l|~4zKh?&qI%+iMoM*78IzWJGNa{HCCn{#2c2svIWmG> zYUT>((C-T00Wno6>|v?Xt;k;_wHGsPKU<*jRx{9a$)o8EvkM~oB_agoztsoYXM^`V zW!2$_pV8fDYRMhd$Q#gnIg8asbJ2w!{vf}11z=^WLMVIB6TdOc^J{2iGoDmjJOW65 zWK{+&Bw$Lia(z}7Iyz)W>6ry8de6lPbdoyDc&hmFmzSNpho+m)V$oc*9Y?K$OQTZn zgi0l(C1SRmk6J>Uk&q|ZawI)Tk?mA+byi26K<|k9Pp7iO9OP_Ge=%6JE$`# zTD<8P*IG$O?4;wZ?z}Dkq|fqF218)_XhzV9o>F&LRCkWwRT!|g6GwM@tUTXBpR`ca1c8_vP`S`3Vjv8F! zPu10+;|JI>Q{>T9*-5+7ZjgF#arQFNd~E8lp!uAnPy{?- zi9_2%nOl>G>wiY$eEPT7wiU-os&0S`v2QLC<8M62FemL&I6 znNf8_($(tiI*6<#tGr3>>G7+%gW2_(C!ESWmbMOEvBS%r20Hyb-Nw19-2D)-tyyj} z37E;#9PTaibm3UUwpdcvKwft*58Rn@u#nd((V8b*}|CI-{OY@3w zD|EP{palMHQus8b=IBxi-^*ZgoqH)k)}vhduM@Z|x1}>FMFfEvL0lE0WXFLN&rCE4 zK~gjaWK*_NTnjJZ%M#F(SgK-JAoBKD5VWk&zq|oAB;eXXfPTi^e$EPY)DNoO*|)&b z@Tu#|68iy%?EA!rgvtU{j)EB-_Ik9BK#QqlL8-4`C^rLip#Y3JSnqd)a72)UYVivf z>&&4Gte_~k)_av?tWLO{?9pk~hA^R*9tW51E$kgyTKs|`pT)EpA}edfyBoGg4tc=q zVqkWIb+8y{eR4Ejmw6Jn=Y7+4iTs~14I>Yq)3<#Ne3_z@oHs^pdp^1Mekq^1 zRxh1^HF5_XZ^ryg_xPCM(yw?8iC7@UC3L1}B!hQHc&3c^esjc#?j6RQEazgc3618OI8myj^{SvO9hZc}+XC+hB zF&aGqs>v3snyA9#l^?tGtaRD*ah=^z+ATVP9HHz%os#((WUzx~PEQe%M2xij%4OBO zM8$C9=0v6ZgEc7^wv1fzNiCY6&&)W0Q#^))B*x4kG^a|jz9PuQM-%8mB^E?;p3+G< z6R7mw%ZK9g!x|$ypA)j>r@{gSypG>XYbesJ6iXS$qrb&vl60u&nm(oWSeV2<<24`C|hbXpXx;6CWce8I?$%{r<+-rxvSM805?;o2wd&D#6s-t-A&LLORoSQG{R?W|6hnJjx|Eu%N?Ji3V z85hr2UOHT$%`>ZWv#$1q3fSsNk{m~u>Z;!M*O+P3RNIdw*SzMa!njPXE5!)5$HSd+ z6qX%=bFI4Bd)WmKg8d@w{e_FS+Y7(7Ab6RLJn{?wL z6((YoQbDq!<_B(y$NRkMI`?bj?-VS0Orpl++V1@<&|{7pR_pQT#lC`9-k-30gt+ZixqLs`@4 z&=nK#SA|RA8#$Ok=iAzu18$xRM&hziFa|p zG)_W%=zU_2uCLbZs_lEYjJ??!;b)4lemDAzI5k!yQWEWCYI>|JGnb~n82TIw?Vi!2 zAfA9+er!XNIhV1Jn1QBuf!&RGxmULXYBF{~2B%G?rpDSkE#c|9&?k^`KJ+&1052_& z-UsNQf3_^X&a&LL_@o&8tyU+pTg%R2(7}w|(6d8iYKp7~*T<3cPwPX5^0rqL z%a5C-ftPWhy71LEMvR3F*fko{1I2IWm|(>Nt>QoQ3$MPRp<+u+Ls?_E;{Z}N+Ad>e zr2XFL1Nkvyb|OMi`I&=c#zG!FEqSVvH+JTBF1x2gTC}+Nas>_usEA>-HTM#gv#u(4 zc6Q_;OxSdDSzq1%1QsR71bhnjQbf+VLm_=bjYG`|ogdGfA1P&_&&*5htgxfd!udmK zMis|Qw>IsC&M24c6!l1rP2{HYLA}Vf&?CsV2q$%0*^7TX4vY`$ARTIj+<^f}of|U&!h;FXHWe-)@JYjS~k*LRLdc5{;xm z>a4v}Uq<=gmC#zfQa-Tt;Ir1)?YZOz^KF2siFcY{OY?@4l63&9Pw`lq--LbT8-M6C z;2!P37LnC=4$Zz@iOE@69Nf@pzA%IAchERiw`quP0x)=Ar;CR-`)J+H-&b{OFRBw+ zTymfZTkT9vf>{Og`}HKRnN|VHb++@BZemx8c6RZO*oi8|ebe);{AQJUMN4QAMeybDBt7?>1P6ZWlpP1BVW$S|RHsR&b?h84S;n%MMnw3hy*H!-8G-7j zAUIi}_xf%Qak{nHh+sZ#L-slFD^y%_oJ+pUJS zeW>Cy!wx;uajsWXvlga@MpdVr5~e0oG@XBB)+UlM{GYi zqlv^#;#HuWWKyxsTznO}aWrmUgTuLg*c87qv)jybr@XGaEN%KuJ^F9vqIT&M-#Fm$ z_|_h5NowJHzeYZ)n#17D!TrabE-3PW-F+2KC%*joUUciYv96Ot$PUJ;Wbn2HFugGn zdG?S1VeEwk6Eq5)U=`Vwx^@5DNOyYMs6e+-Co)?J5{W(8_C1`>Zi`Ip=^T3|wzl`} z#zGTz7hZonjxK)n!LhRf%jA)Qe>H12ql#+g$uFI!90$}_vJSFL9}5FqMBK=5$3bKL zLi^)S$(3(1d7XPG({5R!k-Mbaek5YEvHC}#A4dkMw!c1ox!N`>p#m#emd~PGRDUTo z2cEUL(tMR?-Pa&yq^TJ+bV}H<;_x^Z(VM0rnI!VAl4m>%#ON{uY@d@nbd8%evVL>l z3*u6r=zL*;Ah3HoNQ5(i4)97-rCVpHnY9DbL`Ppxb10yn?b)$YBG>!6U3> z`Q4T$dpNG*@OKl5JX46FtLNqJ+3H*90z{(lgof$fOg(1rK(W1<`z^3i^}yDlgTFQ@ zg|>1}&YmTVxrpU>pHc}%k3B-8H!gjpJWye&2z$k0%25K{V7Sd2) z5tHag==eh+nA05dAbtOE{NXWeVElIlGUA1Scs|zMDzF>u2sQMuK#B3b0q!fPcmr-a z6xH(gj~dVQJ*V&J&o?yH4;|An8}aEEK62B3i%-8?5wmUOA^ZjcW2v>6z%-O~?3Il% z!_0Irgl{xeTZP>WQm?x{{g|<$*J8`Vl87>?Zq*DJo3T)P_U_Cmi9BM zC^CG-vVS2;(CBVze#d6V52q3}diLP5V<2#R%vY;tm)y>&CP^-7VY>7S;@^FB1Yp0ru|h46nUc`a`YP+p+tHz{TjAxl=Qw5Zs6j?dpZd&8ups+wGKWIW^K7ZS zebDU2xXVqe+=;v=dwDiO0ggLxVrkl%d~w@*B{Q!Oc~UnR{a(;y>g~E2I7*VaU%OSR zi4&Ln@acbB>qXYb+%bT$wvZijN3vGBfkR?=-c&7m8te3mF*IkayQp&>la8|L$nmWR z`*7joRdO+N;6+op)uRGL#-93xEx%?-Z+qsou>?^wm>BC($YT@J{Mpi7Hzd6Dj)O$C89R38L5|E6d|6 zwg)WF1T6H28jjo+vFyn%IdhLGuSuSz9gK!#Ny&I|Sa)I>MC~01qn$L`3tA0a+fO8? zsHzqvV-|B;4YE&N-Ry$m$ZIop9Lb#c;!#VZ(O5BQp&4cVka@`%MDy!fuG0mu#A|ts z6$~P0N7}X1dml({!Jtzk9Y%Gqqv?LchaZ4FqaJRf+OaKD?{J2_t{U+pBU&1>N5YXO zU9udk^MmwV<{bjFl7&rdQx816l0t>9qW3Ir0pZ`>CjLo*Jn-7jQjuF$w{9xDl!k;KR@4+*C?~EAL&;)@kWwd zyXy9VRi+fwYQ$ zyL-jLi9#*0PdO#29wym<{O*F;#P6z^zf0;Fe(hF|b-JhA*PN>amXMUzx#+I$LgI@BgVW_9? zH{W}F%t?~f%%@75wZ>!9#I8UMjmvb#1#YpAfDM&aExpvqN5JYLXcART<14bnPt81% zKl{PF0{CHe_J&8;>src%H|}wX0eZF=X0^+YPC1&l?qJzaf!20z){vEn08h1cKt&8Y zeJGFY-dpwbVqft-I~^mE*)|h$*U4ARv_dYax#+A0v4xZjWsHr6n6b+}CH^N9@=y>S z%kT~)g0LQoAbOrp!}gH3_kjkjHhQu2^NVfCg(C00sVDsg1`x7hRX2{*`1xQ*+Ow0k z!7|+D`a}ImrhA1G%H3H%bL)dh5)J-TbJYlaWGp@Rs*=kpkFdL-OsQxJc)< zjZWrRZ0X#kW<^9xV}zJ3nyM=OxE$O9QKSHu&9-^~>uzn-Oz z&+lJ~Mh3+tO=ZRJ#^@|(iL6|rt10)T9(_qJ&YN0c1+pY9T7&h`nAEGmqNzH$S$t@% zBZ+(kR6?U2xyfj|VGiAdI4z8l-|5A08eIuZO3QsyaFU|AfG*Kv7w20ipUVB# z4oWXILej%t4_cO%b>^dDYwTI)xB$}X9?0-zD?1&KM1DzMR)sw?<=X5GdJpH|f_dYX z(7Uk-C17hh&Yu;9J(LZ|1HjlWqI^#VL#ihD9%LPI+48%ZrXQ#`(?+RYoUUMKOl9`N zq-5W=DjZ7K=04{+-6)%oXlO}=sbS5&S+YEjNw~6IXDV)nG!kIJ;XQm_hR*LdjdvNn zyzb(8NA0OZ>l+u3=FoA^I6Eus`mG7+5s}1x-;`20FI(HbjuX|{A^ zZs|<*kpMlP$_-ogtF#12b;O4nSYA^4(25eq#<-4U9&l_kdSK@uZU+)wosF{e(ZLz$ zcmumSzHQw_B&Y$#$>6)Lua76Hr8)#xt9$Xxq+U69W4~?nVRr@7E|9-7;K~_}7!jR* zBbnUO1CO*qKD)d@y-btVSqR%@HwvYJoTAWfr65onp;w^9eX$s8#I)%FD_s!FkS<8U0Q$MHWFpy}3S=RA*6)(E7IHf*y_OP}`Q&X+Y zz1=G31bmD2bkdBq537@3UQC$zpd*kL;;*GKZ8*i%dG~y3`0jg1@|F!cMIs{};y1RyML`g^KkQQ+0?&e&>^SZn{HHhfT($Gs`?vS4 zF`QT&RM=X7(4nbF!zbE)psso_KbBb6`uE^2VdU%c#dv4}*Rl9YJCom;la}E;PWJd= zx59#!Vt~H?{aKI)Dfz9D@se0wY#{N?j)LZT>_vFw8T|fU| zIq1$D=*`~;oRyi_>P(1S-S?!D1=qdoai=OB6ZzOCcj`-qI$~BI4Uh}uFKTzu)O@AY z`DP|&#h}Z+Q4Vd|i~Wu}_kgvs4SC`QjiOKJ8Q}Htb|Hp1#wy^IXXDTLSggNPT~!vp ztY&>o)%s=JS9OQ_7dcXr&@j~&mk;y9iyeX79{HTpP2iR2G7~X1F}(YB++&gMTgw!1 z`Ajl)lUi+C;q^V~pFOekEx4njPiUxRUGKK8(ZK`nw!T3!rT;xT#ST}M_R-d}<>ky! zSKr^(zck%r-${sp9xd17YO%b_!IBx}6@{p1$#A7MHkyKc62+ZS-J@|?F%G5qRICWV z#A3-61d}=5K6h{NPdj`sN<94f;(f~>3(+ZofYxFqhyu1k{0RWE9|aiJ2$NHG>Zcbb z{X^(N@SRD393cC0Z^Gl$M2FBzR0@~x^x1_9Ei|erZ~H?9 zbnNA)j2^A0?K_`4RThsf4dS8rZA^MvhQ6-b&RN!2-9<(j)VIzGuM=TTVW}X8kN2e< z>vsD#b_6 zbGDk!>z%hRR>F$J82sW3G+8A;g%+n&sg8F!|AK${@ z&PuwjBB{c#5?9>z-a<|RMWs^0y|#K%M*tT$OgRhKi*P+z5K0F8L6&Q{BD>t9)RyYC zR%l>VyD@xt$>bxiT5hXcpVa!O$eREG!>&0<1NYS%v=#vJG$*u`I+7wA{Rg5d<_?ba zrv`(h1IE_YT4||t#yp9z(Q8Uz;eic$wq+jlnQ?j|N7XWknd%s0j%tYrt!%_{23Pn<*;gOz0a;C#4 z?$47?UaOBcTs3-K`nQ6rRwjej3=09yO2uSH?~s$vnm-*4YAb8nX7JA zJ09b$@u`beGpuC)9H$SMA@AGz^dB|#C`~#(=B^r;dU~&o^_K7Z;z(?>#(Vwy-H*wT zx=N4DGq7Zb6#Vckazp@}<_L`p^49x;s|#P_&z)1))+l-ke|i2#a~WH_68vEPfcEnC z?%gerVz-*a)K|0)VT`^!KERSDT*!cqsgstj?nGpNXgfI^QDk9F8nI&4$f&_)4Wd2g z!MPOoItv(~0i}v#G+4g#t1#YD?cYGLHD8abOHVB4hlaHMIdmz9YWDf|g`(tr4;DT3 zJDfx=;e5q}$lfClSf>+*0S@kD1+)*2PZ3Ecf5Oa4(yXj9Lpn2ZZoBatVW9Y_0_q3A z0%S!Rklp$6kg*28nMcyu7d7R8Ro1N%#)y&(39c5+fLBN7 zik4$UOMc0Gwv#V}vrZKD170&81O|-7_!Wgh9e$Z{(JiObkRjViw4V@@MTAEefM!T|dmR=OV0UG1o7HTqSEmp3e6hJT=i-(4U~V`W_2ZhDRN^^sU9= zMy~FY+63Nvpv$bMAtXW2M}it}-qYK8o-@v)C63f=W5XzwFg@8PHIVsZQH*Je9bx9n zabP6vcbwhtmnmQtui+$1*Wtu&qJ!TCe6ixz$Zs^^My#P^9p|frS3%f*r}x@QNNO~z z1)1UB4V~sW?b`(2KZ{8tStIz8b`qpc@3Ohp;l#x**Zm_`Q|>(K_tnWY?nS$gl9KCcbK2L;NxH8yO&HR>2W?1`I01{~)LYG^#b z0;TOQc~nx#TjvSF)pMg2e)ou7>dgoZC3RE=eFT9J*s5#WC<8q3%dL6kAhp;_ z+X$y9h(+<&RsizylB8@qHJM1~^irXnlP@uEY8!RzBtkZ(kHPLw(zjWsovFgu5%;%6 z4Th(&#T<&Z^cW#<(jMHhz{Gdj=I?Ovu$Jr6C(HeciIEFrk!$AX(`y%hEVL;q3GD_aWEJrl?~MPmVK zc;e?ib>?}Kh5a)sfU}LD1cDs0Isew0!{Auj3SBJOr+#dF75DtwPXR--FryvY$#uL` z4uW;(iO@Chd3E2kIN2`7zCErw_*P^1#@<2H++zl@h5D4YSyutoip~AgVj+OL9AS1> z=TqlOoz1ajBh3N*hEQ3`-PTId2YBMlyPcJ4Nu;CqWj+?~OCo^f;R#|B3mN}pA)J@j zKnM%bpK)v;IT>m=1FTL6d%j(+U9JQXzdRV$q#$2=Ii+mI_BEe(sP*BoxZ!*V?UhLJ zW|c}Z6yL(??=y1plR$0=~qhIQ#27_CkxkM)8zTC7f|ldvyprGd&a3H~%N z!-pH{%65!hfhK}bB7<($I7*mu2~(_Y8f~%L=~^vTK^q3U2P`#cV)IfYnub3 zjJ$8gXA7KeydYf3E#B9=jfY6_pvu0&J|y5z>d* z=cG89NlA3ljmJ2+c@BRhmqW&tSnpme#)uznsa6O5vj1_Vu1c%9J!;US+r3_w1f(me za1o^rib0YwnA>sN=eaE}OEJs3%Dvl+*uTIm{8b*3v%P$W!qb~|5lp+!OLoeKbQLz*LnGBgcRlU zx`Ljv-l6oiay8K070}?aa%<+K(r0>~FtYNjS9&cGwV3kPOP3domzTKuD>R4FJxHbZ>c9=jvd53^{amMP#W(akgw zY%X{9Xeq(;j38qt`-dgmM2CPgz-_HPB5T+0LF$gcIyfo6db(Onz1qu~?>bm#z96Kc zA6mF^oJ^g8NkeCrOwbT+#%`)|=EPMAd6LQ7n+vU?TFPo^_l>(0Cz{7F zoELwU^jW@rZ=96HOy+XEuq^!1$+~nUDK_&DOqaVtXZl+ESCO6d_`VE*5>%Z|$Y-0I zK~dMH$l6V_!*w&<^+Lrf=jxgB)n2|g8_8y_eML{ar)aA3Jp%MMw_4jKdstp>WQ~(u zEgFWZ+9tVHnoX9N&qZB-HSfEe>D$^*aka0@dvfM6{-QBrQ!o?-nx-**Cq&|Lt(psb zCMs=-uX(1a$p6hNbbEU!bNu416%bIx4odb?zOVay-6K=`foS(Q>1 z1XZnqBth5%Gv2jXmC>&Wx|BAmB*-NC_G;I*vY3srm$+tP_lmi&jvM>xz9!OD`Q2to zkH9Us>t3>JIR28*%=6y()LNQ>^Dz1GJa)|yasQb102Xi&-qsl6=yqHBu}bS^J%42T zU^Zf0s{fHyvi2X*h_L)awQ`0u{_A#D; zenqE3rrXBEy2w=7GPfHtBxH)IO~hi^W;ejqzA5j?p2zZwrbv`$C<@xhZjlYBs`4Sb z!Aq9E-wT^&S`S*TKzrWgsZrNo$$G`AOU3axkJ8et;1$G5maQByuLe(UW|U%3%tWw*k z3xaI%#_#$%z%wrobC?QF& zmoh-F)KvRQD>cEnaSF;J5YcL&2e!MC|+tU-4z&3ZxatLj@YdMIBTtS27{5 zywuXjf8T9^CNv#xybO91B4EhG*5`z8}O78q2Qg?~W^tcQL0Q6CzIs7Jk=xz(%DJ$$f0`U7e@89KE7)=|M3ZwIqw_{6V96I{jj!Pn-V|VbRE59dnU#??Cw5`=J zhnwZ%X#AX+Zphe?YDVR<5@7Kk_-|dQ2sp!U_7?!*AGR z=GjRLKYoS(^7@wlf`XTnoB5tNu;k2n`7r{(k1{kokC7bvTB_Q*S1P(OwTrjyTyR`I zxl7FSEO|n4n`Qy$hws=$UCZ+YWAFWtbCGl!gj_A zH3q$1*J4A#%4Z{KmuI@kJKe9=Tc#yzEsuR|c0Oy{miA5QQzbdDj6j!_&;_&nJ7XB4>ZXC`Z|E?8{sePg?)Skp|x0(By;N;3}eZWL#eW)^De|OBzATd?1 z%m}N@H^KW=MWX`Y1@c+LV%m9L+k*>1bxO%`nspn(_=IWRwQ+l+?@WH?b$%XFsvSNJ zVl1I3r`WQet_#5Lute4j4hMSsE0#uURO;^z(A8&9xUfBy65@dx-_Rz3tVi z+*~*-ka_h^TgRe8%PTevw)vudt@`bXKUzk==G>>ZwF&%4uz&I;9_na~7>r}{pQ^iJ ze4`*0QhHrOs^hS|^Jke2R>7$!Q&QWaNmiEiQ`bOs<_;@p3K;yJCgU3k`Re-H`c*=$b=}VX=k!O3yn%TN=(rB_!bvj_!Fc6f11j&d5OHgCiQMH_L5*L^>+**sn-Om;0Tk zkJ%wOYKK{`ovz&uXZ=yTL^^}8^GonoWhaW>3343BjVn+DnBc3{5P7=b=W_2UE_EHn_ zuk2)|qPK>cKmHXiw$3l8Fi%Uw6wl}2XYRGg4t;G|j$dPlA-rZ$h-pclA7uLfe zwl&ed8^Ol}m6F*82Ct~rFCBDi#T)uIE+fh1sJT5q>R(K{jXCb3nn!jyVIyZ(%F2iM z!VA}Sl@H((->EMyAe@a>|yLHQUdoJ5IZ-{RW9G1D? z?P}c{dyOQMPBCku+YK*CA6=d(I$r!|mk{Z=t6}U$2gh3^6Dbx#z>d?H9t?8oshogK}X9Ln5dtM5g zZ)EKqWHsODq;6CU?WTp=OqEnAI2&|HdnI$y!M868(tWx9soJuu(^RhWFY$?++3%G* z+KP$!%K(p=+OgAXTF2+xIwo{$XVcTpm0%m1_HMY-5;tr@Jwjum)De%7qnhlf1Z*42p;3L&bGUq(`g|We)r|!6Ki6apb7={Us$+HJ_ zio-m@sMDHa%>1E`+S?=DVpk{~PFBcZq08IS{s;NS#g)}rHGYo|dXvT0q~R7nziioF zGH_3MHw2OqA|)l)^6i#c;g{Gu)n67joBM?A5=z(Txu~Vd(q-*7 zF%~_yB6hyAoWEK7Ue>cat-d#Ktauon2KDnjX-TWs5}yB}*%DwRAJuBpbi7Z7Y*RWP z>w7XVwa)fAxaE8JrA^Fe|54B-%6@BYVyW)Vcmz*D1cT-kj&7 ztce5ij(Wv^_GW1bsuyp}SN zd^&8HUe_cpHdw2_`?mbn7hDfZ*m;Sg<7kpum|vJb>Gqk^PfAeolAB>I(q6zHi~8i> zW6A~YxihIbSua_!^YYH$6_EE8orj`@FO&^z6DFGYu!#_fM?g6XCx*@wr8 zJ9U|ShB@XY`^&s)YJ-I?uE}gliFmZ8UiNmW^ZQRvdC4o9=$DtzTHtCZ$tXbmS#Jf4r-J~^9IaQSJNm@3_b41?O)LUF#ns?YMKiK*=%3nkO zYHTLBskET$_XMZCTAq&9F|xuORQhL6-!T#VG270EMxy--zfkr~; zLGjR}VZG#K7oYhVSNX>Ty6*6f*^72#P3u6?_EB9Ue|>wt#PShi(`}zpl%~4SQ94Jd zUg1OokN!(NF_f22g|fc&?z7G2&7zSJ?Ws;7Z|l#Od7DKSHGRh3F3r@mKl$r^MNX~j zu*}I>qx&qo%7dvffp`G~f+W&K4ZVQOzeAqX=-TR-sRE$R z_o04AbnyMf&O!8>3+P6`7Tb(p-8!5wQlMQ=|80vw3Le9_8lfoweD@)l3*DwdLKVB? z1;kps_3Ut+-0Z6h(G!eGQv|>?jcr$c&oU)}OM#9Mfh4KAc z^9L2$lm?NdZGfjdPaH%8*F!bmc=lk;=^Qg^S2S%kQ}#1Pel-rEu@a9wpVgXj=nRUA zoc|OKYQZ($tk!)7TBb9HqNNoeD)({7(7$IUecvMs&sRSEMcuSZ;SG>WwgAV&Ns4vG z{u~V{)=d{{vF-O0)YIx&O;Tbh_|qM22L1`N!@NoDpp^ z21%6jYS)#qxmBaQ*yBGEaz{F$(S?Ekj2EB)Z$UP?5{vQS^FnpFX5SY!GZEE4S4V7+n0u7dkPT^2|@Q9o5vXQA!m; z5I8@b<=hpVl=TEK9=e1oAWAbfQ6O6K9&OzDy1TecwUYUr^*Wo z>IY!SS7t8g8B8d2@E@0E8PW$wR&&#LP5s4Dx7o{K@!-wEMC0N&g{o_mW`v=2n((~O z+ESM!*zuRW^`Yr(=X1_pq+pOI1`L^M_nB- zEHb=l1Guw^PXx(tWSblOEj)Ai8<@2fjB-k}c=B&?o7ZVs7?gv?x(qZjlE@eWe@i&J zH4ZT5PJm{86r@;MAHnPaJ4gph(20fC`Q{Tfgy;??q4Zp!n^OV0;fb3Jm8vJ0;m3K# zCo+Jtl}l_zW8$UIA9K;Y`U)y!GVXz{40*Ctz|b%L$^B+41tIT?evW>r1Kl{^j{3U+ zcFt$;ug<=h!2v=fx0BZGJmBlS>teY%d3m55)Vq!OFo%FXOS$qd)i}3e{B1oApNkNAzKU}!^A5Z!8cL~@c3qV=uh|jPe+p$^mkDLDjbQ> z5C6j_kI=O?83KT(I_F<;uA&PjV#+C4{+Au!NrZz@)i->C4bUINk^C)ygVw0OlnRWy;E(w9>!gZ_Iq7 zehY&IUAu(Xr|?35ZQ!o)A_ICt1Ow1lxl>%|F7^_RVBCoBh(|3y(|Ehh`%`S$b$K~u z+yxJJ$D7-Ju6;oxr}Gyh90yV+oNbrVpWE(^v4;=pPn-h-{eM^Tw=h|iX`@X%L4nl< zN_G~CZLs((s$|_M)ix?hyM|*%NVDa5ze4rsrzj^oMx*9br)|QaRy8lb4!eD(fcakm z?}ggs5}gQhx1KNF%f5cD2G{;aeRsrum{>0ous^gS@&f1w{GnKm7VM!7uTkr5uX4Y& zn&aSbY&z*2`PkW~2cUld4(1EmulyDZO?EGSY%R?kwf6gn`+eJLe(ro_any9aA?Pe) zgNsbWikckl{K>uV;IK^iso3DLJAsXJH$9%8Vfd5vqYSAJCM#kk+1-?uXDbaoL~()> zf+7?eHV?(V8$9~BvU?c8Z;G0JMZ%(qZR=h4^)tbr|upc%6Yr%&qemk^Aa3ch&*RiuWVKVD!zf@}c80%6h->%`>=vAqU51V$( zFdLK===C#=Pdxnd-H8!+-596X#^dor~%f>#?xfku)l)z;Wd?CFzpLQ<%)VlUz%hs@LzsZ93 zFxO76L608SylkYAB?+>9fu$@S^VK`dA!V@M%lm>DIAFn;?6=v%%pAYLHH9nQoXJ;j zyOXnyu_k6g;jL{x0&?fJJHx)PqjbVic<7!`g07FvHIB%^l1l0in`-BL4x`SE8o-|- z;Vi29MJA3mG1F1CzPc}l+Tv?Judto0&g`yu8G19WD6G^;Lkt^8A}Gi0Y3(zyUDO7h zvgI_i=$4$8_^Xn}M|<{4_3CI>2jXVg6_MmBJ5WfTExi2Y!j}tmN}JjO?DnTLr!S~gwFAY zdzDI?tF5oh?@_+Zd_i4Iqz)=|{n%vwlbO(CSf;s-Xb^W>ibYvzpR!?eV?b%K`^xb> zu|ajbxg5ozQfVsT^!Wt;?g9(Otmd=w!)Of~`nyf$!&U-~+X9sZpUky8_ZBSK4t4+9 z)Jgs2FaMpVo*Csl*d@3cieX$ZSkG#bWqLq>VWuhHR0VaHhSaSMlK4lbmaCVAh)fY= z3RceSe*Vb>iBGe48zsM5lSOQB?xiH6N$UehJ8}F|kp70W_iq=6nT19Ao$kd%AG}@s zSxM6OqYSg8T7jIC?Wu?K_2sPpv82b92*>I7&|BlSI>d@XV<9lxz>_$GvwU#$Y`qL_(HQGcrv z$~U%^!soYZ^NCux5b8#@NjE}IZrAbA;*3stcG}V@uBVd3$k{G+%Qm7?Q(6Q_@9pfS9X!Yu+fzANCjLuPUwtrvt1aP zD}%Wbo>w{<{0@aMvG1U2Wk)EcxM^MwkXAn!WH)`)&KWjYyD=t}%b`8EoFV<&jWX$> zCH$F1kaTDsiO$vK)R(i_c}e9%H<3jMoF6h3PU>GaD2?<})#ezrGr03$a>PvmvOQbS zN1zOgG$3#r9OtLltn1G&I$UxJMb-aaZ;Tnl-xdsBWj&le6>+uNkb)>2c>iTPrc+;f zjekYg<1IRHm*V-Qb9$BImEHz?Wc&Y3!^lOX9W7ExrN@jd8@<#(nN@Y{71CQ&3)s#! z#I_MrOdnz4yHRJIQRgcS(y!2%rH#k48)i`Oru4#{U*+I$Y*lU=~j0(r5 z;`vZSMsmL~_%io4HiFOiu-P?i13#F6sMCqdBg~rNHaL7|Y!yiI_tK(nGjbO%QiX5a*8!Sb2-LDEdbbjqpn$$YvS5{yYw){$xU}FKU#>q^!!tS z%Kg%9%e;O!L-iNu=hs&OtcR>wisehaeJp;l%x*)t<1*ab%WJ>mFbpa@{9d7KERhz4 z%tub+*}pj`Ha{qlAPwIAc#8HhXeDp}wywam9)gysoxAUiNRZ~vu?6d;$#rT}T0@1X zi&9~B4nCz^vDYcK*lQC85B*=ftQuN?eeZYZnuy{Oke6%_BP%>Q?nW8gur?m20 zrxxL|&L`SF;>;pFrZ?g1=>2tzj(P%JLgd>Q%jTVBoXb2uibO0 z7BXR9{*pVRA$FqK@=Tny9yvAa>aj=FS;;hVa{oqEGu;kH{W+l2Rzy1#c@Hz7!L14& zA@aAC3eAvdzhYT8a9rpcQ`O$+bn)~_DM;~IP#>463Kh{!MU43N7uIai3kjB0$`__r zL6yp4uvc*mUrQ%G-PCWjW6NzIQ_?Bxd)=M2|Hmzw16b4!(gY=YsU%V3%?JzNhe>{M zzF~JGFf0CjaR*W~bNnF(oHEaRCZkp>QKU%JT-atYdVSkB8BSGy=KbcV3NTd)&f#-k4i^%VX9 zTLR;Rga|UBe|{1I6t;i8v(}Bm#>TjMh%}u||5Lv{{_bPjE3r?^h9qbmi``5)zPS5q zJ$w5PgF9mL9*jqLl`~DBD1cHjnZ-MpZDA_PC7~v?T z8LJ0u*2xPR#)YO>N(&od&blS8G_$JOuvW**U?QaWy0?ddCTr&T59@AC!v?Fu7p0Xe z1LX;Nup^t<7@(4d`a`+wf7h1X?4dnX_CUXTl@vMbmlU|;<2jIJZJHwHMn>G0l9$L{ z8~&@$c5$y1DrOz!$MU(el0=sBe4b&%QuRBcycxw7Ze|~gsz(#?p$8z>oqi@5;mvtR zzH94nvLhATmw#XlA)k0?e|_@dx8Zq55NDNJH>$H<+w4 zxn7s0ct6R*j3LxfVuj~!$E~s<%trAeJuE-A#`PwU5&Pa1ZxJ?-)qZaNBWQGgEOR_i zUnNK5sW$bvF9Wa7`r4DRGX-uG-_E{>Xg2dO}Up$MFAabvj1aKAUkVnC|_q;1|Of}j(c{tu`gA?+8dYBbvj$ z(gP0j5StIbIUWm43^_=|V8j(P+Sb~AK9irguT}oBn;L?Ok~jkRGg4CR4hNp~$;gs1 z&WXqbXFAsr-k0}`He};nf7lgME9z6f>_h#SfO;;$!yUGvby$PT+1zTW1aR)|;;L_S zw3M~Q#UBpu@mlnW2p*)!9zW?Gp)t?4sTpF{E+yee2^(7Wd2k*VZbpcT$S;-Z4ewN~ zwUVQn^}T;yNAn`zd#cDkQQqwr5ewP;Y@YVMlu5(OYb~M6X31nB)+bHTe#&$R4iOXpw6X57%~BN0!DnEjULxP$*r-3i-M~swsHeTLQBU4W0Ke~1 zf^!oQHyHwQx8?8svTAGXa!cc9ikL_lvM_2OIEHu2reVBr+_PO+JOv& zl|81ON}AjO=fX^c6m|D+oaJ^7cM5tPFFdc!KeFBVXDE80Nw}v@Lu`on;hw` zXyx*w$s;=)W@C2S>F)Sgy|BKrW7k(CI(@Oi!hRm!RJgqqTgFdZ4G9B@TR-P5{rbES zgXx5SQLODZ@}YmNwxlqfn+~ZTT-}|$@22-s46GNjtW*GrGj zQsj;MKTDRchg_eN@@wjXD~;Px%96EB_q17oOUQ$U2X|uv7&b(qLg$|vLs~+tw z#a&!_!g9LLjAkd(y?AHM4u$fa+dxnfsIMnT^ml$vD0yo6K|C%}W8HsqSRlH6_q^#4 zTM#(vHVf_8^{)598KaQ@@mD?}{b@}E6**#>9|^6t8S^(kQ-gV+=+?9IY(iGj}UJCUITIJea;+MXONu%NACBrAw3MYMPWX zQutAgJ?^}g^o%f#?!NXKFtw?^&kA8*&WT#7-fh&tN&IG!idFaW3oDL=TPlb^oP(%|w4$a6dofW@Tmbv( z7860bqYgxY;d1#}aUeiDsC=;F$ad|1mGhg4FW(QX$9FfA!G}A0Hd)=Z&w9UXiY6a< z9^wLRuPxFpc4NpN0FynoLtl!vQnzEUG2;MeB;!dU(f9vi0i^CxvOLu?#Nn>$sMKhI ze_F~i!+kjvn>rPlzPvC{mVKtQP`zc-nyVJW=mmM9QC1aaE>*IW7 zd%|}Vm5Ukz*64%-Ir06{=(QfCBumhrE}LITGvPfA`toY?_<{>Ib6HTGGq3X(?LI5v z2(x7qlP*siK^py58H$*rA~u3c?lon7oT*p(?kEcPa}`LFLgeL*3h+R4c=>j{6Pm+c z@!$3jL*1u%o+Y^+B_f(7o8Xbs!;&@iANB*xuxzFB8#B6~sAJz(YP_F006@mgHDBcW5Lv!bPxLyAR(# zKdUZ|dbZX*;!S7e(@2yeJ;r7<6uulO|9A6zHn(46J9PSkE32!R<$KOk$xO`{b6u(O z&n+_7+I0CO>@m+y5z-iS(<5lFI}E#_pEzOKkzy4>OVVUmb+9cP8IUH=dTFM zk}e-C;w=c$I%d!?gul=NaTWp!?W7zJeywkcS@qosJE{D}v1CzQ-f9iHQklQmkNQtM zsdMUes9gyTfusMq_zcObE>m0*Zd$W8A=25Dn|s~)d@!2M+#)*lJ8;>GdR)C-`5G}9 zgsv|A{Rh{*PsCrk8b8lIffsAb%E{OKq$l*Wq>E!4VO@UZ&sSuufN6k<;Atq6?-NjP zF-*dqi9*()dyM%jU`zV+ODHE*+yFO>pXqzlmS!vV`?bfBH#Rxe~Kb+g`MFAuHYqBuk;m{ z?F^M)&fG^_tW14({ow)IaJ@8r?;P*xr`g$auPC%BCnb3>hqXbfBBVut%M)(?I;$%K z%7?=c^6#F|PAsO<0@2N$^e}jRVQXylSj%Or(Xuk_A~OZe%}=^tSaSzTs#rPT^3G&Py0Um{^bIKisXf!U zC{!be!U{*M>2k6dJX7_u=L7EP5bF1hdkRInr$`#ugB$nslSD;?S^78JxL$_NOMchQ zk(TSNI}PO%naorMMGxGIeDRH-$p>#=oK z@{9H&ZX}}uEg5j}`o}j0lLg$0(ifKAkbx7kxy>|$iGJ!33+^rv8NZ{@-fYP$;LU~` zy1fPcb5HbvhZH1{R~w`*u3|2E}5hWPTecX)*^oMKB2MZE~Qvny|xX zZWbO67GAodIr&!+yfT}E{Qnp8-;=?`g!k(P&LkgYnjJp_UMbIgM6>;rBqXhtD8VY> zB?wWM8nN2a-Yy5X?;~6-eEFM>dNUpspikxEY&Q0^dzCYY*(M;W=2e z4LYVkX2*mk|H`gWP+lGY|8zX^z`E!lLEhe`*I+zO!6=EoX~7|%BhY!OzBI@r?={8nhV=)0x;X7$5gmPp-m&c7e6S`j#bi=A zC&4|j35788yDw{kt{JY*=j>PcuRVo(=ZLB0MwYj!y#y|g+s>!??(rC0d&_4lf1_e4 z@8;Hs6AK$#f-1)~#Os87H}57QzKYurN;%8v>HMr0RdKjy$gIcrv%U6BV2RdmrRsJC zu)7w_F^unO5WTNqJnzwm1w$)MI%YS8S5I+Yq~d+ z`Cbv3l*3+T65Tzp2Di|TSW#K-zY~6z%5l-u#n-wqJs&95sWQ##o`K8{uG(56h#&FT z=uzgfCBE)veb;YD_2KoG=kw+liroSfqXc*B>}O=UA}QY8UyHHAxt0SBHRguwulC*I z+L%P*3i{_n4;;?6{f~Q%?!RrmI{qY+9wew{5Mx@v;F|(3qvn66@Q?dH>khv+peEl8 z#MI~X!=k*g zSJ(|cV-J;1d1eceH7^E-K4^^aXBvWDiXR0da%AVm#BXiPxjh_Tv|CxRpADF&-ha(r zDRQzZMjeC1QfKAbv_8x;jvxx6bb zN|5`Oz8tF={-q?fuS>c*TWikrPeEv9NuljFnMhTNU1cu;AU9(L*%P?QL{>C!J0#qU znAQm!&BEA3U>%YR1ri2JTZ-pg_TNl+wui7maH`3)Fh4q_9b8AVb-(}U`ND+7S3Vv6 z`!f`sCVzreEt3D`-KY z@7bYkAuJ7vl9C=mXIfX3CSChTO^k#*Z7x7-~90`orsFAWa^6~v* z?8_3=NHwNDWi2*NzBg_O)R4VO%@znkj2+r~v1db-I3>lD(dMBYD0D#&AJPgoeLibm zKK8j_W6t{nCISsdo0Q`8AZN>CWk_H>GBNMW)W3cngdAs%0Pz}x>mh#7(tYW};Dm?7 zzNErqn=&XN)L^Z$!gRgg9aHh7$o28EY`D-lRUA9sSl?Vmr$bw}cY$*&1^3l3i-7xD zKw2Y+C%&Di348gLa-F+l?<cPrB-k6BKH_r2N}aZG-;)_;)z*U5t3RU1 zJbFbQ+RjS%Q(aq^aI<315fN`-Q%rkf(QHaPfVL@(Apu6dReq;H zVTMZIfoE5fp|th0G8E-~JwrKvN~~6YO;vYmIpHMXc{3&4u04Fhdhk56j2Tv$;deby zM!yz$AQIxh+u>vtF!QjT<*6@-0)%x`lF$-A%lS}z*qnpejp^b8rnM?;(m|x~U5n`qbq;*?Vz_D1l>s`G9Z!sG;@|2F zw4YwisJrOr{oU%8=(pLA*hl!EKPAqXAVsB6kB+JkumEYtjvqzO{{S2L;#Rvr-6o_<#>6}Bett|X(=Xx z88hEIZ=8!lAk>PTmi_GvqB?4Ysq}4q3^|XFIBAreG&WR<=$^kW?C~jw{|9?-8CGQ!t_=!;f~Yh|!=_uM zy9ES9x?4&>8blhUyQP)x?v_-MmXek(>8@GtM$b9(&ADd&%>0?}To*r(?S9`C&wA>< zpJ%0TnLmL*euVYHQgQ2P+@9d<;(<1o6PeqQlxj`_&Ofpk$II>f1=E;}6q)RmvL%45 zGK;#+CO|q$?7!1d^_sX_LCRy~xJ+?mj+#G(Qyc5Zkqf8Z-cO408~{b3=mIDVf94S zr_k@Gn1=1+g?J?^y;xhPD?wbG(*@a@O~;P7L}2?B|FDld3Nlu*wR;R#5N2Qn5Vlh1 zG^dImxRwD=Ob4aM%&sb0yF$6{4RW^yNTGBsTAQqTqPt|Xau$vIwKpi?^$f{(U0Doa zbw9F;w+B2I666-`Kqev|LF#Qjll{X80HM+w2^QYLs_mqMr%XZY!xCh!2G-UPUjn+50Latoo)nh)`tL@b;CD^=4*dp-ug z$%jGc_h$&%@Rz<{?i-1h{X%U^2t|ql+3p|ik^kK?JVLygd;|G(39tW|w;MFh^w|y|LwEkK{XeD=IfH#!0ctVW&QM?tSH=adX zRi(Lo5M&u;2@~46f$O^LVFTe$e3hFc@rDz+SJ_oyA&zK(e`|ex{`1TcfZMS_nKV{B z`-{SPL7jtz_F}j?y@uZtovfz-Sq6V0^7^4z_KyVMQ%$}PFw<>$g~t2fb@cC8tr^|- zvE9JMEO+j$2{oK>%T^_bG@M{a%_M=AIs$00rAmS_e`zFGta@CfQ5_*Yl~w@h;b?CX zz+BkFwuHQ+oDd!im1@A|9&;{o0fcu%0ZxeAY(mj^wd+} zhIXrI!=2g8c%Y;p#z#ErBSYxmt1PW-Uh_K>eZ1P&1{wRG?6xvxo8?-T!(c$1#hjfl z!P$DOX@?PHEK?u*4DMH#L%zxrw`b%MWreDxZli>j+ReCtn|4dOrx}2hy^J*$2<`%Z z?gYg%20gFp`K&$xd)Jf{XD#;g)C`A3;pK5d<{3UP%SH8EB4&Lrsc?G+S6B^p4`RP2 z@ZHs%9>#f+pF0Zj>ZR5j2-<3$RnOyCgG^Eg*G(r1LBKYV=y4dQ0>WPy(Zzxy)7=(y za|FmuuYC?aQ&HTEe}hOLXpC9BAedUHR3X766WtJC#G)gzl|2$6pq7i(!Dn;S!4u-P zqzif5ox`ez01`s^mWcddaHD^00|9QoY&X$sMKUG+*N9D9&WIT;&DV z-&+=nyVhcM%v97Zf(Z+30ufjFYjx_k7H}@g*CExW^e+cM*3D+$3j?A1dBwex|xTVwX;RU^1`Dks>XC3^qN*CEZOyid~ ziT{(0_?Cg=!9R5BOf6ze!18gy?omypoUha(b9=5M-VvJarigPD`iVR2nV3-^+vBoY z_*ui2g9cBiQQ%?hhN|%74D&&Soypm_ceL5hVio~o1#pD%91G)rq^locsiWPO`M+i> zm@kp;muc4+%aLHl;gNY_cgIQz0GRJJS4pwb=d3_|!A!-{a58?>Iz{P@!c3!Mck#fb z->8Nf`I6zl2Q@NaToGS4+lLG+Mo(8#kt?U<$Qdq)`}L#498#qZt{~s9knZ*bhf#^! zjKMT#mgtD+i;; zE~-5IS+#UvD^Du&ZhapUMy10N7U^F7Lz5N-I+YI= z&)@y>Hk?qh8}$UGOlN}GzR;HAg39gt1W9yzY@>f4>= zRyy+Sz!R~AG9rF2j^n@O;ht!+JNh!zR$+%<&q{61+M+-lyMFZeJViy;F}m?3VvS}o zCR%I_kJaAK_39s!16hhbI0MX+qO%!zV$X_9*7+-O4iulewU;xiKXEOh#GS$q5vS@w zzh#$wHdy0#ZunfQfBpd=tIZ2?=-S%ocC|X{d~d3rBz$lqRpQ5C z@T%(4EFG@yZP)&yFW1bQe;&0D^N2?PhK*j*7@=f_(`pTM{^hIvGCC3kG=;L9ou46^ zVPmYFDf{12KdT?1n!GTCj8L?&-6OcSfe22N$Q)Gx(-T0J6LoAKVtzY!NO3l`yFJ?{ zz5uvVbuV0!6*?R6WHpL2T3}8xY7CzlwHjb?p{{XbEaL$j`(qhR3|A`En)L1_b(L7Z z9TUd}d^FOB&ST%kS`?Lkpb|%I8WE?GWSI*1n9G=tb^HL3C*|=8jrLg}p{U_w5sRE! zD$}2pH&c4^sCXuzE5q|5`cE35u%vgkuX|<=7kdU>)NnQ`degkUDyFZs6CeziyEMHo zzha}p)H79{8eaeAXQWBspVUB}4xhI<|aTJp?oWRq9^ z!DX}*k;bPbnTBXJ_rC@RtV{Y?s|qYC64o&Cg+>~y(M?pKca{qPZ~;i#CS(#)J&uD% zHt?+4K>Yjoca}(fTb2D}ttDgJb=@821`d(XE|s0Jj!xc0Cb>SYn#~zyNrS9MANk4Z z%rcr)8M4?nr`IQ3UWZ6t@yVu%HRts`yFOdP(N>|K5KCPv=oe#6DSMvwo-+~WJStsp z<~3^Mq9e$;C7xmD+P+2I%Y$p%T`#M!Yt0gX#X9~0D=~%#7+9!>* ztB=(SVky5b#4Db>a|c~YPkX3{xq@?o<`?e*H+nrkgyC6N|>)9#Ox&c9`1o z?!Ku}*s7!Hw;0?O%eM$<$Dv*J&X>h!93#~v^|i>xlxR zA2%#BP~We_q%~Gk83H9}qChrqITb}6(auEU>cdKZM$4xj4HF*EWJdWHxi1y#wJt*6 z0(Q#7gr~+o?WM9TtvTxMF@pTln_Q{#!Ke8ya3>NOJKv?kCBAjUO>%He4_Pj$10xjq;tO_#_CfN?p)( z2}L()Qd822C~no;pV0k%bo}vr)q^Mv9(}&+e_%Z7SGmYHI3cZvpLlc4<&OXw{wbYxIywIBQ z^k~BK?X*)9rH}PtT8T(Y)RmD;9tQDI+f>W)23=xK6iX?GJnqlLT*c^6lP==ERCzh3 zqxT9z$opyaIhh>T&c3l=VGZ1;u)WA_?0RSr@^nQ8IURVL0W)l*xJ z?}kjClpEDsyOJmmuE^GU+M$x*#yMDEw(cEY@H}5lH=_?++IA%y?0=4HoAC`DwQW+W z9nbb^s_;(Oah7Pccy(G*>`MiqrcgtkCZwWs{?2kocbOi@=l3#9a0R>`DPi&b*Req! z8d=fZdf{p-8EhHdUtiEEH+qq&8HeZf%C>XDGGHOphOgNHl)Uy0iy#ig_O+qLz4y~P zYYLaEXGRq^2SeHr#gg59(?`K8KL zQ>TCex0(xc9JRaadi61hQrNn?a)pc8a{{*+_q%o`p#!37=`B%$It^ysIGyF<m^U-&ESKKKDP-SxPdTAWbv%)RZpUls`*`12}FnUXjMLd&`zopHLsA zYy61_qk9x~6+SEWRF_6STiqk?kUge<3}b<-sW@1Hpvc|#?vqcWN!+AOXeB?I7y$)d z!IXX(7vEDw57&6gdNYK)$-dU!Msxp%_dUwu7uYKNnyFt>4HwkFIx1?UbqCug-A zG`Lm|)qbDI6TigGU6L&q|G3ocL&R0WbI&dCB(|cXDyULvaD|vR%k_MX`i#F!JDkI_ zpYDS%p7{RIZ3n`mvelFmyCo@hh6(KH0{$X~S0!!RTC*Z@hi)s+!!4>elz%qa6x-FC zeN({@PIW{;Xvw!C}%5VHpTxuw*zxN!eJX7DA_lbPkcHDkXg|lbY+0|#*R!aff zw>lD^6K_8~I?QRAaAzGrU*=h&g zo5tyQy&7i0>EU$b?**BHQflDgXw|1Yi}t#Nc6&?m4_UAU&YPl)!4+lOq6`JT1EOc% ztn{+gzX!xsItuc`XlSjl9PvN1z_egpLz#2&+8o9gTIol9E7eN`Ym~Sp?$dz(D#!nJ zIYa#+sh({XJjm4ruG1Rz*A-yHeSAQe?L4&zy8Ax2jD1#Si7TIIeYdmxRF($@g?2~` zARrb#ygw!Z+;H$a$oBBwX(;%iKc%+qo>I(gKB)={Z&Y_PPn`g-yg!CA3Sz^4f*>m8 zej6R|x@_<}C$4Gwm#A!cS8JNtP}#CgzR@q7*r0o&OUMR9sYB=oVY6#rm^}D05+-;d zxt;1QCg9oLO2+sk10MCV+eQN#`5ZDB`CtAsVmJt==T502)Wpk(zs~G#{{m@^p0o;x zj_bt=9(W$bxdH>p1|l%(me1(l1L+}1NKgYnLh_I8vEWW53V0sE{2dk_ko=znqSz!i zuDJlLYfqYqrdXLJja(R*w=kl&90)K=FWW~z5dKjdcs(FdK;UOMK@)b=DjQ^s{%IEz z{y0U*Ctc@e0UD}R0Q~-Ex7^h_$hcwKD`qQ4eR&4tKFATyA^Z}KaE&K|#V?xPuHMDz zc|%#zdmEr!wj2-OO9CJ{Y&awb{tNSzWnv^wPxj}^DPh0{wx@pp=0)y}114z?F`LJs zP5kfw6PWuS=%ua@9$0fdU6pD$p<{ToCSPVrrZ=nzjavT|WE`Yg@xc78=1qn}GSxXB zWG&rESFcW%F~|6?INV<-qG41C(0hIeUILI}!9o)b=!P)aqyIk)%0M#VnvZq9Fov~%bx@!q3LqF_z(ymR)xIh zO~w*Bl>bIs4KXFqxROk$IsppG%!TovDD=PO|No(|1gM)(A~>T=;R+6={ModClJ%Un z?7i@k7zr@%H@BgqQ2N9hLL<9pKhst2;(c0s19X=Y_M-+k#UBG4p@)Vmwrh|Wg6ltT zbm$gDvwuUH(**wN1SKdEtD8l*cy=v3i z%(o%qeA6PncD=$LBQunTc8uKn6SaQ5`jyiKzB(}v*(us0x+a}AH#E#~2y@rrm*eXt zw>kR3Go{kPOiH|yOVDn6|K^*wXrXUfoD#bU?K0vpC_ZkLTiLr=Zs42Um1=cF%z2UA zb?Htf(qKZO;~5|T*HK75abPc8j=fXf9!-0GR!&g-Kh5zDRInSS{rPKhr;r-y_vt2u z6&4D9JyP=bI!?Mazg$*ZDH^ZaOmo$Q<$UJA4Ke5Xv zb)Lf(h^%?}`xg-`zq?ULq-7Qk^gZ3WpKoR1WH9)1sdl^0<10yOPuVh53lY9;O2KYQ zca~db5|-c(l|X494rPybiu|A@7kX{^tCC>fLpNx^vT?WFt&KPm+7HBVFntYK;tL0| zSAB7)b#|Bgo#Z>RD{wwM@F+kTY*^5j)nkw0huDz}h9e` zS3YNKv$`YC7;byIXWXIJ5U3~kirSL#JfRwn_pu?A1T4uvYYDH?QlrqjYRrUJ98qDk zLPG;CH33MJrh~*dcl9-rAOA4E%ii(}{E1=7SnjLWgm(l+w9r&d%b40!k&40nH~V=2 zUDvMfDCZYiYS4HWqWgSj<(%&w2`FxXa=oA?BjELe-oV5hw1;HP?_29CRt7)4GXhk7 zjUhO*4676s8ctP`+WfTj^?O6=pa~>OxZ3faqr$*3Jmq}6{2ujdDkEQ_Lt*%AOG@Cg zt&Ptf1)2|6Yu-$^)k`Fmxeg1y>PtpldL8ox@QfW}stqQ2zwt`Knsbz6oV}jc#hC;w z-K+X;*q?oQ<|y`qkbb<&wK&J%8e0QTg#UD7?OidT1V6iionjSPeA$+R(R+$JUFZoVu31?MaQei+HE z5WkSeg;pB#6lhoJN?h#pm7t)0D@^ZU*gJPN4>4v_Ga3%EqSGp34~k<5U@maZPuCbc zyT3Didy)!G6xMkfyeN^R+lw8z1ggCa{#XYq6FQOj56slcX6YBOul+4Vc$2L00;6Zz z@}Iv&PJql&u3LZ?|jKw=Sf7)&CTwjB;Yy0s9Gc_aP)go=H zTPbX=E?mSHfB5(7U9;HWb;d#(7b=qV8td4*z_s7)J?Vd$sg7Af^ab@sA@NgUtQt;m z_asSYDRG&lytLWa4BQ_td&6~q(8XcaR`$}MI*go8EjmtA7Vze%h%}mQ&R|{cdwc-) z%E%ZT3SkGW{f06{9k08$2YqY1+#V~0V@EM}P?jaL$v<=;^Wci>)6Qai=S(p45aQ)5 zR<7>HAMh($7Meuor}D(eyZE*S8qw|!cy$j}o8K)7=@+?ChKSK%%;g^^muF~zX=;2P zkA^C{QO5wcraa!l_&u&Q!~^F2FwT?$ahag10QEplvT@r-6`neu7-XEqKL180CM78; z{&24+jaH+~O>~+UR0Kwm)@!oD;#HbBxXI6!=RD2 zE5#JN8W&?7=o5ZP_c;;JDqN|b*MtQ&`=rf~gpUd&WxT!&;+#$JHo7|87%Ow`*JXOt z@bo`d`{aWcs8ZiZYYcGjcA~Tv1k%xwl>7mbjwvdz+8V`9Q7~;vy*m5^<$0D2f-&xI z8grIBFtIurw{E0U< zLt--S{BwsMjb%y(hD~Y0>irRgyUwJ83Kn&UGWXn#e|%Kv_xwFT4>(5s?avS5Pb3c- z))bw%5?W8~R}_lXB^pl``Dns_%-_fYVhS|yDYwTYA@f^Jh4B?c z&6UremuJ%OaV^M}E9bnKY@qf~p&zl$KXY6G^+5K$6Qxe`cXS4Mv)FoStY^ENt-}WC z$6@7JJd$w|Ohj$!GiCU~QuR6TtfXu*5V)(`-t7EAb+e5e03XgwAC$#0ICWuJeA+pu|}0*@msO z5gT|(AvuB`)=s`Uo*3yn$4A8-I?vvU*RpQ?27EJUA}kmvugWaN!$U=NrF#>AA&A9u z*UXXBYku@z!)RsW-uTd>X7)H^8Lw!PBFAPOHs1X21T#mLI)9I6#a4+sOUe#SB?8Y? z7`(OeQO1w2FOdQ1YHt4e{)qWp*Qr@X>)4(~{`mM_5o8SWZ>bl*>eypZdEk0VvMsad z+J>)0f%zx(*!kJuQ{OVfBTc~C8coZS1tvMF zeX`sllpw&~4W$RR(W~c8-gq zq-Jf%6VD+mWviFcIhGZtsjXiw{dayXTuWRMI;*lCU%uvscSnfuDnWg#!xSt+sr%>y zQEgah>0`ag?$a9~Ql!wQE=&ITqq3v?Aif~6E#~Em#O1B^Y>fte%3|&2AXS0H`ZKBW zODE8M!kFQ?`)q|fVARf&>S6B7m=+HC_9O=#_+iv@0G0ferltoqUW`IgAT#8-VpdQw2L(t;RiF|z3fNrctmw3IP(CP zQB~Y*H3e7f(oldF9wsJXgJ)OQS(rEbLk@A7>+u`e3pd_-ORSsobZ+nzy;SQxBsZEk zOdj3-?7WUls6CI-+5u~ zJf$zaVX- zDQm+Nes&_SWaaagz9@ReXP2|A*t4e}m>;njt7ik>y6bygWbVrfCf);U+?V8Cp1jDX!OLO*;O5tjPjTICQ0@q5afiyF8VKCAhy?f3B5FFV+O+k63! zGW78f(gKXBA9lMvMgS$&ziyYHi;P-Qm?T8}vHgM3vk`CE1t>^QR=>^gMu8rcu&jGf zDJoXc?m=0_$CD-YOU>wq&zWY8o7LmWpxUD)&te6km&ZmDTkKBKva%t9ITyWJLZ9DB zY@bk06UE-XG0!K-UV0tn>O2;M>~5z$v~(KvROPP>kYA>`reZ_UmKSUet;}t2$6WTx zDM|Butz(cXuXNJ-puWMX@Bdm6TLc>&xd&ve?@q`_nPe}FPsDQrAE?= zqBty?0W3G(=FV*dy?`YNB{luY#1u{TtAoC`%!#0#!#8FNKpn_})wlv?hsdk`Se-2U zc#%uhszU+;^SyVAZJ~gui1-{ZF!vY^$1xT`$Cy&k8wjvclxKCc8?_ab^L1X7h-oU8 zxgB-56ehHg=7gn6sLxFCIyOGbV4zREoUU*YTE41AId0({x0Q>vVb;sIfAogOD{eaR z(=(&P#>y6@(%=}D51=s46?7UI1{6>FK4Lz91o;Dp;F=WPG=7hJ%tV!4M(@+gqbDk3 zZ^Qr@(E}HV!OHb9&WmK8{Gf;r4W-&nH*>4D2Mu* zA`hrV2yA%>_ou3R#mYUEHkZ47soeg67#<8*aae)IE;?tDI4DYkcstQ&bV%K(C^MUsuF@vm4+#$RI0q zPJ0eN!CA7pwzr$~dAC_p8v=_T!(e1=QFhuc4WyS3bBTYq@(GJf%T>I?3fRNTHXsK0$=V9(5HPXjX`<~`FcT=*Pnx(Ig z*^U)k%=69Z2}G9Ie2`JqRqNlJEgmXxCh5Ggx_ih5Yo*p?SouU$Rg%;9c&y%t>3=1{ zP;~p1J=ZacRc5NG&mn*!mJfLifzD@V+_NyCRfiI?q}zSQzc zrx)Ax+h}fmomt>^T)a>6Vj5W)IcrDuZTN>Ri-JLrA=f$0E_UvenS%!M_)Kx zxMXRw@?{HRt6ny0=N$k>f%CICmDo4ZYki*->+S+VR#_po|BLwuthK|KN9f<)t!zEO z(qPGi&3wx;2v(mdT29JYj%FKsI%=b`x@mI34D3h3_OZ2TI_Nv|GFRNY@}<_9dmZWe z&p@SB2Q@B!&JUf9Ok0+OaKD-c6NS5s$?02_;`Q-J5}jT#zun3=tEfC;Ke(DAr$;CW zG{$Ydl#p=9;z5(YY@0o1(Nm-U2_s%gwtd}87c`!?tZ>3c=&Mof1CV8ZEtw&i*U_c3**ywrQzo3R_(M__N63OpB4W%U0AQ--3hwEnovG#uXgbzu1!EkPO^Z=aSG;%Ubh z4%GC}{#HDo-kh$%XBF7vE%7f89yW)!2l*&WPTpBQUgdaf(s(gnfr5@MnQC*-1lz3_ z&c&U3h&Y4Q^Xz~-OH>!oieHYgSq;#C!$JVOkyom)%4g9(-vlmxCosfJMY~qdF0hn| zR$g0QcPmv8mQDYPB^O5%IUhHj1o+p|3Z%P3^8V>JSh1F5PTtP89 zz3O(Hp%Gf_Ezg}p3Y2en>1k~F20at6%(159*zGWJVyhC zyuLx}cfA*XLFMxtL@G7WSFVS88;6G)#{y&Mb?w+golE{m(pqGoWkw~2*3so^aJCz*n`=LK_L@FWKW8-Lt1Z-7Kw6?iz-phCS!Co+Y~@GJ zjKEN#e6|J0pA+VFUEkj@b8_ODzDSes4)8g1vlAar>73Z?`fzlU6|R*e|I}G-@EvkU zS7#+ZD}Nh80b(l??7CdR1)b~hB(Ijvz533iGBF7grPmYXrth_PaaQH}Ol7|BgFxH- z$)7iT#Y466pxFY*&owMa-g54KdbT}NiQc2Ue&pVNG@@oPktd-O()GFh?vu(fsr%de zpHgIVif-191`Fk08PicJuPavCp6~E zpCIU;l@algulk#cDllD-<>+Rc-L+H@(h5hF(3g}Ao-|aV;j_$5P3-4t)N!T~9uePL zlIQE3WwA7ai))0}Ig-yaCZcR+5R!mfX&##Pr41@9?*!N!7cFDoj0pncT!=*S9STaz z>*aY5eT+dvh%+fzYjnZPR>7$|Pk$#GKE>c7q6?b7`u5#{LR}fX7yWfN-Zay5^3{&= zdH=7>1+ivVr03zAVyUH5yhFrTQ7_ef;f1JDtsDR|D7G~$X{EFQsX8et9tH-&tl-!$ zUz^cCdxSpou6bDuuW6Ga?-F{d{*Mb_hKf|A&KY1mQ-`Uv{NjUna(g>0fdp3wJjr_RMoG}WaERoE`-(p8oD_Z#5k;;J9)r!)n5p6R4-7h8ne{&ZSQ^v zb1t5$C(P9riDK5`Oetp`Tmcc?`t^4*d4`YgM)-)6AywKa3(%AvFCNgfNE`*~jdUH! z*s<6WrRl!)i^Cng>caYznmS4bkJ0xILsPv%5lya67HRpbGBsoCHx%%!KEgF-X{P#% zAhsh-rm$r0%q!xqFB#`$+h28)x!sA!ejARYS)BDB64{=4=1xF+*ZKlU^~Fp8O;bYK zYudECK}CC%j#6=jU*k9?XFWv~YK1Xl7!^p3pT~#=!l*~m6j6Hd2&d2jaJ@>wgM z;e?4`Rvokeb59Yzk8g z7||R{pU^2WU>sh=S11EMu5ZqQ7wN$BD{Fw&pDgI%mI!^@2S~o%%ZBja{}|Q|3?PlH zZA$~t^DtJyNN!JyX_LVNQYg@A0dVGdF#ie^0ejzqiq&tun8N~(y7@tj0uL0f_YH<% zYJ^z;{2VvYV1|{H56ulI1WC_7)E0cj6I+T< z$3!_n3%=zyD~*K>;A1|(UVvmkxR1s2c-r}qpd0t!y5WCg^NxW)J7?&DUUI09pb})O zdIU4he;X>C4+)zHcnUww&T^9V?@+Fpof8*9DFAeOAnp;kfu8@vLKu=z>;eCg0wjn8 zB8W9%gb0hv+6}JnMmZ%%%?!RJ^nR_a+yAc&|L1&3vFGp|!v}Xoupaw@|C6o1Nd8w= zBbb1!y@4RPaMT^}oYfO@)Q_TGX(CFG?J)kn4p1CP!HXZq-)hNC>WR*X^lCslA#jL) zbpCgx02=v7!zl$9LOrP{T3Oepu&Czr5HIeZ#S59aHhN+Rk~gV*SG_I!~h zmjf@#4h$f<*%N3@aC%5H7T$vyUZnSd3g#>7*8WePr}hvq41O#iY`EIfoCNi4EO?D8 zF6&)dZ+7sak2El+e`_2Lw#+po3?0m9CzuyhN}t4ov0g3pPuwhSAw;A2b(Q5v1$6F3 z`X7}>phAbJTKo0?9een8g5rlKywFBMi*oya7Uh2y<$ojvQkVZ59G_z>@1A(p1<1;& z?QFAooeADzlS72 z__}WqfZ=;=>|XrP?qtWHP^SUU{p`T@5VPiNsV{{HHjti?LE*{u%XPacXX|pWhjsTW zW%%l`_L?dG_326)rR&k=qbG(Rv^J#k_7m7lUzX~NIBfmG9nO)Lx0h9`bHBbkemm!V zcL&^@ICnUv=Nv*rL{u=A#;C@kUGG5OdHQ={A(C1KZ}V3b8c4p&+Uw9ej@MX|-FZkH z@b#;dWIX7eV05rLNT=Ia+r5x|ps%e>+!adXe|33o!Phi6v`=u*^^Tk7uc1>91nNDu zQCdr?wHs+@UqYwD4qTMHvluP^`)F%&yvdbIv)VEc*!u*UEA7rFMqT%t!Y zF7IRV+y3ql_h~fgO&l*bMgzB^qVb=tW;`rQR&9Ae#Hs&hs_yMtR{YMze*eO>^Q@zV z0KMbYdO=C$Tl_C9Z9JY=7kObD3^m}w_J`<#C*jE=7thZbKu{JnPFW>;^Fp%HS9Zn`0 z36qc|nS8_~kS0>X6!88g%8T&w1%8^=*93N7BsV#70ll9ROwNC%2w0m>(v2pXT#rXR z)A(JUg1c&bX;a@K2II5(1w0RI8!gt+LQATd^&kSX>HyM1q0mz^^A62%V{}Bc!Q=8w z^CT!@s==A{>afIfekoqB>w7F?RW3U$8*Ak1WFdU_>U=xHYOJ` z(^QK!h%Fzyh`|_a;^K4!zpKV%BYXSB2MuWzh0=7pCj7gnQXM?y~e9mml25 z&Rd?GsQ#koS)f^uN95@%rVOwXFRtl!wrXb6HLG5K`-3@wJv8I8LL%U|gK=)WJymNf zhGu>GWG*9Iu&v5s>>w2{tVpwJND#|)r9UlgGaMaEwaqlmJv;3%Gi!xy^_!d9O{xgc zA0htei<#^qD*dP42W|@ZWmCmEeEjE|6}2w=G*p@l3SH6^qqq zp@ybZ&jxHU^LlflLZQ-30f#{uC0yu=TfFfatlThtn@l{5T%@kweS%Wceo|(SXbV`J zL}B2q_%f<3^-UZF7KLtv+jqD-${8<1c+QTt3Not9CE^+!H~hdwMv}q9F^MRH3@Z6% zPGoL7-Y+8cn%(ODu!t(Ehm#5Ep6l<-eHm|d=gXE$TAowQd69zC4JulMUP&p}YS-Dl z-KrX^%w>A{$?JV^fN|jK;J2}$r7~U32`zJ3a$IN%uj7!bMEZ)fRuZRGzBKV zr}vf1FLoo8QS0X2C3{WSR<7ot)T`iqU%&A5G8m3tOedvhveZv z3HIO!0lP&Maikt~*dA4=+GvEfy#$l|$HyYU#(}?r23`+*^2YlFLjt^e#>+;EIEmIb zzbd6AXQ+oJYi*70dTLmJi^3|u#?rtPK2#-taiVtDLCQw?qo(Jsk`s!ibZQ{#F&;72 z={W<0nFId`)8Kq2xGN!&%w+Lye9Te6#Yb`>oUh^Bio~*RbcpJ1bcQbTZN9uwO!S+3 z3jqi5O^K?6oSo$p*Lr~q|L`Oi;MMjTonCLI+>OEM9P1=2U2N>2ru{~M^iAO#3zoPz zbwB{j6LmR&&M*wcXOW!ez0{p=>9A7qZ{{LHQ59~E!6p6^6OSt@*d_%8YWjBR*hW-J zJW8~dk6aIfC?kajFr^CB{dR;67}-(K#I*;r<M^WMK|$dib1|Sg3mHwFzN2-jTcs37HE8O*CUv* zJUN1+`pvZFi)8N48K=b$%&#!MOU;LGkcvD*96F_y{Ww=EA4&Qqz<6QlL04A~cH>*& zU+&t51kLR-#E<=3IscGNDNWZq?w?#!2fEx5Nc}+aFHDjGX83O2LY*bl#=GO|+6YF2)C9+o7jLk!@+9YwPUKUxFUt&Z zn3v+aOq(1M?WsQl0#KMUlZVZ0Nno>P0U=?e0>3E2N^5@A5f~Eu&I3mf@r~!ju7B8X zpZ5joPGkMpWg&EQ>2YS`Euk(Ef0FoJ>_m{wR{W}93_*?y61RW#3tcl2aasQ;n01`*g_3tYA4P7PH9Yl_;%@g?3u53g}R?7=--pE;@v8t>85SSLt$|N z>_XMJul_?cwBdw10Wk}L?Lu!wyw917T{Ep`n|Xf=8h^)eCllTL-5E*|7x_S;*2ci9 zhSreas{0qFaW|>yV+k<^p|0af>SclhhlfWu886US0c!qNn7UFg4Qjct_QeDQ%Petp zL|?n3uQQw!LB=gW{&_bO8Y*A9xQ5K+}fWWO=O?w(e8NI zgsI_&(v@m#(~gFaxhl%Eu&K9@Y+N`APLi9rd1N2Sq@cDb(aVc`xg<{B*XzHEb(`{! zr$jplN4^f;H6IKhUAS*LekX|>>&rt-6XhPKCA=|#@SvrWyYvG)TFXBY9OpTexg~cw zGlLiG16F`>UBj^nFb_}?|NWqhc$TMUw`i~w?W#+_mkBvzz7w?^E27uSeb089@?>W2 z_UBL!sWHg|lJf?g&T@sr$?+tzv0Vjlb;ge8^I|x^MvD9;be8g0%(zh1y@U-r%4wG= zg94RJGk?QGSAzYNCkF-w9>oUQ5@Q^N6PnNF$fu3cPtw8SXygd>WJz|6ggbsOdE4{y z*6K97X|yl}6wON@q_3rN;uwp)<2qtnp#C64co9@ZK25Tbz-}HD7Dt&M0;cG5Is8_{ zE8V{I0oX=roXFr3isyd{&c_@{Jwebu*Z<_-QGW@x?QT#GNK30n-15xT8lb27_zGt} z$m+GYzYf~_1pRcYoJBS>tued5FvH)@h66}gD9VTzpGk95Gp3y=T1^L7QWyUwA^)~G z{47U>L5_4HCLl~K!Dc1n)&I##r^ok3$E$k*&lfpa7#muk9shTq-i0|$`rb+tX?Pp8 zag%Z=g_3@Z?|eBUg}pH5&<}6f;`CNBBM82)c4Wr?3u)X`Xd%jhj%(Byn?gVN9M`FY zt}iMCufCp|*hGA>4T2iRFgJ?Nu`zq)P-JU`le| z`Cd2wky`nbGpsvk1Z*!CtY?dZZ2vr7~a74F7NN1h#UrJzN_Z3^{>-LW$J#!uAwo=&TsugSJt>h|X_%&W0O}=9iuYH34;c*|8 z$4ZLx!3roqz)Y^vX?$zDHdR+~b$_f8IMRvV#`HXk`rmr2=jAMdIA+8iSqtsR^zryb z@8L~cLB$M8v@)T7G_9&&cl>!f?7R_?D{aIUYr(Ckd*|CNG7DXyS`Ik!Cwy41Kup~uYib1 z^y+klEJgPU%vd^4F)PzztaxwAzJG1^eG&3G+at@X<##tnI11#4?S`PjQ+-CnFIQqD zhIRU1d^E>r>K$d8udiI0->~X7(t(&c0@z@GuReN{M5jTugc9uV8S*W`>-~PALvH5V;mpR6kGi(UimO5Q{64E;pn9k(NMjj&6@gU*S>z7;|PZa}p`zI<)9nUz1A}Hy<(xzhuOvT9H0`RHVwlM@aVUr(ejk-eWXAjnf9eJxPfLrl# z*(>IcbTrLc2}KY(nc$`rEvrKMhcHp0qj95h_j4zDz2-(;G{phH1_rL61N(#u)55z< zM?S7`-*}uaqu^NH=k&CPN^roBG(bCe%w6m`u}X2h^{+p_30MjjJd=|2jARM9_M1Bg zu82_pFzFs=YqV>#EiJMsyegnD25+Ou2ZbfQ(obIk=hwJWNbY0B+EBk3v+*kPVVW<_ z&&1Mf7rM;L?ukrQn94Kv4Qh9mZ=@O<-QSFVmdF8j(hvzS4=^d8bzt<2w^)k%P&i?t zz5)IoxQ=1{C;eqP!h>0(*_s0NF0@-Xgly^a>a?4=idjKLsjTvN*qi;IXhX*R{{kqD zU~e{=d2|5;M-D`68d&Wp7*;UOB6!|6aGD)resYv=7VBMIoE|<=3Sc67Aj&SE%3n5P z_2-Cb0fwTq>He&PS0!J$o(}H_-6;HV`rJ&AaU+IsK2N>}JoV)e^GsEl= z;ZUg#Wlx&>QNjm*e?C|b2<`C2V{OCsfDU!YhhiK@O&(a!Gh=JwoI{W-SvRUvEmZ&9 z?YZ;i?%|BhFFoBjHwr%5*@r&v;&AWJXZlT;DuRfGa8;#@7Z`FCy;RE1@rT26o1 z5N~}5U>(ZRIaYIJLk*h|HFzvOth3dTy7)6sX~Q!jEX;jid^jW{%dL*adXsa5Q1HD; z2!4U?%X`%LgCh#KX*>S)X{~nZSUVBG+@gz={R0PqzhF3lX4VK8=X0b8S1(sm zgb|mq=Ge?wJp2CA9B=9^S=A&FJ*ZpB);%D!p%pU8Z!pW?m%RaJ@GXP`SU5C7y-kIF zx&X+c4(a56;@?VG2+9YRzv;4PB+^R{SpIrF2=}moyaO%q&3|T97a$Txge0s0NSi}< zoViV<&jQ}EytlG>GoimFj&=ZSPymL{3f0pNeBnIl1{6B~2|4Y5%MYewX&i$PYVu+MN?t%FB&aCIPyCIy- z069-PPk@plyS-!sDJ={sB#@|~0p9Kds%Ue@zeoeniBzb-O{V(tLmHq>Qmu&)aJ+`E z3GXKWeNP|WFVbBY4gN7ZFh<#*S33TxL19EZK>4t=y*tPce<8R$4YY&2Y0MAT+1`LC z!xPaYISCnJ@<{M-Y}APO|KJUb11A+U1iOj;yj;P<{}7bj{WmsY8>m0Six&G4=yjh! zz`XzLAM6YW9plg+`k+5}lY@$8;*P$Rp&NS-hWZ=0CO?SaRZ>0xEVskpOS`FHhQ|3x z{f>8TutK@jF&>ZI?^0r3C*g;rldSkZP1Q>DVmG+nd<{(+b18rf2FMAj3mFCfv{a*C z9qmPq_98%1%y~|kr<4 z5v#+%b?p4GjVn#c0#25AopUFt;^aDN38{Nk(%Z$E($UXEMMbG3f^echHtW)w6B-~2 zC*uFX-dhJ%*@b(<1}dmXH%Rve6r{UL>23+>?iK~ylF|)Q(y{3VrP*|YgaVrm>4tZ0 zpXZ$U&NttC=9_uv{qH#A3=FgHec$U|YhBm6u3s#+@~U`Y>^`@FgN(in_a5koM1gc# z@z37RT6;J+!XkFl8aFE&oi-O^R4ObiE&nK)6lmv0tY9!3g@mX6MJCJq61w~Ivl(;= zk0ib)Sy%&JpYI$ksOyF%8z%tI%Ktfb&mn<3jR7ddqQ7?M8=XMMrr!53B9+%$ zN~|(aZl~}#M>$+NM-9e4`6625@34HqDi7CMa5hQ^OBT{|+YS8!P*O6pE|#=Nw)m<7 z>t8GhHWHQfwrM&AB|4-H0Z%~mA($^Rq9x%y&SVB(uqEYaCyrA8NhU?GP(5Izo)>{n z)6Pn4&a6vTdSj$3Oc8$M3xCl|$SW=UA>xW>z$>HC^**;=k}VXE4puD*-S2Ma-QOK0_XV z$E975Qt_M?$A<{rjMc+K>Eehl~zG?+iZ`e|oSB>`eKZY#P`L;(&#wDRA|u0*NI zZJR~85RG+NPVMN{n*B%^%T@l_Fl%0iF|zq-D%uI7Ud^Y4nSFJv*XjHRUctA$he%o+HiF?(v0yNjSzX?Db8)F>_Qc22!MEzWt9W+3EyP0|}5)EXo3ush=K zOACIF0JK3tUBnKXVbt)jO*A9CxmAxfy_M$HhM%JQG$ZG2(LV}>=UeMS&Gbc^~DONsWJ-FS?ZHh!v?qs;9~f99#*ki>H# zS`NR|G6`Qf%Uk2Rz}0@LMDrcCl$=oP3ACg&<8fME_uXQk5fl{5{F^8nh3x~ue=OfB zEh&lmOA6Z?-y;URO|P>S5e0ea(QdXsbeIdxw15vsbFWnKeZPCiu>x%IW3|uF26R!mT0K(gG%n% zUa7rX?9-%Ce$vXDseyZrw%ow^uY)Lw(>T*WQx{KpFr8oRw)UeUM`zxJpip`ZXe0Oe z*Y5SLEBe0+oKdyd{tNNk$1JbX+PzsFbGqDp13r|aI#2^yq^vESkN@2+)gz7}DAE0p zFaTMSJ7hch5^E`*qms=Y&E=bXNB8ak?cGZ(R4!2Do}D{)?wf#5xde~uAFGtfB#H4g z6Le$TwD^Snk~Hj7qpUdHyLOU1W-hG&`eFrzLh=RncXEiun|6NsDU}*FQhJ64${tI_ z(K^?7OVTF{>178hYJhrC4Mj3kW}_?DU#><^8xenDv$uGu7UINLpq7^!nD4SMzw@ee!@5sM!X(Ab zhGByvcg%XW=51c8u(PNW0?vxCj{S^f8@b2Yk~vv9u1QUIJKK+3(YHzU0d*- zg&7i^QdUG7`8P{jGd0RDlHA^^mLK8{mR$d~G$nEavYax~Hc#?xKE6}r!gjZIUh|&n zVITdaIiOn>=stoA?Izz4hx)q&(V@~wl7Sl`EO4wJu7JF@%rvJ^06^u-9-y^ z08YRPCoigz8>tA&^~{rc_8m1-ufYn9R>ryKB%J^Fcsn`f@~7XBkT{%rZT*bGe(T0lU*#UcuJKj!^UmZ#q*Y^5L3K;AytUnF-_~J z&->hrcm3!NipTAg_0a~hD$L=+6;``mj9OH!>C1o8O>|y}IS{C`q0DhM(Ij46lQ!c^ z_^KAFEwfltqhKjfyL#k=pb{|a%IvXl5eqSE!xGi3)nuvs7Fo8`PBaVpFk}jtI*&t9 zrh@k<6B}DbM)vV`+xuMM)`q{#B>3?xCHY;|a*Z(3il*+|-7e7Tx>-_po+`NR)2=G- zzA^m!s?1?N`&VX3xhF(|1V-!70Ru zJTq#FRa<(7^+x;!FTqYtD4w?DZf@j0!t8pb8w(M=yuIV(6*DPVp?B^wV(YFC&Jy2Olf zgZ{*;oV1`sca!{p4qOxbUgN5=bS*f-VmQNJDLvGt5y|Kx8G z_IshS$0Aw1!w|R7q?5ZK)<&^2W5KU37=G1xV3ql~$EmBBCM! zJ`_PNP|&4EQKGJfJ(WqxdA_)8el)jB1ur!|f4#z+bTtoYuF*Sp=2;cYQmGxfCZTP9 z8>_O;P65#rg?-EXc!H0L5Ioj~p z?YGK?tfP;Z9wGpKiHmM4`)N z{E$HL)cV|)CBd6e%VaDAHIFfe;_>hIKdn?hx0hemdDBGnVd|e09;ZB3tHo=lFi?0= z0`IVoI*EP$=Vi6;wH;$}QC--TWvpSUE6`{;KN?ds37fC7R4q{7@=Hz{dAm7Y&|nnj z43A?@XgV5|_k*#uh_(ViRlV17;jRkISkm)I`84Vy4L5r_TBGo3F)2^AX5Zv%t-CnP zB>^=+U=Eo;aWOsari0WJ&jZV6I~B&!64%GCB$!Xi`SP%`N7*|{DC6z*8i`7%I8Gi@ zM^6_XQZvz>7cclfg(dLW7Fz5MMYB{Y6a?4xM{zg4vU6{JT)EX0tR>Yk#dFb{26TZ2 z+_duwX-JGGc`Izf5)_*;Uz`w-bkV9JzF8MlS-G#BP}zE3cIl5=E7e+D+GAw~wRyj8 zc5uLhyWci4i!VH zlXF)L(>gJ=#f0yl-!wF^QN%}$`XNTudEGag1SZOSs}kt;uWv!bmzf%hqEFGQ_fV(v z6%DK!ke(*;1Xj>$GyQ zUqCBK>QHfisncvIT~HRm(JkPS>98@{ENBlf>rntEM^lgfISH_&IuBGb2y(U8f@-AdEDBfzm!U@EFOR#ghR+R*A|3M7sj4_ zhno1*Fl2^|Z^tm<5oFDRxN^_g+(Scvx=?6pmM*Kg9odGR427qk<1^ES>dh zLrB($KJI-Mr_W6(jPPAM@dsVN^kQ^lUZKSd9v8V#L^`=kl)Q3++2|C9nnayQ)zc4W zgxzi~xF_8v2Bs)qhY3RZGvCh3z}@|lmInPDef>|4cnkCj3y=10zSWBV`D^Hks&g+M z3m%oHDAP!wD+l;BH0p(y_FMFRFU4!@!=dWK7eA~WKO=n356~iv9PB{fFvezP7mw80 z4yf#VE}iW!tCWgqdv4;n4$^_M$$LhWZHoB2c8KljhC191!h7w zEEU^K@sviGlpvlKV=L0Td}rEL405RVy#5xZ88unO#6ol#kL5Vu;Fp))EmLq3q$O1J z*yFmGYTV;4@JN#+8KpaV(uKN;pO1exvGSB5ec2b2&7i=;^Wj0`^I+opmpvm-7c)cZoK1#fEuox{~qP2?TFe9&yoX`?o@P_?b-DG_ccmd4m@a%lHGX)%0uTzz8D1pg}J4W!-f z3=gkATyXE~X@5cimC+E!dNH#X{kM4l@}a7c?l~e~mDd`a)~0>uV*s3$kb-SSM6j#f zsWNyb&LbM@8gKjI@ZeUFy}A136PbthX}AemT^^Xuo(FTcp$O0xjRV-v&cqPo8U70f z(-ZfR3k zye=V2tm%a(kfP-e<>o}4^-43u_nx3nG&=fVOT>ZIaOCV0xn#mK``Co^r~eNams+mf z*ON)HGp|w1AVUGOpt&YT_{e;H5AL*4S@phP#HSDeAC*14se*;V?+-~21JzvD)LcwN zpnV`L^JQJ1&E*H9XGXpwy$&?ifi`5lpD8~bPd7XMaPKl1W0$Ykr+S=H%|w?(OBDJ| zJ+6?lf4&3+m2l{3D7o82=kMLQmTW@{q=5IY+qF&GLa(#tr-84vIl5jj8jsm&^P1a+ z?ZIja#8=mw5i-2e4;-5Ur^`axGAUBoX3r~?RbD8*JyZPCd$M5gnU-^h31br(909W?T7|#V59z7!+ z&UyiXyq1fH6hti7-NqBL-77V0-6iCpp^+fqa<(h=#q{hZgyp{dI8mzKNW;#qBE8|f z+LRziUv3*P}v9P!QdYi%9`Nuy)Uth#@7o_QsKFl^ir{1TitIFBTT9#Irw9aoM(^BEvX-n zv$RxuZeG@{{_xw*XiZ{NEUb*uPogyq=i#%i3i+U^QE)cxxoj8!Ha*g99^i{} zL9abx2;fku0P-S)GQLQyfXcr4INpT3@-2NxbR6Id2D5BUl_lh5`00dtu00G8boeFW z2^#j-ewV6{OR|^r6agM+({CX3lkHSX~<&KkDb zI_J2PDBlM5hc(x|P@|ien51V%)vN>anGVMW+=H zd?tWUk8He4=E3}ByD#TIw$(u25y&I%&DM$=TpyV+svH?);!een8`ql;r5`Lu3R`hJ z_xU4weto)l>`%RECE-IE3v>4zNcI*MNbO`_SfI+XM%}I3{R&{`@G8r(l@nXMI8^j- zI)|YQVK(iGIe*7tn29t1jD(ovw|~Nr2LOvKfJ1ZNY;fJmc{gxr`H8_`vf2Ae`g7Ce zZX;jx?f?dn`w8KqD#TkZcpDfPApZJsB0Vss$XOZ+}(4x(1<#1Zjw8A8oA3&kNh zYK^rCkGmxRc$rfJ7Iyg2dc4lxjpiadnmfXSy`$RErqN$~yZA_JAAc0LW6)LVci*l6tl?d}5AA8906y_oBzaV?bVl-t)6 zKmk}XemVNR`QN7(yLf>juS4rKeI+lF7gPiaRr88FhSn0A=(QjA_NgdflFduTt|$H4 z1v+8$Rh;soBS4vpf|89DUy2E^Mz95D*)#MR7!81<8qatOiZb6Z7Bez zU_?~L0@No3wNw~X9Ri>whAkJH%-uRl$|P~sBbWu(>)DZ{i%5(FF{c$x>Aj!l(@ax) z6f53G>zUmkPsFTV^gT^89ZJ`aXVqsKNa2p`8c!?(aL9uJ?nR!KcHXhGrl zd~UIV<#`gW&`_?aHvsZ{h9a2ZDZq+HfncW3=d%8F?Ar*YukXo>xd|Q|`kdMcAmnVf zMWujdAr#ejpvLwmseGESe@m6kj7n>-luk9)Kv&DV9OG;TLaO<_=F{|LVrt?Yrg%s6 zLigoGDI~T60bZ8d)j5EpBxo9VN=o-&F-nWys^%%k&Q_g9*5~Ut)==2~w9ry4)pI0J zlI?#Z2TLXN0-GteA}DI_KjpoLhJqb1-+0oKuSh3%Q+_YNl*{(7)RHP{B2zs502%WW z&7qiWUK7Rd53x8p*s}`am>2nDMxn)G5V3qQJ%c(2(anh>0)VvO1R&7Sk!gVcwpO-+ z8fjvX7!zeQYRoQ_zx^n1G%P+6SYjPJ#aj`kW>0F1n4_4(!s8}BF~K0=f=EfsY8#Q9 z@JAl5W%%=XoL#WyYMpQ@16DK|k~{mtVEfIz0EJ)ut>(Fs%?=y53%tV#VwcD^N!uRE zYdrY;qK};2LNiuLV%$~;IF;Eq?FTP=(4+RY-dxB(_xQE2kN^AkUyw74mYqsMvPA(Y22w~v7p7V}c@@G9|+FUtF2KAdu zuqlQlxZBhir$MR=r}J-RX}6Kjf43-_lSk_KPx#XY}c?1%X7Zf6<6I zvdhL}RFNPj$1`k)sXur^*Q8TG@6Os;K_3f1FIUn4!r*Vi%Z{cMb*IJ;a7Xl4pBX~5 z**gU)#zMHp=#Z_O7oqPhK`D4qvkxPtB?`ztF6%b}iFi(y*%zbGiY~f%$Ce7`Om@_f z`9^pCtZ`85oV<)*#)cdEo#!s7+Pq3_wMpH2Iudpmot9!9PzGST>%OB>?b<__r&8ik z?5X=22;nusERG?K!ibECSei^b!1RCly- zRXAQU*`~3zWc;^>H-+*V2afYDGG5YpP5y**W?4T5d4o*WgqEbVcqv*7mFAJL5YJYv zBxkEvC=4OKyyWDZfYD+(Q5a`(%_Cf|lqm)~5ZUBbXMrQkJW(mPo+!*c z)h0pj#+XtR`423BKBaNE%sz;pWEuQN#x*KNRz0v#2K0o`29Np$J!vx>%(@m3zCF^&e5rCxi zq|lh3dyT;cfXd7nmtVhI3$9v2c0E9Q%{cX~fcJ@wC4e|p-3%@h8+K^Gh!;EE2H9=8 zgv{2gCfB$0Nc+{tm%X9u0B8xHt9QgB4)#?xohpkeRBI#{Yz3dC zR1hwcAcbe6m<+p}sb_oaK(xjH)`qJ$Kykhd4toRX=3?vM>Tm|#RwoCrlk$Bk>_5qEvq34QLyMO=1|Dx1HEPO*9KW9a+O1$)`dXaa zXaF>_=&E(qPQ82#0IAl<_g+tQqwlpa(XHgsj{anADn1e;DKd&&U&1G19KBJ@jbN| z02O!Hozuo=)g!dkm&v>VJHoMN-ypiO`I_K&m}C{$-07t0?Twl zSawm^!|(jgtFIXW1S)I`Mo#>}6PztywzP-dEq7SU=vXMlg@2~)*^;Gri&CoB#FYK$ z7E&*g7hfh3jnzo9JIYmPs%SJ z$d=e~?EHn}bD$>D#izJhQv^LGaKUX2vEIMIhu?g7b~nVu991VZVLLgUY_Vxj9TuhG zdv)dr;JEueYPhQa3Y#F}k^sP=vm>f7^!GP_IyE|_*W{rr7t4@&s()!6Sy6-z2-Wp* zV2b3SXbkDH@pJ7(@z?IjfUbhm)~c;+yg*W{XL-&m3w?N=vm?@tato`@bt|5f&qE78 z+`v+B*#Zk06SbgVme;?{BUUMNDX{1!?Chdd>a3_0>e^RiC^xGDv^#~LPnB1$ zKxp3`3%!amfSI5yfBj&Eka?>xtB;6Ky(Y7}o(3DL@M|!QzruQrRr{so{So9@sTSM9X6ij5-BV#s#7 z-gK3XVGYM+O)?~|Ei-qmv;}rB2is}su9SEDhwK``1=J#T0maB z`KxhtV|F#fxMk`cSbO0S1koZ{XhwTZ9ea(rE)&>uMX46aq(JF86g2pa&O;}`oK8f# zp_A^ud?)CC(^Q%h-Zu@{X?9OCv^%yXQbb{?7J=XcO>{;Q2}J5A#FT6_0q+4)fV8|c z`!`6b00zRC7wxTcVIs)PJT=ck+k_wO^$0(*yR`O*ATdRG&VWsYfvbRJTD0o4gVX>S zTtLi%P1Oh3pw4L0@J48Y&o%$(3KgKZ1Jlb=aoJ4;5lOOVcEQvX!Q>rZg;7>qCD`^ZS44EumVSmJRaUeN#vJxV7&tB7;U5A z%LH-Z!5*)gh_J9I&H7?T{3QZ;!2$pFbi6>-Y@x}M-Ex$s-en{D1_DIt4>s~LbY`w=UZj1OU>utQ!JoX z;Bb`0jsQ2;r-t3;m=^$d{l$I3Q<}o&9}AY)Wg>7>o@{3!U=f$eIFT|Vr)o5} z1z=*oR|hkbdB_7>JEH81#<$l=q(69sq(O|6N#?RGY)lpO){wSC7bW}Y3fdO20PoSw z9KT%g0J=+MYPOoUylttO60rAUPRnvmqT0MLu{sm5DW8}FU&Rz8eEVC#e#OvYXPbC&!oWF|y&HO#9Xd|G4DGKZd z>cXzT5mpB9cC+URWCEmSPmnoI2E7!7+6knK*4lLrnI=t0G7ge4(Vc-TdSqs1K3~kx zt+k5-x8yyys4ordVFJ&ZYr)qx({^v^AyH}R<<#KO`Mu(-LCN(G5-W_;aR91Hv>;o4 zX2@kZDiMrM@~z0EQOXO4IzGey=G=mD9GoZO7*)S@@Ep$6Cv&QGBV+pFa12G;QG zaXl^NQ5| z!OK0k6l_~QnG}xH3fp^RhzyZGj4=@gh-743e?!60$v%k1s#%))F&C>j9Ao$U$43ty)P)7BS(!Xt z%T55qaXKEqUJC$wivq!QOPEiyT}pk!ZD&>zr=KVLyqnyAT)iTkBQf0i%d>e3+rV%?<4G7X0-+v=2O2;WuZ)9j7Ktf6 z4_7(wV&oms2LSq(eEnlHM1kh)xifJD8Pcin1FIlq_Z~44CvXP^ z!WK^uX28_mLjJr&%;kffiU?g}>;1REj>!DJ_=ww)?_}qiG(1& zx;B>(D8Ov2Po0o9HbPY3M#5Zoe|l*}>^gNo~7OG~YCN=s^(WK$ctF{*8{cQcrF(B%6Qo&jC zkp;94UPcDzXJ7*np~rk-Jaeb{FtWMgCSfsGeySAv#-c8kMOWj;j~_Ok&z!$5>hVY; z6B8;bDpO=y@@|{3VC;aBQMvc(Fz0?6%Z@uJi_Hj9pZo~}8@+kzG;=waw2$IjEm%O} zz>9KE0dgNF#E%Th1BrTc6X%>a1P3N01DW00^Vp~Z@#+U z2JYk%4e-i%O>J^fR6%8!j|O-^F15V*Rfhox!PTdPf^yrRXb};h0XNl$G?4rNeV=Ds z@BH%}w&#^rv1tn#udG}gO9q9N;WIDA(HvRo=8M+cvPQq1lr$X?_95BHcr7Fb8T?XaL|dcy#e$PF9r$5<3GTFSD!DvpYX) zb9bn#k2P6-3a}0}kF$xs*5en4aC_t=y|O(|l(3o4>U6g{{y>>IFtZC!VC@@e|Ac+z z)E>jj7%!OFA@r|uqm(yzwtl4N*F)Dq;L-HU{a$*AnYs9L&-~P|X1&a@*;#a_U9B@?}jdGtR4{Sd-S8Wj!~|jb>{nB?n)zr} z%ZNETbi1d{TS)a2+Qj8s6%NsTWmsW~K5_4=O6R9JQ*!u;>D2Sl-y3!>x96JPmPB~C z`~LQ4e>d&U?)m+3+F*D!x=&@eUpXq!ls4Do zDSmOfO~s;9n~)UKy(KI>Qs4N>yUModlp2)w$u$*>?jb|)S7$k74g#vpqz>+!`eNC;F>y`VK*BNKUB3&j8Tf^rEb_cf((?FLX$t^wRyYzt%g50d*dg zqD&bp*~;2EyqafcxT&Se?lx!K@(lVqg-54UuO4D}&7<%o7#%CjWj9^1I?k9@PBxqm z-G#jWrPZbY=MAH$INF%bG06sWx1SMbr4)PJ{@ImUux*n=dCO4 ziL=GqnNnG@^VgqmL*!D^l}&`sEle1d8;S*{8e6aGEt~t3PAdELDl{fb^_a+8FO%^L z!c%IaEpfW?#lm_`+sqevJDl33b=F^PzZ@gzDM_i#kAbO*Tex~)nuvXs)^Hr_ z6`s7#o9k;VgktkU;g~bpJQY>oe%=nCaeL>6hCR?X3U)cwqBM z+H5J6mfh@1hSzhc&gW_%@`7JNfnYG5H3NgJr1Ky;$hn$=-)*J5YUeCW;P#*hRys9Z zE3m%x)=5S((B1AGJhDx07o|wLZ3D_g{qUmgPM61rSYYn67nYrYAET=t0hgW|m zM>O}@pZbbAt@N|_F8C}0L_zaB1Jz`%zOyGHEplZ$$-i35pl(%BHi`ss#wL-L@HsoUw?(k=_{QG6FT zl;*>@AYXzbt2M>e7KBJeRa6=c2!@Indx{+T?)j}4fw!0LL=ls4SQ=C4>Okyx!5EK3 zITq!O>$S$Y9NyW7^=q#V6zy88dY|m9L8ugUA(8IKGYM1tjXCH=jn$Z_)b*e02x3F^ zThsnnZD~>evjdYPeqX0BN~zb$EQ@}8N}*;F8I*;eB2vCx&euCjCr=cvW5Nd?lvCB* z&|atSLFBpYA~hjw1}~&ruGaO@;3G$|Cdd}=q^W}X z-C2eRIP49X1Q!YRdaulXH?%at*u9lb-y4^bmW!$h%hHs3%L+B$Mg+ZgVFfBQXAMkC zu1p7Y_~Dl7E@Rb{I!(s=VzjGnyi}<&=2Ggm66xNpClTY8f6Bwigd}o#%f^C!48OR| zynie6S;Y)7VK8I`3uBry^c4_!$EI60lrHSPS|u=2ktq)0*K9vggPd;7XiN4!+1Q-0 z!2D(3a@Dl(GyI1hCTO6r#p}KGM2Rw!a8C zka;;>Fjh5kvK=-YWnG1_u%2zGg3Dr{BKaqe(V(@1G5+n7tnOd>Ia-ds*V6W<36JFB zM;N!y?5>Ips}rTOwN7p;dx(mW4))1@nd)G3okUJH7XM*lFxkle>J@qH+=^)u9UUu? z;XM~TBB50{g}1S_B&8YPaobH|`q8L|L ziDqmXCLF<}p}F$a<^pkVv^>h#phZQUjK#2eTho?p@;9+|i_>`p z{x!Kj$Ll}I`%q2ZLo;p0ba zY9T%WVv@j4_{hgGu?-NLNY8{;9)s-g3YnqahVcSM@_W#`Wsb+9m3+&9Dw0D6biZ_j zuAHE9NvwID2aEnTqu$XYN=i=_&EV-A!CGCXnAPGfN@8Bqge;_KDN#)5Fl&jzeL<>* z1XjH-8FWLAF61{mYYibB#2uW4rrE>shrIA|f{md^Cp#0JFr^Nw?S8{5!U%bYz0PTh z&=OOVZ?}czaQY3X#3(_Oq}0WHpuGE_5UI#55;VI!(At7{mTFaF#qEVR}vmD6gu@=|<3+~EXZFzQ&P1S=n~KkzW~ z77&N%r3rg$9IOnHXMf-rBwomp@G0>MUlERD4d=*FRXeEB6*gY_p%JHm2~G?lJzY!4xA)^L-1MwqU(zAd4e- zio@WXb2kI~Pn=f70E0Nh-5VCEXsfVo1XU59dZ{Pf3Xd(L2_@){7n3Caa$}G2dV3gV zU=Ky`!nxNuzNMv5#yc9geB}ln)29?Wk>SXy#1n8qY=?hOJ;>r8P<_FL^SCj_h5T4L zqb%Y~9cX&H!^G=&1XYN%8?Va!nO{+BEs}iLl%$$l!WT*$Wv~|PTnHn-jKcrD@XS7G z3&=??KPFxh2gg34z}`$SWo?j(*fphyA2JkmXlJUBN8B^NbOnJC)42el#%mbqQ-f7v zkT+u^r6gb|o`PUz(}F{aFff9&Ew_rPrkDhMuo?+Z9Z+~FI@ zq603{<1b$^D}&=9PEaHPpl1FjBTe?+cO-#tT41;J7<54+S){punRSKdHZQRI6!`@v z$um;J5n#?7fC7G2xC3zuPXHrhkNN-AvH%<&BQ)@2@eUtZ!oc(jBB@6u0KYfUg4+Q$ z)F72S4Vwm4aSEh!>!Yt@5G!`jMj{ih#0Je3Uga|nA0chqS+V=5~4#_fiz|Sac##0{PvWE|0|L?+_Ar8U=Op|!z z(F2Wp`Ufh~yErP;;qS2hRji)-pJDqy88-ZFCY%4j0{qX+`+t;#|L@GaJI3c3_iI1e z&ghk8pb99_NIYjq{h^rctc5M(g{&tHrE-8Y3ouE*3OOWV0iANNFJhhD3`C83JGf_r zcY|UB%Ds=sj_TQG4P-|Zxd-X}t1SK+G)rxm{1g=KeFH^8E_6Y9Z~;_gH}dWZB>Aty zdJD3rXPreF`9D|rpK(FXmu=xI_uVV{-ssOXis^IyG5Jxqs)sVz{mTPXr3MXjy{a?` zuQpV?o$Fc3>b*}(CDQ5{v@6Xg{Wyx0At{pShsfy)I!OqT1N?sgcI3z|IO%%J&Lgvt zA$Tr)OHrl^LNsT<0|aBI~jM7x9;zJ_o|v$rV0t z!)>6A-od9={)Modp~S(Rpk06;EX@{5r@;_5FqG3^T6GgB0+MArH_pczGuKnXqXmlm zWTKoMPU&LxxjTm}*(_`0urcmR<=v|&7i4qD#YY~F1!C6_VuEx?!++4=^hV_?$q8F z>Bd(hSS|iFWM6Da-1^Qr2xv50wpCiBh#2!#3PzBuwg>vlZmzUxa(T^u$H6vAQL83a z=>QlnkK|e&V}3dhS;c~yNhpuyC{jNf7qIB;!;2!0R0B+~(mpnv1T?Bbs7O2j^F(~3 zswQ-%KwMr_lSJ3bB&yk~X|tQdbINY_gpaQaOPn00u&&p{XrC}NQXQuoDUOxf=Q~c_ z95j;7`AY0ED5o?Uu@+2JC7WV2(yx56dj2nG*D>LDWm8alj?Bgf*zArM|L%N)A-X7VRi4`IYwuR+D6<^v?yjNvS$qYp!bC*E z82un)@Tck_=5Yu$f?TC0f4$b)^!&$6Xob~+7t1z@fChZ$<`DYklp2oN!4W(#CT}yA z9kkLDBS`!W!Qmg{ADI>v!B)FJGL~BY7G#NCq0J!Y{15{|gu_Kz(ccL2rVX@gKUnqC z$K=+8*l851Dn#t^vpP$$jBatA3!y#72_ z0z{@)!*#v(ihy3LvG?Yw^HC!7;+vYWBYYN>_W2Nx$|vbUr~|AbC~K#>%bb=OKgKGY zsne=cNy^1YyJdyD7wyQlm= z9wVF@xu_e^zdt1h6mO4L`@4|%-(%WV8IE4Z#xz$m?=BvEZhQwdYx2?NTjs8JJ5R>$ zW0?pIo_oo9t$S`a@FD#9UVuBdyI3pezB0T#0qrYPDG*`5ue+B1)wsCHK(2v!9!a$J zqoh%}qQN&VxE?7XRN{JKZGm|GjKb)tU#S}K04tWA&z4Ym@=xg-PBJx?bUxEkrlVNB z+15W6uV7(Niol)h!ZIvRDG9IY5HX}iZ-?J%T=yPcs0!BjZ}PC1XgN~qip_?#la(lc zvxX1}K7zUi&~gg9vd9vcE{~4T62B7HGI>^Yn3rw3q{$(G4-fvuZE(lvsmU`5EiA`K zMd0zMEEqfw6MXF~#ikP17Q?HRg|W-oq}7v59a@}NE%n5*$wlNbm%*K{4zs8T48;!i zE@#AA@2rtHzIxC_uaGTmP~j*#`4DM~NR}E(JrGtPjWR<$_cZZ~8`jfyofO@yN>@MR z>Zss|390UrON+?Td-Wjil@eSWAhpamnxor^pLK0}s+zLPWGM#MianJV3uD^aNj zgxMeT#-3;I9d*V@Hg>`4iYwi}Cz5AebY!mfomH>!>$sK18$Tca;wj^qnCjnt(ihyT zT*Oy(jvE=HhaU%-Vt$pg{xSQ@pY_fS>{bLvSb(=}(&ad}{r-@*Fl&k>!&9qlq@J?V z`-))f6?Id-%%&7@TJGQlpgs`_r1FkL*~mNOzWxX6Zeyrfb{~7zN9a=K$+ywcUfCMs>3)(+!sVaHs5O0&@%4tSw9mr|KRyBetIF!ngEp6;MG3Uw@ z(#V`VPxy$6{N5gHcfbUSXuO?|qqL8V2G)X~=|TBz>{APe&YmY=t@=q!Kp`J#eFCmK zU3Dji*&r07&V^Nn9UxLpJ|uB4#To4luX(%w?p%E5{@(t=m=*b{GE00^R=6)>O(88I z@1i>Lj(>a9&&Aq2ct?| zlpm%efR@VE^|#2P?Dt_ZN0hWxtC70f)mmI3#=O!IZ*N!QbYtHS(OIfzzL4ln8;cFc zwNgBV*lRcHPqNPCM7;1)b6OiJNZ3dY4sV_u#M%dBZ;S71Kms5U3gOSITIMhZX5Hpt zsOM)hydgvDbyf)0Gy?1eo-R_ zqOHn!QYXzT{D(gD1<4ZTBJUn=@UUPelL0jukA0LtLmfd@#Dp5>=4LpvG!DGZZO&Xy ztWWd8g{_0t3qA zYYcaSg|oE?1|NgXqiBARakw`&rg2|TNprTpKMa-nmt#qkD$rRqh{#cnZ4S%9qbI_oO z*+6{Thz&C88UEpojBS^}^2O#lQ#Pd;pZv+|1`OiWc1KyM5mvR}8?#Y%vJI*Bx@vW% zN|-(VgXYyD+;{z-AHpKC3xrVV#ayM?q6LF$rQdiRp|@2n)V`6PShFtv{kCKCwI-+-l!9{sY6>=PJxytoN1n(_mvLEsV9nwYqdZ=292* zX)J}NGg@~M(n;FPmKY=#hwlqca=G@9Hs6vlu6gVSBw${uTipVQ9EUabRv$&!HbF5d zki%+w2QqEjRVCKHf>YRH6oeg4Vjf?Ok2~|_y~-@=IH zNUD8}Cz`P|cyqjgsn!b9zMNb8z|;%9YpttV=C*5J1xif9hNIW*TKzK)gQ^>&wk)s6 z-}`6$8$%W|H9QB@xw_;cvi%bDQ+}!*o-yL5-!ZU!6-j)q54kxIvIrB+cj*zCzPVjQ zC0W1X)V+iw#@GRwCPg&O8F@@1Mjn$633vYTkjTLve)5bi@L*ao&tgQqUcr+#ac_P3 z%a^YC>-jv_SoQ*vtXY0_3GUK(I(=f_bQhKCR==OCGJC;FiWW%MO#hTt(Y>+=`Eq)f z*WwUrve}g2DjVDztKo)t?gakQnZr{d?d+2XcfaF(LN;-1PKSG}-~N^X!Pf&uIlyXc zd9fm;=?Kxj*5UZ&Q}0wIGP#&S&D6XMINh6Li?v?oF;~{p{{hNK7hy)(lnr6aANm<4 zR~BY-_5#)!BbEzCRzw+9_%2|w=~^FOZLBaI>@a7B@!~MY!|2H@_ZFc3CIi#D;gv9J zpwyEgT1njbzu0@ruqfNFZ&VPZZb}4{R#c=_y0Jh>k#14xl#~`lK_v`AT0pu+Qic>1 zgrR$AP`ZSnbFT}$pXdL+&pwX5KkZ|Gc)!ez_sn%(XRfu*wSMcj@^ycU@goWh!7!b- zZ`VQ;*zNOv82Y@QAHRKAYrom&=6P`?R+?+3V+-@FYdklT>+wsa85%LgKL#OEIR`8< z$xl*yJl_Q?)PK9dwjcg1^j<{j&Hjf4|IDmi3Nn+s#%u2R=UlZ7`K8|P!#eYZx6Y=0PmSy+Jw<;5Ul^PtHV>y9EkR65L&$ty;BQ4i z{O3@K&$TE*ZKC=QYtDH*W8@}Ccq;^oDc*O3sdU}A+A16kJYY8}*1KHUzKyeoRE zL-}XafBAVWuEKJ?hkK-o5~RAZGEXDZ;5=%>gg=e%7A%9HO#7*;+yk-DMQ}pa@vOKaB8qRpLZ7 zOp+_ix{Y_}Js$lL`NB2^cqnI#NDrB2TWG9joiYcq7*JI30GSv&g*tq_#I9#Sj(qp$ zcn%ExZ-Ksm9#Hu@NN(ec*lW4Y_a6A+M+mJ*^8L7KuXO^471VEJO+%_9HJ@V1(_HZ^5y4aR}@CZJVUi2XTZ*rM7Qh3SV zx_=hTFpi_rOlN{%y3Q1~A%Jz>1s0^*^nGgNC65V^LhmSqBXVvh@E!6&rw7i%e4$-g zZsaA%IQsV!U664cg~>1H|DVbFKa=(U&z)6$P~^t?N4)2NwSHuM*z=~``Xz-#lk*xr z3SpmLblg)&QNDGa`$b^byQ^FZ=Pqz^@?O_E=gH)mhbyb>osnja{S~`2xKX-fU7pLr z@}snFV`tZ3@1V+UFdMQk z1RTSspoX7+rV>03Kj(*^`vgO!QJ-LoL;(U~OiTLl9NA-d=J4}4LsmWH=O^Lk>8FEu zB&tpqwzcE?o61K9)SZ3$^alK#f|H;bem+I^r zyvns|cpk^V_VgHj12pE_`R$~)kYYm1j;?P0BcKB{5&NML3$RaUB%X^S?%FXwLFic;HBZT{ zV|Ys>qMSV1Rj3HNrBzwUm2qSGnO8*O7K0oG8lNp5=NRk<$o(^u7d+QHGme>^tISp9 za*Hp&+L>ElC2-xzy!EchT=}yXq;sR+j?WG!hsHlaJ@;7mG$bUY>lv#JSpEH+x!Zd5 z@wu5-&6`}HvGOSzi>vkgos=$}W zOvD&wC28;o3^+$WqYNsqr)<8P_!uX#y8n?{ui`=JaE!QmncI))2y4dam%kED89~7f^q&R&2 zo~dPAF%jdMm+X20n~$C@_#*pL+3$qJ7;26wKVr5~Ej*4FrUhoD3vqEajeDLOR`(NF zUSq!uvFNghkJ2i~NG|hCzN3-uVc#e$dA6*$9#|lDQT2&7=3sL#8&&SsYFA|Gnj)3I z+w;k%F-E3}Wq!TMaznD$sfAx#n#=SekNk_c0Q6m0l$xaWN3bYgVNu3^OH$zx406Gu zcuhu6uvQ&twYquUP4M0+5W28qySMUdV^+o{NZGq4+OEQG!AdjRpjj_+P)%F7Qch&J zV|X8X-&$cEvLAlcu9T^pr9Wx7r&lnmD&`0r)2EXHdZ?LNrAZ{;o3J< zarMd97H&hi$xcm?L9du1YMsRM2{buN!ZZ;>+xGMeA%!qE9pjReFgKS#v^?y#5WpwC zd8=C2q_nt*aeZ%@LSMwbgE=R925Xcz!)0AEOPyDhnn6?FFVh;sU)AI>NycBPQI_3Y zjdm-QpF7yc>|;~b@(SIGd@5UIgIV_?>MEtW_HkJ;C2g@$3H>Heh4L9g+YVuK{FUZ? ztZAf3n^WFuNuJ`Ym>TEXwYxQrg}>c%2jI-xWu0EQb0W<}L%h5WMcT!Um3)xMnC=vNVm2y1yc)-^=SwN^iJur%+nQi;gm*qa z_cxW%ijz*0dy0x>5XuD+Rt4v1PwVkEoTCgn5Jb^@59m)uu*)?+H{gS*bH}rkhN%OY z&qQs;^6VJC5hCTQUi+I)^#;<3-+=uXB;G~}x}X@|j9p7~funGwI^Nw>2SJ`k&Fc*E z1AE5h6D@pY16rkE?#3L*-fM7-h-jaq{y$L+Ag7JNpoI^;>3^UP0<*H*&<8BO8W|?o zE-R~1lZzWtVp`~uU;t06tDtXFRiyq>8hmp9VgU|EEN46lS)$ucTM649HI(i4|L}s| zAz%H+EE=JhtYP0CeW zDTF7f98j`dH7mgzBHD@P2qCpXiwM^y2&xGtHq{OJuxjtIAYl@7yeUz<&{bb+E|*0j zNsm1E0v>b>SS+~{rNRlhmbp)p-a*%)uz8Cw(b2>QRRV1Rk@NV^JHZPwoZG>Pgz$r> z&k2+0B8*RW8CnbH+(S~Jnr?Jb!6@V+hE^#+sA}-`Gw}LB9%-i6tJfYg?Un6T_^qIv z2n^1`C?ywvxe4ORR5ejlCRkc~(%_rN@UDPzkA%MNGPKuBlg$r)QE%qAPaLqHr`oG) z!$b)soR2DT-61JPfan;qTMk2g75H1Gb}F;bk=bmoH!R z@_$n*GhR`2cbW>ed-y~gX}|#*F#XjrGqWyE=c(b2Tk*|i-j8pSB%ngUd2I0||2tQ? z)xdfcCT2(%JS!lLCLfzY=8B9Ui6MeR$x~HeQd#|54WWVTEgCqpONZM7@8}T7Y2Tk( zKW*sNec-H*DN7oliVuOM|0>ItY2yAX!nbQIV?9%&G>I(T;v7ElrXeyy6UZCiivcgW>khXmpOdW<=-b!9^F>tUS{cB zk$DUia(_S#$yVROSEIz#&5lKiTep5QY>Wvt{9?Ge*rq|?g=~JG29iS=5EN1!Kjaz4 zA@5VMToX7g_%!Vt(&1AMH+76k*HxwZl0Ou+YhQqD%HbD5f@kmucG-}p!~Z^AUJXg0 zkA&Af3Z`yra%o>!As>B(bX$>-i8>;92=^RqJv3R=sQt9vG_q1rkP)vHu^f#b9Ce(X zA0=THszA@3Y!tfoX1CTI7J_Z`0bafT1h0f}*4pvjEMTr^j1?<*bH@@k%@a*zeHu?5 zii^{_eh=$IEPlSM?hMGQ@$lqdOxy^d%rKp@)<}9p0u`SJj+Z;* z;SoTg8yL3G-(kmAS9m!26T4)dG4gYmOMTtT zwBh{^_9kbrrppg!j`nX8LDlWa@@pIYE*>kx<}NOKsqfq`se;@mki0nh6=mZ!vhVQK ztYt^zf@1ie(;`FgjmY8DN&THVcGPl$vXze}*^m=v@0~l_(X*FEf1+$+jZd$J?e~9u zy@p}&nwdU{Z%2#h2FIggky;_!w0=5VBJY`zN1=VcOb`cRe6!Fw*yspP#>H0}ay$s- zb{jTe;lne$aSX2nG-8}F=h0g`weF9^?=^np`yjV=CbF1vd=VKmB>@w2#KZOP(~BGf zTR*isRma$yq}@4ZF5;^oy)U|YQ1s?Vk`+>4W^XKW{gLigvY%=6 z#jIb_zYN*5(licUI>Kw0*_(!p7cy(*DKlkF_7yW)ht<$;o<0Pb62f`wJ^9(Aj_l%J zk$4cs>qJq}?+wk3O-+Qr`ai_e_MedTZb~&))aF+M!M~Lq7HX`A4v{S+P|`7~Sif1w z((0ss$>AC{2>nUv<*E2)OX9EE)m|)C^~G_SuGh$gagO$9EXCk{rBkhGf)$N!RZ!md zN_ri6DHZr&BtAwQiq{tUC%_6yZ* zcD!B?k(b@ZnATYOZ@8~aY9?MM3>R~n6dWHbMrOVQw)Ai=e+4eS8mdASdJniR6xFO0 zP_Akuqm|2h?|8rk2rh=z;U?HVW^u3EwH)QSq2|hJQ8)~jYULopS$0Fm@&vN=DL{%z z%--at7Im2oTVEVbtLn|`4(=`N*(;Gsn#blJe-2Hty?u$FOOC<&z&jvf-RJ*pd#ra! z6UymLG)Ib%mwpvO8sde9N@nlaLl&y}28-=*`K?Gp4OPnGrOXE}wqjF7xRvujI$7f9 zQeyDhXssiye}7NB>f0L6URDt=2KrJ3mnwyfEg4xcE~4OTY(}G?F8j;e@+g=kKa);r z=gYlk)k<;Hq`HnM>cCG=&1)R1=Uas^tKF0ZYePxy{Y7q(q;!|kgQ5g30dE?@bUGST|MIMUcLoT0ZuE*zu1MucAh`jdd^;%F%< zw#}_J!gqDAb;PRWnXX=iT?*-M?oZBVImGJN@{?X?)Z-L=FhXtQa(zMXg~1+F>bs#Q1=_{GkdqsjjY8` zuv3xyHANL%QcP`Xj1VewoM=kvxX?I_&&`%#oS$u_fxhkTRIsEVI9mh8?J>X41YR@| z?u1Bp30mkbW7L~gRVknc$c)0*ov^XWsgY#}g}*VR@SJvJXx+chF~Dy-sUr1}f2+{5@dQ*|-&tSy zu%(&*sNv6u!tf#=;pn|k5O&6%8gD~3x+0e(&yGvCE zxlk34>kH0tqA>*uO&S1$gE}x5oYShf^@ffU%BE!Md>I!@aL4eLq+fqQ>ajHuxmI2p}r_U#pKwY}- zJY!?1Eq)Iui(P9E_*SE|Hu>^b7k5{MFXt^|Q?`$k-Lq$nJ&-M9JYqkESG?m8} zE?->eFX{xtdadmSr6n)}io36TW}Z!WGEM?@BfbBl$Hz>f+zxlhsIi~L%Za8afjOXD zkTB;}(h>f)6lv@uwone$DW_8+jXKKRan2Lr+H#-cIY}@VnPOxmACKQGQzog+Kcg#r zR>YwxR!qIu$TMeepzEdS!-sNdBfwEK*Ygo6wbX-ZKKh*WVKF?MK9tWivRFEp&YRJoHY#b5~Z98$O@ zf#qJHG0f6KJ7++9^uBqZ+O3+Q#G&()R*oJv8GDf!3MBKib9O`F#=Ag~HlNXU@jOw@ zN(8UI^Jq1mW}eZIo&D(Lv%NtVLD1+yK}kxFN7%wSKO&Ykefx@(vPCq;&K&X_#+RyG z7xaKZM5QxTxghPi6S57FBjAAVKQUS^iim(eVX}fS543wgg4ntu_^n;ZL|Fa-J0^gV zwfK2{)NnZvdH{n3KQ>x&dpemx@(IdgeLe%q()T5H2d$ov^Hg@&L&~QD-{Fsuxy7=j zp1i7kE7ctqAm3nC;?~SO@016fTFW+%UjO*k=DHP7ZjRUHleDK+K*{TJ$P2X2^Q2Jv z<^v2Of>4oM!k$MC-Pq+4Xl3DQ#1JTbW3uj8aU8^DF4HY#2~*Q+Fu65WvGnMlKUJgrDQ)z7-;okEi#TV1apO~m zVg)YYU;|GoBU{FchpHK6Bc~1L)-$4##Mf`sTrNTz3Mm3hr_6piT=1ro-tgtiKUUuP zAkh?O&6r>iU08rIKwQW-um$2!w(Ptt78suCy&wwS1r|_ zFR&9FFEM%%s$Uhzw3c)7h5DTm8Ye>d4AIlTHWIcq|NhL7lmJgmb4gQrAqjf&w%sGF zjUXKjO)}K@y}%FwC)$2Vr2?3g3?04^dh9OOFSJUez{F+rCU~v9+GBE}#6@L7J=uF^ z3obRaz)DC$bFok2@&En4e9BsrA60BCm)?BB_~NG%6apRAsV*_?LjpHauvAT6=uC_D zBu)y;j)`CYVF@C1S`PQ2~BoyNp6=AbUJmf5zdoC1UtFxfOu#^tp|H8EE?!-ZGu`Q#l>cOpMwxTj|y|&H+l9 zi~Y?4RRuOGS6c;>yUoNmfM~^%$lK`@F#Rx#JHORcZAP{`a-6?`#(It$w?Ra=IwHFL z+CH8#yYJr>Yo2;1VJmLp{+s)9q0LhX5&Xt1z8NHSKmnXD@~gkXvt+ONAyRXv1+Krg#2bj<&2Sj2=?7|3UFFMXyZuM~lglPQC<7;*%pA*PfyU&P-?2CjRTj_H zUZr1rI;c+-saS&b?-ZT6pmw&h6JnUl?5p-%g)G`+ou*ofb{>H%&xVUG`^?mlCOy2r zj*J}J4n2lniReTM4)j)7S1Cn-aKG2}2CvMJ)B6k*KaqVMc(9%V3GN^Zc>M#3SiRcz{X0Khd;ZKW{>+{* z`vh3b|7`dB(|;7g@>i_Bm)aXp5|~0BN7Te9SS#OR9uva)jh+G5+@_+XoJ+I#9B|kK z987YSC%ac}6+1@o=-zumnD(!JjzaV^39`ub@E0q6LhfI)q*@Fye%O%y{BM>MUd`2| zBXK{DJdA`n4tv1%>LBh>j^%4YBBVdn3Gk(FPn^-;_GDKO^Jq9w0Qq$*m^@PYdgA|@ zM`z*RHwb<^BF{C$b0XTI#s5C1hnPqAB0Z6RQ@|`ql*zGstA$UNx|`~e!>^Dv;A9M!pND>ur7M?=#OJZ(^^V`O#8hS5MI_@p}E0UB+#>W68 zeSt5tl%o=Ihir16=$eVxaj|Sh^DWReBQp|v7B1lBKAj^Bm(h7W`vr1aWOR3dswDR8 zEpSa9!=pr2K}Q4q+tUvei&byS>2O|6qakr=#Ngf>COC%7oDhBpDTd5vPeqDICg=Js z(hONY{EQjO-b0m7a8_1_0{@LVwL?@jaO%)Ld@)I1wys&B7g?6QCWtr@11q0CtXcW4 zT5s7EQ0ropfeVvxTb#xmIqis$Eq5x4Z(HyF(r`upSCeG%!cy&$qH;_xoihtOO{`T- z4T|WJjLe@q;gWfuSAO5|>0hfH+9~pA(6imPj#;mp?~XV0e7+zuC4ZaEe!sj?n=SNp zICnOgRhZ_YSLwOcM=WsPpAHiH>H4$ftj71zpg$P5LAzFjgTT&Bz zH|l%VbLovgXy2EGB|yx^A2tp3v&k#OH+8ECCY2dQES*P|vrR%Ad7nsv-wvwvnQS#^ zmsxeT_`Wwx_Szh!-dpd!$}Hn`o&9BKmuaNoB#Wt@6&a7g*Sm2`RoYScrd_v!v$*s_ zvVX#M^(U;1Sx(m4uawg+%1&G9Zv4LVvcha;WZSGCUlWZg=c`)mATD`bfA*>Dd4eW} z$#FsjtI>iR?8~C@h;09A1a8AE`8nKO{;K8J#t8lxp5MJ6s?_6cw$}FCMmRkAI7IbF zSGf6l_me$cG`aVm2NZDMnk&ELaDU`l@Qi?(Zoa8jZn;>I6NZnLC&GBgjTI#$;k40N z!eze5WHc}C4VPKP`wP)U9*K(l-D3s!$|;dkmXEJSj)_aCPz+FEx?AbB)q-uSqRZ>8 zqt!3kEV>4-0)gcD%l6y&Q5SeQRAV(R&M~gnhsNDS zug++!y;rb$a>JW)Ei4x<8BzFGu3Tx1K3gWm9CIj8M+Xsi@PMnGW?8OChE);QluaVG z$-0BFb9JrPDavCZu>`BsVAqwNXt&t$Ub#wALmNF#;a5Gq5|v}Bfn{Dx4kz~KLfOnX zR>(Zbe0_?CVK(P0vLXKzBF^-%C{~NvdCDAnE_z9hKdfY0=?3pe8P?V7Gp_^A?T5?v;2Ku56O6EjIZxf`&~+W!3;YiUx2;7R1HM<4+MTp8o^ZvGUvdaBaYeA(8C zX`s{G3UMqPek>BA!UY=fibwwZmR4FdK$WvcechQrD);#sX#6R2g`CO`><=|47O08e1&v2gtpy!QhpR(lRC zc<5GB3`Vv5cso+kAcDrrxaB6*jK61H6xax%hO z#~B?90n?@(=qT}d&_9TN>wy0g`||6c_z!s9DRuX$n>G1vR(DkIU4ZXt8R-F0& zn9G_&sr&TZ#9NY5&~}?hDChY(iPAUzrX1S!GPdkIJ83=3QWto3P}9y)?~c3~Y@%Qi zI2gq;o56oSUMn@sgid5UaN4*9lh_;D6Tnmb)A6VOAm|M*@b9i)dlxr&t0wWDpcYdILNH99sO%zFAmr=Agvvz*%eGw4z=^2envufbs8OkHVt6~yBSwP4R{LeDk6 zJf?;oDpg0ChzSd8fnP)MafZ`m6Q@&ire@Z79>RM7Ev_;NtnfJI;XcQi zI|c4Zmtps*27=$!Uw}fgJ6(<6&Pp@iitSXPvjAVBQo`h zGWYZaYRXODtW}j~{{q_3E+7vSGN~b07^xcBFMwOd99%5A4c&%Z`pJrcFH`DI^>Ls+ z?YH#x`9Z6lZc_Vl7J~oz=B;)>Q;mw%C_`?s#=d)NvHZ^AT$)gwAb-=TE|fJ7nBFX5 zcP#swJ=?F?7|xoC_f)yq_i_i^VftY^DM#VtNaGua0b`$Z=rhb5NaS7EtHqhI@83;0 z*?tD<&Be|8Ub&HGKvxU2w^7~-N=iXej*P@zuMB%(A9$~IYft@yxl@E8*z_-R<8bw* z@ox?+7DGyY0OcJ|!(_}S8OPtZfp?ZO`@jfUI^{osa&#=vfPJY>GBt?-F_z0> zOZgnm=0Aaha1ED4#8c{bGxj`i4yB3m#I)<-4xTFA7q))a>^v)bI|qo1wZKmyv)pf; zS5%uW7Vouk$1}QE%x7moL_&}#1W`B9*106&JPFtwnS)(*U;&-6WXJk>a@f2fX)}B6 z4sMEhd(jxqF~4ei{55c^(f9ar5@Spzew$^$Mov4~(i{or#Xzc`*XO^$Q&!|yD4~R`>NQ73M8_kmx&*wrsZg`!qTX~?E?VIQ zDXqft&lQQK@?^vvxNijBC^Xu2!RkQTL`S%aRopop6q{p9(T~KJ6UT`tey*M!GcyKq z({^=LDbtb}xM4O%37Ee&b`-d7o0<|Z2OC@Q{f_@RkZQMDQ+iKE80sZ{Ol15Q3m_WJ z{2%ZN-KQuUO_5keuc&u@aDTqY7(qnIk}c2by;|soQSuY+ECz~C=7Xt3e^UWR?^<#Z z3-F6BZ+K282CUXG+J8Ur*;pKEOpx(?J+cf=+AJWP<&>5wr6IvIkSybFgXm*VoJW}h zbWV#Zn#K|q=kpJD7BChO$Hfbj*L2(23Gj{Wdt3GwX1t_iwOb1Dp1f^;$)pr+;%Ul> zl{2X^tKW!NW7nFe{L9>95=I;`ywskNBpB-C$!k-9=-BzQ1|Oi0#6Vp{Z9pKB>Nc|S zFHi_0Yj?J;K&3ctvbi%{+sN|3XMG-fbxh3RvoH_zdthJKarAPGQ&e1WbDbf$JFtGr zKchwLb5&*bG|pUogi;k4+d{m?2VyYa3?g=ex+W4^y*x7l#F)hC$F@iJJLh(9VhkS_ zow70+3{NlJQ59A?8jc64$rgA))Ii?-8o7WY|D_TqJ`=#|(`KDl;o3hHovNCU3;e$X zMc*Q^FC@l7`X{exW@_AVe#j?4=ndZT-fcT3|tk4GRC5g5$3A&7 z?PLHK8~{-7fO()qdaA;ijm_!zx5MrI0s|q_pwv;#J{s#~1A0^s_{^njAt2OvZ6Si$ z1|TTkUO)EHLyWoW-D4Z)4?P3?Ibau9@;2>3gJf{MOEEa`U5U%q&C_=hSiYg8h>=ZTlJdvpMu#|0FTa}cNYRdp?^2O4Dbv=7oF9$8lEw=!+AcE>#yyu- zWAxZBMnT(V!6KfKd(76(8n0q6tZwzU3O!gRwjHdxQYrUsc&TYTq{213U(+(}pPBK> zw92`kd@g(FtUBJECH2RN^SO1_Uly< zhjy+*5y-smiR5MQC?f`p|0j2)Bkv9e@n;tumQJsr#Z1@8p>A#tql_)aseuE@E#D8C zXRBx<(u&#}Rk&Snt1(3;2PdLX?!UiNd+)XOqLY5piu=rdef`g#y~y_cMG^Oj-<^SJ zg|SImw0f4qFWnicY|8T=H6_Vg%Vgqkv3ZL1#+k+?dkl_uD(yEaT$zSV0uKD;bDK)L zl8q?Icf3FLEb20@I9*5MnTVa2=5wurF7F9_CIX+!ItcGVkx3Fn(%?Kqud(ISuEYmy zSj@DfWNIhH*X>@BW3HP1_(&($C>y;b6@obN3*l@^(`GDz)q*Xn(BPdp*S z@LBxUT;EuQ-Zv{fl)InSqFzREcz^$f2d+e9Z8kKo(SL$$#qiTd_ABeto8t>_7Nl9l zAK&iOSJsXcGt>}wvn)&So!`9?tRU1;nY`2GX<}bdwraRzw3j=Z)9|wrK$zX?PvU)EkB(jDw++wP% zRi%6?hjHtLbG!DxRj=|aiJ|K$lAimpJW9Z+6u3FNt(9r6#bvJdkKbR4yv$RQHK;<9 zo8X}LN4oQ1J+lO()M-@U8O+?GY!Onl&voGHzSB=Pp*LN!P-VICZpihgP=Hu#wl7~~ zN$gefpN|4eajYkNhPMlZS^Ki*Fot<|xH&RfYoztOdLI;Re-3Y)4Zco?oX3e;ZY|W@`(08(NcO>w{ln^S-ma!Ja40es0BfA@nbx2o4561cq z4y$}mOj;aV2qwdi#+qlpr58@7q4mY&N7ur8J6G9%f=+S>lS@$9LEZXn6XlmPsiY#7 zE!$#NS2JEIhxhM9S(kOqCY=oqNXLemMXzmS-<&Zoin<`s5$oijvNL72{WWUc_Lnv+nlHMxZt1#&~Y_kO0bcYk)Kkq8V}0GSi~p0N~hQ?{Vn zAv7GA4H_YTSTY#ZBrz)MS4fJv;1u44Zkd}8E^M04j=E3%(Y^35-b$ZbmJm0dJ0dcQ! zRBpU;k@ql-OFSw~(^zCpy!oqToh3VF#^{XziN9@uuqW<#%r6mE#lG7#pKFC_V^H(e zzBHEi&Vw+1*EOjN*CH+Os_Ju3y>k`FpMQ59UCwAp8*#kzmGVbHMx&_Y8*3En=~+>B zHmN!ZrQdqd6?60Lzt(no=0aMRL~Uzxx(4~j1Z`-F4@#F<_lg%px>@ukDy^8a+`7`u zlIN+cQ@xCJHm`?!WQFbFs#}`^A0NEj(-f3%-=sFCiSRsV)0e7TdaR|>62Tc`RiTZk z3^;Q}nEf}ojK!0lpNG+7MAyzeYDWaiA%00%4eAq9;J_Hhn^sQ(tEPq)=q@nLUc5nv#VjKm0;LEKy1Wj)TNg2z`Lhm z-okq)XrRbq?p#ztYpHw4?5Q{U{5GlTF`RwBB13z-BGea9OdpfHe}7@0UAw`*qaE|w zR&d7m!@Lm2Uc^&hJEHs^hf3Vty45oa%H_Fn9yxnr+0||l&30I`&*Bdcg4Ugyh>J`4 zTIbv~E_t5yZr>*Jl(&V8!1}sC|A2K&tt*&Uat*aCw1-PU0(u1 zwfrP34#2GZed4mnFP@5-88t)W>h>60#k>7CC3RQ&@WhJ(+~uMqO$#5@zt`U2 z_4;M$-Gbe35KBc%C`y|8e>jy!H&MG-31h2GM)oeG-lS*kncR~fg{&jclG~r6huYgT z7gE0r(|}cym)fu%h>GQ*=6C`gg!zT-4v9nW9mL-{2)Q+iZxaT4csf7vXfkACZ<~n^ z0V&L-nWx=w#Wv~ep?!i{1a~}L9G^gblLX??BH)0Rm)mdnD0@7NuH5&?yGBr~z(vrU4a(NlG{;GP7QK{3IA@9&y8qJUEJ@K%yuF%NMhC%Z*~R zY`Vus?v*54Ih>DWCh?1=@4nlha7#1SZ05FPh&h{s3N%4!Oa&irRN(UPuO|$qKUHx= z$?%h;F3x&?ckA2Q+X7||2Y_D8Rxf*_%O@^`STmBZq`{H}xNsb~y+DTk9YZwge{-=v zQgC%}as7@pn{}DmS<+0kIpd=CE3+|zJ&V2n7W?641t7zR&a?k)g=xBHB|+Qr+my^q zZrPF+=AcLeP&FFaI#8X9$zX`xv3M&4!BP}D&)frufZ@Yo=49^NDkWNeF8X&<5PP}@ zwysJ>SU?MSUeZ50Ul78de;M~LNd7AyPYiyPa#atXVIHqFloSgY>ZdIjuy=7%_Qlcv zOv8VH%hJKtaH$uaig7PZb=xG-bAcCez4gEc7AVo&C3_}e)CFgIt(e*ka^%XR& ztW(Mf{s|>`;g@F}%y5>F3u}4QTpkRe(%+d)5DaAt9|Od^M>b7z%PM%mTfi(?ADg@O zKUM`=bZ2h?hjJ)5l%+pb*et4&rkM@IY2K|8` z-EJzgS($A0_%$O2L5d&4A46`Dy&uZXj-G%jsBYZ$>$@Jw^|wU?BZZMjh8j(v zo=*H#+;Qc&y+oifKV8&k6ih+>~)mkVazc*GO&(9Y&-pR_mO&^eb)d22Ub5{9}H95EuEc zpN3QpZ>sx+tk=9EjnIW=h9!bgu73w;r6XM7Q?5#HJGdN5O3%k6M8JZ~*(L&a?``a~ zmm#$oC|16Oqwk$-t(vG`>@-J0%ocv>0TXL6`&#vB{aH5~PBsGAw2gTb$?G`g*9E6V z-M&7Y`0Jg+AaUX24@f!#M5w=J%oNfhVW()>uG;*Wt!Ktnow(x(8VxvwlOL1Ux&FZH z{?~r-p}Bm`z4m3WM~1brq0e%-&z$rgl5%p7cQg=XqbIY4d&MAa64=7+!r~{^J@NZG z(y@QR`o={r|F~!K%Y>cX1C_kBA41=^!vC_POO@@ zz!OmzHFy!`-(2mylR!E@WUE{;Tz*H2b*45jUv&fz4MHQ4M3jpsCikOI({Z_t?tb5% zOTORm8)G8dnDkziG9XJ2nDeFqzo#1n;1A0(lz6C7tl{o5w|UAstU~(%0*R(R>0qEk zvAQEaS8$J=gQ6Q9Y`tKKg(R-K-~eAR2Iv24D#}m&EiE>BDOf^q0nQ z7Dta!IUv5?_KC#D@OIgLQlz9a5=xHXjS;pV&IVv$2E1~EoT8Ic^Ark7kX_u?zBR_M+)RdWkQMcF%PB^c{A6dieE;c^*ybiQ1V z6L%4x>j7VKFL+*A7Udmf2&gu-M~ixmK5P{|tfznwo+<);h1gCgBnTpF1|NrMZ$>ZD zNj{>y0LvlwHdzagW!}vXYEByi4y>g25@=l$u=xv060|$0YC&W29>_s z0RHAY73Y=-5$@TZoEV?oWqHkvd$*nwQECB7Ho3=_B}lfAu<7;+5`Tgq$y}$ZOt7od z6G%@y>c{9MFiVmf$MWhvqHJ6zb--=R$DC#1|75;p@#VJ{#CR{;Row_$da)%;+ z2E-56?WzG^78H#~=^9!cI3;4Ol*hYhd6}2IthywC|K4QwOs$Wx9ilpsz99H>i2}Y# z3&@~?BNKhuq(*nIVsrdW_?3rp5Yx(9=gYlOGu>u|o73&d z9;@y0LUx0qJDy)AMIou6;v((^y;0j6{z{7jENJ6yfIdpxj=9_7C+mpsrlFtFXZj;- zVEaw2G5$`W>P)6xdtVCeUiV`U$SZ{oJ#7W8E$xe!gxw3vKJ<;;LFye2by3Rf-i11H z-EdGJgn|>CvH27Jr}N%L4OTJ7 z^F9p1V48-n4Hx)sJ0^IHpl@Axcv^SW$mPRQdmTRa+WhwinVtT+gbmh@k<=BK)nSlv z5IoPFyJH0))q4=Gw)G6KF8HF2%{N_WHxtOWam5I$XEjHg*5uC_5y^`@x!v2WcUh0i z;pw|u^7LHf3{X9N&uVXN;hkP%>}`;*Z#Tb+>fytMxeM|T3L%bx1Weple!tRE zf0A2IsFOFpjL*?2)x*70w!ey?YGkWNzy|W|O&Er2z*YQpkBQYhRzfhU|K8o*1%0{rr za5@sP)P~|RqkcrIjof(_-vg-;RKa37PG!iY*DwqWOQ3P> z^FFkel_e|KepM{ZsS@420Q=0hJLV9K1cp%3@!X7sNj7B5(ypF!M} zGpkXHLY-G6&j;`m8Bi*5+4iHfH(IaQCIc!HV%ZyBy_k5%F5jqx5v~XIr}0S^&+{hM z|2HJ-+=bGy+Mvb9m&70`gx9#z0MeV6O%S^Szz^yg+buWZc<;+aO1R8tSM0cA$(u9r z*Ygy+W7_8DJ6`0tWNRZ)^uxrpe^Z)TD}qM~Ek7tk@K~z8@|piqKQ8g$nF3_%2m%6*1Z5O;m|cjRJACBqzep$kIMa?qgnQ{g za!oDUUS@BGSiIzv;Jp8cE@0syXzu~A69tHIw!b}lV|-}~5=*!sYRU3L*g9f}E+Y9J z8>>v0QOV#s3$s*dy%r>Z8}lCraiD4cSJ70Lf}5h8RgLBeB$mwIwXRu0p!Fj@JI%xX z7y^re%--Tji18<=zK*lChGH?L`Y*DHvIT?^4A_+^Ln=dM1!FO`v>z;h zsK}^H+mPy?D^MKNWyE`?M8z{-XKP_vpMEW_v4;V))~UD>S*Zko>C2{W>PKu}W*2OWZ@;ZFG<6d3$>bM%CG^^2SK`AW9x+#V#lAL(OI zOJkn39*6GKzp^a|I;l+n=(@9Ts+vP)CSwLBFH7Xc7@I||_2lRe_||!t8kdv4a(%zn zxSt~>b(E*b#)I2P2;lVN& z&aClIG;nCZiu@OCk*MK)1WqwxpfdY)xsDOpcAObFxZ-31&_ElC`l zdfO$nG5G7JbVvG6Qcwx%U>2)E!YKItAwfjkHkltfTg;;3!7vvdr$OHTx3f-gs9G`z zdGD}M5n})!sP?Na_(LH{?8aZ5&l?$gs5$QX6NFeX6!?CV(@NNszG7Q4lnP`Rv9#}x z3&M;qDT1-^-`d0h05n&Z0@w+!lVeiVCk3ue>Ofg`_MO|w=Q!RDxVJiQTb>W-!z8H;vPfWvzr9?Chkoiub_8VFs>Nabn-Mhr-G zXY?aXHH=Zc3Zl8usMY}OY)Q}EwtAOvv2w@G8??8$8+q@QaPS88n7%?y}y1W>;0 zA1KIUb-o+DgBIM2D)8-j{9a?F=4){gGoRsOHB4xS0npRNN2#7(e&bDOO{Pp}{MDAgHg>erSN;k6q8tS$p-*c|<=9 zZRH-(D+iLz(iep=R_xEe`1+JZ6p-Kvyjn3Vp=Eqj5zVdjx?|pBWsl9H+sDScf9;Ev zq73>-hZf|%2k%B7u-&cOwl_#+#=TT13H_3)q|e@EMz9mM7G;fNB{JKPa`QcIvQ;55 zE*IvcXUpQ5_J6f^oexc(-CL20V8Ma37C~&OpaMccKtRMvWhFPOl^T$ zHetL927(Y~n6go1i3|fF>}d>BM%ekC$I$lIzu^7y@*$t{jNJFR@B5tVI@iTg3x{)0 zOqn<_H_hppAL+5H9i-D4h7n-8gTNf~h-=%1tcUdhI8ocqS@76!`Fv>-MOFJ%ddwB; zrnw(Eb@LL`jOB~(m6qWEZZ^;Lp???i`xENHv!q|G(3GIgDr&lljfN8{m0KRYCWLKR z3f#y6FU7?ngG99)4pa2pRa&xXl%LB+2YH_xhwZX_=i8w%yk>5}_FA|I`5pZtZKGtY zu2f+!N|-YA`o|e#;Q_g3RBT=i|IDo4y4%EOiS!M;=?sBa{yH^iLZeTO!^1e$ybzB$ zP+=%NQW8p%9}F2vUp%XRQT##4tlkTJ(ez?z_4ns3)<*=z`caoxo=axm8olEjPCrOFpRG0;+ci;QA@vEg`Na*)iH3lVULYKtcC^PZi zN|a4Znt#~TH;heP77|`ht<)91?!*CULPG;$D*`9eD$umk)J~;}u_lLk1L%L<`qCm9 z0bU~HB`5k}?fUoezocm=oqzs5S&0s(aim3sK*VzM6jw{&0X3Bfl3aNKnH`sBFSo&^ zd3mrxVWF+$uwiSE25*>6wcR(Z8KZuk(+Y%OQd7kb$-aJTG~t$9ArjvAK$vMplD@GV zy29`n)~k?I+$R8gQE9+ZH0g8#CHfWG2s_@xwccWb!r7WIlzPtD%bf9zlbz_(lw`WQ z^ebDC@L@RadN$#ftJ&=@bBoA=xcc9VnX(pM)I^b#ZgibXM>9F(4*AsTL~OF{+?1r6 z%=Yc84n*%IQ@YBX$|=>9p~dF(9>a&Md9 zyIs%k=NQQ`)a~!C@r(5rPYY{(`dK0muPa)NQ8<~Gi|R@FC%x&+$xRR)3VILy@6l0W`-!VImDR~uJ0w4WS3pF|F0?8535)yxiJ zc=HFCH?HC`t}RTv8s{b{$`f}g07=Lbs`V57ej1k9y%=u2Se6NTuvbn)RZnIg3lf)` z17OPP*I(Q8d)?(dK`)AvlZ^3VR;U@;)qhDgJWA})O%WQ~X&#;ct21JzzXHZgFynQ&{2pEY}%!hojhr%pfGdPJfa8eqFWviI8RFc}RsleaPBy+K;MNo|P} zqS?#cC+&=d$-m_|MVks!FnjX%lHZ~*yanU%q$}>GSbfT7DBB`u*0%{c9d6jb{2`SH z>af`T`0%RM5DX>`%iFAV{y@-5kcf?+4P~=cQ>c2HSo&craJhZ**oU#~A<3fcvCDa| zTm$*W5VRcP6agDoEZS(l0PZMu5GHbDmu&4Xt`QA$M~pMDR|*pnU%x?RB+W~8MZ11i zP**@2`Ir!zuwwssBO!EcTpw>wUPseYPAjNeScUI1-L0HoKRxfDTCSzkS`x|YXlO|3 z`=%c(8p0w?gquV@aj2GQNy*Phbv5xu_++FjV~VIgX?^UQ-2&N0+Tr{x(IF^>0EcCz z@?g?%pJ>FxkwtMf+8=+*IW*o7 z5d^VKK7Zd_0((_z(c!aG$j{8pAe;D`gH-+;X8*bsZY{*!E+Re&XV8w4*_uBgM)*rd zzKiV%lvtVz!xTWr*#|#+Bpb*LoptN)BM@=F^ z%5_rL0QC3-iLml32B=)R0b_r)d(YleHt+paHu+z4tS*cp=nx%>IdE~ohCLW_k>omW zTUqeEZ*spVSxl)+v1rFNWC~tCS#N)PNjkzF%$7|pfwbhzgV|C-+v`|a zr$Mt+e@BO*Vp-wtuGak3{Bn3*2*>g~!AgXgk8F41mw_XSoO-51~n3|^}9O*BO{ z^5jouLP^q^ssiecGpM3W!6Xp37|rzNUj@NNQD83T+cKAlP5^u|0ZcVPs&K(f#GwIY zY+mA+%zy5l<&X`OXJ>TN!}#Uw;a(oJC!#4IKsX}n_UR^M%0}RDpq91K7(9Cj3RrJD zzS-CUbA|pF{W}`y9jBSbW$}ZtV)~f}t7);E+0gNHr|Hj=SwNK_%r~@a8-PrtsOJ}4 zeNgW)r`NYsww9vKmW}@j{XM`|84%4V7c+n$S_d+x2;d~>Ioe_jy#mPxq5-Nru4qx^ z1HyfrVA5PU{cQLTGHnLOl zMD0Pte=?h000qN9)>IoV13)0EUVfYYJ=b_hrE}I2epr)5doZ^_;nFB%03?%=xK4uf z8=k6wF+xU4!LPWsqM4d=(>3jbej5&;_c*=X57R(poU0wkj;oqX#7=iiiklRy63_b4 zrJ#Z}ToYf+yKsnWv;xYj6p?49* zo^7R7rakGq*y`2M{RyOZ0b{g`)Zz?1(QjKY=e(c<25c?C5L`7z-;8-w%U0Y6GmDIs z#CwYj&P*j+n&5+v47anfJ|6TX{vM89AH0-BTx{c z1H_>-?u|y~o-fTkC-??Dr;=ADZ!`K`BbC>&b+fQiazl@M-2hjcfI0E%rB02{9ciJ0 z6O+;f?Or;rjyS^BNM0$`dIM+d!5 z85na!?ftbHxpV@!;mO5;}g8`6ELej32QA4{n0G{hVU zpobNiR&9#F;CHavO6cCiL4XIUNf|qSE9sEECZQCF;3wAu03^lzVi1@10vVzxv)A%|J0BFsvcX_jpQ(Z_LSO?i7Iy-EA<$nf~Y5`ml6P1TsA`$XE zM2JOnf;Is55XKuk@Q{BV&yov8VnN&>i;xFegu20-PTC?^L=Z@-8Aq1~9r8kRAv( zPnrPY>9b>{$(QIZjG>+0afxPwKjmcr$|T$!b~B_ydd1; zAAV_H99(YpXFr%<2#5k_&=~Wx6}MHTOGBpT4o8y${XyQvMQ-2j_TLZpA6m9S-a%)M z1xW4#5K}?mX+OZ2e_j6qx&B(2=TX-0PXWf&yCB<*Yd5>X#dUXvku#TV(M!E{`GMIdiGZCn$Qrl lxvw+D`uN|?|0~17 Date: Tue, 29 Jul 2025 13:32:46 -0400 Subject: [PATCH 481/552] [Doc] update Contributing page's testing section (#18272) Signed-off-by: David Xia Signed-off-by: x22x22 --- docs/contributing/README.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/docs/contributing/README.md b/docs/contributing/README.md index e3ae5055b99..5a2a70d57e8 100644 --- a/docs/contributing/README.md +++ b/docs/contributing/README.md @@ -26,6 +26,8 @@ See . ## Developing +--8<-- "docs/getting_started/installation/python_env_setup.inc.md" + Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source][build-from-source] documentation for details. @@ -42,7 +44,7 @@ For an optimized workflow when iterating on C++/CUDA kernels, see the [Increment Install MkDocs along with the [plugins](https://github.com/vllm-project/vllm/blob/main/mkdocs.yaml) used in the vLLM documentation, as well as required dependencies: ```bash -pip install -r requirements/docs.txt +uv pip install -r requirements/docs.txt ``` !!! note @@ -98,13 +100,14 @@ For additional features and advanced configurations, refer to the official [MkDo ??? console "Commands" ```bash - pip install -r requirements/common.txt -r requirements/dev.txt + # These commands are only for Nvidia CUDA platforms. + uv pip install -r requirements/common.txt -r requirements/dev.txt --torch-backend=auto # Linting, formatting and static type checking - pre-commit install --hook-type pre-commit --hook-type commit-msg + pre-commit install # You can manually run pre-commit with - pre-commit run --all-files + pre-commit run --all-files --show-diff-on-failure # To manually run something from CI that does not run # locally by default, you can run: @@ -122,6 +125,10 @@ For additional features and advanced configurations, refer to the official [MkDo Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment. +!!! note "Install python3-dev if Python.h is missing" + If any of the above commands fails with `Python.h: No such file or directory`, install + `python3-dev` with `sudo apt install python3-dev`. + !!! note Currently, the repository is not fully checked by `mypy`. @@ -153,7 +160,7 @@ Using `-s` with `git commit` will automatically add this header. !!! tip You can enable automatic sign-off via your IDE: - + - **PyCharm**: Click on the `Show Commit Options` icon to the right of the `Commit and Push...` button in the `Commit` window. It will bring up a `git` window where you can modify the `Author` and enable `Sign-off commit`. - **VSCode**: Open the [Settings editor](https://code.visualstudio.com/docs/configure/settings) From c79c338a54b0806cacdd7400ac22d6818438888c Mon Sep 17 00:00:00 2001 From: Michael Goin Date: Tue, 29 Jul 2025 15:51:58 -0400 Subject: [PATCH 482/552] Add `flashinfer_python` to CUDA wheel requirements (#21389) Signed-off-by: mgoin Signed-off-by: x22x22 --- docker/Dockerfile | 4 +++- requirements/cuda.txt | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index b87401c5935..0cd2cfad66f 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -386,6 +386,8 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist # Install FlashInfer from source ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git" +# Keep this in sync with https://github.com/vllm-project/vllm/blob/main/requirements/cuda.txt +# We use `--force-reinstall --no-deps` to avoid issues with the existing FlashInfer wheel. ARG FLASHINFER_GIT_REF="v0.2.9rc2" RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' . /etc/environment @@ -408,7 +410,7 @@ RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ python3 -m flashinfer.aot TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ - uv pip install --system --no-build-isolation . + uv pip install --system --no-build-isolation --force-reinstall --no-deps . popd rm -rf flashinfer BASH diff --git a/requirements/cuda.txt b/requirements/cuda.txt index c1273b224ea..5557c868aca 100644 --- a/requirements/cuda.txt +++ b/requirements/cuda.txt @@ -12,3 +12,5 @@ torchaudio==2.7.1 torchvision==0.22.1 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version # https://github.com/facebookresearch/xformers/releases/tag/v0.0.31 xformers==0.0.31; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch >= 2.7 +# FlashInfer should be updated together with the Dockerfile +flashinfer_python==0.2.9rc2 \ No newline at end of file From 5d8ed8a3d412666ab190ac7927c2faff0caebbc0 Mon Sep 17 00:00:00 2001 From: Doug Smith Date: Tue, 29 Jul 2025 17:45:19 -0400 Subject: [PATCH 483/552] docker: docker-aware precompiled wheel support (#21127) Signed-off-by: dougbtv Signed-off-by: x22x22 --- docker/Dockerfile | 26 +++++++++++++-------- setup.py | 58 +++++++++++++++++++++++++++++++++++------------ vllm/envs.py | 11 +++++++-- 3 files changed, 68 insertions(+), 27 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 0cd2cfad66f..75b5ab0230c 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -209,16 +209,7 @@ ARG SCCACHE_REGION_NAME=us-west-2 ARG SCCACHE_S3_NO_CREDENTIALS=0 # Flag to control whether to use pre-built vLLM wheels -ARG VLLM_USE_PRECOMPILED -# TODO: in setup.py VLLM_USE_PRECOMPILED is sensitive to truthiness, it will take =0 as "true", this should be fixed -ENV VLLM_USE_PRECOMPILED="" -RUN if [ "${VLLM_USE_PRECOMPILED}" = "1" ]; then \ - export VLLM_USE_PRECOMPILED=1 && \ - echo "Using precompiled wheels"; \ - else \ - unset VLLM_USE_PRECOMPILED && \ - echo "Leaving VLLM_USE_PRECOMPILED unset to build wheels from source"; \ - fi +ARG VLLM_USE_PRECOMPILED="" # if USE_SCCACHE is set, use sccache to speed up compilation RUN --mount=type=cache,target=/root/.cache/uv \ @@ -235,6 +226,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \ && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \ && export SCCACHE_IDLE_TIMEOUT=0 \ && export CMAKE_BUILD_TYPE=Release \ + && export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" \ + && export VLLM_DOCKER_BUILD_CONTEXT=1 \ && sccache --show-stats \ && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \ && sccache --show-stats; \ @@ -248,9 +241,22 @@ RUN --mount=type=cache,target=/root/.cache/ccache \ # Clean any existing CMake artifacts rm -rf .deps && \ mkdir -p .deps && \ + export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" && \ + export VLLM_DOCKER_BUILD_CONTEXT=1 && \ python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \ fi +# When using precompiled wheels, keep only the newest manylinux1 wheel and delete others +RUN if [ "$VLLM_USE_PRECOMPILED" = "1" ]; then \ + echo "Cleaning up extra wheels in dist/..." && \ + # Identify the most recent manylinux1_x86_64 wheel + KEEP_WHEEL=$(ls -t dist/*manylinux1_x86_64.whl 2>/dev/null | head -n1) && \ + if [ -n "$KEEP_WHEEL" ]; then \ + echo "Keeping wheel: $KEEP_WHEEL"; \ + find dist/ -type f -name "*.whl" ! -path "${KEEP_WHEEL}" -delete; \ + fi; \ + fi + # Check the size of the wheel if RUN_WHEEL_CHECK is true COPY .buildkite/check-wheel-size.py check-wheel-size.py # sync the default value with .buildkite/check-wheel-size.py diff --git a/setup.py b/setup.py index d46e678e7aa..58e5833f16a 100644 --- a/setup.py +++ b/setup.py @@ -7,6 +7,7 @@ import logging import os import re +import shutil import subprocess import sys from pathlib import Path @@ -297,6 +298,10 @@ def get_base_commit_in_main_branch(self) -> str: ]).decode("utf-8") upstream_main_commit = json.loads(resp_json)["sha"] + # In Docker build context, .git may be immutable or missing. + if envs.VLLM_DOCKER_BUILD_CONTEXT: + return upstream_main_commit + # Check if the upstream_main_commit exists in the local repo try: subprocess.check_output( @@ -357,19 +362,48 @@ def run(self) -> None: # create a temporary directory to store the wheel temp_dir = tempfile.mkdtemp(prefix="vllm-wheels") wheel_path = os.path.join(temp_dir, wheel_filename) - print(f"Downloading wheel from {wheel_location} to {wheel_path}") - from urllib.request import urlretrieve - try: urlretrieve(wheel_location, filename=wheel_path) except Exception as e: from setuptools.errors import SetupError - raise SetupError( f"Failed to get vLLM wheel from {wheel_location}") from e + # During a docker build: determine correct filename, copy wheel. + if envs.VLLM_DOCKER_BUILD_CONTEXT: + dist_dir = "/workspace/dist" + os.makedirs(dist_dir, exist_ok=True) + # Determine correct wheel filename from METADATA + with zipfile.ZipFile(wheel_path, "r") as z: + metadata_file = next( + (n for n in z.namelist() + if n.endswith(".dist-info/METADATA")), + None, + ) + if not metadata_file: + raise RuntimeError( + "Could not find METADATA in precompiled wheel.") + metadata = z.read(metadata_file).decode() + version_line = next((line for line in metadata.splitlines() + if line.startswith("Version: ")), None) + if not version_line: + raise RuntimeError( + "Could not determine version from METADATA.") + version = version_line.split(": ")[1].strip() + + # Build correct filename using internal version + arch_tag = "cp38-abi3-manylinux1_x86_64" + corrected_wheel_name = f"vllm-{version}-{arch_tag}.whl" + final_wheel_path = os.path.join(dist_dir, corrected_wheel_name) + + print(f"Docker build context detected, copying precompiled wheel " + f"({version}) to {final_wheel_path}") + shutil.copy2(wheel_path, final_wheel_path) + return + + # Unzip the wheel when not in Docker context with zipfile.ZipFile(wheel_path) as wheel: files_to_copy = [ "vllm/_C.abi3.so", @@ -378,15 +412,9 @@ def run(self) -> None: "vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so", "vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so", "vllm/cumem_allocator.abi3.so", - # "vllm/_version.py", # not available in nightly wheels yet ] - file_members = list( filter(lambda x: x.filename in files_to_copy, wheel.filelist)) - - # vllm_flash_attn python code: - # Regex from - # `glob.translate('vllm/vllm_flash_attn/**/*.py', recursive=True)` compiled_regex = re.compile( r"vllm/vllm_flash_attn/(?:[^/.][^/]*/)*(?!\.)[^/]*\.py") file_members += list( @@ -403,11 +431,8 @@ def run(self) -> None: package_data[package_name] = [] wheel.extract(file) - if file_name.endswith(".py"): - # python files shouldn't be added to package_data - continue - - package_data[package_name].append(file_name) + if not file_name.endswith(".py"): + package_data[package_name].append(file_name) def _no_device() -> bool: @@ -415,6 +440,9 @@ def _no_device() -> bool: def _is_cuda() -> bool: + # Allow forced CUDA in Docker/precompiled builds, even without torch.cuda + if envs.VLLM_USE_PRECOMPILED and envs.VLLM_DOCKER_BUILD_CONTEXT: + return True has_cuda = torch.version.cuda is not None return (VLLM_TARGET_DEVICE == "cuda" and has_cuda and not (_is_neuron() or _is_tpu())) diff --git a/vllm/envs.py b/vllm/envs.py index fcfad4eec16..9b6d8c8be24 100755 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -68,6 +68,7 @@ MAX_JOBS: Optional[str] = None NVCC_THREADS: Optional[str] = None VLLM_USE_PRECOMPILED: bool = False + VLLM_DOCKER_BUILD_CONTEXT: bool = False VLLM_TEST_USE_PRECOMPILED_NIGHTLY_WHEEL: bool = False VLLM_NO_DEPRECATION_WARNING: bool = False VLLM_KEEP_ALIVE_ON_ENGINE_DEATH: bool = False @@ -222,8 +223,14 @@ def get_vllm_port() -> Optional[int]: # If set, vllm will use precompiled binaries (*.so) "VLLM_USE_PRECOMPILED": - lambda: bool(os.environ.get("VLLM_USE_PRECOMPILED")) or bool( - os.environ.get("VLLM_PRECOMPILED_WHEEL_LOCATION")), + lambda: os.environ.get("VLLM_USE_PRECOMPILED", "").strip().lower() in + ("1", "true") or bool(os.environ.get("VLLM_PRECOMPILED_WHEEL_LOCATION")), + + # Used to mark that setup.py is running in a Docker build context, + # in order to force the use of precompiled binaries. + "VLLM_DOCKER_BUILD_CONTEXT": + lambda: os.environ.get("VLLM_DOCKER_BUILD_CONTEXT", "").strip().lower() in + ("1", "true"), # Whether to force using nightly wheel in python build. # This is used for testing the nightly wheel in python build. From 8b78f9838f10fc2f7fb749a8e4235fd3b5b0ccc1 Mon Sep 17 00:00:00 2001 From: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Date: Tue, 29 Jul 2025 17:56:29 -0400 Subject: [PATCH 484/552] Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)" (#21850) Signed-off-by: Gregory Shtrasberg Signed-off-by: x22x22 --- CMakeLists.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 664fb6a0ee9..ea56b8451f2 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -243,6 +243,7 @@ set(VLLM_EXT_SRC "csrc/sampler.cu" "csrc/cuda_view.cu" "csrc/quantization/gptq/q_gemm.cu" + "csrc/quantization/compressed_tensors/int8_quant_kernels.cu" "csrc/quantization/fp8/common.cu" "csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu" "csrc/quantization/gguf/gguf_kernel.cu" @@ -296,8 +297,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") "csrc/sparse/cutlass/sparse_scaled_mm_entry.cu" "csrc/cutlass_extensions/common.cpp" "csrc/attention/mla/cutlass_mla_entry.cu" - "csrc/quantization/fp8/per_token_group_quant.cu" - "csrc/quantization/compressed_tensors/int8_quant_kernels.cu") + "csrc/quantization/fp8/per_token_group_quant.cu") set_gencode_flags_for_srcs( SRCS "${VLLM_EXT_SRC}" From 9513a0b7af626e4715ec79ba293f898b39a44aec Mon Sep 17 00:00:00 2001 From: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Date: Tue, 29 Jul 2025 16:34:19 -0700 Subject: [PATCH 485/552] [BugFix] Fix interleaved sliding window not set for Gemma3n (#21863) Signed-off-by: Yong Hoon Shin Signed-off-by: x22x22 --- vllm/config.py | 9 +++++++-- vllm/model_executor/models/gemma3n.py | 9 +++++++-- 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/vllm/config.py b/vllm/config.py index 7e75716b80b..d236bcf8625 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -723,11 +723,16 @@ def _task_to_convert(task: TaskOption) -> ConvertType: ) # Workaround for Gemma 2 which uses interleaved sliding window - # attention, but it's not specified in its config. TODO: remove this - # when Gemma 2 is fixed in Transformers. + # attention, but it's not specified in its config. + # TODO: remove this when Gemma 2 config updated in HuggingFace. if self.hf_text_config.model_type == "gemma2": self.hf_text_config.sliding_window_pattern = 2 + # TODO: remove this when Gemma 3n config updated in HuggingFace. + if self.hf_text_config.model_type == "gemma3n_text": + # 4 sliding window attention followed by 1 full attention + self.hf_text_config.sliding_window_pattern = "LLLLG" + sliding_window = getattr(self.hf_text_config, "sliding_window", None) sliding_window_pattern = getattr(self.hf_text_config, "sliding_window_pattern", None) diff --git a/vllm/model_executor/models/gemma3n.py b/vllm/model_executor/models/gemma3n.py index 7d163320e0d..168665cc296 100644 --- a/vllm/model_executor/models/gemma3n.py +++ b/vllm/model_executor/models/gemma3n.py @@ -297,8 +297,13 @@ def __init__(self, has_weight=False) layer_idx = extract_layer_index(prefix) - if config.layer_types[layer_idx] == "sliding_attention": - self.sliding_window = config.sliding_window + + is_sliding_window = ( + getattr(config, "interleaved_sliding_window", None) is not None + and config.layer_types[layer_idx] == "sliding_attention") + + if is_sliding_window: + self.sliding_window = config.interleaved_sliding_window rope_theta = config.rope_local_base_freq rope_scaling = {"rope_type": "default"} else: From fa3ac7e9e7035ffe6ef3a2a4a4453f33a3e4c101 Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Tue, 29 Jul 2025 17:11:50 -0700 Subject: [PATCH 486/552] [ci] add b200 test placeholder (#21866) Signed-off-by: simon-mo Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 6cda800b647..f95f038840d 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -643,6 +643,17 @@ steps: - python3 examples/offline_inference/audio_language.py --model-type whisper - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl +- label: Blackwell Test + working_dir: "/vllm-workspace/" + gpu: b200 + # optional: true + source_file_dependencies: + - csrc/ + - vllm/ + commands: + - nvidia-smi + - python3 examples/offline_inference/basic/chat.py + ##### 1 GPU test ##### ##### multi gpus test ##### From cc6445327a6cb711b8d632cb1e63a099b6ce1b0c Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Tue, 29 Jul 2025 18:03:27 -0700 Subject: [PATCH 487/552] [ci] mark blackwell test optional for now (#21878) Signed-off-by: x22x22 --- .buildkite/test-pipeline.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index f95f038840d..2bf0b6fd9a1 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -646,7 +646,7 @@ steps: - label: Blackwell Test working_dir: "/vllm-workspace/" gpu: b200 - # optional: true + optional: true source_file_dependencies: - csrc/ - vllm/ From 7351db9c261b011cc167ea0580ec45d8a34a824b Mon Sep 17 00:00:00 2001 From: milesial Date: Tue, 29 Jul 2025 18:16:25 -0700 Subject: [PATCH 488/552] [Bugfix] Correct max tokens for non-contiguous embeds (#21798) Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Signed-off-by: x22x22 --- vllm/multimodal/profiling.py | 31 ++++++++++++++++++++++++++++--- vllm/multimodal/registry.py | 2 +- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/vllm/multimodal/profiling.py b/vllm/multimodal/profiling.py index 7f6fb47a21f..d96803b643f 100644 --- a/vllm/multimodal/profiling.py +++ b/vllm/multimodal/profiling.py @@ -180,11 +180,14 @@ def _get_dummy_mm_inputs( def _get_mm_num_tokens( self, mm_inputs: MultiModalInputs, + mm_embeddings_only: bool = True, ) -> Mapping[str, int]: placeholders_by_modality = mm_inputs["mm_placeholders"] return { - modality: sum(item.get_num_embeds() for item in placeholders) + modality: + sum(item.get_num_embeds() if mm_embeddings_only else item.length + for item in placeholders) for modality, placeholders in placeholders_by_modality.items() } @@ -253,10 +256,11 @@ def get_decoder_dummy_data( multi_modal_placeholders=mm_inputs["mm_placeholders"], ) - def get_mm_max_tokens( + def _get_mm_max_tokens( self, seq_len: int, mm_counts: Optional[Mapping[str, int]] = None, + mm_embeddings_only: bool = True, ) -> Mapping[str, int]: if mm_counts is None: mm_counts = self.get_mm_limits() @@ -285,4 +289,25 @@ def get_mm_max_tokens( return max_tokens_per_item mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts) - return self._get_mm_num_tokens(mm_inputs) + return self._get_mm_num_tokens(mm_inputs, + mm_embeddings_only=mm_embeddings_only) + + def get_mm_max_contiguous_tokens( + self, + seq_len: int, + mm_counts: Optional[Mapping[str, int]] = None, + ): + """ + Returns the maximum length of the multimodal (image placeholders+text) + tokens, including any break/text tokens in-between image embeddings. + + [IMG] [IMG] [IMG] [IMG] [IMG] [IMG] + Returns 9, even when the number of image embeddings is 6. + + This is important to take into account when profiling and + initializing the encoder cache size. + """ + + return self._get_mm_max_tokens(seq_len, + mm_counts, + mm_embeddings_only=False) diff --git a/vllm/multimodal/registry.py b/vllm/multimodal/registry.py index c44fcacd246..bfa391829d2 100644 --- a/vllm/multimodal/registry.py +++ b/vllm/multimodal/registry.py @@ -129,7 +129,7 @@ def get_max_tokens_per_item_by_modality( seq_len = model_config.max_model_len mm_limits = self.get_mm_limits_per_prompt(model_config) - return profiler.get_mm_max_tokens( + return profiler.get_mm_max_contiguous_tokens( seq_len, { modality: 1 From 8f99a83b54bce079dbf1fc9bc6528dbd0603fba8 Mon Sep 17 00:00:00 2001 From: Chen Zhang Date: Tue, 29 Jul 2025 18:45:29 -0700 Subject: [PATCH 489/552] [v1][attention] Support Hybrid Allocator + FlashInfer (#21412) Signed-off-by: Chen Zhang Signed-off-by: x22x22 --- tests/v1/attention/test_attention_backends.py | 19 ++++++----- tests/v1/spec_decode/test_eagle.py | 1 + tests/v1/worker/test_gpu_model_runner.py | 3 +- vllm/config.py | 32 ++++++++++++++----- vllm/v1/attention/backends/cpu_attn.py | 4 +-- vllm/v1/attention/backends/flash_attn.py | 4 +-- vllm/v1/attention/backends/flashinfer.py | 18 ++++------- vllm/v1/attention/backends/flex_attention.py | 4 +-- vllm/v1/attention/backends/mamba_attn.py | 4 +-- vllm/v1/attention/backends/mla/common.py | 4 ++- vllm/v1/attention/backends/mla/flashmla.py | 7 ++-- .../attention/backends/mla/rocm_aiter_mla.py | 7 ++-- vllm/v1/attention/backends/rocm_aiter_fa.py | 4 +-- vllm/v1/attention/backends/triton_attn.py | 4 +-- vllm/v1/attention/backends/utils.py | 14 +++++--- vllm/v1/worker/gpu_model_runner.py | 13 +++++--- 16 files changed, 85 insertions(+), 57 deletions(-) diff --git a/tests/v1/attention/test_attention_backends.py b/tests/v1/attention/test_attention_backends.py index 9bd0b99798d..f197cbb7bbb 100644 --- a/tests/v1/attention/test_attention_backends.py +++ b/tests/v1/attention/test_attention_backends.py @@ -198,7 +198,8 @@ def __init__(self, device: torch.device): def run_attention_backend(backend: _Backend, kv_cache_spec: FullAttentionSpec, - vllm_config, device: torch.device, + layer_names: list[str], vllm_config, + device: torch.device, common_attn_metadata: CommonAttentionMetadata, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, @@ -211,31 +212,33 @@ def run_attention_backend(backend: _Backend, kv_cache_spec: FullAttentionSpec, if backend == _Backend.FLASHINFER_VLLM_V1: import unittest.mock - from vllm.v1.attention.backends.flashinfer import PerLayerParameters + from vllm.v1.attention.backends.utils import PerLayerParameters - def mock_get_per_layer_parameters(vllm_config, impl_cls): + def mock_get_per_layer_parameters(vllm_config, layer_names, impl_cls): # Return mock parameters for a single layer head_size = vllm_config.model_config.get_head_size() return { - "mock_layer": + layer_name: PerLayerParameters( window_left=-1, # No sliding window logits_soft_cap=0.0, # No soft cap sm_scale=1.0 / (head_size**0.5) # Standard scale ) + for layer_name in layer_names } with unittest.mock.patch( 'vllm.v1.attention.backends.flashinfer.get_per_layer_parameters', mock_get_per_layer_parameters): - builder = builder_cls(kv_cache_spec, vllm_config, device) + builder = builder_cls(kv_cache_spec, layer_names, vllm_config, + device) attn_metadata = builder.build( common_prefix_len=0, common_attn_metadata=common_attn_metadata, ) else: # Build metadata - builder = builder_cls(kv_cache_spec, vllm_config, device) + builder = builder_cls(kv_cache_spec, layer_names, vllm_config, device) attn_metadata = builder.build( common_prefix_len=0, common_attn_metadata=common_attn_metadata, @@ -427,8 +430,8 @@ def test_backend_correctness(batch_spec_name: str, model: str): set_kv_cache_layout("HND") backend_output = run_attention_backend(backend_name, kv_cache_spec, - vllm_config, device, - common_attn_metadata, + ["placeholder"], vllm_config, + device, common_attn_metadata, query_vllm, key_vllm, value_vllm, kv_cache_for_backend) diff --git a/tests/v1/spec_decode/test_eagle.py b/tests/v1/spec_decode/test_eagle.py index da7e5e2c467..a126c7c943e 100644 --- a/tests/v1/spec_decode/test_eagle.py +++ b/tests/v1/spec_decode/test_eagle.py @@ -305,6 +305,7 @@ def create_deterministic_logits(token_ids): _Backend.FLASH_ATTN_VLLM_V1) attn_metadata_builder = attn_metadata_builder_cls( kv_cache_spec=create_standard_kv_cache_spec(proposer.vllm_config), + layer_names=proposer.attn_layer_names, vllm_config=proposer.vllm_config, device=device, ) diff --git a/tests/v1/worker/test_gpu_model_runner.py b/tests/v1/worker/test_gpu_model_runner.py index e14fbe1e47e..231dfcbb688 100644 --- a/tests/v1/worker/test_gpu_model_runner.py +++ b/tests/v1/worker/test_gpu_model_runner.py @@ -745,7 +745,8 @@ def test_hybrid_attention_mamba_tensor_shapes(monkeypatch): layer_4 = "model.layers.4.mixer" layer_5 = "model.layers.5.mixer" - with set_current_vllm_config(vllm_config): + with set_current_vllm_config(vllm_config), monkeypatch.context() as m: + m.setenv("VLLM_ATTENTION_BACKEND", "FLASHINFER") hf_config = vllm_config.model_config.hf_config fwd_context = {} for key in [layer_0, layer_1]: diff --git a/vllm/config.py b/vllm/config.py index d236bcf8625..52985229ad7 100644 --- a/vllm/config.py +++ b/vllm/config.py @@ -740,8 +740,8 @@ def _task_to_convert(task: TaskOption) -> ConvertType: isinstance(sliding_window, list)) if not self.disable_sliding_window and has_interleaved_attention: - if (backend := - envs.VLLM_ATTENTION_BACKEND) in ("XFORMERS", "FLASHINFER"): + if not envs.VLLM_USE_V1 and (backend := envs.VLLM_ATTENTION_BACKEND + ) in ("XFORMERS", "FLASHINFER"): sliding_window_len_min = get_min_sliding_window( self.hf_text_config.sliding_window) @@ -5094,13 +5094,29 @@ def assert_hashable(text): T = TypeVar("T") -def get_layers_from_vllm_config(vllm_config: VllmConfig, - layer_type: type[T]) -> dict[str, T]: +def get_layers_from_vllm_config( + vllm_config: VllmConfig, + layer_type: type[T], + layer_names: Optional[list[str]] = None) -> dict[str, T]: + """ + Get layers from the vLLM config. + + Args: + vllm_config: The vLLM config. + layer_type: The type of the layer to get. + layer_names: The names of the layers to get. If None, return all layers. + """ + + if layer_names is None: + layer_names = list( + vllm_config.compilation_config.static_forward_context.keys()) + + forward_context = vllm_config.compilation_config.static_forward_context + return { - layer_name: layer - for layer_name, layer in - vllm_config.compilation_config.static_forward_context.items() - if isinstance(layer, layer_type) + layer_name: forward_context[layer_name] + for layer_name in layer_names + if isinstance(forward_context[layer_name], layer_type) } diff --git a/vllm/v1/attention/backends/cpu_attn.py b/vllm/v1/attention/backends/cpu_attn.py index 3b6d753863d..9ed46331863 100644 --- a/vllm/v1/attention/backends/cpu_attn.py +++ b/vllm/v1/attention/backends/cpu_attn.py @@ -315,8 +315,8 @@ def get_seq_len_block_table_args( class TorchSDPAMetadataBuilderV1(AttentionMetadataBuilder[TorchSDPAMetadata]): - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device) -> None: + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device) -> None: self.kv_cache_spec = kv_cache_spec self.vllm_config = vllm_config self.scheduler_config = vllm_config.scheduler_config diff --git a/vllm/v1/attention/backends/flash_attn.py b/vllm/v1/attention/backends/flash_attn.py index 7c8a5e056fe..4c2a6c6b985 100755 --- a/vllm/v1/attention/backends/flash_attn.py +++ b/vllm/v1/attention/backends/flash_attn.py @@ -148,8 +148,8 @@ class FlashAttentionMetadataBuilder( AttentionMetadataBuilder[FlashAttentionMetadata]): full_cudagraph_supported: ClassVar[bool] = get_flash_attn_version() == 3 - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): self.vllm_config = vllm_config self.model_config = vllm_config.model_config self.parallel_config = vllm_config.parallel_config diff --git a/vllm/v1/attention/backends/flashinfer.py b/vllm/v1/attention/backends/flashinfer.py index 775780807ea..27552f0e7c1 100755 --- a/vllm/v1/attention/backends/flashinfer.py +++ b/vllm/v1/attention/backends/flashinfer.py @@ -21,10 +21,9 @@ from vllm.utils import cdiv from vllm.v1.attention.backends.flash_attn import use_cascade_attention from vllm.v1.attention.backends.utils import ( - AttentionMetadataBuilder, CommonAttentionMetadata, PerLayerParameters, - get_kv_cache_layout, get_per_layer_parameters, - infer_global_hyperparameters, reorder_batch_to_split_decodes_and_prefills, - split_decodes_and_prefills) + AttentionMetadataBuilder, CommonAttentionMetadata, get_kv_cache_layout, + get_per_layer_parameters, infer_global_hyperparameters, + reorder_batch_to_split_decodes_and_prefills, split_decodes_and_prefills) from vllm.v1.kv_cache_interface import AttentionSpec if TYPE_CHECKING: @@ -219,8 +218,8 @@ def __post_init__(self): class FlashInferMetadataBuilder(AttentionMetadataBuilder[FlashInferMetadata]): - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): self.device = device self._workspace_buffer = None self._prefill_wrapper = None # Wrapper for prefill/append @@ -228,7 +227,8 @@ def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, self._cascade_wrapper = None # Wrapper for cascade attention # Global hyperparameters shared by all attention layers - self.global_hyperparameters: Optional[PerLayerParameters] = None + self.global_hyperparameters = infer_global_hyperparameters( + get_per_layer_parameters(vllm_config, layer_names, FlashInferImpl)) self.vllm_config = vllm_config self.cache_config = vllm_config.cache_config @@ -283,10 +283,6 @@ def _get_cascade_wrapper(self): def _plan(self, num_prefills: int, num_decodes: int, attn_metadata: FlashInferMetadata): - if self.global_hyperparameters is None: - self.global_hyperparameters = infer_global_hyperparameters( - get_per_layer_parameters(self.vllm_config, FlashInferImpl)) - if attn_metadata.use_cascade: attn_metadata.cascade_wrapper = self._get_cascade_wrapper() attn_metadata.cascade_wrapper.plan( diff --git a/vllm/v1/attention/backends/flex_attention.py b/vllm/v1/attention/backends/flex_attention.py index ad63f92cd88..bb0d890c775 100644 --- a/vllm/v1/attention/backends/flex_attention.py +++ b/vllm/v1/attention/backends/flex_attention.py @@ -258,8 +258,8 @@ def __post_init__(self): class FlexAttentionMetadataBuilder( AttentionMetadataBuilder[FlexAttentionMetadata]): - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): self.model_config = vllm_config.model_config self.parallel_config = vllm_config.parallel_config self.cache_config = vllm_config.cache_config diff --git a/vllm/v1/attention/backends/mamba_attn.py b/vllm/v1/attention/backends/mamba_attn.py index dca5de46c06..8b702e28d67 100644 --- a/vllm/v1/attention/backends/mamba_attn.py +++ b/vllm/v1/attention/backends/mamba_attn.py @@ -87,8 +87,8 @@ class Mamba2AttentionMetadata: class Mamba2AttentionMetadataBuilder( AttentionMetadataBuilder[Mamba2AttentionMetadata]): - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): assert isinstance(kv_cache_spec, MambaSpec) self.kv_cache_spec = kv_cache_spec self.chunk_size = vllm_config.model_config.get_mamba_chunk_size() diff --git a/vllm/v1/attention/backends/mla/common.py b/vllm/v1/attention/backends/mla/common.py index cf17d933023..0095d752178 100755 --- a/vllm/v1/attention/backends/mla/common.py +++ b/vllm/v1/attention/backends/mla/common.py @@ -406,6 +406,7 @@ class MLACommonMetadataBuilder(AttentionMetadataBuilder[M]): def __init__(self, kv_cache_spec: AttentionSpec, + layer_names: list[str], vllm_config: VllmConfig, device: torch.device, metadata_cls: Optional[type[M]] = None): @@ -471,7 +472,8 @@ def __init__(self, BatchPrefillWithRaggedKVCacheWrapper] = [] self._global_hyperparameters = infer_global_hyperparameters( - get_per_layer_parameters(vllm_config, MLACommonImpl)) + get_per_layer_parameters(vllm_config, layer_names, + MLACommonImpl)) if self._use_cudnn_prefill: self.cudnn_workspace = torch.empty( diff --git a/vllm/v1/attention/backends/mla/flashmla.py b/vllm/v1/attention/backends/mla/flashmla.py index d3e5300dbbd..39463b9c061 100644 --- a/vllm/v1/attention/backends/mla/flashmla.py +++ b/vllm/v1/attention/backends/mla/flashmla.py @@ -56,9 +56,10 @@ class FlashMLAMetadata(MLACommonMetadata[FlashMLADecodeMetadata]): class FlashMLAMetadataBuilder(MLACommonMetadataBuilder[FlashMLAMetadata]): full_cudagraph_supported: ClassVar[bool] = True # Decode-only - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): - super().__init__(kv_cache_spec, vllm_config, device, FlashMLAMetadata) + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): + super().__init__(kv_cache_spec, layer_names, vllm_config, device, + FlashMLAMetadata) self.compilation_config = vllm_config.compilation_config self.num_q_heads = vllm_config.model_config.get_num_attention_heads( diff --git a/vllm/v1/attention/backends/mla/rocm_aiter_mla.py b/vllm/v1/attention/backends/mla/rocm_aiter_mla.py index 834c2345583..5c5891f035a 100644 --- a/vllm/v1/attention/backends/mla/rocm_aiter_mla.py +++ b/vllm/v1/attention/backends/mla/rocm_aiter_mla.py @@ -66,9 +66,10 @@ class AiterMLAMetadata(MLACommonMetadata[AiterMLADecodeMetadata]): class AiterMLAMetadataBuilder(MLACommonMetadataBuilder[AiterMLAMetadata]): full_cudagraph_supported: ClassVar[bool] = True # decode only - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): - super().__init__(kv_cache_spec, vllm_config, device, AiterMLAMetadata) + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): + super().__init__(kv_cache_spec, layer_names, vllm_config, device, + AiterMLAMetadata) assert self.kv_cache_spec.block_size == 1, "AITER MLA" \ "only supports block size 1." diff --git a/vllm/v1/attention/backends/rocm_aiter_fa.py b/vllm/v1/attention/backends/rocm_aiter_fa.py index 85a5dc8c91c..dd10b7f0273 100644 --- a/vllm/v1/attention/backends/rocm_aiter_fa.py +++ b/vllm/v1/attention/backends/rocm_aiter_fa.py @@ -231,8 +231,8 @@ class AiterFlashAttentionMetadataBuilder( AttentionMetadataBuilder[AiterFlashAttentionMetadata]): full_cudagraph_supported: ClassVar[bool] = True - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): self.vllm_config = vllm_config self.model_config = vllm_config.model_config self.parallel_config = vllm_config.parallel_config diff --git a/vllm/v1/attention/backends/triton_attn.py b/vllm/v1/attention/backends/triton_attn.py index 83471ca51b7..195fbd3b1b9 100644 --- a/vllm/v1/attention/backends/triton_attn.py +++ b/vllm/v1/attention/backends/triton_attn.py @@ -59,8 +59,8 @@ class TritonAttentionMetadataBuilder( AttentionMetadataBuilder[TritonAttentionMetadata]): full_cudagraph_supported: ClassVar[bool] = True - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): self.device = device self.block_size = kv_cache_spec.block_size self.kv_cache_spec = kv_cache_spec diff --git a/vllm/v1/attention/backends/utils.py b/vllm/v1/attention/backends/utils.py index b13362f8a8d..d1599ba10b6 100644 --- a/vllm/v1/attention/backends/utils.py +++ b/vllm/v1/attention/backends/utils.py @@ -70,8 +70,8 @@ class AttentionMetadataBuilder(abc.ABC, Generic[M]): full_cudagraph_supported: ClassVar[bool] = False @abstractmethod - def __init__(self, kv_cache_spec: AttentionSpec, vllm_config: VllmConfig, - device: torch.device): + def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str], + vllm_config: VllmConfig, device: torch.device): self.kv_cache_spec = kv_cache_spec @abstractmethod @@ -164,14 +164,14 @@ class PerLayerParameters: def get_per_layer_parameters( - vllm_config: VllmConfig, + vllm_config: VllmConfig, layer_names: list[str], cls_: type['AttentionImpl']) -> dict[str, PerLayerParameters]: """ - Scan all attention layers and determine some hyperparameters + Scan layers in `layer_names` and determine some hyperparameters to use during `plan`. """ - layers = get_layers_from_vllm_config(vllm_config, Attention) + layers = get_layers_from_vllm_config(vllm_config, Attention, layer_names) per_layer_params: dict[str, PerLayerParameters] = {} for key, layer in layers.items(): @@ -208,6 +208,10 @@ def infer_global_hyperparameters( param_sets = list(per_layer_params.values()) global_params = param_sets[0] for params in param_sets: + if params.window_left != global_params.window_left: + raise ValueError( + "Window left is not the same for all layers. One potential fix " + "is to set disable_sliding_window=True") assert params == global_params, ( "FlashInfer backend currently only supports models in which all " "layers share the same values for the following hyperparameters: " diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index 84ad582c9c9..3befb6adf27 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -2521,7 +2521,7 @@ def freeze_gc(): elapsed_time, cuda_graph_size / (1 << 30)) def _initialize_single_attn_backend( - self, kv_cache_spec: KVCacheSpec + self, kv_cache_spec: KVCacheSpec, layer_names: list[str] ) -> tuple[AttentionBackend, AttentionMetadataBuilder]: if isinstance(kv_cache_spec, AttentionSpec): attn_backend_i = get_attn_backend( @@ -2551,6 +2551,7 @@ def _initialize_single_attn_backend( attn_metadata_builder_i = attn_backend_i.get_builder_cls()( kv_cache_spec, + layer_names, self.vllm_config, self.device, ) @@ -2574,8 +2575,9 @@ def initialize_attn_backend(self, kv_cache_config: KVCacheConfig) -> None: kv_cache_config.kv_cache_groups): kv_cache_spec = kv_cache_group_spec.kv_cache_spec - attn_backend_i, attn_metadata_builder_i = \ - self._initialize_single_attn_backend(kv_cache_spec) + attn_backend_i, attn_metadata_builder_i = ( + self._initialize_single_attn_backend( + kv_cache_spec, kv_cache_group_spec.layer_names)) self.attn_backends.append(attn_backend_i) self.attn_metadata_builders.append(attn_metadata_builder_i) @@ -2606,8 +2608,9 @@ def initialize_attn_backend(self, kv_cache_config: KVCacheConfig) -> None: assert len(attn_specs) == len(attn_layers), \ "All or none of the layers are expected to be encoder-only" - attn_backend, attn_metadata_builder = \ - self._initialize_single_attn_backend(attn_specs[0]) + attn_backend, attn_metadata_builder = ( + self._initialize_single_attn_backend(attn_specs[0], + attn_layers.keys())) self.attn_backends.append(attn_backend) self.attn_metadata_builders.append(attn_metadata_builder) self.is_encoder_only_model = True From 3285bc46b5982e991b1fe37065cfba421f90aa10 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Wed, 30 Jul 2025 03:45:08 +0100 Subject: [PATCH 490/552] [Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: x22x22 --- .buildkite/nightly-benchmarks/README.md | 5 + .../nightly-benchmarks/nightly-annotation.md | 21 ++-- .../nightly-descriptions.md | 34 +++---- .../performance-benchmarks-descriptions.md | 1 + .github/PULL_REQUEST_TEMPLATE.md | 4 +- .markdownlint.yaml | 13 +++ .pre-commit-config.yaml | 7 +- README.md | 7 ++ RELEASE.md | 5 +- benchmarks/README.md | 99 +++++++++++-------- benchmarks/auto_tune/README.md | 8 +- benchmarks/kernels/deepgemm/README.md | 4 +- csrc/quantization/cutlass_w8a8/Epilogues.md | 5 +- docs/cli/README.md | 4 +- docs/configuration/tpu.md | 15 ++- docs/contributing/ci/failures.md | 8 +- .../contributing/ci/update_pytorch_version.md | 4 +- docs/contributing/deprecation_policy.md | 6 +- docs/contributing/profiling.md | 4 +- docs/contributing/vulnerability_management.md | 6 +- docs/deployment/frameworks/anything-llm.md | 12 +-- docs/deployment/frameworks/chatbox.md | 10 +- docs/deployment/frameworks/dify.md | 10 +- docs/deployment/frameworks/haystack.md | 2 - .../retrieval_augmented_generation.md | 1 + .../integrations/production-stack.md | 9 +- docs/deployment/k8s.md | 2 +- docs/design/metrics.md | 4 +- docs/design/p2p_nccl_connector.md | 4 +- docs/design/prefix_caching.md | 11 ++- docs/design/torch_compile.md | 6 +- docs/features/compatibility_matrix.md | 6 +- docs/features/lora.md | 2 + docs/features/multimodal_inputs.md | 2 + docs/features/quantization/auto_round.md | 2 +- docs/features/quantization/int4.md | 4 +- .../quantization/quantized_kvcache.md | 1 + docs/features/quantization/quark.md | 1 + docs/features/quantization/torchao.md | 1 + docs/getting_started/installation/cpu.md | 6 +- .../installation/intel_gaudi.md | 8 +- docs/models/hardware_supported_models/tpu.md | 5 +- docs/serving/distributed_serving.md | 2 +- docs/serving/expert_parallel_deployment.md | 3 +- docs/serving/openai_compatible_server.md | 1 + docs/usage/security.md | 32 +++--- docs/usage/v1_guide.md | 10 +- .../disaggregated-prefill-v1/README.md | 2 +- .../offline_inference/openai_batch/README.md | 8 +- examples/others/lmcache/README.md | 4 + examples/others/logging_configuration.md | 6 +- pyproject.toml | 10 -- tools/ep_kernels/README.md | 9 +- vllm/plugins/lora_resolvers/README.md | 3 +- 54 files changed, 267 insertions(+), 192 deletions(-) create mode 100644 .markdownlint.yaml diff --git a/.buildkite/nightly-benchmarks/README.md b/.buildkite/nightly-benchmarks/README.md index ae42f70077c..fcde284efea 100644 --- a/.buildkite/nightly-benchmarks/README.md +++ b/.buildkite/nightly-benchmarks/README.md @@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc ## Trigger the benchmark Performance benchmark will be triggered when: + - A PR being merged into vllm. - Every commit for those PRs with `perf-benchmarks` label AND `ready` label. @@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh ``` Runtime environment variables: + - `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0. - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). @@ -46,12 +48,14 @@ Runtime environment variables: - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string. Nightly benchmark will be triggered when: + - Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label. ## Performance benchmark details See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases. > NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead. +> ### Latency test Here is an example of one test inside `latency-tests.json`: @@ -149,6 +153,7 @@ Here is an example using the script to compare result_a and result_b without det Here is an example using the script to compare result_a and result_b with detail test name. `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` + | | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio | |---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------| | 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 | diff --git a/.buildkite/nightly-benchmarks/nightly-annotation.md b/.buildkite/nightly-benchmarks/nightly-annotation.md index ef11c040057..466def07b6f 100644 --- a/.buildkite/nightly-benchmarks/nightly-annotation.md +++ b/.buildkite/nightly-benchmarks/nightly-annotation.md @@ -1,3 +1,4 @@ +# Nightly benchmark annotation ## Description @@ -13,15 +14,15 @@ Please download the visualization scripts in the post - Find the docker we use in `benchmarking pipeline` - Deploy the docker, and inside the docker: - - Download `nightly-benchmarks.zip`. - - In the same folder, run the following code: - - ```bash - export HF_TOKEN= - apt update - apt install -y git - unzip nightly-benchmarks.zip - VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh - ``` + - Download `nightly-benchmarks.zip`. + - In the same folder, run the following code: + + ```bash + export HF_TOKEN= + apt update + apt install -y git + unzip nightly-benchmarks.zip + VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh + ``` And the results will be inside `./benchmarks/results`. diff --git a/.buildkite/nightly-benchmarks/nightly-descriptions.md b/.buildkite/nightly-benchmarks/nightly-descriptions.md index 5f003f42f07..8afde017d38 100644 --- a/.buildkite/nightly-benchmarks/nightly-descriptions.md +++ b/.buildkite/nightly-benchmarks/nightly-descriptions.md @@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/ ## Setup - Docker images: - - vLLM: `vllm/vllm-openai:v0.6.2` - - SGLang: `lmsysorg/sglang:v0.3.2-cu121` - - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` - - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` - - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* - - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. + - vLLM: `vllm/vllm-openai:v0.6.2` + - SGLang: `lmsysorg/sglang:v0.3.2-cu121` + - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` + - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` + - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* + - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. - Hardware - - 8x Nvidia A100 GPUs + - 8x Nvidia A100 GPUs - Workload: - - Dataset - - ShareGPT dataset - - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) - - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) - - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. - - Models: llama-3 8B, llama-3 70B. - - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). - - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. - - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. - - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). + - Dataset + - ShareGPT dataset + - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) + - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) + - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. + - Models: llama-3 8B, llama-3 70B. + - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). + - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. + - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. + - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). ## Known issues diff --git a/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md b/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md index a1f8441ccda..8bb16bd3cf3 100644 --- a/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md +++ b/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md @@ -1,3 +1,4 @@ +# Performance benchmarks descriptions ## Latency tests diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 017ec7ca82d..d4aceab4472 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,5 @@ -## Essential Elements of an Effective PR Description Checklist +# Essential Elements of an Effective PR Description Checklist + - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results @@ -14,5 +15,4 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE B ## (Optional) Documentation Update - **BEFORE SUBMITTING, PLEASE READ ** (anything written below this line will be removed by GitHub Actions) diff --git a/.markdownlint.yaml b/.markdownlint.yaml new file mode 100644 index 00000000000..c86fed9555d --- /dev/null +++ b/.markdownlint.yaml @@ -0,0 +1,13 @@ +MD007: + indent: 4 +MD013: false +MD024: + siblings_only: true +MD033: false +MD042: false +MD045: false +MD046: false +MD051: false +MD052: false +MD053: false +MD059: false diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5197820fb40..045096cb863 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -35,12 +35,11 @@ repos: exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*' types_or: [c++, cuda] args: [--style=file, --verbose] -- repo: https://github.com/jackdewinter/pymarkdown - rev: v0.9.29 +- repo: https://github.com/igorshubovych/markdownlint-cli + rev: v0.45.0 hooks: - - id: pymarkdown + - id: markdownlint-fix exclude: '.*\.inc\.md' - args: [fix] - repo: https://github.com/rhysd/actionlint rev: v1.7.7 hooks: diff --git a/README.md b/README.md index dc2f0afbe35..5348405b72d 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,4 @@ +