diff --git a/docs/source/serving.rst b/docs/source/serving.rst index cb61b159c4..9efa905b0d 100644 --- a/docs/source/serving.rst +++ b/docs/source/serving.rst @@ -1,12 +1,441 @@ (Part 3) Serving on vLLM, SGLang, ExecuTorch ------------------------------------------------- +============================================ -TorchAO provides an end-to-end pre-training, fine-tuning, and serving -model optimization flow by leveraging our quantization and sparsity -techniques integrated into our partner frameworks. This is part 3 of 3 -such tutorials showcasing this end-to-end flow, focusing on the -serving step. +TorchAO provides an end-to-end pre-training, fine-tuning, and serving model optimization flow by leveraging our quantization and sparsity techniques integrated into our partner frameworks. This is part 3 of 3 such tutorials showcasing this end-to-end flow, focusing on the serving step. .. image:: ../static/e2e_flow_part3.png +This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch. + +.. contents:: + :local: + :depth: 2 + +Post-training Quantization with HuggingFace +------------------------------------------- + +HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading. + +.. code-block:: bash + + pip install git+https://github.com/huggingface/transformers@main + pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 + pip install torch + pip install accelerate + +For this example, we'll use ``Float8DynamicActivationFloat8WeightConfig`` on the Phi-4 mini-instruct model. + +.. code-block:: python + + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig + from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow + + model_id = "microsoft/Phi-4-mini-instruct" + + quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()) + quantization_config = TorchAoConfig(quant_type=quant_config) + quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config) + tokenizer = AutoTokenizer.from_pretrained(model_id) + + # Push the model to hub + USER_ID = "YOUR_USER_ID" + MODEL_NAME = model_id.split("/")[-1] + save_to = f"{USER_ID}/{MODEL_NAME}-float8dq" + quantized_model.push_to_hub(save_to, safe_serialization=False) + tokenizer.push_to_hub(save_to) + +.. note:: + For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs `_. + +Serving and Inference +-------------------- + +Serving and Inference with vLLM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements. + +First, install vLLM with torchao support: + +.. code-block:: bash + + pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly + pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 + +To serve in vLLM, we're using the model we quantized and pushed to Hugging Face hub in the previous step :ref:`Post-training Quantization with HuggingFace`. + +.. code-block:: bash + + # Server + vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3 + + # Client + curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "pytorch/Phi-4-mini-instruct-float8dq", + "messages": [ + {"role": "user", "content": "Give me a short introduction to large language models."} + ], + "temperature": 0.6, + "top_p": 0.95, + "top_k": 20, + "max_tokens": 32768 + }' + +Serving a float8 dynamic quantized model with vLLM shows 36% VRAM reduction, 1.15x-1.2x inference speedup and little to no accuracy impact on H100. :ref:`Memory Benchmarking` and :ref:`Performance Benchmarking` for more details. + +.. note:: + For more information on vLLM Integration, please refer to the detailed guide :ref:`torchao_vllm_integration`. + +Serving and Inference with SGLang +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + (Coming soon!) + +Inference with Transformers +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Install the required packages: + +.. code-block:: bash + + pip install git+https://github.com/huggingface/transformers@main + pip install torchao + pip install torch + pip install accelerate + +.. code-block:: python + + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline + + torch.random.manual_seed(0) + + model_path = "pytorch/Phi-4-mini-instruct-float8dq" + + model = AutoModelForCausalLM.from_pretrained( + model_path, + device_map="auto", + torch_dtype="auto", + trust_remote_code=True, + ) + tokenizer = AutoTokenizer.from_pretrained(model_path) + + messages = [ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, + {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, + {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, + ] + + pipe = pipeline( + "text-generation", + model=model, + tokenizer=tokenizer, + ) + + generation_args = { + "max_new_tokens": 500, + "return_full_text": False, + "temperature": 0.0, + "do_sample": False, + } + + output = pipe(messages, **generation_args) + print(output[0]['generated_text']) + +Mobile Deployment with ExecuTorch +-------------------------------- + +ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment. Optionally, before lowering to ExecuTorch, we can finetune a model using QAT :doc:`finetuning`, which has demonstrated some improvements in the quality of quantized models. + +[Optional] Untie Embedding Weights +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Optionally, we can quantize the embedding and lm_head differently, since those layers are tied, we first need to untie the model: + +.. code-block:: python + + from transformers import ( + AutoModelForCausalLM, + AutoProcessor, + AutoTokenizer, + ) + import torch + from transformers.modeling_utils import find_tied_parameters + + model_id = "microsoft/Phi-4-mini-instruct" + untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") + tokenizer = AutoTokenizer.from_pretrained(model_id) + + print(untied_model) + print("tied weights:", find_tied_parameters(untied_model)) + if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"): + setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False) + + untied_model._tied_weights_keys = [] + untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone()) + + print("tied weights:", find_tied_parameters(untied_model)) + + USER_ID = "YOUR_USER_ID" + MODEL_NAME = model_id.split("/")[-1] + save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights" + + untied_model.push_to_hub(save_to) + tokenizer.push_to_hub(save_to) + + # or save locally + save_to_local_path = f"{MODEL_NAME}-untied-weights" + untied_model.save_pretrained(save_to_local_path) + tokenizer.save_pretrained(save_to) + +Step 1: Create Mobile-Optimized Quantization +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Quantizing the model for mobile deployment using TorchAO's ``Int8DynamicActivationIntxWeightConfig`` configuration. If we've untied the embedding and lm_head following the previous step, we can quantize embedding using ``IntxWeightOnlyConfig`` configuration, and lm_head using ``Int8DynamicActivationIntxWeightConfig`` configuration. + +.. code-block:: python + + from transformers import ( + AutoModelForCausalLM, + AutoProcessor, + AutoTokenizer, + TorchAoConfig, + ) + from torchao.quantization.quant_api import ( + IntxWeightOnlyConfig, + Int8DynamicActivationIntxWeightConfig, + ModuleFqnToConfig, + quantize_, + ) + from torchao.quantization.granularity import PerGroup, PerAxis + import torch + + # we start from the model with untied weights + model_id = "microsoft/Phi-4-mini-instruct" + USER_ID = "YOUR_USER_ID" + MODEL_NAME = model_id.split("/")[-1] + untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights" + untied_model_local_path = f"{MODEL_NAME}-untied-weights" + + # embedding_config is required only if we untied the embedding and lm_head in the previous step, else we can use only linear config for quantization + embedding_config = IntxWeightOnlyConfig( + weight_dtype=torch.int8, + granularity=PerAxis(0), + ) + linear_config = Int8DynamicActivationIntxWeightConfig( + weight_dtype=torch.int4, + weight_granularity=PerGroup(32), + weight_scale_dtype=torch.bfloat16, + ) + quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config}) + quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[]) + + # either use `untied_model_id` or `untied_model_local_path` + quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config) + tokenizer = AutoTokenizer.from_pretrained(model_id) + + # Push to hub + MODEL_NAME = model_id.split("/")[-1] + save_to = f"{USER_ID}/{MODEL_NAME}-8da4w" + quantized_model.push_to_hub(save_to, safe_serialization=False) + tokenizer.push_to_hub(save_to) + + +Step 2: Export to ExecuTorch +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Convert the quantized model to .pte file, which can be run on mobile device. + +.. code-block:: bash + + # Install ExecuTorch + git clone https://github.com/pytorch/executorch.git + cd executorch + ./install_requirements.sh + + # Convert checkpoint format for ExecuTorch + python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin + + # Export to PTE format with torchao optimizations preserved + PARAMS="executorch/examples/models/phi_4_mini/config.json" + python -m executorch.examples.models.llama.export_llama \ + --model "phi_4_mini" \ + --checkpoint "pytorch_model_converted.bin" \ + --params "$PARAMS" \ + -kv \ + --use_sdpa_with_kv_cache \ + -X \ + --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \ + --max_seq_length 128 \ + --max_context_length 128 \ + --output_name="phi4-mini-8da4w.pte" + +The .pte file can be run with ExecuTorch on a mobile phone. Follow the `instructions `_ for doing this on an iOS device. + +Mobile Performance Characteristics +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The torchao-optimized 8da4w model provides: + +- **Memory**: ~3.2GB on iPhone 15 Pro +- **Speed**: ~17 tokens/sec on iPhone 15 Pro +- **Accuracy**: Maintained within 5-10% of original model on most benchmarks + +.. note:: + For detailed instructions on testing the ExecuTorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model `_. + +Evaluation +--------- + +Model Quality Assessment +^^^^^^^^^^^^^^^^^^^^^^ + +Evaluate quantized models using lm-evaluation-harness: + +.. code-block:: bash + + # Install evaluation framework + # Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install + + # Evaluate baseline model + lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8 + + # Evaluate torchao-quantized model (float8dq) + lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8 + +Memory Benchmarking +^^^^^^^^^^^^^^^^^ +For Phi-4-mini-instruct, when quantized with float8 dynamic quant, we can reduce the peak memory usage by 36% compared to the baseline model. + +.. code-block:: python + + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer + + # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq" + model_id = "pytorch/Phi-4-mini-instruct-float8dq" + quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16) + tokenizer = AutoTokenizer.from_pretrained(model_id) + + torch.cuda.reset_peak_memory_stats() + + prompt = "Hey, are you conscious? Can you talk to me?" + messages = [ + { + "role": "system", + "content": "", + }, + {"role": "user", "content": prompt}, + ] + templated_prompt = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True, + ) + print("Prompt:", prompt) + print("Templated prompt:", templated_prompt) + inputs = tokenizer( + templated_prompt, + return_tensors="pt", + ).to("cuda") + generated_ids = quantized_model.generate(**inputs, max_new_tokens=128) + output_text = tokenizer.batch_decode( + generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False + ) + print("Response:", output_text[0][len(prompt):]) + + mem = torch.cuda.max_memory_reserved() / 1e9 + print(f"Peak Memory Usage: {mem:.02f} GB") + +Output: + +.. code:: console + + Prompt: Hey, are you conscious? Can you talk to me? + Templated prompt: <|system|><|end|><|user|>Hey, are you conscious? Can you talk to me?<|end|><|assistant|> + Response: Hello! Yes, I am a digital assistant, and I am fully operational and ready to assist you. How can I help you today? + Peak Memory Usage: 5.70 GB + ++-------------------+---------------------+------------------------------+ +| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq | ++===================+=====================+==============================+ +| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) | ++-------------------+---------------------+------------------------------+ + +Performance Benchmarking +^^^^^^^^^^^^^^^^^^^^^^ + +Latency Benchmarking +""""""""""""""""""" + +.. code-block:: bash + + # baseline + python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1 + + # float8dq + VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1 + +Serving Benchmarking +""""""""""""""""""""" + +We benchmarked the throughput in a serving environment. + +.. code-block:: bash + + # Setup: Get vllm source code + git clone git@github.com:vllm-project/vllm.git + + # Install vllm + VLLM_USE_PRECOMPILED=1 pip install --editable . + + # Run the benchmarks under vllm root folder: + + # Download sharegpt dataset: + wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json + + # Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks + # Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script. + + # For baseline + # Server: + vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3 + # Client: + python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1 + + # For float8dq + # Server: + VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3 + # Client: + python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1 + +Results (H100 machine) +""""""""""""""""""""" + ++----------------------------+---------------------+------------------------------+ +| Benchmark | Phi-4-mini-instruct | Phi-4-mini-instruct-float8dq | ++============================+=====================+==============================+ +| latency (batch_size=1) | 1.64s | 1.41s (1.16x speedup) | ++----------------------------+---------------------+------------------------------+ +| latency (batch_size=128) | 3.1s | 2.72s (1.14x speedup) | ++----------------------------+---------------------+------------------------------+ +| serving (num_prompts=1) | 1.35 req/s | 1.57 req/s (1.16x speedup) | ++----------------------------+---------------------+------------------------------+ +| serving (num_prompts=1000) | 66.68 req/s | 80.53 req/s (1.21x speedup) | ++----------------------------+---------------------+------------------------------+ + +Conclusion +--------- + +This tutorial demonstrated how torchao's quantization and sparsity techniques integrate seamlessly across the entire ML deployment stack: + +- **HuggingFace Transformers** provides easy model loading with torchao quantization +- **vLLM** leverages torchao's optimized kernels for high-throughput serving +- **ExecuTorch** enables mobile deployment with torchao's mobile-optimized schemes +- **lm-evaluation-harness** provides model quality assessment + +All these frameworks use torchao as the underlying optimization engine, ensuring consistent performance gains and ease of integration. The quantization techniques shown provide significant memory reduction (3-4x) and performance improvements (1.5-2x) while maintaining model quality within acceptable bounds for most applications. + +For production deployments, always benchmark on your specific use case and hardware to validate the performance and accuracy trade-offs.