O'Reilly Book - Fall 2025
Available on Amazon
The book includes a comprehensive 175+ item performance checklist covering:
- β Performance Tuning Mindset and Cost Optimization
- β Reproducibility and Documentation Best Practices
- β System Architecture and Hardware Planning
- β Operating System and Driver Optimizations
- β GPU Programming and CUDA Tuning
- β Distributed Training and Network Optimization
- β Efficient Inference and Serving
- β Power and Thermal Management
- β Latest Profiling Tools and Techniques
- β Architecture-Specific Optimizations
This repository contains comprehensive code examples, tools, and resources for AI Systems Performance Engineering. It accompanies the O'Reilly book covering GPU optimization, distributed training, inference scaling, and performance tuning for modern AI workloads.
- GPU Architecture, PyTorch, CUDA, and Open AI Triton Programming
- Distributed Training & Inference
- Memory Optimization & Profiling
- PyTorch Performance Tuning
- Multi-Node Scaling Strategies
- NVIDIA GPU with CUDA support
- Python 3.8+
- PyTorch with CUDA
- Docker (optional)
# Clone the repository
git clone https://github.com/your-repo/ai-performance-engineering.git
cd ai-performance-engineering
# Install dependencies for a specific chapter
cd code/ch1
pip install -r requirements.txt
# Run examples
python performance_basics.py
This repository supports multiple NVIDIA GPU architectures. Switch between Hopper (H100/H200) and Blackwell (B200/B300) architectures:
# Switch to Hopper H100/H200 (sm_90)
./code/switch_architecture.sh sm_90
# Switch to Blackwell B200/B300 (sm_100)
./code/switch_architecture.sh sm_100
# Auto-detect and build for current architecture
./code/build_all.sh
Supported Architectures:
- Hopper H100/H200 (
sm_90
): 80-141GB memory, 4-6 PFLOPS - Blackwell B200/B300 (
sm_100
): 192-288GB memory, 20-30 PFLOPS
For detailed architecture specifications and performance benchmarks, see code/README.md
.
Updated for PyTorch 2.8, CUDA 12.8, and Triton 3.3:
- PyTorch 2.8: Enhanced compiler, dynamic shapes, improved profiler
- CUDA 12.8: Latest CUDA features, improved kernel performance
- Triton 3.3: Latest Triton optimizations, architecture-specific kernels
- Enhanced Profiling: Nsight Systems 2024.1, Nsight Compute 2024.1
- HTA: Holistic Tracing Analysis for multi-GPU systems
- Perf: Enhanced system-level analysis
- Architecture Optimizations: Hopper/Blackwell-specific features
- The AI Systems Performance Engineer
- Benchmarking and Profiling
- Scaling Distributed Training and Inference
- Managing Resources Efficiently
- Cross-Team Collaboration
- Transparency and Reproducibility
- The CPU and GPU "Superchip"
- NVIDIA Grace CPU & Blackwell GPU
- NVIDIA GPU Tensor Cores and Transformer Engine
- Streaming Multiprocessors, Threads, and Warps
- Ultra-Scale Networking
- NVLink and NVSwitch
- Multi-GPU Programming
- Operating System Configuration
- GPU Driver and Software Stack
- NUMA Awareness and CPU Pinning
- Container Runtime Optimizations
- Kubernetes for Topology-Aware Orchestration
- Memory Isolation and Resource Management
- Overlapping Communication and Computation
- NCCL for Distributed Multi-GPU Communication
- Topology Awareness in NCCL
- Distributed Data Parallel Strategies
- NVIDIA Inference Transfer Library (NIXL)
- In-Network SHARP Aggregation
- Fast Storage and Data Locality
- NVIDIA GPUDirect Storage
- Distributed, Parallel File Systems
- Multi-Modal Data Processing with NVIDIA DALI
- Creating High-Quality LLM Datasets
- Understanding GPU Architecture
- Threads, Warps, Blocks, and Grids
- CUDA Programming Refresher
- Understanding GPU Memory Hierarchy
- Maintaining High Occupancy and GPU Utilization
- Roofline Model Analysis
- Coalesced vs. Uncoalesced Global Memory Access
- Vectorized Memory Access
- Tiling and Data Reuse Using Shared Memory
- Warp Shuffle Intrinsics
- Asynchronous Memory Prefetching
- Profiling and Diagnosing GPU Bottlenecks
- Nsight Systems and Compute Analysis
- Tuning Occupancy
- Improving Warp Execution Efficiency
- Exposing Instruction-Level Parallelism
- Multi-Level Micro-Tiling
- Kernel Fusion
- Mixed Precision and Tensor Cores
- Using CUTLASS for Optimal Performance
- Inline PTX and SASS Tuning
- Intra-Kernel Pipelining Techniques
- Warp-Specialized Producer-Consumer Model
- Persistent Kernels and Megakernels
- Thread Block Clusters and Distributed Shared Memory
- Cooperative Groups
- Using Streams to Overlap Compute with Data Transfers
- Stream-Ordered Memory Allocator
- Fine-Grained Synchronization with Events
- Zero-Overhead Launch with CUDA Graphs
- Dynamic Scheduling with Atomic Work Queues
- Batch Repeated Kernel Launches with CUDA Graphs
- Dynamic Parallelism
- Orchestrate Across Multiple GPUs with NVSHMEM
- NVTX Markers and Profiling Tools
- PyTorch Compiler (torch.compile)
- Profiling and Tuning Memory in PyTorch
- Scaling with PyTorch Distributed
- Multi-GPU Profiling with HTA
- PyTorch Compiler Deep Dive
- Writing Custom Kernels with OpenAI Triton
- PyTorch XLA Backend
- Advanced Triton Kernel Implementations
- Disaggregated Prefill and Decode Architecture
- Parallelism Strategies for MoE Models
- Speculative and Parallel Decoding Techniques
- Dynamic Routing Strategies
- Workflow for Profiling and Tuning Performance
- Dynamic Request Batching and Scheduling
- Systems-Level Optimizations
- Quantization Approaches for Real-Time Inference
- Application-Level Optimizations
- Prefill-Decode Disaggregation Benefits
- Prefill Workers Design
- Decode Workers Design
- Disaggregated Routing and Scheduling Policies
- Scalability Considerations
- Optimized Decode Kernels (FlashMLA, ThunderMLA, FlexDecoding)
- Tuning KV Cache Utilization and Management
- Heterogeneous Hardware and Parallelism Strategies
- SLO-Aware Request Management
- Adaptive Parallelism Strategies
- Dynamic Precision Changes
- Kernel Auto-Tuning
- Reinforcement Learning Agents for Runtime Tuning
- Adaptive Batching and Scheduling
- AlphaTensor AI-Discovered Algorithms
- Automated GPU Kernel Optimizations
- Self-Improving AI Agents
- Scaling Toward Multi-Million GPU Clusters
code/profiler_scripts/comprehensive_profile.sh
- Comprehensive GPU profilingcode/profiler_scripts/enhanced_profiling.sh
- Enhanced profiling with Nsightcode/profiler_scripts/hta_profile.sh
- Holistic Tracing Analysis
tools/comprehensive_profiling.py
- Python-based profiling utilitiestools/compare_nsight/
- Nsight Systems comparison toolstools/inference_gpu_cluster_sizing/
- Cluster sizing notebooks
# Comprehensive profiling
nsys profile -t cuda,nvtx,osrt,triton -o timeline_profile python script.py
# Kernel analysis
ncu --metrics achieved_occupancy,warp_execution_efficiency -o kernel_profile python script.py
# HTA for multi-GPU
nsys profile -t cuda,nvtx,osrt,cudnn,cublas,nccl,triton -o hta_profile python script.py
# System analysis
perf record -g -p $(pgrep python) -o perf.data
perf report -i perf.data
- Meetup Group: AI Performance Engineering
- YouTube Channel: AI Performance Engineering
- YouTube Video
- PyTorch Optimizations: Data Loader Pipeline
- Cross-Architecture CUDA and ROCm Kernel Development
We welcome contributions! Please see our Contributing Guide for:
- Code examples and improvements
- Documentation updates
- Performance optimization techniques
- Bug reports and feature requests
This project is licensed under the MIT License - see the LICENSE file for details.
- Book: AI Systems Performance Engineering on Amazon
- Meetup: AI Performance Engineering Meetup Group
- YouTube: AI Performance Engineering Channel
Built with β€οΈ in San Francisco for the AI performance engineering community