Skip to content

A CLI tool for estimating GPU VRAM requirements for Hugging Face models, supporting various data types, parallelization strategies, and fine-tuning scenarios like LoRA.

Notifications You must be signed in to change notification settings

joe0731/hf_vram_calc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

HF VRAM Calculator

Python 3.8+ License: MIT uv

A professional Python CLI tool for estimating GPU memory requirements for Hugging Face models with different data types and parallelization strategies.

โšก Latest Features: Smart dtype detection, 12 quantization formats, 20+ GPU models, professional Rich UI

Quick Demo

# Install and run
pip install hf-vram-calc
hf-vram-calc microsoft/DialoGPT-medium

# Output: Beautiful tables showing 0.9GB inference, GPU compatibility, parallelization strategies

Features

  • ๐Ÿ” Automatic Model Analysis: Fetch configurations from Hugging Face Hub automatically
  • ๐Ÿง  Smart Data Type Detection: Intelligent dtype recommendation from model names, config, or defaults
  • ๐Ÿ“Š Comprehensive Data Type Support: fp32, fp16, bf16, fp8, int8, int4, mxfp4, nvfp4, awq_int4, gptq_int4, nf4, fp4
  • ๐ŸŽฏ Multi-Scenario Memory Estimation:
    • Inference: Model weights + KV cache overhead (ร—1.2 factor)
    • Training: Full Adam optimizer states (ร—4ร—1.3 factors)
    • LoRA Fine-tuning: Low-rank adaptation with trainable parameter overhead
  • โšก Advanced Parallelization Analysis:
    • Tensor Parallelism (TP): 1, 2, 4, 8
    • Pipeline Parallelism (PP): 1, 2, 4, 8
    • Expert Parallelism (EP) for MoE models
    • Data Parallelism (DP): 2, 4, 8
    • Combined strategies (TP + PP combinations)
  • ๐ŸŽฎ GPU Compatibility Matrix:
    • 20+ GPU models (RTX 4090, A100, H100, L40S, etc.)
    • Automatic compatibility checking for inference/training/LoRA
    • Minimum GPU memory requirement calculations
  • ๐Ÿ“ˆ Professional Rich UI:
    • ๐ŸŽจ Beautiful color-coded tables and panels
    • ๐Ÿ“Š Real-time progress indicators
    • ๐Ÿš€ Modern CLI interface with emoji icons
    • ๐Ÿ’ก Smart recommendations and warnings
  • ๐Ÿ”ง Flexible Configuration:
    • Customizable LoRA rank, batch size, sequence length
    • External JSON configuration files
    • User-defined GPU models and data types
  • ๐Ÿ“‹ Parameter Display: Raw count + human-readable format (e.g., "405,016,576 (405.0M)")

Installation

Quick Install (from PyPI)

pip install hf-vram-calc

Build from Source

# Clone the repository
git clone <repository-url>
cd hf-vram-calc

# Build with uv (recommended)
uv build
uv pip install dist/hf_vram_calc-1.0.0-py3-none-any.whl

# Or install directly
uv pip install .

Dependencies: requests (HTTP), rich (beautiful CLI), Python โ‰ฅ3.8

For detailed build instructions, see: BUILD.md

Usage

Basic Usage - Smart Dtype Detection

# Automatic dtype recommendation from model config/name
hf-vram-calc microsoft/DialoGPT-medium

# Model name contains dtype - automatically detects fp16
hf-vram-calc nvidia/DeepSeek-R1-0528-FP4

Specify Data Type Override

# Override with specific data type
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype bf16
hf-vram-calc mistralai/Mistral-7B-v0.1 --dtype nvfp4

Advanced Configuration

# Custom batch size and sequence length
hf-vram-calc mistralai/Mistral-7B-v0.1 --batch-size 4 --sequence-length 4096

# Custom LoRA rank for fine-tuning estimation  
hf-vram-calc microsoft/DialoGPT-medium --lora-rank 128

# Detailed analysis (enabled by default)
hf-vram-calc meta-llama/Llama-2-7b-hf --show-detailed

System Information

# List all available data types and GPU models
hf-vram-calc --list-types

# Use custom configuration directory
hf-vram-calc --config-dir ./my_config microsoft/DialoGPT-medium

# Show help
hf-vram-calc --help

Command Line Arguments

Required

  • model_name: Hugging Face model name (e.g., microsoft/DialoGPT-medium)

Data Type Control

  • --dtype {fp32,fp16,bf16,fp8,int8,int4,mxfp4,nvfp4,awq_int4,fp4,nf4,gptq_int4}: Override automatic dtype detection
  • --list-types: List all available data types and GPU models

Memory Estimation Parameters

  • --batch-size BATCH_SIZE: Batch size for activation estimation (default: 1)
  • --sequence-length SEQUENCE_LENGTH: Sequence length for memory calculation (default: 2048)
  • --lora-rank LORA_RANK: LoRA rank for fine-tuning estimation (default: 64)

Display & Configuration

  • --show-detailed: Show detailed parallelization and GPU compatibility (default: enabled)
  • --config-dir CONFIG_DIR: Custom configuration directory path
  • --help: Show complete help message with examples

Smart Behavior

  • No --dtype: Uses intelligent priority (model name โ†’ config โ†’ fp16 default)
  • With --dtype: Overrides automatic detection with specified type
  • Invalid model: Graceful error handling with helpful suggestions

Quick Start Examples

# Estimate memory for different models
hf-vram-calc microsoft/DialoGPT-medium              # โ†’ 0.9GB inference (FP16)
hf-vram-calc meta-llama/Llama-2-7b-hf              # โ†’ ~13GB inference  
hf-vram-calc nvidia/DeepSeek-R1-0528-FP4           # โ†’ Auto-detects FP4 from name

# Compare different quantization methods
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype fp16     # โ†’ ~13GB
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype int4     # โ†’ ~3.5GB  
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype awq_int4 # โ†’ ~3.5GB

# Find optimal parallelization strategy
hf-vram-calc mistralai/Mistral-7B-v0.1 --show-detailed  # โ†’ TP/PP recommendations

# Check what's available
hf-vram-calc --list-types                               # โ†’ All types & GPUs

Data Type Priority & Detection

Automatic Data Type Recommendation

The tool uses intelligent priority-based dtype selection:

  1. Model Name Detection (Highest Priority)

    • model-fp16, model-bf16 โ†’ Extracts from model name
    • model-4bit, model-gptq, model-awq โ†’ Detects quantization
  2. Config torch_dtype (Medium Priority)

    • Reads torch_dtype from model's config.json
    • Maps torch.float16 โ†’ fp16, torch.bfloat16 โ†’ bf16, etc.
  3. Default Fallback (Lowest Priority)

    • Defaults to fp16 when no dtype detected

Supported Data Types

Data Type Bytes/Param Description Detection Patterns
fp32 4.0 32-bit floating point fp32, float32
fp16 2.0 16-bit floating point fp16, float16, half
bf16 2.0 Brain Float 16 bf16, bfloat16
fp8 1.0 8-bit floating point fp8, float8
int8 1.0 8-bit integer int8, 8bit
int4 0.5 4-bit integer int4, 4bit
mxfp4 0.5 Microsoft FP4 mxfp4
nvfp4 0.5 NVIDIA FP4 nvfp4
awq_int4 0.5 AWQ 4-bit quantization awq, awq-int4
gptq_int4 0.5 GPTQ 4-bit quantization gptq, gptq-int4
nf4 0.5 4-bit NormalFloat nf4, bnb-4bit
fp4 0.5 4-bit floating point fp4

Parallelization Strategies

Tensor Parallelism (TP)

Splits model weights by tensor dimensions across multiple GPUs.

Pipeline Parallelism (PP)

Distributes different model layers to different GPUs.

Expert Parallelism (EP)

For MoE (Mixture of Experts) models, distributes expert networks to different GPUs.

Data Parallelism (DP)

Each GPU holds a complete model copy, only splitting data.

Example Output

Smart Dtype Detection Example

$ hf-vram-calc microsoft/DialoGPT-medium
Using recommended data type: FP16
Use --dtype to specify different type, or see --list-types for all options
  ๐Ÿ” Fetching configuration for microsoft/DialoGPT-medium...
  ๐Ÿ“‹ Parsing model configuration...                         
  ๐Ÿงฎ Calculating model parameters...                        
  ๐Ÿ’พ Computing memory requirements...                       

                          โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿค– Model Information โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
                          โ”‚                                    โ”‚
                          โ”‚  Model: microsoft/DialoGPT-medium  โ”‚
                          โ”‚  Architecture: gpt2                โ”‚
                          โ”‚  Parameters: 405,016,576 (405.0M)  โ”‚
                          โ”‚  Recommended dtype: FP16           โ”‚
                          โ”‚                                    โ”‚
                          โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

        ๐Ÿ’พ Memory Requirements by Data Type and Scenario                
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚              โ”‚   Total Size โ”‚    Inference โ”‚        Training โ”‚         LoRA โ”‚
โ”‚  Data Type   โ”‚         (GB) โ”‚         (GB) โ”‚     (Adam) (GB) โ”‚         (GB) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚     FP16     โ”‚         0.75 โ”‚         0.91 โ”‚            3.92 โ”‚         0.94 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

          โšก Parallelization Strategies (FP16 Inference)                 
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                    โ”‚      โ”‚      โ”‚      โ”‚      โ”‚   Memory/GPU โ”‚   Min GPU    โ•‘
โ•‘ Strategy           โ”‚  TP  โ”‚  PP  โ”‚  EP  โ”‚  DP  โ”‚         (GB) โ”‚   Required   โ•‘
โ•Ÿโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ข
โ•‘ Single GPU         โ”‚  1   โ”‚  1   โ”‚  1   โ”‚  1   โ”‚         0.91 โ”‚     4GB+     โ•‘
โ•‘ Tensor Parallel    โ”‚  2   โ”‚  1   โ”‚  1   โ”‚  1   โ”‚         0.45 โ”‚     4GB+     โ•‘
โ•‘ TP + PP            โ”‚  4   โ”‚  4   โ”‚  1   โ”‚  1   โ”‚         0.06 โ”‚     4GB+     โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

                  ๐ŸŽฎ GPU Compatibility Matrix                         
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฏโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฏโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฏโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฏโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ GPU Type        โ”‚   Memory   โ”‚  Inference   โ”‚   Training   โ”‚     LoRA     โ”ƒ
โ” โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”จ
โ”ƒ RTX 4090        โ”‚    24GB    โ”‚      โœ“       โ”‚      โœ“       โ”‚      โœ“       โ”ƒ
โ”ƒ A100 80GB       โ”‚    80GB    โ”‚      โœ“       โ”‚      โœ“       โ”‚      โœ“       โ”ƒ
โ”ƒ H100 80GB       โ”‚    80GB    โ”‚      โœ“       โ”‚      โœ“       โ”‚      โœ“       โ”ƒ
โ”—โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ทโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ทโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ทโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ทโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”›

โ•ญโ”€โ”€โ”€ ๐Ÿ“‹ Minimum GPU Requirements โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                   โ”‚
โ”‚  Single GPU Inference: 0.9GB      โ”‚
โ”‚  Single GPU Training: 3.9GB       โ”‚  
โ”‚  Single GPU LoRA: 0.9GB           โ”‚
โ”‚                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Large Model with User Override

$ hf-vram-calc nvidia/DeepSeek-R1-0528-FP4 --dtype nvfp4

$ hf-vram-calc Qwen/Qwen-72B-Chat 
                          โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿค– Model Information โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
                          โ”‚                                      โ”‚
                          โ”‚  Model: nvidia/DeepSeek-R1-0528-FP4  โ”‚
                          โ”‚  Architecture: deepseek_v3           โ”‚
                          โ”‚  Parameters: 30,510,606,336 (30.5B)  โ”‚
                          โ”‚  Original torch_dtype: bfloat16      โ”‚
                          โ”‚  User specified dtype: NVFP4         โ”‚
                          โ”‚                                      โ”‚
                          โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

        ๐Ÿ’พ Memory Requirements by Data Type and Scenario                
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚              โ”‚   Total Size โ”‚    Inference โ”‚        Training โ”‚         LoRA โ”‚
โ”‚  Data Type   โ”‚         (GB) โ”‚         (GB) โ”‚     (Adam) (GB) โ”‚         (GB) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚    NVFP4     โ”‚        14.21 โ”‚        17.05 โ”‚           73.88 โ”‚        19.34 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

List Available Types

$ hf-vram-calc --list-types
Available Data Types:
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Data Type โ”‚ Bytes/Param โ”‚ Description            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ FP32      โ”‚           4 โ”‚ 32-bit floating point  โ”‚
โ”‚ FP16      โ”‚           2 โ”‚ 16-bit floating point  โ”‚
โ”‚ BF16      โ”‚           2 โ”‚ Brain Float 16         โ”‚
โ”‚ NVFP4     โ”‚         0.5 โ”‚ NVIDIA FP4             โ”‚
โ”‚ AWQ_INT4  โ”‚         0.5 โ”‚ AWQ 4-bit quantization โ”‚
โ”‚ GPTQ_INT4 โ”‚         0.5 โ”‚ GPTQ 4-bit quantizationโ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Available GPU Types:
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ GPU Name          โ”‚ Memory (GB) โ”‚ Category   โ”‚ Architecture โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ RTX 4090          โ”‚          24 โ”‚ consumer   โ”‚ Ada Lovelace โ”‚
โ”‚ A100 80GB         โ”‚          80 โ”‚ datacenter โ”‚ Ampere       โ”‚
โ”‚ H100 80GB         โ”‚          80 โ”‚ datacenter โ”‚ Hopper       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Calculation Formulas

Inference Memory

Inference Memory = Model Weights ร— 1.2

Includes model weights and KV cache overhead.

Training Memory (with Adam)

Training Memory = Model Weights ร— 4 ร— 1.3
  • 4x factor: Model weights (1x) + Gradients (1x) + Adam optimizer states (2x)
  • 1.3x factor: 30% additional overhead (activation caching, etc.)

LoRA Fine-tuning Memory

LoRA Memory = (Model Weights + LoRA Parameter Overhead) ร— 1.2

LoRA parameter overhead calculated based on rank and target module ratio.

Advanced Features

Configuration System

External JSON configuration files for maximum flexibility:

  • data_types.json - Add custom quantization formats
  • gpu_types.json - Define new GPU models and specifications
  • display_settings.json - Customize UI appearance and limits
# Use custom config directory
hf-vram-calc --config-dir ./custom_config model_name

# Add custom data type example (data_types.json)
{
  "my_custom_int2": {
    "bytes_per_param": 0.25,
    "description": "Custom 2-bit quantization"
  }
}

Memory Calculation Details

Scenario Formula Explanation
Inference Model ร— 1.2 Includes KV cache and activation overhead
Training Model ร— 4 ร— 1.3 Weights(1x) + Gradients(1x) + Adam(2x) + 30% overhead
LoRA (Model + LoRA_paramsร—4) ร— 1.2 Base model + trainable parameters with optimizer

Parallelization Efficiency

  • TP (Tensor Parallel): Near-linear scaling, slight communication overhead
  • PP (Pipeline Parallel): Good efficiency, pipeline bubble ~10-15%
  • EP (Expert Parallel): MoE-specific, depends on expert routing efficiency
  • DP (Data Parallel): No memory reduction per GPU, full model replica

Supported Architectures

Fully Supported โœ…

  • GPT Family: GPT-2, GPT-3, GPT-4, GPT-NeoX, etc.
  • LLaMA Family: LLaMA, LLaMA-2, Code Llama, Vicuna, etc.
  • Mistral Family: Mistral 7B, Mixtral 8x7B (MoE), etc.
  • Other Transformers: BERT, RoBERTa, T5, FLAN-T5, etc.
  • New Architectures: DeepSeek, Qwen, ChatGLM, Baichuan, etc.

Architecture Detection

  • Automatic field mapping for different config.json formats
  • Fallback support for uncommon architectures
  • MoE handling for Mixture-of-Experts models

Accuracy & Limitations

โœ… Highly Accurate For:

  • Parameter counting (exact calculation)
  • Memory estimation (within 5-10% of actual)
  • Parallelization ratios (theoretical maximum)

โš ๏ธ Considerations:

  • Activation memory varies with sequence length and optimization
  • Real-world efficiency may differ due to framework overhead
  • Quantization accuracy depends on specific implementation
  • MoE models require expert routing consideration

Build & Development

Built with modern Python tooling:

  • uv: Fast Python package management and building
  • Rich: Professional terminal interface
  • Requests: HTTP client for model config fetching
  • JSON configuration: Flexible external configuration system

For development setup, see: BUILD.md

Contributing

We welcome contributions! Areas for improvement:

  • ๐Ÿ”ง New quantization formats (add to data_types.json)
  • ๐ŸŽฎ GPU models (update gpu_types.json)
  • ๐Ÿ“Š Architecture support (enhance config parsing)
  • ๐Ÿš€ Performance optimizations
  • ๐Ÿ“š Documentation improvements
  • ๐Ÿงช Test coverage expansion

See Also

  • ๐Ÿ“š BUILD.md - Complete build and installation guide
  • โš™๏ธ CONFIG_GUIDE.md - Configuration customization details
  • ๐Ÿ“ Examples in help: hf-vram-calc --help for usage examples

Version History

  • v1.0.0: Complete rewrite with uv build, smart dtype detection, professional UI
  • v0.x: Legacy single-file version (deprecated)

License

MIT License - see LICENSE file for details.


Made with โค๏ธ for the ML community | Built with uv and Rich

About

A CLI tool for estimating GPU VRAM requirements for Hugging Face models, supporting various data types, parallelization strategies, and fine-tuning scenarios like LoRA.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages