Skip to content

[RFC]: Decoupling vLLM Configuration from Hugging Face #24384

@charlotte12l

Description

@charlotte12l

Motivation.

Currently, vLLM assumes that all models follow the Hugging Face format. Configuration is parsed directly into a transformers.PretrainedConfig instance, which is then embedded into ModelConfig as hf_config. This tight coupling introduces several problems:

  • Poor extensibility: non-HF models (e.g., Mistral-native) cannot be integrated cleanly. Their configuration must first be awkwardly adapted into a PretrainedConfig-like object.

  • Maintenance overhead: many fields in PretrainedConfig are irrelevant to inference, but vLLM does not clearly separate used vs. unused fields. Users who want to support their own model formats must carefully map into HF’s schema, and this kind of manual mapping is fragile and error-prone.

  • Under‑specified, inconsistently named critical fields[added on Nov 1]: Fields vLLM relies on at runtime—such as num_kv_heads, num_experts, max_model_len—are not emphasized or standardized in PretrainedConfig. Different models use different names, forcing ModelConfig to perform bespoke mappings.

  • Missing architecture hints force runtime introspection[added on Nov 1]: Useful architecture details (e.g., attention type per layer for KV‑cache initialization) are not guaranteed(maybe it’s now? but what if there are other hints needed in the future?) available in PretrainedConfig. Today we infer them at runtime (e.g., via forward_context), adding complexity and risk. Getting such fields into HF first and then into every model slows vLLM development.

This proposal aims to resolve these issues by introducing a clean separation of concerns:

  • Define a standardized, vLLM‑native configuration schema that clearly defines the fields the engine needs.
  • Provide pluggable parsers that translate external configuration formats (HF, Mistral‑native, GGUF, etc.) into that schema.
  • Simplify code paths: fewer runtime probes (e.g., no per‑layer attention discovery via forward_context) and earlier detection of missing information.

Proposed Change.

1. Unified Configuration Schema

Introduce a new class, tentatively named ModelArchitectureConfig (or a better name lol) , that contains only the essential fields required by vLLM for inference:

class ModelArchitectureConfig:
    architectures: List[str]
    model_type:str
    hidden_size: int
    num_hidden_layers: int
    num_attention_heads: int
    head_dim: int  
    vocab_size: int

def __init__(self,    
   architectures,
   model_type,
   hidden_size,
   num_hidden_layers,
   num_attention_heads,
   head_dim,
   vocab_size,
   **kwargs,
):
   ...
      for key, value in kwargs.items():
           setattr(self, key, value)


def validate():

Current structure:

VLLMConfig
 └── ModelConfig
       └── hf_config: PretrainedConfig

Proposed structure:

VLLMConfig
 └── ModelConfig
       ├── unified_config: UnifiedPretrainedConfig   # always present
       └── hf_config: PretrainedConfig    # optional

  • Engine logic consumes only unified_config.
  • hf_config is retained when loading from Hugging Face, but becomes optional.

2. Parsers

With ModelArchitectureConfig , each model format can be supported through a dedicated parser. Parsers always ensure the vllm-runtime required fields will be correctly extracted, while leaving the rest model specific fields as it is.

Similar to #24277, we can do:

class HFConfigParser(ConfigParserBase):

    def parse(self,
              model: Union[str, Path],
              trust_remote_code: bool,
              revision: Optional[str] = None,
              code_revision: Optional[str] = None,
              **kwargs) -> UnifiedPretrainedConfig:
@register_config_parser("custom_config_parser")
class CustomConfigParser(ConfigParserBase):

    def parse(self,
              model: Union[str, Path],
              trust_remote_code: bool,
              revision: Optional[str] = None,
              code_revision: Optional[str] = None,
              **kwargs) -> UnifiedPretrainedConfig:
        raise NotImplementedError

3. Engine surface change

Where the engine previously reached for model_config.hf_config, switch to the model_config.unified_pretrained_config:

4. Migration Plan (updated on Nov 1)

A. Add ModelArchitectureConfig interface, HFArchitectureConfigParser and MistralArchitectureConfigParser to parse their trained model params into ModelArchitectureConfig
B. Update ModelConfig to include model_architecture_config and make hf_config optional.
C. During kv cache initialization, call model_architecture_config.layer_attention_types to get each layer’s attention type. Also refactor get_kv_cache_spec a bit to make it able to get spec from a un-initialized module
D. Calls to hf_config during vllm-runtime are gradually replaced with model_architecture_config.
E. Calls to hf_config in model_executor/models/*.py are gradually replaced with model_architecture_config.
F. Keep hf_config but mark it as deprecated, populate it only if the source was a hugging face model
G. Completely remove hf_config

Feedback Period.

1-2 weeks

CC List.

@22quinn @zhuohan123 @yeqcharlotte @houseroad @simon-mo

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions