-
Notifications
You must be signed in to change notification settings - Fork 115
Add Qwen3 model support #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements support for the Qwen3 model family, including Qwen3-4B-Instruct. Key features: - QK normalization for improved training stability - Grouped Query Attention (32 query heads, 8 KV heads) - High RoPE theta (5M) for extended context (262K tokens) - Support for causal language modeling and sequence classification - Complete parameter mapping for HuggingFace model loading - Example scripts demonstrating text generation and chat usage Tested with Qwen3-4B-Instruct-2507 and generates coherent English output.
I will test it tomorrow with my h200 to be sure that everything is working. With my mbr the answers seems ok, but the generation is slow. |
Implements last token pooling strategy in text_embedding to support Qwen3-Embedding models which use the last token's hidden state for generating text embeddings. - Add :last_token_pooling option to text_embedding - Extract last non-padding token using attention_mask - Add Qwen3-Embedding-0.6B example demonstrating: - Text embedding generation (1024-dim vectors) - Semantic similarity computation - Instruction-aware embeddings - Batch processing Tested with Qwen3-Embedding-0.6B and produces correct similarity scores.
Implements :for_embedding architecture for Qwen3 models with last token pooling, enabling direct use with Bumblebee.Text.text_embedding/3. Changes: - Add :for_embedding architecture to Qwen3 model - Register Qwen3ForEmbedding in model mappings - Add instruction prompts example showing Qwen team recommendations - Update examples to use cleaner serving-based API - Add .lexical/ to gitignore - Clean up mix.exs dependencies (remove emlx, nx override) Examples demonstrate: - Basic embedding generation (1024-dim vectors) - Semantic similarity computation - Instruction-aware prompts (1-5% performance improvement) - Custom task instructions for code search - Multilingual embedding support Tested with Qwen3-Embedding-0.6B, generates correct similarity scores.
Implements document reranking using Qwen3-Reranker models. Rerankers score query-document pairs for relevance, improving retrieval quality in RAG and search applications. Features: - Automatic yes/no token detection from tokenizer - Proper input format with instruction, query, and document - Softmax-based relevance scoring (0-1 range) - Support for custom task instructions Example demonstrates: - Basic query-document scoring - Custom instructions for code search - Reranking search results (top-k selection) Results show correct ranking: - Relevant docs score 0.99+ - Irrelevant docs score near 0.0 - Custom instructions work for domain-specific tasks Works with Qwen3-Reranker-0.6B/4B/8B models.
Move all Qwen3-related examples and documentation into examples/qwen3/ for better organization and discoverability. Changes: - Create examples/qwen3/ directory - Move qwen3.exs, qwen3_embedding.exs, qwen3_embedding_prompts.exs, qwen3_reranker.exs - Move QWEN3_IEX_GUIDE.md to examples/qwen3/ - Update examples/README.md to reference qwen3/ subdirectory All examples now accessible under examples/qwen3/ with consistent structure.
I was interested in getting a qwen3 vision model working like https://huggingface.co/huihui-ai/Huihui-MiniCPM-V-4_5-abliterated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/bumblebee.ex
Outdated
"mbart" => :mbart, | ||
"phi" => :code_gen, | ||
"phi3" => :llama, | ||
"qwen3" => :gpt2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/tokenizer_config.json, it says "tokenizer_class": "Qwen2Tokenizer"
, so we should add :qwen2
tokenizer type. In practice it we just need to add it here
@tokenizer_types %{ |
For the default special tokens see https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/tokenization_qwen2_fast.py#L68-L76.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool, fun fact claude initially added and then changed it to gpt2..
lib/bumblebee/text/qwen3.ex
Outdated
# QK Normalization (Qwen3-specific) - normalize over head_dim | ||
query = | ||
if spec.use_qk_norm do | ||
Layers.rms_norm(query, | ||
name: join(name, "query_norm"), | ||
epsilon: spec.layer_norm_epsilon, | ||
channel_index: -1 | ||
) | ||
else | ||
query | ||
end | ||
|
||
key = | ||
if spec.use_qk_norm do | ||
Layers.rms_norm(key, | ||
name: join(name, "key_norm"), | ||
epsilon: spec.layer_norm_epsilon, | ||
channel_index: -1 | ||
) | ||
else | ||
key | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only divergence from the usual logic, right? Instead of rewriting all of the implementation here, you can add a new option to Layers.Transformer.blocks
. I would add :query_norm
and :key_norm
, both being a 2-arity function. There is already a :layer_norm
option kinda similar to that (and we already have kqv specific options: :query_use_bias
, :key_use_bias
, :value_use_bias
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would skip these examples, since they are not as easy to find. We could instead add a section to https://github.com/elixir-nx/bumblebee/blob/main/notebooks/llms.livemd#mistral, or if it's more elaborate, perhaps a separate Qwen3 notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome feedback! i will work on it later today EST.
- Remove .lexical/ from project gitignore (should be in global gitignore) - Add :qwen2 tokenizer type with correct Qwen3 special tokens - Refactor QK normalization to use generalized approach: - Add :query_norm and :key_norm options to Layers.Transformer - Apply normalization after head splitting, before rotary embedding - Update Qwen3 to use Layers.Transformer.blocks instead of custom implementation - Remove ~200 lines of custom decoder/attention code - Remove standalone examples directory per review feedback The generalized QK normalization approach makes the transformer layer more flexible and maintainable, allowing other models to use similar patterns.
Use 'decoder.blocks' as the name prefix when calling Layers.Transformer.blocks to match the expected params mapping pattern decoder.blocks.{n}.*. This aligns with how other models like BERT use the transformer blocks.
Fix model_type_to_tokenizer_type mapping to use :qwen2 instead of :gpt2 for qwen3 models. This ensures Qwen3 models load with the correct tokenizer configuration including proper special tokens.
Create notebooks/qwen3.livemd demonstrating: - Text generation using Qwen3-4B-Instruct-2507 - Embeddings using Qwen3-Embedding-0.6B with similarity examples - Reranking using Qwen3-Reranker-0.6B with query-document scoring This replaces the deleted standalone examples with a consolidated, easy-to-follow notebook format as suggested in PR review.
Update the embeddings section to use the proper instruction format: 'Instruct: Given a query, retrieve relevant documents\nQuery: {query}\n{text}' This ensures consistency with the reranker example and follows Qwen3 embedding best practices for better semantic search results.
Add comprehensive test suite for Qwen3 using tiny-random/qwen3: - Test :base architecture with QK normalization enabled - Test :for_causal_language_modeling with logits verification - Test :for_sequence_classification (shape only, random params) - Test :for_embedding architecture Reference values generated from tiny-random/qwen3 model predictions. All tests pass successfully (4 tests, 0 failures).
Generation looking good! iex(16)> prompt = """
...(16)> <|im_start|>system
...(16)> You are a helpful assistant.<|im_end|>
...(16)> <|im_start|>user
...(16)> What is the capital of France?<|im_end|>
...(16)> <|im_start|>assistant
...(16)> """
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"
iex(17)>
nil
iex(18)> result = Nx.Serving.run(serving, prompt)
%{
results: [
%{
text: "The capital of France is Paris.",
token_summary: %{input: 26, output: 8, padding: 0}
}
]
}
Still more tests to do and write! |
Add Qwen3 Model Family Support
Summary
This PR adds comprehensive support for the Qwen3 model family from Alibaba Cloud, including text generation,
embeddings, and reranking models. Qwen3 is a state-of-the-art multilingual language model with advanced features like
QK normalization and support for up to 262K context length.
What's New
Architectures:
Key Features:
innovation)
Files Changed
Core Implementation:
Examples:
Documentation:
Testing
Text Generation (Qwen3-4B-Instruct)
{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-4B-Instruct-2507"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-4B-Instruct-2507"})
{:ok, config} = Bumblebee.load_generation_config({:hf, "Qwen/Qwen3-4B-Instruct-2507"})
serving = Bumblebee.Text.generation(model, tokenizer, config)
Nx.Serving.run(serving, "The future of AI")
Results: Generates coherent English text, answers questions correctly, creates stories and code.
Text Embeddings (Qwen3-Embedding-0.6B)
{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Embedding-0.6B"},
architecture: :for_embedding
)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})
serving = Bumblebee.Text.text_embedding(model, tokenizer,
output_attribute: :embedding,
embedding_processor: :l2_norm
)
e1 = Nx.Serving.run(serving, "The cat sat on the mat")
e2 = Nx.Serving.run(serving, "A feline rested on the rug")
Nx.dot(e1.embedding, e2.embedding) |> Nx.to_number() # 0.73 (similar)
Results:
Reranking (Qwen3-Reranker-0.6B)
{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Reranker-0.6B"})
Score query-document relevance
Relevant: 0.99+, Irrelevant: ~0.0
Results: Correctly ranks documents by relevance to queries.
Compatible Models
Text Generation:
Embeddings:
Reranking:
Technical Implementation
QK Normalization
Unlike standard transformers, Qwen3 applies RMS normalization to query and key states:
hidden -> dense -> split_heads -> rms_norm -> rotary -> attention
Architecture Support
Custom decoder blocks implement QK normalization while maintaining compatibility with Bumblebee's transformer patterns.
Embedding Architecture
New :for_embedding architecture automatically pools the last non-padding token for text embedding tasks.
Reranking
Uses the causal LM architecture with yes/no token logit extraction and softmax scoring.
Breaking Changes
None. This is purely additive.
References