Created ReplicateKVHeadTransform to integrate KV-heads replication module within Qefficient library. #625

quic-dhirajku · 2025-11-19T06:09:54Z

The Transform enables KV-head replication for CausalLMs and VLMs as well.
The feature is enabled by passing n_kv_head_repeat parameter during initialization of the QEff wrapper class for the corresponding model.
n_kv_head_repeat param acts as the multiplier for the number of repeats to be done to original count of KV heads. This operation also causes the config and the hash params of the respective model to update the num_key_value_heads parameter and add a paramter orig_kv_heads to it; It allows us to export the same model with different number of kv_heads without causing a hash conflict.
Added tests for both CausalLMs and VLMs with this functionality to compare outputs of Pytorch HF model and the AIC model. Two new optional paramters n_kv_head_repeat and test_kv_replicate are added for testing purpose. Setting test_kv_replicate to True performs a KV-head replication of every model such that the number of KV-heads and attention heads becomes equal. This was done to ensure tests don't fail due to misalignment issues when we simply repeat num_key_value_heads twice and thus cause a divisibility error on hum_heads.

…dule within Qefficient library. The Transform enables KV-head replication for CausalLMs and VLMs as well. The feature is enabled by passing n_kv_head_repeat parameter during initialization of the QEff wrapper class for the corresponding model. n_kv_head_repeat param acts as the multiplier for the number of repeats to be done to original count of KV heads. This operation also causes the config and the hash params of the respective model to update the num_key_value_heads parameter and add a paramter orig_kv_heads to it; It allows us to export the same model with different number of kv_heads without causing a hash conflict. Also added tests for both CausalLMs and VLMs with this functionality to compare outputs of Pytorch HF model and the AIC model. Two new optional paramters n_kv_head_repeat and test_kv_replicate are added for testing purpose. Setting test_kv_replicate to True performs a KV-head replication of every model such that the number of KV-heads and attention heads becomes equal. This was done to ensure tests don't fail due to misalignment issues when we simply repeat num_key_value_heads twice and thus cause a divisibility error on hum_heads. Signed-off-by: Dhiraj Kumar Sah <[email protected]>

quic-rishinr · 2025-11-20T08:55:13Z

@ochougul please review

quic-dhirajku requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners November 19, 2025 06:09

quic-dhirajku force-pushed the replicate_kv_heads_transform branch from 1dfdea6 to 3c90390 Compare November 19, 2025 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Created ReplicateKVHeadTransform to integrate KV-heads replication module within Qefficient library. #625

Created ReplicateKVHeadTransform to integrate KV-heads replication module within Qefficient library. #625

quic-dhirajku commented Nov 19, 2025

Uh oh!

quic-rishinr commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Created ReplicateKVHeadTransform to integrate KV-heads replication module within Qefficient library. #625

Are you sure you want to change the base?

Created ReplicateKVHeadTransform to integrate KV-heads replication module within Qefficient library. #625

Conversation

quic-dhirajku commented Nov 19, 2025

Uh oh!

quic-rishinr commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants