MLA by quic-mamta · Pull Request #789 · quic/efficient-transformers

quic-mamta · 2026-02-10T09:51:57Z

caching compressed kv
also online/offline mla k,Q up projection absorption

The export hash needs to different for different mla absorption config-> this needs to be fixed.

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

…e sorted Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

anujgupt-github · 2026-02-19T17:33:47Z

QEfficient/transformers/models/modeling_auto.py

        enable_chunking = kwargs.get("enable_chunking", False)

+        # TODO: HACK handle better
+        if enable_mla := kwargs.get("enable_mla", False):


why do we need this boolean in kwargs?
if model has MLA, it should just be enabled.

Because we can treat it as full KV model as well, this will allow us to skip the upprojection on the full CTX for each decode iteration.
We can get rid of this, once we have data about if FULL KV is better or MLA is better

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

ochougul and others added 6 commits January 28, 2026 07:37

added modeling for kimik2

390b817

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

able to run kimi model, need to check accuracy

acdcdbb

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

bugfix

469fd7c

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

experimentation branch commit

c29fb18

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added MLA with/WO fusion, the caching for different config needs to b…

787254c

…e sorted Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Add prefill only moe changes from kimik2 branch

ba3218c

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

quic-mamta requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners February 10, 2026 09:51

mamtsing added 2 commits February 16, 2026 21:06

Change Cache for compressed KV and k_rope

5f80105

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

fix dynamic axis and output mismatch

a47fff0

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

anujgupt-github reviewed Feb 19, 2026

View reviewed changes

Split kv_a_proj_with_mqa weights to get ckv and k_pe

c16a0c7

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

quic-mamta force-pushed the mla_fusion branch from b8b3326 to c16a0c7 Compare February 24, 2026 17:45

ochougul and others added 7 commits March 5, 2026 06:10

added compressed-tensors

b4aaac1

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added tiktoken

62ad3c2

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added tiktoken

a7119e5

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added changes to load int4 weights correctly with our quantizer

f8ff6af

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added support for MatMulNBits

899d77f

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Update modeling_deepseek_qeff.py

3ac7e8b

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

local changes, ugly

696fd2f

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLA#789

MLA#789
quic-mamta wants to merge 16 commits intomainfrom
mla_fusion

quic-mamta commented Feb 10, 2026

Uh oh!

anujgupt-github Feb 19, 2026

Uh oh!

ochougul Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

quic-mamta commented Feb 10, 2026

Uh oh!

anujgupt-github Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

ochougul Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants