Skip to content

[Exp]: Python 3.12 upgrade#607

Closed
abukhoy wants to merge 22 commits intoquic:mainfrom
abukhoy:python-3.12-upgrade
Closed

[Exp]: Python 3.12 upgrade#607
abukhoy wants to merge 22 commits intoquic:mainfrom
abukhoy:python-3.12-upgrade

Conversation

@abukhoy
Copy link
Copy Markdown
Contributor

@abukhoy abukhoy commented Nov 5, 2025

No description provided.

Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
abukhoy and others added 13 commits November 5, 2025 11:45
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
added build-essential and dev apt packages

Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
# Support for Diffusers Architecture in Efficient Transformers

## Overview
This pull request introduces **Diffusers architecture support** to the
**Efficient Transformers** framework, enabling seamless integration of
diffusion models.

## Key Highlights
1. **Support of model
[black-forest-labs/FLUX1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)**
2. **Flexible Configuration**  
- Supports JSON-based configuration files for easy compilation and
execution.
3. **Performance Benchmarking**  
- Implements a performance matrix for Diffusers models to enable
benchmarking for each modules.
4. **Testing Framework**  
   - Includes initial test scripts for Diffusers (In progress).
5. **Support of onnx subfunction graph using flag use_onnx_function**
6. **Support parallel compilation of modules using flag
`parallel_compile`**

---------

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcommm.com>
Signed-off-by: tv-karthikeya <vtirumal@qti.qualcomm.com>
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
Co-authored-by: tv-karthikeya <vtirumal@qti.qualcomm.com>
Co-authored-by: Amit Raj <amitraj@qti.qualcommm.com>
Co-authored-by: Karthikeya <venkatakarthikeya01@gmail.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention 
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers


NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Update tests of onnx_subfunction to compare the hash of the .onnx file
when `use_onnx_subfunction` flag is toggled

---------

Signed-off-by: Amit Raj <amitraj@qti.qualcommm.com>
Co-authored-by: Amit Raj <amitraj@qti.qualcommm.com>
**Overview**

On-device sampling can significantly reduce host overhead and improve
inference throughput; however, so far it has only been implemented for
`QEffForCausalLM` models. This PR extends on-device sampling support to
the language decoder of dual QPC vision language models,
`QEffCausalLMForTextImageToTextModel`. In addition, it fixes the bug in
gumbel noise so that it correctly simulates a multinomial distribution
for random sampling.

**Implementation details**

```
class _QEffAutoModelForImageTextToTextDualQPC:

def __init__(
        self,
        model: nn.Module,
        continuous_batching: bool = False,
        qaic_config: Optional[dict] = None,
        **kwargs,
    ):
        # Omitting unchanged parts
        self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs)
        # ---Sampling---
        # Note: SamplerTransform should be applied after all other transforms
        # are done. The role of the sampler is to just add nodes at the output of the
        # previous transform function.
        self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs)
```

**Usage**

The usage is the similar to enable on-device sampling for
`QEffForCausalLM`.

```
from QEfficient import QEFFAutoModelForImageTextToText

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

qeff_model = QEFFAutoModelForImageTextToText.from_pretrained(
    model_id,
    attn_implementation="eager",
    kv_offload=True,
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
    },
)
```

---------

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Co-authored-by: sanising <sanising@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
…of hash comparison (quic#670)

## Summary
Refactored the subfunction unit test to directly verify ONNX subfunction
usage by inspecting the exported model structure, replacing the previous
hash-based validation approach.

## Changes
- Removed hash-based checks (`export_hash` and file hash comparisons)
- Added ONNX model inspection utilities:
- `has_gpt2block_function()`: Checks for QEffGPT2Block function
definitions
- Added explicit assertions to verify:
  - QEffGPT2Block function is defined when `use_onnx_subfunctions=True`
- QEffGPT2Block function is NOT defined when
`use_onnx_subfunctions=False`
- QEffGPT2Block calls exist in graph nodes when subfunctions are enabled
  - No QEffGPT2Block calls when subfunctions are disabled
- Maintained functional equivalence testing (generation output
comparison)

Signed-off-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
@abukhoy abukhoy marked this pull request as draft December 18, 2025 05:56
Signed-off-by: Abukhoyer SHaik <abukhoye@qti.qualcomm.com>
@abukhoy abukhoy marked this pull request as ready for review December 18, 2025 06:07
quic-dhirajku and others added 3 commits December 18, 2025 12:11
quic#661)

installing pytorch2.9 for FT CI test

---------

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
## ✨ Add Support for Guided Decoding to On Device Sampling

### 📌 Overview

This PR introduces **guided decoding** capabilities in On Device
Sampling for `QEffForCausalLM` and `QEffCausalLMForTextImageToTextModel`
models.

</br>
</br>

### 🚀 Motivation

As outlined in [this blog on structured
decoding](https://blog.vllm.ai/2025/01/14/struct-decode-intro.html),
structured decoding represents a fundamental shift in controlling LLM
outputs. Instead of relying on post-processing, constraints are enforced
during token generation via **logits manipulation**. This approach
ensures:

*   **Format compliance** at generation time.
*   Reduced error rates for structured outputs.
* Performance improvements through optimized backends like **XGrammar**,
which can deliver up to **5× faster token generation under load**.

The constraints are provided through `token_bitmasks` which is a Boolean
matrix of shape `(batch_size, vocab_size)`. Here, each element indicates
whether a token should be kept (1) or masked (0). During sampling, this
mask is applied to the logits before token selection, ensuring that only
allowed tokens are considered.

By performing this operation directly on the device, we eliminate
host-device transfers, reduce latency, and improve throughput for
structured decoding workloads.

</br>
</br>


### 🛠️ Implementation Details

The guided decoding logic is injected via `include_guided_decoding=True`
during model loading. No changes to the model architecture are required.

```python
from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Load model with On Device Sampler enabled
qeff_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
        "include_guided_decoding": True,
    },
)

# Compile as usual
qeff_model.compile(
    prefill_seq_length=128,
    ctx_len=256,
    full_batch_size=16,
    num_devices=4,
    num_speculative_tokens=0,
    mxint8_kv_cache=True,
    mxfp6_matmul=True,
)
```

To disable guided decoding, simply set `include_guided_decoding=False`.

---------

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Co-authored-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Co-authored-by: sanising <sanising@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
@abukhoy
Copy link
Copy Markdown
Contributor Author

abukhoy commented Dec 22, 2025

This is moved to #685
We will close it soon.

@quic-rishinr
Copy link
Copy Markdown
Contributor

Closing the PR as it will be take care in PR #685

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants