Has anyone implemented a qwen-vl multimodal demo using expo? #276

bbhxwl · 2026-01-14T01:06:12Z

bbhxwl
Jan 14, 2026

Has anyone implemented a qwen-vl multimodal demo using expo?

a-ghorbani · 2026-01-14T08:26:34Z

a-ghorbani
Jan 14, 2026
Collaborator

not Expo (RN CLI), but we are able to run Qwen3-VL-2B on a phone.

not entirely sure what issue you're running into, but in our case the problem was related to context window limits.

A couple of things to keep in mind:

Qwen-VL models use dynamic image sizing rather than a fixed size.

The number of output tokens scales based on image dimensions:

tokens = (H / effective_patch) × (W / effective_patch)

For Qwen3-VL, the effective patch size is patch_size × merge_size = 16 × 2 = 32 pixels (see preprocessor_config.json and clip.cpp#L3104).

for example for my phone's camera the image dims are 3024 × 4032, which means:

tokens = (3024 / 32) × (4032 / 32) ≈ 11,844 tokens

To manage this, llama.cpp caps output tokens at 4096 by default for Qwen-VL models (source). It calculates the maximum allowed pixels from this token limit, then scales the image down (calc_size_preserved_ratio).

So

High-res images (common from phone cameras) can easily go above 4096 tokens (before the cap is applied)
so, you need either:
1. downscale the image before passing it to the model - use the formula above to calculate the n_tokens it nees
2. or explicitly set a lower max token count and let it llama.cpp/mtmd to scale it: PR #275

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has anyone implemented a qwen-vl multimodal demo using expo? #276

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Has anyone implemented a qwen-vl multimodal demo using expo? #276

Uh oh!

bbhxwl Jan 14, 2026

Replies: 1 comment

Uh oh!

Uh oh!

a-ghorbani Jan 14, 2026 Collaborator

bbhxwl
Jan 14, 2026

a-ghorbani
Jan 14, 2026
Collaborator