|
| 1 | +# Vision/VLM Branch Patterns (Notes) |
| 2 | + |
| 3 | +This repo's `axllm` branch currently runs **text-only** models by: |
| 4 | + |
| 5 | +- `tokenizer->encode(history)` to token ids |
| 6 | +- compute `tokens_diff` vs `last_tokens_ids` (append-only fast path) |
| 7 | +- `SetKVCache(k/v, precompute_len, input_num_token)` |
| 8 | +- build `out_embed` for `tokens_diff` via `embed_selector.getByIndex(id)` |
| 9 | +- `Run(out_embed)` to do prefill/decode |
| 10 | +- `GetKVCache(...)`, append assistant reply, update `last_tokens_ids` |
| 11 | + |
| 12 | +So the "context" (KV cache) support on `axllm` is based on **token-id diff** + caching. |
| 13 | + |
| 14 | +Below are distilled patterns from the VLM branches (Qwen/InternVL/FastVLM/SmolVLM2, incl. multi-frame/video variants). |
| 15 | + |
| 16 | +## 1) How Images/Videos Get Into The Prompt |
| 17 | + |
| 18 | +All VLM branches follow the same high-level idea: |
| 19 | + |
| 20 | +1. The tokenizer's chat template emits **placeholder tokens** for each media item. |
| 21 | +2. The vision encoder produces **per-placeholder embedding vectors**. |
| 22 | +3. The code finds the placeholder positions in `input_ids` and **replaces** the corresponding token embeddings in `out_embed`. |
| 23 | + |
| 24 | +The differences are in (a) which placeholder tokens are used and (b) how to locate them. |
| 25 | + |
| 26 | +### A. Qwen3/Qwen2.5-VL style |
| 27 | + |
| 28 | +- Prompt uses `<|vision_start|> ... <|vision_end|>` wrapping. |
| 29 | +- Inside the vision block, placeholders are repeated: |
| 30 | + - image: `<|image_pad|>` repeated `num_media * num_media_tokens` |
| 31 | + - video: `<|video_pad|>` repeated `num_media * num_media_tokens` |
| 32 | +- Injection offsets are typically found by scanning for `vision_start_token_id` and taking `offset=i+1` (first placeholder after start). |
| 33 | + |
| 34 | +### B. InternVL style |
| 35 | + |
| 36 | +- Prompt uses `<img> ... </img>` wrapping. |
| 37 | +- Placeholder token is usually `<IMG_CONTEXT>` repeated `num_media_tokens`. |
| 38 | +- Some templates place `content.data` before placeholders; others after. |
| 39 | +- Injection offsets are found by scanning `input_ids` for `IMAGE_CONTEXT_TOKEN` (or via `<|vision_start|>` then next token). |
| 40 | + |
| 41 | +### C. FastVLM style |
| 42 | + |
| 43 | +- Similar to InternVL: uses a simple placeholder token (e.g. `"<image>"`) repeated. |
| 44 | +- Code scans for a fixed `IMAGE_CONTEXT_TOKEN` id in `input_ids` and overwrites those slots. |
| 45 | + |
| 46 | +### D. SmolVLM2 style |
| 47 | + |
| 48 | +- Placeholder token is `"<image>"` (often a single token id). |
| 49 | +- The template may include additional "header tokens" around the image tokens. |
| 50 | +- Injection offsets are found by detecting **runs** of consecutive `IMAGE_CONTEXT_TOKEN` ids, pushing the first id of each run. |
| 51 | + |
| 52 | +## 2) Vision Encoder Output Shapes And Preprocess |
| 53 | + |
| 54 | +There are two broad encoder IO styles: |
| 55 | + |
| 56 | +### A. "Classic image encoder" (single image -> embedding sequence) |
| 57 | + |
| 58 | +Common in `ax-fastvlm`, `ax-internvl`: |
| 59 | + |
| 60 | +- Determine input layout: |
| 61 | + - NCHW float input: normalize `(x/255 - mean) / std`, write as `float` into input tensor. |
| 62 | + - NHWC u8 input: resize + RGB, memcpy into input tensor. |
| 63 | +- Determine output dtype: |
| 64 | + - if output size matches `elem_count * 2` => bf16 output |
| 65 | + - if output size matches `elem_count * 4` => fp32 output (then convert to bf16) |
| 66 | +- Result is a flat bf16 array whose length is `(num_media_tokens * tokens_embed_size)` (conceptually). |
| 67 | + |
| 68 | +### B. Qwen-VL "video processor" (frames -> patches -> embedding sequence) |
| 69 | + |
| 70 | +Common in `ax-qwen2_5-vl`, `ax-qwen3-vl`, `axcl-qwen3-vl`: |
| 71 | + |
| 72 | +- Preprocess frames (even for image) via a "video-like" patching pipeline: |
| 73 | + - resize to `(vision_config.height, vision_config.width)` |
| 74 | + - RGB |
| 75 | + - temporal patching (`temporal_patch_size`) |
| 76 | + - spatial merge (`spatial_merge_size`) |
| 77 | + - patch size (`patch_size`) |
| 78 | +- Produces `pixel_values` per grid segment (for videos: multiple segments). |
| 79 | +- Each segment is fed to `image_encoder` to produce an embedding block. |
| 80 | +- Tracks `cfg.image_grid_thw` and/or `cfg.video_grid_thw` for mRoPE. |
| 81 | + |
| 82 | +## 3) Extra Side-Inputs Used By Some VLMs |
| 83 | + |
| 84 | +### A. mRoPE / position ids (Qwen-VL) |
| 85 | + |
| 86 | +Qwen-VL branches compute `position_ids` (3 x seq_len) based on: |
| 87 | + |
| 88 | +- `input_ids` |
| 89 | +- `cfg.image_grid_thw` / `cfg.video_grid_thw` |
| 90 | +- vision settings: `spatial_merge_size`, sometimes video time scaling (`second_per_grid_ts`) |
| 91 | + |
| 92 | +These `position_ids` are then used in prefill by writing them into the model's `indices` input |
| 93 | +instead of a simple monotonically increasing index. |
| 94 | + |
| 95 | +### B. deepstack features (some AXCL Qwen-VL variants) |
| 96 | + |
| 97 | +Some image encoders output additional tensors (e.g. 3 "deepstack features"). |
| 98 | +During prefill, for tokens where `visual_pos_mask[j] == 1`, the code adds those features |
| 99 | +into the intermediate embedding stream (bf16->fp32 add -> bf16). |
| 100 | + |
| 101 | +### C. visual_pos_mask |
| 102 | + |
| 103 | +Computed from `input_ids` by marking positions that equal `image_token_id` or `video_token_id`. |
| 104 | +Used to align deepstack features with only the visual placeholder positions. |
| 105 | + |
| 106 | +## 4) How Multi-Image / Multi-Frame Is Represented |
| 107 | + |
| 108 | +All branches encode multi-image as: |
| 109 | + |
| 110 | +- `Content{ role=USER, type=IMAGE, data=prompt, num_media=N, num_media_tokens=T }` |
| 111 | +- Tokenizer repeats placeholder tokens `N*T`. |
| 112 | +- Vision encoder returns `N` blocks, each block length `T * tokens_embed_size`. |
| 113 | + |
| 114 | +For video: |
| 115 | + |
| 116 | +- Some branches treat each temporal grid segment as one "media block". |
| 117 | +- Some compute `cfg.video_grid_thw = {{grid_t, grid_h, grid_w}}` and then internally expand. |
| 118 | + |
| 119 | +## 5) Key Gap vs `axllm` (Context Support) |
| 120 | + |
| 121 | +The VLM branches above usually build `input_ids` for the full prompt and then build the full |
| 122 | +`out_embed` (text token embeddings + injected vision embeddings) in one go. |
| 123 | + |
| 124 | +`axllm` is different: |
| 125 | + |
| 126 | +- It only materializes embeddings for `tokens_diff` (incremental tail) to support KV-cache context. |
| 127 | +- It currently has **no hook** to replace placeholder token embeddings with vision embeddings. |
| 128 | + |
| 129 | +So a pluggable image encoder for `axllm` must solve: |
| 130 | + |
| 131 | +- When `tokens_diff` contains placeholder ids, produce embeddings that include: |
| 132 | + - normal token embeddings for text tokens |
| 133 | + - vision embeddings for the placeholder slots |
| 134 | +- While keeping the existing "token-id diff + KV cache" logic intact. |
| 135 | + |
| 136 | +Practical implication: |
| 137 | + |
| 138 | +- The "media -> placeholder slots" mapping must be reproducible from `(history, token_ids)` and/or |
| 139 | + persisted state, so incremental encoding can inject the correct vision embeddings only for the |
| 140 | + newly appended part of the conversation. |
| 141 | + |
| 142 | +## 6) Abstraction Axes For A Pluggable Vision Module (Proposed) |
| 143 | + |
| 144 | +To unify the branches without changing `axllm`'s main control flow, the pluggable module needs |
| 145 | +to cover these responsibilities: |
| 146 | + |
| 147 | +- `Tokenizer side`: |
| 148 | + - define placeholder tokens + how many tokens per media item (`num_media_tokens`) |
| 149 | + - define how to locate placeholder offsets in `input_ids` |
| 150 | + - (optional) provide `position_ids` generation rules (mRoPE) |
| 151 | +- `Vision side`: |
| 152 | + - load/init image encoder axmodel(s) |
| 153 | + - preprocess image/video frames |
| 154 | + - produce per-media embedding blocks (bf16, `tokens_embed_size` aligned) |
| 155 | + - (optional) deepstack features |
| 156 | +- `Injection side`: |
| 157 | + - given `input_ids` and media blocks, produce an "embedding stream" where placeholder slots |
| 158 | + are overwritten by vision embeddings |
| 159 | + - for `axllm` context mode: support doing the above for **only the tail** (`tokens_diff`), |
| 160 | + while keeping placeholder alignment correct. |
| 161 | + |
| 162 | +These notes are intentionally implementation-agnostic so we can use them as a checklist when |
| 163 | +refactoring `axllm` to support LLM + VLM with context. |
| 164 | + |
| 165 | +## 7) Current `axllm` Implementation Notes (This Branch) |
| 166 | + |
| 167 | +This branch adds a pluggable vision module behind a runtime config switch (no compile-time toggle): |
| 168 | + |
| 169 | +- `config.json`: `vlm_type` (or `VLM_TYPE`) selects the vision module. |
| 170 | +- If `vlm_type != "None"(0)`, `filename_image_encoder_axmodel` must be set. |
| 171 | + - Vision preprocessing backend is selected at **CMake configure time**: |
| 172 | + - Prefer OpenCV if found. |
| 173 | + - Otherwise fall back to `third_party/SimpleCV` and print a CMake warning (slight differences vs OpenCV are possible). |
| 174 | + |
| 175 | +VLM runtime data flow (keeps the existing token-diff + KV-cache logic): |
| 176 | + |
| 177 | +- Tokenizer still generates placeholder tokens based on `Content.num_media` and `Content.num_media_tokens`. |
| 178 | +- Vision module prepares a copy of `history` where those two fields are filled, then: |
| 179 | + - encodes images/videos with `image_encoder.axmodel` |
| 180 | + - builds a `pos2vision` mapping for placeholder positions in `input_ids` |
| 181 | + - (Qwen-VL) computes `position_ids` (mRoPE) and a decode start override |
| 182 | +- The LLM loop only builds embeddings for `tokens_diff`, and replaces only the vision placeholder slots in that tail. |
0 commit comments