Skip to content

Commit 86030e4

Browse files
committed
[ggma] Add documentation for TinyLlama example
- Created `runtime/ggma/examples/generate_text/tinyllama.md` with step‑by‑step guide. - Includes prerequisites, model generation commands, full processing pipeline, and a summary. ONE-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>
1 parent d57bf22 commit 86030e4

File tree

4 files changed

+233
-0
lines changed

4 files changed

+233
-0
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# User input
2+
prompt = "Lily picked up a flower."
3+
model_name = "Maykeye/TinyLLama-v0"
4+
5+
# Tokenizer
6+
from transformers import AutoTokenizer
7+
8+
tokenizer = AutoTokenizer.from_pretrained(model_name)
9+
tokenizer.pad_token = tokenizer.eos_token
10+
tokenizer.padding_side = "right"
11+
inputs = tokenizer(
12+
prompt,
13+
return_tensors="pt",
14+
padding="max_length",
15+
max_length=30,
16+
truncation=True,
17+
)
18+
19+
# Generator
20+
import torch
21+
22+
from transformers import AutoModelForCausalLM
23+
24+
model = AutoModelForCausalLM.from_pretrained(model_name)
25+
model.eval()
26+
27+
from tico.utils.record_input import RecordingInput
28+
29+
# past_key_values
30+
# ---------------
31+
# During prefill, "past_key_values" not None, but an empty Cache instance.
32+
# Passing None makes torch.export happy.
33+
34+
35+
input_to_remove = [
36+
"attention_mask",
37+
# For left pad, [0, ⋯, 0, 1, ⋯, 1]
38+
# For right right pad, [1, ⋯, 1, 0, ⋯, 0]
39+
# ( 0 is pad-token )
40+
# This script uses right pad and pass all-1 attention mask (including pad).
41+
# Npu computes all positions whether it is pad or not.
42+
]
43+
condition_fn = lambda args_dict: args_dict["past_key_values"].get_seq_length() != 0
44+
45+
with torch.no_grad(), RecordingInput(
46+
model, condition_fn, input_to_remove=input_to_remove
47+
) as rec:
48+
outputs = model.generate(
49+
**inputs,
50+
max_new_tokens=32,
51+
do_sample=False,
52+
pad_token_id=tokenizer.eos_token_id,
53+
)
54+
captured_input = rec.captured_input
55+
56+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
57+
print(generated_text)
58+
59+
# Tico
60+
import tico
61+
from tico.serialize.operators.adapters.onert.llama_attention import (
62+
llama_attention_forward_adapter,
63+
)
64+
from transformers.models.llama.modeling_llama import LlamaAttention
65+
66+
#LlamaAttention.forward = llama_attention_forward_adapter
67+
68+
model = AutoModelForCausalLM.from_pretrained(model_name)
69+
model.eval()
70+
circle_model = tico.convert(model, captured_input)
71+
circle_model.save(f"tinyllama.decode.circle")
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# User input
2+
prompt = "Lily picked up a flower."
3+
model_name = "Maykeye/TinyLLama-v0"
4+
5+
# Tokenizer
6+
from transformers import AutoTokenizer
7+
8+
tokenizer = AutoTokenizer.from_pretrained(model_name)
9+
tokenizer.pad_token = tokenizer.eos_token
10+
tokenizer.padding_side = "right"
11+
inputs = tokenizer(
12+
prompt,
13+
return_tensors="pt",
14+
padding="max_length",
15+
max_length=32,
16+
truncation=True,
17+
)
18+
19+
# Generator
20+
import torch
21+
22+
from transformers import AutoModelForCausalLM
23+
24+
model = AutoModelForCausalLM.from_pretrained(model_name)
25+
model.eval()
26+
27+
from tico.utils.record_input import RecordingInput
28+
29+
# past_key_values
30+
# ---------------
31+
# During prefill, "past_key_values" not None, but an empty Cache instance.
32+
# Passing None makes torch.export happy.
33+
34+
35+
input_to_remove = [
36+
"past_key_values",
37+
# DynamicCache is flatten-able operator since 4.50.
38+
# See _pytree.py > tree_flatten
39+
# SUPPORTED_NODES has *transformers.DynamicCache*
40+
# After flattening, DynamicCache becomes { "key_cache": [] , "value_cache": [ ] }
41+
# dict.value is returne. dict.key is stored in treespec.
42+
#
43+
# On prefill, DynamicCache is empty, and dict is empty after flattening.
44+
# PyTorch removes empty dict!
45+
# If number of args is 4 (including cache), it becomes 3!
46+
# To avoid this error, don't pass empty cache, just pass None.
47+
"attention_mask",
48+
# For left pad, [0, ⋯, 0, 1, ⋯, 1]
49+
# For right right pad, [1, ⋯, 1, 0, ⋯, 0]
50+
# ( 0 is pad-token )
51+
# This script uses right pad and pass all-1 attention mask (including pad).
52+
# Npu computes all positions whether it is pad or not.
53+
"cache_position"
54+
# It is the list of cache position like [0, 1, ..., 11].
55+
# For npu, we always store all values (including pad).
56+
]
57+
58+
with torch.no_grad(), RecordingInput(model, input_to_remove=input_to_remove) as rec:
59+
outputs = model.generate(
60+
**inputs,
61+
max_new_tokens=32,
62+
do_sample=False,
63+
pad_token_id=tokenizer.eos_token_id,
64+
)
65+
captured_input = rec.captured_input
66+
67+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
68+
print(generated_text)
69+
70+
# Tico
71+
import tico
72+
73+
model = AutoModelForCausalLM.from_pretrained(model_name)
74+
model.eval()
75+
circle_model = tico.convert(model, captured_input)
76+
circle_model.save(f"tinyllama.prefill.circle")
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
transformers==4.50.3
2+
torch
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# TinyLlama Example Documentation
2+
3+
This document provides a step‑by‑step guide for generating and processing a text generation model.
4+
5+
## Summary
6+
7+
1. Set up the environment and install dependencies.
8+
2. Generate the initial `prefill` and `decode` Circle model files.
9+
3. Run the pipeline to optimize, reshape, and prune the model, producing a final `decode.circle` ready for inference.
10+
11+
## Prerequisites
12+
13+
1. **Python virtual environment**
14+
```bash
15+
cd runtime/ggma/examples/generate_text/
16+
python3 -m venv _
17+
source _/bin/activate
18+
```
19+
20+
2. **Install required Python packages**
21+
```bash
22+
pip install -r requirements.txt
23+
```
24+
25+
3. **Install TICO (Torch IR to Circle ONE)**
26+
```bash
27+
# Clone the repository
28+
git clone https://github.com/Samsung/TICO.git
29+
# Install it in editable mode
30+
pip install -e TICO
31+
```
32+
33+
## Generating Model Files
34+
35+
Run the provided scripts to create the prefill and decode Circle model files:
36+
37+
```bash
38+
python prefill.py # Generates tinyllama.prefill.circle
39+
python decode.py # Generates tinyllama.decode.circle
40+
```
41+
42+
You can verify the generated files:
43+
44+
```bash
45+
ls -lh *.circle
46+
# Expected output:
47+
# -rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 tinyllama.decode.circle
48+
# -rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 tinyllama.prefill.circle
49+
```
50+
51+
## Full Processing Pipeline
52+
53+
The following pipeline shows how to chain several tools to transform the model:
54+
55+
```bash
56+
with.py tinyllama.decode.circle |
57+
fuse.attention.py \
58+
fuse.bmm_lhs_const.py | reshape.fc_weight.py | \
59+
reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] | \
60+
transpose.io.kvcache.py | \
61+
remove.io.py output --keep_by_id 0 | \
62+
select.op.py --by_id 0-181 | \
63+
gc.py | \
64+
retype.input_ids.py > decode.circle
65+
```
66+
67+
### Explanation of each step
68+
69+
| Tool | Purpose |
70+
|------|---------|
71+
| `with.py` | Reads the Circle model from stdin and writes it to stdout. |
72+
| `fuse.attention.py` | Fuses attention‑related operators for optimization. |
73+
| `fuse.bmm_lhs_const.py` | Fuses constant left‑hand side matrices in batch matrix multiplication. |
74+
| `reshape.fc_weight.py` | Reshapes fully‑connected layer weights. |
75+
| `reshape.io.py input --by_shape [...]` | Reshapes input tensors to the specified shapes. |
76+
| `transpose.io.kvcache.py` | Transposes the KV‑cache tensors. |
77+
| `remove.io.py output --keep_by_id 0` | Keeps only the output tensor with ID 0, removing the rest. |
78+
| `select.op.py --by_id 0-181` | Selects operators with IDs from 0 to 181. |
79+
| `gc.py` | Performs garbage collection, removing unused tensors and operators. |
80+
| `retype.input_ids.py` | Changes the data type of input IDs as needed. |
81+
| `> decode.circle` | Saves the final processed model to `decode.circle`. |
82+
83+
84+
Feel free to adjust the pipeline arguments (e.g., shapes, IDs) to suit your specific model configuration.

0 commit comments

Comments
 (0)